5 Tips for Designing a Multi-Agent System: Grab Case Study

Learning from Grab’s Analytics Data Warehouse Multi-Agent System

When a data platform supporting over one thousand internal users and fifteen thousand tables begins to strain under routine support requests, the engineering team faces a stark choice: hire more people or redesign the workflow. Grab’s Analytics Data Warehouse (ADW) team chose the latter, building a multi-agent AI system that reclaims hundreds of engineering hours each month. This real-world implementation offers valuable lessons for anyone interested in multi agent system design. By examining their approach, we can extract five actionable tips that apply to enterprise automation projects, whether you are architecting a new system or evolving an existing one.

multi agent system design

The ADW platform at Grab is the backbone for analytics across the company. As usage grew, engineers found themselves buried in repetitive troubleshooting tasks—query analysis, log retrieval, schema lookups, and code fixes. The team needed a way to automate these support workflows without sacrificing safety or reliability. Their solution: a multi-agent system orchestrated with LangGraph and FastAPI, separating requests into investigation and enhancement paths. Below, we break down the five design principles that made their system effective, using the Grab case study as this guide.

1. Separate Investigation and Enhancement Workflows

The first and perhaps most impactful decision in the Grab system was to split incoming engineering requests into two distinct workflow types: investigation and enhancement. This separation is a core principle of multi agent system design because it dramatically reduces complexity for the agents involved. Instead of a single agent trying to decide whether to diagnose a problem, generate code, or both, each agent has a clear, bounded mission.

Why Two Workflows?

Investigation workflows handle diagnostic tasks. When a data warehouse user reports a slow query or an unexpected error, the system first determines what is happening. Agents in this pathway perform query analysis, retrieve relevant logs, look up schema definitions, and summarise the issue. They do not change anything. Enhancement workflows, on the other hand, are purely generative. They produce code changes, SQL fixes, and automated merge requests. By keeping these two paths separate, the risk of an agent modifying production data during a diagnostic step drops to zero. The Grab engineers noted that this architecture reduced complexity in agent reasoning and improved reliability in production workflows.

Implementation in Practice

Imagine a data engineer who submits a ticket about a failing pipeline. The system’s classifier routes the request to the investigation workflow. A diagnostic agent pulls logs, finds a schema mismatch, and summarises the root cause. That summary is then handed to the enhancement workflow, which generates a corrected SQL script and opens a merge request for human review. Each step is performed by a specialised agent with constrained responsibilities, avoiding the chaos of a monolithic agent that tries to do everything at once.

For teams designing their own systems, this separation offers a clear template. Map out the typical support requests you receive. Group them into those requiring only analysis and those needing action. Build distinct agent pipelines for each group, and ensure the handoff between them is clean and well-documented. This approach also makes it easier to measure the performance of each workflow independently.

2. Consolidate the Tool Ecosystem for Maintainability

When the Grab team first prototyped their multi-agent system, they exposed over thirty internal tools to the agents. These tools ranged from data access APIs to logging systems and code repositories. The result was unpredictable: agents would occasionally call the wrong tool or choose a suboptimal one for the task at hand. The solution was a deliberate consolidation of the tool ecosystem down to a smaller, curated set. This is a critical lesson in multi agent system design: more tools do not equal smarter agents.

The Problem of Tool Proliferation

In any large organisation, internal tools accumulate over years. A querying interface, a log viewer, a metadata catalog, a code review system—each serves a purpose, but an agent with too many options struggles to select the right one consistently. The Grab team observed that reducing the toolset improved maintainability and reduced unpredictable tool selection by agents. This directly addresses a common challenge: how do you ensure that agents produce safe and reviewable outputs? Fewer tools mean fewer paths to error.

How to Curate Your Toolset

Start by listing every internal tool that your agents could potentially use. Then, for each tool, ask: is this tool essential for the core workflows we are automating? Can its functionality be merged into a more general tool? For example, instead of separate tools for reading logs, searching SQL queries, and fetching table schemas, consider a single “metadata and diagnostics” tool that bundles these capabilities. At Grab, the final toolset included controlled SQL execution, metadata access, log retrieval, and Git-based workflow integration. That focused set made agent behaviour far more predictable.

Additionally, design each tool with clear, narrow interfaces. An agent should not have to parse a complex API to get a simple answer. Use well-defined input and output schemas, and enforce consistent error handling. When you reduce the cognitive load on the agent (and on the human debugging it), the entire system becomes more robust.

3. Embed Safety and Human-in-the-Loop Governance

Automating engineering support means giving agents the power to execute SQL queries and generate code changes. That power carries obvious risks: accidental data exposure, unintended schema modifications, or deployment of faulty code. The Grab team integrated safety measures directly into the multi agent system design, rather than bolting them on later.

Constrained SQL Execution

Any SQL execution is passed through a validation layer that restricts operations. For example, agents can only run SELECT queries on read replicas, and any DDL or DML statements are blocked at the database level. This ensures that even if an agent misclassifies a request or generates a harmful query, the database is never at risk. The validation layer also checks for sensitive data patterns—such as personally identifiable information—and either redacts the results or blocks the query entirely.

Human-in-the-Loop for Code Changes

Enhancement workflows that produce code or SQL fixes undergo a mandatory human review before any change is deployed. The system generates a merge request with a clear description of the proposed change, the reasoning behind it, and any relevant logs or test results. An engineer reviews the request, modifies it if needed, and approves it manually. This guardrail is essential for maintaining trust in the system. As one Grab engineer put it, “Automation should amplify human judgment, not replace it.”

Designing for Reviewability

To make human review practical, the system outputs must be easy to understand. Each agent’s output includes a chain-of-thought summary: what data it accessed, what decision it made, and why. This transparency allows a reviewer to quickly verify the correctness of a fix without re-tracing the agent’s entire reasoning path. In your own design, consider adding a structured log of agent actions to every change request. It reduces the time engineers spend on oversight and builds confidence in the automation.

4. Manage Context with Structured Compression and Selective Retrieval

Multi-step agent reasoning often requires maintaining state across multiple interactions. An agent might first look up a schema, then run a log query, then reference an earlier result to generate a fix. But large language models (LLMs) operate within token limits, and raw context can quickly exceed those boundaries. The Grab team faced this challenge head-on and developed strategies that are highly instructive for multi agent system design.

You may also enjoy reading: 11 Best Milwaukee Electric Power Site Tools.

The Context Management Challenge

In the investigation workflow, an agent might need to read several error messages, compare them with historical logs, and examine schema definitions—all before summarising the issue. If the agent naively concatenates every piece of data it retrieves, it will exceed the context window of the LLM, leading to truncated reasoning or hallucinated responses. To address this, the system uses structured context compression. Instead of storing raw logs, the agent summarises each log entry into a structured fact. For example, “User X encountered error Y at time Z on table Q.” This compression reduces token usage while preserving the key information.

Selective Retrieval

Not every piece of information needs to be in every agent’s context. The system implements selective retrieval: each agent decides what additional data it needs based on its current step. If the agent is diagnosing a schema error, it retrieves only the relevant table definitions, not the entire catalog. This is implemented as a retrieval-augmented generation (RAG) layer that is called on demand. The supervisor agent controls which subordinate agents can retrieve what, ensuring that each agent sees only the context necessary for its specific task.

Practical Application for Your Team

When you build a multi-agent system, plan for context management from the start. Design each agent to produce a condensed summary of its findings, which can be passed to the next agent in the pipeline. Also, avoid the temptation to include every possible tool or data source in the initial context. Instead, let agents request information dynamically. This keeps token usage low and reasoning sharp. You can also implement a sliding window of recent interactions, discarding older state that is no longer relevant.

5. Design for Modular Orchestration with Supervisors

The final tip from the Grab case study is about orchestration architecture. Their system uses a LangGraph-based workflow engine combined with FastAPI services for coordination. A central supervisor controls communication flow and task delegation among specialised agents. This modular approach is a hallmark of effective multi agent system design in enterprise settings.

Why a Supervisor Pattern?

In a complex system with multiple agents, you need a way to route tasks, manage state, and handle errors. The supervisor pattern centralises these responsibilities. When a new request arrives, the supervisor classifies it, determines which investigation or enhancement path to follow, and delegates sub-tasks to the appropriate agents. It also monitors the progress of each agent, re-routing if an agent times out or produces an invalid output. This avoids the chaos of agents calling each other directly, which can quickly lead to cyclic dependencies and unpredictable behaviour.

Tool Coordination and State Management

With a supervisor, tool coordination becomes simpler. The supervisor maintains a registry of available tools and ensures that agents do not conflict over shared resources. For instance, only one agent at a time is allowed to access the log retrieval tool to prevent request throttling. The supervisor also manages the state of the entire workflow: it knows which steps have been completed, what context has been collected, and when the human review gate is needed. This state can be serialised and stored, allowing the system to resume interrupted workflows.

Scaling from Prototype to Production

If you are starting with a prototype, you might initially hardcode the agent interactions. But as the system grows, invest in a proper orchestration layer like LangGraph or a workflow engine such as Temporal or AWS Step Functions. The modular design pays off in production because you can swap out individual agents without rewriting the entire pipeline. For example, you might replace a diagnostic agent with a more advanced version that uses a fine-tuned model, leaving the rest of the workflow unchanged. The supervisor pattern also makes it easier to add new capabilities, such as a new investigation tool or a different code‑generation approach, without disrupting existing paths.

Measuring the Impact and Shifting Engineering Culture

While the Grab team did not disclose exact performance numbers, they reported a measurable reduction in time spent on routine support tasks and faster resolution cycles. More importantly, they observed a shift in engineering culture: the team moved from reactive firefighting to proactive platform engineering. Engineers who once spent hours investigating common issues now have bandwidth to design new features and improve system architecture. This cultural change is one of the most valuable outcomes of a well-designed multi-agent system.

For teams evaluating the ROI of agent automation, it is important to look beyond hours saved. Consider the quality of work: are your engineers spending more time on creative problem-solving? Are support ticket response times improving? Is the system learning from each interaction to become more accurate over time? Grab’s example shows that when you invest in thoughtful multi agent system design, the benefits compound across the organisation.

By studying how Grab separated workflows, consolidated tools, embedded safety measures, managed context, and used a supervisor for orchestration, you can apply these principles to your own projects. Whether you are automating data platform support, incident response, or any other engineering task, these five tips provide a solid foundation for building a system that is both powerful and trustworthy. The result is not just reclaimed engineering hours—it is a team that can focus on the work that truly moves the platform forward.

Add Comment