Tool Calling Feedback That Sharpens AI Agent Performance

Prev Article Next Article

Tool-calling agents are getting smarter, but they still need help making the right decisions during execution. That’s where inference-time feedback comes in — a method that evaluates decisions as they happen. A recent paper accepted at the Fifth Workshop on Natural Language Generation, Evaluation, and Metrics at ACL 2026 introduces a practical way to improve tool calling feedback. Instead of waiting until after execution, a specialized reviewer agent checks provisional tool calls before they are made.

This creates a clean separation of responsibilities. The primary execution agent focuses on taking action, while the reviewer agent monitors quality at every step. This separation of concerns means you get faster corrections and more reliable outputs without adding unnecessary overhead.

Architecture: Separation of Execution and Review Agents

This two-agent structure works because each role is clearly defined. The execution agent handles the task at hand, while the review agent checks the work before it reaches the outside world. By splitting these responsibilities, you avoid the common pitfall where a single agent tries to both act and self-correct — a process that often leads to missed errors or slower responses.

Tool calling feedback - real-life example — Bild: JACLOU-DL / Pixabay

The Role of the Execution Agent

The execution agent is designed to move quickly. Its job is to generate provisional tool calls based on the current context and user request. It focuses on taking action, not on second-guessing itself. This speed is useful, but it comes with a trade-off: the execution agent may select the wrong tool, pass incorrect parameters, or misinterpret the scope of the request. That’s where the second agent comes in.

The Role of the Review Agent

The review agent evaluates every provisional tool call before it is executed. It checks three key areas: tool selection, parameter accuracy, and scope recognition. If the execution agent picks a calculator tool when a database query is needed, the reviewer catches it. If the parameters are off by a decimal point, the reviewer flags it. This tool call validation step acts as a safety net, preventing bad calls from ever reaching the external system.

This separation of concerns is the foundation of reliable tool calling feedback. The execution agent stays fast and focused, while the review agent provides a quality gate. Together, they form an agent architecture that reduces error propagation — a mistake in the execution step is caught before it can cause cascading failures. The result is a system that balances speed with accuracy, giving you dependable outputs without constant manual oversight.

Measuring Feedback Quality: Helpfulness and Harmfulness Metrics

That balance between speed and accuracy sounds great in theory, but how do you know if your feedback loop is actually improving things — or making them worse? You need a way to measure the real impact of that inference-time review. That’s where the benefit-risk ratio comes into play. The paper introduces two straightforward feedback metrics called Helpfulness and Harmfulness to quantify exactly this tradeoff.

Helpfulness measures the percentage of base agent errors that your tool calling feedback successfully corrects. In other words, it tells you how often the reviewer catches a mistake and fixes it before the output reaches you. A high Helpfulness score means your feedback system is doing its job — it’s error correction at work, reducing the need for you to manually spot issues.

Harmfulness, on the other hand, tracks the percentage of correct responses that the feedback process degrades. Sometimes a reviewer misinterprets a perfectly good answer and modifies it for the worse. That’s a degradation, and Harmfulness captures how often that happens. You want this number as low as possible.

Together, these two metrics give you a clear view of the benefit-risk ratio for any feedback configuration. They guide critical decisions like model selection and prompt optimization. For example, if one reviewer model shows high Helpfulness but also high Harmfulness, you might need to adjust its prompts or switch to a more conservative reviewer. By tracking both numbers, you can tune your system to maximize corrections while minimizing new errors — making your tool calling feedback practical and reliable.

Benchmark Evaluation: BFCL and τ2-Bench Results

That kind of practical tuning only matters if the underlying approach actually delivers results. To find out, the method was put through two well-known benchmark evaluations. The first, BFCL, focuses on single-turn tool calls where the agent gets one shot to pick the right function. The second, τ2-Bench, tests multi-turn tasks in stateful scenarios — the kind where context builds across several interactions and the agent must remember what happened earlier.

Inspiration for Tool calling feedback — Bild: hamiltonpaviana / Pixabay

On BFCL, the approach improved irrelevance detection by 5.5 percent. In plain terms, your system gets noticeably better at recognizing when a tool call has nothing to do with what the user actually asked for. That might sound small, but catching irrelevant calls early prevents wasted processing and avoids confusing outputs downstream.

The τ2-Bench results were even more striking: a 7.1 percent improvement on multi-turn tasks. Stateful scenarios benefit more from iterative reinforcement feedback because each turn gives the agent a fresh chance to refine its reasoning. Instead of calling a tool once and moving on, the agent revisits earlier decisions, checks whether the output actually helped, and adjusts if needed.

The pattern across both benchmarks is consistent. Whether you are building a simple single-turn assistant or a complex multi-turn agent, reinforced feedback strengthens your overall tool calling feedback loop. The BFCL benchmark shows you can catch irrelevant calls before they waste time. The τ2-Bench results show you can handle longer conversations where maintaining state and context matters most. For anyone evaluating AI agent pipelines, this benchmark evaluation offers solid evidence that iterative feedback pays off — especially when the conversation gets complicated.

Model Impact: Reasoning vs Standard Models

While iterative feedback loops clearly improve tool-calling behavior, the scale of that improvement hinges on the AI model you choose. Different base models respond to reviewer corrections in markedly different ways. The reasoning model o3-mini achieved a 3:1 benefit-to-risk ratio versus 2.1:1 for GPT-4o during benchmark evaluations. This means that for every positive change, there were fewer unwanted side effects. Reasoning models like o3-mini seem to absorb and apply feedback more selectively, so the net benefit from each reinforcement cycle is higher.

If you are integrating tool calling feedback into your pipeline, opting for a reasoning model could amplify the gains while minimizing the risks. The benefit-to-risk ratio is a key metric here. For GPT-4o, a standard model, the ratio was lower, suggesting that its feedback integration was less precise. In contrast, reasoning models showed a higher net benefit from the same reviewer feedback. This difference matters because tool calling feedback isn’t just about making corrections; it’s about making the right corrections with minimal disruption to other aspects of performance.

So when planning your AI agent architecture, consider the model’s reasoning capability. It directly influences how much you get out of each reinforcement cycle. For tasks that involve complex decision-making or multi-step tool use, a reasoning model like o3-mini likely provides a better foundation for your feedback mechanism.

On a similar note, New Report Shows Small Changes Improve Large Language Models explores this topic with concrete examples.

Automated Prompt Optimization with GEPA

So, you have a solid reviewer agent in place, and you’re seeing the benefits of reinforcement. But what if you could push that performance even further without retraining your entire base model? That’s where automated prompt optimization comes into play. A technique called GEPA offers a lightweight, complementary boost to your existing setup.

Ideas around Tool calling feedback — Bild: Alexas_Fotos / Pixabay

Think of GEPA as a tool for automated tuning. Instead of manually tweaking your reviewer agent’s instructions, it systematically refines the prompts it receives. This process targets the tool calling feedback loop itself, making your agent’s evaluations more precise and consistent. The result is an additional +1.5–2.8% improvement in overall performance—a solid gain for a method that requires no heavy model retraining.

Here’s the practical takeaway: GEPA works best as an inference-time enhancement. It doesn’t replace your core feedback mechanism; it sharpens it. By automating the prompt adjustments, you reduce the guesswork in setting up your reviewer. This is especially useful for complex tasks where multi-step tool use demands clear, structured feedback. A few key steps to get started:

Audit your existing prompts for clarity and specificity.
Run GEPA on a small sample of your agent’s outputs.
Integrate the refined prompts into your feedback loop.

This approach keeps your workflow efficient and practical, giving you a reliable lift without the overhead of full model retraining. For anyone looking to maximize agent improvement with minimal friction, GEPA is a straightforward next step.

Tradeoffs and Limitations: Latency, Cost, and Error Modes

Adding a reviewer agent to your tool-calling workflow is not free. While it can catch mistakes and improve reliability, you need to weigh the practical costs. The biggest tradeoff is speed: every time your AI calls a tool, the reviewer must pause, analyze the trajectory, and decide whether to approve or reject. This latency tradeoff can slow down real-time applications, especially if you’re handling many calls in quick succession.

There’s also the computational cost of running that extra LLM inference. You’re essentially doubling the processing load for each tool call. The paper does not provide hard numbers on how much slower or more expensive this makes things — that will depend heavily on your model choice, API pricing, and request volume. For low-stakes tasks, the overhead might not be worth it. For critical operations, the reliability gain may justify the extra spend.

Beyond speed and money, you should watch for failure modes. A reviewer agent can overcorrect: it might reject a perfectly valid tool call because its own judgment is too strict. Conversely, it can miss errors entirely, especially if the mistake is subtle or the reviewer’s reasoning is flawed. Remember that LLM trajectory assessments are inherently post-hoc and disconnected from the active execution loop. The reviewer isn’t running the tool; it’s only looking at the log after the fact. This means it can’t catch problems that only appear during live execution, like timing issues or partial failures. For a balanced approach, consider these deployment considerations carefully before adding a reviewer to your pipeline.

Frequently Asked Questions

How does the reviewer agent evaluate provisional tool calls before execution?

The reviewer agent inspects each proposed tool call against the original user request and the conversation context. It checks that the selected tool is appropriate for the task, the parameters are logically consistent, and the call stays within the intended scope. This evaluation happens without executing the tool, acting as an inference-time gate.

What is the tradeoff in terms of latency or cost when adding a reviewer agent to the execution loop?

Adding a reviewer agent increases processing time for each tool call, as the system must run an additional model inference. This can make responses slightly slower, but it often reduces errors that would require retries or corrections. The cost tradeoff involves paying for the reviewer’s computation against the potential savings from avoiding failed or harmful tool executions.

Can the reviewer agent be improved without retraining the base agent, and how?

Yes, you can refine the reviewer agent independently by updating its instructions, adding new validation rules, or adjusting its evaluation criteria. This separation of concerns makes it practical to enhance tool calling feedback without modifying the underlying execution model. You can also swap the reviewer for a different, more capable model as needed.