AI Agent Long Task Failures: Microsoft Study

Prev Article Next Article

Companies exploring automated workflows would be well advised to keep their AI agents on a short leash. Microsoft researchers have found that even the priciest frontier models introduce errors in long workflows, directly challenging the core promise of autonomous AI software. This discovery lands as a sobering reality check for the ai agent long task narrative that has dominated recent tech marketing. The costs of failure are not hypothetical. They represent lost time, corrupted data, and eroded trust in automation.

ai agent long task

The Grand Vision Meets the Cold Hard Data

Anthropic markets Claude Cowork as a tool that “handles tasks autonomously,” returning a finished deliverable from a simple goal. Similarly, Microsoft promotes its 365 Copilot for “complex, multistep research across your work data and the web.” These are compelling pitches for anyone drowning in repetitive digital chores.

Yet the Windows maker’s own scientists are not so convinced. Philippe Laban, Tobias Schnabel, and Jennifer Neville from Microsoft Research set out to study what happens when large language models (LLMs) handle multistep tasks. Their findings, published in a preprint titled “LLMs Corrupt Your Documents When You Delegate,” paint a starkly different picture. The gap between marketing promises and scientific evidence is wide is wide enough to swallow a budget whole.

How They Tested: The DELEGATE-52 Benchmark

To test how LLMs handle long-running knowledge work tasks, the researchers devised a benchmark called DELEGATE-52. It simulates multistep workflows across 52 professional domains, from writing code and crystallography to music notation. This is a far more taxing test than sorting a spreadsheet, which should be table stakes for any aspiring workflow agent.

In the accounting domain, for instance, the challenge begins with a seed document representing the accounting ledger of Hack Club, a nonprofit organization. The model must split the seed document into separate category-based files and then merge these chronologically back into a single file. The benchmark specifically targets the cumulative error problem inherent in any ai agent long task execution. It measures not just whether the model can act, but whether it can sustain coherence over time.

The Results: A 25 Percent Content Loss for Frontier Models

“Our findings show that current LLMs introduce substantial errors when editing work documents, with frontier models losing on average 25 percent of document content over 20 delegated interactions,” the authors report. The average degradation across all models was a staggering 50 percent. That means half the work product disappeared or was mangled.

The authors found that LLMs performed better on programming tasks and worse on natural language tasks. To be considered “ready” for a given work domain, the researchers set the bar at 98 percent or above 98 percent after 20 interactions. They only found one domain qualified: Python programming. For every other domain, the authors found LLMs fell short of “ready.”

The study found that “catastrophic corruption,” meaning a benchmark score of 80 percent or less, occurred in more than 80 percent of model/domain combinations. The best performing model, Google Gemini 3.1 Pro, achieved readiness for only 11 of the 52 domains. That is a roughly one-in-five success rate for the most capable system on the market.

Deletion vs. Corruption: A Tale of Two Failure Modes

In weaker models, degradation took the form of content deletion. Important sections of the document simply vanished. In frontier models, the failure mode shifted to content corruption. The information was still there, but it was wrong, rearranged, or nonsensical. Both outcomes are unacceptable for professional use.

And when errors occurred, they tended to happen all at once, resulting in the loss of 10 to 30 points in a single round-trip interaction. They did not accumulate gradually over the entire test run. This makes the behavior unpredictable and dangerous for long-running processes.

“The stronger models aren’t avoiding small errors better, they delay critical failures to later rounds and experience them in fewer interactions,” the researchers observe in their paper. This means that a model performing well in early tests provides no guarantee that it won’t catastrophically fail on the next step of an ai agent long task.

When Agents Make Things Worse

The Microsoft authors went on to test how agents — LLMs given access to file reading, writing, and code execution through a basic harness — handle the DELEGATE-52 benchmark. Tools in this instance did not help.

“The four tested models perform worse when operated agentically with tools than without, incurring an average additional degradation of 6 percent by the end of simulation,” the authors observe in reference to GPT-5.4, 5.2, 5.1, and 4.1.

Given that task delegation is the whole point of an AI agent, this casts a shadow on the AI hype train. An intern who corrupted a quarter of a document over a long workflow would be shown the door. Yet organizations are spending an average of 36 percent of their digital budgets on AI automation, according to Deloitte. That might make sense if arming LLMs with tools meant less degradation, but the data shows the opposite trend.

The Cumulative Latency Problem: The Root Cause

Why do LLMs fail so spectacularly on long workflows? The root issue is what researchers call the Cumulative Latency Problem. Each interaction adds a small chance of misinterpretation or hallucination. Over 20 interactions, these risks compound exponentially.

You may also enjoy reading: How Much Do MRI Techs Make? Median Salary $88,180 + Full Breakdown.

Unlike human workers who can maintain a mental model of their task for hours, LLMs operate within a finite attention window. As the workflow lengthens, the model struggles to maintain consistency across all the modifications it has made. The research shows that errors often happen all at once, losing 10 to 30 points in a single interaction. This suggests the model reaches a threshold where it fundamentally loses track of the document’s state.

The study also reveals a hidden danger: early performance does not predict final outcomes. LLM performance after two interactions looks stable, but performance after 20 interactions is a gamble. This makes it nearly impossible to trust a model based on a short trial run. Users still need to closely monitor LLM systems as they operate and complete tasks on their behalf.

Navigating the Agentic Era: Practical Steps for Automation Teams

Given the findings, businesses must adopt a cautious, evidence-based approach. Here are concrete strategies informed by the DELEGATE-52 study.

1. Implement Rigorous Checkpoints

Do not let an AI agent run for 20 interactions without human review. Use automated diff tools to compare document states after each step. The research shows errors can drop by 10 to 30 points in a single interaction, so frequent validation catches problems early. Treat it like code review for document workflows.

2. Match the Model to the Domain

The study found that only Python programming met the readiness threshold. For other domains, expect failure. Use structured, well-defined output formats like JSON or code before attempting free-form natural language generation. If your task is highly creative or open-ended, expect higher corruption rates.

3. Use Agents as Assistants, Not Delegates

Treat the agent as a collaborator that requires guidance, not an autonomous employee. The paper’s title, “LLMs Corrupt Your Documents When You Delegate,” is a direct warning against handing over complete authority. Keep a human in the loop for validation and decision making.

4. Test with DELEGATE-52

Use the benchmark published by Microsoft Research to evaluate any model’s fitness for your specific domain. If the model fails the benchmark, it will likely fail your real-world data as well. This gives you an objective baseline before committing production workloads.

5. Start Small Bets, Not Big Exposures

Start with low-risk tasks. Validate thoroughly before scaling to mission-critical data. The 25 percent content loss observed in frontier models is a serious financial risk. Protect your core assets by isolating AI experiments from production records until you have confidence in the outcomes.