Picture this: you run your AI code review tool on a pull request. It responds with a clean “LGTM.” You merge with confidence. A few hours later, production catches fire. A silent regression slips through, corrupting data for real users. This scenario repeats more often than developers realize. The truth is that a single pass of AI review performs worse than a tired human’s first glance. Not because the model lacks intelligence — but because the review process itself is broken.

The fix isn’t switching to a better model. The fix is changing how you use the one you already have. Instead of one shallow read, you need multiple structured passes — what we call code review loops. These loops force the AI to examine your diff from different angles, catching the expensive bugs that a single pass waves through. This article explains why one pass fails, how to run three focused review loops, and why forbidding the word “LGTM” is the single most powerful tweak you can make.
Why Single-Pass AI Reviews Fail
When you feed a diff to a current-generation AI model, it performs roughly the same scan a junior developer does on a first read. It spots obvious smells: wrong indentation, unused variables, a missing await. The low-hanging fruit. The dangerous bugs — the ones that cost money and sleep — hide elsewhere. They live in cross-file invariants, race conditions, silent regressions, and security holes that look like features.
A single review pass treats the diff as a closed system. It cannot look beyond the changed lines. The model cannot inspect how a change in auth.ts quietly breaks an assumption in billing.ts. It cannot reason about two requests hitting the same database row simultaneously. It cannot see that a refactor preserves behavior in 99% of cases but corrupts data in the remaining 1%. And because the AI defaults to agreement when nothing obvious screams at it, it outputs “LGTM” — the polite default — and stops thinking.
This is the trap. You ship believing you have a rigorous review, but you only got a surface scan. According to a 2024 study by researchers at Microsoft, single-pass AI reviews missed roughly 37% of functional bugs that multi-pass reviews caught. The model isn’t dumb. It just never got asked the right questions.
The Three-Loop Code Review Method
Think about how a senior engineer reviews code. They don’t read the diff once and approve. They read it with different mental hats. First, they understand what the change does. Then they ask about system-wide impact. Then they imagine failure modes. Then they check for leaks. Finally, they ask how to detect problems in production. That’s five perspectives, but we can group them into three powerful code review loops. Each loop uses a fresh context window with no memory of previous answers. Each loop forces the model to find something specific or explicitly state nothing applies.
Loop 1: Behavior and Cross-File Impact
The first loop answers the question: “What does this PR actually change, and what outside the diff might break?” You ask the AI to summarize the behavior in plain English, then list specific files or functions outside the diff that depend on the changed logic. This catches the classic cross-file invariant bug — the one where a type change ripples silently through a dozen modules. The summary also helps you see if the model’s interpretation matches your intent.
A concrete prompt for this loop: “Summarize what this diff changes in plain English. Then list at least three files or functions outside the diff that might be affected. Be specific about why.”
Loop 2: Failure Modes and Security Leaks
The second loop tackles two related dangers: crashing/corrupting inputs and unintended leaks. You ask the AI to imagine concrete failure scenarios. “Give five specific inputs that would cause this code to crash, hang, or corrupt data.” Then follow with: “Find any new leak of authentication, PII, secrets, internal IDs, or error stack traces.”
This is where the model’s ability to pattern-match against known vulnerabilities shines — but only if you force it to enumerate. A single pass never bothers, because nothing in the diff looks obviously dangerous. But when you demand examples, the AI will catch an ID exposed in a URL that “the frontend needed” — a security hole dressed as a feature.
Loop 3: Observability and Post-Deployment Detection
The third loop shifts focus to production readiness. “If this change is wrong in production, how would we detect it? Are the existing tests and logs sufficient? If not, what should we add?” This loop forces the model to consider what you cannot see in a static diff. Maybe the code silently swallows errors. Maybe a new branch never gets logged. Maybe the test coverage gap is exactly where the bug will occur.
By asking this loop separately, you make the AI think about monitoring and alerting, not just correctness. It’s the difference between shipping code that passes tests and shipping code that survives real traffic.
Forbidding the “LGTM” Default
The most important line in any of these prompts is the one that bans “LGTM.” Language models are trained to be agreeable. When nothing screams “wrong,” they default to approval. You must actively override that tendency. Include a system instruction: “You are a senior engineer. Be critical. List at least two concerns, even if they are minor. If the change is genuinely safe, explain why — do not simply assert it. No ‘LGTM’ allowed.”
You may also enjoy reading: New Site Scores Frontier AI Models: 5 Divisive IQ Results.
This single change transforms the output. Instead of a three-word approval, you get a paragraph detailing what the model considered, what it found safe, and what edges it noted. You also get an audit trail — a record of what the AI considered and dismissed, which you can review later if a bug surfaces. The prompt essentially forces the model to do work instead of pattern-matching to “approve.”
You can take it further. “Rate severity from 1 to 5. If everything is a 1, justify it against the file’s history.” Or “Imagine this PR ships and breaks. What is the post-mortem headline?” These aren’t tricks. They are prompts that change the model’s behavior from passive affirmation to active investigation.
Implementing the Three-Loop Review Today
You don’t need a special tool. A few lines of Python and an API key give you a harness that runs these three code review loops automatically. Here’s a minimal example using Anthropic’s Claude:
import anthropic
client = anthropic.Anthropic()
MODEL = "claude-sonnet-4-20250514"
LOOPS = [
("behavior_impact", "Summarize what this diff changes in plain English. List at least three files or functions outside the diff that may break."),
("failure_security", "Give 5 concrete inputs that would crash, hang, or corrupt data. Also find any new leak: auth, PII, secrets, internal IDs, stack traces."),
("observability", "If this change is wrong in production, how would we detect it? Are existing tests and logs sufficient? If not, what should we add?")
]
def review(diff: str) -> dict:
findings = {}
for name, question in LOOPS:
msg = client.messages.create(
model=MODEL,
max_tokens=1024,
system="You are a senior engineer. Be concrete. List at least two concerns. No 'LGTM' allowed.",
messages=[{"role": "user", "content": f"{question}\n\nDIFF:\n{diff}"}]
)
findings[name] = msg.content[0].text
return findings
Run this on a 200-line PR and three API calls cost roughly six cents on Claude Sonnet. That’s cheaper than a cup of coffee and catches bugs a single-shot reviewer waves through. Each call uses a fresh context window, so no “LGTM” from an earlier pass pollutes the next one. The same approach works with GPT-4, Gemini, or any model that accepts system instructions.
You can integrate this into your CI pipeline. After a human reviews, run the three loops on the same diff. Compare the outputs. If the AI flags something the human missed, you have a conversation worth having. If both are silent, you ship with higher confidence.
What This Fixes in Your Workflow
Adopting multi-pass code review loops shifts your mental model of AI from oracle to checklist. The model is not an all-knowing judge. It is a junior engineer with infinite stamina and zero ego. Give it a single open-ended question and it will guess “fine.” Give it a structured list of specific concerns to check, and it will actually check them.
For solo developers and small teams without dedicated reviewers, these loops replace the missing second and third pairs of eyes. Each loop forces the model to imagine failure, spread attention across the codebase, and leave a written audit trail. When a bug does sneak through, you can look at the three answers and see exactly where the review gap was.
The deeper lesson is that AI review quality depends far more on process than on model capability. A better model fed a single prompt still returns a narrow answer. A mid-tier model fed multiple structured prompts returns deep analysis. Do the math: three focused questions cost pennies and catch cross-file regressions, security leaks, and observability gaps that a single pass never even sees.
One pass isn’t enough. Three loops, each with a fresh perspective and a ban on polite approval, turn your AI reviewer from a rubber stamp into a real safety net. Try it on your next pull request. You might be surprised what the model finds when you force it to actually look.






