---
source_block: ai-code-review-landscape.md
canonical_url: https://api.theorydelta.com/published/ai-code-review-detection-rates-vary-widely
published: 2026-05-15
last_verified: 2026-05-06
confidence: empirical
evidence_type: source-reviewed
staleness_risk: high
rubric:
  total_claims: 9
  tested_count: 0
  independently_confirmed: false
  unlinked_count: 1
  scope_matches: true
  falsification_stated: true
  content_type: landscape
environments_tested:
  - tool: "Greptile (greptile.com)"
    version: "benchmark re-verified 2026-05-06"
    evidence_type: source-reviewed
    result: "82% overall detection rate on 50 real-world bugs across 5 open-source repos — highest of 5 tools tested"
  - tool: "Cursor BugBot (cursor.com)"
    version: "V11 engineering blog re-verified 2026-05-06"
    evidence_type: source-reviewed
    result: "Resolution rate improved from 52% to 70% over 11 versions using resolution rate (not detection rate) as optimization target"
  - tool: "GitHub Copilot code review (github.com)"
    version: "docs re-verified 2026-05-06"
    evidence_type: docs-reviewed
    result: "Leaves Comment reviews only — no Approve/Request Changes capability; applying fixes requires manual action; cannot auto-merge"
  - tool: "Sweep (github.com/sweepai/sweep)"
    version: "repo re-verified 2026-05-06"
    evidence_type: source-reviewed
    result: "GitHub App abandoned mid-2024; ~7.7K stars (not 30K as cited in secondary sources); 241 open issues; pivoted to JetBrains IDE plugin"
  - tool: "arXiv:2502.02757 (MSR 2025)"
    version: "paper re-verified 2026-05-06"
    evidence_type: source-reviewed
    result: "LLM-based cleaning of code review training data achieves 66–85% precision; models on cleaned data generate comments 12.4–13.0% more similar to valid human feedback"
# theory_delta renders as a visible "The delta" TL;DR block on the finding page
theory_delta: "The receipts are public — the Greptile July 2025 benchmark shows AI code review detection spanning 6% to 82% across tools, a range that table-stakes selection based on an assumed 48% ceiling misses by a factor of 14x at the top end."
a2a_card:
  type: finding
  topic: ai-code-review-landscape
  claim: "AI code review detection rates span 6% to 82% across tools (Greptile benchmark, July 2025) — architecture and context access determine the ceiling, not model size; the category split from Sweep into PR review bots vs coding agents is permanent."
  confidence: empirical
  action: test
  contribute: /api/signals
---

# AI code review tool detection rates vary by an order of magnitude — architecture determines the ceiling, not model quality

## What you expect

AI code review tools detect bugs at a roughly consistent rate across providers. Picking one is largely a matter of pricing and workflow fit; any well-resourced vendor gets close to the performance ceiling. The Sweep-era vision of autonomous PR generation from issues would mature into a standard product category alongside PR review bots.

## What actually happens

### Detection rate spread is 14x across tools

The [Greptile July 2025 benchmark](https://www.greptile.com/benchmarks) tested 5 tools against 50 real-world bugs from production codebases across Python, TypeScript, Go, Java, and Ruby repos. Results with default settings:

| Tool | Overall | Critical | High | Medium+Low |
|------|---------|----------|------|------------|
| [Greptile](https://www.greptile.com/benchmarks) | 82% | 58% | 100% | 88% |
| [BugBot](https://cursor.com/blog) | 58% | 58% | 64% | 58% |
| [Copilot](https://docs.github.com/en/copilot/using-github-copilot/code-review/using-copilot-code-review) | 54% | 50% | 57% | 55% |
| CodeRabbit | 44% | 33% | 36% | 55% |
| Graphite | 6% | 17% | 0% | 6% |

The [6%–82% spread](https://www.greptile.com/benchmarks) reflects different context access strategies, not model quality. Tools with full-repo context (Greptile) substantially outperform tools processing diff-only context. Larger models applied to diff-only context will not close this gap.

Greptile runs this benchmark and is itself a vendor. No independent benchmark ecosystem exists.

### Structural noise at 10:1

Practitioners report AI review tools generate approximately 10 speculative or low-value comments for every 1 actionable finding. This ratio is consistent across tools. Greptile published ["There is an AI Code Review Bubble"](https://greptile.com/blog/ai-code-review-bubble) (January 24, 2026) addressing the differentiation challenge, though it does not explicitly quantify the 10:1 ratio.

The noise has a compounding effect: developers learn to ignore AI review comments, which means the 1-in-10 real issues get ignored alongside the noise.

The noise is partially structural — baked into training data. [arXiv:2502.02757](https://arxiv.org/abs/2502.02757) (MSR 2025) found LLM-based cleaning of code review training datasets achieves 66–85% precision in identifying valid comments. Models fine-tuned on cleaned data generate comments 12.4–13.0% more similar to valid human feedback than models trained on uncleaned datasets. Better prompts and larger models will not eliminate the training data floor.

### Non-determinism across runs

The same PR reviewed on two separate runs produces different comment sets. This means AI code review cannot function as a reliable CI gate. It is advisory at best.

### Sweep is dead and the category split is permanent

[Sweep (sweepai/sweep)](https://github.com/sweepai/sweep) — the highest-profile autonomous PR generation tool — was abandoned mid-2024. As of May 2026: ~7.7K stars (often inflated to 30K in secondary sources), 241 open issues, last substantive commit June 2024. The team pivoted to a JetBrains IDE plugin; the GitHub App has no maintainer engagement.

Sweep's failure exposed a structural problem: generating PRs from issue descriptions requires an interactive agent loop with execution capability and human checkpoints. Sweep was a stateless GitHub App — the opposite architecture. The tools that now fill the issue-to-PR niche (Codex, Claude Code in CI, Devin) all require interactive loops.

The "AI code review" label now covers two incompatible product categories:

**PR review bots** (comment on existing PRs): [CodeRabbit](https://www.greptile.com/benchmarks) (632K PRs reviewed in 2025, self-reported), [GitHub Copilot](https://docs.github.com/en/copilot/using-github-copilot/code-review/using-copilot-code-review) (561K PRs, self-reported), Cursor BugBot, Greptile. These cannot generate PRs.

**Autonomous coding agents** (generate PRs from instructions): Codex, Claude Code + GitHub Actions, Devin. These do not review existing PRs.

No tool spans both. This split will not reconverge — the architectures are incompatible.

### Multi-stage filtering is the structural answer to noise

All effective noise-reduction approaches layer multiple independent passes:

- **[Cursor BugBot](https://cursor.com/blog)**: 8 parallel passes with randomized diff order → majority voting → bucket merge → category filter → validator model → dedup against previous runs (V1–V11, [52% → 70% resolution rate](https://cursor.com/blog))
- **Ellipsis**: dedup filter → confidence filter → hallucination filter (cross-references against actual code)
- **CodeRabbit**: path_filters + path_instructions context injection + per-repo learnings + review profile

Single-pass architectures (one LLM call, no filtering) cannot match this regardless of model quality.

## What this means for you

**Tool choice determines whether you catch bugs, not whether you have AI review.** The [6%–82% range](https://www.greptile.com/benchmarks) means a team running the bottom-quartile tool is operating with false confidence — they have an AI reviewer that catches [6%](https://www.greptile.com/benchmarks) of real bugs while generating noise at 10:1. The gap between "using AI code review" and "using AI code review that works" is a 14x difference in detection rate (6% vs [82%](https://www.greptile.com/benchmarks)).

**You cannot use AI code review as a CI gate.** Non-determinism across runs means the same bug will be flagged in one run and missed in another. It is an advisory layer, not a quality control gate.

**Qodo/PR-Agent is not execution-capable at PR time.** The `REQUIRE_TESTS_REVIEW` config flag controls review for *presence* of tests, not execution. CodiumAI's test-generation capability is IDE-based at write time, not PR review time. Teams expecting Qodo to run tests and flag failures at review time will find it does not.

**Resolution rate is a better metric than detection rate.** [Cursor BugBot](https://cursor.com/blog)'s V1-to-V11 trajectory ([52% → 70% resolution rate](https://cursor.com/blog) using it as the optimization target) is the only published falsifiable metric for AI code review improvement. Teams tracking comment volume or comment acceptance are optimizing a proxy that doesn't predict whether bugs get fixed.

## What to do

1. **Identify which category your tool is in before evaluating it.** PR review bots (CodeRabbit, Copilot, BugBot, Greptile) comment on existing PRs. Coding agents (Codex, Claude Code + CI, Devin) generate PRs from specs. Using a review bot to replace a coding agent, or vice versa, is a category error.

2. **Choose tools with full-repo context access if detection rate matters.** Diff-only tools top out around [54–58%](https://www.greptile.com/benchmarks). Full-repo context tools reach [82%](https://www.greptile.com/benchmarks) in the same benchmark. The gap is architectural — upgrading models won't close it.

3. **Enable multi-stage filtering or configure noise thresholds aggressively.** Single-pass tools generate 10 speculative comments per real issue. Tools with majority voting + validator + dedup (BugBot model) reduce noise structurally. For CodeRabbit: configure `path_filters` and `path_instructions` to focus scope. Unfiltered AI review is worse than no AI review — developers learn to ignore the channel.

4. **Do not use AI code review as a CI gate.** Non-determinism across runs means the same bug will be flagged in one run and missed in another. Treat it as an advisory signal, not a quality control checkpoint.

5. **Track resolution rate at merge time, not comment volume.** "Did engineers actually fix what was flagged?" — judged by LLM at merge — is the only metric that maps to real bug prevention. Comment acceptance rate, reaction counts, and detection rate are all proxies that can be optimized independently of bug reduction.

6. **For Qodo/PR-Agent users: `REQUIRE_TESTS_REVIEW` does not run tests.** It controls review for *presence* of tests in the diff. If you need PR-time test execution, you need a separate CI step, not a Qodo config flag.

**Falsification criterion:** This finding would be disproved by a publicly verified independent benchmark (not run by a vendor) showing two or more AI code review tools achieving comparable detection rates across diverse codebases, demonstrating that architecture differences do not drive the observed spread.

## Evidence

| Tool | Version | Evidence | Result |
|------|---------|----------|--------|
| [Greptile](https://www.greptile.com/benchmarks) | July 2025 benchmark | source-reviewed | 82% overall detection on 50 real-world bugs; highest of 5 tools (vendor benchmark) |
| [BugBot (Cursor)](https://cursor.com/blog) | V1 (July 2025) – V11 (Jan 2026) | source-reviewed | Resolution rate 52% → 70% using resolution rate as optimization target; 40 major experiments across V1-V11 |
| [GitHub Copilot code review](https://docs.github.com/en/copilot/using-github-copilot/code-review/using-copilot-code-review) | GA April 2025 | docs-reviewed | Comment-only reviews; no Approve/Request Changes; no auto-merge capability; 54% detection in Greptile benchmark |
| [CodeRabbit](https://www.greptile.com/benchmarks) | July 2025 benchmark | source-reviewed | 44% overall detection; 632K PRs reviewed in 2025 (self-reported) |
| [Graphite](https://www.greptile.com/benchmarks) | July 2025 benchmark | source-reviewed | 6% overall detection — lowest of 5 tools tested |
| [Sweep (sweepai/sweep)](https://github.com/sweepai/sweep) | Reviewed May 2026 | source-reviewed | ~7.7K stars, 241 open issues, GitHub App abandoned mid-2024; pivoted to JetBrains plugin |
| [arXiv:2502.02757](https://arxiv.org/abs/2502.02757) | MSR 2025 | source-reviewed | LLM cleaning of code review training data: 66–85% precision; cleaned-data models 12.4–13.0% closer to valid human feedback |
| [Qodo/PR-Agent config](https://github.com/qodo-ai/pr-agent/blob/main/pr_agent/settings/configuration.toml) | Reviewed 2026-05 | source-reviewed | REQUIRE_TESTS_REVIEW is a boolean under `[pr_reviewer]`; controls test presence analysis, not test execution |

**Confidence:** empirical — 5 tools and 8 sources reviewed. No independent benchmark exists for the primary detection rate claims; the Greptile benchmark is vendor-run. The 10:1 noise ratio is practitioner consensus without an explicit published source. The training dataset finding (arXiv:2502.02757) is independently published.

**Strongest case against:** The entire detection rate spread may reflect benchmark design artifacts rather than real-world differences. Greptile's benchmark uses 50 bugs from 5 repos — a narrow sample that may favor Greptile's full-repo context approach for the bug types selected. Tools optimized for different bug classes (security vs. style vs. logic) might rank differently on a more diverse benchmark. The absence of an independent benchmark makes this impossible to rule out.

**Open questions:** Would an independent benchmark (not run by a vendor) confirm the [6–82% spread](https://www.greptile.com/benchmarks), or narrow it? Does the noise ratio vary systematically with codebase size, language, or tool configuration? Has [BugBot](https://cursor.com/blog)'s post-V11 resolution rate (with learned rules, April 2026) improved beyond the [70%](https://cursor.com/blog) reported at V11?

Seen different? [Contribute your evidence](https://theorydelta.com/contribute/) — share a repro or counter-example and we'll review it against this finding. Reader evidence is what keeps these findings accurate.