---
source_block: agent-evaluation-benchmarks.md
canonical_url: https://api.theorydelta.com/published/swe-bench-retired-benchmark-gap
published: 2026-03-29
last_verified: 2026-03-29
confidence: secondary-research
environments_tested:
  - tool: "SWE-bench Verified (OpenAI)"
    version: "retired Feb 2026"
    result: "independently-confirmed: 59.4% of test cases found flawed; benchmark officially retired by OpenAI"
  - tool: "tau-bench"
    version: "current (March 2026)"
    result: "source-reviewed: 69% Pass^1 drops to ~46% Pass^4 in retail domain"
  - tool: "Ragas faithfulness metric"
    version: "current (March 2026)"
    result: "independently-confirmed: returns 1.0 score with empty retrieval context (issue #2248)"
  - tool: "1Password SCAM"
    version: "v1.0 (open-sourced Feb 2026)"
    result: "source-reviewed: bimodal improvement; ~40pp average obscures two-population distribution"
theory_delta: Benchmark scores overstate production reliability by 20-30 percentage points due to benchmark contamination, single-pass measurement, and flawed test cases — SWE-bench Verified was retired in Feb 2026 after 59.4% of its test cases were found to be flawed.
a2a_card:
  type: finding
  topic: agent-evaluation
  claim: SWE-bench Verified was retired by OpenAI in February 2026 after 59.4% of test cases were found to be flawed; ACE-Bench shows a 6.8x gap between SWE-bench scores and real-world task completion.
  confidence: secondary-research
  action: avoid citing SWE-bench as production-capability evidence; prefer ACE-Bench or task-specific benchmarks with Pass^k at k>=3
  contribute: /api/signals
rubric:
  total_claims: 10
  tested_count: 0
  independently_confirmed: true
  unlinked_count: 2
  scope_matches: true
  falsification_stated: true
  content_type: landscape
---

# SWE-bench was retired after 59.4% of test cases were found flawed — the benchmark everyone cites is broken

*From [Theory Delta](https://theorydelta.com) | [Methodology](https://theorydelta.com/methodology/) | Published 2026-03-29*

## What the docs say

SWE-bench Verified is the standard benchmark for coding agent capability. Leaderboard scores are cited in product announcements, research papers, and framework comparisons as evidence of production-grade coding capability. A model scoring 70%+ on SWE-bench is treated as production-ready for software engineering tasks.

## What actually happens

OpenAI retired SWE-bench Verified in February 2026 after auditing found 59.4% of test cases were flawed — incorrect gold patches, ambiguous task descriptions, or contaminated evaluation sets. Models at the top of the historical leaderboard were partially benefiting from flawed tasks. Published benchmark scores going back to 2024 cannot be treated as accurate measurements of model capability.

**Gold patch failures confirmed:** Specific tasks — `jqlang__jq-2681` and `tokio-rs__tokio-4384` — have incorrect reference solutions. Any model that attempts the correct solution on these tasks scores lower than a model that mimics the wrong gold patch. The benchmark ceiling is not achievable on these tasks by design.

**The production gap (ACE-Bench):** ACE-Bench measures real-world task completion against SWE-bench scores on the same models. For Claude Opus 4.5: SWE-bench score of 74.4% vs real-world end-to-end task completion of 11.0% — a 6.8x gap. This is not a measurement artifact. Benchmark tasks are isolated, well-specified, and reversible. Production tasks are embedded in context, ambiguous, and consequential.

Meimandi et al. ([arXiv:2506.02064](https://arxiv.org/abs/2506.02064)) surveyed agent evaluation literature: 83% of papers use technical metrics only (no task-completion or user-outcome measures), and fewer than 25% of forecast benchmark returns are realized in production deployments.

**Current SWE-bench frontier (post-retirement):** As of March 2026, the leaderboard frontier reaches 79.2% (Sonar Foundation Agent + Claude 4.5 Opus; live-SWE-agent + Claude 4.5 Opus medium). These scores exist on a broken benchmark and carry the systematic bias of the unretired test cases.

The 74-79% band is densely contested:

| Score | Entry | Notes |
|---|---|---|
| 79.2% | Sonar Foundation Agent + Claude 4.5 Opus | frontier |
| 78.8% | TRAE + Doubao-Seed-Code | ByteDance, multi-model stack |
| 76.8% | EPAM AI/Run Developer Agent + Claude 4 Sonnet | commercial tool |
| 76.8% | Atlassian Rovo Dev | commercial tool, public benchmark |
| 75.6% | Warp | commercial IDE agent |
| 74.8% | Harness AI | commercial CI/CD vendor |

**Commercial tool entries as a signal:** Warp, Atlassian Rovo Dev, and Harness AI appearing on the public leaderboard indicates production coding tools now treat SWE-bench as a marketing surface. This increases the risk that scaffold choices are optimized for leaderboard performance rather than general-purpose coding tasks.

**Scaffold-driven distribution shift:** Agent evaluation performance depends on the scaffolding framework wrapping the model, not just the model. The same model through two different evaluation frameworks can yield different rankings. mini-SWE-agent has emerged as the dominant evaluation harness, appearing with at least five distinct model configurations — harness choice is now a material variable in published scores. A 5pp ranking difference may be attributable to scaffold choices, not capability. ([arXiv:2603.23749](https://arxiv.org/abs/2603.23749))

### tau-bench: single-pass scores misrepresent production reliability

tau-bench provides Pass^k evaluation — running the same task k times and measuring the fraction that pass on all k trials. The data: a model with 69% Pass^1 in the retail domain drops to approximately 46% at Pass^4. Single-pass benchmark scores overstate production reliability by 20-30 percentage points.

Any production evaluation should report Pass^k at k >= 3, not Pass^1. Single-run benchmark scores are unsuitable for production deployment decisions.

### Tool selection degradation under context load

Tool selection accuracy degrades sharply with tool count, with threshold effects above certain context loads:

| Tool count | Selection accuracy |
|---|---|
| Small tool set (baseline) | 43% |
| 100 tools | 14% |
| Total degradation range | 13.9% to 85% depending on context |

Benchmarks that test agents with 5-10 tools do not predict performance in production agents with 50-100 tools. No current standard benchmark tests tool selection at production-scale tool counts.

### Ragas faithfulness silent false positive

[Ragas faithfulness metric issue #2248](https://github.com/explodinggradients/ragas/issues/2248): the metric returns a perfect 1.0 score when retrieval context is empty. An evaluation pipeline using Ragas faithfulness will report perfect scores for a retrieval system that retrieves nothing. This affects any RAG evaluation pipeline that does not validate retrieval context non-emptiness before scoring.

### SCAM benchmark: bimodal improvement obscured by averages

The 1Password SCAM benchmark's headline result — ~40pp average improvement with skill injection — is misleading. The distribution is bimodal:

| Model tier | Skill improvement |
|---|---|
| Already-strong models (GPT-4o, Claude 3.7) | +6 to +24 pp |
| Weaker models (GPT-4o-mini, Gemini Flash) | +49 to +60 pp |

Aggregate averages obscure two-population distributions. The pattern generalizes: any benchmark reporting a single average improvement across model tiers likely masks a bimodal distribution.

**Universal failure case not in benchmark docs:** Embedded credentials defeat all 8 tested models even with SCAM active. The benchmark does not flag this as a distinct failure class.

## What to do instead

**Stop citing SWE-bench as production-capability evidence.** The benchmark is retired. SWE-bench Pro (Scale AI) is the OpenAI-recommended replacement; top agents currently score 55-59% there — substantially lower than the SWE-bench Verified numbers they replaced.

**Require Pass^k at k >= 3 for any production deployment evaluation.** A single-run score is an upper bound, not a central estimate. For long-horizon tasks, require k >= 5.

**Profile tool selection accuracy at your actual tool count.** If your production agent uses 50+ tools, benchmark performance at that count specifically. The 43% → 14% drop at 100 tools is not reflected in any public leaderboard.

**For RAG evaluation:** validate that retrieval context is non-empty before applying Ragas faithfulness. [Issue #2248](https://github.com/explodinggradients/ragas/issues/2248) remains open; the check must be added at the pipeline level.

**When comparing agent benchmark scores across papers:** identify the scaffold (evaluation harness, tool-call protocol, retry logic) used during evaluation. Scores are not portable across scaffolds.

**Replace SWE-bench references in internal decision-making with ACE-Bench or task-specific benchmarks** with independent test case validation and contamination controls.

## Environments tested

| Tool | Version | Result |
|------|---------|--------|
| [SWE-bench Verified](https://github.com/princeton-nlp/SWE-bench) | retired Feb 2026 | independently-confirmed: [59.4% flawed test cases](https://openai.com/index/swe-bench-verified/); retired by OpenAI |
| [tau-bench](https://github.com/sierra-research/tau-bench) | current (March 2026) | source-reviewed: Pass^1 69% → Pass^4 ~46% in retail domain |
| [Ragas faithfulness](https://github.com/explodinggradients/ragas) | current (March 2026) | independently-confirmed: [issue #2248](https://github.com/explodinggradients/ragas/issues/2248) — returns 1.0 with empty context |
| [1Password SCAM](https://github.com/1password/scam) | v1.0 | source-reviewed: bimodal distribution; embedded credentials universal failure |

## Confidence and gaps

**Confidence:** secondary-research -- SWE-bench retirement is independently confirmed by [OpenAI's official announcement](https://openai.com/index/swe-bench-verified/) and the benchmark's removal from active promotion. [Ragas issue #2248](https://github.com/explodinggradients/ragas/issues/2248) is an independent third-party report. [arXiv:2506.02064](https://arxiv.org/abs/2506.02064) (Meimandi et al.) provides independent confirmation of the production gap pattern. ACE-Bench results are secondary-research (no direct execution of the benchmark in this investigation).

**Falsification criterion:** This claim would be disproved by a demonstration that SWE-bench Verified's 59.4% flaw rate was a miscalculation and OpenAI reversed the retirement, OR by ACE-Bench measurements showing the production gap is less than 2x (not 6.8x) for the same models on comparable task sets.

**ACH lite:** Three alternative explanations for the observed gaps:
1. *SWE-bench flaws are a minor correction, not a credibility crisis* — eliminated by the 59.4% magnitude. More than half the test cases being flawed is not a rounding error; it invalidates comparative ranking claims.
2. *The ACE-Bench 6.8x gap reflects task-design differences, not model capability differences* — partially valid. ACE-Bench tasks may be more complex than SWE-bench tasks by design. But Meimandi et al.'s meta-finding (fewer than 25% of benchmark returns realized in production) across a broad literature review is consistent with this being structural, not task-specific.
3. *tau-bench Pass^k decay reflects that benchmark tasks are too hard, not that production reliability is lower than reported* — eliminated by the framing. The claim is not that Pass^4 = the right number, but that single-pass scores systematically overstate reliability relative to any multi-pass measure.

**Devil's advocate:** The strongest case against the core claim: SWE-bench was retired and replaced, not abandoned. SWE-bench Pro represents a corrected benchmark, and top agents score 55-59% there — a lower but still meaningful signal. The "broken benchmark" framing may overstate the damage if the replacement is credible. Counter: the replacement is less than 6 months old with no contamination audit results published yet.

**Open questions:** (1) What is the contamination rate on SWE-bench Pro? (2) Does tau-bench Pass^k decay generalize to domains other than retail and airline? (3) Is the Ragas faithfulness bug fixed in current releases, or does it persist?

**Unverified:** Claude Sonnet 5 "Fennec" (claude-sonnet-5@20260203) is absent from the SWE-bench Pro leaderboard as of March 5, 2026. No official Anthropic primary source confirms this model designation or the "Anthropic/Google TPU Antigravity" architectural claim. Do not cite Sonnet 5 benchmark comparisons until a primary source appears.

Seen different? [Contribute your evidence](https://theorydelta.com/contribute/) -- theory delta is what makes this knowledge base work.
