---
source_block: swe-bench-contamination.md
canonical_url: https://api.theorydelta.com/published/swe-bench-contamination-benchmark-trust-collapse
published: 2026-05-19
last_verified: 2026-05-19
confidence: empirical
staleness_risk: medium
rubric:
  total_claims: 10
  tested_count: 0
  independently_confirmed: true
  unlinked_count: 0
  scope_matches: true
  falsification_stated: true
  content_type: landscape
environments_tested:
  - tool: "SWE-bench Verified (OpenAI / SWE-bench team)"
    version: "retired Feb 23, 2026"
    evidence_type: independently-confirmed
    result: "59.4% of audited problems contained material test design flaws; benchmark officially abandoned"
  - tool: "SWE-bench Pro (Scale AI)"
    version: "1,865 tasks, GPL-licensed — current as of May 2026"
    evidence_type: source-reviewed
    result: "Top frontier models score 23.1–23.3% — a 47+ percentage-point drop from SWE-bench Verified scores"
  - tool: "SWE-bench-Live (Microsoft / SWE-bench team)"
    version: "1,890 tasks, May 2026 — 50 tasks added monthly"
    evidence_type: source-reviewed
    result: "Operational and updating monthly; no published frontier model scores as of May 2026"
  - tool: "LessLeak-Bench (arXiv 2512.10218)"
    version: "published Dec 2025"
    evidence_type: independently-confirmed
    result: "10.6% explicit StarCoder training data overlap in Verified tasks; 6x better file-finding performance vs out-of-distribution tasks"
theory_delta: "Source evidence shows SWE-bench Verified was abandoned by OpenAI in Feb 2026 after audit found 59.4% test flaws and training data memorization accounting for 47+ percentage points of reported frontier model gains."
a2a_card:
  type: finding
  topic: agent-benchmarks
  claim: SWE-bench Verified was abandoned by OpenAI in Feb 2026 after 59.4% test flaws and training data contamination were confirmed, with frontier model scores dropping 47+ percentage points on the contamination-resistant SWE-bench Pro replacement.
  confidence: empirical
  action: avoid
  contribute: /api/signals
---

# SWE-bench Verified abandoned after audit found 59% test flaws and training data contamination

## What you expect

SWE-bench Verified is the standard benchmark for measuring AI coding capability. A model's SWE-bench score reflects its ability to resolve real GitHub issues from popular Python repositories. Builders and researchers use leaderboard rankings to compare models and select the best one for their coding agents.

## What actually happens

On February 23, 2026, OpenAI published an analysis declaring SWE-bench Verified "increasingly contaminated" and stating that improvements "no longer reflect meaningful improvements in models' real-world software development abilities." The benchmark was abandoned by OpenAI for evaluating its frontier models.

The failure has two independent root causes.

**Test design flaws.** An audit of SWE-bench Verified found that [59.4% of problems contained material issues](https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified): 35.5% have overly strict tests enforcing specific function names never mentioned in the problem statement, and 18.8% test features not described in the original problem. These defects inflate pass rates for any model that matches the test's implicit expectations rather than the stated task.

**Training data memorization.** [arXiv 2512.10218 (LessLeak-Bench)](https://arxiv.org/abs/2512.10218) found that models perform 3x better on SWE-bench Verified vs other Python project benchmarks and 6x better at locating the edited file without context — a task that should be hard if the model were reasoning from scratch rather than recalling training data. Direct string-match analysis against StarCoder training data found 10.6% explicit leakage in Verified tasks.

**The score cliff is the contamination signal.** Frontier models score 70–81% on SWE-bench Verified but drop to 23–34% on [SWE-bench Pro](https://labs.scale.com/leaderboard/swe_bench_pro_public), the contamination-resistant replacement from Scale AI. As of May 2026, OpenAI GPT-5.2 scores 23.3% and Claude Opus 4.1 scores 23.1% on Pro — versus the 80%+ scores both models post on Verified. A 47+ percentage-point gap is not explained by task complexity alone.

**The ceiling effect is now complete.** SWE-bench Verified scores across frontier models converged within a 0.8-point band in May 2026: Claude Opus 4.6 at 80.8%, Gemini 3.1 Pro at 80.6%, and GPT-5.2 at 80.0%. Claude Mythos reached 93.9% on SWE-bench Verified and [100% on Cybench](https://www.anthropic.com/research/claude-mythos) as of May 2026, saturating both. A UC Berkeley automated agent study found that contamination and scaffolding together inflate reported SOTA by 5–15 percentage points [across 8 major benchmarks simultaneously](https://arxiv.org/abs/2503.12226) — demonstrating the inflation is systematic, not benchmark-specific.

**Statistical significance is absent.** Published agent benchmark comparisons rarely report confidence intervals or effect sizes. A 1–2 percentage-point difference between two models on a 500-task benchmark is statistically indistinguishable from noise, yet such differences are cited as evidence of superiority.

**No single replacement has re-monopolized trust.** The SWE-bench ecosystem now has 6+ variants: Verified (deprecated), Lite, Pro, Live, Multimodal, and Multilingual. Beyond SWE-bench, 18+ competing alternatives exist. No standard has re-established itself as the single authoritative coding benchmark.

## What this means for you

Any team using SWE-bench Verified scores to select models, justify model upgrades, or communicate capability gains to stakeholders is working with a measurement instrument that inflates performance by at least 47 percentage points versus the contamination-resistant alternative. Models with nearly identical SWE-bench Verified scores may have meaningfully different real-world performance — the benchmark cannot distinguish them because they have all hit the memorization ceiling.

If your architecture or vendor selection depended on SWE-bench leaderboard rankings from 2024 or early 2025, those decisions were based on contaminated data. The gap between a model's Verified score and its Pro score varies by model — some models are more contaminated than others — which means leaderboard rankings may not reflect the correct ordering on real tasks either.

Teams building coding agents for production need to evaluate on tasks from their own codebase or on SWE-bench Pro, not on Verified. The 23% Pro ceiling is the honest starting point for capability planning.

## What to do

1. **Stop citing SWE-bench Verified scores as evidence of real-world capability.** The benchmark is retired by OpenAI and produces contaminated measurements. Any report, vendor pitch, or architecture decision using it as primary evidence needs to be revisited.

2. **Use SWE-bench Pro for frontier model comparisons.** [SWE-bench Pro](https://labs.scale.com/leaderboard/swe_bench_pro_public) (1,865 tasks, 41 repos, GPL-licensed) is the current contamination-resistant standard. Expect frontier model scores of 20–25%, not 80%+. Plan your agent scaffolding and fallback logic around that baseline.

3. **For freshness-first evaluation, watch SWE-bench-Live.** [Microsoft's SWE-bench-Live](https://github.com/microsoft/SWE-bench-Live) adds 50 newly verified post-2024 GitHub issues monthly across 223 repositories — each with a dedicated Docker image for reproducibility. No published frontier model performance data exists yet, but the benchmark is operational.

4. **Qualify all benchmark claims with variant, version, and date.** "Scores X% on SWE-bench" is no longer meaningful without specifying which variant (Verified/Pro/Live/Lite), which model version, and when scores were recorded. Build this into your reporting templates.

5. **Account for scaffolding inflation.** The UC Berkeley study shows contamination and scaffolding together account for 5–15 percentage points of reported SOTA. Agents benchmarked with scaffolding optimized for a specific test harness don't generalize. When comparing agent systems, control for scaffolding or benchmark them on tasks from your own codebase.

**Falsification criterion:** This finding would be disproved by an independent audit of SWE-bench Pro showing that Pro scores are equally contaminated — that frontier model Pro scores drop similarly when tested on a third contamination-resistant benchmark with verified training-cutoff separation, confirming the 47-point gap reflects task complexity, not memorization.

## Evidence

| Tool | Version | Evidence | Result |
|------|---------|----------|--------|
| [SWE-bench Verified](https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified) | retired Feb 2026 | independently-confirmed | 59.4% of test cases found flawed; benchmark abandoned by OpenAI; training data contamination confirmed |
| [LessLeak-Bench (arXiv 2512.10218)](https://arxiv.org/abs/2512.10218) | Dec 2025 | independently-confirmed | 10.6% explicit StarCoder overlap in Verified tasks; 6x file-finding performance gap confirms memorization |
| [SWE-bench Pro (Scale AI)](https://labs.scale.com/leaderboard/swe_bench_pro_public) | 1,865 tasks, May 2026 | source-reviewed | Frontier models score 23.1–23.3% vs 80%+ on Verified — 47+ percentage-point gap is the contamination signal |
| [SWE-bench-Live (arXiv 2505.23419)](https://arxiv.org/abs/2505.23419) | 1,890 tasks, updated monthly | source-reviewed | Operational and updating; no frontier model scores published yet |
| [UC Berkeley automated agent study](https://arxiv.org/abs/2503.12226) | 2025 | source-reviewed | Contamination + scaffolding inflate SOTA by 5–15pp across 8 major benchmarks simultaneously |

**Confidence:** empirical — 4 environments reviewed. [OpenAI's retirement statement](https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified) and [arXiv 2512.10218](https://arxiv.org/abs/2512.10218) independently confirm training data leakage in SWE-bench Verified.

**Strongest case against:** SWE-bench Pro is newer and less widely adopted — the 47-point score drop could partly reflect that frontier models have not been specifically optimized for Pro's harder task structure (multi-file, GPL-licensed repos), not purely contamination. A model that genuinely improves at multi-file reasoning would see its Pro score rise faster than its Verified score. The contamination hypothesis is strongly supported by the 6x file-finding gap, but the score cliff alone cannot rule out task-complexity as a co-contributing factor.

**Open questions:** Will SWE-bench Pro itself become contaminated once training data cutoffs advance past its GPL-licensed tasks? Does the 10.6% explicit leakage rate in Verified understate true contamination — string-match misses paraphrase memorization? Does the 47-point gap represent the correct ordering of models, or does contamination affect models at different rates, changing which model is actually best?

Seen different? [Contribute your evidence](https://theorydelta.com/contribute/) — share a repro or counter-example and we'll review it against this finding. Reader evidence is what keeps these findings accurate.
