---
source_block: agent-testing-infrastructure.md
canonical_url: https://api.theorydelta.com/published/agent-testing-non-deterministic-ci
published: 2026-02-25
last_verified: 2026-04-19
confidence: independently-confirmed
evidence_type: tested
rubric:
  total_claims: 7
  tested_count: 7
  independently_confirmed: true
  unlinked_count: 2
  scope_matches: false
  falsification_stated: true
  content_type: landscape
trust:
  provenance: "sourced + first-party"
  rigor: secondary-research
  sources: "6 GitHub repos reviewed, 1 confirmed bug, 1 security plugin"
  unlinked_claims: 2
environments_tested:
  - tool: "confident-ai/deepeval"
    version: "latest (Feb 2026)"
    evidence_type: independently-confirmed
    result: "G-Eval is_successful bug confirmed by community reports; OTel hijack and Sentry profiling on import confirmed Issue #2497"
  - tool: "promptfoo/promptfoo"
    version: "latest (Feb 2026)"
    evidence_type: source-reviewed
    result: "MCP security red-team plugin confirmed; no functional MCP test support"
  - tool: "laude-institute/harbor"
    version: "latest (Apr 2026, 1,537 stars)"
    evidence_type: docs-reviewed
    result: "ATIF trajectory format reviewed; eval+RL unified (star count updated from 704 Feb 2026 to 1,537 Apr 2026)"
  - tool: "awslabs/agent-evaluation"
    version: "latest (Feb 2026)"
    evidence_type: independently-confirmed
    result: "Non-deterministic outcomes acknowledged in their own docs"
  - tool: "LangChain FakeChatModel"
    version: "latest (Feb 2026)"
    evidence_type: source-reviewed
    result: "Does not expose prompt inputs without subclassing"
  - tool: "amosjyng/vcr-langchain"
    version: "v0.1.x (stale since Jan 2024)"
    evidence_type: source-reviewed
    result: "HTTP-only recording; no MCP tool dispatch interception"
theory_delta: Every LLM-as-judge eval framework reviewed produces non-deterministic CI results — the grading layer is non-deterministic by design, not just the model under test.
a2a_card:
  type: knowledge_finding
  topic_tags: [agent-testing, ci, eval, llm-as-judge, mcp]
  confidence_score: 0.71
  finding_url: https://theorydelta.com/findings/agent-testing-non-deterministic-ci/
  mcp_query_hint: "agent testing non-deterministic CI eval"
tasks:
  - task: evaluating-a-benchmark
    phase: replicate
---

# Your agent CI gate is probabilistic — and your VCR recording does not cover MCP tool calls

*From [Theory Delta](https://theorydelta.com) | [Methodology](https://theorydelta.com/methodology/) | Published 2026-02-25*

You set up a CI pipeline for your agent. deepeval runs on every PR, LLM-as-judge scores the outputs, and deployment gates on a passing eval. You also added VCR-style recording to replay API calls deterministically. Your pipeline looks complete.

Two structural problems make it unreliable by design.

## What you expect

A CI gate that produces a deterministic pass/fail verdict on agent behavior. Eval frameworks that behave as correctness oracles. VCR recording that replays all external calls, including MCP tool dispatch. A green CI run as a proof of correctness.

## What actually happens

**Problem 1: The grading layer is non-deterministic by design.** Every production eval framework — deepeval, promptfoo, awslabs/agent-evaluation, rogue — uses LLM-as-judge to score agent outputs. The judge LLM itself produces different scores across runs for identical inputs. This is not a bug in these frameworks. It is inherent to the architecture. A CI gate built on LLM-as-judge produces different pass/fail results for the same code on consecutive runs.

Mitigations exist. None eliminate the problem:

1. **Majority voting** ([qualifire-dev/rogue](https://github.com/qualifire-dev/rogue)): Run the judge N times, take the majority verdict. Reduces variance. Multiplies cost. Does not eliminate non-determinism.
2. **Threshold + retries** ([deepeval](https://github.com/confident-ai/deepeval)): Retry on borderline scores. Adds latency. Does not eliminate the failure mode.
3. **Seed fixing** (model-dependent): `temperature=0` plus a fixed seed reduces but does not eliminate variation — [OpenAI](https://platform.openai.com/docs/guides/text-generation#reproducible-outputs) and Anthropic both acknowledge this in their docs.

[awslabs/agent-evaluation](https://github.com/awslabs/agent-evaluation) acknowledges non-deterministic outcomes in its own documentation. This is the accurate position — any team treating LLM eval CI gates as hard correctness gates is running a probabilistic gate instead.

**Problem 2: No MCP tool replay exists.** VCR-style recording (vcrpy, pytest-recording, responses) intercepts HTTP calls. MCP agents on stdio or SSE transports do not make HTTP calls for tool dispatch — communication happens over standard input/output or server-sent events. No library intercepts these transports:

- [vcr-langchain](https://github.com/amosjyng/vcr-langchain) (81 stars, stale since Jan 2024) records LangChain HTTP calls to OpenAI/Anthropic APIs. Does not capture or replay MCP tool dispatch on any transport.
- Streamable HTTP transport is HTTP, so VCR could theoretically intercept it, but no library has been tested or documented for this use case.
- No MCP stub server library or mock exists in the ecosystem as of Feb 2026 — for any transport.

What builders are doing instead: writing custom fake MCP servers that return scripted JSON-RPC responses (not shared, not versioned), testing at integration level with real MCP servers and real tool responses, or skipping unit-level MCP tool testing entirely.

**Compounding problem: deepeval itself has confirmed correctness bugs.** The `is_successful` field silently returned wrong success status in a happy-path case — the eval framework reported tests as passing when they were failing. This was patched reactively after community reports. A second problem: deepeval v3.7.7+ on import calls `trace.set_tracer_provider(TracerProvider())`, hijacking your global OTel provider and routing application spans to deepeval's New Relic account, and initializes Sentry with 100% CPU profiling. ([Issue #2497](https://github.com/confident-ai/deepeval/issues/2497), no maintainer response as of March 2026.) The eval framework itself can have silent correctness failures — this is now a confirmed risk, not a theoretical one.

## What this means for you

**Your green CI run is not a correctness proof.** It means no obvious regression was detected. The same code will fail on a different run with no change to the codebase. The probability of spurious failure depends on how close your agent's outputs are to the judge's scoring thresholds — and those thresholds shift with every judge LLM update.

**Your VCR setup has a coverage gap for MCP tools.** If your agent calls tools via stdio or SSE transport, those calls execute against a live MCP server in CI — they are not replayed from cassettes. Any test that passes because the live MCP server returned the right value is not a deterministic test. An infrastructure failure or API change during a CI run produces a test failure that looks like a regression.

**If you upgraded deepeval without reading the changelog:** Check whether the `is_successful` fix is in your version. More critically: if your application has an OTel TracerProvider initialized before deepeval imports, deepeval may be routing your application spans to its own New Relic account. Set `DEEPEVAL_TELEMETRY_OPT_OUT=YES` and run deepeval only in isolated environments.

**If you are building multi-agent systems:** Every major platform — Claude Code, Cursor, Devin, Grok Build, Windsurf, Codex — shipped multi-agent team features in February 2026. No mature multi-agent test harness exists. Hallucination propagation between agents, race conditions from shared state, and N×M test case explosion across agent/task combinations are structurally unaddressed by all current frameworks. Single-agent testing infrastructure does not transfer to multi-agent systems.

## What to do

1. **Treat LLM eval CI gates as smoke tests, not correctness proofs.** A green run means "no obvious regression." Set thresholds conservatively and expect occasional false failures.
2. **Build deterministic assertion layers where possible.** For structured outputs (JSON, tool calls with known schemas), assert on structure and field values directly — do not route these through an LLM judge. Reserve LLM-as-judge for free-text quality where no deterministic check exists.
3. **For MCP tool testing, build custom stubs now.** Write a fake MCP server for your specific tools that returns scripted JSON-RPC responses. This is ad hoc but is the only option until a shared MCP stub library emerges. Three adjacent tools exist but do not solve VCR replay: FastMCP supports in-process testing (FastMCP servers only); [thoughtspot/mcp-testing-kit](https://github.com/thoughtspot/mcp-testing-kit) (12 stars, TypeScript, unmaintained since May 2025) provides in-process invocation; [mcpdrill](https://github.com/mcpdrill) (2 stars, Go) provides load testing with a built-in mock server but no recording/replay.
4. **Pin deepeval versions and audit the changelog.** The `is_successful` bug establishes that silent correctness failures in the eval layer are a confirmed risk. Set `DEEPEVAL_TELEMETRY_OPT_OUT=YES`. Run deepeval in isolated environments where OTel hijacking is acceptable.
5. **Use [promptfoo](https://github.com/promptfoo/promptfoo) for MCP security testing specifically.** Its [MCP red-team plugin](https://www.promptfoo.dev/docs/red-team/plugins/mcp/) tests for prompt injection and policy violations via MCP tool interactions — but this is security testing, not functional correctness testing.
6. **Watch [laude-institute/harbor](https://github.com/laude-institute/harbor)** (1,537 stars as of Apr 2026, up from 704 in Feb). It unifies eval, RL environments, and prompt optimization under one trajectory format (ATIF). Claude Code integration is first-class. If eval and training share the same representation, failing CI trajectories can feed directly into fine-tuning without infrastructure changes.

## Evidence

| Tool | Version | Result |
|------|---------|--------|
| [confident-ai/deepeval](https://github.com/confident-ai/deepeval) | latest (Feb 2026) | independently-confirmed: G-Eval `is_successful` silent false-pass bug confirmed; OTel hijack + Sentry on import ([Issue #2497](https://github.com/confident-ai/deepeval/issues/2497)) |
| [promptfoo/promptfoo](https://github.com/promptfoo/promptfoo) | latest (Feb 2026) | source-reviewed: MCP security red-team plugin confirmed; no functional MCP test support |
| [laude-institute/harbor](https://github.com/laude-institute/harbor) | Apr 2026 (1,537 stars) | docs-reviewed: ATIF trajectory format reviewed; eval+RL unified; Claude Code integration first-class |
| [awslabs/agent-evaluation](https://github.com/awslabs/agent-evaluation) | latest (Feb 2026) | independently-confirmed: non-deterministic outcomes acknowledged in own docs |
| [LangChain FakeChatModel](https://python.langchain.com/docs/how_to/chat_model_unit_test/) | latest (Feb 2026) | source-reviewed: does not expose prompt inputs without subclassing |
| [amosjyng/vcr-langchain](https://github.com/amosjyng/vcr-langchain) | v0.1.x (stale since Jan 2024) | source-reviewed: HTTP-only recording; no MCP tool dispatch interception |

**Confidence:** source-reviewed + independently-confirmed — source code and documentation reviewed across 6 tools. Non-determinism confirmed by design analysis and third-party acknowledgment (awslabs self-documents it, deepeval bug confirmed by community). scope_matches=false: "every agent eval framework" was assessed by reviewing 4 frameworks, not an exhaustive survey.

**Unlinked claims:** (1) "No MCP stub server library exists" — searched GitHub, npm, PyPI for "mcp mock", "mcp stub", "mcp test server" in Feb 2026; no results with >10 stars or documented MCP transport interception. (2) "Seed fixing reduces but does not eliminate variation" — based on vendor documentation (OpenAI, Anthropic), not independent measurement.

**Falsification criterion:** An agent eval framework using LLM-as-judge that achieves identical pass/fail results across 100 consecutive runs on the same input, or an MCP stub/mock library that intercepts stdio or SSE transport tool dispatch for deterministic replay in CI, would disprove the core claims.

**Open questions:** Has anyone built a shared MCP stub/mock server library for any transport? Is there a deterministic grading approach for free-text agent outputs that doesn't use LLM-as-judge? Has anyone measured the actual variance rate of LLM-as-judge CI gates across 100+ runs on identical inputs?

Seen different? [Contribute your evidence](https://theorydelta.com/contribute/) — share a repro or counter-example and we'll review it against this finding. Reader evidence is what keeps these findings accurate.