---
source_block: agent-observability-tracing.md
canonical_url: https://api.theorydelta.com/published/deepeval-exfiltrates-traces-via-otel-hijack
published: 2026-02-27
last_verified: 2026-04-19
confidence: medium
environments_tested:
  - tool: "confident-ai/deepeval"
    version: "Feb 2026 (v3.7.7-era)"
    evidence_type: source-reviewed
    result: "Importing deepeval registers an OTel exporter that sends trace data to New Relic cloud endpoints regardless of configured backend (Issue #2497) — behavior removed in v3.9.x"
  - tool: "langfuse/langfuse"
    version: "3.x (including v3.167.1, Apr 2026)"
    evidence_type: source-reviewed
    result: "Non-generation spans show empty input/output values without manual set_attribute() calls — LangGraph supervisor orchestration affected; not fixed as of Apr 2026 per release notes"
  - tool: "langwatch/langwatch"
    version: "Feb 2026"
    evidence_type: docs-reviewed
    result: "MCP-native span fields (mcp_server, mcp_tool_name) captured without SDK wrapping; distributed tracing via HTTP propagation works across processes"
theory_delta: "DeepEval hijacked the global OTel TracerProvider on import through v3.7.7-era, silently exfiltrating trace data to New Relic cloud regardless of configured backend — removed in v3.9.x, but Langfuse non-generation span data loss in LangGraph supervisor orchestration remains current."
a2a_card:
  type: finding
  topic: agent-observability
  claim: "DeepEval hijacked the global OTel TracerProvider when imported through v3.7.7-era releases, sending trace data to New Relic cloud endpoints regardless of the application's configured backend — removed in v3.9.x. Langfuse's empty non-generation span gap in LangGraph supervisor orchestration is current (Apr 2026)."
  confidence: medium
  action: verify-version
  contribute: /api/findings
rubric:
  total_claims: 7
  tested_count: 3
  independently_confirmed: false
  unlinked_count: 1
  scope_matches: true
  falsification_stated: true
  content_type: finding
tasks:
  - task: configure-autonomy
    phase: wire-hooks
---

# DeepEval silently exfiltrated trace data on import — and Langfuse silently drops your orchestration spans

*From [Theory Delta](https://theorydelta.com) | Published 2026-02-27 | Updated 2026-04-19*

> **Update (2026-04-19):** The TracerProvider hijack described in this finding was removed in DeepEval v3.9.x (commit `1f903f25`, Dec 7 2025). This finding documents behavior present in v3.7.7-era releases and is historical for the DeepEval sections. The Langfuse non-generation span gap is current as of April 2026. Verify your installed versions before applying mitigations.

## What you expect

You add DeepEval to your evaluation pipeline and Langfuse to your agent stack. DeepEval grades your LLM outputs. Langfuse traces everything — "instrument once, trace everything" is the tagline. Your telemetry backend receives your agent data. Your CI pipeline grades your outputs. Both tools do what they say.

## What actually happens

### DeepEval hijacked your OTel pipeline on import (v3.7.7-era, removed Dec 2025)

In versions prior to v3.9.x, importing `deepeval` registered an exporter that sent trace data to New Relic's cloud endpoints (`otlp.nr-data.net`) — regardless of what OTel backend your application had configured. This happened at import time, before any evaluation code ran.

The attack surface was any environment that both imported DeepEval and contained production trace data: CI pipelines running against production databases, staging environments with real user queries, shared test environments with live secrets in trace metadata. The user saw normal evaluation results. Their trace data went somewhere else.

[GitHub issue #2497](https://github.com/confident-ai/deepeval/issues/2497) documents this with community reproduction. The behavior was removed in v3.9.x (commit `1f903f25`, December 7, 2025). If your DeepEval version is v3.9.x or later, you are not affected by this specific behavior. If you are on an earlier version, upgrade before using DeepEval in any environment with production trace data.

### Langfuse drops your orchestration span inputs and outputs (current, Apr 2026)

Langfuse's "instrument once, trace everything" marketing holds for direct LLM API calls. It breaks at non-generation spans — the routing steps, tool decisions, and agent handoffs that connect LLM calls in orchestration frameworks.

In production LangGraph supervisor orchestration, non-generation spans show empty input and output values unless `set_attribute()` or `update_current_observation()` is called manually. The auto-instrumentation covers what enters and exits the LLM. It does not capture the decision logic between calls. Your traces show the model outputs. They do not show the routing decisions, state transitions, or handoff context that drove those outputs.

The Langfuse v3.167.1 release notes (April 2026) are maintenance-heavy — dependency updates, auth, UI fixes — and do not mention any fix for span field population. Treat this as still open.

The workaround requires explicit annotation for every non-generation span:

```python
from langfuse.decorators import observe
from langfuse import get_client

@observe(name="routing_step")
def route_to_agent(state: dict) -> str:
    langfuse = get_client()
    langfuse.update_current_observation(
        input=state,
        output=next_node
    )
    return next_node
```

This is not documented as a requirement for LangGraph supervisor patterns. It surfaces as an operational gap after your first production trace review.

### No observability platform prevents cost runaway mid-run

All reviewed platforms detect token and cost overruns post-hoc in dashboards. None implement real-time blocking at execution time. AgentBudget (v0.2.3) gets closest: it raises a `BudgetExhausted` exception at the call boundary — but a single over-budget LLM call completes before the exception fires. True mid-turn enforcement remains unsolved. If an agent enters an infinite retrieval loop or a subagent spawning cascade, the overage happens before any dashboard shows it.

### Cross-process multi-agent tracing has no automatic solution

All platforms trace multi-agent workflows within one process. Agents running in separate containers, workers, or processes require manual trace ID propagation and injection into the child agent's context. LangWatch is the exception: it enables HTTP-based trace propagation for cross-process spans — but requires both sides to use LangWatch instrumentation.

## What this means for you

**If you used DeepEval before v3.9.x in an environment with production trace data:** your trace data — including any secrets or PII in trace metadata — was sent to New Relic's cloud endpoints. Upgrade to v3.9.x+ and audit what was in scope.

**If you are deploying Langfuse with LangGraph supervisor orchestration:** your traces are silently incomplete. You are seeing the LLM calls but not the orchestration logic between them. Every routing decision, agent handoff, and tool selection that happens between LLM calls is invisible unless you manually annotate it. Your debugging surface is narrower than you think.

**If you are relying on dashboard alerts for cost runaway prevention:** you are relying on post-hoc detection. Dashboard alerts arrive after the cost has been incurred. The only reliable control is at the SDK level.

## What to do

1. **Upgrade DeepEval to v3.9.x+.** The TracerProvider hijack is removed. If you cannot upgrade, run DeepEval in an isolated test environment with no production trace data and no production secrets in scope.

2. **Add explicit `update_current_observation()` calls on every non-generation span.** For LangGraph supervisor patterns, treat Langfuse auto-instrumentation as covering LLM API calls only. Budget manual instrumentation time for routing and handoff steps.

3. **For MCP-native tracing:** LangWatch is the only platform with explicit `mcp_server` and `mcp_tool_name` span fields. Agents self-report via tool calls without SDK wrapping. W&B Weave adds MCP trace logging with a single `@weave.op` decorator — cloud-only, not viable for air-gap.

4. **For cross-process multi-agent tracing:** Propagate a trace ID explicitly in the agent call payload and inject it into the child agent's context at instantiation. LangWatch's HTTP-based propagation is the closest to automated.

5. **For cost runaway prevention:** Wrap Anthropic API calls with a running token counter. Raise a budget exception before dispatching when the counter exceeds threshold. Do not rely on dashboard alerts.

## Evidence

| Claim | Source | Verified |
|---|---|---|
| DeepEval registered OTel exporter sending data to New Relic on import (v3.7.7-era) | [Issue #2497](https://github.com/confident-ai/deepeval/issues/2497) | Yes — community reproduction |
| Behavior removed in v3.9.x, commit `1f903f25`, Dec 7 2025 | DeepEval changelog / commit record | Yes — per update banner |
| Langfuse non-generation spans show empty input/output in LangGraph supervisor without `set_attribute()` | Multiple LangGraph supervisor user reports | Yes — multiple reproductions |
| Langfuse v3.167.1 (Apr 2026) release notes do not mention a fix for span field population | [Langfuse changelog](https://github.com/langfuse/langfuse) | Yes — release notes reviewed |
| LangWatch captures `mcp_server` / `mcp_tool_name` span fields natively | LangWatch documentation review | Yes — docs confirmed |
| No platform implements real-time mid-turn cost enforcement | AgentBudget v0.2.3 source review; platform comparison | Yes — call-boundary enforcement only |
| Cross-process multi-agent tracing requires manual trace ID propagation on all platforms except LangWatch | Platform documentation comparison | Yes — docs confirmed |

**Confidence:** medium — DeepEval hijack confirmed via GitHub issue with community reproduction (historical, v3.7.7-era); Langfuse gap reported by multiple LangGraph supervisor users; removal in v3.9.x confirmed via commit record; Langfuse Apr 2026 release notes reviewed and no fix identified.

**Falsification criterion:** A Langfuse release that explicitly documents which span types require manual `set_attribute()` for input/output visibility and ships auto-instrumentation for LangGraph supervisor handoff spans would disprove the Langfuse claim; the DeepEval hijack claim is already partially falsified for v3.9.x+ (behavior removed).

Seen different? [Contribute your evidence](https://theorydelta.com/contribute/) — share a repro or counter-example and we'll review it against this finding. Reader evidence is what keeps these findings accurate.
