---
source_block: agentic-error-suppression.md
canonical_url: https://api.theorydelta.com/published/agentic-error-suppression
published: 2026-03-29
last_verified: 2026-03-28
confidence: medium
evidence_type: independently-confirmed
environments_tested:
  - tool: "Claude Code (Anthropic)"
    version: "March 2026 (headless/CI mode)"
    evidence_type: source-reviewed
    result: "~8% of CI/CD executions return exit code 0 with empty or irrelevant response (SFEIR Institute documentation)"
  - tool: "OpenAI Codex CLI"
    version: "v0.36–v0.72 (regression window)"
    evidence_type: source-reviewed
    result: "All commands returned exit code 0 with empty stdout/stderr; root cause: shell environment config (issue #9091, closed)"
  - tool: "Vercel AI SDK"
    version: "< v4.1.22"
    evidence_type: source-reviewed
    result: "streamText() silently swallows errors by default; fixed in v4.1.22 via opt-in onError callback (issue #4726, closed)"
theory_delta: Error suppression that a human would notice and investigate leaves an agent with exit code 0 and no signal to stop — the Tools Fail paper measured a 76-point accuracy drop under silent failures.
a2a_card:
  type: finding
  topic: agentic-error-suppression
  claim: "Error suppression patterns (2>/dev/null, || true, bare except) are structurally incompatible with agent execution — agents have no secondary feedback channel and proceed on false premises when exit code 0 is returned for a failed operation."
  confidence: medium
  action: avoid
  contribute: /api/signals
rubric:
  total_claims: 8
  tested_count: 0
  independently_confirmed: true
  unlinked_count: 1
  scope_matches: true
  falsification_stated: true
  content_type: finding
tasks:
  - task: configure-autonomy
    phase: run-production
---

# Error suppression patterns common in scripts will silently break your agent

*From [Theory Delta](https://theorydelta.com) | [Methodology](https://theorydelta.com/methodology/) | Published 2026-03-29*

## What you expect

Shell scripting guides treat `2>/dev/null`, `|| true`, and `set +e` as style preferences. Python tutorials present broad `except:` blocks as acceptable shortcuts. Both treat error suppression as a minor code smell — the assumption is that the person running the script will notice if something looks wrong.

## What actually happens

Agents have exactly one feedback channel: exit code plus stdout/stderr. When error suppression destroys this channel, the agent cannot detect, diagnose, or recover from failures.

The ["Tools Fail" paper (arXiv:2406.19228)](https://arxiv.org/abs/2406.19228) measures the impact quantitatively: when tools returned incorrect output without error signals, GPT-3.5 accuracy dropped from 98.7% to 22.7% — a **76 percentage point collapse**. The failure mode is not "agent notices something is wrong but cannot fix it." It is "agent does not notice anything is wrong" — models copy incorrect tool output and generate hallucinated justifications for why it was correct.

This pattern appears across the full agent toolchain, not just bash scripts:

**SDK layer:** Vercel AI SDK `streamText()` [silently swallows errors by default](https://github.com/vercel/ai/issues/4726) — no logs, no stack trace without an opt-in `onError` callback. Fixed in v4.1.22, but the default behavior shipped silently for all prior versions.

**CLI layer:** OpenAI Codex CLI [issue #9091](https://github.com/openai/codex/issues/9091) documents a regression between v0.36 and v0.72 where all commands returned exit code 0 with empty stdout and stderr. The agent could not observe any command output and could not detect whether commands succeeded or failed. Root cause was local shell environment configuration — but the failure mode (stdout invisible to agent) is the same regardless of cause. Related issues [#7317](https://github.com/openai/codex/issues/7317), [#6033](https://github.com/openai/codex/issues/6033), and [#6991](https://github.com/openai/codex/issues/6991) confirm this is a recurring failure class.

**Shell layer:** Bash 3.2 (macOS default) silently crashes agent scripts on empty array expansion under `set -u` — `${arr[@]}` expands to nothing and the script terminates with an unbound variable error. Bash 3.2 also does not support `declare -A` associative arrays — scripts using them return exit code 0 with no output rather than surfacing the unsupported syntax error.

**Observability layer:** AgentOps silent 401 authentication failures occur without visible error output by default. An agent using AgentOps for observability may be operating without any telemetry while believing instrumentation is active.

**CI layer:** [SFEIR Institute documentation on Claude Code headless mode](https://institute.sfeir.com/en/claude-code/claude-code-headless-mode-and-ci-cd/errors/) reports approximately 8% of CI/CD executions produce a silent failure: exit code 0 returned, but response is empty, incomplete, or irrelevant.

Note: the observatory prevalence claim (58% of repos exhibit error suppression) is sourced from an internal scan tagged as `[legacy, unverified]` in the source block and should not be treated as a confirmed statistic until independently reproduced.

## What this means for you

Your agent pipeline is not just vulnerable to error suppression you introduce — it is vulnerable to the error suppression already in every shell script, every SDK, and every tool your agent calls. The difference between human and agent execution is not degree but kind:

| Property | Human-executed script | Agent-executed script |
|---|---|---|
| Visual feedback | Terminal output visible | None — only structured return value |
| Error detection | Intuition, domain knowledge | Exit code + stderr only |
| Recovery path | Re-run, inspect logs, ask a colleague | Retry same command, or hallucinate success |
| Suppression impact | Inconvenient — human will eventually notice | Catastrophic — agent proceeds on false premise |

The compounding factor: [Dagster's "Dignified Python" analysis](https://dagster.io/blog/dignified-python-10-rules-to-improve-your-llm-agents) found that LLMs default to broad exception swallowing because `try: ... except: pass` patterns dominate training data. Agents writing scripts **actively introduce error suppression** unless their context explicitly prohibits it. You are not just defending against legacy patterns — you are defending against your own agent.

Your staging environment (low concurrency, single platform) will pass. Production (concurrent requests, mixed platforms, real-world tool failures) is where error suppression triggers the 76-point drop.

## What to do

Add `set -euo pipefail` as the first line of every agent-executed bash script. This converts the largest class of silent failures into explicit errors that agents can detect and act on. The `pipefail` flag is especially important: without it, `failing_command | grep pattern` returns the exit code of `grep`, not the failing command.

```bash
#!/bin/bash
set -euo pipefail
```

For commands where non-zero exit is not an error (e.g., `grep` returning 1 for "no matches"), use targeted handling with logging rather than blanket suppression:

```bash
# BAD: suppresses all error information
command 2>/dev/null || true

# GOOD: capture and surface errors explicitly
if ! output=$(command 2>&1); then
  echo "ERROR: command failed with: $output" >&2
  exit 1
fi

# Acceptable: grep no-match is not an error, but document why
matches=$(grep -r "pattern" dir/ 2>&1) || true  # grep returns 1 on no match
```

For Python code agents execute or write:

```python
# BAD: swallows all errors
try:
    result = risky_operation()
except:
    pass

# GOOD: catch specific exceptions, surface the error
try:
    result = risky_operation()
except SpecificError as e:
    print(f"ERROR: {e}", file=sys.stderr)
    raise
```

When agents write scripts, include an explicit rule in their context: "Never use `2>/dev/null`, `|| true`, or bare `except:` blocks. Use `set -euo pipefail`. Catch specific exceptions and re-raise or log to stderr."

For the Tools Fail paper's mitigation: adding a simple disclaimer ("The tool can sometimes give incorrect answers") to tool descriptions improved silent error detection by approximately 30% (GPT-3.5: from 23% to 53% in zero-shot). This is low-cost and can be added to tool definitions without changing the tool implementation.

**Falsification criterion:** A controlled study showing agent task accuracy is not significantly degraded by silent tool failures in real-world (non-synthetic) codebases, or agent architectures achieving >80% accuracy on tasks where tool failures are silently suppressed at a scale comparable to the Tools Fail paper, would disprove this finding.

## Evidence

| Claim | Primary Source | Verified |
|---|---|---|
| 76pp accuracy drop with silent tool failures | [arXiv:2406.19228](https://arxiv.org/abs/2406.19228) | Yes — controlled study, GPT-3.5 |
| ~8% of Claude Code CI/CD runs return exit 0 with empty/irrelevant response | [SFEIR Institute](https://institute.sfeir.com/en/claude-code/claude-code-headless-mode-and-ci-cd/errors/) | Source-reviewed |
| Codex CLI all commands returned exit 0 with empty stdout/stderr | [openai/codex#9091](https://github.com/openai/codex/issues/9091) | Source-reviewed; root cause: local shell env config |
| Vercel AI SDK streamText() swallows errors silently by default | [vercel/ai#4726](https://github.com/vercel/ai/issues/4726) | Source-reviewed; fixed in v4.1.22 |
| Adding disclaimer to tool description improves error detection ~30% | [arXiv:2406.19228](https://arxiv.org/abs/2406.19228) | Yes — same study |

**Confidence:** medium — primary quantitative evidence from one academic paper (arXiv:2406.19228), independently confirmed by GitHub issues in three separate tools. Not tested by execution in Theory Delta's environment. The Tools Fail paper uses synthetic calculator tools, not real-world bash scripts; the magnitude of accuracy degradation in production agentic systems may differ.

**Open questions:** Does the accuracy degradation replicate on more capable models (Claude 3.5+, GPT-4o)? Does `set -euo pipefail` actually improve end-task accuracy in agent-executed scripts, or just surface more visible errors? What is the real-world prevalence of error suppression in agent-executed scripts (the internal observatory stat is unverified)?

Seen different? [Contribute your evidence](https://theorydelta.com/contribute/) — share a repro or counter-example and we'll review it against this finding. Reader evidence is what keeps these findings accurate.