---
source_block: agent-production-readiness-checklist.md
canonical_url: https://api.theorydelta.com/published/agent-production-readiness-nine-gates
published: 2026-05-19
last_verified: 2026-05-19
confidence: secondary-research
evidence_type: independently-confirmed
staleness_risk: medium
rubric:
  total_claims: 14
  tested_count: 0
  independently_confirmed: true
  unlinked_count: 0
  scope_matches: true
  falsification_stated: true
  content_type: landscape
environments_tested:
  - tool: "LangChain State of Agent Engineering 2026"
    version: "2026 survey"
    evidence_type: docs-reviewed
    result: "89% of teams have observability tooling; only 52% have evaluation — 37-point gap between visibility and quality measurement"
  - tool: "LangGraph (LangChain)"
    version: "deepagents issue #1698 (fixed 2026-03-25)"
    evidence_type: docs-reviewed
    result: "recursion_limit=25 in subagent graphs did not propagate to spawned child graphs (fixed in deepagents March 2026); teams on pre-fix versions remain exposed"
  - tool: "OpenAI Agents SDK"
    version: "deployment checklist (2026)"
    evidence_type: docs-reviewed
    result: "Official deployment checklist lacks per-run tool-call cap or circuit-breaker requirement; rate limiting is explicitly not built in"
  - tool: "LangChain Deep Agents"
    version: "production guide (2026)"
    evidence_type: docs-reviewed
    result: "Production guide documents phased rollout pattern (5/25/100%) as recommended but notes it is commonly skipped"
  - tool: "Vercel"
    version: "agentic infrastructure blog (May 2026)"
    evidence_type: docs-reviewed
    result: "agent-auditing-agent pattern documented: a second agent reviews first agent's planned actions before execution; used in production"
  - tool: "OpenTelemetry GenAI semconv"
    version: "stable Q1 2026"
    evidence_type: docs-reviewed
    result: "GenAI semantic conventions reached stable status Q1 2026; ~80% of agent failures attributed to control-flow issues not model errors"
  - tool: "Galileo"
    version: "production readiness checklist blog (2026)"
    evidence_type: docs-reviewed
    result: "89% observability / 52% eval gap cited; checklist covering cost bounds, HITL, eval, and rollout strategy confirmed as standard practitioner reference"
  - tool: "GetOnStack / dev.to incident report"
    version: "2026"
    evidence_type: independently-confirmed
    result: "$47K cost explosion from two-agent circular handoff running 11 days undetected; no token budget enforcement in place"
theory_delta: "There is no canonical agent maturity ladder — the useful framing is a 9-gate operational checklist, and most production deployments pass only 4-6 of them, leaving silent cost, safety, and quality gaps the vendor frameworks don't enforce."
a2a_card:
  type: finding
  topic: agent-production-readiness
  claim: Agent production readiness is not a maturity stage — it is a 9-gate checklist; most deployments pass 4-6 gates, and the missing gates (cost bounds, eval, HITL, durability) are the ones that produce the documented $47K incidents.
  confidence: secondary-research
  action: test
  contribute: /api/signals
---

# "Production-ready" agents have no canonical definition — most deployments pass 4-6 of 9 operational gates

## What you expect

Agent framework vendors describe their tooling as enterprise-ready or production-grade. The underlying assumption is that production readiness is a binary state: you deploy, it works, or it fails obviously. Practitioners seeking guidance find maturity model language ("stage 3 production", "enterprise tier") that implies a ladder with clear advancement criteria.

## What actually happens

There is no canonical production readiness standard for agents. What exists is a set of independent operational concerns that each require deliberate engineering. The useful framing is a checklist of gates, not a maturity stage. A deployment can be production-ready on cost enforcement while completely missing evaluation or HITL gating. The [LangChain State of Agent Engineering 2026 survey](https://www.langchain.com/state-of-agent-engineering) puts numbers on this gap: 89% of teams have observability tooling, but only 52% have evaluation — a 37-point gap between visibility and quality measurement across the practitioner population.

Most production deployments pass 4-6 of the 9 gates below. "Production-ready" in practice means 7+.

### Gate 1: Cost and loop bounds

Documentation from framework vendors describes iteration limits and recursion limits, but does not wire them into hard cost ceilings by default. The [GetOnStack $47K incident](https://dev.to/waxell/the-47000-agent-loop-why-token-budget-alerts-arent-budget-enforcement-389i) is the documented anchor: a two-agent circular handoff ran for 11 days, accumulating $47K in API charges before manual detection. Token budget alerts existed; budget enforcement did not.

Three independent control mechanisms are required, and none substitute for the others:

1. **Iteration caps** — LangGraph `recursion_limit` (default 25 in subagent graphs), CrewAI `max_iter`, Cursor Ralph 20-cap. These limit decision steps, not token spend.
2. **Token budget enforcement** — a gateway-level hard ceiling that blocks requests when a per-run or per-session budget is exhausted. This is distinct from iteration limits.
3. **Exact-repetition detection** — a hash ledger of prior calls that breaks circular loops before they accumulate cost. The $47K incident was reproduced for $0.20 with this control in place.

Prior to the March 2026 fix documented in [deepagents issue #1698](https://github.com/langchain-ai/deepagents/issues/1698), LangGraph's `recursion_limit=25` in a parent graph did not automatically propagate to spawned child graphs — child graphs silently used their own default limit. The fix shipped in March 2026. Teams should verify they are on a post-fix version, and regardless of version should explicitly set `recursion_limit` in all nested subgraph configurations as a defense-in-depth practice.

### Gate 2: Observability

The [OpenTelemetry GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/) reached stable status in Q1 2026, providing a standard schema for agent span data. Industry reporting attributes approximately 80% of agent failures to control-flow issues rather than model errors — which means span data on tool calls, branching, and handoffs is more diagnostic than model output alone.

[Honeycomb's agent observability tooling](https://www.honeycomb.io/blog/honeycomb-launches-agent-observability-full-visibility-agentic-workflows) frames the requirement: visibility into tool call sequences, not just LLM input/output. Most existing observability stacks instrument the model calls but not the tool execution layer or the handoff decisions between agents.

The 89% observability adoption number from the LangChain survey is misleading without this nuance: most teams have *some* observability, but whether it covers the control-flow layer that drives 80% of failures is a different question.

### Gate 3: Controlled rollout

A 5/25/100% phased rollout pattern is documented in the [LangChain Deep Agents production guide](https://docs.langchain.com/oss/python/deepagents/going-to-production) as the recommended approach: start with 5% of traffic, validate failure rates, expand. This pattern is documented as commonly skipped. The consequence of skipping it is that production failure modes surface at full blast load rather than on a bounded user cohort.

Controlled rollout is not an agent-specific concern, but it is frequently treated as optional for agents because early agent deployments are small by headcount. As agent systems handle consequential decisions (writes, external calls, financial operations), rollout gating becomes the only way to bound the blast radius of a configuration error.

### Gate 4: Security and governance

The security gate covers four independent sub-concerns:

- **RBAC and tool scoping** — agents should hold minimum required permissions; tool access should be scoped per agent type
- **Sandboxing** — code-executing agents require container or subprocess isolation; `LocalPythonInterpreter`-style soft sandboxes are bypassable (NCC Group has a working exploit against smolagents' default)
- **PII and data handling** — agent sessions that touch untrusted input should not, in the same session, access sensitive systems (Oso's session tainting pattern)
- **Memory namespacing** — multi-tenant agents sharing a memory layer without namespace isolation are vulnerable to cross-session data leakage

The [Vercel agentic infrastructure post (May 2026)](https://vercel.com/blog/agentic-infrastructure) documents an agent-auditing-agent pattern currently in production use: a second agent reviews the first agent's planned actions before execution, providing a lightweight safety layer without blocking the primary workflow.

### Gate 5: Evaluation

The 89%/52% gap (observability vs. evaluation) from the [LangChain survey](https://www.langchain.com/state-of-agent-engineering) establishes the population baseline. Observability catches when things break; evaluation catches when they silently degrade. An agent that returns a response without errors can still be producing semantically wrong output on a meaningful fraction of queries.

The [Galileo production readiness checklist](https://galileo.ai/blog/production-readiness-checklist-ai-agent-reliability) and practitioner guidance converge on the same requirement: evaluation is distinct from observability and requires deliberate instrumentation. The minimum viable evaluation baseline: a fixed set of regression inputs with expected outputs, run against every model or prompt change before deployment.

Without evaluation, confidence in agent behavior is derived entirely from the absence of error signals — which is not the same as evidence of correctness.

### Gate 6: Human-in-the-loop (HITL) gates

Three production agent systems document their HITL patterns:

- [Claude Code permission modes](https://code.claude.com/docs/en/permission-modes) — granular approval scopes for file write, execution, and network operations
- Devin plan checkpoint — human approval required before multi-step execution plans are committed
- GitHub Copilot Workspace — pull request creation requires explicit human trigger; agent does not open PRs autonomously

The EU AI Act enforcement date of August 2, 2026 applies to high-risk AI systems including those making consequential automated decisions. HITL gating is one of the structural requirements. Deployments that do not have explicit HITL for irreversible or high-stakes actions face both compliance exposure and the operational risk documented in the enterprise-agent-production literature: agent mistakes execute at the same speed as agent successes, which can mean global-scale impact before human review is possible.

The unresolved gap noted in the literature: no agent framework has published HITL throughput benchmarks. At scale, a HITL queue can accumulate decisions faster than reviewers can drain it — the pattern is architecturally unaddressed.

### Gate 7: Model control

Three independent levers exist at the model layer:

- **Reasoning-effort tuning** — for models that expose a reasoning effort parameter, lower effort reduces latency and cost on tasks where the reasoning budget is routinely underutilized; misconfigured effort is a silent cost multiplier
- **Fallback models** — when a primary model is unavailable or over budget, automatic fallback to a cheaper or locally-hosted model prevents hard failures; without it, transient model API issues propagate as agent outages
- **Prompt caching** — for workflows with stable system prompts and preambles, prompt caching reduces costs by up to 88% on cached prefixes; this is cost reduction, not cost enforcement, but it changes the unit economics of operating within a budget

The [OpenAI Agents SDK deployment checklist](https://developers.openai.com/api/docs/guides/deployment-checklist) covers model selection and fallback. Reasoning-effort tuning is model-provider-specific and not covered in any single cross-provider guide.

### Gate 8: Rate limiting

Per-run tool-call caps are distinct from token limits: they constrain the number of external interactions (API calls, file writes, web fetches) regardless of token budget. A tool-call storm on a cheap model can exhaust external API rate limits without triggering a token budget ceiling.

LangChain ships middleware for per-run tool-call capping. The [OpenAI Agents SDK](https://developers.openai.com/api/docs/guides/deployment-checklist) does not include this as a built-in — teams using the SDK must implement tool-call rate limiting at the application layer.

This is not a LLM token rate limit (the model API provider's concurrency controls). It is an agent-level cap on how many tool invocations a single agent run is permitted to make.

### Gate 9: Async and durability

Agent workflows that span multiple LLM calls and external tool invocations can run for minutes or hours. Synchronous execution breaks above approximately 30 seconds due to HTTP gateway timeouts and client connection limits.

Durable execution (Temporal, LangGraph checkpointing) provides exactly-once semantics for long-running workflows: a transient failure mid-workflow resumes from the last checkpoint rather than restarting from scratch or losing state silently.

The documented gap: durable execution does not prevent cost accumulation. The GetOnStack incident used durable patterns — the workflow executed reliably and durably while accumulating $47K. Durability and cost enforcement are orthogonal gates. Both are required.

## What this means for you

The gap between "passes 4-6 gates" and "passes 7+" is where production incidents live. The pattern in the literature:

- **Missing gate 1 (cost bounds)** → runaway cost incidents ($47K range)
- **Missing gate 5 (evaluation)** → silent semantic degradation that observability cannot catch
- **Missing gate 6 (HITL)** → irreversible operations executed without human approval
- **Missing gate 9 (durability)** → state loss on failures in long-running workflows

The gates are independent. Passing 8 does not protect you if the 9th gate for your specific risk profile is missing. A deployment with excellent observability, evaluation, and HITL but no cost enforcement is one coordination-loop bug away from a significant incident.

## What to do

Audit your current deployment against the 9 gates and score honestly. Most teams discover the same gaps: cost enforcement (gate 1), evaluation (gate 5), and async durability (gate 9).

1. **Gate 1 — Cost bounds**: Add a token-budget gateway (LiteLLM, MLflow AI Gateway) with a hard reject (HTTP 429) at your per-run ceiling. Set `require_trace_id_on_calls_by_agent: true` in LiteLLM or equivalent — silent-fail tracking is not enforcement. Add exact-repetition detection to any multi-agent loop. Explicitly set and propagate `recursion_limit` in all nested LangGraph subgraphs.

2. **Gate 2 — Observability**: Instrument tool call entry/exit spans, not just LLM call spans. GenAI semconv provides the schema; Arize Phoenix, Langfuse, and Honeycomb all support it. Without tool-call visibility you cannot diagnose 80% of failures.

3. **Gate 3 — Controlled rollout**: Route 5% of traffic to new agent versions first. Measure tool call error rate, cost per session, and task completion rate over 24h before expanding.

4. **Gate 4 — Security**: Scope tool permissions per agent. For code-executing agents, use Docker or subprocess isolation — soft Python sandbox approaches are bypassable. Namespace memory by tenant. Consider the agent-auditing-agent pattern for high-stakes write operations.

5. **Gate 5 — Evaluation**: If you have zero evals, start with 20 regression inputs covering the highest-stakes task types. Add Pass^k (k≥3) for any task class where reliability is a commitment — single-run pass rates overstate reliability by 20-30 points.

6. **Gate 6 — HITL**: Identify your irreversible action classes (delete, financial commit, permission escalation). Wire explicit human approval gates for each. Treat the August 2, 2026 EU AI Act enforcement date as a forcing function if you operate in scope.

7. **Gate 7 — Model control**: Set fallback model configs. Enable prompt caching on stable system prompts. Audit reasoning-effort settings if your model provider exposes them.

8. **Gate 8 — Rate limiting**: Add per-run tool-call caps at the application layer if your framework doesn't enforce them. Separate this from token limits — they address different failure modes.

9. **Gate 9 — Async/durability**: Any workflow expected to exceed 30s needs checkpointing. Temporal or LangGraph persistence. Remember: durability is not cost enforcement — add gate 1 separately.

**Falsification criterion:** This finding would be disproved by a major agent framework (LangGraph, OpenAI Agents SDK, or CrewAI) shipping a default configuration that enforces all 9 gates out of the box with zero additional configuration, or by survey data showing median production deployments pass 8+ gates.

## Evidence

| Tool | Version | Evidence | Result |
|------|---------|----------|--------|
| [LangChain State of Agent Engineering 2026](https://www.langchain.com/state-of-agent-engineering) | 2026 survey | docs-reviewed | 89% of teams have observability tooling; only 52% have evaluation — 37-point gap confirmed |
| [GetOnStack $47K incident](https://dev.to/waxell/the-47000-agent-loop-why-token-budget-alerts-arent-budget-enforcement-389i) | 2026 post-mortem | independently-confirmed | Two-agent circular handoff ran 11 days, $47K accumulated; no budget enforcement; reproduced for $0.20 with repetition detection |
| [LangGraph deepagents issue #1698](https://github.com/langchain-ai/deepagents/issues/1698) | fixed 2026-03-25 | docs-reviewed | recursion_limit=25 in parent graph did not propagate to spawned child subagent graphs; fix landed March 2026 — teams on pre-fix versions remain exposed |
| [OpenAI Agents SDK deployment checklist](https://developers.openai.com/api/docs/guides/deployment-checklist) | 2026 | docs-reviewed | Official checklist lacks per-run tool-call cap or circuit-breaker requirement; rate limiting not built in |
| [LangChain Deep Agents production guide](https://docs.langchain.com/oss/python/deepagents/going-to-production) | 2026 | docs-reviewed | 5/25/100% phased rollout documented as recommended; noted as commonly skipped in practice |
| [Vercel agentic infrastructure](https://vercel.com/blog/agentic-infrastructure) | May 2026 | docs-reviewed | agent-auditing-agent pattern in production: second agent reviews first agent's planned actions before execution |
| [Honeycomb agent observability](https://www.honeycomb.io/blog/honeycomb-launches-agent-observability-full-visibility-agentic-workflows) | 2026 | docs-reviewed | Tool call sequence visibility identified as the diagnostic gap; ~80% of agent failures attributed to control-flow not model errors |
| [Galileo production readiness checklist](https://galileo.ai/blog/production-readiness-checklist-ai-agent-reliability) | 2026 | docs-reviewed | Checklist covering 9 readiness dimensions cited; 89%/52% observability/eval gap used as baseline |
| [Claude Code permission modes](https://code.claude.com/docs/en/permission-modes) | 2026 | docs-reviewed | Granular HITL approval scopes for file write, execution, and network documented |
| [OpenTelemetry GenAI semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/) | stable Q1 2026 | docs-reviewed | GenAI semantic conventions reached stable status; standard schema for agent span data |

**Confidence:** secondary-research — 10 sources reviewed (survey data, incident post-mortems, vendor docs, framework issue trackers). Zero runtime testing; all evidence is docs-reviewed or independently-confirmed. The $47K incident is the only independently-confirmed data point; the 89%/52% gap is survey-derived.

**Strongest case against:** The 9-gate framing is synthesized — no single practitioner or framework vendor has validated this exact list as the canonical set. The 89%/52% survey numbers come from LangChain, which has an interest in promoting evaluation tooling (they sell it). The $47K incident, while real, may represent a tail risk rather than typical exposure. Teams with strong platform engineering practices may pass all 9 gates implicitly without naming them as such. The "4-6 gates" claim for most deployments is inference from survey and incident data, not direct measurement.

**Open questions:** What is the actual distribution of gate coverage across production deployments? Does passing all 9 gates meaningfully reduce incident rates, or are there other failure modes outside this taxonomy? Is the HITL throughput gap (no framework has published HITL queue benchmarks) a real constraint in practice or a theoretical concern?

Seen different? [Contribute your evidence](https://theorydelta.com/contribute/) — share a repro or counter-example and we'll review it against this finding. Reader evidence is what keeps these findings accurate.