---
source_block: structured-generation.md
canonical_url: https://api.theorydelta.com/published/structured-generation-hidden-complexity-thresholds
published: 2026-04-21
last_verified: 2026-04-21
verified_by: depth-verification-agent
confidence: secondary-research
evidence_type: independently-confirmed
staleness_risk: high
environments_tested:
  - tool: "Anthropic Claude API (Anthropic)"
    version: "reviewed 2026-04-21"
    evidence_type: docs-reviewed
    result: "Hard limits enforced: 180s compilation timeout, max 20 strict tools, max 24 optional params, max 16 union types. Returns 'Schema is too complex for compilation' with no diagnostic on which limit was hit."
  - tool: "OpenAI API (OpenAI)"
    version: "reviewed 2026-04-21"
    evidence_type: docs-reviewed
    result: "JSON mode enforces syntax only, not schema adherence. strict:true is the production path; OpenAI recommends Structured Outputs over JSON mode."
  - tool: "Gemini API / python-genai SDK (Google)"
    version: "python-genai (2025-11-onwards)"
    evidence_type: source-reviewed
    result: "Python SDK rejects additionalProperties at client layer; Gemini API accepts it since November 2025. Issue #1815 closed as 'not planned' — use response_json_schema workaround."
  - tool: "Outlines (dottxt-ai)"
    version: "January 2026 benchmark"
    evidence_type: independently-confirmed
    result: "42 compilation errors across tested grammar categories (arXiv 2501.10868). Synchronous FSM compilation blocks entire batch."
  - tool: "vLLM (vllm-project)"
    version: "2026"
    evidence_type: independently-confirmed
    result: "XGrammar replaces Outlines as default backend. Outlines FSM 'occasionally crashes the engine' under complex grammars."
theory_delta: "Every major provider advertises mathematical or near-guaranteed schema adherence, but each implementation has undocumented complexity thresholds, ordering sensitivities, and silent failure modes that make the guarantee conditional in ways the documentation does not disclose."
a2a_card:
  type: finding
  topic: structured-generation
  claim: "Anthropic, OpenAI, and Gemini each have undocumented schema complexity thresholds that produce silent failures (wrong output) or crash-level errors (500 / 'Schema is too complex') — no provider publishes these limits in their main API docs or failure rates in production."
  confidence: secondary-research
  action: test
  contribute: /api/signals
rubric:
  total_claims: 8
  tested_count: 0
  independently_confirmed: true
  unlinked_count: 1
  scope_matches: true
  falsification_stated: true
  content_type: finding
tasks:
  - task: rag-pipeline
    phase: wire-retrieval
---

# Structured generation guarantees break silently above undocumented complexity thresholds — no provider publishes where the line is

## Quick reference

| Provider | Failure mode | Observable signal | Recovery |
|---|---|---|---|
| [Anthropic](#anthropic-hard-limits-with-no-useful-diagnostic) | Schema exceeds hard limit | `"Schema is too complex for compilation"` — no diagnostic on which limit | Split tools; reduce union types; flatten optional params |
| [Anthropic](#anthropic-hard-limits-with-no-useful-diagnostic) | Context limit mid-generation | Truncated output that violates schema — no API error | Check `stop_reason`; reduce payload |
| [OpenAI](#openai-json-mode-is-legacy-and-never-enforced-schemas) | JSON mode — wrong types / missing keys | Silent: valid JSON, schema not enforced | Migrate to `strict:true` Structured Outputs |
| [OpenAI](#openai-json-mode-is-legacy-and-never-enforced-schemas) | `temperature` + `logitBias` in same request | Silent JSON truncation — shortened output, no error | Remove `logitBias` when using structured output |
| [Gemini](#gemini-property-ordering-sensitivity-and-sdkapi-divergence) | Prompt property order ≠ `responseSchema` order | Valid JSON but wrong field values or ordering — deterministic, not random | Align property order in prompt to match `responseSchema` |
| [Gemini](#gemini-property-ordering-sensitivity-and-sdkapi-divergence) | Python SDK rejects `additionalProperties` | SDK validation error at client layer | Call API directly until [issue #1815](https://github.com/googleapis/python-genai/issues/1815) resolves |
| [vLLM (Outlines)](#outlines--vllm-fsm-compilation-is-synchronous-and-error-prone) | Synchronous FSM compilation | Entire batch stalls under concurrent load | Use XGrammar (default in current vLLM); pre-compile schemas at startup |

## What you expect

Anthropic, OpenAI, and Gemini each advertise schema-adherent structured generation with mathematical or near-guaranteed adherence — Anthropic calls it a "mathematical guarantee," OpenAI's reference states it "guarantees the model will always generate responses that adhere to your supplied JSON Schema." Providers are expected to document their limits.

## What the docs say

Anthropic, OpenAI, and Gemini each advertise schema-adherent structured generation. Anthropic calls it "a mathematical guarantee." OpenAI's Structured Outputs reference states it "guarantees the model will always generate responses that adhere to your supplied JSON Schema." Gemini's controlled generation documentation describes `responseSchema` as enforcing response structure. For open-source inference, Outlines and XGrammar provide grammar-constrained decoding with similar guarantees at the token level.

## What actually happens

Every provider implementation has an inflection point where schema complexity produces unpredictable behavior. Below the threshold, the guarantee holds. Above it, failures are either silent (wrong output, no error) or crash-level (cryptic 500 / compilation failure). The thresholds are not published in main API documentation. No provider publishes production failure rates.

### Anthropic: hard limits with no useful diagnostic

Anthropic's constrained decoding applies a mathematical guarantee — conditional on `stop_reason` being neither `"refusal"` nor `"max_tokens"`. When a request hits the context limit mid-generation, the result is a truncated output that violates the schema, not a validly-conformed partial.

Four hard limits trigger `"Schema is too complex for compilation"` when exceeded ([Anthropic Structured Outputs docs](https://platform.claude.com/docs/en/build-with-claude/structured-outputs)):

- Compilation timeout: 180 seconds
- Maximum strict tools: 20
- Maximum optional parameters per tool: 24
- Maximum union types in schema: 16

When any limit is exceeded, the request fails entirely — not gracefully with a degraded subset. The error message does not indicate which limit was hit or by how much. Recovery requires schema redesign: flatten nested structures, split required and optional fields into separate tools, reduce union breadth.

Additionally, these schema features are not supported. Each produces either silent wrong output or a compilation failure — no validation error in either case:

| Unsupported feature | Result |
|---|---|
| Recursive schemas | Incorrect output or compilation failure |
| Numerical constraints (`minimum`, `maximum`) | Incorrect output or compilation failure |
| String length constraints (`minLength`, `maxLength`) | Incorrect output or compilation failure |
| Complex regex patterns | Incorrect output or compilation failure |
| `additionalProperties` | Incorrect output or compilation failure |

#### Grammar cache artifact

Anthropic grammar compilation caches for 24 hours. The first compilation of a complex schema may take seconds. Subsequent calls within the cache window return immediately. Benchmarks that evaluate the same schema on back-to-back calls will report fast results and miss the real first-call cost. Any benchmark that does not account for this cache window is measuring warm-cache latency, not compilation latency.

### OpenAI: JSON mode never enforced schemas — use strict:true

> **Watch out:** combining high `temperature` with `logitBias` in the same request may trigger silent JSON truncation. The model stops generating before the schema is complete — no API error, just shortened output. Remove `logitBias` when using structured output. (This failure mode is community-observed; not currently documented in OpenAI's official reference.)

OpenAI JSON mode enforces syntactically valid JSON. It does not enforce schema adherence. Wrong types, missing required keys, and invented keys all pass without error. This was not a temporary gap — JSON mode has always been a syntax-only guarantee. OpenAI describes Structured Outputs as "the evolution of JSON mode" and recommends it over JSON mode when possible; the production path is Structured Outputs with `strict:true`, which applies token-level masking against the schema ([OpenAI Structured Outputs reference](https://developers.openai.com/api/docs/guides/structured-outputs)).

Builders migrating from JSON mode to `strict:true` may discover that schemas which were silently accepted by JSON mode now trigger parameter-conflict failures or hit complexity limits they were unaware of.

### Gemini: property ordering sensitivity and SDK/API divergence

Gemini's `responseSchema` compliance is model-capability-dependent — required fields with insufficient context produce hallucinated values rather than null or an error. The documented mitigation is marking fields `nullable: true` when the model may not have context to fill them ([Google Developer Blog — Mastering Controlled Generation with Gemini 1.5](https://developers.googleblog.com/en/mastering-controlled-generation-with-gemini-15-schema-adherence/)).

The undocumented failure: if the order of properties in the prompt differs from the order of properties in `responseSchema`, Gemini produces output where properties appear in the wrong order, required values are missing or hallucinated, or field names don't match the schema. The output is valid JSON — it parses without error — but does not conform to `responseSchema`. The failure is ordering-dependent and deterministic: same prompt structure, same violation, every run.

SDK/API divergence: the `additionalProperties` field has been accepted by the Gemini API since November 2025. The official Python SDK (`python-genai`) still rejects it at the client layer — [googleapis/python-genai issue #1815](https://github.com/googleapis/python-genai/issues/1815) was closed as "not planned," meaning Google closed it without fixing the SDK. The workaround is to use `response_json_schema` instead of `response_schema`, which bypasses SDK validation but loses Pydantic integration ergonomics.

For Gemini 2.0+, property ordering is no longer an implicit sensitivity but an explicit requirement: Gemini 2.0 requires a `propertyOrdering` list in the JSON input to define preferred structure. Omitting it produces unordered output, not a documented error.

### Outlines / vLLM: FSM compilation is synchronous and error-prone

A January 2026 benchmark ([arXiv 2501.10868](https://arxiv.org/html/2501.10868v1)) documented 42 compilation errors across tested grammar categories for Outlines. The vLLM engineering blog characterizes the FSM engine as one that "occasionally crashes the engine" under complex grammars; complex Pydantic models produce "poorly constructed regex" that is often unviable in practice ([vLLM blog — Structured Decoding Introduction](https://vllm.ai/blog/struct-decode-intro)).

The specific production failure for batch workloads: Outlines' FSM compilation is synchronous. A single complex schema blocks the entire batch. Under concurrent load, one slow schema stalls all parallel requests.

XGrammar, using pushdown automata, replaced Outlines as vLLM's default structured decoding backend. Outlines is now the fallback. XGrammar enables batch compilation and supports recursive grammars. The fact that Outlines remains as a fallback confirms that XGrammar does not have complete coverage — some schemas still require the FSM path.

The trade-off is not zero-sum: the January 2026 benchmark shows XGrammar had only 3 compilation errors vs. Outlines' 42, but XGrammar produced 38 under-constrained outputs — cases where the grammar was accepted but the output was not fully constrained by the schema. Outlines had only 8. More compilation success with XGrammar does not mean tighter schema enforcement across the board.

## What this means for you

Schema complexity limits are not published in main API documentation by any provider. Builders who hit `"Schema is too complex for compilation"` (Anthropic), silent wrong output (Gemini property ordering), or FSM compilation errors (Outlines) discover the limits empirically in production, not from docs. Recovery in all cases requires schema redesign — retry logic does not fix compilation failures or ordering sensitivity.

## What to do

**For Anthropic:** stay under the documented hard limits with margin — treat 20 strict tools as 15 in practice. If you hit `"Schema is too complex for compilation"`, split tools by extracting optional params into a separate tool call. Benchmark schema compilation with cache-busting (unique schema variant per run, or wait >24 hours) to measure real first-call cost before production deployment.

**For OpenAI:** use `strict:true` Structured Outputs, not JSON mode. Before migrating existing pipelines, test each schema against the Structured Outputs endpoint explicitly — JSON mode silently absorbed failures that `strict:true` will surface as errors. Do not combine high temperature with `logitBias` when using structured output.

**For Gemini:** align prompt property order to `responseSchema` property order; on Gemini 2.0+, set `propertyOrdering` explicitly. Mark fields `nullable: true` for any field the model may lack context to fill. If using `additionalProperties`, use `response_json_schema` instead of `response_schema` to bypass SDK validation — [issue #1815](https://github.com/googleapis/python-genai/issues/1815) was closed as "not planned."

**For vLLM / open-source:** use XGrammar (default in current vLLM). If falling back to Outlines for schema coverage reasons, test compilation at startup rather than at request time — identify blocking schemas before they reach production. Do not use Outlines for batch workloads without an explicit compilation queue.

**Universal:** test schemas at complexity boundaries — nested objects, large union counts, optional field density — under the same token budget constraints as production requests. Retry logic does not fix compilation failures or ordering sensitivity. Schema redesign is the only recovery path.

## Evidence

| Tool | Version | Method | What was observed |
|---|---|---|---|
| [Anthropic Claude API](https://platform.claude.com/docs/en/build-with-claude/structured-outputs) | reviewed 2026-04-21 | docs-reviewed | Hard limits enforced: 180s compilation timeout, max 20 strict tools, max 24 optional params, max 16 union types. Exceeding any returns `"Schema is too complex for compilation"` with no indication of which limit was hit. |
| [OpenAI API](https://developers.openai.com/api/docs/guides/structured-outputs) | reviewed 2026-04-21 | docs-reviewed | JSON mode enforces JSON syntax only — wrong types, missing keys, and invented keys pass silently. `strict:true` is the production path; OpenAI describes Structured Outputs as "the evolution of JSON mode." |
| [Gemini API / python-genai](https://github.com/googleapis/python-genai) | python-genai (2025-11-onwards) | source-reviewed | Python SDK client rejects `additionalProperties` at the validation layer; Gemini API has accepted the field since November 2025. [Issue #1815](https://github.com/googleapis/python-genai/issues/1815) closed as "not planned" — workaround: use `response_json_schema`. |
| [Gemini API](https://developers.googleblog.com/en/mastering-controlled-generation-with-gemini-15-schema-adherence/) | Gemini 1.5 (2024) | docs-reviewed | Property ordering sensitivity documented in Google engineering blog: output does not conform to `responseSchema` when prompt property order diverges from schema property order. |
| [Outlines (dottxt-ai)](https://github.com/dottxt-ai/outlines) | January 2026 benchmark | independently-confirmed | 42 compilation errors documented across tested grammar categories in [arXiv 2501.10868](https://arxiv.org/html/2501.10868v1). Synchronous FSM compilation confirmed in vLLM engineering blog. |
| [vLLM](https://vllm.ai/blog/struct-decode-intro) | 2026 | independently-confirmed | vLLM engineering blog: FSM engine "occasionally crashes the engine" under complex grammars; XGrammar adopted as default backend, Outlines retained as fallback. |

## Confidence and gaps

**Falsification criterion:** This finding would be disproved by any provider publishing documented complexity thresholds and failure rates that match the behaviour described above, OR by demonstrating that the Anthropic compilation limits, Gemini ordering sensitivity, or Outlines FSM errors described here are version-specific and resolved in current releases.

**Confidence:** secondary-research — no claims were reproduced by execution in Theory Delta's environment. Evidence sources are: Anthropic documentation (docs-reviewed), OpenAI documentation (docs-reviewed), Google engineering blog (docs-reviewed), a third-party GitHub issue (source-reviewed, now closed as "not planned"), a peer-reviewed benchmark paper (independently-confirmed), and a vLLM engineering blog post (independently-confirmed). The block carries `staleness_risk: high` — provider limits and defaults shift frequently. Anthropic, OpenAI, and Gemini have each changed structured generation behavior within the 6 months prior to publication.

**Strongest case against:** Several claims may already be obsolete. Anthropic, OpenAI, and Gemini release frequently and do not version their structured generation implementations in a way that makes "which version has this bug" easy to determine. The Gemini property ordering claim is sourced from a 2024 blog post against Gemini 1.5; current Gemini 2.x behavior may differ. The Outlines compilation error count is from a January 2026 benchmark snapshot.

**Open questions:** Do the Anthropic hard limits apply identically to all Claude model versions, or are they implementation-dependent? Does the Gemini property ordering sensitivity persist in Gemini 2.5+ or is it addressed by the explicit `propertyOrdering` requirement introduced in 2.0? Which grammar patterns still fall back to Outlines in XGrammar's current coverage, and does the under-constrained output rate differ across schema types? Has any provider added failure rate telemetry to their structured output APIs?

Seen different? [Contribute your evidence](https://theorydelta.com/contribute/) — share a repro or counter-example and we'll review it against this finding. Reader evidence is what keeps these findings accurate.