---
source_block: agentic-rag-pipeline.md
canonical_url: https://api.theorydelta.com/published/langgraph-checkpoint-serialization-silent-loss
published: 2026-03-29
last_verified: 2026-03-01
confidence: empirical
environments_tested:
  - tool: "LangGraph (langchain-ai)"
    version: "v1.0.10"
    result: "source-reviewed: JsonPlusSerializer replaces deserialized values with None on failure — no exception raised (bug #6970, open)"
  - tool: "LangGraph (langchain-ai)"
    version: "v1.0.10"
    result: "source-reviewed: StrEnum values silently coerced to plain str after checkpoint round-trip — type information lost (bug #6598)"
  - tool: "Microsoft GraphRAG"
    version: "v3.0.5"
    result: "source-reviewed: v3 pipeline is extremely slow compared to v2 after NetworkX removal; regression unresolved (issue #2250, open)"
  - tool: "LangGraph (langchain-ai)"
    version: "v1.0.10"
    result: "source-reviewed: get_state().next returns empty tuple after resuming from first of two interrupt() calls — graph paused but snapshot reports complete (bug #6956, open)"
theory_delta: "GraphRAG v3 (Jan 2026) trades NetworkX for a DataFrame-based pipeline and ships a performance regression vs v2 (issue #2250). LangGraph serialization fails closed silently across four documented modes since Jan 2026 — checkpoint round-trips are lossy for non-primitive types, with no exception raised."
a2a_card:
  type: finding
  topic: agentic-rag-pipeline
  claim: "LangGraph checkpoint round-trips are lossy for non-primitive types — four distinct silent failure modes (JsonPlusSerializer null-on-failure, StrEnum→str coercion, nested Enum→None, BinaryOperatorAggregate wrapper leak) corrupt state without raising exceptions, making stateful RAG pipelines unreliable."
  confidence: empirical
  action: test
  contribute: /api/signals
rubric:
  total_claims: 6
  tested_count: 0
  independently_confirmed: true
  unlinked_count: 0
  scope_matches: true
  falsification_stated: true
  content_type: finding
---

# LangGraph checkpoint round-trips silently corrupt non-primitive types — four distinct confirmed modes

*From [Theory Delta](https://theorydelta.com) | [Methodology](https://theorydelta.com/methodology/) | Published 2026-03-29*

## What the docs say

LangGraph documentation presents checkpointing as a reliable mechanism for persisting and resuming stateful agent workflows. The checkpoint/resume pattern is the foundation of LangGraph's human-in-the-loop features and is used extensively in production agentic RAG pipelines that store Pydantic models, Enums, or custom classes in graph state.

## What actually happens

LangGraph checkpoint round-trips are lossy for non-primitive types. Four distinct silent failure modes have been confirmed since January 2026 in open bugs, all affecting LangGraph v1.0.10:

**1. JsonPlusSerializer null-on-failure ([bug #6970](https://github.com/langchain-ai/langgraph/issues/6970), open as of 2026-02-28):** When deserialization fails, `JsonPlusSerializer` replaces the failed value with `None` instead of raising an exception. The graph continues with a corrupted state object. The failure is invisible — no warning, no log entry, no exception. This affects any complex type stored in checkpoint state.

**2. StrEnum coerced to plain str ([bug #6598](https://github.com/langchain-ai/langgraph/issues/6598), January 2026):** `StrEnum` values silently become plain `str` after a checkpoint round-trip. Type information is lost. Code checking `isinstance(value, MyStrEnum)` will fail silently after a resume. Any state machine logic that routes on enum type (rather than enum value) breaks without error.

**3. Nested Enum fields become None ([bug #6718](https://github.com/langchain-ai/langgraph/issues/6718), February 2026):** Nested `Enum` fields in checkpoint state deserialize as `None` rather than raising. Like bug #6970, this is silent replacement — the state object looks valid but contains corrupted values.

**4. BinaryOperatorAggregate wrapper leak ([bug #6909](https://github.com/langchain-ai/langgraph/issues/6909), 2026-02-27):** When a channel starts `MISSING`, `BinaryOperatorAggregate` with `Overwrite` returns the wrapper object rather than the unwrapped payload. Downstream code receives a `BinaryOperatorAggregate` instance where it expects the actual state value.

These are not edge cases in obscure usage paths. They affect any LangGraph pipeline storing Pydantic models, Enums, or custom classes through checkpointing — which is most production agentic RAG pipelines.

### Interrupt state snapshot bug affects human-in-the-loop workflows

[Bug #6956](https://github.com/langchain-ai/langgraph/issues/6956) (open, 2026-02-27): `get_state().next` returns an empty tuple `()` after resuming from the first of two `interrupt()` calls in the same node. The graph is still paused — but the snapshot reports it as complete. Any code checking `state.next` to determine whether a graph is still running will silently misread a paused graph as finished. Human-in-the-loop workflows with chained interrupts are directly affected. An agent waiting for human approval may receive a "complete" signal and proceed without it.

### Conditional edge routing can corrupt branch selection

[Issues #4968](https://github.com/langchain-ai/langgraph/issues/4968), [#4891](https://github.com/langchain-ai/langgraph/issues/4891), [#4226](https://github.com/langchain-ai/langgraph/issues/4226): inline docstrings inside Python dict literals used as conditional edge mappings silently corrupt the routing key — the docstring becomes part of the dictionary key, producing a `KeyError` at runtime during tool routing. Under async streaming, the error may be swallowed. A newer variant ([bug #6770](https://github.com/langchain-ai/langgraph/issues/6770)): `KeyError('__end__')` when a conditional router returns `'__end__'` but `path_map` does not explicitly include an `__end__`/END key. Fix: add `"__end__": "__end__"` to `path_map`.

### GraphRAG v3 shipped a performance regression vs v2

GraphRAG v3 (January 2026, current: v3.0.5) removed the NetworkX dependency and moved to DataFrame-based graph utilities. [Issue #2250](https://github.com/microsoft/graphrag/issues/2250) (2026-02-26) documents the v3 pipeline as "extremely slow compared to v2." The regression is unresolved in v3.0.5. Teams that benchmarked on v2 must re-benchmark before deploying v3. The v3 restructure also adds opt-in LLM-based entity resolution ([PR #2234](https://github.com/microsoft/graphrag/pull/2234), open) that addresses semantic fragmentation ("Ahab" vs "Captain Ahab") — but the entity type deduplication bug ([issue #1718](https://github.com/microsoft/graphrag/issues/1718), marked fatal, still open) is orthogonal and unresolved. Both problems can coexist.

**This finding would be disproved by:** LangGraph v1.0.10+ passing a round-trip checkpoint test where Pydantic models, StrEnum, nested Enum, and BinaryOperatorAggregate values are preserved with type fidelity after a checkpoint cycle. It would also be disproved for the GraphRAG regression by a benchmark showing v3 matching or exceeding v2 TPS at equivalent corpus size.

## What to do instead

**For LangGraph stateful pipelines:** Treat checkpoint round-trips as lossy for non-primitive types until bugs #6970, #6598, #6718, and #6909 are closed. Add explicit checkpoint validation after every `resume` call:

```python
# After resuming a LangGraph graph
state = graph.get_state(config)
# Validate critical fields are not None and have expected types
assert state.values.get("my_enum") is not None, "checkpoint deserialization failure"
assert isinstance(state.values["my_enum"], MyExpectedType), f"type corrupted: {type(state.values['my_enum'])}"
```

For state that must survive checkpoint round-trips, prefer primitive types (str, int, dict with primitive values) over Pydantic models and Enums where possible. If Enums are required, serialize them to their `.value` before storing in graph state and reconstruct on read.

**For human-in-the-loop workflows with chained interrupts:** Do not rely solely on `state.next` to determine if a graph is paused. Track interrupt state explicitly in your application layer until bug #6956 is closed.

**For conditional edge routing:** Do not use inline docstrings inside Python dict literals in edge mappings. Always include an explicit `"__end__": "__end__"` entry in `path_map` for any conditional router that may return `__end__`.

**For GraphRAG v3:** If migrating from v2, benchmark your specific corpus before deploying to production. The regression in issue #2250 is unresolved. If performance is critical and entity resolution quality is acceptable in v2, consider staying on v2 until the regression is addressed.

**For agentic RAG with step limits:** Wrap any agentic loop in a catch that detects step-limit exit and forces a final synthesis call before returning to the user. When `max_agent_steps` triggers mid-retrieval, frameworks return raw tool output — JSON, API response, schema — instead of a synthesized answer.

## Environments tested

| Tool | Version | Result |
|------|---------|--------|
| [LangGraph](https://github.com/langchain-ai/langgraph) | v1.0.10 | source-reviewed: JsonPlusSerializer replaces deserialization failures with None ([#6970](https://github.com/langchain-ai/langgraph/issues/6970), open) |
| [LangGraph](https://github.com/langchain-ai/langgraph) | v1.0.10 | source-reviewed: StrEnum coerced to str after checkpoint round-trip ([#6598](https://github.com/langchain-ai/langgraph/issues/6598)) |
| [LangGraph](https://github.com/langchain-ai/langgraph) | v1.0.10 | source-reviewed: nested Enum fields become None after resume ([#6718](https://github.com/langchain-ai/langgraph/issues/6718)) |
| [LangGraph](https://github.com/langchain-ai/langgraph) | v1.0.10 | source-reviewed: BinaryOperatorAggregate returns wrapper instead of payload ([#6909](https://github.com/langchain-ai/langgraph/issues/6909)) |
| [LangGraph](https://github.com/langchain-ai/langgraph) | v1.0.10 | source-reviewed: get_state().next empty after first of two interrupt() calls ([#6956](https://github.com/langchain-ai/langgraph/issues/6956), open) |
| [Microsoft GraphRAG](https://github.com/microsoft/graphrag) | v3.0.5 | source-reviewed: v3 pipeline extremely slow vs v2 after NetworkX removal ([#2250](https://github.com/microsoft/graphrag/issues/2250), open) |

## Confidence and gaps

**Confidence:** empirical — all four LangGraph serialization bugs are confirmed in open GitHub issues by third-party reporters (not Theory Delta). The GraphRAG performance regression is confirmed in a separate user-filed issue. Not tested by execution in Theory Delta's environment — these are source-reviewed from the respective GitHub issue trackers. The bug status (open vs closed) reflects the state as of 2026-03-01; some may have been addressed in subsequent LangGraph releases.

**Strongest case against:** These bugs may already be fixed in LangGraph versions later than v1.0.10. Open issues do not guarantee unfixed behavior — LangGraph releases frequently. The serialization failures affect specific type patterns; pipelines using only primitive types in checkpoint state are unaffected. The GraphRAG regression may be workload-dependent and could be a benchmark-specific observation rather than universal throughput degradation.

**Open questions:** Which LangGraph version (if any) closes all four serialization bugs? Is there a LangGraph release where checkpoint round-trips can be considered reliable for Pydantic models? Does the GraphRAG v3 performance regression appear for all corpus sizes, or only at specific scale thresholds?

Seen different? [Contribute your evidence](https://theorydelta.com/contribute/) — theory delta is what makes this knowledge base work.
