---
source_block: agent-memory-benchmarks-2026.md
canonical_url: https://api.theorydelta.com/published/compression-beats-graph-memory
published: 2026-04-27
last_verified: 2026-04-27
confidence: secondary-research
staleness_risk: high
environments_tested:
  - tool: "mem0ai/mem0"
    version: "v1.x (Mar 2026)"
    evidence_type: source-reviewed
    result: "Vector + graph; ~49% on LongMemEval (independent test); graph layer hardcoded to OpenAI in the TypeScript SDK; graph hard-DELETE was patched to soft-delete via PR #4188 (Mar 2026)"
  - tool: "getzep/graphiti (Zep)"
    version: "Feb 2026 (now mcp-v1.0.2)"
    evidence_type: source-reviewed
    result: "Real temporal knowledge graph; Zep scores 71.2% on the Mastra LongMemEval leaderboard (gpt-4o); user-reported async event loop conflict in self-hosted async stacks (no public tracking issue)"
  - tool: "Mastra Observational Memory"
    version: "Feb 2026"
    evidence_type: source-reviewed
    result: "Compression-based; 84.23% on LongMemEval (gpt-4o), 94.87% (gpt-5-mini); 3-6x text compression, no vector DB, no graph store, no embeddings"
  - tool: "OMEGA"
    version: "Mar 2026"
    evidence_type: source-reviewed
    result: "Self-reported 95.4% on LongMemEval (gpt-5-mini) — highest reported as of March 2026; SQLite + sqlite-vec + FTS5; single-developer project, score not independently reproduced"
  - tool: "Claude Code / Windsurf / Cursor / claude-mem"
    version: "Apr 2026"
    evidence_type: source-reviewed
    result: "All four coding agents converged on compression/injection as their memory pattern — none use a graph store"
theory_delta: Compression-based agent memory beats graph-based agent memory on the LongMemEval benchmark on the same model, and four independently-built coding agents converged on the same compression pattern.
a2a_card:
  type: finding
  topic: agent-memory
  claim: Compression-based agent memory (Mastra OM, OMEGA, claude-mem) outperforms graph-based memory (Zep/Graphiti) on the Mastra LongMemEval leaderboard by 13 percentage points on the same model (gpt-4o), and the gap persists after mem0's hard-delete bug was patched.
  confidence: secondary-research
  action: pick compression over graph for agent memory; do not pay the operational cost of a graph store unless you have a falsifiable use case it solves
  contribute: /api/signals
rubric:
  total_claims: 12
  tested_count: 0
  independently_confirmed: true
  unlinked_count: 0
  scope_matches: true
  falsification_stated: true
  content_type: finding
tasks:
  - task: rag-pipeline
    phase: pick-stack
---

# You picked vector vs graph for agent memory — the empirical answer is neither, pick compression

*From [Theory Delta](https://theorydelta.com) | [Methodology](https://theorydelta.com/methodology/) | Published 2026-04-27*

You are choosing an agent memory layer. Mem0 ([~48K stars](https://github.com/mem0ai/mem0)) markets vector + graph as a hybrid. Zep/Graphiti markets temporal knowledge graphs as the answer to "facts that change over time." The framing the ecosystem hands you is **vector vs graph**: pick vector for semantic recall, pick graph for temporal reasoning.

The benchmarks contradict that framing. The architecture that wins is in neither category.

## What you expect

Vector memory (mem0) will get you broad semantic recall but struggles with relational structure. Graph memory (mem0 graph mode, Zep/Graphiti) will get you temporal reasoning — knowing that "user preferred Python" was superseded by "user now prefers Rust" — at the cost of operational complexity. The choice is a trade-off between recall quality and infrastructure overhead.

## What actually happens

**On the same model, compression-based memory beats graph-based memory by 13 points.** Scores below are from the [Mastra research page](https://mastra.ai/research/observational-memory) LongMemEval leaderboard. Same model (gpt-4o), 500-question LongMemEval set, architecture-only comparison:

| System | Architecture | LongMemEval (gpt-4o) |
|--------|-------------|----------------------|
| Mastra Observational Memory | Compression (Observer/Reflector) | 84.23% |
| gpt-4o Oracle (cheat upper bound — relevant 1–3 sessions only) | Filtered context | 82.40% |
| Supermemory | Memory graph + RAG | 81.60% |
| Zep | Temporal knowledge graph | 71.20% |
| gpt-4o full context (all ~50 sessions stuffed in) | None — raw context | 60.20% |
| Mem0 (independent test, not on Mastra leaderboard) | Vector + graph | ~49% |

Two baselines matter here. **Oracle** (82.4%) is a cheat configuration — it filters the input down to only the 1–3 conversations that contain the answer. It's an upper-bound; you cannot run an oracle in production without already knowing the answer. **Full context** (60.2%) is the realistic baseline — stuff all ~50 conversations in and let the model figure it out. These are different scores measuring different things; collapsing them is the kind of error that makes graph memory look more competitive than it actually is.

The picture that survives the same-model comparison: Mastra OM (compression) beats both the oracle and full context. Zep (temporal KG) beats full context by 11 points but loses to compression by 13. Mem0 underperforms even the full-context baseline. Cross-model results in the [agent-memory-benchmarks-2026 block](https://github.com/roryford/theorydelta-blocks/blob/master/blocks/agent-memory-benchmarks-2026.md) widen the gap further (Mastra OM reaches 94.87% on [gpt-5-mini](https://platform.openai.com/docs/models/gpt-5) — a newer 2026-class OpenAI model; OMEGA reports 95.4% on the same model, self-reported and not independently reproduced). Scores across different models are not directly comparable, but the same-model gap above is.

**Mem0's graph layer used to defeat the temporal-reasoning promise; that bug shipped a fix.** The Mem0 paper described soft-delete with temporal reasoning — old facts marked superseded but retained for temporal queries. The implementation did a destructive delete. PR #4188 (merged 2026-03-21) switched the graph delete path to `r.valid = false, r.invalidated_at = datetime()`. ([mem0 PR #4188](https://github.com/mem0ai/mem0/pull/4188) closed the underlying [mem0 Issue #4187](https://github.com/mem0ai/mem0/issues/4187).)

The benchmark gap above predates that patch and was not closed by it. Mastra OM scored 84.23% on gpt-4o in February 2026; the patch landed five weeks later. A delete-semantics fix would not lift mem0's ~49% LongMemEval score onto compression's curve — the underperformance sits in extraction and retrieval, not in delete behavior.

**Mem0's TypeScript SDK has graph features locked to OpenAI.** [Issue #3711](https://github.com/mem0ai/mem0/issues/3711) (label: `sdk-typescript`) documents `MemoryGraph.structuredLlm` hardcoded to `"openai_structured"` in the TypeScript SDK, with Anthropic, Groq, and other providers failing on the graph pipeline. The issue was closed as duplicate in March 2026; the underlying fix may track under another issue.

The Python SDK's status on the same constraint has not been independently verified — re-test against your provider before relying on graph memory with a non-OpenAI model. The cloud product (Mem0g) requires a $249/month Pro tier, also OpenAI-only.

**Self-hosted Graphiti has a user-reported async event loop conflict.** Embedding `graphiti-core` directly in FastAPI or LangGraph — the most common production Python agent stack — has been reported to produce `RuntimeError: Future attached to a different loop` under real async load. (User-reported, no public tracking issue. The reported workaround is to run `graphiti-core` in its own subprocess with HTTP/queue communication.) This is also not in the README. ([covered in detail in graph-memory-self-hosted-not-production-ready](https://theorydelta.com/findings/graph-memory-self-hosted-not-production-ready/))

**Coding agents have already voted with their architectures.** Every shipping coding agent uses compression-based memory. The table below lists architectural patterns across models — the underlying coding agent runs on different LLMs and the comparison is across models, not a same-model benchmark:

| Agent | Memory pattern | Graph store? |
|-------|----------------|--------------|
| Claude Code | CLAUDE.md + MEMORY.md (flat files, 200-line load limit) | No |
| Windsurf | Auto-generated memories from ~48 hours of codebase analysis | No |
| Cursor | Rules files + context injection | No |
| claude-mem ([~26K stars](https://github.com/thedotmack/claude-mem)) | Session compression + injection | No |

Four independently developed coding agents — built by teams that compete on agent memory quality — converged on the same architectural pattern: compress past sessions, inject relevant fragments into the next context. None of them use a graph store. None of them use a vector DB. ([source: agent-memory-landscape](https://github.com/roryford/theorydelta-blocks/blob/master/blocks/agent-memory-landscape.md))

**Token efficiency points the same direction.** Zep's temporal-KG retrieval uses ~1.6K tokens of context to score 71.2% on LongMemEval (gpt-4o), versus the full ~115K-token context baseline at 60.2%. Both numbers come from the same [Mastra leaderboard](https://mastra.ai/research/observational-memory) above. The lesson is not that graphs are good — it is that compressing the right context beats stuffing all the context. Compression architectures generalize this without paying for a graph store.

## What this means for you

**If you picked Mem0 for "vector + graph hybrid":** With a non-OpenAI provider in the TypeScript SDK, the graph pipeline returns HTTP 401 — graph mode is effectively off. With OpenAI, the graph layer now soft-deletes (the hard-delete defect was fixed in March 2026), but mem0's overall LongMemEval score (~49%) is well below the compression baseline. You are paying the operational cost of running Neo4j to get a vector store with worse semantics than treating the conversation as a single context.

**If you picked Zep/Graphiti for "real" temporal reasoning:** Self-hosted has a user-reported async event loop bug that surfaces in production, not in development. Local tests pass; the production deploy degrades under concurrent async load. Zep Cloud avoids this but requires vendor dependency and silently changed its default OpenAI model from gpt-4o to gpt-4o-mini in v0.27.1 — pin your model explicitly or accept silent quality regression on upgrade.

**If you are choosing now:** The vector-vs-graph dichotomy is a category error. The empirically validated answer is compression: maintain a per-session digest, inject relevant fragments into the next agent context, do not run a graph store. Mastra OM, OMEGA, Hindsight, and the entire coding-agent cohort are independent confirmations of the same pattern.

**If your use case actually requires temporal reasoning** (legal, audit, "what did the user used to believe"): Mem0's graph delete is no longer a hard wipe (PR #4188), but no shipping OSS implementation has been independently benchmarked on the temporal-reasoning sub-task. Graphiti has the data model but the operational hazard. Zep Cloud is the most plausible path, with a vendor commitment and a model-pinning gotcha.

## What to do

**For most agent memory use cases:** Use compression. Mastra OM if you are in TypeScript and want a framework. claude-mem or session-injection patterns if you are building on Claude Code. The compression target is a per-session digest the next agent reads — not a vector index, not a graph.

**If you are committed to a retrieval-based architecture:** Use Mem0 self-hosted in vector-only mode against Qdrant or PgVector. It works with any LLM provider in vector mode. Do not enable graph features on the TypeScript SDK with a non-OpenAI provider until you have re-verified the fix status against your provider; on the Python SDK, test before relying on it.

**If temporal reasoning is genuinely a requirement:** Use Zep Cloud, accept the vendor dependency, and pin your OpenAI model explicitly to avoid silent quality regression. Do not embed graphiti-core directly in FastAPI or LangGraph — run it in a subprocess. Verify before relying on temporal queries that your chosen implementation soft-deletes.

**Do not pick based on stars.** Mem0 is the [star leader](https://github.com/mem0ai/mem0) and scores ~49% on LongMemEval — below the 60.2% baseline of just dumping the whole conversation into a gpt-4o context window. Star count measures awareness; benchmark scores measure capability; convergent architectures across independent teams measure what works.

The evidence table below collects scores reported by each project; some entries are across models (gpt-4o, gpt-5-mini) and not directly comparable, which is called out explicitly in each row.

## Evidence

| Claim | Source | Verified |
|-------|--------|----------|
| Mastra OM 84.23% on LongMemEval (gpt-4o) | [Mastra LongMemEval leaderboard](https://mastra.ai/research/observational-memory) | 2026-04-27 |
| Zep 71.20% on LongMemEval (gpt-4o) | [Mastra LongMemEval leaderboard](https://mastra.ai/research/observational-memory) — same source as Mastra OM and the baselines below, ensuring same-model comparison | 2026-04-27 |
| gpt-4o Oracle 82.40% (filtered to relevant 1–3 sessions only — upper-bound cheat) | [Mastra LongMemEval leaderboard](https://mastra.ai/research/observational-memory) | 2026-04-27 |
| gpt-4o Full context 60.20% (all ~50 sessions stuffed in — realistic baseline) | [Mastra LongMemEval leaderboard](https://mastra.ai/research/observational-memory) | 2026-04-27 |
| Mem0 ~49% on LongMemEval (independent test, not on Mastra leaderboard) | LongMemEval independent reproductions; vectorize.io and dev.to comparisons. **Note:** this score comes from a different evaluation run than the Mastra leaderboard rows above; treat the cross-source comparison as directional, not exact | 2026-03-21 |
| OMEGA 95.4% (self-reported, gpt-5-mini) | OMEGA project documentation; single-developer, not independently reproduced; cross-model and not directly comparable to gpt-4o rows | 2026-03-21 |
| Mem0 graph delete-semantics patched (soft-delete via PR #4188) | [Mem0 Issue #4187](https://github.com/mem0ai/mem0/issues/4187) (closed-completed 2026-03-21) | 2026-04-27 |
| Mem0 graph in TypeScript SDK locked to OpenAI provider | [Mem0 Issue #3711](https://github.com/mem0ai/mem0/issues/3711) (label `sdk-typescript`; closed as duplicate Mar 2026; Python SDK status unverified) | 2026-04-27 |
| Graphiti async event loop conflict | User-reported, no public tracking issue; covered in [graph-memory-self-hosted-not-production-ready](https://theorydelta.com/findings/graph-memory-self-hosted-not-production-ready/) | 2026-04-20 |
| Coding agents converged on compression | Claude Code (CLAUDE.md/MEMORY.md), Windsurf (auto-memories), Cursor (rules), claude-mem (session compression) | 2026-04-25 |

**Confidence:** secondary-research — based on public benchmark leaderboards, vendor-reported scores, and source-reviewed GitHub issues. LongMemEval scores are vendor-reported; OMEGA's 95.4% claim has not been independently reproduced. The ~49% mem0 score comes from independent reproductions but is itself secondary to those reports.

**Open questions (Apr 2026):** Now that mem0 graph soft-deletes (post-PR #4188), does its LongMemEval score change materially? Has the Python SDK been verified for non-OpenAI graph providers, or only the TypeScript SDK? Is there a public benchmark on which graph-based memory beats compression-based memory on the same model? Does Graphiti's mcp-v1.0.2 release fix the user-reported async event loop conflict?

**Falsification criterion:** A LongMemEval, LoCoMo, or MemoryBench result where a graph-based memory tool (mem0 graph mode, Zep/Graphiti, Cognee) beats Mastra OM, OMEGA, or another compression-based system on the same model and dataset would disprove this finding; an independent benchmark of mem0 post-PR #4188 showing graph memory decisively beats compression for temporal-reasoning queries would partially falsify it.

Seen different? [Contribute your evidence](https://theorydelta.com/contribute/)