---
source_block: ollama-local-inference.md
canonical_url: https://api.theorydelta.com/published/ollama-silent-tool-call-failures
published: 2026-05-01
last_verified: 2026-04-29
confidence: empirical
staleness_risk: medium
rubric:
  total_claims: 6
  tested_count: 4
  independently_confirmed: true
  unlinked_count: 1
  scope_matches: true
  falsification_stated: true
  content_type: finding
environments_tested:
  - tool: "Ollama (ollama/ollama)"
    version: "v0.5+"
    evidence_type: source-reviewed
    result: "OLLAMA_CONTEXT_LENGTH defaults to 2048; streaming drops tool_calls delta chunks silently"
  - tool: "Automatic (velvet-tiger/automatic)"
    version: "v0.8.0"
    evidence_type: source-reviewed
    result: "Ollama integration config omits OLLAMA_CONTEXT_LENGTH entirely; agents inherit 2048 default"
  - tool: "OpenClaw / BetterClaw"
    version: "production (2026)"
    evidence_type: independently-confirmed
    result: "Streaming returns finish_reason:stop instead of tool_calls chunks; stream:false resolves it"
  - tool: "Qwen3-14B via Ollama"
    version: "v0.5+ with GBNF"
    evidence_type: source-reviewed
    result: "F1=0.971 tool selection with GBNF enforcement — but streaming bug fires regardless of model quality"
theory_delta: "The docs say nothing about OLLAMA_CONTEXT_LENGTH defaulting to 2048 tokens or the streaming layer dropping tool_calls delta chunks — both silently disable all tool calling with no error raised."
a2a_card:
  type: finding
  topic: Ollama local LLM inference
  claim: Ollama has two orthogonal silent failure modes that each independently disable all tool calling with no error signal — a 2048-token context default and a streaming protocol bug that drops tool_calls chunks.
  confidence: empirical
  action: test
  contribute: /api/signals
---

# Ollama disables tool calling silently in two independent ways by default

## What you expect

Ollama is the dominant local LLM runtime, advertised as OpenAI-compatible with reliable tool-calling support since v0.5. You pull a model, configure your agent framework, and expect tool calls to work. When they don't, you expect an error.

## What actually happens

Ollama has two orthogonal silent failure modes that each independently disable all tool calling — no exception, no log entry, no HTTP error.

**Failure 1: The 2048-token context default.** `OLLAMA_CONTEXT_LENGTH` defaults to 2048 tokens. In multi-turn agentic sessions — where system prompts, tool results, and conversation history accumulate — this ceiling is hit within 3–5 exchanges. When Ollama silently truncates the context, the model receives an incomplete view of the conversation and stops producing tool calls. From the agent framework's perspective, tool calling has inexplicably stopped working mid-session. One environment variable fixes it (`OLLAMA_CONTEXT_LENGTH=32768` or higher), but Ollama does not warn when truncation occurs and agent frameworks that wrap Ollama do not set this config. [Automatic v0.8.0](https://github.com/velvet-tiger/automatic/releases/tag/v0.8.0) — a local agent config registry used by Claude Code and other agents — ships its Ollama integration without this variable, silently inheriting the broken default.

**Failure 2: The streaming protocol bug.** When streaming is enabled (the Ollama default), Ollama returns an empty content chunk with `finish_reason: "stop"` in place of `tool_calls` delta chunks. The model internally generates tool call intentions, but the streaming layer never delivers those chunks to the calling agent. Every tool-dependent skill — web search, file operations, shell execution, MCP tool dispatch — silently fails. The agent receives a completion event with no tool calls and no indication that a tool call was attempted. This failure fires immediately, on the first tool call, regardless of context length or model quality. An agent correctly configured with `OLLAMA_CONTEXT_LENGTH=131072` and a capable model still hits this bug if streaming is enabled.

**The critical interaction:** these two failure modes are orthogonal. Fixing one does not fix the other. A builder who fixes the context length trap still has all tool calling disabled via the streaming bug. Both require explicit configuration to avoid: `OLLAMA_CONTEXT_LENGTH` for the context trap, and `stream: false` in API calls for the streaming bug.

There is also a third class of pre-inference failure: chat template bugs in Ollama's `/api/chat` integration corrupt tool schemas before the model ever sees them, producing malformed tool definitions at the API boundary. GBNF enforcement (added in v0.5) does not prevent this because the corruption happens before token sampling.

Independently confirmed by the [BetterClaw/OpenClaw production bug report](https://www.betterclaw.io/blog/openclaw-ollama-guide) (2026): streaming returns `finish_reason: stop` instead of `tool_calls` chunks; `stream: false` resolves it.

## What this means for you

Every agent framework that defaults to streaming against Ollama — which is most of them — silently loses all tool calling regardless of model quality, GBNF configuration, or context settings. Local-first inference is not a fringe pattern — this is a mainstream failure surface.

The 2048-token default means agentic RAG is also broken out of the box: retrieved chunks injected into the generation context saturate the window before the user query is appended. This is not just a multi-turn chat problem — it affects any workflow where tool results or retrieved content accumulates in context.

Industry surveys (Q1 2026) report the majority of enterprise inference runs on-premises or at the edge — local-first is a structural architectural branch, and these failures affect that entire segment. Tools that wrap Ollama without setting `OLLAMA_CONTEXT_LENGTH` have accepted a latent failure in every agentic workflow they enable.

## What to do

1. **Always set `OLLAMA_CONTEXT_LENGTH`** — minimum 32768 for agentic use; 131072 for RAG or long sessions. Add it to your shell profile or Docker environment. Never rely on the 2048 default for agent workloads.

2. **Set `stream: false`** in API calls or agent framework config when tool calling is required. Accept the UX tradeoff: non-streaming means no visible output until the full response is generated. For interactive use, implement a separate streaming path that does not require tool calling; for agent workflows, streaming is not needed.

3. **Audit your agent framework's Ollama config.** Check whether it sets `OLLAMA_CONTEXT_LENGTH`. If it does not, treat all tool-calling results from that framework as potentially silently dropped.

4. **Pin Qwen3 tool count below 5.** At more than 5–6 active tools, Qwen3-coder switches from JSON tool calls to XML format — integrations using JSON-only parsers silently lose tool calling at that threshold. This is a third independent failure mode, separate from the context trap and streaming bug.

**Falsification criterion:** This finding would be disproved by Ollama releasing a version where (a) the default context length is set to a value sufficient for multi-turn agentic sessions (≥8192), (b) the streaming layer correctly delivers `tool_calls` delta chunks without requiring `stream: false`, and both behaviors are confirmed in the default configuration with no workaround required.

## Evidence

| Tool | Version | Evidence | Result |
|------|---------|----------|--------|
| [Ollama](https://github.com/ollama/ollama) | v0.5+ | source-reviewed | OLLAMA_CONTEXT_LENGTH defaults to 2048; streaming drops tool_calls delta chunks silently |
| [Automatic v0.8.0](https://github.com/velvet-tiger/automatic) | v0.8.0 | source-reviewed | Ollama integration config omits OLLAMA_CONTEXT_LENGTH; agents inherit 2048 default |
| [BetterClaw/OpenClaw](https://www.betterclaw.io/blog/openclaw-ollama-guide) | production 2026 | independently-confirmed | Streaming returns finish_reason:stop instead of tool_calls chunks; stream:false resolves it |
| [Qwen3-14B via Ollama](https://github.com/ollama/ollama) | v0.5+ GBNF | source-reviewed | F1=0.971 tool selection with GBNF — streaming bug fires regardless of model quality |

**Confidence:** empirical — 4 environments reviewed, 1 independently confirmed.

**Strongest case against:** The streaming bug is confirmed from a single production bug report (BetterClaw/OpenClaw); other agent frameworks may work around this in their Ollama integration code. The context-length default is well-documented in Ollama's own configuration reference, though not prominently. Teams using Ollama with frameworks that do set `OLLAMA_CONTEXT_LENGTH` (e.g., direct API users who read the docs) would not encounter failure 1. The v0.21.0 Hermes Agent addition and active v0.20+ development cadence means these behaviors may be changing.

**Open questions:** Does `stream: false` introduce other correctness failures in concurrent request handling? Which agent frameworks already set `OLLAMA_CONTEXT_LENGTH` in their default Ollama integrations? Does the MLX backend (switched March 2026 for Apple Silicon) exhibit the same streaming bug, or does it have a different implementation?

Seen different? [Contribute your evidence](https://theorydelta.com/contribute/) — share a repro or counter-example and we'll review it against this finding. Reader evidence is what keeps these findings accurate.
