---
source_block: hermes-agent.md
canonical_url: https://api.theorydelta.com/published/hermes-agent-self-improvement-non-functional
published: 2026-05-31
last_verified: 2026-05-31
confidence: medium
staleness_risk: high
rubric:
  total_claims: 11
  tested_count: 0
  independently_confirmed: true
  unlinked_count: 0
  scope_matches: true
  falsification_stated: true
  content_type: landscape
environments_tested:
  - tool: "Hermes Agent (NousResearch/hermes-agent)"
    version: "issues open as of v0.15.1 (2026-05-29)"
    evidence_type: source-reviewed
    result: "skill_view() instructions ignored by LLM even when a directly relevant skill exists; auto-trigger non-functional"
  - tool: "Hermes Agent (NousResearch/hermes-agent)"
    version: "issues open as of v0.15.1 (2026-05-29)"
    evidence_type: source-reviewed
    result: "skills_guard false-positive regex treats 'ask' verdict as hard block, preventing autonomous skill creation without human intervention"
  - tool: "hermes-agent-self-evolution (NousResearch/hermes-agent-self-evolution)"
    version: "last commit 2026-03-29; 7 commits total"
    evidence_type: source-reviewed
    result: "Phase 1 only (DSPy skill optimization); Phases 2-5 listed as Planned, not shipped in v0.14.0 or v0.15.1"
  - tool: "hermes-agent-self-evolution (NousResearch/hermes-agent-self-evolution)"
    version: "last push 2026-03-29 (63 days stale as of 2026-05-31)"
    evidence_type: source-reviewed
    result: "Repo has not received a commit in 63 days; the sole community-filed issue (a provenance/governance discussion) has no maintainer response"
  - tool: "GEPA algorithm (arXiv:2507.19457)"
    version: "ICLR 2026 Oral, independently reviewed"
    evidence_type: independently-confirmed
    result: "Algorithm validated: +13% over MIPROv2, +20% over GRPO with 35x fewer rollouts — DSPy/standalone benchmark, NOT a Hermes product measurement"
theory_delta: "GitHub issues show that Hermes's headline self-improvement feature — autonomous skill discovery via skill_view() — is non-functional in production, and the companion self-evolution repo implementing Phases 2-5 has not shipped and is 63 days stale."
a2a_card:
  type: finding
  topic: Hermes Agent self-improvement and skill auto-invocation
  claim: Hermes's skill_view() auto-trigger does not function (LLM ignores it), skills_guard blocks autonomous skill creation, and the companion self-evolution repo is Phase 1 only with Phases 2-5 unshipped and 63 days stale — the self-improving agent narrative is not supported by the current codebase.
  confidence: medium
  action: test
  contribute: /api/signals
---

# Hermes Agent's self-improvement narrative is not supported by the current codebase

## What you expect

Hermes Agent's central marketing claim is a self-improving agent: it learns from experience, builds a library of reusable skills, and automatically applies those skills in future conversations. The companion repo [NousResearch/hermes-agent-self-evolution](https://github.com/NousResearch/hermes-agent-self-evolution) (3,724 stars) describes a five-phase GEPA optimization roadmap. The underlying GEPA algorithm — validated at ICLR 2026 as an Oral paper ([arXiv:2507.19457](https://arxiv.org/abs/2507.19457)) — outperforms GRPO by 20% with 35x fewer rollouts. The reasonable inference: Hermes deploys this algorithm to continuously improve its own skills in production.

## What actually happens

### The auto-invocation mechanism does not trigger

The mechanism by which learned skills should enter the conversation is `skill_view()`. According to open issue [#4589](https://github.com/NousResearch/hermes-agent/issues/4589), the LLM ignores `skill_view()` instructions even when a skill directly relevant to the current task exists. Skills must be explicitly invoked by name — by the human, not by the agent. Auto-trigger does not occur.

This means the self-improvement loop has no completion path: even when a skill is correctly created and stored, the agent does not use it autonomously.

### The skill-creation step also fails silently

Autonomous skill creation is blocked by a second independent failure. A guard mechanism (`skills_guard`) evaluates agent-created skills before registration. A false-positive regex causes the "ask" verdict — which should surface a decision to the user — to be treated as a hard block instead ([#13686](https://github.com/NousResearch/hermes-agent/issues/13686), open, 0 maintainer comments). The result: skills the agent creates are silently rejected without human intervention to override the guard. This failure is independent of the auto-invocation bug above — both ends of the create-and-use loop are broken.

### The companion repo shipped Phase 1 only and has gone stale

[NousResearch/hermes-agent-self-evolution](https://github.com/NousResearch/hermes-agent-self-evolution) documents five phases of GEPA implementation:

- **Phase 1** (DSPy skill optimization) — **shipped**, 7 commits
- **Phases 2–5** (tool descriptions, system prompts, code generation, CI pipeline) — **Planned**, no timeline

Phases 2–5 did not ship in v0.14.0 (May 16, 2026) or v0.15.1 (May 29, 2026). The repo received its last commit on 2026-03-29 — 63 days before this writing — with 3,724 stars. Its only filed issue thread, [#11692](https://github.com/NousResearch/hermes-agent/issues/11692) ("Receipts for self-improving agents: proving which skill version produced which output"), is a community-initiated *provenance/governance* discussion — it asks how to audit which skill version produced which output, and presupposes self-modification rather than asking whether it works. All 13 comments are from third-party contributors building external audit tooling; there is no maintainer response.

### The 40% gain figure reflects the algorithm, not the product

This distinction is the crux. The GEPA *algorithm* ([arXiv:2507.19457](https://arxiv.org/abs/2507.19457)) is independently validated: it is an ICLR 2026 Oral paper showing +13% over MIPROv2 and +20% over GRPO with 35x fewer rollouts. These are **DSPy prompt-optimization benchmarks run independently of Hermes**. The algorithm was evaluated in a standalone research context, not as part of the Hermes product.

The "40% production gains" figure cited in community discussions most plausibly reflects this algorithm benchmark — it does not correspond to any Hermes-specific before/after task-performance measurement. No such measurement exists in either direction: no Hermes-attributable efficacy benchmark has been published that shows what the self-evolution product actually produces in the Hermes runtime. Conflating the algorithm's research validation with the product's production behavior is the primary source of the inflated self-improvement claims.

### The gateway deadlock means autonomous creation would fail even if the above were fixed

Even if both the `skill_view()` and `skills_guard` bugs were resolved, a third structural issue persists: `register_mcp_servers` blocks when called in a nested invocation context ([#10138](https://github.com/NousResearch/hermes-agent/issues/10138)). Background skill creation runs through the gateway process; a skill-creation attempt can deadlock the entire gateway with no recovery path. This issue is confirmed open in v0.15.1.

## What this means for you

If you are evaluating Hermes for a use case that depends on autonomous self-improvement — "the agent gets better over time without human intervention" — the current codebase does not support that use case. Three independent failure modes block the loop: auto-invocation does not trigger, skill creation is silently rejected, and gateway-mode skill creation can deadlock the process. All three are confirmed open as of v0.15.1.

The GEPA research results are real and valid, but they describe an algorithm evaluated on DSPy benchmarks. They say nothing about how well Hermes's product implementation of that algorithm performs, because that measurement has not been published.

For teams that can manage a manual skill workflow — explicitly invoking skills by name, human-reviewing skill creation — Hermes is a capable agent runtime with genuine depth in memory, messaging platform coverage, and MCP integration. The self-improvement claims should not factor into that evaluation until the two auto-invocation bugs and the gateway deadlock are patched and a Hermes-specific benchmark exists.

## What to do

1. **Do not depend on autonomous skill invocation.** Until [#4589](https://github.com/NousResearch/hermes-agent/issues/4589) is closed, build your workflow assuming skills must be named explicitly in each prompt.

2. **Audit your skills_guard config before deploying skill creation.** Review whether the "ask" verdict is being treated as a hard block in your version ([#13686](https://github.com/NousResearch/hermes-agent/issues/13686)). If it is, disable auto-creation or add a human review step — do not assume autonomous creation succeeds silently.

3. **For gateway deployments: disable background skill creation** until [#10138](https://github.com/NousResearch/hermes-agent/issues/10138) is patched. A skill-creation deadlock takes down all messaging platforms on the shared event loop — the failure radius is the entire gateway.

4. **Do not cite the GEPA algorithm benchmarks as evidence for Hermes product performance.** The arXiv paper ([arXiv:2507.19457](https://arxiv.org/abs/2507.19457)) validates the algorithm in a DSPy context. It is not a before/after measurement of Hermes's runtime self-improvement. These are distinct claims requiring distinct evidence.

5. **Watch the hermes-agent-self-evolution repo.** A Phase 2 commit or an efficacy benchmark would materially change this assessment. The repo's last push was 2026-03-29; any activity is a signal worth tracking.

**Falsification criterion:** This finding would be disproved by: (a) a confirmed fix to [#4589](https://github.com/NousResearch/hermes-agent/issues/4589) showing `skill_view()` auto-triggers reliably across N conversations, or (b) a published Hermes-specific benchmark demonstrating measurable before/after task-performance improvement attributable to the self-evolution product (not the GEPA algorithm in isolation), or (c) evidence that Phases 2–5 of hermes-agent-self-evolution have shipped and are integrated into a released Hermes version.

## Evidence

| Tool | Version | Evidence | Result |
|------|---------|----------|--------|
| [Hermes Agent](https://github.com/NousResearch/hermes-agent) | v0.15.1 (2026-05-29); issue open | source-reviewed | `skill_view()` ignored by LLM; skills require manual invocation by name ([#4589](https://github.com/NousResearch/hermes-agent/issues/4589)) |
| [Hermes Agent](https://github.com/NousResearch/hermes-agent) | v0.15.1 (2026-05-29); issue open | source-reviewed | `skills_guard` "ask" verdict treated as hard block; agent-created skills silently rejected ([#13686](https://github.com/NousResearch/hermes-agent/issues/13686)) |
| [Hermes Agent](https://github.com/NousResearch/hermes-agent) | v0.15.1 (2026-05-29); issue open | source-reviewed | `register_mcp_servers` deadlocks in nested invocation; gateway-mode skill creation has no recovery path ([#10138](https://github.com/NousResearch/hermes-agent/issues/10138)) |
| [hermes-agent-self-evolution](https://github.com/NousResearch/hermes-agent-self-evolution) | last commit 2026-03-29 | source-reviewed | Phase 1 only (DSPy, 7 commits); Phases 2–5 listed as Planned; not shipped in v0.14.0 or v0.15.1 |
| [hermes-agent-self-evolution](https://github.com/NousResearch/hermes-agent-self-evolution) | issue #11692 open 2026-04-17 | source-reviewed | Sole community-filed thread is a provenance/governance discussion (audit which skill version produced which output); 13 third-party comments, zero maintainer responses ([#11692](https://github.com/NousResearch/hermes-agent/issues/11692)) |
| [GEPA algorithm (arXiv:2507.19457)](https://arxiv.org/abs/2507.19457) | ICLR 2026 Oral | independently-confirmed | Algorithm independently validated: +13% over MIPROv2, +20% over GRPO with 35x fewer rollouts — DSPy benchmark, not a Hermes product measurement |

**Confidence:** medium — 5 source-reviewed entries plus one independent algorithm validation. No Hermes-attributable execution benchmark exists in either direction. Independent confirmation: [arXiv:2507.19457](https://arxiv.org/abs/2507.19457) (ICLR 2026 Oral) confirms the GEPA *algorithm's* validity as a research artifact — which is the basis for the algorithm-vs-product distinction, not a confirmation of the product claim.

**Strongest case against:** The bugs in [#4589](https://github.com/NousResearch/hermes-agent/issues/4589) and [#13686](https://github.com/NousResearch/hermes-agent/issues/13686) may be narrow configuration issues rather than architectural failures — a correctly configured deployment might not hit them. The stale state of hermes-agent-self-evolution could reflect active work being done in the main hermes-agent repo rather than project abandonment. v0.15.0's major architectural refactor (76% codebase reduction) may have addressed some of the underlying issues without closing the specific issue threads. And the 40% figure, while not from a Hermes-native benchmark, may represent genuine observed improvement in practitioner deployments even if formal measurement is absent.

**Open questions:** Has the v0.15.0 architectural refactor changed the `skill_view()` invocation logic in ways not reflected in the open issue? Is there an internal Hermes team benchmark for self-improvement that hasn't been published? What would a valid Hermes efficacy benchmark look like — before/after on a specific task class?

Seen different? [Contribute your evidence](https://theorydelta.com/contribute/) — theory delta is what makes this knowledge base work.
