Blog · 2026-05-02· 6 min read

Hermes Agent Orchestrates Claude Code + Codex

Hermes positions itself as the orchestrator, not another coder. Claude Opus 4.7 wins SWE-bench (87.6%), Codex GPT-5.5 wins Terminal-Bench (82.7%); Hermes' job is to pick the right specialist per task. This is Specialist Routing - the canonical D1 hub-and-spoke pattern arriving in production tooling.

D1D3hermesorchestrationclaude-code
Painterly walnut relay desk: two sealed envelopes labelled Claude Code and Codex passing between three Loops; the central Loop signs a coordination ledger.

TL;DR

  • Hermes Agent is a coordinator from Nous Research that routes work between Claude Code and Codex under one persistent project memory
  • The pattern is Specialist Routing: pick the cheaper specialist per task instead of running everything through one big model
  • Claude Opus 4.7 wins SWE-bench Verified (87.6%); Codex GPT-5.5 wins Terminal-Bench (82.7%) - different benchmarks, different specialists
  • This is the canonical CCA-F D1 hub-and-spoke pattern, now surfaced in production tooling
  • Worth ~3-4 questions on the exam (D1 = 27% of total)

Quick answer

Hermes Agent is an orchestration layer from Nous Research that coordinates multiple specialist coding agents - most commonly Claude Code and Codex - under one persistent context. It is not another coder. It is the coordinator that picks the right specialist per task. The pattern is called Specialist Routing, and it is the production-shaped version of the CCA-F D1 hub-and-spoke pattern.

What just happened

Hermes Agent shipped v0.8 in late April 2026, and the design choice it codifies matters more than the version bump. Hermes is no longer a single-model wrapper. It is a persistent coordinator that maintains long-term project memory and routes tasks to specialist workers.

Two of those workers are Claude Code (the reasoning-heavy "Lead Engineer" powered by Claude Opus 4.7) and Codex (the rapid-scaffolding "Autonomous Worker" powered by GPT-5.5). The pattern Hermes operationalizes is exactly the agent orchestration shape the CCA-F D1 blueprint calls hub-and-spoke with subagent isolation - except now you can see it in production tooling instead of just in exam prose.

That alignment is why this release is exam-relevant, not just news. Production teams reaching for orchestration are reaching for the same primitives the certification tests.

Why this matters now

The AI coding-agent space spent 2025 in a "single agent does everything" model - Cursor, Copilot, Cody, even Claude Code each tried to be the one tool you reach for. Through 2026 that consolidation is reversing. Hermes' release is part of a wave: Aider's MoE routing, Cline's specialist-mode toggles, and now Hermes' explicit Claude-Code-plus-Codex orchestration. The bet is the same one the CCA-F D1 blueprint already tests: one big model is rarely the right answer when several smaller specialists are cheaper and better at narrow tasks.

The "use multiple specialists" framing has a real failure mode, though. Each specialist needs an isolated context, a clear task prompt, and a return contract - and that integration overhead is non-trivial. Hermes makes those problems explicit (Kanban for task isolation, Task Completion Judgment for the return contract, immutable skills for protection from self-improvement loops). A team that just wires Claude Code + Codex without those guardrails will burn more time on coordination than they save on inference. The CCA-F's D1 distractors are calibrated to this exact trap.

For teams building in production, the Hermes pattern argues for a coordinator-first stack rather than a model-first stack. Pick your coordinator (Hermes, Aider's MoE mode, or your own subagent dispatcher) before you pick which models to coordinate. The coordinator is the API your team uses; the workers behind it are interchangeable. Treat the coordinator as the load-bearing piece, version-pin it, and add observability around its task assignments. That's also what the canonical multi-agent-research-system scenario teaches.

The CCA-F has at least four kinds of D1 questions that this pattern lights up directly: hub-and-spoke vs flat agent ensembles (correct: hub-and-spoke), context inheritance in subagents (correct: subagents do not inherit), cost-aware model routing (correct: route cheap work to cheaper specialists), and verification loops at handoffs (correct: structured HITL with rollback below threshold). Knowing Hermes is the worked example for all four lets you map any D1 question to a known production pattern within 60 seconds.

The open questions are around skill marketplaces and protocol drift. Hermes deliberately uses self-generated skills rather than a public marketplace - that's a security-driven choice, but it limits ecosystem velocity. If a marketplace standard emerges (MCP-style, but for agent skills rather than tools), the next generation of orchestrators may bypass Hermes entirely. The competing bet is OpenClaw (marketplace-heavy, faster ecosystem, more CVEs). Whichever wins shapes what D1 looks like on the 2027 version of the exam.

5 things that matter for the exam

  1. Put the orchestrator in charge of context, not coding. Hermes owns long-term memory and task hand-offs. The workers do not share context with each other - they share context only with the coordinator. That is the canonical D1 subagent isolation rule, surfaced as a real product feature.

  2. Split work by benchmark strength. Claude Opus 4.7 hits 87.6% on SWE-bench Verified - best for refactors, security review, architecture. Codex GPT-5.5 hits 82.7% on Terminal-Bench 2.0 - best for scaffolding, tests, lint, terminal-heavy chores. Different benchmarks, different specialists.

  3. Lock task boundaries to prevent context bleed. Hermes ships a Kanban feature that pins a task to one agent. Without explicit task boundaries, both workers touch everything and context leakage tanks reliability. Same lesson as the canonical subagent isolation rule on the exam.

  4. Watch the token-drain refactor loop. Claude Code can consume up to 4× more tokens than Codex on the same task - one reported refactor cost $155 with Opus vs $15 with Codex. Use Codex first for scaffolding; use Claude only for final architectural review. Cost-aware routing is part of D1 model-selection.

  5. Run the night-shift in Docker, with a PTY for risky commands. Autonomous overnight runs are now common. Use Docker isolation for the workers and hermes start --pty so Claude Code pauses for human confirmation on rm -rf, git push --force, and other destructive commands. Hooks-as-product, exam-relevant for D3.

3 production patterns the Hermes release codifies

  1. The Coordinator pattern. One agent owns long-term project memory; specialist workers spin up per task with fresh context and a narrow tool whitelist. The coordinator never codes - it routes. Mirrors the CCA-F D1 expectation that subagents do not inherit parent context.

  2. The Specialist-by-benchmark pattern. Don't pick one model and route everything to it. Pick the model per task type based on benchmark strength: Opus for reasoning-heavy multi-file edits, Codex for terminal-heavy single-shot commands. Same idea applies to Haiku for cheap classification.

  3. The Verification-loop pattern. After workers complete, the coordinator runs a Task Completion Judgment check. If the score is below 0.9, it rolls back changes and escalates to a human. This is the structured HITL handoff the CCA-F D5 questions probe - operationalized.

Three orchestration anti-patterns

  • Skill overwriting on the self-improving loop. Hermes' self-improvement can clobber manually tuned skills. Mark them immutable: true in the skill markdown frontmatter - same protective discipline as pinned case-facts blocks in long Claude sessions.

  • Letting Claude Code do the cheap work. If you route lint + scaffolding through Opus 4.7, you are paying premium prices for cheap output. This is the classic "bigger model" distractor the exam plants in D1 questions: the right fix is routing, not escalation.

  • Skipping version pinning. Hermes shipped v0.1 → v0.8 in months. Pin your version in CI/CD; treat protocol drift as a real failure mode. The same discipline applies to MCP servers and the tools/list handshake.

How this shows up on the exam

Domain 1 (Agentic Architecture & Orchestration) - 27% of the exam, the biggest single domain. Expect at least two questions whose correct answer is some version of "pick the smaller specialist agent for the cheap step and route to the larger reasoning agent only for the architectural step." The distractors will suggest "increase model size", "lengthen context window", or "add a system prompt that lists every step" - all wrong. Hermes' Specialist Routing is the canonical answer pattern surfaced in production form.

For study-next, pair this post with the Subagents concept page (isolation rule), the Multi-agent research system scenario (the canonical D1 build-along), and the Day-of distractor patterns in the Exam Guide (especially the Model vs Design trap). Hermes is the reference implementation; the exam will probe the principle behind it.

Sources

01 · Read next in the pillars

Where this lands in the exam-prep map

Each blog post bridges into the evergreen pillars. These are the most relevant follow-ups for this story.

02 · FAQ

7 questions answered

What is Hermes Agent and how is it different from Claude Code?
Hermes Agent is an orchestration layer from Nous Research that coordinates multiple coding agents under one persistent project memory. Claude Code is one of the agents Hermes orchestrates - it does the actual reasoning and code edits. Think of Hermes as the project manager and Claude Code as the senior engineer it assigns work to. They are complementary, not competing.
What is Specialist Routing in agent design?
Specialist Routing is the pattern where a coordinator agent inspects each incoming task and forwards it to the worker best-suited by benchmark and cost. Claude Code (Opus 4.7) handles reasoning-heavy multi-file edits; Codex (GPT-5.5) handles terminal-heavy scaffolding. The coordinator owns context; workers run in isolation. This is the canonical CCA-F D1 hub-and-spoke pattern.
Why does Claude Opus 4.7 beat Codex on SWE-bench but lose on Terminal-Bench?
Different optimization targets. SWE-bench Verified rewards multi-file reasoning and architectural correctness - Opus 4.7's strength (87.6%). Terminal-Bench 2.0 rewards single-shot terminal commands and rapid tool-use - Codex GPT-5.5's strength (82.7%). Specialist Routing exploits both rather than picking one and forcing it on every task.
Does Hermes Agent map to a CCA-F exam domain?
Yes - primarily **Domain 1 (Agentic Architecture & Orchestration, 27% of the exam)** and secondarily **Domain 3 (Claude Code Configuration & Workflows, 20%).** Expect questions on hub-and-spoke patterns, subagent context isolation, cost-aware model routing, and verification loops. Hermes is the reference implementation of all four.
How does subagent context isolation actually work?
When Hermes (or any coordinator) spawns a subagent, the subagent receives only the task prompt the coordinator hands it - not the parent's conversation history, not the other subagents' tool results, and not the parent's system prompt unless it is passed explicitly. This prevents context bleed that would otherwise tank reliability. The exam probes this with distractors that assume parent context is inherited.
What is the Hermes Nightshift pattern?
Nightshift is an autonomous overnight workflow where Hermes deploys Codex for scaffolding and Claude Code for fix-proposals while the developer sleeps. It runs inside Docker for isolation, evaluates a Task Completion Judgment score, rolls back changes that score below 0.9, and leaves a morning summary. It is the structured HITL escalation pattern made into a product.
Should I use Hermes for the exam or Claude Code directly?
Use Claude Code directly for exam study and most build-alongs - Hermes adds a coordination layer you don't need for solo work. Use Hermes when you have multiple specialist agents running concurrently and need shared project memory. The exam tests the principle (Specialist Routing); your study tooling can be simpler.

Synthesized from research output on 2026-05-02. LinkedIn cross-post pending.
Last reviewed 2026-05-06.

Blog post · D1 · Blog

Hermes Agent Orchestrates Claude Code + Codex, complete.

You've covered the full ten-section breakdown for this primitive, definition, mechanics, code, false positives, comparison, decision tree, exam patterns, and FAQ. One technical primitive down on the path to CCA-F.

More platforms →