Pillar 9 · Blog · 2026-05-12· 3 min read

The Local Bridge Stack: Claude Code on Llama.cpp + Gemma 4 at 22-28 t/s

Route Claude Code through Llama.cpp to a local Gemma 4 31B model and you get 22-28 tokens/sec autonomous coding, zero API spend, and data that never leaves your NVMe. The trick is two env vars (ANTHROPIC_BASE_URL, ANTHROPIC_API_KEY) plus the free-claude-code bridge, a 32768-token context window, and a one-line CLAUDE_CODE_MAX_TOKENS=16384 ceiling so the client doesn't try a 200k context and wipe the run.

D3D5claude-codelocal-llmllama-cpp
Painterly walnut signal-routing console with brass pneumatic tubes curving inward back to a workshop bench. A hand-painted brass dial reads LOCAL // CLOUD with the needle locked to LOCAL. Loop in wire-rim glasses reads a BASE_URL instruction card at the workbench.

Quick answer

The Local Bridge Stack = Claude Code client + ANTHROPIC_BASE_URL redirect + free-claude-code proxy + Llama.cpp + Gemma 4 31B. Output: 22-28 t/s autonomous coding on an RTX 4090, zero API bill, data stays local. Two non-obvious env vars do the heavy lifting (CLAUDE_CODE_MAX_TOKENS=16384, --api-type anthropic). A root CLAUDE.md mapping the repo architecture is the highest-leverage configuration step after the env vars.

The setup, in 11 lines

Running Anthropic Claude Code with Llama.cpp and Gemma 4 has a new sweet spot. Before: weekend science project. After: a stack worth running every day.

  1. Route Claude Code to local inference with ANTHROPIC_BASE_URL="http://localhost:8080/v1" so Anthropic-format calls hit Llama.cpp.
  2. Set ANTHROPIC_API_KEY="sk-local-token" anyway. The bridge accepts any placeholder; the client refuses to start without one. Yes, it feels illegal.
  3. Install the modern stack with uv, not pip detours: uv tool install @anthropic-ai/claude-code.
  4. Don't use old uv. Pre-v0.6.0 reportedly breaks free-claude-code's proxy dependency resolution. Upgrade first.
  5. Run the bridge from free-claude-code so Claude Code can talk to a local server without pretending it's a cloud app.
  6. Start Llama.cpp with --ctx-size 32768. Smaller contexts look fine until repo analysis lands and everything catches fire.
  7. Add --cont-batching — Claude Code isn't one clean completion. It's a think/act loop, and batching matters.
  8. On newer Llama.cpp builds, use --api-type anthropic so /v1/messages works natively. Less glue code, fewer haunted bugs.
  9. Set CLAUDE_CODE_MAX_TOKENS=16384. Most people miss this. If you don't set it, Claude Code may try a 200k context and wipe your local run.
  10. Pick the right Gemma variant. Gemma 4 31B Q4_K_M is the practical logic/speed tradeoff; the 26B-A4B MoE is faster but loops more on harder refactors.
  11. The payoff: Gemma 4 31B on an RTX 4090 reportedly runs autonomous coding at 22-28 tokens/sec, with data staying local instead of leaving your NVMe.

The bonus tip nobody mentions

Create a root CLAUDE.md mapping the repo architecture. Top-level dirs, module roles, where side-effects live, entry points. Anthropic-format tooling is smart, not psychic — and a structural index in CLAUDE.md gives the local model navigational grip even when individual file contents aren't in context. Twenty minutes of authoring; pays back across every session.

Where it still goes weird

Three honest failure modes:

  • Deletions over 500 lines get weird. The model's reasoning about cross-file impact thins out at that scale.
  • Repo-wide refactors need explicit chunking via Plan Mode. Don't ask for them in one prompt.
  • The 32768 context is enough for most files but not whole-repo analysis. Pair with the root CLAUDE.md above.

This isn't perfect. It's just good enough to be the default for most personal-productivity coding work.

How this shows up on the exam

D3 (Claude Code Configuration) repeatedly tests whether you understand that the client and the inference layer are separable concerns. Most candidates assume the Claude Code workflows (Plan Mode, CLAUDE.md hierarchy, Skills, hooks, slash commands) are tied to Anthropic's API. They're not — the protocols are public, the bridge layer is real, and the workflows survive a model swap. Exam questions that ask "where does configuration X live?" expect you to know it lives in the client and the file system, not in the API.

D5 (Context Management) tests the same pattern from the other side. The CLAUDE_CODE_MAX_TOKENS=16384 ceiling and --ctx-size 32768 are the worked example of deliberate context budgeting — you don't just take whatever the default is, you size the budget to the model and the task. The exam will distract you with "raise the context limit" answers; the correct architecture is to bound the budget and use Plan Mode + CLAUDE.md to make the bounded context productive.

If you're running a different local stack, what token/sec are you seeing — or where's the bottleneck?

That's the question worth asking around. The local-Claude-Code field is moving fast enough that the sweet spot in May 2026 won't be the sweet spot in August. But the architectural patterns — env-var redirect, bridge proxy, deliberate context budgeting, structural CLAUDE.md — survive every model swap.

01 · Read next in the pillars

Where this lands in the exam-prep map

Each blog post bridges into the evergreen pillars. These are the most relevant follow-ups for this story.

02 · FAQ

7 questions answered

What is the Local Bridge Stack?
Local Bridge Stack is the configuration that lets the Anthropic Claude Code client talk to a Llama.cpp server running Gemma 4 31B on your own hardware. The three load-bearing pieces are: (1) ANTHROPIC_BASE_URL="http://localhost:8080/v1" to redirect API calls, (2) the free-claude-code proxy bridge to translate between formats, and (3) Llama.cpp launched with --ctx-size 32768 --cont-batching --api-type anthropic so the /v1/messages endpoint works natively. The output: autonomous coding at 22-28 tokens/sec on an RTX 4090, no API bill, no data leaving the machine.
Why run Claude Code locally instead of against Anthropic's API?
Three reasons. Privacy — code stays on your NVMe; no transcript ever crosses Anthropic's network. Cost — zero per-token spend, which compounds for heavy autonomous-coding workloads. Sovereignty — you keep working when Anthropic has an outage or rate-limits you. The trade-off is honest: 22-28 t/s is faster than typing but slower than cloud Claude, and the model behind it is Gemma 4 31B, not Sonnet 4.6. Choose by workload, not by ideology.
What's free-claude-code?
It's the proxy bridge that translates between Anthropic's API format and OpenAI-compatible local servers like Llama.cpp. Without it, the Claude Code client would refuse to talk to a non-Anthropic endpoint. Install it via uv (not pip — pre-v0.6.0 uv reportedly breaks the proxy's dependency resolution). Pair it with the env var redirect and Claude Code stops pretending it's a cloud app.
Gemma 4 31B Q4_K_M vs the 26B-A4B MoE — which one?
31B Q4_K_M is the practical logic-vs-speed sweet spot for most coding workloads. The 26B-A4B MoE is faster on average but loops more on harder refactors, which is exactly when you don't want a faster wrong answer. Default to 31B; switch to MoE only after you've measured your specific repo's behavior. Either way, set CLAUDE_CODE_MAX_TOKENS=16384 — Claude Code's default is 200k tokens, and a local model trying to allocate that buffer will wipe your run.
Does the local stack work for CCA-F-style scenarios?
Mostly yes, with one structural caveat. The Claude Code *workflows* the exam tests — Plan Mode, CLAUDE.md hierarchy, Skills, hooks, slash commands — all still work because the client is identical; the inference layer just moved. The caveat: Gemma 4 is not Sonnet 4.6, so capability-bound exam patterns (long-context reasoning, multi-step tool orchestration, nuanced refactoring) will degrade gracefully. The exam tests architectural choices, and the architecture you build with the local stack is the same architecture you'd build against the cloud API. The exam-relevant skills transfer.
What still breaks in this setup?
Three known failure modes worth tracking. (1) Big deletions over 500 lines still go weird — the model's reasoning about cross-file impact thins out at that scale. (2) Repo-wide refactors need explicit chunking via Plan Mode; don't ask for them in one prompt. (3) The 32768 context is enough for most files but not whole-repo analysis; pair with a root CLAUDE.md that maps the architecture so the model has the structural index even when individual file contents aren't in context.
What's the role of the root CLAUDE.md in this stack?
Anthropic-format tooling is smart, not psychic. A root CLAUDE.md that maps the repo architecture (top-level dirs, module roles, where the side-effects live, what the entry points are) gives the local model a *structural index* even when individual files aren't in context. That single file is often the difference between a useful local agent and a confused one. Worth 20 minutes to author per project; it's the highest-leverage configuration step after the env vars themselves.

Synthesized from research output on 2026-05-12. LinkedIn cross-post pending.
Last reviewed 2026-05-12.

Blog post · D3 · Pillar 9 · Blog

The Local Bridge Stack: Claude Code on Llama.cpp + Gemma 4 at 22-28 t/s, complete.

You've covered the full ten-section breakdown for this primitive, definition, mechanics, code, false positives, comparison, decision tree, exam patterns, and FAQ. One technical primitive down on the path to CCA-F.

Share your win →