Quick answer
The Local Bridge Stack = Claude Code client + ANTHROPIC_BASE_URL redirect + free-claude-code proxy + Llama.cpp + Gemma 4 31B. Output: 22-28 t/s autonomous coding on an RTX 4090, zero API bill, data stays local. Two non-obvious env vars do the heavy lifting (CLAUDE_CODE_MAX_TOKENS=16384, --api-type anthropic). A root CLAUDE.md mapping the repo architecture is the highest-leverage configuration step after the env vars.
The setup, in 11 lines
Running Anthropic Claude Code with Llama.cpp and Gemma 4 has a new sweet spot. Before: weekend science project. After: a stack worth running every day.
- Route Claude Code to local inference with
ANTHROPIC_BASE_URL="http://localhost:8080/v1"so Anthropic-format calls hit Llama.cpp. - Set
ANTHROPIC_API_KEY="sk-local-token"anyway. The bridge accepts any placeholder; the client refuses to start without one. Yes, it feels illegal. - Install the modern stack with
uv, not pip detours:uv tool install @anthropic-ai/claude-code. - Don't use old
uv. Pre-v0.6.0 reportedly breaks free-claude-code's proxy dependency resolution. Upgrade first. - Run the bridge from
free-claude-codeso Claude Code can talk to a local server without pretending it's a cloud app. - Start Llama.cpp with
--ctx-size 32768. Smaller contexts look fine until repo analysis lands and everything catches fire. - Add
--cont-batching— Claude Code isn't one clean completion. It's a think/act loop, and batching matters. - On newer Llama.cpp builds, use
--api-type anthropicso/v1/messagesworks natively. Less glue code, fewer haunted bugs. - Set
CLAUDE_CODE_MAX_TOKENS=16384. Most people miss this. If you don't set it, Claude Code may try a 200k context and wipe your local run. - Pick the right Gemma variant.
Gemma 4 31B Q4_K_Mis the practical logic/speed tradeoff; the26B-A4B MoEis faster but loops more on harder refactors. - The payoff: Gemma 4 31B on an RTX 4090 reportedly runs autonomous coding at 22-28 tokens/sec, with data staying local instead of leaving your NVMe.
The bonus tip nobody mentions
Create a root CLAUDE.md mapping the repo architecture. Top-level dirs, module roles, where side-effects live, entry points. Anthropic-format tooling is smart, not psychic — and a structural index in CLAUDE.md gives the local model navigational grip even when individual file contents aren't in context. Twenty minutes of authoring; pays back across every session.
Where it still goes weird
Three honest failure modes:
- Deletions over 500 lines get weird. The model's reasoning about cross-file impact thins out at that scale.
- Repo-wide refactors need explicit chunking via Plan Mode. Don't ask for them in one prompt.
- The 32768 context is enough for most files but not whole-repo analysis. Pair with the root
CLAUDE.mdabove.
This isn't perfect. It's just good enough to be the default for most personal-productivity coding work.
How this shows up on the exam
D3 (Claude Code Configuration) repeatedly tests whether you understand that the client and the inference layer are separable concerns. Most candidates assume the Claude Code workflows (Plan Mode, CLAUDE.md hierarchy, Skills, hooks, slash commands) are tied to Anthropic's API. They're not — the protocols are public, the bridge layer is real, and the workflows survive a model swap. Exam questions that ask "where does configuration X live?" expect you to know it lives in the client and the file system, not in the API.
D5 (Context Management) tests the same pattern from the other side. The CLAUDE_CODE_MAX_TOKENS=16384 ceiling and --ctx-size 32768 are the worked example of deliberate context budgeting — you don't just take whatever the default is, you size the budget to the model and the task. The exam will distract you with "raise the context limit" answers; the correct architecture is to bound the budget and use Plan Mode + CLAUDE.md to make the bounded context productive.
If you're running a different local stack, what token/sec are you seeing — or where's the bottleneck?
That's the question worth asking around. The local-Claude-Code field is moving fast enough that the sweet spot in May 2026 won't be the sweet spot in August. But the architectural patterns — env-var redirect, bridge proxy, deliberate context budgeting, structural CLAUDE.md — survive every model swap.
Where this lands in the exam-prep map
Each blog post bridges into the evergreen pillars. These are the most relevant follow-ups for this story.
Scenario
Code generation with Claude Code
The local stack runs the same Claude Code client — every Claude Code architecture question still applies, the inference layer just moves.
Open ↗Scenario
Developer productivity agent
Local inference is the highest-leverage privacy + cost lever for personal-productivity agents.
Open ↗Concept
CLAUDE.md hierarchy
Anthropic-format tooling is smart, not psychic — the repo-mapping CLAUDE.md tip in this post is a direct application of the hierarchy concept.
Open ↗Concept
Context window
The 32768 ctx-size + CLAUDE_CODE_MAX_TOKENS=16384 ceiling is a worked example of why context-window discipline matters in production.
Open ↗7 questions answered
What is the Local Bridge Stack?
ANTHROPIC_BASE_URL="http://localhost:8080/v1" to redirect API calls, (2) the free-claude-code proxy bridge to translate between formats, and (3) Llama.cpp launched with --ctx-size 32768 --cont-batching --api-type anthropic so the /v1/messages endpoint works natively. The output: autonomous coding at 22-28 tokens/sec on an RTX 4090, no API bill, no data leaving the machine.Why run Claude Code locally instead of against Anthropic's API?
What's free-claude-code?
uv (not pip — pre-v0.6.0 uv reportedly breaks the proxy's dependency resolution). Pair it with the env var redirect and Claude Code stops pretending it's a cloud app.Gemma 4 31B Q4_K_M vs the 26B-A4B MoE — which one?
CLAUDE_CODE_MAX_TOKENS=16384 — Claude Code's default is 200k tokens, and a local model trying to allocate that buffer will wipe your run.Does the local stack work for CCA-F-style scenarios?
What still breaks in this setup?
CLAUDE.md that maps the architecture so the model has the structural index even when individual file contents aren't in context.What's the role of the root CLAUDE.md in this stack?
CLAUDE.md that maps the repo architecture (top-level dirs, module roles, where the side-effects live, what the entry points are) gives the local model a *structural index* even when individual files aren't in context. That single file is often the difference between a useful local agent and a confused one. Worth 20 minutes to author per project; it's the highest-leverage configuration step after the env vars themselves.Synthesized from research output on 2026-05-12. LinkedIn cross-post pending.
Last reviewed 2026-05-12.
