Streaming (D4, 20% of CCA-F) - Claude Architect Concept

01 · Summary

TLDR

Streaming returns tokens as they're generated for low-latency UX. A full deep-dive guide is coming soon. messages stream events

low latency

Use case

D4

Exam domain

C

Coverage tier

stub

Status

research

Action

02 · Definition

What it is

Streaming is the capability to receive Claude's response token-by-token as Server-Sent Events (SSE) rather than waiting for the entire response. Add stream=True to messages.create(), and instead of a single response object, you get an HTTP event stream where each event represents a chunk of text or metadata. The user sees text appear word by word, creating a responsive chat experience instead of a blank screen for 10-30 seconds.

What makes streaming realtime is that the connection stays open while the model writes. Each token is emitted as a ContentBlockDelta event within milliseconds, allowing the client to render incrementally. The alternative (non-streaming) waits for the full response and ships it all at once, creating artificial latency. Streaming is especially valuable for long-form responses.

The event stream contains seven main types: MessageStart, ContentBlockStart, ContentBlockDelta (the actual text chunks), ContentBlockStop, MessageDelta, MessageStop, and optional error events. Production handling requires three guarantees: (1) graceful reconnection on network drop, (2) correct handling of tool_use blocks (they don't stream character-by-character), (3) cost tracking (you pay per token regardless).

The main risk is premature disconnection. If the client closes before completion, the request is still billed but you lose partial output. Mitigations: track stream state, buffer all received tokens, exponential backoff on reconnection, log connection drops. Secondary risk: treating streaming as cost optimization, it costs the same as non-streaming, use it for UX.

03 · Mechanics

How it works

Request structure is identical to non-streaming, except stream=True. The SDK returns an iterator (Python) or async generator (TS) that yields events. for event in stream: process(event). Fundamentally synchronous from the caller's perspective: you block reading events until more arrive.

Event payload structure is JSON. ContentBlockDelta events contain a delta field with text. ContentBlockStart signals block type (text or tool_use). MessageStart carries metadata (model, usage estimates). MessageStop includes final token counts and stop_reason. Extract from ContentBlockDelta.delta.text, accumulate. Cost tracking happens at MessageStop.

For tool_use blocks, the stream works differently. The full tool_use block (with name and input) arrives in a single ContentBlockStart event or spread across multiple ContentBlockDelta events with type input_json_delta. Accumulate the JSON incrementally, validate only after ContentBlockStop, then execute. Most common mistake: trying to execute halfway through the JSON stream.

Network handling is critical. Streaming uses HTTP long-polling; the connection stays open for seconds. Wrap the loop in try/except: catch RequestException, log, decide retry or escalate. The SDK provides with client.messages.stream(...) as stream: context manager that auto-closes and handles cleanup.

Streaming mechanics, painterly diagram featuring Loop mascot.

04 · In production

Where you'll see it

Real-time chat UI in Streamlit or Flask

Web app uses stream=True, emits each text chunk via SSE to the frontend. JavaScript EventSource API consumes SSE automatically. User perceives instant feedback; same cost as non-streaming.

Live code generation in IDE extensions

VS Code extension calls Claude. With streaming, user sees the function appear line-by-line. Real incremental generation, not a fake typewriter effect. Extension inserts text into editor buffer as events arrive.

Streaming agentic loops with tool results

Refund agent loop streams text ("Let me check the order..."), then a tool_use block (arrives complete, not streamed). Harness executes, appends result. Next turn streams reasoning again. Streaming improves perceived responsiveness even in multi-turn flows.

Long-form document generation

Research assistant generates 5000-word report. Non-streaming = blank screen 20+ seconds. With streaming, user sees intro, sections, conclusions live. User can interrupt mid-generation without paying for full output.

05 · Implementation

Code examples

Streaming with full event handling

import anthropic
client = anthropic.Anthropic()

def stream_response(user_msg: str, tools: list = None):
    accumulated_text = ""

    with client.messages.stream(
        model="claude-opus-4-5",
        max_tokens=2048,
        system="You are a helpful assistant.",
        tools=tools or [],
        messages=[{"role": "user", "content": user_msg}],
    ) as stream:
        for event in stream:
            if event.type == "content_block_delta":
                if event.delta.type == "text_delta":
                    text = event.delta.text
                    accumulated_text += text
                    print(text, end="", flush=True)  # Real-time output

            elif event.type == "message_stop":
                print(f"\n[Stop reason: {event.message.stop_reason}]")
                if event.message.stop_reason == "tool_use":
                    for block in event.message.content:
                        if block.type == "tool_use":
                            print(f"[Tool: {block.name}]")

    return accumulated_text

Stream loop with text accumulation. tool_use blocks arrive complete, not character-by-character. flush=True for real-time output.

06 · Distractor patterns

Looks right, isn't

Each row pairs a plausible-looking pattern with the failure it actually creates. These are the shapes exam distractors are built from.

Looks right

Streaming is cheaper than non-streaming.

Actually wrong

Same cost. You pay per token regardless. Streaming is purely UX, not cost optimization.

Looks right

If stream connection drops, retry the entire request.

Actually wrong

Drop doesn't lose what was received. Client retains buffered text. Retrying wastes tokens on duplicate work. Log error, decide based on context.

Looks right

Tool_use blocks stream character-by-character; parse JSON as it arrives.

Actually wrong

Tool_use blocks arrive complete or as chunked input_json_delta events that must be fully accumulated before parsing. Parsing partial JSON fails.

Looks right

Display each event to the user immediately.

Actually wrong

Some events (MessageStart) are metadata, not displayable. Filter: display only ContentBlockDelta.text. Meta events go to logging.

Looks right

Cancel a streaming request mid-stream by closing connection.

Actually wrong

Closing stops receiving, but request is still processed server-side and billed. Cancellation is not a cost-saving mechanism.

07 · Compare

Side-by-side

Aspect	Streaming	Non-streaming	Polling endpoint	WebSocket
Time to first token	100-200ms	Entire response time	Batch delay	Similar to streaming
Cost	Same per token	Same per token	Same	Same
Connection	HTTP SSE	Request/response	Repeated polling	Persistent TCP
UX	Real-time, progressive	Batch, instant or long wait	Polling jitter	Real-time, lowest latency
Complexity	Event loop, buffer	Simple	Poll interval tuning	Server upgrade
Best for	Chat UIs, long-form	Quick queries, APIs	Legacy systems	High-frequency real-time

08 · When to use

Decision tree

01

Response >1000 tokens (likely >10 seconds)?

YesStream to show progressive output and reduce perceived latency.

NoNon-streaming fine; instant response either way.

02

User-facing chat or interactive interface?

YesStream. Improves UX dramatically.

NoNon-streaming acceptable for backend tasks.

03

Agentic loop with tool_use blocks?

YesStreaming still works; tool blocks arrive complete. Buffer until ContentBlockStop.

NoSimple text streaming.

04

Need to reconnect on network failures?

YesImplement exponential backoff on stream exception; buffer received tokens.

NoSingle-shot, no retry needed.

05

Bandwidth a constraint?

YesStreaming doesn't help. Same total bytes. Use compression or prompt caching.

NoStreaming is purely UX.

09 · On the exam

Question patterns

Streaming exam trap, painterly cautionary scene featuring Loop mascot.

20 V2 questions wired to this concept. Tap an answer to check it instantly — you'll see whether it's right and why — then expand the full breakdown for the mental model and all four rationales.

A customer-facing chatbot is wired to the Batch API to save money. What goes wrong in production?

Tap your answer to check it.

Streaming reduces token cost. True or false?

Tap your answer to check it.

A stream connection drops mid-response. Should you retry the entire request?

Tap your answer to check it.

Tool_use blocks stream character-by-character. Should you parse the JSON as it arrives?

Tap your answer to check it.

Should you display every streaming event to the user immediately?

Tap your answer to check it.

Cancelling a streaming request by closing the connection: does it stop the bill?

Tap your answer to check it.

14 additional questions for this concept live in the practice pillar. Take a mock exam ↗

10 · FAQ

Frequently asked

Latency of first streamed token?

Typically 100-200ms from request to first ContentBlockDelta.text event, similar to non-streaming first-token time.

Can I interrupt and use partial output?

Yes. Close connection, keep buffered text. Don't retry; you've already paid.

Tool_use blocks in streaming?

Tool blocks arrive complete (not character-by-character). Buffer input_json_delta events, parse when ContentBlockStop fires.

Reduce token cost with streaming?

No. Same cost. UX feature only.

Stream drops 50% in?

Connection closes, buffered text retained. Request billed up to disconnection. Decide retry vs partial result.

Convert streaming to non-streaming retroactively?

No. Once stream=True, you get events. Buffer all events to assemble complete response, but can't "un-stream".

Minimum response time for streaming value?

Responses >3-5 seconds become noticeably more responsive. Shorter (<1 sec) shows negligible improvement.

Emit chunks to a browser client?

Use Server-Sent Events. Server: /stream endpoint, iterate Claude stream, response.write(event) each chunk. Client: const es = new EventSource('/stream').

Should streaming change how I parse tool_use blocks?

No. Even though text deltas arrive token-by-token, tool_use blocks are atomic. Accumulate input_json_delta events into a buffer keyed by index; finalize on the matching content_block_stop. Treating each delta as a complete tool call leads to malformed JSON crashes.

How do I detect mid-stream that the model is stuck or repeating?

Watch the delta cadence: long gaps (>5s) between deltas often signal the model is in a low-entropy retry. Maintain a sliding window of the last N text deltas; if the model emits the same phrase 3+ times, terminate the stream. Don't wait for message_stop, you'll burn tokens on the loop.

11 · Practice with AI

Work this with your AI

Work this concept hands-on with Claude Code, Codex, or claude.ai. Copy a prompt, paste it into your assistant, and practise in tandem. Each one keeps you active (explain it back, get drilled, or build) rather than just reading.

Drill it like the exam (scenario MCQs)
Practice in the exam's scenario-MCQ format with trap awareness.
Explain it back (Feynman)
Build durable, transferable understanding of a concept you can half-state.
Test me, adapting the difficulty
Active recall practice on a concept you think you know.
Check my prerequisites first
Before studying a concept that keeps not sticking.
Find the high-leverage 20%
When a domain feels too big and you are short on time.

Streaming.

TLDR

What it is

How it works

Where you'll see it

Real-time chat UI in Streamlit or Flask

Live code generation in IDE extensions

Streaming agentic loops with tool results

Long-form document generation

Code examples

Looks right, isn't

Side-by-side

Decision tree

Response >1000 tokens (likely >10 seconds)?

User-facing chat or interactive interface?

Agentic loop with tool_use blocks?

Need to reconnect on network failures?

Bandwidth a constraint?

Question patterns

Frequently asked

Work this with your AI

Test yourself

Streaming, complete.

Streaming.

TLDR

What it is

How it works

Where you'll see it

Real-time chat UI in Streamlit or Flask

Live code generation in IDE extensions

Streaming agentic loops with tool results

Long-form document generation

Code examples

Looks right, isn't

Side-by-side

Decision tree

Response >1000 tokens (likely >10 seconds)?

User-facing chat or interactive interface?

Agentic loop with tool_use blocks?

Need to reconnect on network failures?

Bandwidth a constraint?

Question patterns

Frequently asked

Work this with your AI

Test yourself

Streaming, complete.

Share this primitive