# Streaming

> Streaming returns tokens as they're generated for low-latency UX. Vault coverage thin; needs Phase 6 research.

**Domain:** D4 · Prompt Engineering (20% of CCA-F exam)
**Canonical:** https://claudearchitectcertification.com/concepts/streaming
**Last reviewed:** 2026-05-04

## Quick stats

- **Use case:** low latency
- **Exam domain:** D4
- **Coverage tier:** C
- **Status:** stub
- **Action:** research

## What it is

Streaming is the capability to receive Claude's response token-by-token as Server-Sent Events (SSE) rather than waiting for the entire response. Add stream=True to messages.create(), and instead of a single response object, you get an HTTP event stream where each event represents a chunk of text or metadata. The user sees text appear word by word, creating a responsive chat experience instead of a blank screen for 10-30 seconds.

What makes streaming realtime is that the connection stays open while the model writes. Each token is emitted as a ContentBlockDelta event within milliseconds, allowing the client to render incrementally. The alternative (non-streaming) waits for the full response and ships it all at once, creating artificial latency. Streaming is especially valuable for long-form responses.

The event stream contains seven main types: MessageStart, ContentBlockStart, ContentBlockDelta (the actual text chunks), ContentBlockStop, MessageDelta, MessageStop, and optional error events. Production handling requires three guarantees: (1) graceful reconnection on network drop, (2) correct handling of tool_use blocks (they don't stream character-by-character), (3) cost tracking (you pay per token regardless).

The main risk is premature disconnection. If the client closes before completion, the request is still billed but you lose partial output. Mitigations: track stream state, buffer all received tokens, exponential backoff on reconnection, log connection drops. Secondary risk: treating streaming as cost optimization, it costs the same as non-streaming, use it for UX.

## How it works

Request structure is identical to non-streaming, except stream=True. The SDK returns an iterator (Python) or async generator (TS) that yields events. for event in stream: process(event). Fundamentally synchronous from the caller's perspective: you block reading events until more arrive.

Event payload structure is JSON. ContentBlockDelta events contain a delta field with text. ContentBlockStart signals block type (text or tool_use). MessageStart carries metadata (model, usage estimates). MessageStop includes final token counts and stop_reason. Extract from ContentBlockDelta.delta.text, accumulate. Cost tracking happens at MessageStop.

For tool_use blocks, the stream works differently. The full tool_use block (with name and input) arrives in a single ContentBlockStart event or spread across multiple ContentBlockDelta events with type input_json_delta. Accumulate the JSON incrementally, validate only after ContentBlockStop, then execute. Most common mistake: trying to execute halfway through the JSON stream.

Network handling is critical. Streaming uses HTTP long-polling; the connection stays open for seconds. Wrap the loop in try/except: catch RequestException, log, decide retry or escalate. The SDK provides with client.messages.stream(...) as stream: context manager that auto-closes and handles cleanup.

## Where you'll see it in production

### Real-time chat UI in Streamlit or Flask

Web app uses stream=True, emits each text chunk via SSE to the frontend. JavaScript EventSource API consumes SSE automatically. User perceives instant feedback; same cost as non-streaming.

### Live code generation in IDE extensions

VS Code extension calls Claude. With streaming, user sees the function appear line-by-line. Real incremental generation, not a fake typewriter effect. Extension inserts text into editor buffer as events arrive.

### Streaming agentic loops with tool results

Refund agent loop streams text ("Let me check the order..."), then a tool_use block (arrives complete, not streamed). Harness executes, appends result. Next turn streams reasoning again. Streaming improves perceived responsiveness even in multi-turn flows.

### Long-form document generation

Research assistant generates 5000-word report. Non-streaming = blank screen 20+ seconds. With streaming, user sees intro, sections, conclusions live. User can interrupt mid-generation without paying for full output.

## Code examples

### Streaming with full event handling

**Python:**

```python
import anthropic
client = anthropic.Anthropic()

def stream_response(user_msg: str, tools: list = None):
    accumulated_text = ""

    with client.messages.stream(
        model="claude-opus-4-5",
        max_tokens=2048,
        system="You are a helpful assistant.",
        tools=tools or [],
        messages=[{"role": "user", "content": user_msg}],
    ) as stream:
        for event in stream:
            if event.type == "content_block_delta":
                if event.delta.type == "text_delta":
                    text = event.delta.text
                    accumulated_text += text
                    print(text, end="", flush=True)  # Real-time output

            elif event.type == "message_stop":
                print(f"\n[Stop reason: {event.message.stop_reason}]")
                if event.message.stop_reason == "tool_use":
                    for block in event.message.content:
                        if block.type == "tool_use":
                            print(f"[Tool: {block.name}]")

    return accumulated_text
```

> Stream loop with text accumulation. tool_use blocks arrive complete, not character-by-character. flush=True for real-time output.

**TypeScript:**

```typescript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

async function streamResponse(userMsg: string, tools: Anthropic.Tool[] = []) {
  let accumulated = "";

  const stream = await client.messages.stream({
    model: "claude-opus-4-5",
    max_tokens: 2048,
    system: "You are a helpful assistant.",
    tools,
    messages: [{ role: "user", content: userMsg }],
  });

  for await (const chunk of stream) {
    if (chunk.type === "content_block_delta" && chunk.delta.type === "text_delta") {
      const text = chunk.delta.text;
      accumulated += text;
      process.stdout.write(text);  // Real-time output
    }
    if (chunk.type === "message_stop") {
      console.log(`\n[Stop reason: ${chunk.message.stop_reason}]`);
    }
  }

  const final = await stream.finalMessage();
  console.log(`Final tokens: ${final.usage.output_tokens}`);
  return accumulated;
}
```

> async stream iterator. process.stdout.write for real-time. finalMessage() provides token counts.

## Looks-right vs actually-wrong

| Looks right | Actually wrong |
|---|---|
| Streaming is cheaper than non-streaming. | Same cost. You pay per token regardless. Streaming is purely UX, not cost optimization. |
| If stream connection drops, retry the entire request. | Drop doesn't lose what was received. Client retains buffered text. Retrying wastes tokens on duplicate work. Log error, decide based on context. |
| Tool_use blocks stream character-by-character; parse JSON as it arrives. | Tool_use blocks arrive complete or as chunked input_json_delta events that must be fully accumulated before parsing. Parsing partial JSON fails. |
| Display each event to the user immediately. | Some events (MessageStart) are metadata, not displayable. Filter: display only ContentBlockDelta.text. Meta events go to logging. |
| Cancel a streaming request mid-stream by closing connection. | Closing stops receiving, but request is still processed server-side and billed. Cancellation is not a cost-saving mechanism. |

## Comparison

| Aspect | Streaming | Non-streaming | Polling endpoint | WebSocket |
| --- | --- | --- | --- | --- |
| Time to first token | 100-200ms | Entire response time | Batch delay | Similar to streaming |
| Cost | Same per token | Same per token | Same | Same |
| Connection | HTTP SSE | Request/response | Repeated polling | Persistent TCP |
| UX | Real-time, progressive | Batch, instant or long wait | Polling jitter | Real-time, lowest latency |
| Complexity | Event loop, buffer | Simple | Poll interval tuning | Server upgrade |
| Best for | Chat UIs, long-form | Quick queries, APIs | Legacy systems | High-frequency real-time |

## Decision tree

1. **Response >1000 tokens (likely >10 seconds)?**
   - **Yes:** Stream to show progressive output and reduce perceived latency.
   - **No:** Non-streaming fine; instant response either way.

2. **User-facing chat or interactive interface?**
   - **Yes:** Stream. Improves UX dramatically.
   - **No:** Non-streaming acceptable for backend tasks.

3. **Agentic loop with tool_use blocks?**
   - **Yes:** Streaming still works; tool blocks arrive complete. Buffer until ContentBlockStop.
   - **No:** Simple text streaming.

4. **Need to reconnect on network failures?**
   - **Yes:** Implement exponential backoff on stream exception; buffer received tokens.
   - **No:** Single-shot, no retry needed.

5. **Bandwidth a constraint?**
   - **Yes:** Streaming doesn't help. Same total bytes. Use compression or prompt caching.
   - **No:** Streaming is purely UX.

## Exam-pattern questions

### Q1. Streaming reduces token cost: true?

No. Streaming costs the same per token. It's a UX feature (responsive text), not a cost optimization. The total tokens generated and billed are identical to non-streaming.

### Q2. Stream connection drops mid-response. Retry the entire request?

No. The drop doesn't lose what was already received. The client retains the buffered text. Retrying sends a new prompt and wastes tokens on duplicate work. Log the error, decide based on context.

### Q3. Tool_use blocks stream character-by-character: parse JSON as it arrives?

No. Tool_use blocks arrive complete (in ContentBlockStart) or as chunked input_json_delta events that must be fully accumulated before parsing. Parsing partial JSON fails.

### Q4. Display every event to the user immediately: good UX?

No. Some events (MessageStart) are metadata, not displayable. Filter: display only ContentBlockDelta.text. Meta events go to logging or internal state tracking.

### Q5. Cancel a streaming request by closing the connection: stops the bill?

No. Closing stops receiving, but the request is still processed server-side and billed up to that point. Cancellation is not a cost-saving mechanism; budget for completion before opening the stream.

### Q6. Latency of the first streamed token?

~100-200ms from request to first ContentBlockDelta.text event. Similar to non-streaming first-token time. Streaming optimizes time-to-screen, not time-to-first-token.

### Q7. Streaming response shorter than 1 second: noticeable UX improvement?

No. Responses >3-5 seconds become noticeably more responsive with streaming. Shorter (<1 sec) shows negligible improvement. Use streaming for long-form responses, not quick queries.

### Q8. Emit streamed chunks to a browser client: which protocol?

Server-Sent Events (SSE). Server: open /stream endpoint, iterate Claude stream, response.write(event) each chunk. Client: const es = new EventSource('/stream'); es.onmessage = .... Simple, reliable, browser-native.

## FAQ

### Q1. Latency of first streamed token?

Typically 100-200ms from request to first ContentBlockDelta.text event, similar to non-streaming first-token time.

### Q2. Can I interrupt and use partial output?

Yes. Close connection, keep buffered text. Don't retry; you've already paid.

### Q3. Tool_use blocks in streaming?

Tool blocks arrive complete (not character-by-character). Buffer input_json_delta events, parse when ContentBlockStop fires.

### Q4. Reduce token cost with streaming?

No. Same cost. UX feature only.

### Q5. Stream drops 50% in?

Connection closes, buffered text retained. Request billed up to disconnection. Decide retry vs partial result.

### Q6. Convert streaming to non-streaming retroactively?

No. Once stream=True, you get events. Buffer all events to assemble complete response, but can't "un-stream".

### Q7. Minimum response time for streaming value?

Responses >3-5 seconds become noticeably more responsive. Shorter (<1 sec) shows negligible improvement.

### Q8. Emit chunks to a browser client?

Use Server-Sent Events. Server: /stream endpoint, iterate Claude stream, response.write(event) each chunk. Client: const es = new EventSource('/stream').

### Q9. Should streaming change how I parse tool_use blocks?

No. Even though text deltas arrive token-by-token, tool_use blocks are atomic. Accumulate input_json_delta events into a buffer keyed by index; finalize on the matching content_block_stop. Treating each delta as a complete tool call leads to malformed JSON crashes.

### Q10. How do I detect mid-stream that the model is stuck or repeating?

Watch the delta cadence: long gaps (>5s) between deltas often signal the model is in a low-entropy retry. Maintain a sliding window of the last N text deltas; if the model emits the same phrase 3+ times, terminate the stream. Don't wait for message_stop, you'll burn tokens on the loop.

---

**Source:** https://claudearchitectcertification.com/concepts/streaming
**Vault sources:** ACP-T03 capabilities
**Last reviewed:** 2026-05-04

**Evidence tiers** — 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed.
