Sybil-2/docs/api/streaming-chat.md

# Streaming Chat API Contract

This document defines the server-sent events (SSE) contract for chat completions.

Endpoint:
- `POST /v1/chat-completions/stream`

Transport:
- HTTP response uses `Content-Type: text/event-stream; charset=utf-8`
- Events are emitted in SSE format (`event: ...`, `data: ...`)
- Request body is JSON

Authentication:
- Same as REST endpoints (`Authorization: Bearer <token>` when token mode is enabled)

## Request Body

```json
{
  "chatId": "optional-chat-id",
  "provider": "openai|anthropic|xai",
  "model": "string",
  "messages": [
    { "role": "system|user|assistant|tool", "content": "string", "name": "optional" }
  ],
  "temperature": 0.2,
  "maxTokens": 256
}
```

Notes:
- If `chatId` is omitted, backend creates a new chat.
- If `chatId` is provided, backend validates it exists.
- Backend stores only new non-assistant input history rows to avoid duplicates.

## Event Stream Contract

Event order:
1. Exactly one `meta`
2. Zero or more `tool_call`
3. Zero or more `delta`
4. Exactly one terminal event: `done` or `error`

### `meta`

```json
{
  "type": "meta",
  "chatId": "chat-id",
  "callId": "llm-call-id",
  "provider": "openai",
  "model": "gpt-4.1-mini"
}
```

### `delta`

```json
{ "type": "delta", "text": "next chunk" }
```

`text` may contain partial words, punctuation, or whitespace.

### `tool_call`

```json
{
  "toolCallId": "call_123",
  "name": "web_search",
  "status": "completed",
  "summary": "Performed web search for 'latest CPI release'.",
  "args": { "query": "latest CPI release" },
  "startedAt": "2026-03-02T10:00:00.000Z",
  "completedAt": "2026-03-02T10:00:00.820Z",
  "durationMs": 820,
  "error": null,
  "resultPreview": "{\"ok\":true,...}"
}
```

### `done`

```json
{
  "type": "done",
  "text": "full assistant response",
  "usage": {
    "inputTokens": 123,
    "outputTokens": 456,
    "totalTokens": 579
  }
}
```

`usage` may be omitted when provider does not expose final token accounting for stream mode.

### `error`

```json
{ "type": "error", "message": "provider timeout" }
```

## Provider Streaming Behavior

- `openai`: backend may execute internal tool calls (`web_search`, `fetch_url`) before producing final text.
- `xai`: same tool-enabled behavior as OpenAI.
- `anthropic`: streamed via event stream; emits `delta` from `content_block_delta` with `text_delta`.

Tool-enabled streaming notes (`openai`/`xai`):
- Stream still emits standard `meta`, `delta`, `done|error` events.
- Stream may emit `tool_call` events while tool calls are executed.
- `delta` events stream incrementally as text is generated.

## Persistence + Consistency Model

Backend database remains source of truth.

During stream:
- Client may optimistically render accumulated `delta` text.

On successful completion:
- Backend persists assistant `Message` and updates `LlmCall` usage/latency in a transaction.
- Backend then emits `done`.

On failure:
- Backend records call error and emits `error`.

Client recommendation (for iOS/web):
1. Render deltas in real time for UX.
2. On `done`, refresh chat detail from REST (`GET /v1/chats/:chatId`) and use DB-backed data as canonical.
3. On `error`, preserve user input and show retry affordance.

## SSE Parsing Rules

- Concatenate multiple `data:` lines with newline before JSON parse.
- Event completes on blank line.
- Ignore unknown event names for forward compatibility.

## Example Stream

```text
event: meta
data: {"type":"meta","chatId":"c1","callId":"k1","provider":"openai","model":"gpt-4.1-mini"}

event: delta
data: {"type":"delta","text":"Hello"}

event: delta
data: {"type":"delta","text":" world"}

event: done
data: {"type":"done","text":"Hello world"}

```