docs/api/streaming-chat.md

# Streaming Chat API Contract

This document defines the server-sent events (SSE) contract for chat completions.

Endpoint:
- `POST /v1/chat-completions/stream`
- `POST /v1/chats/:chatId/stream/attach`

Transport:
- HTTP response uses `Content-Type: text/event-stream; charset=utf-8`
- Events are emitted in SSE format (`event: ...`, `data: ...`)
- Request body is JSON
- Request body supports the same inline attachment schema and limits documented in `docs/api/rest.md`.

Authentication:
- Same as REST endpoints (`Authorization: Bearer <token>` when token mode is enabled)

## Request Body

```json
{
  "chatId": "optional-chat-id",
  "persist": true,
  "provider": "openai|anthropic|xai",
  "model": "string",
  "messages": [
    {
      "role": "system|user|assistant|tool",
      "content": "string",
      "name": "optional",
      "attachments": [
        {
          "kind": "image",
          "id": "attachment-id",
          "filename": "photo.jpg",
          "mimeType": "image/jpeg",
          "sizeBytes": 12345,
          "dataUrl": "data:image/jpeg;base64,..."
        },
        {
          "kind": "text",
          "id": "attachment-id",
          "filename": "notes.md",
          "mimeType": "text/markdown",
          "sizeBytes": 4567,
          "text": "# Notes\\n...",
          "truncated": false
        }
      ]
    }
  ],
  "temperature": 0.2,
  "maxTokens": 256
}
```

Notes:
- `persist` defaults to `true`.
- If `persist` is `true` and `chatId` is omitted, backend creates a new chat.
- If `chatId` is provided, backend validates it exists.
- If `persist` is `false`, `chatId` must be omitted. Backend does not create a chat and does not persist input messages, tool-call messages, assistant output, or `LlmCall` metadata.
- For persisted streams, backend stores only new non-assistant input history rows to avoid duplicates.
- Attachments are optional and are persisted under `message.metadata.attachments` on stored user messages when `persist` is `true`.

Persisted chat streams with a `chatId` are backend-owned active runs:
- Once started, the backend keeps the stream running even if the HTTP client disconnects or refreshes.
- While running, `GET /v1/active-runs` includes the `chatId`.
- Starting a second persisted stream for the same active `chatId` returns `409`.
- Clients can reattach with `POST /v1/chats/:chatId/stream/attach`.

## Attach Endpoint

`POST /v1/chats/:chatId/stream/attach`
- Body: none.
- Response uses the same `text/event-stream` transport and event names as `POST /v1/chat-completions/stream`.
- Replays buffered events for the active in-memory stream, then emits new events until `done` or `error`.
- Returns `404 { "message": "active chat stream not found" }` if no stream is currently active for that chat.
- Authentication is the same as all other API endpoints.

This endpoint is intended for clients that restored an active `chatId` from `GET /v1/active-runs`, especially after browser refresh. Replayed `delta` events may include text that was originally emitted before the client attached.

## Event Stream Contract

Event order:
1. Exactly one `meta`
2. Zero or more `tool_call`
3. Zero or more `delta`
4. Exactly one terminal event: `done` or `error`

### `meta`

```json
{
  "type": "meta",
  "chatId": "chat-id-or-null",
  "callId": "llm-call-id-or-null",
  "provider": "openai",
  "model": "gpt-4.1-mini"
}
```

For `persist: false` streams, `chatId` and `callId` are `null`.

### `delta`

```json
{ "type": "delta", "text": "next chunk" }
```

`text` may contain partial words, punctuation, or whitespace.

### `tool_call`

```json
{
  "toolCallId": "call_123",
  "name": "web_search",
  "status": "completed",
  "summary": "Performed web search for 'latest CPI release'.",
  "args": { "query": "latest CPI release" },
  "startedAt": "2026-03-02T10:00:00.000Z",
  "completedAt": "2026-03-02T10:00:00.820Z",
  "durationMs": 820,
  "error": null,
  "resultPreview": "{\"ok\":true,...}"
}
```

### `done`

```json
{
  "type": "done",
  "text": "full assistant response",
  "usage": {
    "inputTokens": 123,
    "outputTokens": 456,
    "totalTokens": 579
  }
}
```

`usage` may be omitted when provider does not expose final token accounting for stream mode.

### `error`

```json
{ "type": "error", "message": "provider timeout" }
```

## Provider Streaming Behavior

- `openai`: backend uses OpenAI's Responses API and may execute internal function tool calls (`web_search`, `fetch_url`, optional `codex_exec`, and optional `shell_exec`) before producing final text.
- `xai`: backend uses xAI's OpenAI-compatible Chat Completions API and may execute the same internal tool calls before producing final text.
- `openai`: image attachments are sent as Responses `input_image` items; text attachments are sent as `input_text` items.
- `xai`: image attachments are sent as Chat Completions content parts; text attachments are inlined as text parts.
- `openai`: Responses calls that can enter the server-managed tool loop use `store: true` so reasoning and function-call items can be passed between tool rounds.
- `anthropic`: streamed via event stream; emits `delta` from `content_block_delta` with `text_delta`. Image attachments are sent as base64 `image` blocks and text attachments are appended as `text` blocks.
- `web_search` uses `CHAT_WEB_SEARCH_ENGINE` (`exa` default, or `searxng` with `SEARXNG_BASE_URL` set). SearXNG mode requires the instance to allow `format=json`. This only affects chat-mode tool calls, not search-mode endpoints.
- `codex_exec` is available only when `CHAT_CODEX_TOOL_ENABLED=true`. It SSHes to `CHAT_CODEX_REMOTE_HOST`, creates/uses `CHAT_CODEX_REMOTE_WORKDIR`, and runs `codex exec --dangerously-bypass-approvals-and-sandbox --skip-git-repo-check <non-interactive wrapped prompt>` there with SSH stdin closed. Prefer `CHAT_CODEX_SSH_KEY_PATH` with a read-only mounted private key; `CHAT_CODEX_SSH_PRIVATE_KEY_B64` is also supported.
- `shell_exec` is available only when `CHAT_SHELL_TOOL_ENABLED=true`. It uses the same devbox SSH configuration, starts in `CHAT_CODEX_REMOTE_WORKDIR`, and runs non-interactive shell commands there with SSH stdin closed, not inside the Sybil server container.
- `CHAT_MAX_TOOL_ROUNDS` controls how many model/tool result cycles may occur before the backend returns a tool-call limit message; default is 100.

Tool-enabled streaming notes (`openai`/`xai`):
- Stream still emits standard `meta`, `delta`, `done|error` events.
- Stream may emit `tool_call` events while tool calls are executed.
- `delta` events carry assistant text and are emitted incrementally for normal text rounds. The backend may buffer model-native text briefly while determining whether a provider round contains tool calls.
- OpenAI Responses stream events are normalized by the backend into this SSE contract; clients do not consume OpenAI's raw Responses stream event names.

## Persistence + Consistency Model

Backend database remains source of truth.

For persisted streams:
- Client may optimistically render accumulated `delta` text.
- Backend persists each completed tool call as a `tool` message before emitting its `tool_call` SSE event, so chat detail refreshes can show completed tool calls while the assistant response is still running.

On successful persisted completion:
- Backend persists assistant `Message` and updates `LlmCall` usage/latency in a transaction.
- Backend then emits `done`.

On persisted failure:
- Backend records call error and emits `error`.

For `persist: false` streams:
- Client may render the same `meta`, `tool_call`, `delta`, and terminal events.
- Backend does not write any chat, message, tool-call log, assistant output, or call metadata rows.
- `done.text` is the canonical assistant text if the client later imports the result into a saved chat.

Client recommendation (for iOS/web):
1. Render deltas in real time for UX.
2. On `done`, refresh chat detail from REST (`GET /v1/chats/:chatId`) and use DB-backed data as canonical.
3. On `error`, preserve user input and show retry affordance.

## SSE Parsing Rules

- Concatenate multiple `data:` lines with newline before JSON parse.
- Event completes on blank line.
- Ignore unknown event names for forward compatibility.

## Example Stream

```text
event: meta
data: {"type":"meta","chatId":"c1","callId":"k1","provider":"openai","model":"gpt-4.1-mini"}

event: delta
data: {"type":"delta","text":"Hello"}

event: delta
data: {"type":"delta","text":" world"}

event: done
data: {"type":"done","text":"Hello world"}

```
docs 2026-02-14 21:20:14 -08:00			`# Streaming Chat API Contract`

			`This document defines the server-sent events (SSE) contract for chat completions.`

			`Endpoint:`
			- `POST /v1/chat-completions/stream`
backend, web: support for resuming streams 2026-05-04 09:12:31 -07:00			- `POST /v1/chats/:chatId/stream/attach`
docs 2026-02-14 21:20:14 -08:00
			`Transport:`
			- HTTP response uses `Content-Type: text/event-stream; charset=utf-8`
			- Events are emitted in SSE format (`event: ...`, `data: ...`)
			`- Request body is JSON`
adds attachment support 2026-05-02 19:21:06 -07:00			- Request body supports the same inline attachment schema and limits documented in `docs/api/rest.md`.
docs 2026-02-14 21:20:14 -08:00
			`Authentication:`
			- Same as REST endpoints (`Authorization: Bearer <token>` when token mode is enabled)

			`## Request Body`

			```json
			`{`
			`"chatId": "optional-chat-id",`
quick question feature 2026-05-02 23:48:01 -07:00			`"persist": true,`
docs 2026-02-14 21:20:14 -08:00			`"provider": "openai\|anthropic\|xai",`
			`"model": "string",`
			`"messages": [`
adds attachment support 2026-05-02 19:21:06 -07:00			`{`
			`"role": "system\|user\|assistant\|tool",`
			`"content": "string",`
			`"name": "optional",`
			`"attachments": [`
			`{`
			`"kind": "image",`
			`"id": "attachment-id",`
			`"filename": "photo.jpg",`
			`"mimeType": "image/jpeg",`
			`"sizeBytes": 12345,`
			`"dataUrl": "data:image/jpeg;base64,..."`
			`},`
			`{`
			`"kind": "text",`
			`"id": "attachment-id",`
			`"filename": "notes.md",`
			`"mimeType": "text/markdown",`
			`"sizeBytes": 4567,`
			`"text": "# Notes\\n...",`
			`"truncated": false`
			`}`
			`]`
			`}`
docs 2026-02-14 21:20:14 -08:00			`],`
			`"temperature": 0.2,`
			`"maxTokens": 256`
			`}`
			```

			`Notes:`
quick question feature 2026-05-02 23:48:01 -07:00			- `persist` defaults to `true`.
			- If `persist` is `true` and `chatId` is omitted, backend creates a new chat.
docs 2026-02-14 21:20:14 -08:00			- If `chatId` is provided, backend validates it exists.
quick question feature 2026-05-02 23:48:01 -07:00			- If `persist` is `false`, `chatId` must be omitted. Backend does not create a chat and does not persist input messages, tool-call messages, assistant output, or `LlmCall` metadata.
			`- For persisted streams, backend stores only new non-assistant input history rows to avoid duplicates.`
			- Attachments are optional and are persisted under `message.metadata.attachments` on stored user messages when `persist` is `true`.
docs 2026-02-14 21:20:14 -08:00
backend, web: support for resuming streams 2026-05-04 09:12:31 -07:00			Persisted chat streams with a `chatId` are backend-owned active runs:
			`- Once started, the backend keeps the stream running even if the HTTP client disconnects or refreshes.`
			- While running, `GET /v1/active-runs` includes the `chatId`.
			- Starting a second persisted stream for the same active `chatId` returns `409`.
			- Clients can reattach with `POST /v1/chats/:chatId/stream/attach`.

			`## Attach Endpoint`

			`POST /v1/chats/:chatId/stream/attach`
			`- Body: none.`
			- Response uses the same `text/event-stream` transport and event names as `POST /v1/chat-completions/stream`.
			- Replays buffered events for the active in-memory stream, then emits new events until `done` or `error`.
			- Returns `404 { "message": "active chat stream not found" }` if no stream is currently active for that chat.
			`- Authentication is the same as all other API endpoints.`

			This endpoint is intended for clients that restored an active `chatId` from `GET /v1/active-runs`, especially after browser refresh. Replayed `delta` events may include text that was originally emitted before the client attached.

docs 2026-02-14 21:20:14 -08:00			`## Event Stream Contract`

			`Event order:`
			1. Exactly one `meta`
[feature] adds web_search and fetch_url tool calls 2026-03-02 16:13:34 -08:00			2. Zero or more `tool_call`
			3. Zero or more `delta`
			4. Exactly one terminal event: `done` or `error`
docs 2026-02-14 21:20:14 -08:00
			### `meta`

			```json
			`{`
			`"type": "meta",`
quick question feature 2026-05-02 23:48:01 -07:00			`"chatId": "chat-id-or-null",`
			`"callId": "llm-call-id-or-null",`
docs 2026-02-14 21:20:14 -08:00			`"provider": "openai",`
			`"model": "gpt-4.1-mini"`
			`}`
			```

quick question feature 2026-05-02 23:48:01 -07:00			For `persist: false` streams, `chatId` and `callId` are `null`.

docs 2026-02-14 21:20:14 -08:00			### `delta`

			```json
			`{ "type": "delta", "text": "next chunk" }`
			```

			`text` may contain partial words, punctuation, or whitespace.

[feature] adds web_search and fetch_url tool calls 2026-03-02 16:13:34 -08:00			### `tool_call`

			```json
			`{`
			`"toolCallId": "call_123",`
			`"name": "web_search",`
			`"status": "completed",`
			`"summary": "Performed web search for 'latest CPI release'.",`
			`"args": { "query": "latest CPI release" },`
			`"startedAt": "2026-03-02T10:00:00.000Z",`
			`"completedAt": "2026-03-02T10:00:00.820Z",`
			`"durationMs": 820,`
			`"error": null,`
			`"resultPreview": "{\"ok\":true,...}"`
			`}`
			```

docs 2026-02-14 21:20:14 -08:00			### `done`

			```json
			`{`
			`"type": "done",`
			`"text": "full assistant response",`
			`"usage": {`
			`"inputTokens": 123,`
			`"outputTokens": 456,`
			`"totalTokens": 579`
			`}`
			`}`
			```

			`usage` may be omitted when provider does not expose final token accounting for stream mode.

			### `error`

			```json
			`{ "type": "error", "message": "provider timeout" }`
			```

			`## Provider Streaming Behavior`

oai responses api, tool call retries 2026-05-02 21:44:32 -07:00			- `openai`: backend uses OpenAI's Responses API and may execute internal function tool calls (`web_search`, `fetch_url`, optional `codex_exec`, and optional `shell_exec`) before producing final text.
			- `xai`: backend uses xAI's OpenAI-compatible Chat Completions API and may execute the same internal tool calls before producing final text.
			- `openai`: image attachments are sent as Responses `input_image` items; text attachments are sent as `input_text` items.
			- `xai`: image attachments are sent as Chat Completions content parts; text attachments are inlined as text parts.
			- `openai`: Responses calls that can enter the server-managed tool loop use `store: true` so reasoning and function-call items can be passed between tool rounds.
adds attachment support 2026-05-02 19:21:06 -07:00			- `anthropic`: streamed via event stream; emits `delta` from `content_block_delta` with `text_delta`. Image attachments are sent as base64 `image` blocks and text attachments are appended as `text` blocks.
Adds searxng support for tool calling 2026-05-02 18:14:41 -07:00			- `web_search` uses `CHAT_WEB_SEARCH_ENGINE` (`exa` default, or `searxng` with `SEARXNG_BASE_URL` set). SearXNG mode requires the instance to allow `format=json`. This only affects chat-mode tool calls, not search-mode endpoints.
codex no sandbox (its already sandboxed) 2026-05-02 21:50:17 -07:00			- `codex_exec` is available only when `CHAT_CODEX_TOOL_ENABLED=true`. It SSHes to `CHAT_CODEX_REMOTE_HOST`, creates/uses `CHAT_CODEX_REMOTE_WORKDIR`, and runs `codex exec --dangerously-bypass-approvals-and-sandbox --skip-git-repo-check <non-interactive wrapped prompt>` there with SSH stdin closed. Prefer `CHAT_CODEX_SSH_KEY_PATH` with a read-only mounted private key; `CHAT_CODEX_SSH_PRIVATE_KEY_B64` is also supported.
Various fixes for tool calling 2026-05-02 21:19:52 -07:00			- `shell_exec` is available only when `CHAT_SHELL_TOOL_ENABLED=true`. It uses the same devbox SSH configuration, starts in `CHAT_CODEX_REMOTE_WORKDIR`, and runs non-interactive shell commands there with SSH stdin closed, not inside the Sybil server container.
oai responses api, tool call retries 2026-05-02 21:44:32 -07:00			- `CHAT_MAX_TOOL_ROUNDS` controls how many model/tool result cycles may occur before the backend returns a tool-call limit message; default is 100.
docs 2026-02-14 21:20:14 -08:00
[feature] adds web_search and fetch_url tool calls 2026-03-02 16:13:34 -08:00			Tool-enabled streaming notes (`openai`/`xai`):
			- Stream still emits standard `meta`, `delta`, `done\|error` events.
Enable streaming for tool call logs 2026-03-02 16:39:05 -08:00			- Stream may emit `tool_call` events while tool calls are executed.
fix streaming 2026-05-02 23:09:39 -07:00			- `delta` events carry assistant text and are emitted incrementally for normal text rounds. The backend may buffer model-native text briefly while determining whether a provider round contains tool calls.
oai responses api, tool call retries 2026-05-02 21:44:32 -07:00			`- OpenAI Responses stream events are normalized by the backend into this SSE contract; clients do not consume OpenAI's raw Responses stream event names.`
[feature] adds web_search and fetch_url tool calls 2026-03-02 16:13:34 -08:00
docs 2026-02-14 21:20:14 -08:00			`## Persistence + Consistency Model`

			`Backend database remains source of truth.`

quick question feature 2026-05-02 23:48:01 -07:00			`For persisted streams:`
docs 2026-02-14 21:20:14 -08:00			- Client may optimistically render accumulated `delta` text.
tool call in-flight resume 2026-05-02 22:03:43 -07:00			- Backend persists each completed tool call as a `tool` message before emitting its `tool_call` SSE event, so chat detail refreshes can show completed tool calls while the assistant response is still running.
docs 2026-02-14 21:20:14 -08:00
quick question feature 2026-05-02 23:48:01 -07:00			`On successful persisted completion:`
docs 2026-02-14 21:20:14 -08:00			- Backend persists assistant `Message` and updates `LlmCall` usage/latency in a transaction.
			- Backend then emits `done`.

quick question feature 2026-05-02 23:48:01 -07:00			`On persisted failure:`
docs 2026-02-14 21:20:14 -08:00			- Backend records call error and emits `error`.

quick question feature 2026-05-02 23:48:01 -07:00			For `persist: false` streams:
			- Client may render the same `meta`, `tool_call`, `delta`, and terminal events.
			`- Backend does not write any chat, message, tool-call log, assistant output, or call metadata rows.`
			- `done.text` is the canonical assistant text if the client later imports the result into a saved chat.

docs 2026-02-14 21:20:14 -08:00			`Client recommendation (for iOS/web):`
			`1. Render deltas in real time for UX.`
			2. On `done`, refresh chat detail from REST (`GET /v1/chats/:chatId`) and use DB-backed data as canonical.
			3. On `error`, preserve user input and show retry affordance.

			`## SSE Parsing Rules`

			- Concatenate multiple `data:` lines with newline before JSON parse.
			`- Event completes on blank line.`
			`- Ignore unknown event names for forward compatibility.`

			`## Example Stream`

			```text
			`event: meta`
			`data: {"type":"meta","chatId":"c1","callId":"k1","provider":"openai","model":"gpt-4.1-mini"}`

			`event: delta`
			`data: {"type":"delta","text":"Hello"}`

			`event: delta`
			`data: {"type":"delta","text":" world"}`

			`event: done`
			`data: {"type":"done","text":"Hello world"}`

			```