Files
Sybil-2/docs/api/streaming-chat.md
2026-05-02 23:48:01 -07:00

199 lines
7.2 KiB
Markdown

# Streaming Chat API Contract
This document defines the server-sent events (SSE) contract for chat completions.
Endpoint:
- `POST /v1/chat-completions/stream`
Transport:
- HTTP response uses `Content-Type: text/event-stream; charset=utf-8`
- Events are emitted in SSE format (`event: ...`, `data: ...`)
- Request body is JSON
- Request body supports the same inline attachment schema and limits documented in `docs/api/rest.md`.
Authentication:
- Same as REST endpoints (`Authorization: Bearer <token>` when token mode is enabled)
## Request Body
```json
{
"chatId": "optional-chat-id",
"persist": true,
"provider": "openai|anthropic|xai",
"model": "string",
"messages": [
{
"role": "system|user|assistant|tool",
"content": "string",
"name": "optional",
"attachments": [
{
"kind": "image",
"id": "attachment-id",
"filename": "photo.jpg",
"mimeType": "image/jpeg",
"sizeBytes": 12345,
"dataUrl": "data:image/jpeg;base64,..."
},
{
"kind": "text",
"id": "attachment-id",
"filename": "notes.md",
"mimeType": "text/markdown",
"sizeBytes": 4567,
"text": "# Notes\\n...",
"truncated": false
}
]
}
],
"temperature": 0.2,
"maxTokens": 256
}
```
Notes:
- `persist` defaults to `true`.
- If `persist` is `true` and `chatId` is omitted, backend creates a new chat.
- If `chatId` is provided, backend validates it exists.
- If `persist` is `false`, `chatId` must be omitted. Backend does not create a chat and does not persist input messages, tool-call messages, assistant output, or `LlmCall` metadata.
- For persisted streams, backend stores only new non-assistant input history rows to avoid duplicates.
- Attachments are optional and are persisted under `message.metadata.attachments` on stored user messages when `persist` is `true`.
## Event Stream Contract
Event order:
1. Exactly one `meta`
2. Zero or more `tool_call`
3. Zero or more `delta`
4. Exactly one terminal event: `done` or `error`
### `meta`
```json
{
"type": "meta",
"chatId": "chat-id-or-null",
"callId": "llm-call-id-or-null",
"provider": "openai",
"model": "gpt-4.1-mini"
}
```
For `persist: false` streams, `chatId` and `callId` are `null`.
### `delta`
```json
{ "type": "delta", "text": "next chunk" }
```
`text` may contain partial words, punctuation, or whitespace.
### `tool_call`
```json
{
"toolCallId": "call_123",
"name": "web_search",
"status": "completed",
"summary": "Performed web search for 'latest CPI release'.",
"args": { "query": "latest CPI release" },
"startedAt": "2026-03-02T10:00:00.000Z",
"completedAt": "2026-03-02T10:00:00.820Z",
"durationMs": 820,
"error": null,
"resultPreview": "{\"ok\":true,...}"
}
```
### `done`
```json
{
"type": "done",
"text": "full assistant response",
"usage": {
"inputTokens": 123,
"outputTokens": 456,
"totalTokens": 579
}
}
```
`usage` may be omitted when provider does not expose final token accounting for stream mode.
### `error`
```json
{ "type": "error", "message": "provider timeout" }
```
## Provider Streaming Behavior
- `openai`: backend uses OpenAI's Responses API and may execute internal function tool calls (`web_search`, `fetch_url`, optional `codex_exec`, and optional `shell_exec`) before producing final text.
- `xai`: backend uses xAI's OpenAI-compatible Chat Completions API and may execute the same internal tool calls before producing final text.
- `openai`: image attachments are sent as Responses `input_image` items; text attachments are sent as `input_text` items.
- `xai`: image attachments are sent as Chat Completions content parts; text attachments are inlined as text parts.
- `openai`: Responses calls that can enter the server-managed tool loop use `store: true` so reasoning and function-call items can be passed between tool rounds.
- `anthropic`: streamed via event stream; emits `delta` from `content_block_delta` with `text_delta`. Image attachments are sent as base64 `image` blocks and text attachments are appended as `text` blocks.
- `web_search` uses `CHAT_WEB_SEARCH_ENGINE` (`exa` default, or `searxng` with `SEARXNG_BASE_URL` set). SearXNG mode requires the instance to allow `format=json`. This only affects chat-mode tool calls, not search-mode endpoints.
- `codex_exec` is available only when `CHAT_CODEX_TOOL_ENABLED=true`. It SSHes to `CHAT_CODEX_REMOTE_HOST`, creates/uses `CHAT_CODEX_REMOTE_WORKDIR`, and runs `codex exec --dangerously-bypass-approvals-and-sandbox --skip-git-repo-check <non-interactive wrapped prompt>` there with SSH stdin closed. Prefer `CHAT_CODEX_SSH_KEY_PATH` with a read-only mounted private key; `CHAT_CODEX_SSH_PRIVATE_KEY_B64` is also supported.
- `shell_exec` is available only when `CHAT_SHELL_TOOL_ENABLED=true`. It uses the same devbox SSH configuration, starts in `CHAT_CODEX_REMOTE_WORKDIR`, and runs non-interactive shell commands there with SSH stdin closed, not inside the Sybil server container.
- `CHAT_MAX_TOOL_ROUNDS` controls how many model/tool result cycles may occur before the backend returns a tool-call limit message; default is 100.
Tool-enabled streaming notes (`openai`/`xai`):
- Stream still emits standard `meta`, `delta`, `done|error` events.
- Stream may emit `tool_call` events while tool calls are executed.
- `delta` events carry assistant text and are emitted incrementally for normal text rounds. The backend may buffer model-native text briefly while determining whether a provider round contains tool calls.
- OpenAI Responses stream events are normalized by the backend into this SSE contract; clients do not consume OpenAI's raw Responses stream event names.
## Persistence + Consistency Model
Backend database remains source of truth.
For persisted streams:
- Client may optimistically render accumulated `delta` text.
- Backend persists each completed tool call as a `tool` message before emitting its `tool_call` SSE event, so chat detail refreshes can show completed tool calls while the assistant response is still running.
On successful persisted completion:
- Backend persists assistant `Message` and updates `LlmCall` usage/latency in a transaction.
- Backend then emits `done`.
On persisted failure:
- Backend records call error and emits `error`.
For `persist: false` streams:
- Client may render the same `meta`, `tool_call`, `delta`, and terminal events.
- Backend does not write any chat, message, tool-call log, assistant output, or call metadata rows.
- `done.text` is the canonical assistant text if the client later imports the result into a saved chat.
Client recommendation (for iOS/web):
1. Render deltas in real time for UX.
2. On `done`, refresh chat detail from REST (`GET /v1/chats/:chatId`) and use DB-backed data as canonical.
3. On `error`, preserve user input and show retry affordance.
## SSE Parsing Rules
- Concatenate multiple `data:` lines with newline before JSON parse.
- Event completes on blank line.
- Ignore unknown event names for forward compatibility.
## Example Stream
```text
event: meta
data: {"type":"meta","chatId":"c1","callId":"k1","provider":"openai","model":"gpt-4.1-mini"}
event: delta
data: {"type":"delta","text":"Hello"}
event: delta
data: {"type":"delta","text":" world"}
event: done
data: {"type":"done","text":"Hello world"}
```