Files
Sybil-2/docs/api/streaming-chat.md

8.7 KiB

Streaming Chat API Contract

This document defines the server-sent events (SSE) contract for chat completions.

Endpoint:

  • POST /v1/chat-completions/stream
  • POST /v1/chats/:chatId/stream/attach

Transport:

  • HTTP response uses Content-Type: text/event-stream; charset=utf-8
  • Events are emitted in SSE format (event: ..., data: ...)
  • Request body is JSON
  • Request body supports the same inline attachment schema and limits documented in docs/api/rest.md.

Authentication:

  • Same as REST endpoints (Authorization: Bearer <token> when token mode is enabled)

Request Body

{
  "chatId": "optional-chat-id",
  "persist": true,
  "provider": "openai|anthropic|xai|hermes-agent",
  "model": "string",
  "messages": [
    {
      "role": "system|user|assistant|tool",
      "content": "string",
      "name": "optional",
      "attachments": [
        {
          "kind": "image",
          "id": "attachment-id",
          "filename": "photo.jpg",
          "mimeType": "image/jpeg",
          "sizeBytes": 12345,
          "dataUrl": "data:image/jpeg;base64,..."
        },
        {
          "kind": "text",
          "id": "attachment-id",
          "filename": "notes.md",
          "mimeType": "text/markdown",
          "sizeBytes": 4567,
          "text": "# Notes\\n...",
          "truncated": false
        }
      ]
    }
  ],
  "temperature": 0.2,
  "maxTokens": 256
}

Notes:

  • persist defaults to true.
  • If persist is true and chatId is omitted, backend creates a new chat.
  • If chatId is provided, backend validates it exists.
  • If persist is false, chatId must be omitted. Backend does not create a chat and does not persist input messages, tool-call messages, assistant output, or LlmCall metadata.
  • For persisted streams, backend stores only new non-assistant input history rows to avoid duplicates.
  • Attachments are optional and are persisted under message.metadata.attachments on stored user messages when persist is true.

Persisted chat streams with a chatId are backend-owned active runs:

  • Once started, the backend keeps the stream running even if the HTTP client disconnects or refreshes.
  • While running, GET /v1/active-runs includes the chatId.
  • Starting a second persisted stream for the same active chatId returns 409.
  • Clients can reattach with POST /v1/chats/:chatId/stream/attach.

Attach Endpoint

POST /v1/chats/:chatId/stream/attach

  • Body: none.
  • Response uses the same text/event-stream transport and event names as POST /v1/chat-completions/stream.
  • Replays buffered events for the active in-memory stream, then emits new events until done or error.
  • Returns 404 { "message": "active chat stream not found" } if no stream is currently active for that chat.
  • Authentication is the same as all other API endpoints.

This endpoint is intended for clients that restored an active chatId from GET /v1/active-runs, especially after browser refresh. Replayed delta events may include text that was originally emitted before the client attached.

Event Stream Contract

Event order:

  1. Exactly one meta
  2. Zero or more tool_call
  3. Zero or more delta
  4. Exactly one terminal event: done or error

meta

{
  "type": "meta",
  "chatId": "chat-id-or-null",
  "callId": "llm-call-id-or-null",
  "provider": "openai",
  "model": "gpt-4.1-mini"
}

For persist: false streams, chatId and callId are null.

delta

{ "type": "delta", "text": "next chunk" }

text may contain partial words, punctuation, or whitespace.

tool_call

{
  "toolCallId": "call_123",
  "name": "web_search",
  "status": "completed",
  "summary": "Performed web search for 'latest CPI release'.",
  "args": { "query": "latest CPI release" },
  "startedAt": "2026-03-02T10:00:00.000Z",
  "completedAt": "2026-03-02T10:00:00.820Z",
  "durationMs": 820,
  "error": null,
  "resultPreview": "{\"ok\":true,...}"
}

done

{
  "type": "done",
  "text": "full assistant response",
  "usage": {
    "inputTokens": 123,
    "outputTokens": 456,
    "totalTokens": 579
  }
}

usage may be omitted when provider does not expose final token accounting for stream mode.

error

{ "type": "error", "message": "provider timeout" }

Provider Streaming Behavior

  • openai: backend uses OpenAI's Responses API and may execute internal function tool calls (web_search, fetch_url, optional codex_exec, and optional shell_exec) before producing final text.
  • xai: backend uses xAI's OpenAI-compatible Chat Completions API and may execute the same internal tool calls before producing final text.
  • hermes-agent: backend uses the configured Hermes Agent OpenAI-compatible Chat Completions API. Sybil does not add its own tool definitions for this provider; Hermes Agent handles its own tools server-side. Custom Hermes stream events are normalized away unless they produce text deltas in this SSE contract.
  • openai: image attachments are sent as Responses input_image items; text attachments are sent as input_text items.
  • xai and hermes-agent: image attachments are sent as Chat Completions content parts; text attachments are inlined as text parts.
  • openai: Responses calls that can enter the server-managed tool loop use store: true so reasoning and function-call items can be passed between tool rounds.
  • anthropic: streamed via event stream; emits delta from content_block_delta with text_delta. Image attachments are sent as base64 image blocks and text attachments are appended as text blocks.
  • web_search uses CHAT_WEB_SEARCH_ENGINE (exa default, or searxng with SEARXNG_BASE_URL set). SearXNG mode requires the instance to allow format=json. This only affects chat-mode tool calls, not search-mode endpoints.
  • codex_exec is available only when CHAT_CODEX_TOOL_ENABLED=true. It SSHes to CHAT_CODEX_REMOTE_HOST, creates/uses CHAT_CODEX_REMOTE_WORKDIR, and runs codex exec --dangerously-bypass-approvals-and-sandbox --skip-git-repo-check <non-interactive wrapped prompt> there with SSH stdin closed. Prefer CHAT_CODEX_SSH_KEY_PATH with a read-only mounted private key; CHAT_CODEX_SSH_PRIVATE_KEY_B64 is also supported.
  • shell_exec is available only when CHAT_SHELL_TOOL_ENABLED=true. It uses the same devbox SSH configuration, starts in CHAT_CODEX_REMOTE_WORKDIR, and runs non-interactive shell commands there with SSH stdin closed, not inside the Sybil server container.
  • CHAT_MAX_TOOL_ROUNDS controls how many model/tool result cycles may occur before the backend returns a tool-call limit message; default is 100.

Tool-enabled streaming notes (openai/xai):

  • Stream still emits standard meta, delta, done|error events.
  • Stream may emit tool_call events while tool calls are executed.
  • delta events carry assistant text and are emitted incrementally for normal text rounds. The backend may buffer model-native text briefly while determining whether a provider round contains tool calls.
  • OpenAI Responses stream events are normalized by the backend into this SSE contract; clients do not consume OpenAI's raw Responses stream event names.

Persistence + Consistency Model

Backend database remains source of truth.

For persisted streams:

  • Client may optimistically render accumulated delta text.
  • Backend persists each completed tool call as a tool message before emitting its tool_call SSE event, so chat detail refreshes can show completed tool calls while the assistant response is still running.

On successful persisted completion:

  • Backend persists assistant Message and updates LlmCall usage/latency in a transaction.
  • Backend then emits done.

On persisted failure:

  • Backend records call error and emits error.

For persist: false streams:

  • Client may render the same meta, tool_call, delta, and terminal events.
  • Backend does not write any chat, message, tool-call log, assistant output, or call metadata rows.
  • done.text is the canonical assistant text if the client later imports the result into a saved chat.

Client recommendation (for iOS/web):

  1. Render deltas in real time for UX.
  2. On done, refresh chat detail from REST (GET /v1/chats/:chatId) and use DB-backed data as canonical.
  3. On error, preserve user input and show retry affordance.

SSE Parsing Rules

  • Concatenate multiple data: lines with newline before JSON parse.
  • Event completes on blank line.
  • Ignore unknown event names for forward compatibility.

Example Stream

event: meta
data: {"type":"meta","chatId":"c1","callId":"k1","provider":"openai","model":"gpt-4.1-mini"}

event: delta
data: {"type":"delta","text":"Hello"}

event: delta
data: {"type":"delta","text":" world"}

event: done
data: {"type":"done","text":"Hello world"}