AuDesign Voice — Wire Protocol (v1)¶

JSON-over-WebSocket with interleaved binary frames for raw audio. Event names are inspired by the OpenAI Realtime API so existing clients port easily, but this is not a drop-in replacement.

Connection¶

Client POST /v1/sessions with API key, receives:

{ "session_id": "...", "ws_url": "wss://.../v1/voice", "token": "<JWT, 5 min TTL>" }

Client opens WS to ws_url with header Authorization: Bearer <token>.
Server replies with session.created.

Audio format¶

Direction	Format
Client → Server	PCM16 LE, mono, 16 kHz, sent as binary WebSocket frames (recommended) or base64 in `input_audio.append`
Server → Client	PCM16 LE, mono, 24 kHz, sent as binary frames prefixed with a 4-byte stream-id header

Future: Opus support (negotiated via session.update.audio_codec).

Events — Client → Server¶

`session.update`¶

Configure the session. May be sent any time; merged with current config.

{
  "type": "session.update",
  "session": {
    "instructions": "You are a helpful assistant for Dubai municipality...",
    "languages": ["ar-AE", "ar-SA", "en-US", "en-GB"],
    "voice": "en-US-AvaMultilingualNeural",
    "model": "gpt-4.1",
    "tools": [ { "type": "function", "function": { ... OpenAI tool schema ... } } ],
    "turn_detection": { "type": "server_vad", "silence_ms": 600 },
    "rag": { "index_name": "kb-municipality", "top_k": 5 }
  }
}

`input_audio.append`¶

Optional JSON wrapper if not using binary frames.

{ "type": "input_audio.append", "audio": "<base64 PCM16>" }

`input_audio.commit`¶

Force end-of-utterance (overrides VAD). Optional.

`response.cancel`¶

Abort the current assistant response (also triggered automatically on barge-in).

`tool.result`¶

{ "type": "tool.result", "call_id": "call_abc", "output": "{...json...}" }

`conversation.item.create`¶

Inject text into the conversation without going through STT.

{ "type": "conversation.item.create", "item": { "role": "user", "content": "..." } }

Events — Server → Client¶

`session.created`¶

{ "type": "session.created", "session_id": "...", "model": "gpt-4.1", "voice": "..." }

`vad.speech_started` / `vad.speech_stopped`¶

Server-side voice activity flags.

`transcript.delta`¶

Streaming partial recognition.

{ "type": "transcript.delta", "text": "ابي اعرف ", "language": "ar-AE" }

`transcript.final`¶

{ "type": "transcript.final", "text": "ابي اعرف what's the weather اليوم",
  "languages": ["ar-AE", "en-US"], "duration_ms": 2840 }

`response.text.delta`¶

Streaming assistant text (token deltas from the LLM).

`response.audio.delta`¶

Synthesized audio chunk. Sent as binary frame; the JSON event acts as metadata only when streaming via base64.

{ "type": "response.audio.delta", "stream_id": 7, "bytes": 4800 }

`response.audio.done`¶

End of one synthesized assistant turn.

`response.done`¶

Logical end of the assistant turn (after audio + text both flushed).

{ "type": "response.done", "stream_id": 7, "usage": { "prompt_tokens": ..., "completion_tokens": ... } }

`tool.call`¶

Mirrors OpenAI tool-call streaming.

{ "type": "tool.call", "call_id": "call_abc", "name": "get_weather",
  "arguments": "{ \"city\": \"Dubai\" }" }

`barge_in.detected`¶

Server detected user speech during TTS playback. TTS stream is cancelled.

{ "type": "barge_in.detected", "stream_id": 7, "spoken_chars": 42 }

`transcript.analysis`¶

Optional, when PII redaction or sentiment is enabled.

{ "type": "transcript.analysis", "pii": [{ "category": "PhoneNumber", "offset": 12, "length": 9 }],
  "sentiment": "neutral" }

`error`¶

{ "type": "error", "code": "stt_unavailable", "message": "..." }

State machine (server)¶

        ┌─────────────┐  speech_started     ┌────────────────┐
        │  IDLE/LISTEN├────────────────────▶│ USER_SPEAKING  │
        └─────────────┘                     └───────┬────────┘
              ▲                                     │ speech_stopped
              │                                     ▼
              │                          ┌─────────────────────┐
              │   response.done          │  THINKING (LLM)     │
              │◀─────────────────────────┤                     │
              │                          └──────────┬──────────┘
              │                                     │ first audio chunk
              │                                     ▼
              │                          ┌─────────────────────┐
              │  vad.speech_started      │  AGENT_SPEAKING     │
              ├────────────────── BARGE-IN ─────┤  (TTS playing) │
              │                          └─────────────────────┘
              │                                     │ audio.done
              └─────────────────────────────────────┘

Auth¶

JWT (HS256) with claims sub (tenant ID), sid (session ID), exp. Issued by /v1/sessions after API-key validation. WS upgrade rejects with 4401 close code on invalid/expired token.

Limits (v1)¶

Max session duration: 30 min (configurable per tenant)
Max audio frame size: 32 KB
Max concurrent sessions per tenant: configurable (default 5)

AuDesign Voice — Wire Protocol (v1)¶

Connection¶

Audio format¶

Events — Client → Server¶

session.update¶

input_audio.append¶

input_audio.commit¶

response.cancel¶

tool.result¶

conversation.item.create¶

Events — Server → Client¶

session.created¶

vad.speech_started / vad.speech_stopped¶

transcript.delta¶

transcript.final¶

response.text.delta¶

response.audio.delta¶

response.audio.done¶

response.done¶

tool.call¶

barge_in.detected¶

transcript.analysis¶

error¶