AuDesign Voice — Wire Protocol (v1)¶
JSON-over-WebSocket with interleaved binary frames for raw audio. Event names are inspired by the OpenAI Realtime API so existing clients port easily, but this is not a drop-in replacement.
Connection¶
- Client
POST /v1/sessionswith API key, receives: - Client opens WS to
ws_urlwith headerAuthorization: Bearer <token>. - Server replies with
session.created.
Audio format¶
| Direction | Format |
|---|---|
| Client → Server | PCM16 LE, mono, 16 kHz, sent as binary WebSocket frames (recommended) or base64 in input_audio.append |
| Server → Client | PCM16 LE, mono, 24 kHz, sent as binary frames prefixed with a 4-byte stream-id header |
Future: Opus support (negotiated via session.update.audio_codec).
Events — Client → Server¶
session.update¶
Configure the session. May be sent any time; merged with current config.
{
"type": "session.update",
"session": {
"instructions": "You are a helpful assistant for Dubai municipality...",
"languages": ["ar-AE", "ar-SA", "en-US", "en-GB"],
"voice": "en-US-AvaMultilingualNeural",
"model": "gpt-4.1",
"tools": [ { "type": "function", "function": { ... OpenAI tool schema ... } } ],
"turn_detection": { "type": "server_vad", "silence_ms": 600 },
"rag": { "index_name": "kb-municipality", "top_k": 5 }
}
}
input_audio.append¶
Optional JSON wrapper if not using binary frames.
input_audio.commit¶
Force end-of-utterance (overrides VAD). Optional.
response.cancel¶
Abort the current assistant response (also triggered automatically on barge-in).
tool.result¶
conversation.item.create¶
Inject text into the conversation without going through STT.
Events — Server → Client¶
session.created¶
vad.speech_started / vad.speech_stopped¶
Server-side voice activity flags.
transcript.delta¶
Streaming partial recognition.
transcript.final¶
{ "type": "transcript.final", "text": "ابي اعرف what's the weather اليوم",
"languages": ["ar-AE", "en-US"], "duration_ms": 2840 }
response.text.delta¶
Streaming assistant text (token deltas from the LLM).
response.audio.delta¶
Synthesized audio chunk. Sent as binary frame; the JSON event acts as metadata only when streaming via base64.
response.audio.done¶
End of one synthesized assistant turn.
response.done¶
Logical end of the assistant turn (after audio + text both flushed).
{ "type": "response.done", "stream_id": 7, "usage": { "prompt_tokens": ..., "completion_tokens": ... } }
tool.call¶
Mirrors OpenAI tool-call streaming.
{ "type": "tool.call", "call_id": "call_abc", "name": "get_weather",
"arguments": "{ \"city\": \"Dubai\" }" }
barge_in.detected¶
Server detected user speech during TTS playback. TTS stream is cancelled.
transcript.analysis¶
Optional, when PII redaction or sentiment is enabled.
{ "type": "transcript.analysis", "pii": [{ "category": "PhoneNumber", "offset": 12, "length": 9 }],
"sentiment": "neutral" }
error¶
State machine (server)¶
┌─────────────┐ speech_started ┌────────────────┐
│ IDLE/LISTEN├────────────────────▶│ USER_SPEAKING │
└─────────────┘ └───────┬────────┘
▲ │ speech_stopped
│ ▼
│ ┌─────────────────────┐
│ response.done │ THINKING (LLM) │
│◀─────────────────────────┤ │
│ └──────────┬──────────┘
│ │ first audio chunk
│ ▼
│ ┌─────────────────────┐
│ vad.speech_started │ AGENT_SPEAKING │
├────────────────── BARGE-IN ─────┤ (TTS playing) │
│ └─────────────────────┘
│ │ audio.done
└─────────────────────────────────────┘
Auth¶
JWT (HS256) with claims sub (tenant ID), sid (session ID), exp. Issued
by /v1/sessions after API-key validation. WS upgrade rejects with 4401 close
code on invalid/expired token.
Limits (v1)¶
- Max session duration: 30 min (configurable per tenant)
- Max audio frame size: 32 KB
- Max concurrent sessions per tenant: configurable (default 5)