Skip to content

Realtime Client

File: main_logic/omni_realtime_client.py

The OmniRealtimeClient manages the WebSocket connection to Realtime API providers (Qwen, OpenAI, Gemini, Step, GLM).

Supported providers

ProviderProtocolNotes
Qwen (DashScope)WebSocketPrimary, most tested
OpenAIWebSocketGPT Realtime API
StepWebSocketStep Audio
GLMWebSocketZhipu Realtime
GeminiGoogle GenAI SDKUses SDK wrapper, not raw WebSocket

Key methods

connect()

Establishes a WebSocket connection to the provider's Realtime API endpoint.

send_text(text)

Sends user text input to the LLM.

send_audio(audio_bytes, sample_rate)

Streams user audio chunks to the LLM. Audio is sent as raw PCM data.

send_screenshot(base64_data)

Sends a screenshot for multi-modal understanding. Rate-limited by NATIVE_IMAGE_MIN_INTERVAL (1.5s default).

Event handlers

EventPurpose
on_text_delta()Streamed text response from the LLM
on_audio_delta()Streamed audio response
on_input_transcript()User's speech converted to text (STT)
on_output_transcript()LLM's output as text
on_interrupt()User interrupted the LLM's output

Turn detection

The client uses server-side VAD (Voice Activity Detection) by default. The LLM provider decides when the user has finished speaking, enabling natural conversation turn-taking.

Image throttling

Screen captures are rate-limited to avoid overwhelming the API:

  • Active speaking: Images sent every NATIVE_IMAGE_MIN_INTERVAL seconds (1.5s)
  • Idle (no voice): Interval multiplied by IMAGE_IDLE_RATE_MULTIPLIER (5x = 7.5s)

Released under the MIT License.