Skip to content

mathisarends/rtvoice

Repository files navigation

rtvoice

PyPI version Python Version

A Python library for building real-time voice agents powered by the OpenAI Realtime API. It handles the full session lifecycle — microphone input, WebSocket streaming, turn detection, tool calling, and audio playback — so you can focus on what your agent does, not how it talks.


Installation

pip install rtvoice[audio]

Requires Python 3.13+ and an OPENAI_API_KEY environment variable (or pass api_key= directly).


Quickstart

import asyncio
from rtvoice import RealtimeAgent

async def main():
    agent = RealtimeAgent(
        instructions="You are Jarvis, a concise and helpful voice assistant.",
    )
    await agent.run()

asyncio.run(main())

Run it, speak into your microphone, and the agent responds through your speakers. Press Ctrl+C to end the session.


Table of Contents


Tool calling

Basic tools

Create a Tools instance, decorate functions with @tools.action(description), then pass the instance to RealtimeAgent. Both async and regular def functions are supported.

import asyncio
from rtvoice import RealtimeAgent, Tools

tools = Tools()

@tools.action("Get the current weather for a given city")
async def get_weather(city: str) -> str:
    return f"It's 18°C and partly cloudy in {city}."

async def main():
    agent = RealtimeAgent(
        instructions="Answer weather questions using get_weather.",
        tools=tools,
    )
    await agent.run()

asyncio.run(main())

Parameter types are inferred from the function signature and included in the schema sent to the model. All parameters without a default value are marked required.

Pydantic model tools

For richer schemas, register a Pydantic model with param_model=. The model fields become the tool parameters, and the function receives a validated model instance.

from typing import Literal

from pydantic import BaseModel, Field
from rtvoice import Tools

tools = Tools()

class CalendarSearchParams(BaseModel):
    query: str = Field(description="What to search for")
    date: str | None = Field(default=None, description="Optional ISO date filter")
    limit: int = Field(default=5, description="Maximum number of matches")
    source: Literal["work", "personal"] = "work"

@tools.action(
    "Search calendar events",
    param_model=CalendarSearchParams,
)
async def search_calendar(params: CalendarSearchParams) -> str:
    return await calendar.search(
        query=params.query,
        date=params.date,
        limit=params.limit,
        source=params.source,
    )

Nested Pydantic models, typed lists, enums, literals, defaults, and Field(description=...) are included in the generated tool schema.

Long-running tools

Set holding_instruction to have the assistant speak a phrase while the tool runs. The agent will say it immediately after calling the tool, before the result arrives.

@tools.action(
    "Search the web for a query",
    holding_instruction="Let me search that for you, give me a moment.",
)
async def search_web(query: str) -> str:
    result = await do_search(query)
    return result

Optionally add result_instruction to tell the model how to present the result once the tool returns:

@tools.action(
    "Fetch the latest headlines",
    holding_instruction="Fetching the news...",
    result_instruction="Summarise the headlines in two sentences.",
)
async def get_headlines() -> str: ...

Status templates

status is a spoken update for tools registered with param_model=. Use {field_name} placeholders from the Pydantic model — rtvoice validates them at registration time.

class PlaySongParams(BaseModel):
    song: str = Field(description="Song title")

@tools.action(
    "Play a song by name",
    param_model=PlaySongParams,
    status="Playing {song} now.",
)
async def play_song(params: PlaySongParams) -> str:
    await music_player.play(params.song)
    return f"Now playing: {params.song}"

status can also be a callable that receives the validated Pydantic model and returns a string dynamically.

@tools.action(
    "Play a song by name",
    param_model=PlaySongParams,
    status=lambda params: f"Playing {params.song} now.",
)
async def play_song(params: PlaySongParams) -> str:
    await music_player.play(params.song)
    return f"Now playing: {params.song}"

Context injection

Any tool parameter typed as Inject[T] is filled automatically by the framework — the model never sees it and does not need to supply a value. Three types are injectable:

Type What it provides
Inject[EventBus] Internal event bus
Inject[ConversationHistory] Full conversation so far
Inject[YourContextType] Your custom context= object
from rtvoice import Tools, Inject
from rtvoice.tools import ToolContext
from rtvoice.conversation import ConversationHistory

tools = Tools()

@tools.action("Summarise the conversation so far")
async def summarise(
    history: Inject[ConversationHistory],
) -> str:
    text = history.format()
    return await llm.summarise(text)

Custom application context

Pass any object as context= on RealtimeAgent. It is then injectable in every tool via Inject[YourType].

from dataclasses import dataclass
from rtvoice import RealtimeAgent, Tools, Inject

@dataclass
class AppState:
    user_name: str
    premium: bool

tools = Tools()

@tools.action("Greet the user by name")
async def greet(state: Inject[AppState]) -> str:
    tier = "premium" if state.premium else "free"
    return f"Hello {state.user_name}, you are on the {tier} plan."

agent = RealtimeAgent(
    instructions="Greet the user when asked.",
    tools=tools,
    context=AppState(user_name="Alice", premium=True),
)

Supervisor

Delegate complex, multi-step tasks to one LLM-driven supervisor. The voice agent hands off, speaks a holding phrase, and presents the result when done.

from rtvoice import RealtimeAgent, Supervisor, Tools
from rtvoice.llm import ChatOpenAI

tools = Tools()

@tools.action("Book a restaurant table")
async def book_table(
    restaurant: str,
    date: str,
    time: str,
    party_size: int,
) -> str:
    return f"Booked for {party_size} at {restaurant} on {date} at {time}."

supervisor = Supervisor(
    description="Books restaurant tables on behalf of the user.",
    holding_instruction="I'm checking availability, just a moment.",
    instructions="Use book_table to complete booking requests. Call done() when finished.",
    tools=tools,
    llm=ChatOpenAI(model="gpt-4o-mini"),
)

agent = RealtimeAgent(
    instructions="Delegate restaurant bookings to the supervisor.",
    supervisor=supervisor,
)

How it works: the realtime agent registers the Supervisor as a callable supervisor tool. When invoked, the supervisor runs its own agentic loop (tool calls → LLM → tool calls …) until it either calls done() or needs a clarification from the user via clarify(). Clarifications are automatically routed back through the voice agent and the loop resumes.

Supervisor parameters:

Parameter Description
description Shown to the realtime model to decide when to delegate
instructions System prompt for the supervisor's own LLM loop
llm ChatOpenAI(model=...) or any ChatModel implementation
tools Tools instance with the actions the supervisor may call
holding_instruction Spoken while the supervisor works
result_instructions Tells the realtime model how to present the result
handoff_instructions Extra guidance appended to the tool description
max_iterations Loop iteration cap (default: 10)
context Arbitrary object injectable inside supervisor tools

Conversation seeds

Pre-fill the session with synthetic conversation history before the microphone opens. The model will behave as if those exchanges already happened.

from rtvoice import RealtimeAgent, ConversationSeed, SeedMessage

agent = RealtimeAgent(
    instructions="You are a helpful assistant.",
    conversation_seed=ConversationSeed(
        messages=[
            SeedMessage.user("My name is Alice and I prefer short answers."),
            SeedMessage.assistant("Got it, Alice. I'll keep things brief."),
        ]
    ),
)

Use ConversationSeed.from_pairs() for a more concise form when you have multiple user/assistant exchanges:

seed = ConversationSeed.from_pairs(
    ("My name is Alice.", "Nice to meet you, Alice."),
    ("I prefer short answers.", "Understood, I'll be brief."),
)

Lifecycle listener

Subclass AgentListener and pass it to RealtimeAgent to hook into session events. Override only the methods you care about — all are async no-ops by default.

from rtvoice import RealtimeAgent, AgentListener

class MyListener(AgentListener):
    async def on_agent_starting(self) -> None:
        print("Agent is starting up...")

    async def on_agent_session_connected(self) -> None:
        print("WebSocket connected, ready to talk.")

    async def on_user_transcript(self, transcript: str) -> None:
        print(f"User said: {transcript}")

    async def on_assistant_transcript(self, transcript: str) -> None:
        print(f"Assistant replied: {transcript}")

    async def on_agent_stopped(self) -> None:
        print("Session ended.")

agent = RealtimeAgent(
    instructions="You are a helpful assistant.",
    listener=MyListener(),
)

All available callbacks:

Method When it fires
on_agent_starting() Before any I/O or WebSocket setup
on_agent_session_connected() WebSocket session established
on_agent_stopped() Agent fully shut down
on_user_started_speaking() VAD detected speech start
on_user_stopped_speaking() VAD detected speech end
on_user_transcript(transcript) Finalised user transcript (requires transcription_model)
on_assistant_started_responding() Assistant began streaming audio
on_assistant_stopped_responding() Assistant finished streaming audio
on_assistant_transcript(transcript) Full assistant response text
on_assistant_transcript_delta(delta) Incremental assistant text chunk (requires "text" in output_modalities)
on_agent_interrupted() User interrupted the assistant mid-response
on_agent_error(error) Session or API error
on_supervisor_started() The supervisor began running
on_supervisor_finished() The supervisor finished
on_user_inactivity_countdown(remaining_seconds) Fires each second before inactivity timeout

Custom audio devices

Implement AudioInputDevice or AudioOutputDevice from rtvoice.audio to replace the default microphone or speaker — useful for telephony, file playback, testing, or embedded hardware.

Custom input

from collections.abc import AsyncIterator
from rtvoice.audio import AudioInputDevice

class CustomMicrophone(AudioInputDevice):
    def __init__(self):
        self._active = False

    async def start(self) -> None:
        self._active = True
        # open your audio source here

    async def stop(self) -> None:
        self._active = False
        # release resources here

    async def stream_chunks(self) -> AsyncIterator[bytes]:
        while self._active:
            chunk = await self._read_pcm_chunk()  # raw 16-bit PCM, 24 kHz mono
            yield chunk

    @property
    def is_active(self) -> bool:
        return self._active

agent = RealtimeAgent(
    instructions="...",
    audio_input=CustomMicrophone(),
)

Custom output

from rtvoice.audio import AudioOutputDevice

class CustomSpeaker(AudioOutputDevice):
    def __init__(self):
        self._playing = False

    async def start(self) -> None:
        self._playing = True

    async def stop(self) -> None:
        self._playing = False

    async def play_chunk(self, chunk: bytes) -> None:
        # write raw 16-bit PCM audio to your sink
        await self._write_to_device(chunk)

    async def clear_buffer(self) -> None:
        # discard buffered audio (called on user interruption)
        await self._flush()

    @property
    def is_playing(self) -> bool:
        return self._playing

agent = RealtimeAgent(
    instructions="...",
    audio_output=CustomSpeaker(),
)

Audio format: 16-bit PCM, 24 kHz, mono in both directions.


Turn detection

Control when the model decides the user has finished speaking.

Semantic VAD (default)

Waits for a semantically complete thought. Less likely to cut off mid-sentence.

from rtvoice import RealtimeAgent, SemanticVAD, SemanticEagerness

agent = RealtimeAgent(
    instructions="...",
    turn_detection=SemanticVAD(eagerness=SemanticEagerness.LOW),
)

SemanticEagerness values: LOW, MEDIUM, HIGH, AUTO (default).

Server VAD

Energy-based: triggers on silence duration. More predictable latency.

from rtvoice import RealtimeAgent, ServerVAD

agent = RealtimeAgent(
    instructions="...",
    turn_detection=ServerVAD(
        threshold=0.5,           # energy threshold 0–1
        prefix_padding_ms=300,   # audio kept before speech onset
        silence_duration_ms=500, # silence needed to commit end-of-turn
    ),
)

Voice and model

from rtvoice import RealtimeAgent, AssistantVoice, RealtimeModel

agent = RealtimeAgent(
    model=RealtimeModel.GPT_REALTIME,       # or GPT_REALTIME_MINI, GPT_REALTIME_1_5
    voice=AssistantVoice.CORAL,
    speech_speed=1.2,                       # 0.25–1.5, default 1.0
    instructions="...",
)

Available voices: ALLOY, ASH, BALLAD, CORAL, ECHO, FABLE, ONYX, NOVA, SAGE, SHIMMER, VERSE, CEDAR, MARIN.


Recording

Save the raw session audio to a file:

agent = RealtimeAgent(
    instructions="...",
    recording_path="session.pcm",
)

result = await agent.run()
print(result.recording_path)   # Path to the saved file

The returned AgentResult also contains result.turns — a list of ConversationTurn objects with role and text for every exchange.


Inactivity timeout

Automatically stop the agent after a period of user silence:

agent = RealtimeAgent(
    instructions="...",
    inactivity_timeout_enabled=True,
    inactivity_timeout_seconds=30.0,
    listener=MyListener(),   # on_user_inactivity_countdown fires each second 5→1
)

The countdown fires through AgentListener.on_user_inactivity_countdown(remaining_seconds) — useful for playing a "still there?" prompt before the session closes.


Azure OpenAI

Pass an AzureOpenAIProvider instead of the default OpenAI provider:

from rtvoice import RealtimeAgent
from rtvoice import AzureOpenAIProvider

agent = RealtimeAgent(
    instructions="...",
    provider=AzureOpenAIProvider(
        azure_endpoint="https://your-resource.openai.azure.com",
        azure_deployment="gpt-4o-realtime-preview",
        api_version="2024-12-17",
        api_key="...",          # or omit to use AZURE_OPENAI_API_KEY
    ),
)

About

Framework for real-time voice agebts built on OpenAI's Realtime API

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages