I wrote a 12-part playbook covering every layer of voice AI for interviews — pipeline selection, provider comparison, scaling, cost optimization. Then I built a production S2S agent with OpenAI Realtime and Gemini Live, dynamic prompts, and a state machine for session orchestration. Those posts are the reference architecture. Clean diagrams. Neat abstractions. Everything works on paper.

This series is about what happened when we took that architecture and ran qualitative research interviews at scale. Not technical hiring. Not screening. Actual research — where the goal is to extract rich, nuanced responses from participants who aren’t trying to impress you, they’re trying to articulate experiences they’ve never put into words before.

Research interviews are a fundamentally different beast. In a hiring interview, the candidate is motivated. They prepare. They stay focused. In a research interview, the participant might be exhausted, distracted, or unsure why they agreed to do this. They pause for 10 seconds mid-thought. They ramble. They circle back to something from 5 minutes ago. The AI agent needs to handle all of that gracefully — and it needs to do it hundreds of times a day with zero supervision.

This eight-part series documents the production reality: the architecture decisions that mattered, the bugs that cost us weeks, the state machines that actually work, and the cost math that determines whether the whole thing is viable. Part 1 starts with the architecture — specifically, the parts that no tutorial covers.

Why Research Interviews Need Server-Side Agents

The simplest voice AI architecture is also the most tempting: put the API key in the browser and connect directly to OpenAI Realtime or Gemini Live. The participant’s browser opens a WebSocket, streams audio, gets audio back. No backend needed. Ship it Friday.

Here’s why that doesn’t work for research at scale.

Security. Your API keys are in the browser. Every participant with dev tools open can extract them. For a demo, that’s fine. For a platform running thousands of sessions per month with enterprise clients who’ve signed data processing agreements — that’s a career-ending incident. OpenAI’s Realtime API docs explicitly recommend server-side connections for production deployments.

Research protocol enforcement. A research interview follows a protocol — specific phases, specific questions, specific probing strategies. The protocol is your methodology. If the participant manipulates the client (or the connection drops and reconnects), you need the server to maintain state. Client-side agents can’t enforce server-side invariants.

Observability. When a session goes wrong — and sessions will go wrong — you need full server-side logs. Every audio frame, every LLM response, every state transition. Client-side logging means trusting the participant’s browser to report accurately, which it won’t, because participants close tabs, lose connectivity, and use browsers with aggressive resource management.

Cost enforcement. A single research session can burn $5-15 in API costs depending on the provider and duration. Without server-side budget limits, a stuck session or a participant who leaves their tab open can run up costs for hours. We learned this the hard way — Bug 1 in Part 2 covers exactly this scenario.

The pattern that works: the participant’s browser never touches the AI provider. It connects to a media server via WebRTC, and a server-side agent handles the AI conversation. The browser is a dumb audio terminal.

The Three-Tier Architecture

The production architecture has three layers, each with a distinct responsibility:

Tier 1: Web Client (React/Next.js + WebRTC SDK). The participant’s browser. It connects to a media server room using a time-limited access token, sends/receives audio tracks via WebRTC, and renders a minimal UI — a “start” button, a timer, maybe a visual indicator that the AI is listening. No business logic. No API keys. No state management beyond “am I connected.”

Tier 2: Backend API (Node.js or Python). The orchestration layer. When a research session is scheduled, the API creates a room on the media server, generates access tokens for the participant and the agent, writes session configuration as room metadata, and dispatches the agent process. It also handles webhooks for session lifecycle events — participant joined, participant left, session ended.

Tier 3: Python Voice Agent. The actual AI. It joins the same media server room as the participant, reads the configuration from room metadata, establishes a speech-to-speech connection to OpenAI Realtime or Gemini Live, and runs the research protocol as a state machine. This is a long-running process — 10 to 60 minutes per session — with its own lifecycle management.

The media server (LiveKit is what we used, but the pattern applies to any SFU with a server SDK) acts as the bridge. Both the participant and the agent connect to the same room, but they’re completely independent processes. The agent can crash and reconnect without the participant knowing. The participant can have network issues without the agent losing state.

This separation matters because research sessions are expensive to restart. If a 30-minute interview fails at minute 25 because the agent process died, you’ve wasted the participant’s time and your budget. The three-tier architecture means each layer can fail and recover independently.

Room Metadata as Configuration Transport

Here’s the problem: the backend API knows everything about the session — which provider to use, which voice, what the research protocol looks like, what the participant’s context is. The Python agent knows none of this at startup. It just knows “join room X.”

We needed a way to pass rich configuration from the API to the agent without coupling them directly. The solution: room metadata. The media server lets you attach arbitrary JSON to a room, and any participant (including the agent) can read it.

This is the JSON contract:

{
  "provider": "openai_realtime",
  "model": "gpt-4o-realtime-preview",
  "voice": "shimmer",
  "instructions": "You are a research interviewer conducting a study on...",
  "vad_config": {
    "threshold": 0.5,
    "silence_duration_ms": 500,
    "prefix_padding_ms": 300
  },
  "phases": [
    {
      "id": "rapport",
      "name": "Rapport Building",
      "duration_seconds": 120,
      "prompt_suffix": "Start with casual conversation..."
    },
    {
      "id": "core",
      "name": "Core Questions",
      "duration_seconds": 900,
      "prompt_suffix": "Transition to the main research questions..."
    },
    {
      "id": "closing",
      "name": "Closing",
      "duration_seconds": 120,
      "prompt_suffix": "Thank the participant and wrap up..."
    }
  ],
  "budget_limit_usd": 10.0,
  "session_timeout_seconds": 3600,
  "created_at": "2026-02-20T10:30:00Z"
}

Every field in this schema was added because we hit a production issue without it. budget_limit_usd was added after a zombie session ran for 3 hours. vad_config was added after research participants kept getting interrupted mid-thought (more in Part 2). The phases array replaced a hardcoded state machine after the third researcher asked for a different interview structure.

The beauty of this approach: the API and the agent share a contract, not a connection. The API can be written in any language. The agent can be restarted without the API knowing. And the configuration is visible in the media server’s admin dashboard for debugging.

The Metadata Propagation Latency Problem

Here’s the part nobody warns you about. When the API creates a room and writes metadata, and the agent process starts and joins that room, there’s a window — 100 to 500 milliseconds — where the agent is in the room but the metadata hasn’t propagated yet.

This sounds trivial. It isn’t.

In a synchronous system, you’d create the room, write metadata, and then start the agent. But in production, the agent is a separate process (often on a separate machine). It’s dispatched by a job queue. It starts when resources are available. By the time it joins the room, the metadata should be there. Except sometimes it isn’t.

The SFU distributes metadata updates via its internal pub/sub system. If the agent connects in the narrow window between room creation and metadata propagation — which happens roughly 5-8% of the time under load — it reads empty metadata and either crashes or starts with defaults that are wrong for this session.

Our fix is a polling loop with a hard timeout:

import asyncio
import json
import logging
from dataclasses import dataclass

logger = logging.getLogger("voice-agent")

METADATA_POLL_INTERVAL_MS = 150
METADATA_TIMEOUT_S = 5.0

@dataclass
class SessionConfig:
    provider: str
    model: str
    voice: str
    instructions: str
    vad_config: dict
    phases: list
    budget_limit_usd: float
    session_timeout_seconds: int

async def wait_for_metadata(room, timeout: float = METADATA_TIMEOUT_S) -> SessionConfig:
    """Poll room metadata until available or timeout."""
    elapsed = 0.0
    interval = METADATA_POLL_INTERVAL_MS / 1000.0

    while elapsed < timeout:
        raw = room.metadata
        if raw:
            try:
                data = json.loads(raw)
                logger.info(f"Metadata received after {elapsed:.1f}s")
                return SessionConfig(
                    provider=data["provider"],
                    model=data["model"],
                    voice=data["voice"],
                    instructions=data["instructions"],
                    vad_config=data.get("vad_config", {}),
                    phases=data.get("phases", []),
                    budget_limit_usd=data.get("budget_limit_usd", 10.0),
                    session_timeout_seconds=data.get("session_timeout_seconds", 3600),
                )
            except (json.JSONDecodeError, KeyError) as e:
                logger.warning(f"Malformed metadata, retrying: {e}")

        await asyncio.sleep(interval)
        elapsed += interval

    raise TimeoutError(f"No metadata after {timeout}s — aborting session")

The 150ms polling interval is a balance: fast enough to minimize startup latency, slow enough to not hammer the SFU. The 5-second hard timeout catches the case where the API failed to write metadata entirely — without it, the agent would poll forever.

In production, the average wait is 80ms. The P95 is 320ms. The P99 is 480ms. And about 0.1% of sessions hit the timeout, which means the API had a bug or the SFU had an issue. Those sessions fail fast and get retried automatically by the job queue.

This is the kind of infrastructure detail that doesn’t appear in any tutorial or quickstart guide. But if you skip it, roughly 5-8% of your sessions start with wrong or missing configuration.

OpenAI Realtime vs Gemini Live for Research

Both providers work. Both have tradeoffs that matter specifically for research interviews. Here’s what we learned after running hundreds of sessions on each.

OpenAI Realtime (docs) is the lower-latency option. Time-to-first-audio is typically 250-350ms in our measurements. It supports 60-minute sessions natively, which covers even the longest qualitative interviews. Function calling works mid-conversation, which is how we trigger phase transitions and capture structured data during the interview. The server-side VAD is tunable via the session configuration, which is critical for research (more on this in Part 2).

The downside: cost. Audio input is $40/M tokens and audio output is $80/M tokens (as of early 2026 — check the pricing page for current rates). For a typical 30-minute research interview, that works out to roughly $2.50-5.00 per session depending on how much the participant talks. At 100 sessions per day, that’s $250-500/day in API costs alone.

Gemini Live (docs) has a different value proposition. It’s multimodal — it can process audio and video simultaneously, which opens up use cases like screen-sharing research sessions or product usability testing where you need the AI to see what the participant sees. Gemini’s affective dialog capabilities are notably better for research: it handles emotional nuance, hesitation, and conversational repair more naturally than OpenAI Realtime in our testing.

The cost advantage is significant. Gemini 2.0 Flash with audio runs at roughly $1.00-1.50 per 30-minute session — less than half of OpenAI Realtime. For high-volume research operations, this is the difference between viable and not viable.

The tradeoff: latency is slightly higher (400-600ms time-to-first-audio in our measurements), and the turn-taking model works differently. Gemini uses its own voice activity detection that’s less configurable than OpenAI’s server VAD. For research interviews where participants pause frequently, this requires different tuning strategies.

Our approach: make the provider a runtime configuration parameter (that’s the provider field in the room metadata). The agent picks the right S2S connector at startup:

from enum import Enum

class VoiceProvider(str, Enum):
    OPENAI_REALTIME = "openai_realtime"
    GEMINI_LIVE = "gemini_live"

def create_s2s_session(config: SessionConfig):
    """Factory: pick the right S2S provider from session config."""
    provider = VoiceProvider(config.provider)

    if provider == VoiceProvider.OPENAI_REALTIME:
        from livekit.plugins import openai
        return openai.realtime.RealtimeModel(
            model=config.model,
            voice=config.voice,
            modalities=["audio", "text"],
        )
    elif provider == VoiceProvider.GEMINI_LIVE:
        from livekit.plugins import google
        return google.beta.realtime.RealtimeModel(
            model=config.model,
            voice=config.voice,
        )
    else:
        raise ValueError(f"Unknown provider: {config.provider}")

This means the research team can A/B test providers per study. Run half the sessions on OpenAI for lower latency, half on Gemini for better affect detection. Compare the transcript quality. Compare the cost. Make a data-driven decision.

The LiveKit Agents SDK abstracts the WebRTC-to-S2S bridge for both providers, so the agent code above the provider layer is identical regardless of which S2S model handles the conversation. That’s the architecture bet that paid off the most: keeping the provider as a pluggable dependency rather than a structural commitment.

What’s Next

This post covered the static architecture — the pieces and how they connect. Part 2 covers what happens when those pieces interact with real users at real scale: Zombie Agents, Pre-Warming, and the 5 Bugs That Cost Us Weeks.

The short version: every bug on that list is something that works fine in development and fails spectacularly in production. The metadata propagation delay we covered here is just the appetizer.


This is Part 1 of an 8-part series: Production Voice AI for Research at Scale.

Series outline:

  1. The Architecture Nobody Warns You About — Server-side agents, metadata transport, provider selection (you are here)
  2. Zombie Agents, Pre-Warming, and the 5 Bugs That Cost Us Weeks — Production pain points and fixes (Part 2)
  3. Multi-Phase State Machines — Research protocol as code, LLM-driven transitions (Part 3)
  4. From Recording to Insight — The automatic post-interview pipeline (Part 4)
  5. The Real Cost — Per-minute tracking, budgets, self-hosting math (Part 5)
  6. What Breaks at 200 Concurrent Sessions — Scaling bottlenecks and operational metrics (Part 6)
  7. Multi-Language Voice AI — Language detection, provider routing, locale-aware VAD, i18n prompts (Part 7)
  8. Deployment and Go-Live — Docker, Kubernetes, CI/CD, zero-downtime deploys, monitoring (Part 8)

Related reading:

Export for reading

Comments