In Part 1, I walked through the three-tier architecture for production voice AI research: web client, backend API, and Python voice agent connected through an SFU with room metadata as the configuration transport. The architecture diagram looks clean. The code samples compile. Everything makes sense.

Now let me tell you what actually happened when we deployed it.

This post covers the five bugs that consumed the most engineering time during our first two months of production research interviews. None of them appear in any quickstart guide or tutorial. All of them are obvious in hindsight. And together, they interact in ways that make the whole system harder to debug than the sum of its parts.

Bug 1: Zombie Agents

The scenario: the backend API creates a room, dispatches a voice agent, and generates an access token for the participant. The agent joins the room and starts waiting for the participant. Except the participant never shows up. Maybe they changed their mind. Maybe their browser crashed. Maybe they’re having lunch and forgot about the session.

The agent doesn’t know any of this. It’s sitting in the room, connected to the SFU, consuming CPU and memory, waiting. If it pre-warmed a connection to OpenAI Realtime or Gemini Live (which it does — see Bug 2), it’s also burning API costs just idling.

We called these “zombie agents.” They’re alive in the process sense but dead in the useful-work sense.

The math is bad. In our first month, roughly 15% of scheduled sessions had no-shows or late cancellations. At peak, we had 100+ sessions per day. That’s 15 zombie agents per day. Average zombie lifetime before we noticed: 5 minutes (some ran for the full session timeout of 60 minutes before the hard limit killed them). That’s 75 minutes of wasted compute per day at minimum — and on bad days, several hours of wasted S2S API credits.

The fix is straightforward: a participant join timeout. If no participant connects within 60 seconds of the agent joining, the agent terminates and cleans up.

import asyncio
import logging

logger = logging.getLogger("voice-agent")

PARTICIPANT_JOIN_TIMEOUT_S = 60

async def wait_for_participant(room, timeout: float = PARTICIPANT_JOIN_TIMEOUT_S):
    """Wait for a human participant to join, or abort."""
    join_event = asyncio.Event()

    def on_participant_connected(participant):
        if not participant.is_agent:
            logger.info(f"Participant joined: {participant.identity}")
            join_event.set()

    room.on("participant_connected", on_participant_connected)

    # Check if someone is already in the room
    for p in room.remote_participants.values():
        if not p.is_agent:
            logger.info(f"Participant already present: {p.identity}")
            return p

    try:
        await asyncio.wait_for(join_event.wait(), timeout=timeout)
    except asyncio.TimeoutError:
        logger.warning(f"No participant after {timeout}s — terminating zombie agent")
        await cleanup_agent(room)
        raise

async def cleanup_agent(room):
    """Disconnect from S2S provider and leave the room."""
    # Close any active S2S session
    if hasattr(room, '_s2s_session') and room._s2s_session:
        await room._s2s_session.close()
    await room.disconnect()
    logger.info("Zombie agent cleaned up")

The 60-second timeout is generous — in practice, 95% of participants who are going to join do so within 15 seconds of their scheduled time. But research participants aren’t like software engineers joining a standup. They might be fumbling with the link, asking a family member to watch the kids, or simply running late. Sixty seconds balances cost against courtesy.

After deploying this fix, our wasted compute dropped by 85%. The remaining 15% was from sessions where the participant joined, stayed for under 30 seconds, and left — which is a different problem (and a research methodology problem, not an engineering one).

Bug 2: The Pre-Warming Pattern

Here’s a scenario that frustrated us for two weeks before we diagnosed it properly.

A participant clicks “Start Interview.” They see a connection indicator. They wait. And wait. Five seconds. Eight seconds. Ten seconds. Then the AI finally speaks: “Hi, thanks for joining today.”

Ten seconds of silence at the start of a research interview is devastating. The participant thinks something is broken. They start saying “Hello?” They click refresh. They leave. Our completion rate for the first week was 72% — and exit surveys showed “long wait before the AI spoke” as the top complaint.

The problem: when the participant connects, the agent starts cold. It needs to establish a WebSocket connection to the S2S provider (OpenAI Realtime or Gemini Live), negotiate the session parameters, send the system prompt, and wait for the model to be ready. Each step takes time:

WebSocket connection:        200-500ms
Session negotiation:         300-800ms
System prompt transmission:  100-300ms
Model readiness:             500-2000ms
First audio generation:      300-500ms
─────────────────────────────────────
Total cold start:            1.4s - 4.1s (typical: 3-5s)
Network variance:            can push to 8-10s

The solution is pre-warming: the agent connects to the S2S provider before the participant joins, enters a warm-but-silent state, and waits. When the participant connects and publishes their audio track, the agent flips to active mode and speaks immediately.

We modeled this as a four-state machine:

from enum import Enum
from dataclasses import dataclass, field
from typing import Optional
import asyncio
import time
import logging

logger = logging.getLogger("voice-agent")

class AgentState(str, Enum):
    COLD = "cold"          # Just started, no connections
    WARMING = "warming"    # Connecting to S2S provider
    WARM = "warm"          # S2S connected, waiting for participant
    ACTIVE = "active"      # Participant connected, conversation live

@dataclass
class AgentLifecycle:
    state: AgentState = AgentState.COLD
    s2s_session: Optional[object] = None
    warm_since: Optional[float] = None
    participant_joined: bool = False

    async def transition_to_warming(self, config):
        """Connect to S2S provider in background."""
        self.state = AgentState.WARMING
        logger.info("State: COLD -> WARMING")
        self.s2s_session = await create_s2s_connection(config)
        await self.s2s_session.send_system_prompt(config.instructions)
        self.state = AgentState.WARM
        self.warm_since = time.monotonic()
        logger.info("State: WARMING -> WARM")

    async def activate(self):
        """Participant joined — go live."""
        if self.state != AgentState.WARM:
            raise RuntimeError(f"Cannot activate from state {self.state}")
        warm_duration = time.monotonic() - self.warm_since
        logger.info(f"State: WARM -> ACTIVE (warm for {warm_duration:.1f}s)")
        self.state = AgentState.ACTIVE
        # The first audio from the model is now immediate
        await self.s2s_session.generate_response()

    async def shutdown(self):
        """Clean up regardless of current state."""
        if self.s2s_session:
            await self.s2s_session.close()
        self.state = AgentState.COLD
        logger.info(f"Agent shut down from {self.state}")

After pre-warming, time-to-first-voice dropped from 5-10 seconds to 1-2 seconds. Completion rates went from 72% to 91% in the first week after deployment. The participant hears the AI greeting within about a second of clicking “Start” — fast enough that it feels responsive without feeling rushed.

The tradeoff: a pre-warmed agent that becomes a zombie (Bug 1) wastes more resources than a cold zombie. That 60-second timeout from Bug 1 becomes more important, not less, because the agent is now connected to the S2S provider during the wait. We budget for roughly 60 seconds of S2S idle time per session as a pre-warming overhead.

Bug 3: VAD Tuning for Research Respondents

Voice Activity Detection (VAD) determines when the participant has stopped talking and the AI should respond. Every S2S provider ships with default VAD settings that are tuned for customer service and general conversation. Those defaults are catastrophically wrong for research interviews.

Here’s why: in a normal conversation, a 500ms pause means you’re done talking. In a research interview, a 500ms pause means you’re thinking. A research participant might say “I think the main problem was…” then pause for 2-3 seconds while they formulate a thought, then continue with “…actually it goes back to when I first started using the product.” If the AI jumps in during that pause, you’ve interrupted the most valuable part of the response — the part where the participant is accessing a deeper memory or forming a more nuanced opinion.

The default VAD settings we found across providers:

OpenAI Realtime defaults:
  threshold: 0.5, silence_duration: 200ms, prefix_padding: 300ms

Gemini Live:
  Uses internal VAD — less configurable, similar aggressiveness

Our research-tuned settings after two weeks of iteration:

from dataclasses import dataclass

@dataclass
class ResearchVADConfig:
    """VAD settings tuned for qualitative research interviews.

    Research respondents think longer, pause more, and circle back
    to earlier points. Aggressive VAD kills the best insights.
    """
    threshold: float = 0.5          # Sensitivity (0.0-1.0)
    silence_duration_ms: int = 500  # Wait 500ms of silence before responding
    prefix_padding_ms: int = 300    # Capture 300ms of audio before speech onset
    min_speech_duration_ms: int = 100  # Ignore sub-100ms sounds (coughs, clicks)

    def to_openai_config(self) -> dict:
        return {
            "type": "server_vad",
            "threshold": self.threshold,
            "silence_duration_ms": self.silence_duration_ms,
            "prefix_padding_ms": self.prefix_padding_ms,
        }

    def to_session_update(self) -> dict:
        """Format for OpenAI Realtime session.update event."""
        return {
            "turn_detection": self.to_openai_config()
        }

The key change: silence_duration_ms from 200ms (default) to 500ms. This single parameter change improved transcript quality more than any prompt engineering we did. Participants said things like “the AI really listens” and “it doesn’t rush me” in post-session feedback.

For OpenAI Realtime, these settings go into the session.update event. The server VAD documentation covers the parameters in detail. For Gemini Live, the VAD is less configurable — you get coarser controls over turn-taking behavior, and we found that Gemini’s native turn detection is actually slightly better for research out of the box because it’s more conservative about claiming a turn.

One caveat: longer silence duration means longer perceived latency. When the participant actually is done talking, they wait 500ms for the AI to respond instead of 200ms. This is a conscious tradeoff. For research, the cost of interrupting a good answer is much higher than the cost of a slightly slower response.

Bug 4: Turn Detection and Interruption Handling

This bug is the evil cousin of Bug 3. VAD determines when the AI should start responding. Turn detection determines what happens when the participant starts talking while the AI is still speaking.

In a research interview, interruptions happen constantly. Not rude interruptions — collaborative ones. The participant says “Oh, that reminds me of…” while the AI is finishing a transition phrase. Or they say “Yes, exactly” as a backchannel while the AI asks the next question. Or they start answering before the AI finishes the question because they already know where it’s going.

The two providers handle this differently.

OpenAI Realtime has built-in interruption handling. When the server VAD detects that the participant is speaking while the AI is generating audio, it immediately stops the AI’s output and processes the participant’s speech as a new turn. This is the right behavior for research — the participant’s voice always takes priority.

Gemini Live supports interruption configuration. You can control whether and how the model responds to interruptions during its own output. The behavior is configurable, but the defaults needed adjustment for research contexts.

The real complexity is in the event handling:

import logging

logger = logging.getLogger("voice-agent")

class InterruptionHandler:
    """Handle participant interruptions during AI speech."""

    def __init__(self, session, metrics):
        self.session = session
        self.metrics = metrics
        self._ai_speaking = False
        self._interrupted_count = 0

    def on_agent_started_speaking(self, event):
        self._ai_speaking = True

    def on_agent_stopped_speaking(self, event):
        self._ai_speaking = False

    def on_user_started_speaking(self, event):
        if self._ai_speaking:
            self._interrupted_count += 1
            self.metrics.record_interruption(
                timestamp=event.timestamp,
                ai_progress=event.audio_offset_ms,
            )
            logger.info(
                f"Participant interrupted AI at {event.audio_offset_ms}ms "
                f"(total interruptions: {self._interrupted_count})"
            )

    def register(self, agent):
        """Attach event listeners to the agent lifecycle."""
        agent.on("agent_started_speaking", self.on_agent_started_speaking)
        agent.on("agent_stopped_speaking", self.on_agent_stopped_speaking)
        agent.on("user_started_speaking", self.on_user_started_speaking)

We track interruption patterns for two reasons. First, frequent interruptions at specific points (like phase transitions) suggest the AI’s phrasing is too long — a prompt engineering signal. Second, interruption frequency per participant is a research data point: it correlates with engagement level and comfort with the AI.

The testing methodology for turn detection is worth a post on its own, but the short version: we recorded 50 real research sessions, identified the interruption points, then replayed the audio against different VAD/turn-detection configurations to find the settings that matched what a human interviewer would do. The current settings agree with human judgment about 87% of the time — not perfect, but good enough that participants don’t notice the mismatches.

Bug 5: Provider API Configuration Chaos

This one is embarrassing in its simplicity and maddening in its time cost. We spent three days debugging what turned out to be an environment variable naming issue.

Google’s AI APIs have gone through several naming iterations. Depending on which SDK version, which documentation page, and which example code you’re reading, the API key environment variable might be:

  • GOOGLE_API_KEY
  • GEMINI_API_KEY
  • GOOGLE_GEMINI_API_KEY
  • GOOGLE_APPLICATION_CREDENTIALS (for service accounts, different auth flow entirely)

Different SDK versions look for different variable names. The LiveKit Google plugin looks for one name. The raw Gemini SDK looks for another. If you set the wrong one, you don’t get an error — you get a silent fallback to a default that doesn’t have your key, and then a cryptic authentication failure 10 seconds into the session.

OpenAI has its own version of this: OPENAI_API_KEY is standard, but some orchestration layers also check OPENAI_ORG_ID, OPENAI_PROJECT_ID, or OPENAI_BASE_URL for proxy setups. Missing any of these in the right combination produces different failure modes.

Our fix: a normalizer that runs at agent startup and validates every provider’s configuration before attempting a connection.

import os
import logging
from typing import Optional

logger = logging.getLogger("voice-agent")

class ProviderKeyNormalizer:
    """Normalize and validate provider API keys at startup.

    Different SDKs look for different env var names.
    This ensures the right key is available under all expected names.
    """

    GOOGLE_KEY_NAMES = ["GOOGLE_API_KEY", "GEMINI_API_KEY"]
    OPENAI_KEY_NAMES = ["OPENAI_API_KEY"]

    @classmethod
    def normalize_google(cls) -> Optional[str]:
        key = None
        for name in cls.GOOGLE_KEY_NAMES:
            if os.environ.get(name):
                key = os.environ[name]
                break

        if key:
            # Set ALL known names so every SDK finds it
            for name in cls.GOOGLE_KEY_NAMES:
                os.environ[name] = key
            logger.info("Google API key normalized across all env var names")
        else:
            logger.error(
                f"No Google API key found. Checked: {cls.GOOGLE_KEY_NAMES}"
            )
        return key

    @classmethod
    def normalize_openai(cls) -> Optional[str]:
        key = os.environ.get("OPENAI_API_KEY")
        if not key:
            logger.error("OPENAI_API_KEY not set")
        return key

    @classmethod
    def validate_all(cls):
        """Run at agent startup. Fail fast if keys are missing."""
        results = {
            "google": cls.normalize_google(),
            "openai": cls.normalize_openai(),
        }
        missing = [k for k, v in results.items() if v is None]
        if missing:
            logger.warning(f"Missing provider keys: {missing}")
        return results

Is this elegant? No. Is it necessary? Absolutely. This saved us from at least one “why isn’t Gemini working in production” debugging session per week. The normalization runs once at process startup — if a key is missing, the agent logs the exact problem and fails immediately instead of failing 10 seconds into a session with a participant already waiting.

The Compound Effect

These bugs don’t exist in isolation. They compound in ways that make the system harder to debug than the sum of its parts.

A zombie agent (Bug 1) that pre-warmed (Bug 2) burns S2S API credits while it waits. If the participant eventually shows up after 45 seconds, the S2S session might have timed out, requiring a cold restart — which means the pre-warming was wasted and the participant gets the worst of both worlds: they waited AND they get a slow start.

Bad VAD settings (Bug 3) interact with interruption handling (Bug 4): if the VAD is too aggressive, the AI starts responding too early, which causes the participant to interrupt because the AI talked over them, which triggers an interruption event, which makes the AI stop, which creates an awkward pause, which makes the participant think something is broken.

Provider key misconfiguration (Bug 5) interacts with the provider factory from Part 1. If the agent is configured for Gemini but the Gemini key is misconfigured, the factory creates a session that fails on first use. Without the key normalizer, the error message says “authentication failed” but doesn’t tell you which environment variable is wrong.

The lesson: production voice AI systems are deeply interconnected. You can’t fix these bugs independently. Every fix needs to account for how it interacts with the other four. Our post-mortem process for voice agent incidents now includes a mandatory “interaction effects” section where we document which other components are affected by both the bug and the fix.

What’s Next

Part 3 covers the research protocol itself — how we encode multi-phase interview structures as state machines, how the LLM decides when to transition between phases, and how we handle the messy reality of participants who don’t follow the expected flow. The state machine is the heart of what makes a research interview different from a chatbot conversation.

For the technical foundations behind all of this, see the Voice AI Interview Playbook series and the S2S Voice AI Agent build guide.


This is Part 2 of an 8-part series: Production Voice AI for Research at Scale.

Series outline:

  1. The Architecture Nobody Warns You About — Server-side agents, metadata transport, provider selection (Part 1)
  2. Zombie Agents, Pre-Warming, and the 5 Bugs That Cost Us Weeks — Production pain points and fixes (you are here)
  3. Multi-Phase State Machines — Research protocol as code, LLM-driven transitions (Part 3)
  4. From Recording to Insight — The automatic post-interview pipeline (Part 4)
  5. The Real Cost — Per-minute tracking, budgets, self-hosting math (Part 5)
  6. What Breaks at 200 Concurrent Sessions — Scaling bottlenecks and operational metrics (Part 6)
  7. Multi-Language Voice AI — Language detection, provider routing, locale-aware VAD, i18n prompts (Part 7)
  8. Deployment and Go-Live — Docker, Kubernetes, CI/CD, zero-downtime deploys, monitoring (Part 8)

Related reading:

Export for reading

Comments