In Part 5, we built real-time cost tracking, budget enforcement, and ran the self-hosting math. All of that was built under load — or more accurately, all of that was built because of load. This post is about what happens when you scale a voice AI research platform from a comfortable 10 sessions per week to 200 concurrent sessions, and which parts of the architecture break first.

The short answer: it’s never the AI provider. OpenAI Realtime and Gemini Live handle concurrent sessions fine — that’s their core business model. It’s everything around the AI that breaks. The post-processing pipeline. The session recovery logic. The monitoring. The things you built when you had 10 sessions and never stress-tested at 200.

Here’s what broke, in the order it broke, and how we fixed each one.

The Enrichment Bottleneck

This was the first thing to buckle. The post-processing pipeline from Part 4 worked beautifully at low volume. Sessions end, recordings get processed, transcripts get enriched with sentiment and topic analysis, structured reports get generated. Neat, sequential, reliable.

Then we ran a large study where 200 sessions ended within a 2-hour window. Each session produces roughly 150 transcript segments that need enrichment — sentiment scoring, topic classification, key quote extraction. That’s 30,000 API calls hitting our enrichment endpoint in a burst.

The original pipeline processed segments sequentially within each session, and sessions were processed in FIFO order. At 200 sessions with 150 segments each, the queue depth hit 30,000 items. With an average enrichment call taking 800ms, sequential processing would take nearly 7 hours. Clients expecting results within 30 minutes of session completion were not impressed.

The fix had three parts:

1. Parallel workers with concurrency control. Instead of one worker processing one segment at a time, we run a pool of workers with bounded concurrency. The concurrency limit exists because the enrichment API (typically GPT-4o-mini for cost efficiency) has rate limits, and because unbounded parallelism creates more problems than it solves.

import asyncio
from dataclasses import dataclass, field
from typing import Callable, Any
from collections import deque

@dataclass
class EnrichmentWorkerPool:
    """Processes enrichment tasks with bounded concurrency and priority."""
    max_concurrency: int = 5
    _semaphore: asyncio.Semaphore = field(init=False)
    _active: int = 0
    _completed: int = 0

    def __post_init__(self):
        self._semaphore = asyncio.Semaphore(self.max_concurrency)

    async def submit(self, tasks: list[Callable], priority: int = 0):
        """Submit a batch of enrichment tasks. Lower priority number = processed first."""
        # Sort by priority (newer sessions get lower numbers = higher priority)
        sorted_tasks = sorted(zip([priority] * len(tasks), tasks), key=lambda x: x[0])
        results = await asyncio.gather(
            *[self._run_with_limit(task) for _, task in sorted_tasks],
            return_exceptions=True,
        )
        self._completed += len([r for r in results if not isinstance(r, Exception)])
        return results

    async def _run_with_limit(self, task: Callable) -> Any:
        async with self._semaphore:
            self._active += 1
            try:
                return await task()
            finally:
                self._active -= 1

2. Batched segment processing. Instead of one API call per segment, we batch 10-20 segments into a single enrichment call. The prompt includes all segments as a numbered list and asks for structured output covering all of them. This reduces 30,000 calls to ~2,000 calls. The quality is comparable — we validated this by running 500 segments through both approaches and comparing outputs. Agreement rate was 94% on sentiment, 97% on topic classification.

3. Priority queuing. Newer sessions get processed first. If a participant just finished their session 5 minutes ago, their results should be ready before someone who finished 2 hours ago. This seems obvious in retrospect, but FIFO ordering meant a burst of 200 sessions would process in arrival order regardless of when they completed. Priority queuing with newest-first ordering means the most recent sessions get results in 10-15 minutes even during burst periods.

With these three changes, the 200-session burst processes in about 45 minutes instead of 7 hours. The concurrency limit of 5 keeps us well within API rate limits, the batching reduces total calls by 90%, and priority queuing keeps the participant experience responsive.

Session Recovery Tokens

WebRTC connections drop. It’s not a matter of if — it’s a matter of how often. Participants on WiFi walk between rooms. Mobile users switch from WiFi to cellular. Laptop users close the lid and reopen it. Corporate firewalls aggressively terminate idle UDP connections.

In a research context, a dropped connection is especially painful. The participant might be 25 minutes into a 30-minute session. If the session dies and they have to start over, you’ve lost the data and the participant’s goodwill. If the session just ends, you have an incomplete dataset.

The solution is session recovery: when the WebRTC connection drops, the server-side agent doesn’t immediately terminate. Instead, it enters a PAUSED state and waits for the participant to reconnect.

import asyncio
from dataclasses import dataclass, field
from enum import Enum
from time import time
from typing import Optional

class AgentState(str, Enum):
    ACTIVE = "active"
    PAUSED = "paused"
    ENDED = "ended"

@dataclass
class SessionRecoveryHandler:
    """Manages agent lifecycle when WebRTC connections drop."""
    session_id: str
    recovery_timeout: float = 120.0  # seconds to wait for reconnection
    state: AgentState = AgentState.ACTIVE
    _pause_start: Optional[float] = None
    _recovery_token: Optional[str] = None

    def on_participant_disconnected(self) -> str:
        """Called when WebRTC connection drops. Returns recovery token."""
        self.state = AgentState.PAUSED
        self._pause_start = time()
        self._recovery_token = f"recover-{self.session_id}-{int(time())}"
        return self._recovery_token

    def attempt_recovery(self, token: str) -> bool:
        """Validate recovery token and resume session."""
        if self.state != AgentState.PAUSED:
            return False
        if token != self._recovery_token:
            return False
        if time() - (self._pause_start or 0) > self.recovery_timeout:
            return False
        self.state = AgentState.ACTIVE
        self._pause_start = None
        return True

    async def wait_for_recovery(self, on_timeout: callable):
        """Wait for reconnection or trigger graceful end."""
        await asyncio.sleep(self.recovery_timeout)
        if self.state == AgentState.PAUSED:
            self.state = AgentState.ENDED
            await on_timeout(self.session_id)

Here’s how it works in practice:

  1. Participant disconnects. The SFU detects the WebRTC connection drop and fires a participant_disconnected event. The agent generates a recovery token and enters PAUSED state.
  2. 120-second window. The agent stays alive, maintaining its S2S connection and conversation context. The recovery token is stored server-side and also sent to the client application (which persists it in sessionStorage).
  3. Participant reconnects. The client detects the connection loss, shows a “Reconnecting…” UI, and attempts to rejoin the SFU room with the recovery token. If the token is valid and the timeout hasn’t expired, the agent resumes from exactly where it left off.
  4. Timeout expires. If 120 seconds pass with no reconnection, the agent ends the session gracefully — delivering a closing statement to the empty room (which gets recorded), then triggering the post-processing pipeline on whatever data was collected.

The 120-second timeout is a balance. Too short and legitimate reconnections fail. Too long and you’re paying for idle S2S connections (the provider charges for the open session even when no audio is flowing). At $0.074-0.90 per minute depending on provider, a 2-minute idle window costs $0.15-1.80. We eat that cost to preserve the session data.

In production, roughly 8% of sessions experience at least one disconnection. Of those, about 70% successfully recover within the 120-second window. The remaining 30% either have persistent network issues or the participant chose to leave. Session recovery turned what would have been an 8% data loss rate into a 2.4% loss rate — a meaningful improvement for research data quality.

Provider Failover Under Load

S2S providers have outages. OpenAI’s status page shows the history — partial degradations, elevated error rates, full outages. Gemini has rate limits that can throttle you during peak usage periods. When you’re running 200 concurrent sessions and a provider starts returning errors, you need a plan.

Our failover strategy is simple by design: failover happens at session-creation time only. You cannot swap S2S providers mid-session — the conversation context, the audio history, the system prompt interpretation are all bound to the specific provider session. Switching mid-stream would mean starting over with a new model that has no memory of the conversation so far.

import asyncio
from dataclasses import dataclass, field
from time import time
from typing import Dict

@dataclass
class ProviderHealthCheck:
    """Monitors S2S provider health and routes new sessions accordingly."""
    providers: list[str] = field(default_factory=lambda: ["openai_realtime", "gemini_live"])
    health: Dict[str, bool] = field(default_factory=dict)
    error_counts: Dict[str, int] = field(default_factory=dict)
    last_check: Dict[str, float] = field(default_factory=dict)
    check_interval: float = 30.0  # seconds between health checks
    error_threshold: int = 3      # consecutive errors before marking unhealthy

    def __post_init__(self):
        for p in self.providers:
            self.health[p] = True
            self.error_counts[p] = 0
            self.last_check[p] = 0.0

    def record_error(self, provider: str):
        self.error_counts[provider] = self.error_counts.get(provider, 0) + 1
        if self.error_counts[provider] >= self.error_threshold:
            self.health[provider] = False

    def record_success(self, provider: str):
        self.error_counts[provider] = 0
        self.health[provider] = True

    def get_available_provider(self, preferred: str = "openai_realtime") -> str | None:
        """Return the preferred provider if healthy, otherwise fall back."""
        if self.health.get(preferred, False):
            return preferred
        for p in self.providers:
            if self.health.get(p, False):
                return p
        return None  # All providers down — alert ops immediately

The health check runs as a background daemon, pinging each provider’s session creation endpoint every 30 seconds. Three consecutive errors mark a provider as unhealthy. One success marks it healthy again.

When a new session request comes in:

  1. Check the preferred provider (usually determined by the project configuration — some studies are configured for OpenAI, some for Gemini).
  2. If the preferred provider is healthy, use it.
  3. If not, fall back to the alternative.
  4. If all providers are down, queue the session and alert operations.

This is intentionally conservative. We don’t do weighted routing or gradual traffic shifting. For research, reliability matters more than optimization. If OpenAI is having issues, send everything to Gemini until OpenAI recovers. The cost difference between providers is smaller than the cost of a failed session.

One nuance: when a provider recovers, we don’t immediately route all traffic back. We let new sessions trickle back — first 10%, then 50%, then 100% — over about 15 minutes. This prevents a thundering herd if the recovery is fragile.

Operational Metrics That Matter

At 200 concurrent sessions, you can’t watch individual sessions. You need dashboards. After iterating through several metric sets, we settled on six numbers that tell you whether the platform is healthy:

from dataclasses import dataclass

@dataclass
class PlatformMetrics:
    """The six metrics that matter for voice AI at scale."""

    # 1. Concurrent sessions (gauge) — "how busy are we right now?"
    #    Normal: 0-300. Alert: >400 (approaching capacity)
    concurrent_sessions: int = 0

    # 2. Time-to-first-voice (histogram) — "how fast does the AI start talking?"
    #    Target: <2s. Alert: p95 >3s
    ttfv_p50_ms: float = 0.0
    ttfv_p95_ms: float = 0.0

    # 3. AI response latency (histogram) — "how fast does the AI respond to speech?"
    #    Target: p95 <500ms. Alert: p95 >800ms
    ai_latency_p50_ms: float = 0.0
    ai_latency_p95_ms: float = 0.0

    # 4. Provider error rate (counter) — "are providers healthy?"
    #    Target: <0.5%. Alert: >2%
    provider_error_rate: float = 0.0

    # 5. Enrichment queue depth (gauge) — "is post-processing keeping up?"
    #    Target: <100. Alert: >500
    enrichment_queue_depth: int = 0

    # 6. Average cost per session (rolling) — "are we on budget?"
    #    Target varies by project. Alert: >150% of project target
    avg_cost_per_session: float = 0.0

Let me explain why each one made the cut and what we dropped:

Concurrent sessions is the capacity indicator. It tells you whether you’re approaching infrastructure limits. At 200 concurrent, we’re comfortable. At 400, we start pre-warming additional agent workers. At 500, something is probably wrong (sessions not ending properly — the zombie agent problem from Part 2).

Time-to-first-voice is the participant experience metric. When someone joins a research session, they expect the AI to greet them within 2 seconds. If TTFV creeps above 3 seconds, participants think the connection failed and start clicking buttons. We measure this from the WebRTC track_subscribed event (participant audio is flowing) to the first audio frame from the AI agent.

AI response latency measures the conversational quality. This is the gap between a participant finishing a sentence and the AI starting its response. S2S models are fast here — typically 300-500ms — but degradation under load is the early warning sign of provider issues. We track p50 and p95 separately because the p95 is what catches intermittent problems.

Provider error rate is the failover trigger. Anything above 2% on a 5-minute rolling window triggers provider health check escalation. We count connection failures, timeout errors, and malformed responses. Normal operation runs at 0.1-0.3% error rate.

Enrichment queue depth tells you whether post-processing is keeping up. If this number climbs above 500, the enrichment workers are falling behind — usually because a batch of sessions ended simultaneously. It’s the canary for the enrichment bottleneck we fixed earlier.

Average cost per session is the financial health check. We compute this on a rolling 1-hour window. If cost per session is significantly above the project’s target rate, something changed — maybe the AI is generating longer responses, maybe sessions are running longer than expected, maybe the provider raised prices.

We export all six metrics to Prometheus and display them on Grafana dashboards. LiveKit also publishes its own metrics for the SFU layer — the LiveKit monitoring guide covers what’s available. We combine LiveKit’s transport metrics with our application metrics on a single dashboard so the on-call engineer sees the full picture.

What we dropped from earlier iterations: per-session CPU usage (too noisy), transcript word count (vanity metric), participant satisfaction scores (too delayed to be operational). The six metrics above are what actually get looked at during incidents.

What We Would Build Differently at Day Zero

Every system has its “if I started over” list. Here’s ours, after 18 months of running voice AI research at scale:

1. Self-host the SFU from day one. We started with LiveKit Cloud because it was faster to set up. Migrating to self-hosted later meant changing infrastructure, updating deployment scripts, and running parallel systems during the transition. The economics clearly favor self-hosting above ~700 sessions/month (Part 5), and most serious research operations hit that number within their first quarter. Start self-hosted, use managed as your failover target, not the other way around.

2. Build the post-processing pipeline before the agent. We built the conversational agent first because it was the exciting part. Then we realized we had hundreds of recordings piling up with no automated processing. Build recording extraction, transcription, and enrichment first. Test it with synthetic data. Then build the agent that produces real data for it. The pipeline is where the research value lives — the agent is just the data collection mechanism.

3. Per-session cost tracking from the first session. We added cost tracking at session ~500. Retroactively computing costs for the first 500 sessions was painful and imprecise. Token-level logging from day one costs nothing to implement and saves enormous pain later.

4. Multi-phase state machine even for simple protocols. Our first agent had no state machine — it was a single-phase “just talk” design. Adding the state machine later (Part 3) required refactoring the agent, the prompt system, and the session management. Even the simplest research protocol benefits from at least three phases: introduction, main conversation, wrap-up. Build the state machine skeleton from the start and populate the phases as the protocol develops.

Series Recap

This 8-part series covers the full arc of taking voice AI from proof-of-concept to production research platform:

  • Part 1 laid out the architecture: server-side agents, metadata transport via SFU data channels, and why provider selection matters more than you think.
  • Part 2 covered the production bugs that cost weeks: zombie agents, pre-warming failures, and the five issues you will encounter regardless of how careful you are.
  • Part 3 introduced the multi-phase state machine: research protocol as code, LLM-driven transitions between phases, and how to maintain conversational quality while enforcing structure.
  • Part 4 built the post-interview pipeline: from raw recording to structured research insights, automatically.
  • Part 5 tackled costs: real-time tracking, provider comparison, budget enforcement, and the self-hosting math.
  • This post (Part 6) covers what breaks at scale and the operational metrics that keep it visible.
  • Part 7 tackles multi-language support: language detection, provider routing, locale-aware VAD tuning, and cross-language analysis pipelines.
  • Part 8 closes the series with the full deployment guide: Docker, Kubernetes, CI/CD, zero-downtime deploys, and the go-live checklist.

For the broader reference architecture — cascaded vs S2S pipelines, framework comparison, multi-provider support, recording and compliance, and the full interview lifecycle — see the 12-part Voice AI Interview Playbook. That series is the foundation. This series is what you build on top of it.

Looking forward: the S2S model landscape is evolving fast. OpenAI and Google are both iterating on latency, cost, and capabilities. Multimodal models that process video alongside audio will open new research methodologies — facial expression analysis during conversation, document sharing during sessions, screen-based tasks integrated with voice interaction. The architecture patterns in this series — state machines, recovery tokens, provider abstraction, cost tracking — will transfer directly to those next-generation platforms.

The most important thing I’ve learned building this: voice AI for research is not a model problem. The models are good enough today. It’s an engineering problem — reliability, cost control, data quality, and operational visibility at scale. Solve those, and the models do their job.


References:


This is Part 6 of an 8-part series: Production Voice AI for Research at Scale.

Series outline:

  1. The Architecture Nobody Warns You About — Server-side agents, metadata transport, provider selection (Part 1)
  2. Zombie Agents, Pre-Warming, and the 5 Bugs That Cost Us Weeks — Production pain points and fixes (Part 2)
  3. Multi-Phase State Machines — Research protocol as code, LLM-driven transitions (Part 3)
  4. From Recording to Insight — The automatic post-interview pipeline (Part 4)
  5. The Real Cost — Per-minute tracking, budgets, self-hosting math (Part 5)
  6. What Breaks at 200 Concurrent Sessions — Scaling bottlenecks and operational metrics (Part 6)
  7. Multi-Language Voice AI — Language detection, provider routing, locale-aware VAD, i18n prompts (Part 7)
  8. Deployment and Go-Live — Docker, Kubernetes, CI/CD, zero-downtime deploys, monitoring (Part 8)

For the broader reference architecture covering cascaded vs S2S pipelines, framework selection, multi-provider support, and the full interview lifecycle, see the 12-part Voice AI Interview Playbook.

Export for reading

Comments