The Voice AI Interview Playbook: Scaling to Thousands — Architecture for Concurrent Voice Sessions (Part 10 of 12)

In Part 9, we handled the compliance layer — recording consent, encrypted storage, GDPR-compliant data lifecycle, and the audit trail that keeps legal teams happy. Now we need to handle a different kind of problem: what happens when your pilot of 10 concurrent interviews becomes a company-wide rollout of 10,000?

Scaling real-time voice is not like scaling a REST API. You cannot just add more web servers behind a load balancer and call it done. Voice sessions carry continuous bidirectional audio streams, hold shared state across their entire duration, and are acutely sensitive to the latency introduced by any architectural decision. A 50ms routing hop that’s invisible in an HTTP response is catastrophically obvious in a live conversation.

This post walks through the infrastructure patterns that scale voice AI from prototype to production at thousands of concurrent sessions — including the mistakes that seem obvious in retrospect and the metrics that actually matter for voice workloads.

Why Scaling Voice Is Different

Before the architecture, let’s ground ourselves in what makes voice a unique scaling challenge.

Continuous streams, not discrete requests. A REST endpoint handles one request and releases resources. A voice session holds an open WebRTC or WebSocket connection for 20-60 minutes while consuming CPU (audio encoding/decoding), network bandwidth (48kbps per stream), and memory (jitter buffer, session state). Every concurrent session is a long-lived resource consumer.

Latency is the service level, not a metric. For a web app, p95 latency of 200ms is excellent. For voice, any server-introduced latency above 50ms is perceptible to users. You’re not optimizing for throughput — you’re optimizing for consistently low latency across all sessions simultaneously.

Bursty, not uniform. Hiring patterns create massive spikes. A company running campus recruiting might have zero voice interviews one week and 500 simultaneous sessions during an on-campus event. Your infrastructure has to scale up in minutes, not hours.

State cannot be on the same server as the stream. In a web application, you can route users to any server because their requests are stateless or sessions are in a database. In voice, if the server handling your audio stream disappears, your session disappears. You need to separate stream handling from session state.

These constraints drive every architectural decision that follows.

The LiveKit SFU Architecture

LiveKit is the media server layer of our stack, and understanding how it scales is foundational to everything else.

What a Selective Forwarding Unit Does

A Selective Forwarding Unit (SFU) receives audio/video streams from participants and selectively forwards them to other participants, without mixing or transcoding. In a two-person interview, this is simple: SFU receives audio from the candidate, forwards it to the AI agent, receives audio from the AI agent, forwards it to the candidate.

The reason this matters for scaling: SFUs are stateless-ish. They hold routing state for active sessions, but that routing state is small and fast to reconstruct. They do not transcode audio (which is CPU-intensive). They are primarily network throughput bottlenecks, not compute bottlenecks.

Horizontal SFU Scaling

A single LiveKit server handles roughly 500-1000 concurrent two-party sessions before network bandwidth saturates (at 96kbps per session, 1Gbps network = ~10,000 streams). But you want to scale out before you hit limits, and you want redundancy.

LiveKit supports horizontal scaling via a distributed architecture. Multiple SFU nodes form a mesh, with room routing coordinated through Redis:

                    ┌─────────────────────────────────┐
                    │         LiveKit Cluster          │
                    │                                  │
          ┌─────────┴─────────┐           ┌────────────┴───────────┐
          │   SFU Node 1      │           │    SFU Node 2          │
          │   (us-east-1a)    │◄─────────►│    (us-east-1b)        │
          │   20 rooms        │           │    18 rooms            │
          └────────┬──────────┘           └────────────┬───────────┘
                   │                                   │
                   └─────────────┬─────────────────────┘
                                 │
                         ┌───────┴───────┐
                         │  Redis Cluster │
                         │  (room routing │
                         │   state)       │
                         └───────────────┘

When a new room is created, LiveKit’s control plane (the node that received the request) selects the SFU with the lowest load and registers the room-to-node mapping in Redis. Subsequent connections to that room — whether from a browser client or an AI agent — are routed to the correct SFU node.

Room Routing Configuration

Here is the LiveKit server configuration for a production cluster:

# livekit-config.yaml
port: 7880
rtc:
  tcp_port: 7881
  port_range_start: 50000
  port_range_end: 60000
  use_external_ip: true

redis:
  address: redis-cluster:6379
  # Use Redis cluster mode for HA
  cluster_addresses:
    - redis-1:6379
    - redis-2:6379
    - redis-3:6379

room:
  # Auto-close empty rooms after 30 seconds
  auto_create: false
  empty_timeout: 30
  # Max participants per room (interviewer + agent + optional observer)
  max_participants: 4

turn:
  # TURN server for candidates behind strict NAT
  enabled: true
  domain: turn.yourcompany.com
  cert_file: /etc/livekit/tls.crt
  key_file: /etc/livekit/tls.key

keys:
  # API key management
  api_key: ${LIVEKIT_API_KEY}
  api_secret: ${LIVEKIT_API_SECRET}

The TURN Server Layer

Many enterprise candidates sit behind restrictive corporate firewalls that block UDP entirely. Without TURN (Traversal Using Relays around NAT) servers, their WebRTC connections fail silently or fall back to STUN-only with poor results.

TURN servers relay audio traffic when direct peer-to-peer paths are unavailable. They are the one component you absolutely must deploy in each region alongside your SFU nodes. coturn is the standard open-source choice:

# coturn configuration
listening-port=3478
tls-listening-port=5349
external-ip=YOUR_EXTERNAL_IP
realm=turn.yourcompany.com
server-name=turn.yourcompany.com
lt-cred-mech
# Generate short-lived credentials — LiveKit does this automatically
userdb=/etc/turnserver/turndb
cert=/etc/livekit/tls.crt
pkey=/etc/livekit/tls.key

In production, TURN relayed sessions account for roughly 15-25% of all connections depending on your candidate demographic. Enterprise candidates (connecting through corporate networks) hit this rate even higher. Budget for it.

Stateless Agent Workers

The AI agent side of the architecture has completely different scaling properties from the SFU layer, and this distinction is critical.

The Stateful Trap

The naive implementation couples the agent worker to the session: one process per interview, running for the duration. This works at 10 concurrent sessions. It becomes a management nightmare at 10,000, because:

You cannot restart a worker without dropping an active interview
You cannot scale workers without complex session migration logic
A bug that affects one session can cascade to all sessions on that worker
Capacity planning becomes a puzzle of “how many 45-minute interviews fit on this server”

The correct approach: make agent workers stateless, like a web application server.

Stateless Agent Design

Stateless agent workers do not hold session state in memory. They hold:

The active WebRTC/WebSocket connection to the voice AI provider (OpenAI, Bedrock, Grok)
The active LiveKit room connection
A reference to where the session state lives (Redis, database)

All durable state — conversation history, interview progress, candidate information, rubric scores — lives in Redis with a TTL matching the maximum session duration.

# agent_worker.py — stateless session handler
import asyncio
import redis.asyncio as aioredis
from livekit import agents, rtc
from livekit.agents import JobContext
import json

redis_client = aioredis.from_url(
    os.environ["REDIS_URL"],
    decode_responses=True
)

async def entrypoint(ctx: JobContext):
    """
    Each interview session runs in this context.
    No state is stored in the worker process itself.
    """
    session_id = ctx.room.name

    # Load all session state from Redis at startup
    session_state = await load_session_state(session_id)

    if not session_state:
        # New session — initialize from database
        session_state = await initialize_session(session_id, ctx)
        await save_session_state(session_id, session_state)

    # Connect to voice AI provider
    voice_client = await connect_voice_provider(
        provider=session_state["provider"],
        config=session_state["provider_config"]
    )

    try:
        await run_interview_session(ctx, voice_client, session_state, session_id)
    finally:
        # Save final state before worker terminates
        await save_session_state(session_id, session_state)
        await voice_client.disconnect()

async def load_session_state(session_id: str) -> dict | None:
    data = await redis_client.get(f"session:{session_id}")
    return json.loads(data) if data else None

async def save_session_state(session_id: str, state: dict):
    # TTL of 90 minutes — covers the longest possible interview plus buffer
    await redis_client.setex(
        f"session:{session_id}",
        5400,
        json.dumps(state)
    )

With this design, worker processes can be killed and restarted freely. A new worker process picks up from Redis where the old one left off. This is the foundation for Kubernetes-based auto-scaling.

Kubernetes Orchestration

Kubernetes is the right choice for agent worker orchestration at scale. Here is the complete deployment pattern.

Deployment Configuration

# agent-worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: voice-agent-workers
  namespace: voice-interview
spec:
  # Initial replica count — auto-scaler will adjust
  replicas: 5
  selector:
    matchLabels:
      app: voice-agent-worker
  template:
    metadata:
      labels:
        app: voice-agent-worker
      annotations:
        # Prometheus scrape configuration
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: agent-worker
        image: yourregistry/voice-agent-worker:latest
        resources:
          requests:
            # Voice processing is I/O-bound, not CPU-bound
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "2000m"
            memory: "2Gi"
        env:
        - name: LIVEKIT_URL
          valueFrom:
            secretKeyRef:
              name: livekit-secrets
              key: url
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: redis-secrets
              key: url
        - name: MAX_CONCURRENT_SESSIONS
          value: "10"
        # Graceful shutdown — finish active sessions
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 5 && kill -SIGTERM 1"]
        # Don't kill workers in the middle of sessions
        terminationGracePeriodSeconds: 3600
      # Anti-affinity: spread workers across availability zones
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values: [voice-agent-worker]
              topologyKey: topology.kubernetes.io/zone

Custom Metrics Auto-Scaling

CPU-based auto-scaling does not work for voice workloads. Voice processing is I/O-bound — a worker handling 10 concurrent sessions will show 10% CPU utilization. CPU metrics will never trigger scaling when you need it most.

The correct metrics are:

Active room count: how many concurrent sessions exist right now
Worker queue depth: how many rooms are waiting for an agent worker
P95 audio latency: the end-to-end latency across all active sessions
Worker utilization: sessions per worker vs maximum sessions per worker

Expose these from your agent workers via a Prometheus metrics endpoint:

# metrics.py
from prometheus_client import Gauge, Histogram, start_http_server
import asyncio

# Expose these metrics for the HPA
active_sessions = Gauge(
    'voice_agent_active_sessions_total',
    'Number of currently active interview sessions'
)

session_capacity = Gauge(
    'voice_agent_session_capacity',
    'Maximum sessions this worker can handle'
)

audio_latency = Histogram(
    'voice_audio_latency_ms',
    'End-to-end audio processing latency',
    buckets=[50, 100, 150, 200, 300, 500, 1000, 2000]
)

def record_session_started():
    active_sessions.inc()

def record_session_ended():
    active_sessions.dec()

def record_audio_latency(latency_ms: float):
    audio_latency.observe(latency_ms)

# Start metrics server alongside main agent process
start_http_server(8080)

Then configure the HPA to scale on these custom metrics:

# agent-worker-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: voice-agent-worker-hpa
  namespace: voice-interview
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: voice-agent-workers
  minReplicas: 3
  maxReplicas: 200
  metrics:
  # Primary: scale on session utilization, not CPU
  - type: External
    external:
      metric:
        name: voice_agent_session_utilization
        # (active_sessions / capacity) across all pods
      target:
        type: AverageValue
        averageValue: "70"  # Scale when 70% of capacity is used
  # Secondary: scale on queue depth (pending rooms waiting for workers)
  - type: External
    external:
      metric:
        name: livekit_rooms_without_agent
      target:
        type: Value
        value: "5"  # Scale if more than 5 rooms have no agent
  behavior:
    scaleUp:
      # Fast scale-up for burst hiring events
      stabilizationWindowSeconds: 30
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
    scaleDown:
      # Slow scale-down to avoid disrupting active sessions
      stabilizationWindowSeconds: 600
      policies:
      - type: Pods
        value: 2
        periodSeconds: 120

The scale-up is fast (100% increase per minute is possible) and scale-down is slow (2 pods per 2 minutes). This asymmetry protects active sessions during demand drops while allowing rapid response to hiring surges.

ECS Fargate as an Alternative

If your team is not running Kubernetes already, ECS Fargate is a reasonable alternative. You trade fine-grained control for operational simplicity. Fargate spins up containers in seconds, handles the underlying infrastructure, and auto-scales based on target tracking policies.

The main limitation: Fargate does not support custom metrics-based scaling natively. You need Application Auto Scaling with CloudWatch custom metrics, which requires more infrastructure to set up than the Kubernetes HPA approach.

For teams below 100K interview minutes per month, Fargate simplicity often wins over Kubernetes flexibility.

Regional Deployment

Network latency from a candidate’s browser to the SFU node is additive with everything else in your latency budget. A candidate in Singapore connecting to an SFU in us-east-1 adds 150-200ms one-way network latency. That is already your entire latency budget, before audio encoding, AI processing, or anything else.

The rule of thumb: place SFU nodes within 30ms round-trip time of your candidate population. For global deployments, this means at minimum:

Region Coverage Map:
─────────────────────────────────────────────────────────────────

  us-east-1 (Virginia)     → North America East, Europe
  us-west-2 (Oregon)       → North America West, Asia Pacific
  eu-west-1 (Ireland)      → Europe, Middle East
  ap-southeast-1 (Singapore) → Southeast Asia, Australia
  ap-northeast-1 (Tokyo)   → Japan, Korea, North Asia

  Rule: < 30ms to SFU node means < 150ms end-to-end interview quality

Geographic Routing

Use latency-based DNS routing (AWS Route 53, Cloudflare, or equivalent) to direct each client to the nearest regional cluster:

// Client-side region selection before connecting to LiveKit
async function selectOptimalRegion(): Promise<string> {
  const regionPingTests = [
    { region: 'us-east', url: 'https://sfu-us-east.yourcompany.com/health' },
    { region: 'eu-west', url: 'https://sfu-eu-west.yourcompany.com/health' },
    { region: 'ap-southeast', url: 'https://sfu-ap-se.yourcompany.com/health' },
  ];

  const pingResults = await Promise.all(
    regionPingTests.map(async ({ region, url }) => {
      const start = performance.now();
      try {
        await fetch(url, { method: 'HEAD', cache: 'no-cache' });
        return { region, latency: performance.now() - start };
      } catch {
        return { region, latency: Infinity };
      }
    })
  );

  // Select the lowest-latency healthy region
  const best = pingResults
    .filter(r => r.latency < Infinity)
    .sort((a, b) => a.latency - b.latency)[0];

  return best?.region ?? 'us-east'; // Fallback to primary
}

The backend then creates the LiveKit room on the SFU cluster in the selected region, and the agent worker in that region joins the room.

Load Testing Voice Systems

Load testing voice is genuinely difficult. HTTP load testing tools (k6, Locust, JMeter) cannot simulate WebRTC sessions. You need specialized tooling.

LiveKit CLI Load Testing

LiveKit provides a CLI tool for simulating concurrent sessions:

# Install livekit-cli
brew install livekit-cli  # or download from GitHub releases

# Simulate 100 concurrent two-party sessions
livekit-cli load-test \
  --url wss://your-livekit-server.com \
  --api-key your_api_key \
  --api-secret your_api_secret \
  --room-count 100 \
  --publishers 1 \
  --subscribers 1 \
  --duration 30m \
  --audio-bitrate 48000

# This creates 100 rooms, each with 1 publisher (simulated candidate)
# and 1 subscriber (simulated agent), running for 30 minutes

Monitor these metrics during load tests:

# Watch your SFU node metrics while load testing
watch -n 5 curl -s http://sfu-node:7880/metrics | grep -E \
  "livekit_rooms_total|livekit_participants|livekit_packets_loss"

Load Testing Checklist

Before declaring a deployment production-ready:

Sustain target concurrent session count for 60+ minutes (full interview duration)
Verify p95 audio latency stays below 150ms under load
Test auto-scaling: ramp from 10% to 100% target capacity in 5 minutes
Test scale-down: reduce from 100% to 10% without dropping active sessions
Simulate SFU node failure: verify sessions on failed node reconnect gracefully
Test Redis failure: verify session state recovery
Run TURN relay test for 25% of simulated sessions (simulate enterprise NAT)

Database Scaling

The relational database holds interview records, rubric scores, and evaluation results. It is not on the hot path for active voice sessions (Redis holds session state), but it becomes a bottleneck at scale for session initialization and result storage.

Connection Pooling

At 10,000 concurrent sessions, each session initializing from the database means potentially thousands of concurrent database connections. PostgreSQL handles around 500 concurrent connections before performance degrades. PgBouncer solves this:

# pgbouncer-config.yaml
[databases]
interviews_db = host=postgres-primary port=5432 dbname=interviews

[pgbouncer]
pool_mode = transaction
max_client_conn = 10000
default_pool_size = 50
server_idle_timeout = 600

Transaction-mode pooling is critical here: a connection is held only for the duration of a transaction, not for the entire session. This allows 10,000 application clients to share 50 database connections.

Read Replicas

Session initialization reads candidate data, job requirements, and interview rubrics. These are read-heavy operations that can go to read replicas:

# db.py — read/write splitting
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

# Write to primary
write_engine = create_engine(
    os.environ["DATABASE_PRIMARY_URL"],
    pool_size=10,
    max_overflow=20
)

# Read from replica (or round-robin across replicas)
read_engine = create_engine(
    os.environ["DATABASE_REPLICA_URL"],
    pool_size=20,  # More connections for reads
    max_overflow=40
)

WriteSession = sessionmaker(bind=write_engine)
ReadSession = sessionmaker(bind=read_engine)

async def get_interview_config(interview_id: str) -> InterviewConfig:
    # This is a pure read — use the replica
    with ReadSession() as session:
        return session.query(InterviewConfig).filter_by(id=interview_id).first()

async def save_interview_result(result: InterviewResult):
    # This writes — use the primary
    with WriteSession() as session:
        session.add(result)
        session.commit()

Queue-Based Async Architecture

Not everything in the voice interview pipeline needs to be synchronous with the active session. Evaluation, transcript formatting, and result notification can all be decoupled into async queues.

This is the architecture:

Active Session (real-time, synchronous)
─────────────────────────────────────────────────────────────
  LiveKit Room
  ├── Candidate WebRTC stream
  ├── Agent Worker
  │   ├── Voice AI Provider (OpenAI/Grok/Bedrock)
  │   └── Writes: audio chunks to S3 (streaming)
  │               session events to SQS
  └── Ends → SQS: session.completed event

Post-Session Processing (async, queue-based)
─────────────────────────────────────────────────────────────
  SQS: session.completed
  ├── Transcript Worker
  │   └── Runs Whisper/Deepgram on recorded audio
  │       → stores transcript in DB
  ├── Evaluation Worker
  │   └── Runs LLM evaluation against rubric
  │       → stores scores in DB
  │       → triggers interviewer notification
  └── Compliance Worker
      └── Processes consent forms
          → schedules data deletion per retention policy

The key insight: nothing in the async queue is time-sensitive. Evaluation can run on spot instances during off-peak hours. Transcript formatting can wait 5 minutes. This is where you recover cost — the real-time infrastructure has to run at full capacity, but the post-processing infrastructure can run at 10% of the cost using cheaper compute.

# sqs_publisher.py — publish session events without blocking the voice session
import boto3
import json
from datetime import datetime

sqs = boto3.client('sqs', region_name='us-east-1')
SESSION_QUEUE_URL = os.environ['SESSION_QUEUE_URL']

async def publish_session_completed(session_id: str, metadata: dict):
    """Non-blocking publish — fire and forget"""
    message = {
        'event': 'session.completed',
        'session_id': session_id,
        'timestamp': datetime.utcnow().isoformat(),
        'metadata': metadata
    }
    # Use executor to avoid blocking the async event loop
    loop = asyncio.get_event_loop()
    await loop.run_in_executor(
        None,
        lambda: sqs.send_message(
            QueueUrl=SESSION_QUEUE_URL,
            MessageBody=json.dumps(message),
            MessageGroupId=session_id  # FIFO queue for ordering
        )
    )

Cost at Scale

Here is the honest cost breakdown across deployment scales. All figures are approximate and reflect Q1 2026 pricing.

Infrastructure Costs by Scale

Component	100 concurrent	1,000 concurrent	10,000 concurrent
LiveKit Cloud SFU	$150/mo	$1,200/mo	Self-host required
Self-hosted SFU (EC2 c6in.4xl)	—	$640/mo (2 nodes)	$3,200/mo (10 nodes)
Agent Workers (EKS/ECS)	$80/mo	$600/mo	$4,500/mo
Redis Cluster	$50/mo	$200/mo	$800/mo
PostgreSQL (RDS Multi-AZ)	$150/mo	$400/mo	$1,200/mo
S3 (recordings)	$20/mo	$180/mo	$1,700/mo
TURN servers	$40/mo	$320/mo	$2,400/mo
Load Balancers / DNS	$20/mo	$60/mo	$240/mo
Infrastructure total	~$510/mo	~$3,560/mo	~$14,040/mo

Note: this is infrastructure only. Voice AI provider costs (OpenAI, Grok, Bedrock) are on top of this and will typically 3-5x the infrastructure cost. We cover that in full in Part 11.

At 10,000 concurrent sessions, self-hosting the SFU layer and running agent workers on spot instances (for non-session-active processing) are the two highest-leverage cost levers.

Monitoring and Alerting

Voice infrastructure requires different monitoring than web applications. Here is the Prometheus/Grafana stack configuration that covers the critical signals.

Key Metrics Dashboard

# prometheus-rules.yaml — alerting rules for voice infrastructure

groups:
- name: voice_infrastructure
  rules:

  # Alert when audio latency degrades
  - alert: HighAudioLatency
    expr: histogram_quantile(0.95, rate(voice_audio_latency_ms_bucket[5m])) > 200
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "P95 audio latency exceeded 200ms"
      description: "P95 latency is {{ $value }}ms. Check SFU node health."

  # Alert on packet loss (usually network/TURN issue)
  - alert: HighPacketLoss
    expr: rate(livekit_packets_lost_total[5m]) / rate(livekit_packets_sent_total[5m]) > 0.02
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "Packet loss above 2%"
      description: "Audio quality will be noticeably degraded."

  # Alert when agent worker capacity runs low
  - alert: AgentWorkerCapacityLow
    expr: (sum(voice_agent_active_sessions_total) / sum(voice_agent_session_capacity)) > 0.85
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Agent workers at 85%+ capacity"
      description: "Auto-scaler should be activating. Verify HPA is responding."

  # Alert on provider error rate
  - alert: VoiceProviderErrors
    expr: rate(voice_provider_errors_total[5m]) > 0.05
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Voice AI provider error rate above 5%"
      description: "Provider: {{ $labels.provider }}. May need to activate failover."

Per-Session Observability

Beyond aggregate metrics, you need per-session observability for debugging individual quality issues:

# session_telemetry.py
import structlog
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

logger = structlog.get_logger()
tracer = trace.get_tracer(__name__)

class SessionTelemetry:
    def __init__(self, session_id: str, candidate_id: str):
        self.session_id = session_id
        self.candidate_id = candidate_id
        self.log = logger.bind(
            session_id=session_id,
            candidate_id=candidate_id
        )

    def record_audio_received(self, chunk_size: int, timestamp: float):
        self.log.debug("audio_received", chunk_size=chunk_size, ts=timestamp)

    def record_ai_response_latency(self, latency_ms: float, provider: str):
        self.log.info(
            "ai_response",
            latency_ms=latency_ms,
            provider=provider,
            above_threshold=latency_ms > 300
        )
        audio_latency.observe(latency_ms, {'provider': provider})

    def record_session_quality_event(self, event: str, details: dict):
        self.log.warning("quality_event", event=event, **details)

Store structured session logs in a searchable backend (OpenSearch, Datadog, or similar). When a candidate reports a poor experience, you want to reconstruct exactly what happened in their session within 30 seconds.

Multi-Region Failover

For enterprise deployments, regional outages cannot mean interview outages. The failover architecture routes sessions to the next-nearest healthy region automatically.

Failover Architecture

  Primary routing (normal operation):
  ─────────────────────────────────────────
  Candidate (Singapore) → ap-southeast-1 SFU cluster
  Candidate (London)    → eu-west-1 SFU cluster
  Candidate (New York)  → us-east-1 SFU cluster

  During ap-southeast-1 outage:
  ─────────────────────────────────────────
  Health check fails → Route 53 removes ap-southeast-1
  New candidates (Singapore) → us-west-2 SFU cluster (~180ms vs 30ms normal)
  Active sessions: attempt reconnect to us-west-2 (graceful degradation)

The active session reconnection is the hard part. When an SFU node fails mid-session, the client’s WebRTC connection breaks. Your client SDK needs to implement reconnection logic:

// livekit-client-reconnect.js
const room = new Room({
  reconnectPolicy: {
    maxRetries: 5,
    retryDelayMs: 1000,
    // On reconnect failure, try the backup region
    onFailed: async () => {
      const backupRegion = await getBackupRegion(currentRegion);
      const backupUrl = `wss://sfu-${backupRegion}.yourcompany.com`;
      await room.connect(backupUrl, token);
    }
  }
});

room.on(RoomEvent.Reconnecting, () => {
  console.log('Connection lost — attempting reconnect...');
  showCandidateReconnectBanner(); // "Connection interrupted, reconnecting..."
});

room.on(RoomEvent.Reconnected, () => {
  console.log('Reconnected successfully');
  hideCandidateReconnectBanner();
});

From the candidate’s perspective, a regional failover causes a 3-5 second interruption. That is acceptable. A permanent disconnection is not.

The Scaling Hierarchy

When you’re planning your deployment, think in tiers:

Tier 1 (< 50 concurrent sessions): LiveKit Cloud, single-region, managed infrastructure. Do not over-engineer this. You’re still learning your usage patterns.

Tier 2 (50-500 concurrent sessions): LiveKit Cloud multi-region or self-hosted single-region. Kubernetes for agent workers. Custom metrics HPA. Redis for session state.

Tier 3 (500-5,000 concurrent sessions): Self-hosted LiveKit multi-region. Dedicated TURN server pools. Queue-based async processing. Read replicas for the database.

Tier 4 (5,000+ concurrent sessions): Everything in Tier 3, plus: multi-cluster Kubernetes, CDN-optimized WebRTC ICE gathering, custom media server tuning, dedicated spot instance pools for async workloads, and a dedicated infrastructure team.

The biggest mistake I see: teams building Tier 4 infrastructure for Tier 1 traffic. Start with managed services. Move to self-hosted when the managed costs actually justify it. The data in Part 11 will show you exactly where those tipping points are.

In Part 11, we turn from infrastructure capacity to infrastructure cost. The architecture patterns here add up fast — we’ll break down the exact per-minute cost of every component, show where the three major cost tipping points are, and walk through the optimizations that bring a $3.45 per interview cost down to under $1.00 without sacrificing quality.

This is Part 10 of a 12-part series: The Voice AI Interview Playbook.

Series outline:

Why Real-Time Voice Changes Everything — The landscape, the vision, and the reference architecture (Part 1)
Cascaded vs. Speech-to-Speech — Choosing your pipeline architecture (Part 2)
LiveKit vs. Pipecat vs. Direct — Picking your framework (Part 3)
STT, LLM, and TTS That Actually Work — Building the voice pipeline (Part 4)
Multi-Role Agents — Interviewer, coach, and evaluator personas (Part 5)
Knowledge Base and RAG — Making your voice agent an expert (Part 6)
Web and Mobile Clients — Cross-platform voice experiences (Part 7)
Video Interview Integration — Multimodal analysis with Gemini Live (Part 8)
Recording, Transcription, and Compliance — GDPR, HIPAA, and getting it right (Part 9)
Scaling to Thousands — Architecture for concurrent voice sessions (this post)
Cost Optimization — From $0.14/min to $0.03/min (Part 11)
Multi-Provider Support — OpenAI Realtime, Bedrock Nova, Grok, and the adapter pattern (Part 12)

Export for reading

The Voice AI Interview Playbook: Scaling to Thousands — Architecture for Concurrent Voice Sessions (Part 10 of 12)

Why Scaling Voice Is Different

The LiveKit SFU Architecture

What a Selective Forwarding Unit Does

Horizontal SFU Scaling

Room Routing Configuration

The TURN Server Layer

Stateless Agent Workers

The Stateful Trap

Stateless Agent Design

Kubernetes Orchestration

Deployment Configuration

Custom Metrics Auto-Scaling

ECS Fargate as an Alternative

Regional Deployment

Geographic Routing

Load Testing Voice Systems

LiveKit CLI Load Testing

Load Testing Checklist

Database Scaling

Connection Pooling

Read Replicas

Queue-Based Async Architecture

Cost at Scale

Infrastructure Costs by Scale

Monitoring and Alerting

Key Metrics Dashboard

Per-Session Observability

Multi-Region Failover

Failover Architecture

The Scaling Hierarchy

Comments

On this page

The Voice AI Interview Playbook: Scaling to Thousands — Architecture for Concurrent Voice Sessions (Part 10 of 12)