In Part 9, we handled the compliance layer — recording consent, encrypted storage, GDPR-compliant data lifecycle, and the audit trail that keeps legal teams happy. Now we need to handle a different kind of problem: what happens when your pilot of 10 concurrent interviews becomes a company-wide rollout of 10,000?
Scaling real-time voice is not like scaling a REST API. You cannot just add more web servers behind a load balancer and call it done. Voice sessions carry continuous bidirectional audio streams, hold shared state across their entire duration, and are acutely sensitive to the latency introduced by any architectural decision. A 50ms routing hop that’s invisible in an HTTP response is catastrophically obvious in a live conversation.
This post walks through the infrastructure patterns that scale voice AI from prototype to production at thousands of concurrent sessions — including the mistakes that seem obvious in retrospect and the metrics that actually matter for voice workloads.
Why Scaling Voice Is Different
Before the architecture, let’s ground ourselves in what makes voice a unique scaling challenge.
Continuous streams, not discrete requests. A REST endpoint handles one request and releases resources. A voice session holds an open WebRTC or WebSocket connection for 20-60 minutes while consuming CPU (audio encoding/decoding), network bandwidth (48kbps per stream), and memory (jitter buffer, session state). Every concurrent session is a long-lived resource consumer.
Latency is the service level, not a metric. For a web app, p95 latency of 200ms is excellent. For voice, any server-introduced latency above 50ms is perceptible to users. You’re not optimizing for throughput — you’re optimizing for consistently low latency across all sessions simultaneously.
Bursty, not uniform. Hiring patterns create massive spikes. A company running campus recruiting might have zero voice interviews one week and 500 simultaneous sessions during an on-campus event. Your infrastructure has to scale up in minutes, not hours.
State cannot be on the same server as the stream. In a web application, you can route users to any server because their requests are stateless or sessions are in a database. In voice, if the server handling your audio stream disappears, your session disappears. You need to separate stream handling from session state.
These constraints drive every architectural decision that follows.
The LiveKit SFU Architecture
LiveKit is the media server layer of our stack, and understanding how it scales is foundational to everything else.
What a Selective Forwarding Unit Does
A Selective Forwarding Unit (SFU) receives audio/video streams from participants and selectively forwards them to other participants, without mixing or transcoding. In a two-person interview, this is simple: SFU receives audio from the candidate, forwards it to the AI agent, receives audio from the AI agent, forwards it to the candidate.
The reason this matters for scaling: SFUs are stateless-ish. They hold routing state for active sessions, but that routing state is small and fast to reconstruct. They do not transcode audio (which is CPU-intensive). They are primarily network throughput bottlenecks, not compute bottlenecks.
Horizontal SFU Scaling
A single LiveKit server handles roughly 500-1000 concurrent two-party sessions before network bandwidth saturates (at 96kbps per session, 1Gbps network = ~10,000 streams). But you want to scale out before you hit limits, and you want redundancy.
LiveKit supports horizontal scaling via a distributed architecture. Multiple SFU nodes form a mesh, with room routing coordinated through Redis:
┌─────────────────────────────────┐
│ LiveKit Cluster │
│ │
┌─────────┴─────────┐ ┌────────────┴───────────┐
│ SFU Node 1 │ │ SFU Node 2 │
│ (us-east-1a) │◄─────────►│ (us-east-1b) │
│ 20 rooms │ │ 18 rooms │
└────────┬──────────┘ └────────────┬───────────┘
│ │
└─────────────┬─────────────────────┘
│
┌───────┴───────┐
│ Redis Cluster │
│ (room routing │
│ state) │
└───────────────┘
When a new room is created, LiveKit’s control plane (the node that received the request) selects the SFU with the lowest load and registers the room-to-node mapping in Redis. Subsequent connections to that room — whether from a browser client or an AI agent — are routed to the correct SFU node.
Room Routing Configuration
Here is the LiveKit server configuration for a production cluster:
# livekit-config.yaml
port: 7880
rtc:
tcp_port: 7881
port_range_start: 50000
port_range_end: 60000
use_external_ip: true
redis:
address: redis-cluster:6379
# Use Redis cluster mode for HA
cluster_addresses:
- redis-1:6379
- redis-2:6379
- redis-3:6379
room:
# Auto-close empty rooms after 30 seconds
auto_create: false
empty_timeout: 30
# Max participants per room (interviewer + agent + optional observer)
max_participants: 4
turn:
# TURN server for candidates behind strict NAT
enabled: true
domain: turn.yourcompany.com
cert_file: /etc/livekit/tls.crt
key_file: /etc/livekit/tls.key
keys:
# API key management
api_key: ${LIVEKIT_API_KEY}
api_secret: ${LIVEKIT_API_SECRET}
The TURN Server Layer
Many enterprise candidates sit behind restrictive corporate firewalls that block UDP entirely. Without TURN (Traversal Using Relays around NAT) servers, their WebRTC connections fail silently or fall back to STUN-only with poor results.
TURN servers relay audio traffic when direct peer-to-peer paths are unavailable. They are the one component you absolutely must deploy in each region alongside your SFU nodes. coturn is the standard open-source choice:
# coturn configuration
listening-port=3478
tls-listening-port=5349
external-ip=YOUR_EXTERNAL_IP
realm=turn.yourcompany.com
server-name=turn.yourcompany.com
lt-cred-mech
# Generate short-lived credentials — LiveKit does this automatically
userdb=/etc/turnserver/turndb
cert=/etc/livekit/tls.crt
pkey=/etc/livekit/tls.key
In production, TURN relayed sessions account for roughly 15-25% of all connections depending on your candidate demographic. Enterprise candidates (connecting through corporate networks) hit this rate even higher. Budget for it.
Stateless Agent Workers
The AI agent side of the architecture has completely different scaling properties from the SFU layer, and this distinction is critical.
The Stateful Trap
The naive implementation couples the agent worker to the session: one process per interview, running for the duration. This works at 10 concurrent sessions. It becomes a management nightmare at 10,000, because:
- You cannot restart a worker without dropping an active interview
- You cannot scale workers without complex session migration logic
- A bug that affects one session can cascade to all sessions on that worker
- Capacity planning becomes a puzzle of “how many 45-minute interviews fit on this server”
The correct approach: make agent workers stateless, like a web application server.
Stateless Agent Design
Stateless agent workers do not hold session state in memory. They hold:
- The active WebRTC/WebSocket connection to the voice AI provider (OpenAI, Bedrock, Grok)
- The active LiveKit room connection
- A reference to where the session state lives (Redis, database)
All durable state — conversation history, interview progress, candidate information, rubric scores — lives in Redis with a TTL matching the maximum session duration.
# agent_worker.py — stateless session handler
import asyncio
import redis.asyncio as aioredis
from livekit import agents, rtc
from livekit.agents import JobContext
import json
redis_client = aioredis.from_url(
os.environ["REDIS_URL"],
decode_responses=True
)
async def entrypoint(ctx: JobContext):
"""
Each interview session runs in this context.
No state is stored in the worker process itself.
"""
session_id = ctx.room.name
# Load all session state from Redis at startup
session_state = await load_session_state(session_id)
if not session_state:
# New session — initialize from database
session_state = await initialize_session(session_id, ctx)
await save_session_state(session_id, session_state)
# Connect to voice AI provider
voice_client = await connect_voice_provider(
provider=session_state["provider"],
config=session_state["provider_config"]
)
try:
await run_interview_session(ctx, voice_client, session_state, session_id)
finally:
# Save final state before worker terminates
await save_session_state(session_id, session_state)
await voice_client.disconnect()
async def load_session_state(session_id: str) -> dict | None:
data = await redis_client.get(f"session:{session_id}")
return json.loads(data) if data else None
async def save_session_state(session_id: str, state: dict):
# TTL of 90 minutes — covers the longest possible interview plus buffer
await redis_client.setex(
f"session:{session_id}",
5400,
json.dumps(state)
)
With this design, worker processes can be killed and restarted freely. A new worker process picks up from Redis where the old one left off. This is the foundation for Kubernetes-based auto-scaling.
Kubernetes Orchestration
Kubernetes is the right choice for agent worker orchestration at scale. Here is the complete deployment pattern.
Deployment Configuration
# agent-worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: voice-agent-workers
namespace: voice-interview
spec:
# Initial replica count — auto-scaler will adjust
replicas: 5
selector:
matchLabels:
app: voice-agent-worker
template:
metadata:
labels:
app: voice-agent-worker
annotations:
# Prometheus scrape configuration
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: agent-worker
image: yourregistry/voice-agent-worker:latest
resources:
requests:
# Voice processing is I/O-bound, not CPU-bound
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
env:
- name: LIVEKIT_URL
valueFrom:
secretKeyRef:
name: livekit-secrets
key: url
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: redis-secrets
key: url
- name: MAX_CONCURRENT_SESSIONS
value: "10"
# Graceful shutdown — finish active sessions
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5 && kill -SIGTERM 1"]
# Don't kill workers in the middle of sessions
terminationGracePeriodSeconds: 3600
# Anti-affinity: spread workers across availability zones
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: [voice-agent-worker]
topologyKey: topology.kubernetes.io/zone
Custom Metrics Auto-Scaling
CPU-based auto-scaling does not work for voice workloads. Voice processing is I/O-bound — a worker handling 10 concurrent sessions will show 10% CPU utilization. CPU metrics will never trigger scaling when you need it most.
The correct metrics are:
- Active room count: how many concurrent sessions exist right now
- Worker queue depth: how many rooms are waiting for an agent worker
- P95 audio latency: the end-to-end latency across all active sessions
- Worker utilization: sessions per worker vs maximum sessions per worker
Expose these from your agent workers via a Prometheus metrics endpoint:
# metrics.py
from prometheus_client import Gauge, Histogram, start_http_server
import asyncio
# Expose these metrics for the HPA
active_sessions = Gauge(
'voice_agent_active_sessions_total',
'Number of currently active interview sessions'
)
session_capacity = Gauge(
'voice_agent_session_capacity',
'Maximum sessions this worker can handle'
)
audio_latency = Histogram(
'voice_audio_latency_ms',
'End-to-end audio processing latency',
buckets=[50, 100, 150, 200, 300, 500, 1000, 2000]
)
def record_session_started():
active_sessions.inc()
def record_session_ended():
active_sessions.dec()
def record_audio_latency(latency_ms: float):
audio_latency.observe(latency_ms)
# Start metrics server alongside main agent process
start_http_server(8080)
Then configure the HPA to scale on these custom metrics:
# agent-worker-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: voice-agent-worker-hpa
namespace: voice-interview
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: voice-agent-workers
minReplicas: 3
maxReplicas: 200
metrics:
# Primary: scale on session utilization, not CPU
- type: External
external:
metric:
name: voice_agent_session_utilization
# (active_sessions / capacity) across all pods
target:
type: AverageValue
averageValue: "70" # Scale when 70% of capacity is used
# Secondary: scale on queue depth (pending rooms waiting for workers)
- type: External
external:
metric:
name: livekit_rooms_without_agent
target:
type: Value
value: "5" # Scale if more than 5 rooms have no agent
behavior:
scaleUp:
# Fast scale-up for burst hiring events
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
# Slow scale-down to avoid disrupting active sessions
stabilizationWindowSeconds: 600
policies:
- type: Pods
value: 2
periodSeconds: 120
The scale-up is fast (100% increase per minute is possible) and scale-down is slow (2 pods per 2 minutes). This asymmetry protects active sessions during demand drops while allowing rapid response to hiring surges.
ECS Fargate as an Alternative
If your team is not running Kubernetes already, ECS Fargate is a reasonable alternative. You trade fine-grained control for operational simplicity. Fargate spins up containers in seconds, handles the underlying infrastructure, and auto-scales based on target tracking policies.
The main limitation: Fargate does not support custom metrics-based scaling natively. You need Application Auto Scaling with CloudWatch custom metrics, which requires more infrastructure to set up than the Kubernetes HPA approach.
For teams below 100K interview minutes per month, Fargate simplicity often wins over Kubernetes flexibility.
Regional Deployment
Network latency from a candidate’s browser to the SFU node is additive with everything else in your latency budget. A candidate in Singapore connecting to an SFU in us-east-1 adds 150-200ms one-way network latency. That is already your entire latency budget, before audio encoding, AI processing, or anything else.
The rule of thumb: place SFU nodes within 30ms round-trip time of your candidate population. For global deployments, this means at minimum:
Region Coverage Map:
─────────────────────────────────────────────────────────────────
us-east-1 (Virginia) → North America East, Europe
us-west-2 (Oregon) → North America West, Asia Pacific
eu-west-1 (Ireland) → Europe, Middle East
ap-southeast-1 (Singapore) → Southeast Asia, Australia
ap-northeast-1 (Tokyo) → Japan, Korea, North Asia
Rule: < 30ms to SFU node means < 150ms end-to-end interview quality
Geographic Routing
Use latency-based DNS routing (AWS Route 53, Cloudflare, or equivalent) to direct each client to the nearest regional cluster:
// Client-side region selection before connecting to LiveKit
async function selectOptimalRegion(): Promise<string> {
const regionPingTests = [
{ region: 'us-east', url: 'https://sfu-us-east.yourcompany.com/health' },
{ region: 'eu-west', url: 'https://sfu-eu-west.yourcompany.com/health' },
{ region: 'ap-southeast', url: 'https://sfu-ap-se.yourcompany.com/health' },
];
const pingResults = await Promise.all(
regionPingTests.map(async ({ region, url }) => {
const start = performance.now();
try {
await fetch(url, { method: 'HEAD', cache: 'no-cache' });
return { region, latency: performance.now() - start };
} catch {
return { region, latency: Infinity };
}
})
);
// Select the lowest-latency healthy region
const best = pingResults
.filter(r => r.latency < Infinity)
.sort((a, b) => a.latency - b.latency)[0];
return best?.region ?? 'us-east'; // Fallback to primary
}
The backend then creates the LiveKit room on the SFU cluster in the selected region, and the agent worker in that region joins the room.
Load Testing Voice Systems
Load testing voice is genuinely difficult. HTTP load testing tools (k6, Locust, JMeter) cannot simulate WebRTC sessions. You need specialized tooling.
LiveKit CLI Load Testing
LiveKit provides a CLI tool for simulating concurrent sessions:
# Install livekit-cli
brew install livekit-cli # or download from GitHub releases
# Simulate 100 concurrent two-party sessions
livekit-cli load-test \
--url wss://your-livekit-server.com \
--api-key your_api_key \
--api-secret your_api_secret \
--room-count 100 \
--publishers 1 \
--subscribers 1 \
--duration 30m \
--audio-bitrate 48000
# This creates 100 rooms, each with 1 publisher (simulated candidate)
# and 1 subscriber (simulated agent), running for 30 minutes
Monitor these metrics during load tests:
# Watch your SFU node metrics while load testing
watch -n 5 curl -s http://sfu-node:7880/metrics | grep -E \
"livekit_rooms_total|livekit_participants|livekit_packets_loss"
Load Testing Checklist
Before declaring a deployment production-ready:
- Sustain target concurrent session count for 60+ minutes (full interview duration)
- Verify p95 audio latency stays below 150ms under load
- Test auto-scaling: ramp from 10% to 100% target capacity in 5 minutes
- Test scale-down: reduce from 100% to 10% without dropping active sessions
- Simulate SFU node failure: verify sessions on failed node reconnect gracefully
- Test Redis failure: verify session state recovery
- Run TURN relay test for 25% of simulated sessions (simulate enterprise NAT)
Database Scaling
The relational database holds interview records, rubric scores, and evaluation results. It is not on the hot path for active voice sessions (Redis holds session state), but it becomes a bottleneck at scale for session initialization and result storage.
Connection Pooling
At 10,000 concurrent sessions, each session initializing from the database means potentially thousands of concurrent database connections. PostgreSQL handles around 500 concurrent connections before performance degrades. PgBouncer solves this:
# pgbouncer-config.yaml
[databases]
interviews_db = host=postgres-primary port=5432 dbname=interviews
[pgbouncer]
pool_mode = transaction
max_client_conn = 10000
default_pool_size = 50
server_idle_timeout = 600
Transaction-mode pooling is critical here: a connection is held only for the duration of a transaction, not for the entire session. This allows 10,000 application clients to share 50 database connections.
Read Replicas
Session initialization reads candidate data, job requirements, and interview rubrics. These are read-heavy operations that can go to read replicas:
# db.py — read/write splitting
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
# Write to primary
write_engine = create_engine(
os.environ["DATABASE_PRIMARY_URL"],
pool_size=10,
max_overflow=20
)
# Read from replica (or round-robin across replicas)
read_engine = create_engine(
os.environ["DATABASE_REPLICA_URL"],
pool_size=20, # More connections for reads
max_overflow=40
)
WriteSession = sessionmaker(bind=write_engine)
ReadSession = sessionmaker(bind=read_engine)
async def get_interview_config(interview_id: str) -> InterviewConfig:
# This is a pure read — use the replica
with ReadSession() as session:
return session.query(InterviewConfig).filter_by(id=interview_id).first()
async def save_interview_result(result: InterviewResult):
# This writes — use the primary
with WriteSession() as session:
session.add(result)
session.commit()
Queue-Based Async Architecture
Not everything in the voice interview pipeline needs to be synchronous with the active session. Evaluation, transcript formatting, and result notification can all be decoupled into async queues.
This is the architecture:
Active Session (real-time, synchronous)
─────────────────────────────────────────────────────────────
LiveKit Room
├── Candidate WebRTC stream
├── Agent Worker
│ ├── Voice AI Provider (OpenAI/Grok/Bedrock)
│ └── Writes: audio chunks to S3 (streaming)
│ session events to SQS
└── Ends → SQS: session.completed event
Post-Session Processing (async, queue-based)
─────────────────────────────────────────────────────────────
SQS: session.completed
├── Transcript Worker
│ └── Runs Whisper/Deepgram on recorded audio
│ → stores transcript in DB
├── Evaluation Worker
│ └── Runs LLM evaluation against rubric
│ → stores scores in DB
│ → triggers interviewer notification
└── Compliance Worker
└── Processes consent forms
→ schedules data deletion per retention policy
The key insight: nothing in the async queue is time-sensitive. Evaluation can run on spot instances during off-peak hours. Transcript formatting can wait 5 minutes. This is where you recover cost — the real-time infrastructure has to run at full capacity, but the post-processing infrastructure can run at 10% of the cost using cheaper compute.
# sqs_publisher.py — publish session events without blocking the voice session
import boto3
import json
from datetime import datetime
sqs = boto3.client('sqs', region_name='us-east-1')
SESSION_QUEUE_URL = os.environ['SESSION_QUEUE_URL']
async def publish_session_completed(session_id: str, metadata: dict):
"""Non-blocking publish — fire and forget"""
message = {
'event': 'session.completed',
'session_id': session_id,
'timestamp': datetime.utcnow().isoformat(),
'metadata': metadata
}
# Use executor to avoid blocking the async event loop
loop = asyncio.get_event_loop()
await loop.run_in_executor(
None,
lambda: sqs.send_message(
QueueUrl=SESSION_QUEUE_URL,
MessageBody=json.dumps(message),
MessageGroupId=session_id # FIFO queue for ordering
)
)
Cost at Scale
Here is the honest cost breakdown across deployment scales. All figures are approximate and reflect Q1 2026 pricing.
Infrastructure Costs by Scale
| Component | 100 concurrent | 1,000 concurrent | 10,000 concurrent |
|---|---|---|---|
| LiveKit Cloud SFU | $150/mo | $1,200/mo | Self-host required |
| Self-hosted SFU (EC2 c6in.4xl) | — | $640/mo (2 nodes) | $3,200/mo (10 nodes) |
| Agent Workers (EKS/ECS) | $80/mo | $600/mo | $4,500/mo |
| Redis Cluster | $50/mo | $200/mo | $800/mo |
| PostgreSQL (RDS Multi-AZ) | $150/mo | $400/mo | $1,200/mo |
| S3 (recordings) | $20/mo | $180/mo | $1,700/mo |
| TURN servers | $40/mo | $320/mo | $2,400/mo |
| Load Balancers / DNS | $20/mo | $60/mo | $240/mo |
| Infrastructure total | ~$510/mo | ~$3,560/mo | ~$14,040/mo |
Note: this is infrastructure only. Voice AI provider costs (OpenAI, Grok, Bedrock) are on top of this and will typically 3-5x the infrastructure cost. We cover that in full in Part 11.
At 10,000 concurrent sessions, self-hosting the SFU layer and running agent workers on spot instances (for non-session-active processing) are the two highest-leverage cost levers.
Monitoring and Alerting
Voice infrastructure requires different monitoring than web applications. Here is the Prometheus/Grafana stack configuration that covers the critical signals.
Key Metrics Dashboard
# prometheus-rules.yaml — alerting rules for voice infrastructure
groups:
- name: voice_infrastructure
rules:
# Alert when audio latency degrades
- alert: HighAudioLatency
expr: histogram_quantile(0.95, rate(voice_audio_latency_ms_bucket[5m])) > 200
for: 2m
labels:
severity: warning
annotations:
summary: "P95 audio latency exceeded 200ms"
description: "P95 latency is {{ $value }}ms. Check SFU node health."
# Alert on packet loss (usually network/TURN issue)
- alert: HighPacketLoss
expr: rate(livekit_packets_lost_total[5m]) / rate(livekit_packets_sent_total[5m]) > 0.02
for: 3m
labels:
severity: critical
annotations:
summary: "Packet loss above 2%"
description: "Audio quality will be noticeably degraded."
# Alert when agent worker capacity runs low
- alert: AgentWorkerCapacityLow
expr: (sum(voice_agent_active_sessions_total) / sum(voice_agent_session_capacity)) > 0.85
for: 1m
labels:
severity: warning
annotations:
summary: "Agent workers at 85%+ capacity"
description: "Auto-scaler should be activating. Verify HPA is responding."
# Alert on provider error rate
- alert: VoiceProviderErrors
expr: rate(voice_provider_errors_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Voice AI provider error rate above 5%"
description: "Provider: {{ $labels.provider }}. May need to activate failover."
Per-Session Observability
Beyond aggregate metrics, you need per-session observability for debugging individual quality issues:
# session_telemetry.py
import structlog
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
logger = structlog.get_logger()
tracer = trace.get_tracer(__name__)
class SessionTelemetry:
def __init__(self, session_id: str, candidate_id: str):
self.session_id = session_id
self.candidate_id = candidate_id
self.log = logger.bind(
session_id=session_id,
candidate_id=candidate_id
)
def record_audio_received(self, chunk_size: int, timestamp: float):
self.log.debug("audio_received", chunk_size=chunk_size, ts=timestamp)
def record_ai_response_latency(self, latency_ms: float, provider: str):
self.log.info(
"ai_response",
latency_ms=latency_ms,
provider=provider,
above_threshold=latency_ms > 300
)
audio_latency.observe(latency_ms, {'provider': provider})
def record_session_quality_event(self, event: str, details: dict):
self.log.warning("quality_event", event=event, **details)
Store structured session logs in a searchable backend (OpenSearch, Datadog, or similar). When a candidate reports a poor experience, you want to reconstruct exactly what happened in their session within 30 seconds.
Multi-Region Failover
For enterprise deployments, regional outages cannot mean interview outages. The failover architecture routes sessions to the next-nearest healthy region automatically.
Failover Architecture
Primary routing (normal operation):
─────────────────────────────────────────
Candidate (Singapore) → ap-southeast-1 SFU cluster
Candidate (London) → eu-west-1 SFU cluster
Candidate (New York) → us-east-1 SFU cluster
During ap-southeast-1 outage:
─────────────────────────────────────────
Health check fails → Route 53 removes ap-southeast-1
New candidates (Singapore) → us-west-2 SFU cluster (~180ms vs 30ms normal)
Active sessions: attempt reconnect to us-west-2 (graceful degradation)
The active session reconnection is the hard part. When an SFU node fails mid-session, the client’s WebRTC connection breaks. Your client SDK needs to implement reconnection logic:
// livekit-client-reconnect.js
const room = new Room({
reconnectPolicy: {
maxRetries: 5,
retryDelayMs: 1000,
// On reconnect failure, try the backup region
onFailed: async () => {
const backupRegion = await getBackupRegion(currentRegion);
const backupUrl = `wss://sfu-${backupRegion}.yourcompany.com`;
await room.connect(backupUrl, token);
}
}
});
room.on(RoomEvent.Reconnecting, () => {
console.log('Connection lost — attempting reconnect...');
showCandidateReconnectBanner(); // "Connection interrupted, reconnecting..."
});
room.on(RoomEvent.Reconnected, () => {
console.log('Reconnected successfully');
hideCandidateReconnectBanner();
});
From the candidate’s perspective, a regional failover causes a 3-5 second interruption. That is acceptable. A permanent disconnection is not.
The Scaling Hierarchy
When you’re planning your deployment, think in tiers:
Tier 1 (< 50 concurrent sessions): LiveKit Cloud, single-region, managed infrastructure. Do not over-engineer this. You’re still learning your usage patterns.
Tier 2 (50-500 concurrent sessions): LiveKit Cloud multi-region or self-hosted single-region. Kubernetes for agent workers. Custom metrics HPA. Redis for session state.
Tier 3 (500-5,000 concurrent sessions): Self-hosted LiveKit multi-region. Dedicated TURN server pools. Queue-based async processing. Read replicas for the database.
Tier 4 (5,000+ concurrent sessions): Everything in Tier 3, plus: multi-cluster Kubernetes, CDN-optimized WebRTC ICE gathering, custom media server tuning, dedicated spot instance pools for async workloads, and a dedicated infrastructure team.
The biggest mistake I see: teams building Tier 4 infrastructure for Tier 1 traffic. Start with managed services. Move to self-hosted when the managed costs actually justify it. The data in Part 11 will show you exactly where those tipping points are.
In Part 11, we turn from infrastructure capacity to infrastructure cost. The architecture patterns here add up fast — we’ll break down the exact per-minute cost of every component, show where the three major cost tipping points are, and walk through the optimizations that bring a $3.45 per interview cost down to under $1.00 without sacrificing quality.
This is Part 10 of a 12-part series: The Voice AI Interview Playbook.
Series outline:
- Why Real-Time Voice Changes Everything — The landscape, the vision, and the reference architecture (Part 1)
- Cascaded vs. Speech-to-Speech — Choosing your pipeline architecture (Part 2)
- LiveKit vs. Pipecat vs. Direct — Picking your framework (Part 3)
- STT, LLM, and TTS That Actually Work — Building the voice pipeline (Part 4)
- Multi-Role Agents — Interviewer, coach, and evaluator personas (Part 5)
- Knowledge Base and RAG — Making your voice agent an expert (Part 6)
- Web and Mobile Clients — Cross-platform voice experiences (Part 7)
- Video Interview Integration — Multimodal analysis with Gemini Live (Part 8)
- Recording, Transcription, and Compliance — GDPR, HIPAA, and getting it right (Part 9)
- Scaling to Thousands — Architecture for concurrent voice sessions (this post)
- Cost Optimization — From $0.14/min to $0.03/min (Part 11)
- Multi-Provider Support — OpenAI Realtime, Bedrock Nova, Grok, and the adapter pattern (Part 12)