In Part 7, we tackled multi-language support — per-language VAD profiles, prompt translation workflows, and the cultural nuances that break voice AI in ways no test suite catches. All of that work lives in configuration files and prompt packs that run on… something. Some server, somewhere, that participants connect to at 2 AM in their local time zone, and it has to work.

This final post bridges the gap between “it works on localhost” and “200 researchers are using it simultaneously across three continents.” It’s the deployment story: containers, orchestration, CI/CD, monitoring, and the go-live checklist I wish someone had handed me before our first large-scale study.

The uncomfortable truth about voice AI deployment is that it’s harder than deploying a typical web application. You’re deploying stateful WebRTC sessions alongside stateless APIs alongside async job processors. Each has different scaling characteristics, different failure modes, and different resource requirements. Treating them as a single deployable unit is the fastest path to a 3 AM page.

The Three Services You’re Actually Deploying

When I first deployed the platform, everything was in one container. The web app, the voice agent, and the pipeline workers — all running as processes in a single Docker image. It worked until about 15 concurrent sessions, at which point the voice agents started competing with the pipeline workers for CPU, and session latency spiked from 300ms to 2 seconds.

Voice AI is not a monolith. It’s three distinct services:

1. Web Application (Next.js/React). This serves the participant UI, the researcher dashboard, and the REST API for session management. It’s stateless, horizontally scalable, and the least interesting thing to deploy. Two replicas behind a load balancer, and you’re done. Memory footprint: ~150MB per instance. CPU usage: negligible unless you’re doing server-side rendering.

2. Voice Agent (Python/LiveKit). This is the stateful core. Each active session runs a dedicated agent process that maintains a WebSocket connection to the AI provider (OpenAI Realtime or Gemini Live) and a WebRTC connection to the participant via the SFU. One process per active session. Each process consumes 200-500MB of memory depending on the conversation length and the size of the context window. CPU usage spikes during audio processing and VAD. You cannot share an agent process between sessions — the provider connection is session-scoped.

3. Pipeline Workers (Python/Node.js). These are the async job processors from Part 4 — transcription, enrichment, sentiment analysis, report generation. They pull jobs from a Redis queue, process them, and write results to the database. They’re CPU-intensive during transcription (if running local Whisper) and I/O-bound during enrichment API calls. They have no direct participant interaction, so latency tolerance is higher.

Each service scales differently. The web app scales on request rate. The voice agent scales on concurrent sessions. The pipeline workers scale on queue depth. Coupling them in one container means you can’t scale one without scaling all three, and a memory leak in a pipeline worker can kill an active voice session.

The separation also gives you independent deployment. You can ship a new dashboard feature without restarting active voice sessions, and you can update the pipeline logic without touching the agent code.

Docker Multi-Stage Builds

Each service gets its own Dockerfile with multi-stage builds to keep images small. Here’s the voice agent — the most complex of the three:

# Voice Agent Dockerfile
# Stage 1: Build dependencies
FROM python:3.12-slim AS builder

WORKDIR /build
COPY requirements-agent.txt .

RUN pip install --no-cache-dir --prefix=/install \
    -r requirements-agent.txt

# Stage 2: Runtime
FROM python:3.12-slim AS runtime

RUN apt-get update && apt-get install -y --no-install-recommends \
    libopus0 libsndfile1 ffmpeg \
    && rm -rf /var/lib/apt/lists/*

RUN useradd --create-home --shell /bin/bash agent
USER agent
WORKDIR /app

COPY --from=builder /install /usr/local
COPY --chown=agent:agent src/agent/ ./agent/
COPY --chown=agent:agent src/shared/ ./shared/
COPY --chown=agent:agent src/config/ ./config/

ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1

HEALTHCHECK --interval=15s --timeout=5s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8081/health')"

ENTRYPOINT ["python", "-m", "agent.main"]

The web app is simpler — build the Next.js app, then serve static assets from nginx:

# Web Application Dockerfile
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force
COPY . .
RUN npm run build

FROM nginx:alpine AS runtime
COPY --from=builder /app/.next/standalone ./app
COPY --from=builder /app/public ./app/public
COPY --from=builder /app/.next/static ./app/.next/static
COPY nginx.conf /etc/nginx/conf.d/default.conf
EXPOSE 3000
CMD ["node", "app/server.js"]

The pipeline worker follows the same pattern as the agent but with different dependencies:

# Pipeline Worker Dockerfile
FROM python:3.12-slim AS builder
WORKDIR /build
COPY requirements-worker.txt .
RUN pip install --no-cache-dir --prefix=/install \
    -r requirements-worker.txt

FROM python:3.12-slim AS runtime
RUN apt-get update && apt-get install -y --no-install-recommends \
    ffmpeg libsndfile1 \
    && rm -rf /var/lib/apt/lists/*

RUN useradd --create-home --shell /bin/bash worker
USER worker
WORKDIR /app

COPY --from=builder /install /usr/local
COPY --chown=worker:worker src/workers/ ./workers/
COPY --chown=worker:worker src/shared/ ./shared/
COPY --chown=worker:worker src/config/ ./config/

ENV PYTHONUNBUFFERED=1
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD python -c "import redis; r=redis.from_url('$REDIS_URL'); r.ping()"

ENTRYPOINT ["python", "-m", "workers.main"]

The key decision: separate requirements-agent.txt and requirements-worker.txt. The agent needs the LiveKit SDK, WebRTC libraries, and provider client libraries. The workers need Whisper (if doing local transcription), enrichment libraries, and report generation tools. There’s overlap in the shared utilities, but keeping requirements separate means your agent image doesn’t carry 2GB of Whisper model weights, and your workers don’t carry LiveKit’s native dependencies.

Final image sizes: web app ~150MB, voice agent ~400MB, pipeline worker ~350MB (or ~1.2GB if bundling Whisper models locally).

For local development, docker-compose.yml brings everything together:

version: "3.9"
services:
  web:
    build:
      context: .
      dockerfile: docker/Dockerfile.web
    ports: ["3000:3000"]
    environment:
      - API_URL=http://agent:8081
      - DATABASE_URL=postgresql://voice:voice@postgres:5432/voiceai
    depends_on:
      postgres: { condition: service_healthy }

  agent:
    build:
      context: .
      dockerfile: docker/Dockerfile.agent
    ports: ["8081:8081"]
    environment:
      - DATABASE_URL=postgresql://voice:voice@postgres:5432/voiceai
      - REDIS_URL=redis://redis:6379/0
      - LIVEKIT_URL=ws://livekit:7880
      - LIVEKIT_API_KEY=${LIVEKIT_API_KEY}
      - LIVEKIT_API_SECRET=${LIVEKIT_API_SECRET}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - GOOGLE_API_KEY=${GOOGLE_API_KEY}
    depends_on:
      redis: { condition: service_healthy }
      livekit: { condition: service_started }

  worker:
    build:
      context: .
      dockerfile: docker/Dockerfile.worker
    environment:
      - DATABASE_URL=postgresql://voice:voice@postgres:5432/voiceai
      - REDIS_URL=redis://redis:6379/0
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    deploy:
      replicas: 2
    depends_on:
      redis: { condition: service_healthy }
      postgres: { condition: service_healthy }

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: voiceai
      POSTGRES_USER: voice
      POSTGRES_PASSWORD: voice
    volumes: ["pgdata:/var/lib/postgresql/data"]
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U voice"]
      interval: 5s
      timeout: 3s
      retries: 5

  redis:
    image: redis:7-alpine
    command: redis-server --appendonly yes
    volumes: ["redisdata:/data"]
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5

  livekit:
    image: livekit/livekit-server:latest
    ports: ["7880:7880", "7881:7881", "7882:7882/udp"]
    command: --config /etc/livekit.yaml
    volumes: ["./config/livekit-dev.yaml:/etc/livekit.yaml"]

volumes:
  pgdata:
  redisdata:

The health checks matter more than they look. In local development, services start in random order. Without the depends_on conditions, the agent tries to connect to Redis before Redis is ready and crashes. The health checks ensure correct startup ordering, which saves you from the “it works if I restart docker-compose twice” debugging sessions.

Kubernetes Orchestration (or Docker Swarm If You’re Pragmatic)

Here’s my honest take: if you’re running fewer than 50 concurrent sessions, Kubernetes is overkill. Docker Compose on a single VPS handles 20-30 concurrent sessions comfortably with a 16GB / 8-core machine. Docker Swarm across 2-3 nodes gets you to 50 sessions with minimal operational complexity. The docker stack deploy command is essentially docker-compose for multi-node, and the learning curve is a fraction of Kubernetes.

Kubernetes becomes worth the operational overhead at 50+ concurrent sessions, where you need:

  • Auto-scaling based on custom metrics (active sessions, queue depth)
  • Pod anti-affinity to spread agent processes across nodes
  • Rolling deploys with session drain (more on this in Section 5)
  • Resource limits that prevent one noisy session from degrading others

Here are the key manifests, abbreviated to the parts that matter for voice AI specifically:

# agent-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: voice-agent
spec:
  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 0        # Never kill a pod until new one is ready
  selector:
    matchLabels:
      app: voice-agent
  template:
    metadata:
      labels:
        app: voice-agent
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    spec:
      terminationGracePeriodSeconds: 330   # 5 min drain + 30s buffer
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: voice-agent
                topologyKey: kubernetes.io/hostname
      containers:
        - name: agent
          image: ghcr.io/yourorg/voice-agent:latest
          ports:
            - containerPort: 8081
            - containerPort: 9090    # Prometheus metrics
          resources:
            requests:
              memory: "512Mi"
              cpu: "500m"
            limits:
              memory: "2Gi"
              cpu: "2000m"
          readinessProbe:
            httpGet:
              path: /health
              port: 8081
            initialDelaySeconds: 10
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8081
            initialDelaySeconds: 30
            periodSeconds: 30
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 5"]
          envFrom:
            - secretRef:
                name: voice-agent-secrets
# agent-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: voice-agent-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: voice-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: voice_sessions_active
        target:
          type: AverageValue
          averageValue: "8"       # Scale up when avg > 8 sessions/pod
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 min before scaling down
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60             # Remove 1 pod per minute max
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60             # Add up to 4 pods per minute
# worker-deployment.yaml (abbreviated)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pipeline-worker
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: worker
          image: ghcr.io/yourorg/pipeline-worker:latest
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "1Gi"
              cpu: "1000m"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: pipeline-worker
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: voice_pipeline_queue_depth
        target:
          type: AverageValue
          averageValue: "100"     # Scale up when queue > 100 jobs/worker

The critical detail is terminationGracePeriodSeconds: 330 on the agent deployment. The default is 30 seconds, which is not enough time to drain active voice sessions. When Kubernetes sends SIGTERM to an agent pod, the agent needs to finish all active sessions gracefully — and a research session can run 30 minutes. The 330-second grace period gives 5 minutes for session drain, which handles the vast majority of cases. Sessions that can’t drain in 5 minutes get the 120-second rejoin window from Part 6 as a safety net.

Resource requests are tuned from experience: each concurrent voice session needs roughly 512Mi memory and 0.5 CPU. A pod with 2Gi limit can handle about 4 concurrent sessions comfortably. The HPA targets 8 sessions per pod as the scale-up trigger, which provides headroom before hitting the limit.

The scale-down policy is intentionally conservative — remove 1 pod per minute, with a 5-minute stabilization window. Aggressive scale-down risks killing pods with active sessions. Scale-up is aggressive — add up to 4 pods per minute — because latency from insufficient capacity is worse than briefly over-provisioning.

CI/CD Pipeline with GitHub Actions

Here’s the complete workflow that builds, tests, and deploys all three services:

# .github/workflows/deploy.yml
name: Build and Deploy Voice AI Platform

on:
  push:
    branches: [main, staging]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_PREFIX: ghcr.io/${{ github.repository_owner }}/voice-ai

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"
          cache: "pip"

      - name: Install dependencies
        run: |
          pip install -r requirements-agent.txt -r requirements-worker.txt
          pip install ruff pytest pytest-asyncio

      - name: Lint
        run: ruff check src/

      - name: Type check
        run: |
          pip install pyright
          pyright src/

      - name: Run tests
        run: pytest tests/ -v --timeout=60
        env:
          DATABASE_URL: sqlite:///test.db
          REDIS_URL: redis://localhost:6379/0

  build-images:
    needs: lint-and-test
    runs-on: ubuntu-latest
    if: github.event_name == 'push'
    permissions:
      contents: read
      packages: write
    strategy:
      matrix:
        service: [web, agent, worker]
    steps:
      - uses: actions/checkout@v4

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Build and push
        uses: docker/build-push-action@v6
        with:
          context: .
          file: docker/Dockerfile.${{ matrix.service }}
          push: true
          tags: |
            ${{ env.IMAGE_PREFIX }}-${{ matrix.service }}:sha-${{ github.sha }}
            ${{ env.IMAGE_PREFIX }}-${{ matrix.service }}:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy-staging:
    needs: build-images
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/staging'
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Deploy to staging
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.STAGING_HOST }}
          username: deploy
          key: ${{ secrets.STAGING_SSH_KEY }}
          script: |
            cd /opt/voice-ai
            docker compose pull
            docker compose up -d --remove-orphans
            sleep 10
            docker compose ps --format json | python3 -c "
            import sys, json
            services = [json.loads(l) for l in sys.stdin]
            unhealthy = [s['Name'] for s in services if s.get('Health','') == 'unhealthy']
            if unhealthy:
                print(f'Unhealthy services: {unhealthy}')
                sys.exit(1)
            print('All services healthy')
            "

  deploy-production:
    needs: build-images
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Deploy to production
        run: |
          echo "${{ secrets.KUBECONFIG }}" | base64 -d > /tmp/kubeconfig
          export KUBECONFIG=/tmp/kubeconfig

          # Update image tags
          kubectl set image deployment/voice-web \
            web=${{ env.IMAGE_PREFIX }}-web:sha-${{ github.sha }}
          kubectl set image deployment/voice-agent \
            agent=${{ env.IMAGE_PREFIX }}-agent:sha-${{ github.sha }}
          kubectl set image deployment/pipeline-worker \
            worker=${{ env.IMAGE_PREFIX }}-worker:sha-${{ github.sha }}

          # Wait for rollouts
          kubectl rollout status deployment/voice-web --timeout=120s
          kubectl rollout status deployment/voice-agent --timeout=600s
          kubectl rollout status deployment/pipeline-worker --timeout=120s

      - name: Smoke test
        run: |
          HEALTH_URL="${{ secrets.PRODUCTION_URL }}/api/health"
          for i in $(seq 1 10); do
            STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$HEALTH_URL")
            if [ "$STATUS" = "200" ]; then
              echo "Health check passed on attempt $i"
              exit 0
            fi
            echo "Attempt $i: status $STATUS, retrying in 10s..."
            sleep 10
          done
          echo "Health check failed after 10 attempts"
          exit 1

      - name: Rollback on failure
        if: failure()
        run: |
          echo "${{ secrets.KUBECONFIG }}" | base64 -d > /tmp/kubeconfig
          export KUBECONFIG=/tmp/kubeconfig
          kubectl rollout undo deployment/voice-web
          kubectl rollout undo deployment/voice-agent
          kubectl rollout undo deployment/pipeline-worker
          echo "::error::Deployment failed, rolled back to previous version"

A few things worth noting about this workflow:

The matrix build for Docker images runs all three service builds in parallel. With GitHub Actions’ build cache (cache-from: type=gha), subsequent builds that only change application code take 2-3 minutes instead of 10-15 minutes. The dependency install layer is cached, so only the COPY step for application code triggers a rebuild.

Image tagging with sha-{commit} gives you exact traceability. When a participant reports an issue, you check which image SHA was running, map it to a commit, and know exactly what code was live. The latest tag is a convenience for local development — never use it in production manifests.

The staging deploy is deliberately simple — SSH, docker-compose pull, bring up services. Staging doesn’t need Kubernetes. It needs to be fast and easy to iterate on. The production deploy uses kubectl set image which triggers a rolling update with the deployment’s configured strategy.

The rollback step runs on any failure in the deploy or smoke test. kubectl rollout undo reverts to the previous deployment revision, which Kubernetes tracks automatically. The rollback is a single command, and it preserves the session drain behavior configured in the deployment spec. This is why we tested rollbacks before go-live — you don’t want to discover that your rollback procedure doesn’t work during an actual incident.

Zero-Downtime Deploys for Voice Sessions

Standard rolling deploys have a fatal flaw for voice AI: Kubernetes sends SIGTERM to the old pod, the pod gets 30 seconds (by default) to die, and any active voice sessions on that pod are terminated. For a web server handling HTTP requests, this is fine — the load balancer routes new requests elsewhere, in-flight requests finish quickly. For a voice session that’s been running for 20 minutes, sudden termination means lost data and a confused participant.

The solution is session drain with graceful shutdown:

import asyncio
import signal
import logging
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Optional

logger = logging.getLogger(__name__)

@dataclass
class GracefulShutdownHandler:
    """Manages graceful shutdown for voice agent pods.

    On SIGTERM:
    1. Stop accepting new sessions immediately
    2. Wait for active sessions to complete naturally
    3. Force-close remaining sessions after max_drain_seconds
    4. Exit cleanly
    """
    max_drain_seconds: int = 300    # 5 minutes max drain time
    accepting_sessions: bool = True
    _active_sessions: dict = field(default_factory=dict)
    _shutdown_event: Optional[asyncio.Event] = None
    _drain_started: Optional[datetime] = None

    def __post_init__(self):
        self._shutdown_event = asyncio.Event()

    def register_signals(self, loop: asyncio.AbstractEventLoop):
        """Register SIGTERM and SIGINT handlers."""
        for sig in (signal.SIGTERM, signal.SIGINT):
            loop.add_signal_handler(sig, self._initiate_drain)

    def _initiate_drain(self):
        """Called on SIGTERM. Stops accepting new sessions."""
        logger.warning(
            "Shutdown signal received. Draining %d active sessions...",
            len(self._active_sessions),
        )
        self.accepting_sessions = False
        self._drain_started = datetime.utcnow()
        self._shutdown_event.set()

    def register_session(self, session_id: str):
        """Track a new active session."""
        if not self.accepting_sessions:
            raise RuntimeError("Pod is draining, cannot accept new sessions")
        self._active_sessions[session_id] = datetime.utcnow()
        logger.info("Session %s registered. Active: %d", session_id, len(self._active_sessions))

    def unregister_session(self, session_id: str):
        """Remove a completed session from tracking."""
        self._active_sessions.pop(session_id, None)
        logger.info(
            "Session %s completed. Active: %d",
            session_id, len(self._active_sessions),
        )

    async def wait_for_drain(self):
        """Block until all sessions drain or timeout expires."""
        await self._shutdown_event.wait()

        deadline = datetime.utcnow() + timedelta(seconds=self.max_drain_seconds)

        while self._active_sessions and datetime.utcnow() < deadline:
            remaining = len(self._active_sessions)
            seconds_left = (deadline - datetime.utcnow()).total_seconds()
            logger.info(
                "Drain in progress: %d sessions remaining, %.0fs until timeout",
                remaining, seconds_left,
            )
            await asyncio.sleep(5)

        if self._active_sessions:
            logger.warning(
                "Drain timeout reached. Force-closing %d sessions: %s",
                len(self._active_sessions),
                list(self._active_sessions.keys()),
            )
            # Force-close remaining sessions
            for session_id in list(self._active_sessions):
                await self._force_close_session(session_id)

        logger.info("Drain complete. Exiting.")

    async def _force_close_session(self, session_id: str):
        """Force-close a session that didn't complete during drain."""
        logger.warning("Force-closing session %s", session_id)
        # Send session-end event to participant
        # Save partial transcript and recordings
        # Mark session as INTERRUPTED in database
        self._active_sessions.pop(session_id, None)

    @property
    def health_status(self) -> dict:
        """Health check response. Returns 503 when draining."""
        return {
            "status": "healthy" if self.accepting_sessions else "draining",
            "active_sessions": len(self._active_sessions),
            "accepting_new": self.accepting_sessions,
        }

The flow works like this during a deploy:

  1. Kubernetes sends SIGTERM to the old pod
  2. The preStop hook sleeps 5 seconds, giving the load balancer time to stop routing new connections to this pod
  3. The GracefulShutdownHandler receives SIGTERM, sets accepting_sessions = False
  4. The readiness probe starts returning 503 (because health_status shows “draining”), confirming no new sessions will be routed here
  5. Active sessions continue running normally — the agent doesn’t interrupt them
  6. As sessions complete naturally, they’re unregistered from the handler
  7. After all sessions complete (or after 5 minutes), the pod exits
  8. Kubernetes marks the pod as terminated and the new pod takes over

The 120-second rejoin window from Part 6 acts as a safety net. If a pod is force-killed before drain completes, the participant sees a brief disconnection and the client automatically attempts reconnection. The new agent pod picks up the session state from the database and the participant continues where they left off. In practice, we’ve never had a participant lose more than 15 seconds of conversation during a deployment.

For an even safer approach, you can use blue-green deployments: deploy the new version to a completely separate set of pods (“green”), route all new sessions to green, and let the old pods (“blue”) drain naturally with no timeout pressure. This costs 2x the resources during the transition window but eliminates any risk of session interruption. We used this approach during our first major study and switched to rolling deploys with drain once we were confident in the handler.

The Go-Live Checklist

After three go-lives — one that went smoothly, one that went badly, and one that went badly in ways we hadn’t prepared for — we formalized this checklist. Every item on it exists because we missed it at least once.

Infrastructure

  • PostgreSQL with connection pooling (PgBouncer, 100 max connections) and automated daily backups with tested restoration
  • Redis with AOF persistence enabled — appendonly yes in config, so queue state survives restart
  • Object storage (S3/MinIO) configured with lifecycle policies: move recordings to cold storage after 90 days, delete after retention period
  • LiveKit SFU deployed and load-tested — either self-hosted (Part 5) or Cloud, with TURN/STUN relay for corporate firewalls
  • SSL/TLS certificates on all endpoints with auto-renewal (Let’s Encrypt via Caddy or cert-manager on Kubernetes)
  • DNS configured with 60-second TTL for fast failover during incidents
  • Network: WebRTC UDP ports open (typically 50000-65535), WebSocket endpoints accessible, TURN fallback tested behind at least two corporate firewalls

Application

  • All environment variables and secrets injected via secrets manager — no .env files baked into images, no secrets in docker-compose.yml
  • Provider API keys validated: OpenAI Tier 3+ for rate limits sufficient for concurrent sessions, Gemini production quota approved
  • Voice agent pre-warming enabled — Part 2 explains why cold starts add 3-5 seconds to session start
  • Budget limits configured per session ($15 max) and per study (total cap) — Part 5
  • Research protocol prompts reviewed: all phases of the state machine tested end-to-end, edge cases validated
  • Multi-language prompt packs reviewed by native speakers for each target language — Part 7
  • Session timeout configured: 45-minute hard limit, 2-minute idle timeout, zombie detection at 2 hours

Operations

  • Monitoring dashboards deployed and accessible to the operations team (Section 7 below)
  • Alert rules configured, tested with synthetic alerts, and routed to the correct on-call channel
  • Rollback procedure documented, tested, and executable in one command — kubectl rollout undo or docker compose pull && docker compose up -d with the previous image tag
  • Load test completed: 50+ concurrent sessions sustained for 30 minutes with all pipeline stages running
  • Backup restoration tested end-to-end — not just “backups are being created” but “we restored from backup and verified data integrity”
  • On-call rotation established with escalation paths and a shared incident response runbook
  • Participant support flow: what happens when someone calls/emails saying “it’s not working” — who responds, what they check, how they escalate

The load test deserves special emphasis. We ran a 50-session load test where each session ran for 15 minutes with simulated audio. It uncovered three issues that unit and integration tests missed: a connection pool leak in the enrichment worker that only manifested after 30+ concurrent database connections, a Redis memory spike from accumulated job metadata that wasn’t being cleaned up, and a LiveKit SFU issue where TURN relay allocation failed under concurrent session setup pressure. All three would have caused incidents in the first week of production use.

Production Monitoring — The Dashboard That Saves You

Monitoring for voice AI needs more than the standard web application metrics. Response time and error rate don’t tell you whether sessions are actually working — whether participants hear the AI respond quickly, whether transcriptions are accurate, whether the pipeline is keeping up.

Here’s the Prometheus metrics registry we run on the voice agent:

from prometheus_client import (
    Counter, Gauge, Histogram, CollectorRegistry, generate_latest,
)

# Create a dedicated registry to avoid conflicts
voice_metrics = CollectorRegistry()

# --- Real-time session metrics ---
sessions_active = Gauge(
    "voice_sessions_active",
    "Number of currently active voice sessions",
    ["provider", "language"],
    registry=voice_metrics,
)

session_ttfv = Histogram(
    "voice_session_ttfv_seconds",
    "Time from session start to first voice response from AI",
    ["provider"],
    buckets=[0.5, 1.0, 1.5, 2.0, 3.0, 5.0, 10.0],
    registry=voice_metrics,
)

session_duration = Histogram(
    "voice_session_duration_seconds",
    "Total session duration",
    ["provider", "completion_reason"],
    buckets=[60, 120, 300, 600, 900, 1200, 1800, 2700, 3600],
    registry=voice_metrics,
)

# --- Pipeline metrics ---
pipeline_queue_depth = Gauge(
    "voice_pipeline_queue_depth",
    "Number of pending jobs in pipeline",
    ["stage"],   # transcription, enrichment, reporting
    registry=voice_metrics,
)

pipeline_job_duration = Histogram(
    "voice_pipeline_job_duration_seconds",
    "Time to process a single pipeline job",
    ["stage"],
    buckets=[1, 5, 10, 30, 60, 120, 300],
    registry=voice_metrics,
)

# --- Provider metrics ---
provider_errors = Counter(
    "voice_provider_errors_total",
    "Total provider API errors",
    ["provider", "error_type"],
    registry=voice_metrics,
)

provider_latency = Histogram(
    "voice_provider_response_latency_seconds",
    "Provider API response latency",
    ["provider", "request_type"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0],
    registry=voice_metrics,
)

# --- Cost metrics ---
session_cost = Histogram(
    "voice_session_cost_dollars",
    "Total cost per session in USD",
    ["provider"],
    buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 15.0, 25.0],
    registry=voice_metrics,
)

# --- Health endpoint for Prometheus scraping ---
async def metrics_endpoint(request):
    """Expose metrics at /metrics for Prometheus scraping."""
    from aiohttp.web import Response
    return Response(
        body=generate_latest(voice_metrics),
        content_type="text/plain",
    )

These metrics feed into four critical alert rules that we page on:

# prometheus-alerts.yml
groups:
  - name: voice-ai-critical
    rules:
      - alert: HighTimeToFirstVoice
        expr: |
          histogram_quantile(0.95,
            rate(voice_session_ttfv_seconds_bucket[5m])
          ) > 5
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "TTFV p95 exceeds 5 seconds"
          description: >
            95th percentile time-to-first-voice is {{ $value }}s.
            Participants are waiting too long for the AI to respond.
            Check provider status and agent pod resources.

      - alert: PipelineBacklog
        expr: voice_pipeline_queue_depth > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pipeline queue depth exceeds 1000"
          description: >
            {{ $labels.stage }} queue depth is {{ $value }}.
            Scale worker replicas or investigate stuck jobs.

      - alert: ProviderErrorSpike
        expr: |
          rate(voice_provider_errors_total[5m]) /
          (rate(voice_provider_errors_total[5m]) +
           rate(voice_session_duration_seconds_count[5m]))
          > 0.05
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Provider error rate exceeds 5%"
          description: >
            {{ $labels.provider }} error rate is {{ $value | humanizePercentage }}.
            Check provider status page and consider failover.

      - alert: BudgetApproachingLimit
        expr: |
          sum(voice_session_cost_dollars_sum) by (study_id) /
          voice_study_budget_dollars > 0.85
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Study budget at 85%+"
          description: >
            Study {{ $labels.study_id }} has consumed 85% of its budget.
            Notify the study administrator.

The TTFV (time-to-first-voice) metric is the single most important number in the system. If a participant asks a question and waits more than 5 seconds for the AI to start responding, they assume it’s broken. They click away, close the tab, or sit in awkward silence. Our target is under 2 seconds at p95, and we page if it exceeds 5 seconds for more than 3 minutes. The most common cause is provider-side latency spikes, which we can’t fix but can mitigate by routing new sessions to the backup provider.

The Grafana dashboard has four rows: session health (active count, TTFV, duration distribution), pipeline throughput (queue depth, processing rate, error rate per stage), provider status (latency per provider, error rates, cost accumulation), and infrastructure (pod CPU/memory, database connections, Redis memory). The session health row is what’s on the TV in the office during a live study.

The Operational Runbook

After six months of production operation, these are the incidents that actually happen — not the ones you plan for in architecture reviews, but the ones that generate real pages:

Provider outage. Happens 2-3 times per month for 5-30 minutes. The provider abstraction layer from Part 1 routes new sessions to the backup provider automatically. Active sessions continue on the current provider until they complete — you can’t mid-session switch a stateful WebSocket connection. If the outage lasts longer than any active session, all sessions naturally migrate. Action: monitor, confirm failover is working, notify study admins. No manual intervention needed if the failover is implemented correctly.

Pipeline backlog. Usually follows a burst of session completions — a study scheduled 100 sessions in a 2-hour window, and 100 sessions ending within minutes of each other generates thousands of enrichment jobs. Action: scale worker replicas (kubectl scale deployment/pipeline-worker --replicas=8). If persistent, enable the batched enrichment mode from Part 4 which reduces API calls by 90%. The backlog resolves itself — it’s a throughput issue, not a failure.

Database connection exhaustion. PgBouncer is configured for 100 connections, but enrichment workers holding long transactions during burst processing can exhaust the pool. Symptom: new sessions fail to start because they can’t acquire a database connection. Action: check PgBouncer stats (SHOW POOLS), identify long-running transactions (SELECT * FROM pg_stat_activity WHERE state = 'active' AND duration > interval '60 seconds'), and if necessary, increase the pool size temporarily. Long-term fix: ensure workers use short transactions and commit frequently.

LiveKit SFU overload. At 150+ concurrent sessions on a self-hosted SFU, CPU and bandwidth saturate. Symptom: participants report choppy audio or increased disconnects. Action: scale SFU nodes horizontally, or route overflow sessions to LiveKit Cloud as a burst capacity provider. We keep a Cloud account specifically for this — normally dormant, activated when self-hosted capacity is at 80%.

Session stuck in ACTIVE state. A session shows as ACTIVE in the database but has no corresponding agent process. Caused by an agent crash that didn’t trigger the cleanup handler. The zombie detection from Part 2 catches these, but the timeout is 2 hours by default. Action: manually force-close the session via the admin API. If this happens repeatedly, check for OOM kills in the agent pod logs (kubectl logs --previous).

Each incident should take less than 15 minutes to resolve if the runbook is followed. If it takes longer, the runbook needs updating — and that update is the first post-incident action item.

What I’d Build Differently at Day Zero

Eight posts later, here’s the complete series:

  1. The Architecture Nobody Warns You About — Server-side agents, metadata transport, provider selection
  2. Zombie Agents, Pre-Warming, and the 5 Bugs That Cost Us Weeks — Production pain points and their fixes
  3. Multi-Phase State Machines — Research protocol as code, LLM-driven transitions
  4. From Recording to Insight — The automatic post-interview pipeline
  5. The Real Cost — Per-minute tracking, budgets, self-hosting math
  6. What Breaks at 200 Concurrent Sessions — Scaling bottlenecks and operational metrics
  7. Multi-Language Support — Per-language VAD, prompt translation, cultural adaptation
  8. This post (Part 8) — Deployment, CI/CD, monitoring, and the operational playbook

If I were starting over from day zero, three things get built in sprint one:

Session drain from the first deploy. The GracefulShutdownHandler above is maybe 80 lines of code. We didn’t add it until month three, and every deployment before that risked killing active sessions. It’s the highest-ROI operational code in the entire platform.

Per-language VAD profiles from the first multilingual study. We ran three multilingual studies with default VAD settings before realizing that Japanese participants were being cut off mid-sentence because the pause thresholds were calibrated for English conversation patterns. Language-specific VAD configuration (Part 7) should be table stakes, not an afterthought.

Cost tracking from the first session. Retroactively computing costs is painful and imprecise. Token-level logging from day one costs nothing to implement.

Three things that can wait:

Multi-provider failover. It’s important, but a single provider with good error handling is sufficient for the first few months. Build the abstraction layer, but implement the second provider when you have real traffic patterns to validate the failover logic.

Cross-language analysis in the pipeline. The enrichment pipeline can start monolingual. Cross-language comparison, translation-aware sentiment analysis, and multilingual topic clustering are research features, not operational necessities.

Advanced auto-scaling. Start with fixed replicas and manual scaling. You’ll learn your actual scaling patterns in the first month, and those patterns will inform HPA configuration that’s actually useful instead of theoretical.

Voice AI for research is still early. The S2S models from OpenAI and Google are improving quarterly — latency is dropping, costs are falling, language support is expanding. The infrastructure patterns in this series — state machines, session recovery, provider abstraction, cost tracking, graceful deployment — will transfer to whatever the next generation of models looks like. The models change. The operational problems don’t.

Build the boring infrastructure first. The AI will take care of itself.


References:


This is Part 8 of an 8-part series: Production Voice AI for Research at Scale.

Series outline:

  1. The Architecture Nobody Warns You About — Server-side agents, metadata transport, provider selection (Part 1)
  2. Zombie Agents, Pre-Warming, and the 5 Bugs That Cost Us Weeks — Production pain points and fixes (Part 2)
  3. Multi-Phase State Machines — Research protocol as code, LLM-driven transitions (Part 3)
  4. From Recording to Insight — The automatic post-interview pipeline (Part 4)
  5. The Real Cost — Per-minute tracking, budgets, self-hosting math (Part 5)
  6. What Breaks at 200 Concurrent Sessions — Scaling bottlenecks and operational metrics (Part 6)
  7. Multi-Language Support — Per-language VAD, prompt translation, cultural adaptation (Part 7)
  8. Deployment and Go-Live — Docker, Kubernetes, CI/CD, monitoring, and the operational runbook (Part 8)

For the broader reference architecture covering cascaded vs S2S pipelines, framework selection, multi-provider support, and the full interview lifecycle, see the 12-part Voice AI Interview Playbook.

Export for reading

Comments