I previously shared lessons from building a real-time AI interviewer — the VAD tuning, provider failover, latency budgets, and the things nobody warns you about after running ~200 production interviews. That post covered the “what worked” from our first three months. This series goes deeper: twelve posts covering every layer of the architecture, from pipeline selection to scaling to thousands of concurrent sessions.

Because real-time voice AI isn’t a toy anymore. It’s a production infrastructure problem. And the landscape just changed.

The 300-Millisecond Constraint

Here’s the fundamental constraint that drives every decision in voice AI: humans notice conversation delays above 300 milliseconds. Above 500ms, it feels broken. Above 1 second, people start talking over the AI.

This isn’t a nice-to-have metric. It’s a hard boundary that separates “AI that feels like talking to a person” from “AI that feels like a bad phone tree.”

Natural human turn-taking gap:     200-300ms
"Feels responsive":                300-500ms
"Something's off":                 500-800ms
"Is it broken?":                   800ms-1.2s
"I'll talk over it":              >1.2s

Text-based AI doesn’t have this problem. A chatbot can take 3 seconds to respond and nobody blinks. But voice is a fundamentally different medium — we’ve evolved over hundreds of thousands of years to expect real-time conversational rhythm. The technology has to meet the biology, not the other way around.

This constraint shapes everything: which models you choose, how you stream audio, whether you use a cascaded pipeline or a speech-to-speech model, and how you architect for failover. Every millisecond you save in one component is a millisecond of budget you can spend elsewhere.

Why Text-Based AI Interviews Fail

Before diving into voice, let’s acknowledge what most companies are doing today: text-based AI assessments. Chatbot interviews. Automated coding challenges with an LLM evaluator. They work at screening scale, but they have three fundamental problems:

1. They can’t assess communication skills. For roles above junior level, how someone explains their thinking matters as much as the thinking itself. A text interface strips away tone, pacing, confidence, and the ability to think on their feet.

2. Candidates don’t take them seriously. A survey from our HR team found that candidates treat text-based AI interviews like filing a form — minimal effort, copy-pasted answers, and zero engagement. Voice changes the dynamic. When you’re talking to something that sounds human and responds in real-time, you engage differently.

3. They’re trivially gameable. With a text interface, there’s nothing stopping a candidate from having ChatGPT open in another tab generating answers. Voice conversations with natural pacing and follow-up questions make this dramatically harder.

Voice interviews aren’t a gimmick. They’re the format that actually tests what you care about.

The Landscape Just Changed (2025-2026)

Twelve months ago, building a real-time voice AI required stitching together five separate services with duct tape and prayers. Today, we have purpose-built solutions that collapse the complexity:

Gemini Live API (Google)

The Gemini Live API processes audio and video natively — no intermediate text conversion needed. This isn’t STT → LLM → TTS. It’s a single model that hears you, thinks, and speaks back. First audio response in 320-800ms. Multimodal input means it can simultaneously process what you’re saying and what’s on your screen.

For interviews, this is transformative. During a live coding session, Gemini can analyze the code being written while maintaining a natural conversation about it. No separate video analysis pipeline needed.

OpenAI Realtime API

OpenAI’s Realtime API offers both WebRTC (for browser-based agents) and WebSocket (for server-side agents) connections. Native function calling during voice conversations means the AI can trigger section transitions, score answers, and query knowledge bases without breaking the conversational flow. 60-minute session limit covers even the longest interviews.

Amazon Bedrock Nova Sonic

Nova Sonic delivers under 700ms response latency with bidirectional HTTP/2 streaming. The differentiator: 100+ languages with native-quality accents, non-verbal cue recognition (it detects hesitations, laughter, and inter-sentential pauses), and deep AWS integration for enterprise compliance requirements. If your HR team needs SOC 2 audit trails and VPC endpoints, this is your answer.

Grok Voice Agent API (xAI)

The newcomer worth watching. Flat rate of $0.05 per minute of connection time — no per-token billing, no separate STT/TTS charges. Sub-1-second time-to-first-audio. OpenAI Realtime API compatible, so switching from OpenAI is straightforward. For cost-sensitive deployments at scale, Grok changes the math entirely.

The Common Thread

All four share a critical capability: they’re speech-to-speech models. The traditional cascaded pipeline (separate STT → LLM → TTS) added 300-500ms of latency from the intermediate conversion steps. These new models skip that entirely. You send in audio, you get audio back.

But here’s the nuance — you don’t have to choose one approach. The most robust architectures use both: speech-to-speech for the conversation, cascaded for evaluation and transcript generation. We’ll explore this hybrid approach in depth in Part 2.

Three Personas, One Platform

The voice interview platform we’re building across this series supports three distinct AI personas:

The Interviewer

The primary conversational agent. Conducts structured interviews following a rubric, asks follow-up questions based on candidate responses, manages section transitions (intro → technical → system design → behavioral → Q&A → wrap-up), and adapts difficulty based on the candidate’s level.

Key traits: Professional but warm. Doesn’t interrupt unnecessarily. Asks probing follow-ups when answers are vague. Manages time across sections.

The Coach

A practice mode where candidates rehearse before real interviews. Gives real-time feedback: “That answer was good but a bit long — try to keep responses under 2 minutes” or “You used a lot of filler words in that response. Try pausing instead of saying ‘um.’” More encouraging and constructive than the Interviewer persona.

Key traits: Supportive. Points out specific improvements. Replays and discusses specific moments. Never scores — only coaches.

The Evaluator

A silent observer during interviews that generates structured assessments afterward. Processes the full transcript plus any multimodal observations (from Gemini Live video analysis) and produces rubric-aligned scoring with evidence citations.

Key traits: Objective. Evidence-based. Calibrated against example transcripts. Runs evaluation twice and flags score discrepancies.

These aren’t three separate applications — they’re three system prompts, three sets of function calling tools, and three evaluation rubrics running on the same infrastructure. The underlying voice pipeline, media transport, recording, and scaling layers are shared. This is the architectural principle that keeps complexity manageable.

The Reference Architecture

Here’s the high-level architecture we’ll build across this series. Each subsequent post zooms into a specific layer:

┌──────────────────────────────────────────────────────────────┐
│                      CLIENT LAYER                            │
│  Web App (React)  │  Mobile (RN/Flutter)  │  Admin Dashboard │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                   MEDIA TRANSPORT LAYER                      │
│        LiveKit SFU (WebRTC)  │  TURN/STUN  │  Edge Nodes    │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                    VOICE PIPELINE LAYER                      │
│  Cascaded: VAD → STT → LLM → TTS                           │
│        OR                                                    │
│  Speech-to-Speech: Gemini Live / OpenAI Realtime / Nova /   │
│                    Grok                                      │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                AGENT ORCHESTRATION LAYER                     │
│  Role Manager    │  Session State Machine  │  Tool Executor  │
│  (3 personas)    │  (interview phases)     │  (fn calling)   │
│                  │                         │                  │
│  Provider Router + Circuit Breaker (multi-provider failover) │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                KNOWLEDGE & CONTEXT LAYER                     │
│  Vector DB (RAG)  │  Redis (session state)  │  Templates     │
│  Rubrics & Questions  │  Company Knowledge Base              │
└──────────────────────────────────────────────────────────────┘

┌────────────────────────────┬─────────────────────────────────┐
│   RECORDING & COMPLIANCE   │      EVALUATION & ANALYTICS     │
│  LiveKit Egress            │  Post-interview scoring         │
│  Transcript storage        │  Scoring calibration            │
│  GDPR/HIPAA compliance     │  Bias detection                 │
│  Encryption + consent mgmt │  Reporting dashboard            │
└────────────────────────────┴─────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                    INFRASTRUCTURE LAYER                       │
│  K8s / ECS Fargate  │  Auto-scaling  │  Regional Deployment  │
│  CDN for static      │  DNS + Load Balancing                 │
└──────────────────────────────────────────────────────────────┘

This is the architecture we’ll refine across the series. Each layer has decisions to make, trade-offs to consider, and code to write. Let me walk through how the layers connect.

A candidate opens the web app. The client layer connects to a LiveKit room via WebRTC. The media transport layer handles the real-time audio (and optionally video) connection.

The candidate speaks. Audio frames flow through the voice pipeline — either a cascaded STT → LLM → TTS chain or a direct speech-to-speech model. The agent orchestration layer manages which persona is active, what section the interview is in, and which provider handles the conversation.

The AI asks a follow-up question. The knowledge layer provides context — the rubric for this role, the candidate’s previous answers, relevant technical documentation for domain-specific questions. Function calling triggers section transitions and flags interesting responses.

The session is recorded. LiveKit Egress captures the full audio/video. Transcripts are generated in real-time. Consent management ensures GDPR compliance. Everything is encrypted at rest and in transit.

After the interview, evaluation runs. The full transcript, multimodal observations, and rubric go to the evaluation layer. Scores are generated, calibrated against examples, and flagged for human review when confidence is low.

What This Series Covers

PartTopicKey Decisions
1Why Real-Time Voice Changes EverythingThe landscape and reference architecture (this post)
2Cascaded vs. Speech-to-SpeechPipeline architecture and latency budgets
3LiveKit vs. Pipecat vs. DirectFramework selection with code examples
4STT, LLM, and TTS SelectionProvider comparison with benchmarks
5Multi-Role AgentsPersona system prompts and interview flows
6Knowledge Base and RAGReal-time retrieval under 50ms
7Web and Mobile ClientsCross-platform SDK implementation
8Video Interview IntegrationGemini Live multimodal analysis
9Recording and ComplianceGDPR, HIPAA, encryption
10Scaling to ThousandsSFU mesh, auto-scaling, regional deployment
11Cost OptimizationFrom $0.14/min to $0.03/min
12Multi-Provider SupportAdapter pattern, circuit breakers, failover

Each post includes a detailed architecture diagram, code examples, and real production numbers. The goal isn’t to show you a demo — it’s to give you everything you need to build and run this in production.

Before We Start: The Honest Trade-Offs

Voice AI interviews are powerful, but they’re not a silver bullet. Here are the trade-offs you should consider before investing:

Latency is an ongoing battle. You’ll never “solve” latency. Provider response times fluctuate. Network conditions vary. New model releases change the performance characteristics. You need monitoring, alerting, and the discipline to keep measuring.

Voice quality varies by provider and load. ElevenLabs sounds incredible at 3pm on a Tuesday. At peak load during hiring season? Maybe not as consistent. Fallback chains aren’t optional.

Candidates have opinions. Some love talking to an AI. Some find it weird. Your platform needs a graceful human fallback for candidates who strongly prefer a human interviewer.

Evaluation is the hardest part. Getting the AI to have a conversation is the easy part. Getting it to reliably evaluate that conversation — consistently, fairly, without bias — is where you’ll spend most of your calibration time.

Cost at scale needs architecture. At $3.45 per interview with a managed approach, running 1,000 interviews per month costs $3,450. At 10,000 interviews, that’s $34,500/month. Part 11 of this series covers how to get that down to under $1 per interview with the right architecture.

Let’s Build

The reference architecture above is our blueprint. Over the next eleven posts, we’ll fill in every layer with production-tested patterns, real code, and honest trade-offs.

Next up: Part 2 — Cascaded vs. Speech-to-Speech: Choosing Your Pipeline Architecture. We’ll break down the latency budget for each approach and show you when each one makes sense.


This is Part 1 of a 12-part series: The Voice AI Interview Playbook.

Series outline:

  1. Why Real-Time Voice Changes Everything — The landscape, the vision, and the reference architecture (this post)
  2. Cascaded vs. Speech-to-Speech — Choosing your pipeline architecture (Part 2)
  3. LiveKit vs. Pipecat vs. Direct — Picking your framework (Part 3)
  4. STT, LLM, and TTS That Actually Work — Building the voice pipeline (Part 4)
  5. Multi-Role Agents — Interviewer, coach, and evaluator personas (Part 5)
  6. Knowledge Base and RAG — Making your voice agent an expert (Part 6)
  7. Web and Mobile Clients — Cross-platform voice experiences (Part 7)
  8. Video Interview Integration — Multimodal analysis with Gemini Live (Part 8)
  9. Recording, Transcription, and Compliance — GDPR, HIPAA, and getting it right (Part 9)
  10. Scaling to Thousands — Architecture for concurrent voice sessions (Part 10)
  11. Cost Optimization — From $0.14/min to $0.03/min (Part 11)
  12. Multi-Provider Support — OpenAI Realtime, Bedrock Nova, Grok, and the adapter pattern (Part 12)
Export for reading

Comments