In Part 1, we established the 300-millisecond constraint and the reference architecture. Now we need to make the first fundamental decision: how does audio go in and come out?

There are two approaches. The traditional cascaded pipeline — Speech-to-Text, then Language Model, then Text-to-Speech — has been the production standard for years. The newer speech-to-speech models — Gemini Live, OpenAI Realtime, Nova Sonic, Grok — skip the intermediate steps entirely. Each has real trade-offs, and understanding them deeply is the difference between a voice agent that feels natural and one that feels like a bad phone tree.

The Cascaded Pipeline: STT → LLM → TTS

The cascaded architecture processes audio through three distinct stages:

Microphone → [VAD] → [STT] → text → [LLM] → text → [TTS] → Speaker
               ~50ms   ~150ms          ~200ms          ~75ms
                                                    ──────────
                                             Total: ~475ms

Each component is a separate service, usually from a different provider. You pick the best STT (Deepgram), the best LLM (GPT-4o or Gemini Flash), and the best TTS (ElevenLabs). Then you wire them together.

Why Teams Still Choose This

Debuggability. When something goes wrong — and it will — you can see exactly where. Was the STT misinterpreting “React” as “react”? Was the LLM generating a bad follow-up question? Was the TTS mispronouncing a technical term? With cascaded, you have a text transcript at every intermediate step. You can log it, replay it, and fix the specific component that failed.

Flexibility. Swap any component independently. Move from Deepgram to Whisper for STT without touching your LLM or TTS. Switch from ElevenLabs to Cartesia for TTS to save costs. A/B test different LLMs without changing your audio pipeline.

Visibility for evaluation. In an interview platform, the transcript is a first-class artifact. The evaluation engine needs text. Human reviewers need text. Compliance teams need text. With cascaded, you get the transcript as a natural byproduct of the pipeline — no extra processing needed.

Mature ecosystem. Every STT, LLM, and TTS provider has well-documented APIs, SDKs in every language, and years of production hardening. The cascaded approach has been in production at companies like Twilio, Amazon Connect, and Google Contact Center since well before the LLM era.

The Latency Problem

Here’s where cascaded struggles. Each component runs sequentially, and latencies compound:

VAD (speech end detection):     50ms   ← waiting for silence
Audio encoding + network:       20ms   ← uploading to STT
STT processing:                150ms   ← Deepgram Nova-3
Network (STT → LLM):           10ms
LLM first token:              200ms   ← Gemini Flash
LLM full response:            300ms   ← streaming helps here
Network (LLM → TTS):           10ms
TTS first audio byte:          75ms   ← ElevenLabs Flash v2.5
Network + playback buffer:     30ms
──────────────────────────────────
Naive total:                  845ms   ← too slow

845 milliseconds puts you firmly in the “something’s off” zone. But this is the naive calculation — sequential processing where each step waits for the previous one to fully complete.

Streaming Saves the Day

The key insight is that you don’t need to wait for each step to finish before starting the next one. With streaming, you can overlap processing:

STT streams partial results → LLM starts generating on first complete sentence
LLM streams tokens        → TTS starts synthesizing on first sentence fragment
TTS streams audio chunks   → Speaker plays while more audio is being generated

With aggressive streaming:

VAD:                    50ms
STT (streaming):        80ms   ← partial result, not final
LLM (first token):     150ms   ← starts on partial STT
TTS (first audio):      60ms   ← starts on first LLM tokens
─────────────────────────
Effective latency:     340ms   ← within budget

The trick is accepting slightly lower accuracy on the initial response in exchange for dramatically lower latency. The STT’s partial results might have minor errors, but the LLM can usually compensate. If the STT later corrects itself, you can adjust the conversation flow without the user noticing.

Best practice: Enable streaming at every stage. Non-streaming cascaded pipelines are not viable for real-time voice.

Best practice: Use “eager processing” — start the LLM as soon as you have a medium-confidence STT transcript, not the final one. This can save 50-100ms.

Speech-to-Speech Models: The Single-Hop Approach

Speech-to-speech (S2S) models skip the intermediate text entirely:

Microphone → [VAD] → [S2S Model] → Speaker
               ~50ms    ~320-700ms
                      ──────────────
               Total:  ~370-750ms

One model. One API call. Audio in, audio out. No text conversion steps, no inter-service network hops, no cascading latency.

The Players

Gemini Live API (Google)

  • First audio response: 320-800ms
  • Native multimodal: processes audio + video simultaneously
  • WebSocket connection with bidirectional streaming
  • Best for: Interview scenarios requiring visual analysis (coding interviews, screen sharing)

OpenAI Realtime API

  • WebRTC mode for browsers, WebSocket for server-side
  • Native function calling during conversation
  • 60-minute maximum session length
  • Best for: Structured interviews with complex tool use

Amazon Bedrock Nova Sonic

  • Response latency: under 700ms
  • Bidirectional HTTP/2 streaming
  • 100+ languages with native accents
  • Non-verbal cue detection (laughter, hesitations)
  • Best for: Enterprise deployments requiring AWS compliance integration

Grok Voice Agent API (xAI)

  • Time-to-first-audio: under 1 second
  • Flat rate: $0.05/minute (no per-token billing)
  • OpenAI Realtime API compatible
  • Best for: Cost-sensitive deployments at scale

Why S2S Feels More Natural

Speech-to-speech models capture nuances that cascaded pipelines lose. When a cascaded pipeline converts speech to text, it discards:

  • Intonation. The way a candidate says “I think so…” (trailing off, uncertain) versus “I think so!” (confident, decisive) is completely lost in transcription.
  • Pacing. A candidate pausing to think versus pausing because they don’t know the answer produces the same text but very different audio signals.
  • Vocal emotion. Enthusiasm, frustration, confusion — all carry information that text flattens.

S2S models process raw audio features, preserving these signals. The result is responses that feel more contextually appropriate. When a candidate sounds uncertain, the AI might offer encouragement. When they sound confident, it might challenge them more.

Nova Sonic explicitly supports this — it detects laughter, hesitations, and inter-sentential pauses and uses them to shape its responses.

The Trade-Offs

Black box. You can’t inspect intermediate state. If the AI asks a weird follow-up question, you can’t see what it “heard” or what it “thought” — you just hear the output. Debugging requires replaying the entire audio session.

No native transcript. S2S models produce audio, not text. You need a separate transcription pass for evaluation, compliance, and human review. This is typically done asynchronously after the conversation, which means real-time transcript display requires running a parallel STT alongside the S2S model.

Model lock-in. Your LLM and TTS are the same model. You can’t swap to a better LLM while keeping the same voice. If OpenAI Realtime has a bad LLM day, you can’t switch to Gemini’s TTS — you switch the entire pipeline.

Function calling limitations. While OpenAI Realtime and Gemini Live support function calling, it’s less mature than text-based function calling. Complex tool chains with multiple sequential calls can introduce noticeable delays.

The Hybrid Architecture: Use Both

The best production systems don’t choose one approach — they use both for different purposes.

┌─────────────────────────────────────────────────────────┐
│                 INTERVIEW SESSION                        │
│                                                          │
│  PRIMARY PATH (real-time conversation):                  │
│  Mic → VAD → S2S Model (Gemini Live / OpenAI Realtime)  │
│        └→ Speaker                                        │
│                                                          │
│  PARALLEL PATH (transcript + analysis):                  │
│  Mic → VAD → STT (Deepgram) → Transcript Logger         │
│        └→ Real-time transcript display                   │
│        └→ RAG context retrieval                          │
│        └→ Session state updates                          │
│                                                          │
│  POST-SESSION (evaluation):                              │
│  Full transcript + audio → Evaluation LLM → Scores      │
└─────────────────────────────────────────────────────────┘

The S2S model handles the conversation — it’s the fastest path from candidate speech to AI response. Meanwhile, a parallel STT stream generates real-time transcripts for display, RAG retrieval, and session state management.

Why This Works for Interviews

  1. Latency stays optimal. The conversation path uses S2S for the lowest possible latency.
  2. You still get transcripts. The parallel STT provides real-time text for display, evaluation, and compliance.
  3. RAG can feed context. The STT transcript drives knowledge base lookups that get injected into the S2S model’s context.
  4. Evaluation is text-based. The parallel transcript feeds directly into the post-interview evaluation engine.

The Cost of Hybrid

You’re running two services on the same audio: the S2S model and a separate STT. At Deepgram’s streaming rate of $0.0077/minute, adding parallel transcription costs about $0.23 for a 30-minute interview. That’s trivial compared to the S2S model cost.

Best practice: Always run parallel STT even with S2S models. The transcript is too valuable for evaluation, compliance, and debugging to skip.

VAD: The Unsung Hero (and Villain)

Voice Activity Detection determines when a person has stopped speaking. Get it wrong and your entire pipeline falls apart — regardless of whether you’re using cascaded or S2S.

How VAD Works

Modern VAD uses a neural network (typically Silero VAD) that classifies each audio frame as speech or non-speech. But raw speech detection isn’t enough — you need several additional signals:

# Simplified VAD decision logic
class InterviewVAD:
    def __init__(self):
        self.min_speech_duration = 0.1    # Ignore speech < 100ms (noise)
        self.min_silence_duration = 0.4   # Wait 400ms of silence before "done"
        self.activation_threshold = 0.5   # Higher = less sensitive to noise
        self.max_buffered_speech = 30.0   # Allow long technical explanations

    def should_end_turn(self, audio_frames):
        # 1. Is there speech?
        speech_probability = self.model.predict(audio_frames)

        # 2. Has silence lasted long enough?
        if speech_probability < self.activation_threshold:
            silence_duration = self.get_silence_duration()
            if silence_duration >= self.min_silence_duration:
                return True

        # 3. Has the candidate been speaking too long?
        if self.current_speech_duration > self.max_buffered_speech:
            return True  # Force a turn boundary

        return False

Interview-Specific VAD Tuning

Default VAD settings are tuned for casual conversation. Interviews are different:

ParameterDefaultInterview SettingWhy
min_silence_duration250ms400msCandidates pause to think. Don’t cut them off.
activation_threshold0.250.50Interviews are in quiet rooms. Reduce false activations.
max_buffered_speech10-15s30sTechnical explanations are long.
padding_duration0.1s0.1sKeep the same — this is about audio quality.

Semantic VAD: The Next Level

Basic VAD only looks at audio energy. Semantic VAD also considers what was said:

  • “Well, I think that…” (trailing off) → probably done speaking
  • “First, let me explain the architecture…” → definitely not done
  • “…and that’s how I would approach it.” → done with high confidence

LiveKit’s semantic turn detection combines audio-level VAD with transcript analysis to make smarter end-of-turn decisions. This reduces both premature cutoffs and awkward waiting.

Best practice: Start with Silero VAD at 400ms silence threshold. If you’re getting too many cutoffs during thinking pauses, increase to 500ms. If the AI waits too long after the candidate finishes, consider adding semantic VAD on top.

Audio Encoding: The Decisions That Affect Everything

Sample Rate

  • 16kHz for STT input. All major STT models expect 16kHz mono. Sending 48kHz wastes bandwidth.
  • 24kHz for TTS output. Most neural TTS models generate at 24kHz. ElevenLabs and Cartesia both output 24kHz by default.
  • 48kHz for recording. If you’re archiving interviews, record at the highest quality available.

Codec

  • Opus for WebRTC transport. Variable bitrate, excellent at low bitrates (16-32 kbps), built into every browser and LiveKit.
  • PCM (16-bit) for processing. STT and TTS models expect uncompressed PCM. Decode Opus to PCM at the server boundary.
  • AAC for recording storage. Good compression, universal playback support.

Best practice: Let LiveKit handle codec negotiation — it defaults to Opus for transport and gives you PCM frames in the agent SDK. Don’t try to optimize audio encoding unless you’ve identified it as a bottleneck.

Latency Budget Template

Here’s a practical template for planning your pipeline latency:

CASCADED PIPELINE BUDGET:
┌─────────────────────────────────────┬──────────┬──────────┐
│ Component                           │ Budget   │ Actual   │
├─────────────────────────────────────┼──────────┼──────────┤
│ VAD (end-of-speech detection)       │ 50ms     │ ___ms    │
│ Audio encoding + network            │ 20ms     │ ___ms    │
│ STT (streaming partial)            │ 100ms    │ ___ms    │
│ LLM first token (streaming)        │ 200ms    │ ___ms    │
│ TTS first audio (streaming)        │ 75ms     │ ___ms    │
│ Network + playback buffer           │ 30ms     │ ___ms    │
├─────────────────────────────────────┼──────────┼──────────┤
│ TOTAL                               │ 475ms    │ ___ms    │
└─────────────────────────────────────┴──────────┴──────────┘

S2S PIPELINE BUDGET:
┌─────────────────────────────────────┬──────────┬──────────┐
│ Component                           │ Budget   │ Actual   │
├─────────────────────────────────────┼──────────┼──────────┤
│ VAD (end-of-speech detection)       │ 50ms     │ ___ms    │
│ Audio encoding + network            │ 20ms     │ ___ms    │
│ S2S model first audio               │ 350ms    │ ___ms    │
│ Network + playback buffer           │ 30ms     │ ___ms    │
├─────────────────────────────────────┼──────────┼──────────┤
│ TOTAL                               │ 450ms    │ ___ms    │
└─────────────────────────────────────┴──────────┴──────────┘

Print this out. Fill in actual measurements from your development environment. If any component exceeds its budget, that’s your optimization target.

Decision Framework

When should you use which approach?

CriterionCascadedS2SHybrid
Need real-time transcriptsBestRequires parallel STTGood
Latency requirement < 400msDifficultGoodGood
Debugging/monitoringBestLimitedGood
Provider flexibilityBestLocked to modelGood
Voice naturalnessGoodBestBest
Implementation complexitySimpleSimpleModerate
Cost (30-min interview)~$2.50~$2.00-3.00~$2.30-3.25

For an interview platform, I recommend the hybrid approach. The S2S path gives you the lowest latency for conversation, while the parallel STT path gives you the transcripts you need for evaluation and compliance. The added complexity is modest, and the benefits compound across the entire platform.

What’s Next

We’ve decided on a hybrid architecture. Now we need to choose the framework that implements it. In Part 3, we’ll compare LiveKit, Pipecat, and direct provider integration — with code examples in all three and a clear decision framework for choosing.


This is Part 2 of a 12-part series: The Voice AI Interview Playbook.

Series outline:

  1. Why Real-Time Voice Changes Everything — The landscape, the vision, and the reference architecture (Part 1)
  2. Cascaded vs. Speech-to-Speech — Choosing your pipeline architecture (this post)
  3. LiveKit vs. Pipecat vs. Direct — Picking your framework (Part 3)
  4. STT, LLM, and TTS That Actually Work — Building the voice pipeline (Part 4)
  5. Multi-Role Agents — Interviewer, coach, and evaluator personas (Part 5)
  6. Knowledge Base and RAG — Making your voice agent an expert (Part 6)
  7. Web and Mobile Clients — Cross-platform voice experiences (Part 7)
  8. Video Interview Integration — Multimodal analysis with Gemini Live (Part 8)
  9. Recording, Transcription, and Compliance — GDPR, HIPAA, and getting it right (Part 9)
  10. Scaling to Thousands — Architecture for concurrent voice sessions (Part 10)
  11. Cost Optimization — From $0.14/min to $0.03/min (Part 11)
  12. Multi-Provider Support — OpenAI Realtime, Bedrock Nova, Grok, and the adapter pattern (Part 12)
Export for reading

Comments