In Part 3, we chose our framework. LiveKit Agents won for production deployments, Pipecat for rapid prototyping, and direct WebSocket for ultra-low-latency custom builds. Now comes the part that actually makes or breaks your voice AI interview system: the individual components inside that pipeline.
STT, LLM, and TTS are not interchangeable commodities. The difference between Deepgram Nova-3 and a generic Whisper deployment is not just $0.003/minute — it is 80ms of latency that users feel as unnatural hesitation. The difference between ElevenLabs Flash v2.5 and a mediocre TTS voice is the difference between “this feels like a real interview” and “I am talking to a robot.”
In this post, I am going to walk through every component with real benchmarks, real production data, and the specific combinations that I have found work best for interview applications. I will also show you the streaming integration code that ties them all together, because reading about pipelines and actually building one are very different things.
The Component Landscape
Before diving in, let me lay out the decision space. You are choosing three components that need to work together within a latency budget:
- Total acceptable latency: under 800ms from end of user speech to start of AI voice response
- STT budget: 100-200ms
- LLM first token budget: 250-400ms
- TTS first audio budget: 75-150ms
- Pipeline overhead: 50-100ms
That adds up to roughly 475-850ms. The low end is achievable. The high end is where you start losing the “real conversation” feeling. Every component choice either tightens or blows this budget.
STT: Turning Interview Audio into Accurate Text
Speech-to-text is where most voice AI systems fail quietly. You never notice bad STT until a candidate says “Kubernetes” and your system hears “cube net ease” and the LLM starts answering a question about networking puzzles.
Interview audio has specific challenges:
- Technical vocabulary: AWS, Kubernetes, GraphQL, SOLID principles, CI/CD — words that general speech models see rarely
- Thinking pauses: candidates genuinely pause mid-sentence to collect thoughts. You need a model that does not rush to finalize the transcript
- Filler words: “um,” “uh,” “like,” “you know” — you want these for sentiment analysis but often want them cleaned up before the LLM sees them
- Accents: your candidate pool is global. Your STT needs to be too
Deepgram Nova-3
This is my go-to for production interview systems. The numbers:
| Metric | Deepgram Nova-3 |
|---|---|
| Latency (streaming) | ~150ms to first partial |
| WER (general) | 8.4% |
| WER (technical vocab) | 11.2% |
| Price | $0.0077/min ($0.46/hr) |
| Languages | 36 |
| Custom vocabulary | Yes (free, instant) |
The custom vocabulary feature is underrated. You can pass a list of domain-specific terms at connection time and Nova-3 will bias toward them. For a software engineering interview:
import deepgram
from deepgram import DeepgramClient, LiveOptions
dg_client = DeepgramClient(api_key="YOUR_KEY")
options = LiveOptions(
model="nova-3",
language="en-US",
smart_format=True,
punctuate=True,
diarize=False, # single speaker in most interview scenarios
filler_words=True, # capture "um", "uh" for analysis
keywords=[
"Kubernetes:3", # boost weight
"GraphQL:3",
"microservices:2",
"SOLID:2",
"CI/CD:2",
"AWS:2",
"PostgreSQL:2",
"Redis:2",
],
endpointing=300, # ms of silence before considering utterance complete
interim_results=True,
)
async def stream_audio_to_deepgram(audio_stream, on_transcript):
async with dg_client.listen.asynclive.v("1") as connection:
async def on_message(self, result, **kwargs):
transcript = result.channel.alternatives[0].transcript
if result.is_final and transcript:
await on_transcript(transcript)
elif not result.is_final:
# Stream partial results to reduce perceived latency
await on_transcript(transcript, is_partial=True)
connection.on(deepgram.LiveTranscriptionEvents.Transcript, on_message)
await connection.start(options)
async for chunk in audio_stream:
await connection.send(chunk)
The endpointing=300 setting is important for interviews. The default is 10ms which is fine for quick commands but terrible when someone pauses to think. I have found 250-350ms works well — long enough to not cut people off, short enough to not feel laggy.
OpenAI Whisper (via API or self-hosted)
Whisper is the accuracy leader for challenging audio — heavy accents, background noise, highly technical content. The tradeoff is latency and cost.
| Metric | Whisper API | Whisper Self-Hosted (A10G) |
|---|---|---|
| Latency | 500-2000ms (batch) | 200-400ms (streaming via faster-whisper) |
| WER (general) | 7.1% | 6.8% |
| WER (technical vocab) | 9.3% | 9.0% |
| Price | $0.006/min | ~$0.008/min (compute) |
| Streaming support | No (batch only via API) | Yes (with faster-whisper) |
The official Whisper API does not support streaming — you send audio chunks and get back completed transcripts, which is a deal-breaker for real-time conversation. If you want Whisper’s accuracy with streaming, you need to self-host using faster-whisper with CTranslate2 backend.
For most interview use cases, Deepgram Nova-3 is the better choice. Whisper makes sense if you have a specific accuracy requirement that Nova-3 cannot meet, or if you are dealing with highly accented speech in languages where Deepgram’s support is weaker.
AssemblyAI
AssemblyAI sits at an interesting price point — $0.37/hr ($0.0062/min) — and offers features that go beyond pure transcription:
| Metric | AssemblyAI Nano | AssemblyAI Best |
|---|---|---|
| Latency (streaming) | ~180ms | ~250ms |
| WER | 10.2% | 8.8% |
| Price | $0.0062/min | $0.0140/min |
| Sentiment analysis | Yes | Yes |
| Speaker labels | Yes | Yes |
| Content moderation | Yes | Yes |
The sentiment analysis is genuinely useful for interview applications. You can detect when a candidate is frustrated, confident, or uncertain without post-processing. The “LeMUR” feature lets you ask LLM-style questions about the transcript in real-time, though at additional cost.
The downside is that AssemblyAI’s streaming implementation has been less reliable in my experience — occasional connection drops and higher tail latency compared to Deepgram.
Google Cloud Speech-to-Text v2
Google’s STT is often overlooked but handles certain edge cases well, particularly phone-quality audio and strong non-native accents:
| Metric | Google STT v2 (Chirp) |
|---|---|
| Latency (streaming) | ~200ms |
| WER | 8.9% |
| Price | $0.016/min (streaming) |
| Languages | 125+ |
The pricing is nearly double Deepgram and the latency is slightly higher, making it hard to recommend as a default choice. Where it shines: language diversity. If you are building for markets where Deepgram has limited language support, Google is the pragmatic choice.
STT Decision Matrix
| Use case | Recommended STT | Why |
|---|---|---|
| English, cost-sensitive | Deepgram Nova-3 | Best price/perf ratio |
| Maximum accuracy, English | OpenAI Whisper (self-hosted) | Best WER on challenging audio |
| Sentiment analysis built-in | AssemblyAI | Features > marginal accuracy loss |
| 100+ language support | Google Cloud STT | Broadest language coverage |
| Low latency + accuracy balance | Deepgram Nova-3 | 150ms is hard to beat |
LLM: The Brain of Your Interview Agent
The LLM is where your interview agent actually thinks — understanding context, formulating follow-up questions, evaluating answers, staying in character as an interviewer. The requirements for interview applications are specific:
- Context window: A 45-minute interview at average speaking pace (~150 words/min) generates ~6,750 words. Plus your system prompt (500-1000 tokens) and structured interview guide (500-1500 tokens). You need at least 16k tokens, ideally 32k+
- First token latency: under 400ms to keep total pipeline latency reasonable
- Instruction following: the LLM must stay in the interviewer persona, not ramble, not go off-script
- Function calling: for triggering scoring, updating candidate records, fetching competency definitions mid-interview
Gemini 2.0 Flash
This is the speed champion right now and it is not particularly close:
| Metric | Gemini 2.0 Flash |
|---|---|
| First token latency | ~250-350ms |
| Output tokens/sec | 150-200 |
| Context window | 1M tokens |
| Price (input) | $0.075/1M tokens |
| Price (output) | $0.30/1M tokens |
| Function calling | Yes |
The 1M token context window is genuinely useful for long interview sessions where you want to keep the entire conversation history. The price is also excellent — a typical 45-minute interview with ~8k tokens in and ~2k tokens out costs roughly $0.0006 in LLM costs alone.
The tradeoff is reasoning depth. For behavioral interview questions where the AI needs to probe deeply and recognize complex answer patterns, Gemini Flash occasionally gives shallower follow-ups than GPT-4o or Claude. For structured, formulaic interviews (screening calls, technical trivia), Flash is excellent.
import google.generativeai as genai
from google.generativeai.types import GenerationConfig
genai.configure(api_key="YOUR_KEY")
model = genai.GenerativeModel(
model_name="gemini-2.0-flash",
system_instruction="""You are Alex, a senior technical interviewer at Acme Corp.
You are conducting a 45-minute software engineering interview.
Your job is to assess the candidate's technical depth, problem-solving approach,
and communication skills. Ask one question at a time. When the candidate finishes
answering, either ask a follow-up to probe deeper or move to the next topic.
Keep responses concise — 2-3 sentences maximum for interview questions.
Do not explain your reasoning or evaluation to the candidate.""",
generation_config=GenerationConfig(
max_output_tokens=150, # Keep responses short for voice
temperature=0.7,
)
)
async def get_interviewer_response(conversation_history: list[dict]) -> str:
# Stream response for lower perceived latency
response_text = ""
async for chunk in await model.generate_content_async(
conversation_history,
stream=True,
):
if chunk.text:
response_text += chunk.text
yield chunk.text # Yield tokens as they arrive for TTS streaming
GPT-4o
GPT-4o is the quality benchmark for complex interview scenarios:
| Metric | GPT-4o |
|---|---|
| First token latency | ~400-600ms |
| Output tokens/sec | 60-80 |
| Context window | 128k tokens |
| Price (input) | $2.50/1M tokens |
| Price (output) | $10.00/1M tokens |
| Function calling | Yes (best-in-class) |
GPT-4o’s function calling is noticeably more reliable than alternatives — when you define tools for “mark_answer_complete,” “request_code_example,” or “trigger_evaluation,” GPT-4o calls them at the right moments without hallucinating extra tool calls or forgetting to call them.
The latency is the real issue. At 400-600ms first token, you are right at the edge of the total 800ms budget. You have almost nothing left for STT (150ms) and TTS (75ms). In practice, GPT-4o pushes total round-trip latency to 900-1200ms, which users notice.
My compromise: use GPT-4o for the evaluation and scoring agents (which run asynchronously after each answer) but use Gemini Flash for the live conversation agent.
Claude 3.5 Sonnet / Claude 3.7
Claude’s strength in interview applications is instruction-following precision and staying in character:
| Metric | Claude 3.5 Sonnet |
|---|---|
| First token latency | ~500-700ms |
| Output tokens/sec | 80-100 |
| Context window | 200k tokens |
| Price (input) | $3.00/1M tokens |
| Price (output) | $15.00/1M tokens |
| Function calling | Yes |
Claude tends to give the most “human-sounding” interview responses — natural transitions, appropriate acknowledgment of answers, good follow-up question construction. The latency knocks it out of first-choice status for live conversation, but it is excellent for the post-interview feedback generation (which can be async).
Claude 3.7 Sonnet (with extended thinking) is worth testing for technical evaluation — its reasoning depth on “did this candidate actually explain the SOLID principles correctly” type questions is noticeably better.
Grok
Grok (via xAI API) is the dark horse for cost-optimized builds:
| Metric | Grok-2 |
|---|---|
| First token latency | ~300-450ms |
| Output tokens/sec | 80-120 |
| Context window | 131k tokens |
| Price (input) | $2.00/1M tokens |
| Price (output) | $10.00/1M tokens |
At roughly $0.05/min all-in for typical interview token volumes, Grok is competitive. The quality is solid for structured interviews, though I have found it occasionally goes off-script in complex roleplay scenarios. If you are building a cost-optimized product and willing to do extra prompt engineering to keep the persona stable, Grok deserves a look.
LLM Decision Matrix
| Use case | Recommended LLM | Why |
|---|---|---|
| Live conversation, speed priority | Gemini 2.0 Flash | 250-350ms TTFT, cheap |
| Complex behavioral interviews | GPT-4o | Best instruction following |
| Post-interview evaluation | Claude 3.5 Sonnet | Best reasoning quality |
| Cost-optimized deployment | Gemini 2.0 Flash | Best cost/quality for voice |
| Function-heavy workflows | GPT-4o | Most reliable tool calling |
TTS: Giving Your AI Interviewer a Voice
TTS is what candidates actually hear. A technically perfect STT and LLM pipeline is worthless if the voice sounds like a 2015 IVR system. For interviews, you need:
- Naturalness: prosody, pacing, and intonation that sound like a real person
- Low TTFB (time to first byte): under 100ms to start audio immediately after LLM generates tokens
- Streaming support: do not wait for complete text before starting audio
- Professional tone: warm but authoritative — an interviewer, not a customer service bot
ElevenLabs Flash v2.5
ElevenLabs is the naturalness leader. Flash v2.5 specifically optimizes for low latency:
| Metric | ElevenLabs Flash v2.5 |
|---|---|
| TTFB | ~75ms |
| Naturalness (MOS score) | 4.3/5.0 |
| Languages | 70+ |
| Price | $0.18/1k characters (~$0.027/min) |
| Voice cloning | Yes (instant clone from 30s sample) |
| Streaming | Yes (WebSocket) |
The voices — particularly “Adam,” “Antoni,” and “Callum” for male voices, “Rachel” and “Bella” for female — are genuinely convincing. In user testing, participants regularly described the interview as feeling “like talking to a real person” when using ElevenLabs.
import httpx
import asyncio
from typing import AsyncIterator
ELEVENLABS_API_KEY = "YOUR_KEY"
VOICE_ID = "21m00Tcm4TlvDq8ikWAM" # Rachel - professional, clear
async def stream_tts_elevenlabs(
text_stream: AsyncIterator[str],
output_audio_queue: asyncio.Queue,
) -> None:
"""Stream text tokens to ElevenLabs and push audio chunks to queue."""
async with httpx.AsyncClient() as client:
# ElevenLabs streaming endpoint
async with client.stream(
"POST",
f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream",
headers={
"xi-api-key": ELEVENLABS_API_KEY,
"Content-Type": "application/json",
},
json={
"model_id": "eleven_flash_v2_5",
"voice_settings": {
"stability": 0.5,
"similarity_boost": 0.8,
"style": 0.2, # subtle style, professional
"use_speaker_boost": True,
},
"text": await collect_stream(text_stream),
"output_format": "pcm_24000", # 24kHz PCM for quality
},
) as response:
async for chunk in response.aiter_bytes(chunk_size=4096):
await output_audio_queue.put(chunk)
await output_audio_queue.put(None) # Signal end of audio
async def collect_stream(stream: AsyncIterator[str]) -> str:
"""Collect streaming text with sentence-boundary buffering."""
buffer = ""
async for token in stream:
buffer += token
# Yield at sentence boundaries for more natural chunking
# In practice, pass the full text for better prosody
return buffer
The pricing at $0.18/1k characters works out to roughly $0.027/min for typical interview speech at ~150 words/min (around 900 characters). It is the most expensive TTS option but delivers the most natural output.
Cartesia Sonic
Cartesia is the latency-first choice:
| Metric | Cartesia Sonic |
|---|---|
| TTFB | ~50-65ms |
| Naturalness (MOS score) | 4.0/5.0 |
| Languages | 15+ |
| Price | $0.065/1k characters (~$0.010/min) |
| Voice cloning | Yes |
| Streaming | Yes (WebSocket) |
Cartesia Sonic-2 (their latest model) is impressive for latency. At 50-65ms TTFB, it shaves 10-25ms off ElevenLabs, which matters in highly latency-sensitive deployments. The voice quality is good — not quite ElevenLabs, but noticeably better than generic TTS.
At $0.065/1k characters, Cartesia is 3x cheaper than ElevenLabs. If your user research shows that the quality delta does not affect conversion or satisfaction metrics, Cartesia is the smart cost choice.
import websockets
import json
import base64
async def stream_tts_cartesia(
text: str,
audio_callback,
voice_id: str = "a0e99841-438c-4a64-b679-ae501e7d6091", # professional voice
) -> None:
"""Stream TTS from Cartesia via WebSocket."""
async with websockets.connect(
"wss://api.cartesia.ai/tts/websocket?api_key=YOUR_KEY&cartesia_version=2024-06-10"
) as ws:
await ws.send(json.dumps({
"model_id": "sonic-2",
"transcript": text,
"voice": {
"mode": "id",
"id": voice_id,
},
"output_format": {
"container": "raw",
"encoding": "pcm_f32le",
"sample_rate": 24000,
},
"context_id": "interview-session",
}))
async for message in ws:
data = json.loads(message)
if data.get("type") == "chunk":
audio_bytes = base64.b64decode(data["data"])
await audio_callback(audio_bytes)
elif data.get("type") == "done":
break
Deepgram Aura-2
Deepgram’s TTS is underrated, especially if you are already using Deepgram for STT (single SDK, single bill, simpler architecture):
| Metric | Deepgram Aura-2 |
|---|---|
| TTFB | ~90ms |
| Naturalness (MOS score) | 3.8/5.0 |
| Languages | 2 (English, Spanish) |
| Price | $0.015/1k characters (~$0.0023/min) |
| Voice cloning | No |
| Streaming | Yes |
The price is remarkable — $0.015/1k characters versus ElevenLabs’ $0.18/1k. That is a 12x cost difference. The voice quality is noticeably more “TTS-like” — it lacks the naturalness of ElevenLabs or Cartesia, but for an internal recruiting tool where cost matters more than the slight uncanny valley effect, Aura-2 is worth serious consideration.
PlayHT
PlayHT’s 2.0 Turbo model is a solid mid-tier option:
| Metric | PlayHT 2.0 Turbo |
|---|---|
| TTFB | ~95ms |
| Naturalness (MOS score) | 3.9/5.0 |
| Languages | 142 |
| Price | $0.022/1k characters |
| Voice cloning | Yes (Instant Clone) |
| Streaming | Yes |
PlayHT’s language support (142 languages) is a differentiator if you are building for global markets. The quality sits between Deepgram Aura-2 and Cartesia.
Azure Neural TTS
Azure’s neural TTS is the enterprise default for a reason:
| Metric | Azure Neural TTS |
|---|---|
| TTFB | ~100-150ms |
| Naturalness (MOS score) | 3.8/5.0 |
| Languages | 54 locales, 129 voices |
| Price | $0.016/1k characters |
| Voice cloning | Yes (Custom Neural Voice) |
| Streaming | Yes (SSML) |
129 voices across 54 locales makes Azure the choice for truly global deployments. The Custom Neural Voice feature lets you clone a specific voice for brand consistency. Enterprise compliance (SOC 2, ISO 27001, HIPAA BAA) is built-in, which matters for recruiting applications that handle sensitive candidate data.
TTS Decision Matrix
| Use case | Recommended TTS | Why |
|---|---|---|
| Maximum naturalness | ElevenLabs Flash v2.5 | Best MOS score, 70+ languages |
| Minimum latency | Cartesia Sonic | 50-65ms TTFB |
| Minimum cost | Deepgram Aura-2 | $0.015/1k chars |
| Enterprise/compliance | Azure Neural TTS | HIPAA BAA, 129 voices |
| Global language coverage | PlayHT or Azure | 100+ languages |
| Deepgram stack consistency | Deepgram Aura-2 | Single SDK |
Voice Cloning and Custom Voices
For interview applications used by a single company, voice consistency matters. “Our AI interviewer is Alex” — not a random ElevenLabs preset. Every candidate should meet the same Alex.
Creating a Custom Interview Voice
ElevenLabs Instant Clone requires just 30 seconds of clean audio. For a production interview voice:
- Source audio: record a real person (or use a professional voice actor) reading 5-10 minutes of diverse content — questions, statements, numbers, technical terms
- Clean the audio: normalize to -16 LUFS, remove background noise, ensure consistent microphone distance
- Upload and clone: ElevenLabs will create the clone in 10-30 seconds
- Test extensively: run it through your actual interview script — pay attention to technical terms, question intonation, and sentence-ending cadence
import httpx
async def create_voice_clone(
audio_file_path: str,
voice_name: str,
description: str,
) -> str:
"""Create an ElevenLabs voice clone. Returns the voice ID."""
async with httpx.AsyncClient() as client:
with open(audio_file_path, "rb") as f:
response = await client.post(
"https://api.elevenlabs.io/v1/voices/add",
headers={"xi-api-key": "YOUR_KEY"},
data={
"name": voice_name,
"description": description,
"labels": '{"accent": "American", "use_case": "interview"}',
},
files={"files": (audio_file_path, f, "audio/mpeg")},
)
result = response.json()
return result["voice_id"]
For Cartesia, the process is similar — upload audio clips and get a custom voice ID back. Cartesia’s voice cloning tends to be slightly more consistent for professional tones, worth testing alongside ElevenLabs.
Audio Quality: Sample Rates and Codecs
Audio pipeline quality is invisible when it is right and obvious when it is wrong. These are the settings that have proven out in production:
Sample Rate Selection
| Stage | Recommended Rate | Why |
|---|---|---|
| STT input | 16kHz | STT models are trained at 16kHz; upsampling wastes bandwidth |
| TTS output | 24kHz | Captures full voice frequency range (20Hz-12kHz) without overhead |
| Recording/archival | 48kHz | Standard for professional audio, editing headroom |
| WebRTC transport | 48kHz | WebRTC standard; browser handles downsampling |
The mismatch trap: WebRTC natively operates at 48kHz. Deepgram works best at 16kHz. You need a resampler in your pipeline. Most frameworks (LiveKit, Pipecat) handle this, but if you are building directly, you need to explicitly resample:
import numpy as np
from scipy import signal
def resample_audio(audio_data: bytes, from_rate: int, to_rate: int) -> bytes:
"""Resample audio PCM data between sample rates."""
if from_rate == to_rate:
return audio_data
# Convert bytes to numpy array (16-bit PCM)
samples = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32)
# Calculate resampling ratio
ratio = to_rate / from_rate
num_samples = int(len(samples) * ratio)
# Resample
resampled = signal.resample(samples, num_samples)
# Convert back to 16-bit PCM
return resampled.astype(np.int16).tobytes()
Codec Choices
| Context | Codec | Why |
|---|---|---|
| WebRTC transport | Opus | Adaptive bitrate, handles packet loss gracefully |
| STT streaming | PCM (raw) | Zero decoding overhead, direct to STT |
| TTS streaming | PCM_24000 | Direct playback without decode step |
| Recording storage | AAC (128kbps) | 60-70% smaller than PCM, perceptually lossless for voice |
| Archival | FLAC | Lossless, compressed, good for long-term storage |
The practical pipeline: WebRTC audio arrives as Opus, you decode to PCM for STT, you request PCM from TTS, you encode to AAC for storage. This sounds complex but the codec operations are cheap — on a modern server, encoding/decoding adds under 5ms to your pipeline.
Streaming Integration: The Full Pipeline
This is where the rubber meets the road. Streaming integration means you are not waiting for complete text before starting TTS — you are piping tokens from LLM to TTS as they arrive. The latency difference is dramatic: buffered pipelines add 500-1000ms, streaming pipelines add 0ms.
Here is the complete streaming pipeline that I use in production, written for Deepgram STT + Gemini Flash + ElevenLabs TTS:
import asyncio
import google.generativeai as genai
import httpx
from deepgram import DeepgramClient, LiveOptions, LiveTranscriptionEvents
from typing import AsyncIterator
# Configuration
DEEPGRAM_KEY = "YOUR_DEEPGRAM_KEY"
GEMINI_KEY = "YOUR_GEMINI_KEY"
ELEVENLABS_KEY = "YOUR_ELEVENLABS_KEY"
VOICE_ID = "YOUR_VOICE_ID"
genai.configure(api_key=GEMINI_KEY)
class VoiceInterviewPipeline:
def __init__(self, system_prompt: str):
self.system_prompt = system_prompt
self.conversation_history = []
self.audio_output_queue = asyncio.Queue()
self.current_utterance = ""
# Initialize Gemini model
self.model = genai.GenerativeModel(
model_name="gemini-2.0-flash",
system_instruction=system_prompt,
)
self.chat = self.model.start_chat(history=[])
async def handle_stt_transcript(self, transcript: str, is_final: bool) -> None:
"""Called when Deepgram returns a transcript segment."""
if not is_final:
# Partial results — show in UI but don't process yet
self.current_utterance = transcript
return
# Final transcript — send to LLM and pipe response to TTS
self.current_utterance = ""
self.conversation_history.append({
"role": "user",
"parts": [transcript],
})
# Start LLM + TTS pipeline (non-blocking)
asyncio.create_task(self._llm_to_tts_pipeline(transcript))
async def _llm_to_tts_pipeline(self, user_text: str) -> None:
"""
Stream LLM tokens directly to TTS.
Sentence-boundary buffering ensures natural TTS output
while keeping latency low.
"""
sentence_buffer = ""
full_response = ""
# Stream from Gemini
async for chunk in await self.chat.send_message_async(
user_text, stream=True
):
if not chunk.text:
continue
sentence_buffer += chunk.text
full_response += chunk.text
# Check for sentence boundaries
# Send to TTS at natural pause points
if any(sentence_buffer.rstrip().endswith(p) for p in [".", "?", "!", "\n"]):
sentence_to_speak = sentence_buffer.strip()
if len(sentence_to_speak) > 10: # Avoid tiny fragments
asyncio.create_task(
self._send_to_tts(sentence_to_speak)
)
sentence_buffer = ""
# Send any remaining text
if sentence_buffer.strip():
asyncio.create_task(self._send_to_tts(sentence_buffer.strip()))
# Add AI response to conversation history
self.conversation_history.append({
"role": "model",
"parts": [full_response],
})
async def _send_to_tts(self, text: str) -> None:
"""Send text to ElevenLabs TTS and push audio to output queue."""
async with httpx.AsyncClient(timeout=10.0) as client:
async with client.stream(
"POST",
f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream",
headers={
"xi-api-key": ELEVENLABS_KEY,
"Content-Type": "application/json",
},
json={
"text": text,
"model_id": "eleven_flash_v2_5",
"output_format": "pcm_24000",
"voice_settings": {
"stability": 0.5,
"similarity_boost": 0.8,
},
},
) as response:
async for audio_chunk in response.aiter_bytes(4096):
await self.audio_output_queue.put(audio_chunk)
async def start_stt_listener(self, audio_input_stream: AsyncIterator[bytes]) -> None:
"""Connect to Deepgram and stream audio input."""
dg_client = DeepgramClient(DEEPGRAM_KEY)
options = LiveOptions(
model="nova-3",
language="en-US",
smart_format=True,
punctuate=True,
filler_words=True,
endpointing=300,
interim_results=True,
keywords=["Kubernetes:3", "AWS:2", "Python:2", "GraphQL:3"],
)
async with dg_client.listen.asynclive.v("1") as connection:
async def on_transcript(self_dg, result, **kwargs):
transcript = result.channel.alternatives[0].transcript
if transcript:
await self.handle_stt_transcript(
transcript,
is_final=result.is_final,
)
connection.on(LiveTranscriptionEvents.Transcript, on_transcript)
await connection.start(options)
async for audio_chunk in audio_input_stream:
await connection.send(audio_chunk)
async def get_audio_output(self) -> AsyncIterator[bytes]:
"""Yield audio chunks from the output queue for playback."""
while True:
chunk = await self.audio_output_queue.get()
if chunk is None:
break
yield chunk
# Usage
async def run_interview(audio_input_stream: AsyncIterator[bytes]):
pipeline = VoiceInterviewPipeline(
system_prompt="""You are Morgan, a senior engineering interviewer at TechCorp.
Conduct a 45-minute technical interview. Ask one question at a time.
Keep all responses under 3 sentences."""
)
# Start STT listener and audio output in parallel
await asyncio.gather(
pipeline.start_stt_listener(audio_input_stream),
# Audio output is consumed by the WebRTC sender
)
The key insight in this pipeline is sentence-boundary buffering. Sending tokens one-by-one to TTS would produce terrible prosody — TTS models need semantic context to generate natural intonation. But waiting for the complete LLM response adds 500ms+. Sentence-boundary buffering is the sweet spot: you send one sentence at a time, TTS starts generating audio within 75-100ms of the first sentence completing, and the user hears the first sentence while subsequent sentences are being synthesized.
Fallback Chains: When Your Primary Provider Has a Bad Day
Any production voice system needs fallback logic. ElevenLabs has occasional latency spikes (250ms TTFB instead of 75ms) and rare outages. Building a fallback chain is not optional.
Here is the provider priority for my interview systems:
Primary: ElevenLabs Flash v2.5 (best naturalness) Fallback 1: Cartesia Sonic (slightly lower naturalness, similar latency) Fallback 2: Deepgram Aura-2 (noticeably less natural, but ultra-low cost and reliable)
import asyncio
import time
from enum import Enum
class TTSProvider(Enum):
ELEVENLABS = "elevenlabs"
CARTESIA = "cartesia"
DEEPGRAM = "deepgram"
class TTSFallbackManager:
def __init__(self):
self.provider_health = {
TTSProvider.ELEVENLABS: {"failures": 0, "last_failure": 0, "circuit_open": False},
TTSProvider.CARTESIA: {"failures": 0, "last_failure": 0, "circuit_open": False},
TTSProvider.DEEPGRAM: {"failures": 0, "last_failure": 0, "circuit_open": False},
}
self.priority = [TTSProvider.ELEVENLABS, TTSProvider.CARTESIA, TTSProvider.DEEPGRAM]
self.latency_threshold_ms = 200 # Switch to fallback if TTFB exceeds this
self.circuit_reset_seconds = 60 # Try primary again after 60s
def get_active_provider(self) -> TTSProvider:
"""Return the highest-priority healthy provider."""
now = time.time()
for provider in self.priority:
health = self.provider_health[provider]
# Reset circuit breaker after cooldown period
if health["circuit_open"]:
if now - health["last_failure"] > self.circuit_reset_seconds:
health["circuit_open"] = False
health["failures"] = 0
print(f"[TTS] Circuit breaker reset for {provider.value}")
else:
continue
return provider
# All providers unhealthy — use Deepgram as last resort
return TTSProvider.DEEPGRAM
def record_failure(self, provider: TTSProvider) -> None:
health = self.provider_health[provider]
health["failures"] += 1
health["last_failure"] = time.time()
if health["failures"] >= 3:
health["circuit_open"] = True
print(f"[TTS] Circuit breaker OPEN for {provider.value} after {health['failures']} failures")
def record_success(self, provider: TTSProvider) -> None:
self.provider_health[provider]["failures"] = 0
async def synthesize_with_fallback(
self,
text: str,
audio_callback,
) -> TTSProvider:
"""Try providers in priority order. Return which provider was used."""
provider = self.get_active_provider()
try:
start = time.monotonic()
if provider == TTSProvider.ELEVENLABS:
await synthesize_elevenlabs(text, audio_callback)
elif provider == TTSProvider.CARTESIA:
await synthesize_cartesia(text, audio_callback)
else:
await synthesize_deepgram(text, audio_callback)
latency_ms = (time.monotonic() - start) * 1000
# Check if latency exceeded threshold (soft failure)
if latency_ms > self.latency_threshold_ms and provider != TTSProvider.DEEPGRAM:
print(f"[TTS] {provider.value} latency {latency_ms:.0f}ms exceeded threshold")
self.record_failure(provider)
else:
self.record_success(provider)
return provider
except Exception as e:
print(f"[TTS] {provider.value} failed: {e}")
self.record_failure(provider)
# Recursively try next provider
return await self.synthesize_with_fallback(text, audio_callback)
The circuit breaker pattern is important here. If ElevenLabs starts returning errors, you do not want every TTS request to fail and retry — you want to route around the outage immediately. Three failures open the circuit, and after 60 seconds you try the primary again (in case it was a transient spike).
Per-Minute Cost Breakdown
The question I get asked constantly: “what does this actually cost?” Here is the full breakdown for a 45-minute interview, assuming typical token volumes:
Assumptions:
- 45 minutes of audio
- STT: 45 minutes of transcription
- LLM: ~12k input tokens, ~3k output tokens per interview
- TTS: ~8,100 characters of synthesized speech (interviewer side, ~9 minutes of speech)
| STT | LLM | TTS | STT Cost | LLM Cost | TTS Cost | Total/min |
|---|---|---|---|---|---|---|
| Deepgram Nova-3 | Gemini 2.0 Flash | Deepgram Aura-2 | $0.35 | $0.002 | $0.12 | $0.0107/min |
| Deepgram Nova-3 | Gemini 2.0 Flash | Cartesia Sonic | $0.35 | $0.002 | $0.53 | $0.0196/min |
| Deepgram Nova-3 | Gemini 2.0 Flash | ElevenLabs Flash | $0.35 | $0.002 | $1.46 | $0.0403/min |
| Deepgram Nova-3 | GPT-4o | ElevenLabs Flash | $0.35 | $0.17 | $1.46 | $0.0442/min |
| AssemblyAI | GPT-4o | ElevenLabs Flash | $0.28 | $0.17 | $1.46 | $0.0424/min |
| Whisper API | GPT-4o | ElevenLabs Flash | $0.27 | $0.17 | $1.46 | $0.0422/min |
| Deepgram Nova-3 | Claude 3.5 Sonnet | ElevenLabs Flash | $0.35 | $0.24 | $1.46 | $0.0456/min |
| Google STT | Gemini 2.0 Flash | Azure Neural | $0.72 | $0.002 | $0.13 | $0.0189/min |
Key observations:
- TTS dominates cost in high-quality configurations. ElevenLabs costs more than STT + LLM combined for Flash setups
- Gemini Flash is almost free — $0.002 per interview is remarkable. The LLM is no longer the expensive part
- Minimum viable cost: Deepgram + Gemini + Deepgram Aura-2 at $0.0107/min for a 45-minute interview is $0.48 total. That is commercially viable at scale
- Premium configuration: Deepgram + GPT-4o + ElevenLabs at $0.0442/min for a 45-minute interview is $1.99 total — still inexpensive for a hiring workflow
At 1,000 interviews/month:
- Budget stack (Deepgram + Gemini Flash + Deepgram Aura-2): ~$480/month
- Premium stack (Deepgram + GPT-4o + ElevenLabs): ~$1,990/month
Neither of these is a significant cost compared to recruiter time.
My Recommended Default Stack
After building multiple interview systems in production, my default recommendation:
STT: Deepgram Nova-3 with custom vocabulary for your domain LLM: Gemini 2.0 Flash for live conversation, Claude 3.5 Sonnet for async evaluation TTS: ElevenLabs Flash v2.5 with a custom cloned voice for brand consistency Fallback TTS: Cartesia Sonic (first fallback), Deepgram Aura-2 (second)
This configuration gives you:
- ~475ms total round-trip latency (p50)
- ~$0.04/min cost
- Highly natural voice quality that passes the “real interviewer” test in user research
- Resilient fallback chain for production reliability
If cost is the primary constraint, swap ElevenLabs for Cartesia (cuts TTS cost by 65%) or Deepgram Aura-2 (cuts by 92%) with some voice quality tradeoff.
What is Coming in Part 5
We now have the full voice pipeline running. In Part 5, I will show how to build multi-role agent systems — separate AI personas for interviewer, coach, and evaluator — and how to coordinate them in real-time without the candidate experiencing awkward transitions or the agents stepping on each other.
The interviewer asks questions. The coach (invisible to the candidate) monitors confidence patterns and suggests adjustments to your approach. The evaluator scores each answer against the competency framework in real-time. Getting all three to work together without the candidate hearing the machinery — that is what Part 5 covers.
This is Part 4 of a 12-part series: The Voice AI Interview Playbook.
Series outline:
- Why Real-Time Voice Changes Everything — The landscape, the vision, and the reference architecture (Part 1)
- Cascaded vs. Speech-to-Speech — Choosing your pipeline architecture (Part 2)
- LiveKit vs. Pipecat vs. Direct — Picking your framework (Part 3)
- STT, LLM, and TTS That Actually Work — Building the voice pipeline (this post)
- Multi-Role Agents — Interviewer, coach, and evaluator personas (Part 5)
- Knowledge Base and RAG — Making your voice agent an expert (Part 6)
- Web and Mobile Clients — Cross-platform voice experiences (Part 7)
- Video Interview Integration — Multimodal analysis with Gemini Live (Part 8)
- Recording, Transcription, and Compliance — GDPR, HIPAA, and getting it right (Part 9)
- Scaling to Thousands — Architecture for concurrent voice sessions (Part 10)
- Cost Optimization — From $0.14/min to $0.03/min (Part 11)
- Multi-Provider Support — OpenAI Realtime, Bedrock Nova, Grok, and the adapter pattern (Part 12)