The Voice AI Interview Playbook: Building the Voice Pipeline — STT, LLM, and TTS That Actually Work (Part 4 of 12)

In Part 3, we chose our framework. LiveKit Agents won for production deployments, Pipecat for rapid prototyping, and direct WebSocket for ultra-low-latency custom builds. Now comes the part that actually makes or breaks your voice AI interview system: the individual components inside that pipeline.

STT, LLM, and TTS are not interchangeable commodities. The difference between Deepgram Nova-3 and a generic Whisper deployment is not just $0.003/minute — it is 80ms of latency that users feel as unnatural hesitation. The difference between ElevenLabs Flash v2.5 and a mediocre TTS voice is the difference between “this feels like a real interview” and “I am talking to a robot.”

In this post, I am going to walk through every component with real benchmarks, real production data, and the specific combinations that I have found work best for interview applications. I will also show you the streaming integration code that ties them all together, because reading about pipelines and actually building one are very different things.

The Component Landscape

Before diving in, let me lay out the decision space. You are choosing three components that need to work together within a latency budget:

Total acceptable latency: under 800ms from end of user speech to start of AI voice response
STT budget: 100-200ms
LLM first token budget: 250-400ms
TTS first audio budget: 75-150ms
Pipeline overhead: 50-100ms

That adds up to roughly 475-850ms. The low end is achievable. The high end is where you start losing the “real conversation” feeling. Every component choice either tightens or blows this budget.

STT: Turning Interview Audio into Accurate Text

Speech-to-text is where most voice AI systems fail quietly. You never notice bad STT until a candidate says “Kubernetes” and your system hears “cube net ease” and the LLM starts answering a question about networking puzzles.

Interview audio has specific challenges:

Technical vocabulary: AWS, Kubernetes, GraphQL, SOLID principles, CI/CD — words that general speech models see rarely
Thinking pauses: candidates genuinely pause mid-sentence to collect thoughts. You need a model that does not rush to finalize the transcript
Filler words: “um,” “uh,” “like,” “you know” — you want these for sentiment analysis but often want them cleaned up before the LLM sees them
Accents: your candidate pool is global. Your STT needs to be too

Deepgram Nova-3

This is my go-to for production interview systems. The numbers:

Metric	Deepgram Nova-3
Latency (streaming)	~150ms to first partial
WER (general)	8.4%
WER (technical vocab)	11.2%
Price	$0.0077/min ($0.46/hr)
Languages	36
Custom vocabulary	Yes (free, instant)

The custom vocabulary feature is underrated. You can pass a list of domain-specific terms at connection time and Nova-3 will bias toward them. For a software engineering interview:

import deepgram
from deepgram import DeepgramClient, LiveOptions

dg_client = DeepgramClient(api_key="YOUR_KEY")

options = LiveOptions(
    model="nova-3",
    language="en-US",
    smart_format=True,
    punctuate=True,
    diarize=False,  # single speaker in most interview scenarios
    filler_words=True,  # capture "um", "uh" for analysis
    keywords=[
        "Kubernetes:3",      # boost weight
        "GraphQL:3",
        "microservices:2",
        "SOLID:2",
        "CI/CD:2",
        "AWS:2",
        "PostgreSQL:2",
        "Redis:2",
    ],
    endpointing=300,  # ms of silence before considering utterance complete
    interim_results=True,
)

async def stream_audio_to_deepgram(audio_stream, on_transcript):
    async with dg_client.listen.asynclive.v("1") as connection:
        async def on_message(self, result, **kwargs):
            transcript = result.channel.alternatives[0].transcript
            if result.is_final and transcript:
                await on_transcript(transcript)
            elif not result.is_final:
                # Stream partial results to reduce perceived latency
                await on_transcript(transcript, is_partial=True)

        connection.on(deepgram.LiveTranscriptionEvents.Transcript, on_message)
        await connection.start(options)

        async for chunk in audio_stream:
            await connection.send(chunk)

The endpointing=300 setting is important for interviews. The default is 10ms which is fine for quick commands but terrible when someone pauses to think. I have found 250-350ms works well — long enough to not cut people off, short enough to not feel laggy.

OpenAI Whisper (via API or self-hosted)

Whisper is the accuracy leader for challenging audio — heavy accents, background noise, highly technical content. The tradeoff is latency and cost.

Metric	Whisper API	Whisper Self-Hosted (A10G)
Latency	500-2000ms (batch)	200-400ms (streaming via faster-whisper)
WER (general)	7.1%	6.8%
WER (technical vocab)	9.3%	9.0%
Price	$0.006/min	~$0.008/min (compute)
Streaming support	No (batch only via API)	Yes (with faster-whisper)

The official Whisper API does not support streaming — you send audio chunks and get back completed transcripts, which is a deal-breaker for real-time conversation. If you want Whisper’s accuracy with streaming, you need to self-host using faster-whisper with CTranslate2 backend.

For most interview use cases, Deepgram Nova-3 is the better choice. Whisper makes sense if you have a specific accuracy requirement that Nova-3 cannot meet, or if you are dealing with highly accented speech in languages where Deepgram’s support is weaker.

AssemblyAI

AssemblyAI sits at an interesting price point — $0.37/hr ($0.0062/min) — and offers features that go beyond pure transcription:

Metric	AssemblyAI Nano	AssemblyAI Best
Latency (streaming)	~180ms	~250ms
WER	10.2%	8.8%
Price	$0.0062/min	$0.0140/min
Sentiment analysis	Yes	Yes
Speaker labels	Yes	Yes
Content moderation	Yes	Yes

The sentiment analysis is genuinely useful for interview applications. You can detect when a candidate is frustrated, confident, or uncertain without post-processing. The “LeMUR” feature lets you ask LLM-style questions about the transcript in real-time, though at additional cost.

The downside is that AssemblyAI’s streaming implementation has been less reliable in my experience — occasional connection drops and higher tail latency compared to Deepgram.

Google Cloud Speech-to-Text v2

Google’s STT is often overlooked but handles certain edge cases well, particularly phone-quality audio and strong non-native accents:

Metric	Google STT v2 (Chirp)
Latency (streaming)	~200ms
WER	8.9%
Price	$0.016/min (streaming)
Languages	125+

The pricing is nearly double Deepgram and the latency is slightly higher, making it hard to recommend as a default choice. Where it shines: language diversity. If you are building for markets where Deepgram has limited language support, Google is the pragmatic choice.

STT Decision Matrix

Use case	Recommended STT	Why
English, cost-sensitive	Deepgram Nova-3	Best price/perf ratio
Maximum accuracy, English	OpenAI Whisper (self-hosted)	Best WER on challenging audio
Sentiment analysis built-in	AssemblyAI	Features > marginal accuracy loss
100+ language support	Google Cloud STT	Broadest language coverage
Low latency + accuracy balance	Deepgram Nova-3	150ms is hard to beat

LLM: The Brain of Your Interview Agent

The LLM is where your interview agent actually thinks — understanding context, formulating follow-up questions, evaluating answers, staying in character as an interviewer. The requirements for interview applications are specific:

Context window: A 45-minute interview at average speaking pace (~150 words/min) generates ~6,750 words. Plus your system prompt (500-1000 tokens) and structured interview guide (500-1500 tokens). You need at least 16k tokens, ideally 32k+
First token latency: under 400ms to keep total pipeline latency reasonable
Instruction following: the LLM must stay in the interviewer persona, not ramble, not go off-script
Function calling: for triggering scoring, updating candidate records, fetching competency definitions mid-interview

Gemini 2.0 Flash

This is the speed champion right now and it is not particularly close:

Metric	Gemini 2.0 Flash
First token latency	~250-350ms
Output tokens/sec	150-200
Context window	1M tokens
Price (input)	$0.075/1M tokens
Price (output)	$0.30/1M tokens
Function calling	Yes

The 1M token context window is genuinely useful for long interview sessions where you want to keep the entire conversation history. The price is also excellent — a typical 45-minute interview with ~8k tokens in and ~2k tokens out costs roughly $0.0006 in LLM costs alone.

The tradeoff is reasoning depth. For behavioral interview questions where the AI needs to probe deeply and recognize complex answer patterns, Gemini Flash occasionally gives shallower follow-ups than GPT-4o or Claude. For structured, formulaic interviews (screening calls, technical trivia), Flash is excellent.

import google.generativeai as genai
from google.generativeai.types import GenerationConfig

genai.configure(api_key="YOUR_KEY")

model = genai.GenerativeModel(
    model_name="gemini-2.0-flash",
    system_instruction="""You are Alex, a senior technical interviewer at Acme Corp.
    You are conducting a 45-minute software engineering interview.
    Your job is to assess the candidate's technical depth, problem-solving approach,
    and communication skills. Ask one question at a time. When the candidate finishes
    answering, either ask a follow-up to probe deeper or move to the next topic.
    Keep responses concise — 2-3 sentences maximum for interview questions.
    Do not explain your reasoning or evaluation to the candidate.""",
    generation_config=GenerationConfig(
        max_output_tokens=150,  # Keep responses short for voice
        temperature=0.7,
    )
)

async def get_interviewer_response(conversation_history: list[dict]) -> str:
    # Stream response for lower perceived latency
    response_text = ""
    async for chunk in await model.generate_content_async(
        conversation_history,
        stream=True,
    ):
        if chunk.text:
            response_text += chunk.text
            yield chunk.text  # Yield tokens as they arrive for TTS streaming

GPT-4o

GPT-4o is the quality benchmark for complex interview scenarios:

Metric	GPT-4o
First token latency	~400-600ms
Output tokens/sec	60-80
Context window	128k tokens
Price (input)	$2.50/1M tokens
Price (output)	$10.00/1M tokens
Function calling	Yes (best-in-class)

GPT-4o’s function calling is noticeably more reliable than alternatives — when you define tools for “mark_answer_complete,” “request_code_example,” or “trigger_evaluation,” GPT-4o calls them at the right moments without hallucinating extra tool calls or forgetting to call them.

The latency is the real issue. At 400-600ms first token, you are right at the edge of the total 800ms budget. You have almost nothing left for STT (150ms) and TTS (75ms). In practice, GPT-4o pushes total round-trip latency to 900-1200ms, which users notice.

My compromise: use GPT-4o for the evaluation and scoring agents (which run asynchronously after each answer) but use Gemini Flash for the live conversation agent.

Claude 3.5 Sonnet / Claude 3.7

Claude’s strength in interview applications is instruction-following precision and staying in character:

Metric	Claude 3.5 Sonnet
First token latency	~500-700ms
Output tokens/sec	80-100
Context window	200k tokens
Price (input)	$3.00/1M tokens
Price (output)	$15.00/1M tokens
Function calling	Yes

Claude tends to give the most “human-sounding” interview responses — natural transitions, appropriate acknowledgment of answers, good follow-up question construction. The latency knocks it out of first-choice status for live conversation, but it is excellent for the post-interview feedback generation (which can be async).

Claude 3.7 Sonnet (with extended thinking) is worth testing for technical evaluation — its reasoning depth on “did this candidate actually explain the SOLID principles correctly” type questions is noticeably better.

Grok

Grok (via xAI API) is the dark horse for cost-optimized builds:

Metric	Grok-2
First token latency	~300-450ms
Output tokens/sec	80-120
Context window	131k tokens
Price (input)	$2.00/1M tokens
Price (output)	$10.00/1M tokens

At roughly $0.05/min all-in for typical interview token volumes, Grok is competitive. The quality is solid for structured interviews, though I have found it occasionally goes off-script in complex roleplay scenarios. If you are building a cost-optimized product and willing to do extra prompt engineering to keep the persona stable, Grok deserves a look.

LLM Decision Matrix

Use case	Recommended LLM	Why
Live conversation, speed priority	Gemini 2.0 Flash	250-350ms TTFT, cheap
Complex behavioral interviews	GPT-4o	Best instruction following
Post-interview evaluation	Claude 3.5 Sonnet	Best reasoning quality
Cost-optimized deployment	Gemini 2.0 Flash	Best cost/quality for voice
Function-heavy workflows	GPT-4o	Most reliable tool calling

TTS: Giving Your AI Interviewer a Voice

TTS is what candidates actually hear. A technically perfect STT and LLM pipeline is worthless if the voice sounds like a 2015 IVR system. For interviews, you need:

Naturalness: prosody, pacing, and intonation that sound like a real person
Low TTFB (time to first byte): under 100ms to start audio immediately after LLM generates tokens
Streaming support: do not wait for complete text before starting audio
Professional tone: warm but authoritative — an interviewer, not a customer service bot

ElevenLabs Flash v2.5

ElevenLabs is the naturalness leader. Flash v2.5 specifically optimizes for low latency:

Metric	ElevenLabs Flash v2.5
TTFB	~75ms
Naturalness (MOS score)	4.3/5.0
Languages	70+
Price	$0.18/1k characters (~$0.027/min)
Voice cloning	Yes (instant clone from 30s sample)
Streaming	Yes (WebSocket)

The voices — particularly “Adam,” “Antoni,” and “Callum” for male voices, “Rachel” and “Bella” for female — are genuinely convincing. In user testing, participants regularly described the interview as feeling “like talking to a real person” when using ElevenLabs.

import httpx
import asyncio
from typing import AsyncIterator

ELEVENLABS_API_KEY = "YOUR_KEY"
VOICE_ID = "21m00Tcm4TlvDq8ikWAM"  # Rachel - professional, clear

async def stream_tts_elevenlabs(
    text_stream: AsyncIterator[str],
    output_audio_queue: asyncio.Queue,
) -> None:
    """Stream text tokens to ElevenLabs and push audio chunks to queue."""

    async with httpx.AsyncClient() as client:
        # ElevenLabs streaming endpoint
        async with client.stream(
            "POST",
            f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream",
            headers={
                "xi-api-key": ELEVENLABS_API_KEY,
                "Content-Type": "application/json",
            },
            json={
                "model_id": "eleven_flash_v2_5",
                "voice_settings": {
                    "stability": 0.5,
                    "similarity_boost": 0.8,
                    "style": 0.2,  # subtle style, professional
                    "use_speaker_boost": True,
                },
                "text": await collect_stream(text_stream),
                "output_format": "pcm_24000",  # 24kHz PCM for quality
            },
        ) as response:
            async for chunk in response.aiter_bytes(chunk_size=4096):
                await output_audio_queue.put(chunk)

    await output_audio_queue.put(None)  # Signal end of audio

async def collect_stream(stream: AsyncIterator[str]) -> str:
    """Collect streaming text with sentence-boundary buffering."""
    buffer = ""
    async for token in stream:
        buffer += token
        # Yield at sentence boundaries for more natural chunking
        # In practice, pass the full text for better prosody
    return buffer

The pricing at $0.18/1k characters works out to roughly $0.027/min for typical interview speech at ~150 words/min (around 900 characters). It is the most expensive TTS option but delivers the most natural output.

Cartesia Sonic

Cartesia is the latency-first choice:

Metric	Cartesia Sonic
TTFB	~50-65ms
Naturalness (MOS score)	4.0/5.0
Languages	15+
Price	$0.065/1k characters (~$0.010/min)
Voice cloning	Yes
Streaming	Yes (WebSocket)

Cartesia Sonic-2 (their latest model) is impressive for latency. At 50-65ms TTFB, it shaves 10-25ms off ElevenLabs, which matters in highly latency-sensitive deployments. The voice quality is good — not quite ElevenLabs, but noticeably better than generic TTS.

At $0.065/1k characters, Cartesia is 3x cheaper than ElevenLabs. If your user research shows that the quality delta does not affect conversion or satisfaction metrics, Cartesia is the smart cost choice.

import websockets
import json
import base64

async def stream_tts_cartesia(
    text: str,
    audio_callback,
    voice_id: str = "a0e99841-438c-4a64-b679-ae501e7d6091",  # professional voice
) -> None:
    """Stream TTS from Cartesia via WebSocket."""

    async with websockets.connect(
        "wss://api.cartesia.ai/tts/websocket?api_key=YOUR_KEY&cartesia_version=2024-06-10"
    ) as ws:
        await ws.send(json.dumps({
            "model_id": "sonic-2",
            "transcript": text,
            "voice": {
                "mode": "id",
                "id": voice_id,
            },
            "output_format": {
                "container": "raw",
                "encoding": "pcm_f32le",
                "sample_rate": 24000,
            },
            "context_id": "interview-session",
        }))

        async for message in ws:
            data = json.loads(message)
            if data.get("type") == "chunk":
                audio_bytes = base64.b64decode(data["data"])
                await audio_callback(audio_bytes)
            elif data.get("type") == "done":
                break

Deepgram Aura-2

Deepgram’s TTS is underrated, especially if you are already using Deepgram for STT (single SDK, single bill, simpler architecture):

Metric	Deepgram Aura-2
TTFB	~90ms
Naturalness (MOS score)	3.8/5.0
Languages	2 (English, Spanish)
Price	$0.015/1k characters (~$0.0023/min)
Voice cloning	No
Streaming	Yes

The price is remarkable — $0.015/1k characters versus ElevenLabs’ $0.18/1k. That is a 12x cost difference. The voice quality is noticeably more “TTS-like” — it lacks the naturalness of ElevenLabs or Cartesia, but for an internal recruiting tool where cost matters more than the slight uncanny valley effect, Aura-2 is worth serious consideration.

PlayHT

PlayHT’s 2.0 Turbo model is a solid mid-tier option:

Metric	PlayHT 2.0 Turbo
TTFB	~95ms
Naturalness (MOS score)	3.9/5.0
Languages	142
Price	$0.022/1k characters
Voice cloning	Yes (Instant Clone)
Streaming	Yes

PlayHT’s language support (142 languages) is a differentiator if you are building for global markets. The quality sits between Deepgram Aura-2 and Cartesia.

Azure Neural TTS

Azure’s neural TTS is the enterprise default for a reason:

Metric	Azure Neural TTS
TTFB	~100-150ms
Naturalness (MOS score)	3.8/5.0
Languages	54 locales, 129 voices
Price	$0.016/1k characters
Voice cloning	Yes (Custom Neural Voice)
Streaming	Yes (SSML)

129 voices across 54 locales makes Azure the choice for truly global deployments. The Custom Neural Voice feature lets you clone a specific voice for brand consistency. Enterprise compliance (SOC 2, ISO 27001, HIPAA BAA) is built-in, which matters for recruiting applications that handle sensitive candidate data.

TTS Decision Matrix

Use case	Recommended TTS	Why
Maximum naturalness	ElevenLabs Flash v2.5	Best MOS score, 70+ languages
Minimum latency	Cartesia Sonic	50-65ms TTFB
Minimum cost	Deepgram Aura-2	$0.015/1k chars
Enterprise/compliance	Azure Neural TTS	HIPAA BAA, 129 voices
Global language coverage	PlayHT or Azure	100+ languages
Deepgram stack consistency	Deepgram Aura-2	Single SDK

Voice Cloning and Custom Voices

For interview applications used by a single company, voice consistency matters. “Our AI interviewer is Alex” — not a random ElevenLabs preset. Every candidate should meet the same Alex.

Creating a Custom Interview Voice

ElevenLabs Instant Clone requires just 30 seconds of clean audio. For a production interview voice:

Source audio: record a real person (or use a professional voice actor) reading 5-10 minutes of diverse content — questions, statements, numbers, technical terms
Clean the audio: normalize to -16 LUFS, remove background noise, ensure consistent microphone distance
Upload and clone: ElevenLabs will create the clone in 10-30 seconds
Test extensively: run it through your actual interview script — pay attention to technical terms, question intonation, and sentence-ending cadence

import httpx

async def create_voice_clone(
    audio_file_path: str,
    voice_name: str,
    description: str,
) -> str:
    """Create an ElevenLabs voice clone. Returns the voice ID."""

    async with httpx.AsyncClient() as client:
        with open(audio_file_path, "rb") as f:
            response = await client.post(
                "https://api.elevenlabs.io/v1/voices/add",
                headers={"xi-api-key": "YOUR_KEY"},
                data={
                    "name": voice_name,
                    "description": description,
                    "labels": '{"accent": "American", "use_case": "interview"}',
                },
                files={"files": (audio_file_path, f, "audio/mpeg")},
            )

        result = response.json()
        return result["voice_id"]

For Cartesia, the process is similar — upload audio clips and get a custom voice ID back. Cartesia’s voice cloning tends to be slightly more consistent for professional tones, worth testing alongside ElevenLabs.

Audio Quality: Sample Rates and Codecs

Audio pipeline quality is invisible when it is right and obvious when it is wrong. These are the settings that have proven out in production:

Sample Rate Selection

Stage	Recommended Rate	Why
STT input	16kHz	STT models are trained at 16kHz; upsampling wastes bandwidth
TTS output	24kHz	Captures full voice frequency range (20Hz-12kHz) without overhead
Recording/archival	48kHz	Standard for professional audio, editing headroom
WebRTC transport	48kHz	WebRTC standard; browser handles downsampling

The mismatch trap: WebRTC natively operates at 48kHz. Deepgram works best at 16kHz. You need a resampler in your pipeline. Most frameworks (LiveKit, Pipecat) handle this, but if you are building directly, you need to explicitly resample:

import numpy as np
from scipy import signal

def resample_audio(audio_data: bytes, from_rate: int, to_rate: int) -> bytes:
    """Resample audio PCM data between sample rates."""
    if from_rate == to_rate:
        return audio_data

    # Convert bytes to numpy array (16-bit PCM)
    samples = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32)

    # Calculate resampling ratio
    ratio = to_rate / from_rate
    num_samples = int(len(samples) * ratio)

    # Resample
    resampled = signal.resample(samples, num_samples)

    # Convert back to 16-bit PCM
    return resampled.astype(np.int16).tobytes()

Codec Choices

Context	Codec	Why
WebRTC transport	Opus	Adaptive bitrate, handles packet loss gracefully
STT streaming	PCM (raw)	Zero decoding overhead, direct to STT
TTS streaming	PCM_24000	Direct playback without decode step
Recording storage	AAC (128kbps)	60-70% smaller than PCM, perceptually lossless for voice
Archival	FLAC	Lossless, compressed, good for long-term storage

The practical pipeline: WebRTC audio arrives as Opus, you decode to PCM for STT, you request PCM from TTS, you encode to AAC for storage. This sounds complex but the codec operations are cheap — on a modern server, encoding/decoding adds under 5ms to your pipeline.

Streaming Integration: The Full Pipeline

This is where the rubber meets the road. Streaming integration means you are not waiting for complete text before starting TTS — you are piping tokens from LLM to TTS as they arrive. The latency difference is dramatic: buffered pipelines add 500-1000ms, streaming pipelines add 0ms.

Here is the complete streaming pipeline that I use in production, written for Deepgram STT + Gemini Flash + ElevenLabs TTS:

import asyncio
import google.generativeai as genai
import httpx
from deepgram import DeepgramClient, LiveOptions, LiveTranscriptionEvents
from typing import AsyncIterator

# Configuration
DEEPGRAM_KEY = "YOUR_DEEPGRAM_KEY"
GEMINI_KEY = "YOUR_GEMINI_KEY"
ELEVENLABS_KEY = "YOUR_ELEVENLABS_KEY"
VOICE_ID = "YOUR_VOICE_ID"

genai.configure(api_key=GEMINI_KEY)

class VoiceInterviewPipeline:
    def __init__(self, system_prompt: str):
        self.system_prompt = system_prompt
        self.conversation_history = []
        self.audio_output_queue = asyncio.Queue()
        self.current_utterance = ""

        # Initialize Gemini model
        self.model = genai.GenerativeModel(
            model_name="gemini-2.0-flash",
            system_instruction=system_prompt,
        )
        self.chat = self.model.start_chat(history=[])

    async def handle_stt_transcript(self, transcript: str, is_final: bool) -> None:
        """Called when Deepgram returns a transcript segment."""
        if not is_final:
            # Partial results — show in UI but don't process yet
            self.current_utterance = transcript
            return

        # Final transcript — send to LLM and pipe response to TTS
        self.current_utterance = ""
        self.conversation_history.append({
            "role": "user",
            "parts": [transcript],
        })

        # Start LLM + TTS pipeline (non-blocking)
        asyncio.create_task(self._llm_to_tts_pipeline(transcript))

    async def _llm_to_tts_pipeline(self, user_text: str) -> None:
        """
        Stream LLM tokens directly to TTS.
        Sentence-boundary buffering ensures natural TTS output
        while keeping latency low.
        """
        sentence_buffer = ""
        full_response = ""

        # Stream from Gemini
        async for chunk in await self.chat.send_message_async(
            user_text, stream=True
        ):
            if not chunk.text:
                continue

            sentence_buffer += chunk.text
            full_response += chunk.text

            # Check for sentence boundaries
            # Send to TTS at natural pause points
            if any(sentence_buffer.rstrip().endswith(p) for p in [".", "?", "!", "\n"]):
                sentence_to_speak = sentence_buffer.strip()
                if len(sentence_to_speak) > 10:  # Avoid tiny fragments
                    asyncio.create_task(
                        self._send_to_tts(sentence_to_speak)
                    )
                sentence_buffer = ""

        # Send any remaining text
        if sentence_buffer.strip():
            asyncio.create_task(self._send_to_tts(sentence_buffer.strip()))

        # Add AI response to conversation history
        self.conversation_history.append({
            "role": "model",
            "parts": [full_response],
        })

    async def _send_to_tts(self, text: str) -> None:
        """Send text to ElevenLabs TTS and push audio to output queue."""
        async with httpx.AsyncClient(timeout=10.0) as client:
            async with client.stream(
                "POST",
                f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream",
                headers={
                    "xi-api-key": ELEVENLABS_KEY,
                    "Content-Type": "application/json",
                },
                json={
                    "text": text,
                    "model_id": "eleven_flash_v2_5",
                    "output_format": "pcm_24000",
                    "voice_settings": {
                        "stability": 0.5,
                        "similarity_boost": 0.8,
                    },
                },
            ) as response:
                async for audio_chunk in response.aiter_bytes(4096):
                    await self.audio_output_queue.put(audio_chunk)

    async def start_stt_listener(self, audio_input_stream: AsyncIterator[bytes]) -> None:
        """Connect to Deepgram and stream audio input."""
        dg_client = DeepgramClient(DEEPGRAM_KEY)

        options = LiveOptions(
            model="nova-3",
            language="en-US",
            smart_format=True,
            punctuate=True,
            filler_words=True,
            endpointing=300,
            interim_results=True,
            keywords=["Kubernetes:3", "AWS:2", "Python:2", "GraphQL:3"],
        )

        async with dg_client.listen.asynclive.v("1") as connection:
            async def on_transcript(self_dg, result, **kwargs):
                transcript = result.channel.alternatives[0].transcript
                if transcript:
                    await self.handle_stt_transcript(
                        transcript,
                        is_final=result.is_final,
                    )

            connection.on(LiveTranscriptionEvents.Transcript, on_transcript)
            await connection.start(options)

            async for audio_chunk in audio_input_stream:
                await connection.send(audio_chunk)

    async def get_audio_output(self) -> AsyncIterator[bytes]:
        """Yield audio chunks from the output queue for playback."""
        while True:
            chunk = await self.audio_output_queue.get()
            if chunk is None:
                break
            yield chunk


# Usage
async def run_interview(audio_input_stream: AsyncIterator[bytes]):
    pipeline = VoiceInterviewPipeline(
        system_prompt="""You are Morgan, a senior engineering interviewer at TechCorp.
        Conduct a 45-minute technical interview. Ask one question at a time.
        Keep all responses under 3 sentences."""
    )

    # Start STT listener and audio output in parallel
    await asyncio.gather(
        pipeline.start_stt_listener(audio_input_stream),
        # Audio output is consumed by the WebRTC sender
    )

The key insight in this pipeline is sentence-boundary buffering. Sending tokens one-by-one to TTS would produce terrible prosody — TTS models need semantic context to generate natural intonation. But waiting for the complete LLM response adds 500ms+. Sentence-boundary buffering is the sweet spot: you send one sentence at a time, TTS starts generating audio within 75-100ms of the first sentence completing, and the user hears the first sentence while subsequent sentences are being synthesized.

Fallback Chains: When Your Primary Provider Has a Bad Day

Any production voice system needs fallback logic. ElevenLabs has occasional latency spikes (250ms TTFB instead of 75ms) and rare outages. Building a fallback chain is not optional.

Here is the provider priority for my interview systems:

Primary: ElevenLabs Flash v2.5 (best naturalness) Fallback 1: Cartesia Sonic (slightly lower naturalness, similar latency) Fallback 2: Deepgram Aura-2 (noticeably less natural, but ultra-low cost and reliable)

import asyncio
import time
from enum import Enum

class TTSProvider(Enum):
    ELEVENLABS = "elevenlabs"
    CARTESIA = "cartesia"
    DEEPGRAM = "deepgram"

class TTSFallbackManager:
    def __init__(self):
        self.provider_health = {
            TTSProvider.ELEVENLABS: {"failures": 0, "last_failure": 0, "circuit_open": False},
            TTSProvider.CARTESIA: {"failures": 0, "last_failure": 0, "circuit_open": False},
            TTSProvider.DEEPGRAM: {"failures": 0, "last_failure": 0, "circuit_open": False},
        }
        self.priority = [TTSProvider.ELEVENLABS, TTSProvider.CARTESIA, TTSProvider.DEEPGRAM]
        self.latency_threshold_ms = 200  # Switch to fallback if TTFB exceeds this
        self.circuit_reset_seconds = 60   # Try primary again after 60s

    def get_active_provider(self) -> TTSProvider:
        """Return the highest-priority healthy provider."""
        now = time.time()

        for provider in self.priority:
            health = self.provider_health[provider]

            # Reset circuit breaker after cooldown period
            if health["circuit_open"]:
                if now - health["last_failure"] > self.circuit_reset_seconds:
                    health["circuit_open"] = False
                    health["failures"] = 0
                    print(f"[TTS] Circuit breaker reset for {provider.value}")
                else:
                    continue

            return provider

        # All providers unhealthy — use Deepgram as last resort
        return TTSProvider.DEEPGRAM

    def record_failure(self, provider: TTSProvider) -> None:
        health = self.provider_health[provider]
        health["failures"] += 1
        health["last_failure"] = time.time()

        if health["failures"] >= 3:
            health["circuit_open"] = True
            print(f"[TTS] Circuit breaker OPEN for {provider.value} after {health['failures']} failures")

    def record_success(self, provider: TTSProvider) -> None:
        self.provider_health[provider]["failures"] = 0

    async def synthesize_with_fallback(
        self,
        text: str,
        audio_callback,
    ) -> TTSProvider:
        """Try providers in priority order. Return which provider was used."""
        provider = self.get_active_provider()

        try:
            start = time.monotonic()

            if provider == TTSProvider.ELEVENLABS:
                await synthesize_elevenlabs(text, audio_callback)
            elif provider == TTSProvider.CARTESIA:
                await synthesize_cartesia(text, audio_callback)
            else:
                await synthesize_deepgram(text, audio_callback)

            latency_ms = (time.monotonic() - start) * 1000

            # Check if latency exceeded threshold (soft failure)
            if latency_ms > self.latency_threshold_ms and provider != TTSProvider.DEEPGRAM:
                print(f"[TTS] {provider.value} latency {latency_ms:.0f}ms exceeded threshold")
                self.record_failure(provider)
            else:
                self.record_success(provider)

            return provider

        except Exception as e:
            print(f"[TTS] {provider.value} failed: {e}")
            self.record_failure(provider)

            # Recursively try next provider
            return await self.synthesize_with_fallback(text, audio_callback)

The circuit breaker pattern is important here. If ElevenLabs starts returning errors, you do not want every TTS request to fail and retry — you want to route around the outage immediately. Three failures open the circuit, and after 60 seconds you try the primary again (in case it was a transient spike).

Per-Minute Cost Breakdown

The question I get asked constantly: “what does this actually cost?” Here is the full breakdown for a 45-minute interview, assuming typical token volumes:

Assumptions:

45 minutes of audio
STT: 45 minutes of transcription
LLM: ~12k input tokens, ~3k output tokens per interview
TTS: ~8,100 characters of synthesized speech (interviewer side, ~9 minutes of speech)

STT	LLM	TTS	STT Cost	LLM Cost	TTS Cost	Total/min
Deepgram Nova-3	Gemini 2.0 Flash	Deepgram Aura-2	$0.35	$0.002	$0.12	$0.0107/min
Deepgram Nova-3	Gemini 2.0 Flash	Cartesia Sonic	$0.35	$0.002	$0.53	$0.0196/min
Deepgram Nova-3	Gemini 2.0 Flash	ElevenLabs Flash	$0.35	$0.002	$1.46	$0.0403/min
Deepgram Nova-3	GPT-4o	ElevenLabs Flash	$0.35	$0.17	$1.46	$0.0442/min
AssemblyAI	GPT-4o	ElevenLabs Flash	$0.28	$0.17	$1.46	$0.0424/min
Whisper API	GPT-4o	ElevenLabs Flash	$0.27	$0.17	$1.46	$0.0422/min
Deepgram Nova-3	Claude 3.5 Sonnet	ElevenLabs Flash	$0.35	$0.24	$1.46	$0.0456/min
Google STT	Gemini 2.0 Flash	Azure Neural	$0.72	$0.002	$0.13	$0.0189/min

Key observations:

TTS dominates cost in high-quality configurations. ElevenLabs costs more than STT + LLM combined for Flash setups
Gemini Flash is almost free — $0.002 per interview is remarkable. The LLM is no longer the expensive part
Minimum viable cost: Deepgram + Gemini + Deepgram Aura-2 at $0.0107/min for a 45-minute interview is $0.48 total. That is commercially viable at scale
Premium configuration: Deepgram + GPT-4o + ElevenLabs at $0.0442/min for a 45-minute interview is $1.99 total — still inexpensive for a hiring workflow

At 1,000 interviews/month:

Budget stack (Deepgram + Gemini Flash + Deepgram Aura-2): ~$480/month
Premium stack (Deepgram + GPT-4o + ElevenLabs): ~$1,990/month

Neither of these is a significant cost compared to recruiter time.

My Recommended Default Stack

After building multiple interview systems in production, my default recommendation:

STT: Deepgram Nova-3 with custom vocabulary for your domain LLM: Gemini 2.0 Flash for live conversation, Claude 3.5 Sonnet for async evaluation TTS: ElevenLabs Flash v2.5 with a custom cloned voice for brand consistency Fallback TTS: Cartesia Sonic (first fallback), Deepgram Aura-2 (second)

This configuration gives you:

~475ms total round-trip latency (p50)
~$0.04/min cost
Highly natural voice quality that passes the “real interviewer” test in user research
Resilient fallback chain for production reliability

If cost is the primary constraint, swap ElevenLabs for Cartesia (cuts TTS cost by 65%) or Deepgram Aura-2 (cuts by 92%) with some voice quality tradeoff.

What is Coming in Part 5

We now have the full voice pipeline running. In Part 5, I will show how to build multi-role agent systems — separate AI personas for interviewer, coach, and evaluator — and how to coordinate them in real-time without the candidate experiencing awkward transitions or the agents stepping on each other.

The interviewer asks questions. The coach (invisible to the candidate) monitors confidence patterns and suggests adjustments to your approach. The evaluator scores each answer against the competency framework in real-time. Getting all three to work together without the candidate hearing the machinery — that is what Part 5 covers.

This is Part 4 of a 12-part series: The Voice AI Interview Playbook.

Series outline:

Why Real-Time Voice Changes Everything — The landscape, the vision, and the reference architecture (Part 1)
Cascaded vs. Speech-to-Speech — Choosing your pipeline architecture (Part 2)
LiveKit vs. Pipecat vs. Direct — Picking your framework (Part 3)
STT, LLM, and TTS That Actually Work — Building the voice pipeline (this post)
Multi-Role Agents — Interviewer, coach, and evaluator personas (Part 5)
Knowledge Base and RAG — Making your voice agent an expert (Part 6)
Web and Mobile Clients — Cross-platform voice experiences (Part 7)
Video Interview Integration — Multimodal analysis with Gemini Live (Part 8)
Recording, Transcription, and Compliance — GDPR, HIPAA, and getting it right (Part 9)
Scaling to Thousands — Architecture for concurrent voice sessions (Part 10)
Cost Optimization — From $0.14/min to $0.03/min (Part 11)
Multi-Provider Support — OpenAI Realtime, Bedrock Nova, Grok, and the adapter pattern (Part 12)

Export for reading

The Voice AI Interview Playbook: Building the Voice Pipeline — STT, LLM, and TTS That Actually Work (Part 4 of 12)

The Component Landscape

STT: Turning Interview Audio into Accurate Text

Deepgram Nova-3

OpenAI Whisper (via API or self-hosted)

AssemblyAI

Google Cloud Speech-to-Text v2

STT Decision Matrix

LLM: The Brain of Your Interview Agent

Gemini 2.0 Flash

GPT-4o

Claude 3.5 Sonnet / Claude 3.7

Grok

LLM Decision Matrix

TTS: Giving Your AI Interviewer a Voice

ElevenLabs Flash v2.5

Cartesia Sonic

Deepgram Aura-2

PlayHT

Azure Neural TTS

TTS Decision Matrix

Voice Cloning and Custom Voices

Creating a Custom Interview Voice

Audio Quality: Sample Rates and Codecs

Sample Rate Selection

Codec Choices

Streaming Integration: The Full Pipeline

Fallback Chains: When Your Primary Provider Has a Bad Day

Per-Minute Cost Breakdown

My Recommended Default Stack

What is Coming in Part 5

Comments

On this page

The Voice AI Interview Playbook: Building the Voice Pipeline — STT, LLM, and TTS That Actually Work (Part 4 of 12)