In Part 3, we chose our framework. LiveKit Agents won for production deployments, Pipecat for rapid prototyping, and direct WebSocket for ultra-low-latency custom builds. Now comes the part that actually makes or breaks your voice AI interview system: the individual components inside that pipeline.

STT, LLM, and TTS are not interchangeable commodities. The difference between Deepgram Nova-3 and a generic Whisper deployment is not just $0.003/minute — it is 80ms of latency that users feel as unnatural hesitation. The difference between ElevenLabs Flash v2.5 and a mediocre TTS voice is the difference between “this feels like a real interview” and “I am talking to a robot.”

In this post, I am going to walk through every component with real benchmarks, real production data, and the specific combinations that I have found work best for interview applications. I will also show you the streaming integration code that ties them all together, because reading about pipelines and actually building one are very different things.

The Component Landscape

Before diving in, let me lay out the decision space. You are choosing three components that need to work together within a latency budget:

  • Total acceptable latency: under 800ms from end of user speech to start of AI voice response
  • STT budget: 100-200ms
  • LLM first token budget: 250-400ms
  • TTS first audio budget: 75-150ms
  • Pipeline overhead: 50-100ms

That adds up to roughly 475-850ms. The low end is achievable. The high end is where you start losing the “real conversation” feeling. Every component choice either tightens or blows this budget.

STT: Turning Interview Audio into Accurate Text

Speech-to-text is where most voice AI systems fail quietly. You never notice bad STT until a candidate says “Kubernetes” and your system hears “cube net ease” and the LLM starts answering a question about networking puzzles.

Interview audio has specific challenges:

  • Technical vocabulary: AWS, Kubernetes, GraphQL, SOLID principles, CI/CD — words that general speech models see rarely
  • Thinking pauses: candidates genuinely pause mid-sentence to collect thoughts. You need a model that does not rush to finalize the transcript
  • Filler words: “um,” “uh,” “like,” “you know” — you want these for sentiment analysis but often want them cleaned up before the LLM sees them
  • Accents: your candidate pool is global. Your STT needs to be too

Deepgram Nova-3

This is my go-to for production interview systems. The numbers:

MetricDeepgram Nova-3
Latency (streaming)~150ms to first partial
WER (general)8.4%
WER (technical vocab)11.2%
Price$0.0077/min ($0.46/hr)
Languages36
Custom vocabularyYes (free, instant)

The custom vocabulary feature is underrated. You can pass a list of domain-specific terms at connection time and Nova-3 will bias toward them. For a software engineering interview:

import deepgram
from deepgram import DeepgramClient, LiveOptions

dg_client = DeepgramClient(api_key="YOUR_KEY")

options = LiveOptions(
    model="nova-3",
    language="en-US",
    smart_format=True,
    punctuate=True,
    diarize=False,  # single speaker in most interview scenarios
    filler_words=True,  # capture "um", "uh" for analysis
    keywords=[
        "Kubernetes:3",      # boost weight
        "GraphQL:3",
        "microservices:2",
        "SOLID:2",
        "CI/CD:2",
        "AWS:2",
        "PostgreSQL:2",
        "Redis:2",
    ],
    endpointing=300,  # ms of silence before considering utterance complete
    interim_results=True,
)

async def stream_audio_to_deepgram(audio_stream, on_transcript):
    async with dg_client.listen.asynclive.v("1") as connection:
        async def on_message(self, result, **kwargs):
            transcript = result.channel.alternatives[0].transcript
            if result.is_final and transcript:
                await on_transcript(transcript)
            elif not result.is_final:
                # Stream partial results to reduce perceived latency
                await on_transcript(transcript, is_partial=True)

        connection.on(deepgram.LiveTranscriptionEvents.Transcript, on_message)
        await connection.start(options)

        async for chunk in audio_stream:
            await connection.send(chunk)

The endpointing=300 setting is important for interviews. The default is 10ms which is fine for quick commands but terrible when someone pauses to think. I have found 250-350ms works well — long enough to not cut people off, short enough to not feel laggy.

OpenAI Whisper (via API or self-hosted)

Whisper is the accuracy leader for challenging audio — heavy accents, background noise, highly technical content. The tradeoff is latency and cost.

MetricWhisper APIWhisper Self-Hosted (A10G)
Latency500-2000ms (batch)200-400ms (streaming via faster-whisper)
WER (general)7.1%6.8%
WER (technical vocab)9.3%9.0%
Price$0.006/min~$0.008/min (compute)
Streaming supportNo (batch only via API)Yes (with faster-whisper)

The official Whisper API does not support streaming — you send audio chunks and get back completed transcripts, which is a deal-breaker for real-time conversation. If you want Whisper’s accuracy with streaming, you need to self-host using faster-whisper with CTranslate2 backend.

For most interview use cases, Deepgram Nova-3 is the better choice. Whisper makes sense if you have a specific accuracy requirement that Nova-3 cannot meet, or if you are dealing with highly accented speech in languages where Deepgram’s support is weaker.

AssemblyAI

AssemblyAI sits at an interesting price point — $0.37/hr ($0.0062/min) — and offers features that go beyond pure transcription:

MetricAssemblyAI NanoAssemblyAI Best
Latency (streaming)~180ms~250ms
WER10.2%8.8%
Price$0.0062/min$0.0140/min
Sentiment analysisYesYes
Speaker labelsYesYes
Content moderationYesYes

The sentiment analysis is genuinely useful for interview applications. You can detect when a candidate is frustrated, confident, or uncertain without post-processing. The “LeMUR” feature lets you ask LLM-style questions about the transcript in real-time, though at additional cost.

The downside is that AssemblyAI’s streaming implementation has been less reliable in my experience — occasional connection drops and higher tail latency compared to Deepgram.

Google Cloud Speech-to-Text v2

Google’s STT is often overlooked but handles certain edge cases well, particularly phone-quality audio and strong non-native accents:

MetricGoogle STT v2 (Chirp)
Latency (streaming)~200ms
WER8.9%
Price$0.016/min (streaming)
Languages125+

The pricing is nearly double Deepgram and the latency is slightly higher, making it hard to recommend as a default choice. Where it shines: language diversity. If you are building for markets where Deepgram has limited language support, Google is the pragmatic choice.

STT Decision Matrix

Use caseRecommended STTWhy
English, cost-sensitiveDeepgram Nova-3Best price/perf ratio
Maximum accuracy, EnglishOpenAI Whisper (self-hosted)Best WER on challenging audio
Sentiment analysis built-inAssemblyAIFeatures > marginal accuracy loss
100+ language supportGoogle Cloud STTBroadest language coverage
Low latency + accuracy balanceDeepgram Nova-3150ms is hard to beat

LLM: The Brain of Your Interview Agent

The LLM is where your interview agent actually thinks — understanding context, formulating follow-up questions, evaluating answers, staying in character as an interviewer. The requirements for interview applications are specific:

  • Context window: A 45-minute interview at average speaking pace (~150 words/min) generates ~6,750 words. Plus your system prompt (500-1000 tokens) and structured interview guide (500-1500 tokens). You need at least 16k tokens, ideally 32k+
  • First token latency: under 400ms to keep total pipeline latency reasonable
  • Instruction following: the LLM must stay in the interviewer persona, not ramble, not go off-script
  • Function calling: for triggering scoring, updating candidate records, fetching competency definitions mid-interview

Gemini 2.0 Flash

This is the speed champion right now and it is not particularly close:

MetricGemini 2.0 Flash
First token latency~250-350ms
Output tokens/sec150-200
Context window1M tokens
Price (input)$0.075/1M tokens
Price (output)$0.30/1M tokens
Function callingYes

The 1M token context window is genuinely useful for long interview sessions where you want to keep the entire conversation history. The price is also excellent — a typical 45-minute interview with ~8k tokens in and ~2k tokens out costs roughly $0.0006 in LLM costs alone.

The tradeoff is reasoning depth. For behavioral interview questions where the AI needs to probe deeply and recognize complex answer patterns, Gemini Flash occasionally gives shallower follow-ups than GPT-4o or Claude. For structured, formulaic interviews (screening calls, technical trivia), Flash is excellent.

import google.generativeai as genai
from google.generativeai.types import GenerationConfig

genai.configure(api_key="YOUR_KEY")

model = genai.GenerativeModel(
    model_name="gemini-2.0-flash",
    system_instruction="""You are Alex, a senior technical interviewer at Acme Corp.
    You are conducting a 45-minute software engineering interview.
    Your job is to assess the candidate's technical depth, problem-solving approach,
    and communication skills. Ask one question at a time. When the candidate finishes
    answering, either ask a follow-up to probe deeper or move to the next topic.
    Keep responses concise — 2-3 sentences maximum for interview questions.
    Do not explain your reasoning or evaluation to the candidate.""",
    generation_config=GenerationConfig(
        max_output_tokens=150,  # Keep responses short for voice
        temperature=0.7,
    )
)

async def get_interviewer_response(conversation_history: list[dict]) -> str:
    # Stream response for lower perceived latency
    response_text = ""
    async for chunk in await model.generate_content_async(
        conversation_history,
        stream=True,
    ):
        if chunk.text:
            response_text += chunk.text
            yield chunk.text  # Yield tokens as they arrive for TTS streaming

GPT-4o

GPT-4o is the quality benchmark for complex interview scenarios:

MetricGPT-4o
First token latency~400-600ms
Output tokens/sec60-80
Context window128k tokens
Price (input)$2.50/1M tokens
Price (output)$10.00/1M tokens
Function callingYes (best-in-class)

GPT-4o’s function calling is noticeably more reliable than alternatives — when you define tools for “mark_answer_complete,” “request_code_example,” or “trigger_evaluation,” GPT-4o calls them at the right moments without hallucinating extra tool calls or forgetting to call them.

The latency is the real issue. At 400-600ms first token, you are right at the edge of the total 800ms budget. You have almost nothing left for STT (150ms) and TTS (75ms). In practice, GPT-4o pushes total round-trip latency to 900-1200ms, which users notice.

My compromise: use GPT-4o for the evaluation and scoring agents (which run asynchronously after each answer) but use Gemini Flash for the live conversation agent.

Claude 3.5 Sonnet / Claude 3.7

Claude’s strength in interview applications is instruction-following precision and staying in character:

MetricClaude 3.5 Sonnet
First token latency~500-700ms
Output tokens/sec80-100
Context window200k tokens
Price (input)$3.00/1M tokens
Price (output)$15.00/1M tokens
Function callingYes

Claude tends to give the most “human-sounding” interview responses — natural transitions, appropriate acknowledgment of answers, good follow-up question construction. The latency knocks it out of first-choice status for live conversation, but it is excellent for the post-interview feedback generation (which can be async).

Claude 3.7 Sonnet (with extended thinking) is worth testing for technical evaluation — its reasoning depth on “did this candidate actually explain the SOLID principles correctly” type questions is noticeably better.

Grok

Grok (via xAI API) is the dark horse for cost-optimized builds:

MetricGrok-2
First token latency~300-450ms
Output tokens/sec80-120
Context window131k tokens
Price (input)$2.00/1M tokens
Price (output)$10.00/1M tokens

At roughly $0.05/min all-in for typical interview token volumes, Grok is competitive. The quality is solid for structured interviews, though I have found it occasionally goes off-script in complex roleplay scenarios. If you are building a cost-optimized product and willing to do extra prompt engineering to keep the persona stable, Grok deserves a look.

LLM Decision Matrix

Use caseRecommended LLMWhy
Live conversation, speed priorityGemini 2.0 Flash250-350ms TTFT, cheap
Complex behavioral interviewsGPT-4oBest instruction following
Post-interview evaluationClaude 3.5 SonnetBest reasoning quality
Cost-optimized deploymentGemini 2.0 FlashBest cost/quality for voice
Function-heavy workflowsGPT-4oMost reliable tool calling

TTS: Giving Your AI Interviewer a Voice

TTS is what candidates actually hear. A technically perfect STT and LLM pipeline is worthless if the voice sounds like a 2015 IVR system. For interviews, you need:

  • Naturalness: prosody, pacing, and intonation that sound like a real person
  • Low TTFB (time to first byte): under 100ms to start audio immediately after LLM generates tokens
  • Streaming support: do not wait for complete text before starting audio
  • Professional tone: warm but authoritative — an interviewer, not a customer service bot

ElevenLabs Flash v2.5

ElevenLabs is the naturalness leader. Flash v2.5 specifically optimizes for low latency:

MetricElevenLabs Flash v2.5
TTFB~75ms
Naturalness (MOS score)4.3/5.0
Languages70+
Price$0.18/1k characters (~$0.027/min)
Voice cloningYes (instant clone from 30s sample)
StreamingYes (WebSocket)

The voices — particularly “Adam,” “Antoni,” and “Callum” for male voices, “Rachel” and “Bella” for female — are genuinely convincing. In user testing, participants regularly described the interview as feeling “like talking to a real person” when using ElevenLabs.

import httpx
import asyncio
from typing import AsyncIterator

ELEVENLABS_API_KEY = "YOUR_KEY"
VOICE_ID = "21m00Tcm4TlvDq8ikWAM"  # Rachel - professional, clear

async def stream_tts_elevenlabs(
    text_stream: AsyncIterator[str],
    output_audio_queue: asyncio.Queue,
) -> None:
    """Stream text tokens to ElevenLabs and push audio chunks to queue."""

    async with httpx.AsyncClient() as client:
        # ElevenLabs streaming endpoint
        async with client.stream(
            "POST",
            f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream",
            headers={
                "xi-api-key": ELEVENLABS_API_KEY,
                "Content-Type": "application/json",
            },
            json={
                "model_id": "eleven_flash_v2_5",
                "voice_settings": {
                    "stability": 0.5,
                    "similarity_boost": 0.8,
                    "style": 0.2,  # subtle style, professional
                    "use_speaker_boost": True,
                },
                "text": await collect_stream(text_stream),
                "output_format": "pcm_24000",  # 24kHz PCM for quality
            },
        ) as response:
            async for chunk in response.aiter_bytes(chunk_size=4096):
                await output_audio_queue.put(chunk)

    await output_audio_queue.put(None)  # Signal end of audio

async def collect_stream(stream: AsyncIterator[str]) -> str:
    """Collect streaming text with sentence-boundary buffering."""
    buffer = ""
    async for token in stream:
        buffer += token
        # Yield at sentence boundaries for more natural chunking
        # In practice, pass the full text for better prosody
    return buffer

The pricing at $0.18/1k characters works out to roughly $0.027/min for typical interview speech at ~150 words/min (around 900 characters). It is the most expensive TTS option but delivers the most natural output.

Cartesia Sonic

Cartesia is the latency-first choice:

MetricCartesia Sonic
TTFB~50-65ms
Naturalness (MOS score)4.0/5.0
Languages15+
Price$0.065/1k characters (~$0.010/min)
Voice cloningYes
StreamingYes (WebSocket)

Cartesia Sonic-2 (their latest model) is impressive for latency. At 50-65ms TTFB, it shaves 10-25ms off ElevenLabs, which matters in highly latency-sensitive deployments. The voice quality is good — not quite ElevenLabs, but noticeably better than generic TTS.

At $0.065/1k characters, Cartesia is 3x cheaper than ElevenLabs. If your user research shows that the quality delta does not affect conversion or satisfaction metrics, Cartesia is the smart cost choice.

import websockets
import json
import base64

async def stream_tts_cartesia(
    text: str,
    audio_callback,
    voice_id: str = "a0e99841-438c-4a64-b679-ae501e7d6091",  # professional voice
) -> None:
    """Stream TTS from Cartesia via WebSocket."""

    async with websockets.connect(
        "wss://api.cartesia.ai/tts/websocket?api_key=YOUR_KEY&cartesia_version=2024-06-10"
    ) as ws:
        await ws.send(json.dumps({
            "model_id": "sonic-2",
            "transcript": text,
            "voice": {
                "mode": "id",
                "id": voice_id,
            },
            "output_format": {
                "container": "raw",
                "encoding": "pcm_f32le",
                "sample_rate": 24000,
            },
            "context_id": "interview-session",
        }))

        async for message in ws:
            data = json.loads(message)
            if data.get("type") == "chunk":
                audio_bytes = base64.b64decode(data["data"])
                await audio_callback(audio_bytes)
            elif data.get("type") == "done":
                break

Deepgram Aura-2

Deepgram’s TTS is underrated, especially if you are already using Deepgram for STT (single SDK, single bill, simpler architecture):

MetricDeepgram Aura-2
TTFB~90ms
Naturalness (MOS score)3.8/5.0
Languages2 (English, Spanish)
Price$0.015/1k characters (~$0.0023/min)
Voice cloningNo
StreamingYes

The price is remarkable — $0.015/1k characters versus ElevenLabs’ $0.18/1k. That is a 12x cost difference. The voice quality is noticeably more “TTS-like” — it lacks the naturalness of ElevenLabs or Cartesia, but for an internal recruiting tool where cost matters more than the slight uncanny valley effect, Aura-2 is worth serious consideration.

PlayHT

PlayHT’s 2.0 Turbo model is a solid mid-tier option:

MetricPlayHT 2.0 Turbo
TTFB~95ms
Naturalness (MOS score)3.9/5.0
Languages142
Price$0.022/1k characters
Voice cloningYes (Instant Clone)
StreamingYes

PlayHT’s language support (142 languages) is a differentiator if you are building for global markets. The quality sits between Deepgram Aura-2 and Cartesia.

Azure Neural TTS

Azure’s neural TTS is the enterprise default for a reason:

MetricAzure Neural TTS
TTFB~100-150ms
Naturalness (MOS score)3.8/5.0
Languages54 locales, 129 voices
Price$0.016/1k characters
Voice cloningYes (Custom Neural Voice)
StreamingYes (SSML)

129 voices across 54 locales makes Azure the choice for truly global deployments. The Custom Neural Voice feature lets you clone a specific voice for brand consistency. Enterprise compliance (SOC 2, ISO 27001, HIPAA BAA) is built-in, which matters for recruiting applications that handle sensitive candidate data.

TTS Decision Matrix

Use caseRecommended TTSWhy
Maximum naturalnessElevenLabs Flash v2.5Best MOS score, 70+ languages
Minimum latencyCartesia Sonic50-65ms TTFB
Minimum costDeepgram Aura-2$0.015/1k chars
Enterprise/complianceAzure Neural TTSHIPAA BAA, 129 voices
Global language coveragePlayHT or Azure100+ languages
Deepgram stack consistencyDeepgram Aura-2Single SDK

Voice Cloning and Custom Voices

For interview applications used by a single company, voice consistency matters. “Our AI interviewer is Alex” — not a random ElevenLabs preset. Every candidate should meet the same Alex.

Creating a Custom Interview Voice

ElevenLabs Instant Clone requires just 30 seconds of clean audio. For a production interview voice:

  1. Source audio: record a real person (or use a professional voice actor) reading 5-10 minutes of diverse content — questions, statements, numbers, technical terms
  2. Clean the audio: normalize to -16 LUFS, remove background noise, ensure consistent microphone distance
  3. Upload and clone: ElevenLabs will create the clone in 10-30 seconds
  4. Test extensively: run it through your actual interview script — pay attention to technical terms, question intonation, and sentence-ending cadence
import httpx

async def create_voice_clone(
    audio_file_path: str,
    voice_name: str,
    description: str,
) -> str:
    """Create an ElevenLabs voice clone. Returns the voice ID."""

    async with httpx.AsyncClient() as client:
        with open(audio_file_path, "rb") as f:
            response = await client.post(
                "https://api.elevenlabs.io/v1/voices/add",
                headers={"xi-api-key": "YOUR_KEY"},
                data={
                    "name": voice_name,
                    "description": description,
                    "labels": '{"accent": "American", "use_case": "interview"}',
                },
                files={"files": (audio_file_path, f, "audio/mpeg")},
            )

        result = response.json()
        return result["voice_id"]

For Cartesia, the process is similar — upload audio clips and get a custom voice ID back. Cartesia’s voice cloning tends to be slightly more consistent for professional tones, worth testing alongside ElevenLabs.

Audio Quality: Sample Rates and Codecs

Audio pipeline quality is invisible when it is right and obvious when it is wrong. These are the settings that have proven out in production:

Sample Rate Selection

StageRecommended RateWhy
STT input16kHzSTT models are trained at 16kHz; upsampling wastes bandwidth
TTS output24kHzCaptures full voice frequency range (20Hz-12kHz) without overhead
Recording/archival48kHzStandard for professional audio, editing headroom
WebRTC transport48kHzWebRTC standard; browser handles downsampling

The mismatch trap: WebRTC natively operates at 48kHz. Deepgram works best at 16kHz. You need a resampler in your pipeline. Most frameworks (LiveKit, Pipecat) handle this, but if you are building directly, you need to explicitly resample:

import numpy as np
from scipy import signal

def resample_audio(audio_data: bytes, from_rate: int, to_rate: int) -> bytes:
    """Resample audio PCM data between sample rates."""
    if from_rate == to_rate:
        return audio_data

    # Convert bytes to numpy array (16-bit PCM)
    samples = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32)

    # Calculate resampling ratio
    ratio = to_rate / from_rate
    num_samples = int(len(samples) * ratio)

    # Resample
    resampled = signal.resample(samples, num_samples)

    # Convert back to 16-bit PCM
    return resampled.astype(np.int16).tobytes()

Codec Choices

ContextCodecWhy
WebRTC transportOpusAdaptive bitrate, handles packet loss gracefully
STT streamingPCM (raw)Zero decoding overhead, direct to STT
TTS streamingPCM_24000Direct playback without decode step
Recording storageAAC (128kbps)60-70% smaller than PCM, perceptually lossless for voice
ArchivalFLACLossless, compressed, good for long-term storage

The practical pipeline: WebRTC audio arrives as Opus, you decode to PCM for STT, you request PCM from TTS, you encode to AAC for storage. This sounds complex but the codec operations are cheap — on a modern server, encoding/decoding adds under 5ms to your pipeline.

Streaming Integration: The Full Pipeline

This is where the rubber meets the road. Streaming integration means you are not waiting for complete text before starting TTS — you are piping tokens from LLM to TTS as they arrive. The latency difference is dramatic: buffered pipelines add 500-1000ms, streaming pipelines add 0ms.

Here is the complete streaming pipeline that I use in production, written for Deepgram STT + Gemini Flash + ElevenLabs TTS:

import asyncio
import google.generativeai as genai
import httpx
from deepgram import DeepgramClient, LiveOptions, LiveTranscriptionEvents
from typing import AsyncIterator

# Configuration
DEEPGRAM_KEY = "YOUR_DEEPGRAM_KEY"
GEMINI_KEY = "YOUR_GEMINI_KEY"
ELEVENLABS_KEY = "YOUR_ELEVENLABS_KEY"
VOICE_ID = "YOUR_VOICE_ID"

genai.configure(api_key=GEMINI_KEY)

class VoiceInterviewPipeline:
    def __init__(self, system_prompt: str):
        self.system_prompt = system_prompt
        self.conversation_history = []
        self.audio_output_queue = asyncio.Queue()
        self.current_utterance = ""

        # Initialize Gemini model
        self.model = genai.GenerativeModel(
            model_name="gemini-2.0-flash",
            system_instruction=system_prompt,
        )
        self.chat = self.model.start_chat(history=[])

    async def handle_stt_transcript(self, transcript: str, is_final: bool) -> None:
        """Called when Deepgram returns a transcript segment."""
        if not is_final:
            # Partial results — show in UI but don't process yet
            self.current_utterance = transcript
            return

        # Final transcript — send to LLM and pipe response to TTS
        self.current_utterance = ""
        self.conversation_history.append({
            "role": "user",
            "parts": [transcript],
        })

        # Start LLM + TTS pipeline (non-blocking)
        asyncio.create_task(self._llm_to_tts_pipeline(transcript))

    async def _llm_to_tts_pipeline(self, user_text: str) -> None:
        """
        Stream LLM tokens directly to TTS.
        Sentence-boundary buffering ensures natural TTS output
        while keeping latency low.
        """
        sentence_buffer = ""
        full_response = ""

        # Stream from Gemini
        async for chunk in await self.chat.send_message_async(
            user_text, stream=True
        ):
            if not chunk.text:
                continue

            sentence_buffer += chunk.text
            full_response += chunk.text

            # Check for sentence boundaries
            # Send to TTS at natural pause points
            if any(sentence_buffer.rstrip().endswith(p) for p in [".", "?", "!", "\n"]):
                sentence_to_speak = sentence_buffer.strip()
                if len(sentence_to_speak) > 10:  # Avoid tiny fragments
                    asyncio.create_task(
                        self._send_to_tts(sentence_to_speak)
                    )
                sentence_buffer = ""

        # Send any remaining text
        if sentence_buffer.strip():
            asyncio.create_task(self._send_to_tts(sentence_buffer.strip()))

        # Add AI response to conversation history
        self.conversation_history.append({
            "role": "model",
            "parts": [full_response],
        })

    async def _send_to_tts(self, text: str) -> None:
        """Send text to ElevenLabs TTS and push audio to output queue."""
        async with httpx.AsyncClient(timeout=10.0) as client:
            async with client.stream(
                "POST",
                f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream",
                headers={
                    "xi-api-key": ELEVENLABS_KEY,
                    "Content-Type": "application/json",
                },
                json={
                    "text": text,
                    "model_id": "eleven_flash_v2_5",
                    "output_format": "pcm_24000",
                    "voice_settings": {
                        "stability": 0.5,
                        "similarity_boost": 0.8,
                    },
                },
            ) as response:
                async for audio_chunk in response.aiter_bytes(4096):
                    await self.audio_output_queue.put(audio_chunk)

    async def start_stt_listener(self, audio_input_stream: AsyncIterator[bytes]) -> None:
        """Connect to Deepgram and stream audio input."""
        dg_client = DeepgramClient(DEEPGRAM_KEY)

        options = LiveOptions(
            model="nova-3",
            language="en-US",
            smart_format=True,
            punctuate=True,
            filler_words=True,
            endpointing=300,
            interim_results=True,
            keywords=["Kubernetes:3", "AWS:2", "Python:2", "GraphQL:3"],
        )

        async with dg_client.listen.asynclive.v("1") as connection:
            async def on_transcript(self_dg, result, **kwargs):
                transcript = result.channel.alternatives[0].transcript
                if transcript:
                    await self.handle_stt_transcript(
                        transcript,
                        is_final=result.is_final,
                    )

            connection.on(LiveTranscriptionEvents.Transcript, on_transcript)
            await connection.start(options)

            async for audio_chunk in audio_input_stream:
                await connection.send(audio_chunk)

    async def get_audio_output(self) -> AsyncIterator[bytes]:
        """Yield audio chunks from the output queue for playback."""
        while True:
            chunk = await self.audio_output_queue.get()
            if chunk is None:
                break
            yield chunk


# Usage
async def run_interview(audio_input_stream: AsyncIterator[bytes]):
    pipeline = VoiceInterviewPipeline(
        system_prompt="""You are Morgan, a senior engineering interviewer at TechCorp.
        Conduct a 45-minute technical interview. Ask one question at a time.
        Keep all responses under 3 sentences."""
    )

    # Start STT listener and audio output in parallel
    await asyncio.gather(
        pipeline.start_stt_listener(audio_input_stream),
        # Audio output is consumed by the WebRTC sender
    )

The key insight in this pipeline is sentence-boundary buffering. Sending tokens one-by-one to TTS would produce terrible prosody — TTS models need semantic context to generate natural intonation. But waiting for the complete LLM response adds 500ms+. Sentence-boundary buffering is the sweet spot: you send one sentence at a time, TTS starts generating audio within 75-100ms of the first sentence completing, and the user hears the first sentence while subsequent sentences are being synthesized.

Fallback Chains: When Your Primary Provider Has a Bad Day

Any production voice system needs fallback logic. ElevenLabs has occasional latency spikes (250ms TTFB instead of 75ms) and rare outages. Building a fallback chain is not optional.

Here is the provider priority for my interview systems:

Primary: ElevenLabs Flash v2.5 (best naturalness) Fallback 1: Cartesia Sonic (slightly lower naturalness, similar latency) Fallback 2: Deepgram Aura-2 (noticeably less natural, but ultra-low cost and reliable)

import asyncio
import time
from enum import Enum

class TTSProvider(Enum):
    ELEVENLABS = "elevenlabs"
    CARTESIA = "cartesia"
    DEEPGRAM = "deepgram"

class TTSFallbackManager:
    def __init__(self):
        self.provider_health = {
            TTSProvider.ELEVENLABS: {"failures": 0, "last_failure": 0, "circuit_open": False},
            TTSProvider.CARTESIA: {"failures": 0, "last_failure": 0, "circuit_open": False},
            TTSProvider.DEEPGRAM: {"failures": 0, "last_failure": 0, "circuit_open": False},
        }
        self.priority = [TTSProvider.ELEVENLABS, TTSProvider.CARTESIA, TTSProvider.DEEPGRAM]
        self.latency_threshold_ms = 200  # Switch to fallback if TTFB exceeds this
        self.circuit_reset_seconds = 60   # Try primary again after 60s

    def get_active_provider(self) -> TTSProvider:
        """Return the highest-priority healthy provider."""
        now = time.time()

        for provider in self.priority:
            health = self.provider_health[provider]

            # Reset circuit breaker after cooldown period
            if health["circuit_open"]:
                if now - health["last_failure"] > self.circuit_reset_seconds:
                    health["circuit_open"] = False
                    health["failures"] = 0
                    print(f"[TTS] Circuit breaker reset for {provider.value}")
                else:
                    continue

            return provider

        # All providers unhealthy — use Deepgram as last resort
        return TTSProvider.DEEPGRAM

    def record_failure(self, provider: TTSProvider) -> None:
        health = self.provider_health[provider]
        health["failures"] += 1
        health["last_failure"] = time.time()

        if health["failures"] >= 3:
            health["circuit_open"] = True
            print(f"[TTS] Circuit breaker OPEN for {provider.value} after {health['failures']} failures")

    def record_success(self, provider: TTSProvider) -> None:
        self.provider_health[provider]["failures"] = 0

    async def synthesize_with_fallback(
        self,
        text: str,
        audio_callback,
    ) -> TTSProvider:
        """Try providers in priority order. Return which provider was used."""
        provider = self.get_active_provider()

        try:
            start = time.monotonic()

            if provider == TTSProvider.ELEVENLABS:
                await synthesize_elevenlabs(text, audio_callback)
            elif provider == TTSProvider.CARTESIA:
                await synthesize_cartesia(text, audio_callback)
            else:
                await synthesize_deepgram(text, audio_callback)

            latency_ms = (time.monotonic() - start) * 1000

            # Check if latency exceeded threshold (soft failure)
            if latency_ms > self.latency_threshold_ms and provider != TTSProvider.DEEPGRAM:
                print(f"[TTS] {provider.value} latency {latency_ms:.0f}ms exceeded threshold")
                self.record_failure(provider)
            else:
                self.record_success(provider)

            return provider

        except Exception as e:
            print(f"[TTS] {provider.value} failed: {e}")
            self.record_failure(provider)

            # Recursively try next provider
            return await self.synthesize_with_fallback(text, audio_callback)

The circuit breaker pattern is important here. If ElevenLabs starts returning errors, you do not want every TTS request to fail and retry — you want to route around the outage immediately. Three failures open the circuit, and after 60 seconds you try the primary again (in case it was a transient spike).

Per-Minute Cost Breakdown

The question I get asked constantly: “what does this actually cost?” Here is the full breakdown for a 45-minute interview, assuming typical token volumes:

Assumptions:

  • 45 minutes of audio
  • STT: 45 minutes of transcription
  • LLM: ~12k input tokens, ~3k output tokens per interview
  • TTS: ~8,100 characters of synthesized speech (interviewer side, ~9 minutes of speech)
STTLLMTTSSTT CostLLM CostTTS CostTotal/min
Deepgram Nova-3Gemini 2.0 FlashDeepgram Aura-2$0.35$0.002$0.12$0.0107/min
Deepgram Nova-3Gemini 2.0 FlashCartesia Sonic$0.35$0.002$0.53$0.0196/min
Deepgram Nova-3Gemini 2.0 FlashElevenLabs Flash$0.35$0.002$1.46$0.0403/min
Deepgram Nova-3GPT-4oElevenLabs Flash$0.35$0.17$1.46$0.0442/min
AssemblyAIGPT-4oElevenLabs Flash$0.28$0.17$1.46$0.0424/min
Whisper APIGPT-4oElevenLabs Flash$0.27$0.17$1.46$0.0422/min
Deepgram Nova-3Claude 3.5 SonnetElevenLabs Flash$0.35$0.24$1.46$0.0456/min
Google STTGemini 2.0 FlashAzure Neural$0.72$0.002$0.13$0.0189/min

Key observations:

  1. TTS dominates cost in high-quality configurations. ElevenLabs costs more than STT + LLM combined for Flash setups
  2. Gemini Flash is almost free — $0.002 per interview is remarkable. The LLM is no longer the expensive part
  3. Minimum viable cost: Deepgram + Gemini + Deepgram Aura-2 at $0.0107/min for a 45-minute interview is $0.48 total. That is commercially viable at scale
  4. Premium configuration: Deepgram + GPT-4o + ElevenLabs at $0.0442/min for a 45-minute interview is $1.99 total — still inexpensive for a hiring workflow

At 1,000 interviews/month:

  • Budget stack (Deepgram + Gemini Flash + Deepgram Aura-2): ~$480/month
  • Premium stack (Deepgram + GPT-4o + ElevenLabs): ~$1,990/month

Neither of these is a significant cost compared to recruiter time.

After building multiple interview systems in production, my default recommendation:

STT: Deepgram Nova-3 with custom vocabulary for your domain LLM: Gemini 2.0 Flash for live conversation, Claude 3.5 Sonnet for async evaluation TTS: ElevenLabs Flash v2.5 with a custom cloned voice for brand consistency Fallback TTS: Cartesia Sonic (first fallback), Deepgram Aura-2 (second)

This configuration gives you:

  • ~475ms total round-trip latency (p50)
  • ~$0.04/min cost
  • Highly natural voice quality that passes the “real interviewer” test in user research
  • Resilient fallback chain for production reliability

If cost is the primary constraint, swap ElevenLabs for Cartesia (cuts TTS cost by 65%) or Deepgram Aura-2 (cuts by 92%) with some voice quality tradeoff.

What is Coming in Part 5

We now have the full voice pipeline running. In Part 5, I will show how to build multi-role agent systems — separate AI personas for interviewer, coach, and evaluator — and how to coordinate them in real-time without the candidate experiencing awkward transitions or the agents stepping on each other.

The interviewer asks questions. The coach (invisible to the candidate) monitors confidence patterns and suggests adjustments to your approach. The evaluator scores each answer against the competency framework in real-time. Getting all three to work together without the candidate hearing the machinery — that is what Part 5 covers.


This is Part 4 of a 12-part series: The Voice AI Interview Playbook.

Series outline:

  1. Why Real-Time Voice Changes Everything — The landscape, the vision, and the reference architecture (Part 1)
  2. Cascaded vs. Speech-to-Speech — Choosing your pipeline architecture (Part 2)
  3. LiveKit vs. Pipecat vs. Direct — Picking your framework (Part 3)
  4. STT, LLM, and TTS That Actually Work — Building the voice pipeline (this post)
  5. Multi-Role Agents — Interviewer, coach, and evaluator personas (Part 5)
  6. Knowledge Base and RAG — Making your voice agent an expert (Part 6)
  7. Web and Mobile Clients — Cross-platform voice experiences (Part 7)
  8. Video Interview Integration — Multimodal analysis with Gemini Live (Part 8)
  9. Recording, Transcription, and Compliance — GDPR, HIPAA, and getting it right (Part 9)
  10. Scaling to Thousands — Architecture for concurrent voice sessions (Part 10)
  11. Cost Optimization — From $0.14/min to $0.03/min (Part 11)
  12. Multi-Provider Support — OpenAI Realtime, Bedrock Nova, Grok, and the adapter pattern (Part 12)
Export for reading

Comments