The Voice AI Interview Playbook: Cost Optimization — From $0.14/min to $0.03/min Without Sacrificing Quality (Part 11 of 12)

In Part 10, we built the infrastructure to handle hiring season surges — LiveKit SFU mesh, stateless agent workers, Kubernetes auto-scaling, and multi-region failover. We also previewed the infrastructure cost table at the end, which probably made a few people reach for antacids.

Now let’s talk about the full cost picture.

Real-time voice AI is not cheap. A poorly-optimized stack running at scale can cost $0.14 per minute per session or more. A 30-minute interview becomes $4.20. Multiply that across 10,000 interviews per month and you’re spending $42,000 monthly just on AI infrastructure. That’s a budget item that gets noticed in board meetings.

The good news: with the right architecture decisions, you can get that same 30-minute interview to under $1.00 without meaningfully degrading candidate experience. This post shows you exactly how.

The Per-Minute Cost Anatomy

Let’s start with the baseline: what does one minute of voice AI interview actually cost, at the component level? I’ll use a typical production setup with managed services as the reference point.

Managed Stack Cost Breakdown (per minute of active session)

Component                    Provider              Cost/min    % of total
─────────────────────────────────────────────────────────────────────────
Voice AI (STT+LLM+TTS)       OpenAI Realtime       $0.06/min      43%
Media transport              LiveKit Cloud         $0.04/min      29%
Recording storage            S3 + CloudFront       $0.004/min      3%
Agent worker compute         ECS Fargate           $0.012/min      9%
Redis (session state)        ElastiCache           $0.003/min      2%
Database queries             RDS PostgreSQL        $0.002/min      2%
Async evaluation             Lambda + GPT-4o mini  $0.018/min     13%
─────────────────────────────────────────────────────────────────────────
TOTAL (fully managed)                              ~$0.139/min   100%

A 25-minute interview on this stack costs approximately $3.47. That’s before any HR platform licensing, recruiter time, or infrastructure amortization.

The three biggest cost drivers are immediately obvious: the voice AI provider, the media transport layer, and the async evaluation. These three components account for 85% of per-minute cost. Everything else is rounding error.

What “Managed” Means

“Managed” in this context means:

LiveKit Cloud for media transport (no servers to run)
OpenAI Realtime API for the voice AI provider
ECS Fargate for agent workers (AWS manages the container infrastructure)
All post-session processing on Lambda and managed LLM APIs

You’re paying a significant markup for the convenience of not managing infrastructure. This is the right tradeoff at low volume. It becomes the wrong tradeoff at scale.

The Three Cost Tipping Points

Not every team should fully optimize. The cost profile follows a classic step-function, with three tipping points where the effort-to-savings ratio changes dramatically.

Tipping Point 1: Below 10,000 Minutes/Month — Stay Managed

Below 10,000 interview minutes per month (roughly 400 30-minute interviews, or a small company doing 15-20 interviews per week), you do not have a cost problem. You have a reliability problem. Focus on uptime, quality, and iteration speed.

At this volume:

Monthly voice AI spend: ~$1,400
Cost to engineer self-hosted alternatives: $15,000-30,000 in engineering time
Break-even: 12-24 months minimum

The math does not work. Stay on fully managed services. Revisit when volume grows.

Tipping Point 2: 10,000–50,000 Minutes/Month — Hybrid Optimization (80% savings available)

This is where optimization starts paying for itself within 2-3 months. The target: replace the two highest-cost managed components (voice AI provider and media transport) with lower-cost alternatives while keeping everything else managed.

Key moves at this tier:

Switch from OpenAI Realtime to Grok ($0.06/min → $0.05/min flat, simpler billing)
Move from LiveKit Cloud to self-hosted LiveKit ($0.04/min → ~$0.008/min on reserved EC2)
Replace GPT-4o mini evaluation with a smaller self-hosted model ($0.018/min → $0.006/min on batch GPU)

New cost breakdown after hybrid optimization:

Component                    Provider              Cost/min    Savings
─────────────────────────────────────────────────────────────────────────
Voice AI                     Grok Voice Agent      $0.050/min   -17%
Media transport              Self-hosted LiveKit   $0.008/min   -80%
Recording storage            S3 + CloudFront       $0.004/min    0%
Agent worker compute         ECS Fargate           $0.012/min    0%
Redis (session state)        ElastiCache           $0.003/min    0%
Database queries             RDS PostgreSQL        $0.002/min    0%
Async evaluation             Llama 3.1 on spot     $0.006/min   -67%
─────────────────────────────────────────────────────────────────────────
TOTAL (hybrid optimized)                           ~$0.085/min   -39%

A 25-minute interview now costs $2.13, down from $3.47. A 39% reduction from two infrastructure changes. At 30,000 minutes/month, this saves ~$1,620/month, paying back the engineering investment in about 6 weeks.

Tipping Point 3: Above 50,000 Minutes/Month — Full Optimization (78% total savings possible)

Above 50,000 minutes per month, you’re spending enough on voice AI that the economics of full optimization become compelling. This requires more engineering investment but reduces per-minute cost to under $0.03.

Full optimization adds:

Self-hosted STT (Whisper.cpp or Deepgram self-hosted on GPU): replaces $0.01-0.02/min of the provider cost
Self-hosted TTS for low-stakes interactions (Coqui/Piper for filler and transitions): reduces TTS portion to near zero
Context window optimization (aggressive conversation summarization): reduces LLM token cost by 40-60%
Spot instances for all non-real-time processing: 70% discount on async evaluation compute
TTS pre-generation for scripted content: zero incremental cost for opening scripts

Component                    Implementation         Cost/min    vs Baseline
─────────────────────────────────────────────────────────────────────────
Voice AI (core)              Grok / Gemini Live     $0.020/min   -67%
  (with context optimization)
Media transport              Self-hosted LiveKit    $0.008/min   -80%
Recording storage            S3 Intelligent-Tier    $0.003/min   -25%
Agent workers                EKS Spot + On-demand   $0.006/min   -50%
Redis                        Self-managed Redis     $0.002/min   -33%
Database                     Aurora Serverless v2   $0.001/min   -50%
Async evaluation             Llama on spot batch    $0.004/min   -78%
─────────────────────────────────────────────────────────────────────────
TOTAL (fully optimized)                            ~$0.044/min   -68%

A 25-minute interview now costs $1.10. With caching and pre-generation of scripted content, you can push below $0.90 for standard interview formats.

Provider Cost Comparison

The voice AI provider is the single largest cost lever. Here is the full comparison as of Q1 2026:

Provider	Model	Billing Model	Cost/min	Latency (TTFA)	Notes
OpenAI Realtime	GPT-4o Realtime	Per token (audio in + out)	$0.04-0.08	300-600ms	Native function calling, 60-min sessions
Grok Voice Agent	Grok-2 Voice	Flat per minute	$0.05	<1s	OpenAI-compatible, no token math needed
Gemini Live	Gemini 2.0 Flash	Per token (audio + video)	$0.03-0.06	320-800ms	Multimodal, best for video interviews
Bedrock Nova Sonic	Nova Sonic	Per token (audio)	$0.04-0.07	<700ms	AWS-native, 100+ languages, compliance
Build your own	Whisper + Llama + Coqui	Infrastructure	$0.01-0.02	200-800ms	High effort, max control

Why Grok Wins on Cost Clarity

The flat per-minute billing model of Grok deserves special attention. Every other provider charges per token of audio input and output, which means your monthly bill depends on:

How much candidates talk (variable)
How much the AI talks (variable, depends on verbosity settings)
Audio sampling rate and encoding
Whether silence is billed (it is, for most providers)

With Grok’s $0.05/min flat rate, a 30-minute interview costs exactly $1.50 in voice AI costs. No surprises. This predictability is genuinely valuable for financial planning and is why Grok often wins cost comparisons even when its nominal rate is not the absolute lowest.

The Self-Build Economics

Building your own speech-to-speech pipeline (Whisper.cpp for STT, Llama 3.3 70B for conversation, Coqui XTTS v2 or Piper for TTS) can reach $0.01-0.02/min at scale, but the true cost includes:

Engineering time: 3-6 months to build a production-quality pipeline
Model serving infrastructure: GPU instances for Whisper and TTS ($1,000-5,000/mo baseline)
Quality gaps: Open-source TTS still lags commercial providers for naturalness
Maintenance burden: Model updates, infrastructure management, incident response

For most teams, self-building the full voice pipeline is a false economy below 500,000 minutes per month. The engineering cost, quality gap, and maintenance burden outweigh the infrastructure savings.

STT Cost Optimization

If you’re on a cascaded pipeline (separate STT → LLM → TTS), STT is a significant cost component. Here’s the breakdown:

STT Provider	Cost	Quality	Latency	Best For
Deepgram Nova 3	$0.0043/min	Excellent	150-300ms	Production cascaded pipeline
Whisper API (OpenAI)	$0.006/min	Excellent	400-800ms	High accuracy needed, latency-tolerant
Google STT v2	$0.016/min	Good	200-400ms	GCP-native stacks
AssemblyAI Nano	$0.003/min	Good	200-500ms	Cost-sensitive
Whisper.cpp (self-hosted)	~$0.001/min on GPU	Excellent	100-300ms	High volume, GPU available

Deepgram Nova 3 is the production choice for managed STT: best accuracy-to-cost ratio, 150ms latency that fits comfortably in the voice budget, and a WebSocket streaming API that integrates cleanly with LiveKit.

For self-hosted at scale: Whisper.cpp running on a GPU instance (g5.xlarge at ~$1.00/hr handles approximately 100 concurrent streams) brings STT cost to under $0.001/min. At 100,000 minutes/month, that’s $200 vs $430 for Deepgram — a $2,760/year difference. Meaningful, but not transformative unless you’re at very high volume.

TTS Cost Optimization

TTS is where the quality-cost tradeoff is sharpest. Commercial TTS providers are noticeably better than open-source alternatives for interview contexts, but the gap is closing.

TTS Provider	Cost	Voice Quality	Latency	Best For
ElevenLabs Turbo v2	$0.006-0.012/min	Best-in-class	200-400ms	High-stakes interviews, executive roles
Cartesia Sonic	$0.005/min	Excellent	90-200ms	Production default, great latency
OpenAI TTS	$0.015-0.030/min	Very good	300-600ms	OpenAI ecosystem
Google TTS	$0.004-0.008/min	Good	200-500ms	GCP stacks
Coqui XTTS v2 (self-hosted)	~$0.001/min on GPU	Good	200-500ms	Mid-volume, GPU available
Piper TTS (self-hosted)	~$0.0001/min CPU	Acceptable	50-150ms	Low-stakes interactions only

The practical tiered strategy:

Interviewer persona: Cartesia Sonic or ElevenLabs for maximum naturalness — this is what the candidate hears most
System transitions (“Let’s move on to the next section”): Coqui XTTS v2 — quality is sufficient for scripted transitions
Pre-recorded common phrases: Piper TTS, pre-generated — zero incremental cost

The transition phrases optimization alone can reduce TTS cost by 20-30% because section transitions, acknowledgments, and filler phrases make up a large fraction of AI speech volume.

LLM Cost Optimization

For speech-to-speech providers (Grok, OpenAI Realtime, Gemini Live, Bedrock Nova), the LLM cost is embedded in the per-minute or per-token rate. For cascaded pipelines, it’s a separate line item.

Context Window Management

The most impactful LLM cost optimization is aggressive context window management. A naive implementation passes the entire conversation history to every LLM call. A 45-minute interview with a candidate who talks a lot can accumulate 20,000-40,000 tokens of conversation history. At GPT-4o pricing, that context adds $0.12-0.24 per call, and you’re making dozens of calls per interview.

The fix: rolling summarization.

# context_manager.py
class InterviewContextManager:
    """
    Maintains a sliding context window for LLM calls.
    Older conversation turns are summarized rather than sent verbatim.
    """

    def __init__(self, max_recent_turns: int = 6, max_context_tokens: int = 8000):
        self.max_recent_turns = max_recent_turns
        self.max_context_tokens = max_context_tokens
        self.recent_turns: list[dict] = []
        self.summary: str = ""

    async def add_turn(self, role: str, content: str):
        self.recent_turns.append({"role": role, "content": content})

        # If we exceed max recent turns, summarize the oldest ones
        if len(self.recent_turns) > self.max_recent_turns:
            turns_to_summarize = self.recent_turns[:-self.max_recent_turns]
            self.recent_turns = self.recent_turns[-self.max_recent_turns:]
            await self._update_summary(turns_to_summarize)

    async def _update_summary(self, turns: list[dict]):
        """Summarize old turns into a compact representation."""
        turns_text = "\n".join(
            f"{t['role'].upper()}: {t['content']}" for t in turns
        )

        prompt = f"""Previous context summary: {self.summary}

New turns to incorporate:
{turns_text}

Write a concise summary (max 200 words) preserving:
- Key technical topics discussed
- Candidate's demonstrated knowledge level
- Any commitments made about next topics
- Red flags or strong positives noted"""

        # Use a cheap model for summarization — GPT-4o mini works well
        response = await openai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=300
        )
        self.summary = response.choices[0].message.content

    def get_context_for_llm(self) -> list[dict]:
        """Return the context to pass to the LLM."""
        messages = []

        if self.summary:
            messages.append({
                "role": "system",
                "content": f"[Previous conversation summary: {self.summary}]"
            })

        messages.extend(self.recent_turns)
        return messages

In practice, this reduces the average context size by 60-70% with minimal impact on interview quality. The summarization model call costs roughly $0.001 per invocation on GPT-4o mini — well worth the token savings on the main model.

Model Tiering for Different Question Types

Not every interview question needs GPT-4o. Routine acknowledgments, simple clarifying questions, and structured rubric scoring can run on smaller, cheaper models:

# llm_router.py
class InterviewLLMRouter:
    """Route LLM calls to appropriate models based on task type."""

    async def generate_response(
        self,
        task_type: str,
        context: list[dict],
        prompt: str
    ) -> str:

        model_config = {
            # Simple acknowledgments and filler
            "acknowledgment": {
                "model": "gpt-4o-mini",
                "max_tokens": 50,
                "temperature": 0.7
            },
            # Standard interview questions and follow-ups
            "interview_turn": {
                "model": "gpt-4o-mini",
                "max_tokens": 200,
                "temperature": 0.8
            },
            # Complex technical follow-up requiring nuanced judgment
            "deep_followup": {
                "model": "gpt-4o",
                "max_tokens": 300,
                "temperature": 0.6
            },
            # Final rubric scoring and evaluation
            "evaluation": {
                "model": "gpt-4o",
                "max_tokens": 500,
                "temperature": 0.3
            }
        }

        config = model_config.get(task_type, model_config["interview_turn"])

        response = await openai_client.chat.completions.create(
            model=config["model"],
            messages=context + [{"role": "user", "content": prompt}],
            max_tokens=config["max_tokens"],
            temperature=config["temperature"]
        )

        return response.choices[0].message.content

On a typical 30-minute interview with the routing above, roughly 60% of calls go to gpt-4o-mini and 40% go to gpt-4o. This reduces LLM cost by approximately 35% compared to using gpt-4o for all calls.

Caching Strategies

Pre-Generated TTS for Scripted Content

Every interview follows a predictable structure: opening greeting, section transitions, standard question setups, and closing. These are scripted and never change. Pre-generating them as audio files eliminates TTS cost for a significant fraction of AI speech:

# tts_cache.py
import hashlib
import boto3
from pathlib import Path

s3 = boto3.client('s3')
AUDIO_CACHE_BUCKET = 'your-audio-cache-bucket'

async def get_or_generate_audio(
    text: str,
    voice_id: str,
    tts_client: CartesiaTTS
) -> bytes:
    """
    Check S3 cache before calling TTS API.
    For scripted phrases, this will almost always hit the cache.
    """
    # Hash the text + voice ID to create cache key
    cache_key = hashlib.sha256(f"{voice_id}:{text}".encode()).hexdigest()
    s3_key = f"tts-cache/{voice_id}/{cache_key}.opus"

    try:
        # Try cache first
        response = s3.get_object(Bucket=AUDIO_CACHE_BUCKET, Key=s3_key)
        return response['Body'].read()
    except s3.exceptions.NoSuchKey:
        pass

    # Cache miss — generate and store
    audio_bytes = await tts_client.generate(text, voice_id=voice_id)

    s3.put_object(
        Bucket=AUDIO_CACHE_BUCKET,
        Key=s3_key,
        Body=audio_bytes,
        ContentType='audio/ogg',
        # Cache for 30 days — scripted phrases rarely change
        CacheControl='max-age=2592000'
    )

    return audio_bytes

# Pre-warm cache for all scripted phrases at deployment time
SCRIPTED_PHRASES = [
    "Hello! I'm Alex, your AI interviewer today. Before we begin, I want to confirm you've consented to this session being recorded.",
    "Great. Let's start with a brief introduction. Could you tell me a bit about your current role and what brought you to apply for this position?",
    "Thank you. Let's move on to the technical portion of the interview.",
    "Excellent. Now I'd like to discuss a system design scenario.",
    "Let's shift to some behavioral questions.",
    "We're coming up on the end of our time. Do you have any questions for me?",
    "Thank you so much for your time today. We'll be in touch with next steps within the next few business days.",
]

For a standard 30-minute interview, pre-generated phrases cover roughly 15-20% of TTS usage by duration (openings, transitions, and closings are verbose). At Cartesia Sonic pricing, this saves roughly $0.001 per interview — small individually but meaningful at scale.

Question Intro Caching

Similarly, the opening line of each interview question is scripted and can be cached. Only the follow-up responses, which depend on candidate answers, must be dynamically generated.

Batch Evaluation on Spot Instances

Post-session evaluation is the most cost-controllable workload because it has no latency requirement. You have hours, not milliseconds. This makes it ideal for spot/preemptible instances.

# batch_evaluator.py — runs as ECS Spot or Kubernetes spot pool job
import boto3
import asyncio
from datetime import datetime

class BatchEvaluator:
    """
    Runs evaluation jobs on spot instances during off-peak hours.
    Handles instance interruption gracefully via checkpointing.
    """

    def __init__(self, session_id: str, checkpoint_bucket: str):
        self.session_id = session_id
        self.checkpoint_bucket = checkpoint_bucket
        self.s3 = boto3.client('s3')

    async def evaluate(self, transcript: str, rubric: dict) -> dict:
        # Load checkpoint if we've been interrupted before
        checkpoint = await self.load_checkpoint()
        completed_sections = checkpoint.get('completed_sections', [])

        results = checkpoint.get('results', {})

        for section_name, section_criteria in rubric['sections'].items():
            if section_name in completed_sections:
                continue  # Skip already-evaluated sections

            section_transcript = self.extract_section(transcript, section_name)
            section_score = await self.score_section(
                section_transcript,
                section_criteria
            )

            results[section_name] = section_score
            completed_sections.append(section_name)

            # Checkpoint after each section
            await self.save_checkpoint({
                'completed_sections': completed_sections,
                'results': results,
                'timestamp': datetime.utcnow().isoformat()
            })

        return results

    async def save_checkpoint(self, data: dict):
        import json
        self.s3.put_object(
            Bucket=self.checkpoint_bucket,
            Key=f"eval-checkpoints/{self.session_id}.json",
            Body=json.dumps(data)
        )

    async def load_checkpoint(self) -> dict:
        import json
        try:
            response = self.s3.get_object(
                Bucket=self.checkpoint_bucket,
                Key=f"eval-checkpoints/{self.session_id}.json"
            )
            return json.loads(response['Body'].read())
        except self.s3.exceptions.NoSuchKey:
            return {}

Running evaluation on EC2 Spot (g5.xlarge, ~$0.36/hr vs $1.01/hr on-demand, using Llama 3.1 70B) versus GPT-4o mini API calls reduces evaluation cost by approximately 70% at the cost of managing spot interruptions. The checkpointing above makes interruptions recoverable.

The Build vs. Buy Decision Matrix

Here is the decision framework I use for teams at different scales:

Monthly Volume	Recommended Stack	Estimated Cost/Interview	Engineering Effort
< 5,000 min	Fully managed (LiveKit Cloud + Grok + ECS)	~$3.00	Minimal
5K-20K min	Hybrid: self-hosted LiveKit + Grok + ECS	~$2.00	2-3 weeks
20K-100K min	Hybrid + spot evaluation + context optimization	~$1.50	4-6 weeks
100K-500K min	Self-hosted SFU + Grok + model tiering + caching	~$1.00	2-3 months
> 500K min	Full optimization including self-hosted STT/TTS	~$0.70	4-6 months + ongoing

The “engineering effort” column represents the one-time cost to implement each tier, not ongoing maintenance. The ongoing maintenance cost (roughly 0.25-0.5 engineering weeks per month) is not included but should factor into your ROI calculation.

When Grok Is the Right Full-Stack Answer

For teams between 5K-100K minutes per month who want a simple path to significant cost reduction without complex infrastructure, Grok’s flat $0.05/min rate with OpenAI-compatible API is often the right answer.

The migration from OpenAI Realtime to Grok takes approximately one week (they share the same WebSocket protocol). The savings are immediate. And you skip the complexity of self-hosted SFU or model tiering.

At 50,000 minutes/month:

OpenAI Realtime: ~$3,000/month in voice AI costs
Grok flat rate: $2,500/month in voice AI costs
Delta: $500/month savings from a one-week migration

Combined with self-hosted LiveKit (from Part 10), a mid-scale company spending $8,000/month on fully managed voice AI can reach $3,500/month with three weeks of engineering work and no quality degradation.

The Number That Matters

I started this post with $3.47 per interview on a managed stack. After full optimization:

Small company, fully managed: $3.47/interview (appropriate — don’t optimize prematurely)
Medium company, hybrid: $2.13/interview (39% savings, ~3 weeks engineering)
Large company, optimized: $1.10/interview (68% savings, ~3 months engineering)
Very large, full self-host: $0.88/interview (75% savings, 4-6 months + maintenance)

The break-even on each optimization tier typically falls between 2-6 months at the volumes that justify it. Start with managed, migrate to hybrid when costs become visible in quarterly reviews, and invest in full optimization only when you can see a clear 12-month ROI.

In Part 12, we close out the series with the architecture that makes all of this cost optimization possible: the multi-provider adapter pattern. Supporting OpenAI Realtime, Bedrock Nova Sonic, Grok, and Gemini Live behind a single clean interface means you can switch providers for cost or reliability reasons without rewriting your interview logic. It also gives you the circuit breaker and failover patterns that keep your system running when any single provider has an outage.

This is Part 11 of a 12-part series: The Voice AI Interview Playbook.

Series outline:

Why Real-Time Voice Changes Everything — The landscape, the vision, and the reference architecture (Part 1)
Cascaded vs. Speech-to-Speech — Choosing your pipeline architecture (Part 2)
LiveKit vs. Pipecat vs. Direct — Picking your framework (Part 3)
STT, LLM, and TTS That Actually Work — Building the voice pipeline (Part 4)
Multi-Role Agents — Interviewer, coach, and evaluator personas (Part 5)
Knowledge Base and RAG — Making your voice agent an expert (Part 6)
Web and Mobile Clients — Cross-platform voice experiences (Part 7)
Video Interview Integration — Multimodal analysis with Gemini Live (Part 8)
Recording, Transcription, and Compliance — GDPR, HIPAA, and getting it right (Part 9)
Scaling to Thousands — Architecture for concurrent voice sessions (Part 10)
Cost Optimization — From $0.14/min to $0.03/min (this post)
Multi-Provider Support — OpenAI Realtime, Bedrock Nova, Grok, and the adapter pattern (Part 12)

Export for reading

The Voice AI Interview Playbook: Cost Optimization — From $0.14/min to $0.03/min Without Sacrificing Quality (Part 11 of 12)

The Per-Minute Cost Anatomy

Managed Stack Cost Breakdown (per minute of active session)

What “Managed” Means

The Three Cost Tipping Points

Tipping Point 1: Below 10,000 Minutes/Month — Stay Managed

Tipping Point 2: 10,000–50,000 Minutes/Month — Hybrid Optimization (80% savings available)

Tipping Point 3: Above 50,000 Minutes/Month — Full Optimization (78% total savings possible)

Provider Cost Comparison

Why Grok Wins on Cost Clarity

The Self-Build Economics

STT Cost Optimization

TTS Cost Optimization

LLM Cost Optimization

Context Window Management

Model Tiering for Different Question Types

Caching Strategies

Pre-Generated TTS for Scripted Content

Question Intro Caching

Batch Evaluation on Spot Instances

The Build vs. Buy Decision Matrix

When Grok Is the Right Full-Stack Answer

The Number That Matters

Comments

On this page

The Voice AI Interview Playbook: Cost Optimization — From $0.14/min to $0.03/min Without Sacrificing Quality (Part 11 of 12)