In Part 10, we built the infrastructure to handle hiring season surges — LiveKit SFU mesh, stateless agent workers, Kubernetes auto-scaling, and multi-region failover. We also previewed the infrastructure cost table at the end, which probably made a few people reach for antacids.

Now let’s talk about the full cost picture.

Real-time voice AI is not cheap. A poorly-optimized stack running at scale can cost $0.14 per minute per session or more. A 30-minute interview becomes $4.20. Multiply that across 10,000 interviews per month and you’re spending $42,000 monthly just on AI infrastructure. That’s a budget item that gets noticed in board meetings.

The good news: with the right architecture decisions, you can get that same 30-minute interview to under $1.00 without meaningfully degrading candidate experience. This post shows you exactly how.

The Per-Minute Cost Anatomy

Let’s start with the baseline: what does one minute of voice AI interview actually cost, at the component level? I’ll use a typical production setup with managed services as the reference point.

Managed Stack Cost Breakdown (per minute of active session)

Component                    Provider              Cost/min    % of total
─────────────────────────────────────────────────────────────────────────
Voice AI (STT+LLM+TTS)       OpenAI Realtime       $0.06/min      43%
Media transport              LiveKit Cloud         $0.04/min      29%
Recording storage            S3 + CloudFront       $0.004/min      3%
Agent worker compute         ECS Fargate           $0.012/min      9%
Redis (session state)        ElastiCache           $0.003/min      2%
Database queries             RDS PostgreSQL        $0.002/min      2%
Async evaluation             Lambda + GPT-4o mini  $0.018/min     13%
─────────────────────────────────────────────────────────────────────────
TOTAL (fully managed)                              ~$0.139/min   100%

A 25-minute interview on this stack costs approximately $3.47. That’s before any HR platform licensing, recruiter time, or infrastructure amortization.

The three biggest cost drivers are immediately obvious: the voice AI provider, the media transport layer, and the async evaluation. These three components account for 85% of per-minute cost. Everything else is rounding error.

What “Managed” Means

“Managed” in this context means:

  • LiveKit Cloud for media transport (no servers to run)
  • OpenAI Realtime API for the voice AI provider
  • ECS Fargate for agent workers (AWS manages the container infrastructure)
  • All post-session processing on Lambda and managed LLM APIs

You’re paying a significant markup for the convenience of not managing infrastructure. This is the right tradeoff at low volume. It becomes the wrong tradeoff at scale.

The Three Cost Tipping Points

Not every team should fully optimize. The cost profile follows a classic step-function, with three tipping points where the effort-to-savings ratio changes dramatically.

Tipping Point 1: Below 10,000 Minutes/Month — Stay Managed

Below 10,000 interview minutes per month (roughly 400 30-minute interviews, or a small company doing 15-20 interviews per week), you do not have a cost problem. You have a reliability problem. Focus on uptime, quality, and iteration speed.

At this volume:

  • Monthly voice AI spend: ~$1,400
  • Cost to engineer self-hosted alternatives: $15,000-30,000 in engineering time
  • Break-even: 12-24 months minimum

The math does not work. Stay on fully managed services. Revisit when volume grows.

Tipping Point 2: 10,000–50,000 Minutes/Month — Hybrid Optimization (80% savings available)

This is where optimization starts paying for itself within 2-3 months. The target: replace the two highest-cost managed components (voice AI provider and media transport) with lower-cost alternatives while keeping everything else managed.

Key moves at this tier:

  1. Switch from OpenAI Realtime to Grok ($0.06/min → $0.05/min flat, simpler billing)
  2. Move from LiveKit Cloud to self-hosted LiveKit ($0.04/min → ~$0.008/min on reserved EC2)
  3. Replace GPT-4o mini evaluation with a smaller self-hosted model ($0.018/min → $0.006/min on batch GPU)

New cost breakdown after hybrid optimization:

Component                    Provider              Cost/min    Savings
─────────────────────────────────────────────────────────────────────────
Voice AI                     Grok Voice Agent      $0.050/min   -17%
Media transport              Self-hosted LiveKit   $0.008/min   -80%
Recording storage            S3 + CloudFront       $0.004/min    0%
Agent worker compute         ECS Fargate           $0.012/min    0%
Redis (session state)        ElastiCache           $0.003/min    0%
Database queries             RDS PostgreSQL        $0.002/min    0%
Async evaluation             Llama 3.1 on spot     $0.006/min   -67%
─────────────────────────────────────────────────────────────────────────
TOTAL (hybrid optimized)                           ~$0.085/min   -39%

A 25-minute interview now costs $2.13, down from $3.47. A 39% reduction from two infrastructure changes. At 30,000 minutes/month, this saves ~$1,620/month, paying back the engineering investment in about 6 weeks.

Tipping Point 3: Above 50,000 Minutes/Month — Full Optimization (78% total savings possible)

Above 50,000 minutes per month, you’re spending enough on voice AI that the economics of full optimization become compelling. This requires more engineering investment but reduces per-minute cost to under $0.03.

Full optimization adds:

  1. Self-hosted STT (Whisper.cpp or Deepgram self-hosted on GPU): replaces $0.01-0.02/min of the provider cost
  2. Self-hosted TTS for low-stakes interactions (Coqui/Piper for filler and transitions): reduces TTS portion to near zero
  3. Context window optimization (aggressive conversation summarization): reduces LLM token cost by 40-60%
  4. Spot instances for all non-real-time processing: 70% discount on async evaluation compute
  5. TTS pre-generation for scripted content: zero incremental cost for opening scripts
Component                    Implementation         Cost/min    vs Baseline
─────────────────────────────────────────────────────────────────────────
Voice AI (core)              Grok / Gemini Live     $0.020/min   -67%
  (with context optimization)
Media transport              Self-hosted LiveKit    $0.008/min   -80%
Recording storage            S3 Intelligent-Tier    $0.003/min   -25%
Agent workers                EKS Spot + On-demand   $0.006/min   -50%
Redis                        Self-managed Redis     $0.002/min   -33%
Database                     Aurora Serverless v2   $0.001/min   -50%
Async evaluation             Llama on spot batch    $0.004/min   -78%
─────────────────────────────────────────────────────────────────────────
TOTAL (fully optimized)                            ~$0.044/min   -68%

A 25-minute interview now costs $1.10. With caching and pre-generation of scripted content, you can push below $0.90 for standard interview formats.

Provider Cost Comparison

The voice AI provider is the single largest cost lever. Here is the full comparison as of Q1 2026:

ProviderModelBilling ModelCost/minLatency (TTFA)Notes
OpenAI RealtimeGPT-4o RealtimePer token (audio in + out)$0.04-0.08300-600msNative function calling, 60-min sessions
Grok Voice AgentGrok-2 VoiceFlat per minute$0.05<1sOpenAI-compatible, no token math needed
Gemini LiveGemini 2.0 FlashPer token (audio + video)$0.03-0.06320-800msMultimodal, best for video interviews
Bedrock Nova SonicNova SonicPer token (audio)$0.04-0.07<700msAWS-native, 100+ languages, compliance
Build your ownWhisper + Llama + CoquiInfrastructure$0.01-0.02200-800msHigh effort, max control

Why Grok Wins on Cost Clarity

The flat per-minute billing model of Grok deserves special attention. Every other provider charges per token of audio input and output, which means your monthly bill depends on:

  • How much candidates talk (variable)
  • How much the AI talks (variable, depends on verbosity settings)
  • Audio sampling rate and encoding
  • Whether silence is billed (it is, for most providers)

With Grok’s $0.05/min flat rate, a 30-minute interview costs exactly $1.50 in voice AI costs. No surprises. This predictability is genuinely valuable for financial planning and is why Grok often wins cost comparisons even when its nominal rate is not the absolute lowest.

The Self-Build Economics

Building your own speech-to-speech pipeline (Whisper.cpp for STT, Llama 3.3 70B for conversation, Coqui XTTS v2 or Piper for TTS) can reach $0.01-0.02/min at scale, but the true cost includes:

  • Engineering time: 3-6 months to build a production-quality pipeline
  • Model serving infrastructure: GPU instances for Whisper and TTS ($1,000-5,000/mo baseline)
  • Quality gaps: Open-source TTS still lags commercial providers for naturalness
  • Maintenance burden: Model updates, infrastructure management, incident response

For most teams, self-building the full voice pipeline is a false economy below 500,000 minutes per month. The engineering cost, quality gap, and maintenance burden outweigh the infrastructure savings.

STT Cost Optimization

If you’re on a cascaded pipeline (separate STT → LLM → TTS), STT is a significant cost component. Here’s the breakdown:

STT ProviderCostQualityLatencyBest For
Deepgram Nova 3$0.0043/minExcellent150-300msProduction cascaded pipeline
Whisper API (OpenAI)$0.006/minExcellent400-800msHigh accuracy needed, latency-tolerant
Google STT v2$0.016/minGood200-400msGCP-native stacks
AssemblyAI Nano$0.003/minGood200-500msCost-sensitive
Whisper.cpp (self-hosted)~$0.001/min on GPUExcellent100-300msHigh volume, GPU available

Deepgram Nova 3 is the production choice for managed STT: best accuracy-to-cost ratio, 150ms latency that fits comfortably in the voice budget, and a WebSocket streaming API that integrates cleanly with LiveKit.

For self-hosted at scale: Whisper.cpp running on a GPU instance (g5.xlarge at ~$1.00/hr handles approximately 100 concurrent streams) brings STT cost to under $0.001/min. At 100,000 minutes/month, that’s $200 vs $430 for Deepgram — a $2,760/year difference. Meaningful, but not transformative unless you’re at very high volume.

TTS Cost Optimization

TTS is where the quality-cost tradeoff is sharpest. Commercial TTS providers are noticeably better than open-source alternatives for interview contexts, but the gap is closing.

TTS ProviderCostVoice QualityLatencyBest For
ElevenLabs Turbo v2$0.006-0.012/minBest-in-class200-400msHigh-stakes interviews, executive roles
Cartesia Sonic$0.005/minExcellent90-200msProduction default, great latency
OpenAI TTS$0.015-0.030/minVery good300-600msOpenAI ecosystem
Google TTS$0.004-0.008/minGood200-500msGCP stacks
Coqui XTTS v2 (self-hosted)~$0.001/min on GPUGood200-500msMid-volume, GPU available
Piper TTS (self-hosted)~$0.0001/min CPUAcceptable50-150msLow-stakes interactions only

The practical tiered strategy:

  • Interviewer persona: Cartesia Sonic or ElevenLabs for maximum naturalness — this is what the candidate hears most
  • System transitions (“Let’s move on to the next section”): Coqui XTTS v2 — quality is sufficient for scripted transitions
  • Pre-recorded common phrases: Piper TTS, pre-generated — zero incremental cost

The transition phrases optimization alone can reduce TTS cost by 20-30% because section transitions, acknowledgments, and filler phrases make up a large fraction of AI speech volume.

LLM Cost Optimization

For speech-to-speech providers (Grok, OpenAI Realtime, Gemini Live, Bedrock Nova), the LLM cost is embedded in the per-minute or per-token rate. For cascaded pipelines, it’s a separate line item.

Context Window Management

The most impactful LLM cost optimization is aggressive context window management. A naive implementation passes the entire conversation history to every LLM call. A 45-minute interview with a candidate who talks a lot can accumulate 20,000-40,000 tokens of conversation history. At GPT-4o pricing, that context adds $0.12-0.24 per call, and you’re making dozens of calls per interview.

The fix: rolling summarization.

# context_manager.py
class InterviewContextManager:
    """
    Maintains a sliding context window for LLM calls.
    Older conversation turns are summarized rather than sent verbatim.
    """

    def __init__(self, max_recent_turns: int = 6, max_context_tokens: int = 8000):
        self.max_recent_turns = max_recent_turns
        self.max_context_tokens = max_context_tokens
        self.recent_turns: list[dict] = []
        self.summary: str = ""

    async def add_turn(self, role: str, content: str):
        self.recent_turns.append({"role": role, "content": content})

        # If we exceed max recent turns, summarize the oldest ones
        if len(self.recent_turns) > self.max_recent_turns:
            turns_to_summarize = self.recent_turns[:-self.max_recent_turns]
            self.recent_turns = self.recent_turns[-self.max_recent_turns:]
            await self._update_summary(turns_to_summarize)

    async def _update_summary(self, turns: list[dict]):
        """Summarize old turns into a compact representation."""
        turns_text = "\n".join(
            f"{t['role'].upper()}: {t['content']}" for t in turns
        )

        prompt = f"""Previous context summary: {self.summary}

New turns to incorporate:
{turns_text}

Write a concise summary (max 200 words) preserving:
- Key technical topics discussed
- Candidate's demonstrated knowledge level
- Any commitments made about next topics
- Red flags or strong positives noted"""

        # Use a cheap model for summarization — GPT-4o mini works well
        response = await openai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=300
        )
        self.summary = response.choices[0].message.content

    def get_context_for_llm(self) -> list[dict]:
        """Return the context to pass to the LLM."""
        messages = []

        if self.summary:
            messages.append({
                "role": "system",
                "content": f"[Previous conversation summary: {self.summary}]"
            })

        messages.extend(self.recent_turns)
        return messages

In practice, this reduces the average context size by 60-70% with minimal impact on interview quality. The summarization model call costs roughly $0.001 per invocation on GPT-4o mini — well worth the token savings on the main model.

Model Tiering for Different Question Types

Not every interview question needs GPT-4o. Routine acknowledgments, simple clarifying questions, and structured rubric scoring can run on smaller, cheaper models:

# llm_router.py
class InterviewLLMRouter:
    """Route LLM calls to appropriate models based on task type."""

    async def generate_response(
        self,
        task_type: str,
        context: list[dict],
        prompt: str
    ) -> str:

        model_config = {
            # Simple acknowledgments and filler
            "acknowledgment": {
                "model": "gpt-4o-mini",
                "max_tokens": 50,
                "temperature": 0.7
            },
            # Standard interview questions and follow-ups
            "interview_turn": {
                "model": "gpt-4o-mini",
                "max_tokens": 200,
                "temperature": 0.8
            },
            # Complex technical follow-up requiring nuanced judgment
            "deep_followup": {
                "model": "gpt-4o",
                "max_tokens": 300,
                "temperature": 0.6
            },
            # Final rubric scoring and evaluation
            "evaluation": {
                "model": "gpt-4o",
                "max_tokens": 500,
                "temperature": 0.3
            }
        }

        config = model_config.get(task_type, model_config["interview_turn"])

        response = await openai_client.chat.completions.create(
            model=config["model"],
            messages=context + [{"role": "user", "content": prompt}],
            max_tokens=config["max_tokens"],
            temperature=config["temperature"]
        )

        return response.choices[0].message.content

On a typical 30-minute interview with the routing above, roughly 60% of calls go to gpt-4o-mini and 40% go to gpt-4o. This reduces LLM cost by approximately 35% compared to using gpt-4o for all calls.

Caching Strategies

Pre-Generated TTS for Scripted Content

Every interview follows a predictable structure: opening greeting, section transitions, standard question setups, and closing. These are scripted and never change. Pre-generating them as audio files eliminates TTS cost for a significant fraction of AI speech:

# tts_cache.py
import hashlib
import boto3
from pathlib import Path

s3 = boto3.client('s3')
AUDIO_CACHE_BUCKET = 'your-audio-cache-bucket'

async def get_or_generate_audio(
    text: str,
    voice_id: str,
    tts_client: CartesiaTTS
) -> bytes:
    """
    Check S3 cache before calling TTS API.
    For scripted phrases, this will almost always hit the cache.
    """
    # Hash the text + voice ID to create cache key
    cache_key = hashlib.sha256(f"{voice_id}:{text}".encode()).hexdigest()
    s3_key = f"tts-cache/{voice_id}/{cache_key}.opus"

    try:
        # Try cache first
        response = s3.get_object(Bucket=AUDIO_CACHE_BUCKET, Key=s3_key)
        return response['Body'].read()
    except s3.exceptions.NoSuchKey:
        pass

    # Cache miss — generate and store
    audio_bytes = await tts_client.generate(text, voice_id=voice_id)

    s3.put_object(
        Bucket=AUDIO_CACHE_BUCKET,
        Key=s3_key,
        Body=audio_bytes,
        ContentType='audio/ogg',
        # Cache for 30 days — scripted phrases rarely change
        CacheControl='max-age=2592000'
    )

    return audio_bytes

# Pre-warm cache for all scripted phrases at deployment time
SCRIPTED_PHRASES = [
    "Hello! I'm Alex, your AI interviewer today. Before we begin, I want to confirm you've consented to this session being recorded.",
    "Great. Let's start with a brief introduction. Could you tell me a bit about your current role and what brought you to apply for this position?",
    "Thank you. Let's move on to the technical portion of the interview.",
    "Excellent. Now I'd like to discuss a system design scenario.",
    "Let's shift to some behavioral questions.",
    "We're coming up on the end of our time. Do you have any questions for me?",
    "Thank you so much for your time today. We'll be in touch with next steps within the next few business days.",
]

For a standard 30-minute interview, pre-generated phrases cover roughly 15-20% of TTS usage by duration (openings, transitions, and closings are verbose). At Cartesia Sonic pricing, this saves roughly $0.001 per interview — small individually but meaningful at scale.

Question Intro Caching

Similarly, the opening line of each interview question is scripted and can be cached. Only the follow-up responses, which depend on candidate answers, must be dynamically generated.

Batch Evaluation on Spot Instances

Post-session evaluation is the most cost-controllable workload because it has no latency requirement. You have hours, not milliseconds. This makes it ideal for spot/preemptible instances.

# batch_evaluator.py — runs as ECS Spot or Kubernetes spot pool job
import boto3
import asyncio
from datetime import datetime

class BatchEvaluator:
    """
    Runs evaluation jobs on spot instances during off-peak hours.
    Handles instance interruption gracefully via checkpointing.
    """

    def __init__(self, session_id: str, checkpoint_bucket: str):
        self.session_id = session_id
        self.checkpoint_bucket = checkpoint_bucket
        self.s3 = boto3.client('s3')

    async def evaluate(self, transcript: str, rubric: dict) -> dict:
        # Load checkpoint if we've been interrupted before
        checkpoint = await self.load_checkpoint()
        completed_sections = checkpoint.get('completed_sections', [])

        results = checkpoint.get('results', {})

        for section_name, section_criteria in rubric['sections'].items():
            if section_name in completed_sections:
                continue  # Skip already-evaluated sections

            section_transcript = self.extract_section(transcript, section_name)
            section_score = await self.score_section(
                section_transcript,
                section_criteria
            )

            results[section_name] = section_score
            completed_sections.append(section_name)

            # Checkpoint after each section
            await self.save_checkpoint({
                'completed_sections': completed_sections,
                'results': results,
                'timestamp': datetime.utcnow().isoformat()
            })

        return results

    async def save_checkpoint(self, data: dict):
        import json
        self.s3.put_object(
            Bucket=self.checkpoint_bucket,
            Key=f"eval-checkpoints/{self.session_id}.json",
            Body=json.dumps(data)
        )

    async def load_checkpoint(self) -> dict:
        import json
        try:
            response = self.s3.get_object(
                Bucket=self.checkpoint_bucket,
                Key=f"eval-checkpoints/{self.session_id}.json"
            )
            return json.loads(response['Body'].read())
        except self.s3.exceptions.NoSuchKey:
            return {}

Running evaluation on EC2 Spot (g5.xlarge, ~$0.36/hr vs $1.01/hr on-demand, using Llama 3.1 70B) versus GPT-4o mini API calls reduces evaluation cost by approximately 70% at the cost of managing spot interruptions. The checkpointing above makes interruptions recoverable.

The Build vs. Buy Decision Matrix

Here is the decision framework I use for teams at different scales:

Monthly VolumeRecommended StackEstimated Cost/InterviewEngineering Effort
< 5,000 minFully managed (LiveKit Cloud + Grok + ECS)~$3.00Minimal
5K-20K minHybrid: self-hosted LiveKit + Grok + ECS~$2.002-3 weeks
20K-100K minHybrid + spot evaluation + context optimization~$1.504-6 weeks
100K-500K minSelf-hosted SFU + Grok + model tiering + caching~$1.002-3 months
> 500K minFull optimization including self-hosted STT/TTS~$0.704-6 months + ongoing

The “engineering effort” column represents the one-time cost to implement each tier, not ongoing maintenance. The ongoing maintenance cost (roughly 0.25-0.5 engineering weeks per month) is not included but should factor into your ROI calculation.

When Grok Is the Right Full-Stack Answer

For teams between 5K-100K minutes per month who want a simple path to significant cost reduction without complex infrastructure, Grok’s flat $0.05/min rate with OpenAI-compatible API is often the right answer.

The migration from OpenAI Realtime to Grok takes approximately one week (they share the same WebSocket protocol). The savings are immediate. And you skip the complexity of self-hosted SFU or model tiering.

At 50,000 minutes/month:

  • OpenAI Realtime: ~$3,000/month in voice AI costs
  • Grok flat rate: $2,500/month in voice AI costs
  • Delta: $500/month savings from a one-week migration

Combined with self-hosted LiveKit (from Part 10), a mid-scale company spending $8,000/month on fully managed voice AI can reach $3,500/month with three weeks of engineering work and no quality degradation.

The Number That Matters

I started this post with $3.47 per interview on a managed stack. After full optimization:

  • Small company, fully managed: $3.47/interview (appropriate — don’t optimize prematurely)
  • Medium company, hybrid: $2.13/interview (39% savings, ~3 weeks engineering)
  • Large company, optimized: $1.10/interview (68% savings, ~3 months engineering)
  • Very large, full self-host: $0.88/interview (75% savings, 4-6 months + maintenance)

The break-even on each optimization tier typically falls between 2-6 months at the volumes that justify it. Start with managed, migrate to hybrid when costs become visible in quarterly reviews, and invest in full optimization only when you can see a clear 12-month ROI.


In Part 12, we close out the series with the architecture that makes all of this cost optimization possible: the multi-provider adapter pattern. Supporting OpenAI Realtime, Bedrock Nova Sonic, Grok, and Gemini Live behind a single clean interface means you can switch providers for cost or reliability reasons without rewriting your interview logic. It also gives you the circuit breaker and failover patterns that keep your system running when any single provider has an outage.


This is Part 11 of a 12-part series: The Voice AI Interview Playbook.

Series outline:

  1. Why Real-Time Voice Changes Everything — The landscape, the vision, and the reference architecture (Part 1)
  2. Cascaded vs. Speech-to-Speech — Choosing your pipeline architecture (Part 2)
  3. LiveKit vs. Pipecat vs. Direct — Picking your framework (Part 3)
  4. STT, LLM, and TTS That Actually Work — Building the voice pipeline (Part 4)
  5. Multi-Role Agents — Interviewer, coach, and evaluator personas (Part 5)
  6. Knowledge Base and RAG — Making your voice agent an expert (Part 6)
  7. Web and Mobile Clients — Cross-platform voice experiences (Part 7)
  8. Video Interview Integration — Multimodal analysis with Gemini Live (Part 8)
  9. Recording, Transcription, and Compliance — GDPR, HIPAA, and getting it right (Part 9)
  10. Scaling to Thousands — Architecture for concurrent voice sessions (Part 10)
  11. Cost Optimization — From $0.14/min to $0.03/min (this post)
  12. Multi-Provider Support — OpenAI Realtime, Bedrock Nova, Grok, and the adapter pattern (Part 12)
Export for reading

Comments