In Part 10, we built the infrastructure to handle hiring season surges — LiveKit SFU mesh, stateless agent workers, Kubernetes auto-scaling, and multi-region failover. We also previewed the infrastructure cost table at the end, which probably made a few people reach for antacids.
Now let’s talk about the full cost picture.
Real-time voice AI is not cheap. A poorly-optimized stack running at scale can cost $0.14 per minute per session or more. A 30-minute interview becomes $4.20. Multiply that across 10,000 interviews per month and you’re spending $42,000 monthly just on AI infrastructure. That’s a budget item that gets noticed in board meetings.
The good news: with the right architecture decisions, you can get that same 30-minute interview to under $1.00 without meaningfully degrading candidate experience. This post shows you exactly how.
The Per-Minute Cost Anatomy
Let’s start with the baseline: what does one minute of voice AI interview actually cost, at the component level? I’ll use a typical production setup with managed services as the reference point.
Managed Stack Cost Breakdown (per minute of active session)
Component Provider Cost/min % of total
─────────────────────────────────────────────────────────────────────────
Voice AI (STT+LLM+TTS) OpenAI Realtime $0.06/min 43%
Media transport LiveKit Cloud $0.04/min 29%
Recording storage S3 + CloudFront $0.004/min 3%
Agent worker compute ECS Fargate $0.012/min 9%
Redis (session state) ElastiCache $0.003/min 2%
Database queries RDS PostgreSQL $0.002/min 2%
Async evaluation Lambda + GPT-4o mini $0.018/min 13%
─────────────────────────────────────────────────────────────────────────
TOTAL (fully managed) ~$0.139/min 100%
A 25-minute interview on this stack costs approximately $3.47. That’s before any HR platform licensing, recruiter time, or infrastructure amortization.
The three biggest cost drivers are immediately obvious: the voice AI provider, the media transport layer, and the async evaluation. These three components account for 85% of per-minute cost. Everything else is rounding error.
What “Managed” Means
“Managed” in this context means:
- LiveKit Cloud for media transport (no servers to run)
- OpenAI Realtime API for the voice AI provider
- ECS Fargate for agent workers (AWS manages the container infrastructure)
- All post-session processing on Lambda and managed LLM APIs
You’re paying a significant markup for the convenience of not managing infrastructure. This is the right tradeoff at low volume. It becomes the wrong tradeoff at scale.
The Three Cost Tipping Points
Not every team should fully optimize. The cost profile follows a classic step-function, with three tipping points where the effort-to-savings ratio changes dramatically.
Tipping Point 1: Below 10,000 Minutes/Month — Stay Managed
Below 10,000 interview minutes per month (roughly 400 30-minute interviews, or a small company doing 15-20 interviews per week), you do not have a cost problem. You have a reliability problem. Focus on uptime, quality, and iteration speed.
At this volume:
- Monthly voice AI spend: ~$1,400
- Cost to engineer self-hosted alternatives: $15,000-30,000 in engineering time
- Break-even: 12-24 months minimum
The math does not work. Stay on fully managed services. Revisit when volume grows.
Tipping Point 2: 10,000–50,000 Minutes/Month — Hybrid Optimization (80% savings available)
This is where optimization starts paying for itself within 2-3 months. The target: replace the two highest-cost managed components (voice AI provider and media transport) with lower-cost alternatives while keeping everything else managed.
Key moves at this tier:
- Switch from OpenAI Realtime to Grok ($0.06/min → $0.05/min flat, simpler billing)
- Move from LiveKit Cloud to self-hosted LiveKit ($0.04/min → ~$0.008/min on reserved EC2)
- Replace GPT-4o mini evaluation with a smaller self-hosted model ($0.018/min → $0.006/min on batch GPU)
New cost breakdown after hybrid optimization:
Component Provider Cost/min Savings
─────────────────────────────────────────────────────────────────────────
Voice AI Grok Voice Agent $0.050/min -17%
Media transport Self-hosted LiveKit $0.008/min -80%
Recording storage S3 + CloudFront $0.004/min 0%
Agent worker compute ECS Fargate $0.012/min 0%
Redis (session state) ElastiCache $0.003/min 0%
Database queries RDS PostgreSQL $0.002/min 0%
Async evaluation Llama 3.1 on spot $0.006/min -67%
─────────────────────────────────────────────────────────────────────────
TOTAL (hybrid optimized) ~$0.085/min -39%
A 25-minute interview now costs $2.13, down from $3.47. A 39% reduction from two infrastructure changes. At 30,000 minutes/month, this saves ~$1,620/month, paying back the engineering investment in about 6 weeks.
Tipping Point 3: Above 50,000 Minutes/Month — Full Optimization (78% total savings possible)
Above 50,000 minutes per month, you’re spending enough on voice AI that the economics of full optimization become compelling. This requires more engineering investment but reduces per-minute cost to under $0.03.
Full optimization adds:
- Self-hosted STT (Whisper.cpp or Deepgram self-hosted on GPU): replaces $0.01-0.02/min of the provider cost
- Self-hosted TTS for low-stakes interactions (Coqui/Piper for filler and transitions): reduces TTS portion to near zero
- Context window optimization (aggressive conversation summarization): reduces LLM token cost by 40-60%
- Spot instances for all non-real-time processing: 70% discount on async evaluation compute
- TTS pre-generation for scripted content: zero incremental cost for opening scripts
Component Implementation Cost/min vs Baseline
─────────────────────────────────────────────────────────────────────────
Voice AI (core) Grok / Gemini Live $0.020/min -67%
(with context optimization)
Media transport Self-hosted LiveKit $0.008/min -80%
Recording storage S3 Intelligent-Tier $0.003/min -25%
Agent workers EKS Spot + On-demand $0.006/min -50%
Redis Self-managed Redis $0.002/min -33%
Database Aurora Serverless v2 $0.001/min -50%
Async evaluation Llama on spot batch $0.004/min -78%
─────────────────────────────────────────────────────────────────────────
TOTAL (fully optimized) ~$0.044/min -68%
A 25-minute interview now costs $1.10. With caching and pre-generation of scripted content, you can push below $0.90 for standard interview formats.
Provider Cost Comparison
The voice AI provider is the single largest cost lever. Here is the full comparison as of Q1 2026:
| Provider | Model | Billing Model | Cost/min | Latency (TTFA) | Notes |
|---|---|---|---|---|---|
| OpenAI Realtime | GPT-4o Realtime | Per token (audio in + out) | $0.04-0.08 | 300-600ms | Native function calling, 60-min sessions |
| Grok Voice Agent | Grok-2 Voice | Flat per minute | $0.05 | <1s | OpenAI-compatible, no token math needed |
| Gemini Live | Gemini 2.0 Flash | Per token (audio + video) | $0.03-0.06 | 320-800ms | Multimodal, best for video interviews |
| Bedrock Nova Sonic | Nova Sonic | Per token (audio) | $0.04-0.07 | <700ms | AWS-native, 100+ languages, compliance |
| Build your own | Whisper + Llama + Coqui | Infrastructure | $0.01-0.02 | 200-800ms | High effort, max control |
Why Grok Wins on Cost Clarity
The flat per-minute billing model of Grok deserves special attention. Every other provider charges per token of audio input and output, which means your monthly bill depends on:
- How much candidates talk (variable)
- How much the AI talks (variable, depends on verbosity settings)
- Audio sampling rate and encoding
- Whether silence is billed (it is, for most providers)
With Grok’s $0.05/min flat rate, a 30-minute interview costs exactly $1.50 in voice AI costs. No surprises. This predictability is genuinely valuable for financial planning and is why Grok often wins cost comparisons even when its nominal rate is not the absolute lowest.
The Self-Build Economics
Building your own speech-to-speech pipeline (Whisper.cpp for STT, Llama 3.3 70B for conversation, Coqui XTTS v2 or Piper for TTS) can reach $0.01-0.02/min at scale, but the true cost includes:
- Engineering time: 3-6 months to build a production-quality pipeline
- Model serving infrastructure: GPU instances for Whisper and TTS ($1,000-5,000/mo baseline)
- Quality gaps: Open-source TTS still lags commercial providers for naturalness
- Maintenance burden: Model updates, infrastructure management, incident response
For most teams, self-building the full voice pipeline is a false economy below 500,000 minutes per month. The engineering cost, quality gap, and maintenance burden outweigh the infrastructure savings.
STT Cost Optimization
If you’re on a cascaded pipeline (separate STT → LLM → TTS), STT is a significant cost component. Here’s the breakdown:
| STT Provider | Cost | Quality | Latency | Best For |
|---|---|---|---|---|
| Deepgram Nova 3 | $0.0043/min | Excellent | 150-300ms | Production cascaded pipeline |
| Whisper API (OpenAI) | $0.006/min | Excellent | 400-800ms | High accuracy needed, latency-tolerant |
| Google STT v2 | $0.016/min | Good | 200-400ms | GCP-native stacks |
| AssemblyAI Nano | $0.003/min | Good | 200-500ms | Cost-sensitive |
| Whisper.cpp (self-hosted) | ~$0.001/min on GPU | Excellent | 100-300ms | High volume, GPU available |
Deepgram Nova 3 is the production choice for managed STT: best accuracy-to-cost ratio, 150ms latency that fits comfortably in the voice budget, and a WebSocket streaming API that integrates cleanly with LiveKit.
For self-hosted at scale: Whisper.cpp running on a GPU instance (g5.xlarge at ~$1.00/hr handles approximately 100 concurrent streams) brings STT cost to under $0.001/min. At 100,000 minutes/month, that’s $200 vs $430 for Deepgram — a $2,760/year difference. Meaningful, but not transformative unless you’re at very high volume.
TTS Cost Optimization
TTS is where the quality-cost tradeoff is sharpest. Commercial TTS providers are noticeably better than open-source alternatives for interview contexts, but the gap is closing.
| TTS Provider | Cost | Voice Quality | Latency | Best For |
|---|---|---|---|---|
| ElevenLabs Turbo v2 | $0.006-0.012/min | Best-in-class | 200-400ms | High-stakes interviews, executive roles |
| Cartesia Sonic | $0.005/min | Excellent | 90-200ms | Production default, great latency |
| OpenAI TTS | $0.015-0.030/min | Very good | 300-600ms | OpenAI ecosystem |
| Google TTS | $0.004-0.008/min | Good | 200-500ms | GCP stacks |
| Coqui XTTS v2 (self-hosted) | ~$0.001/min on GPU | Good | 200-500ms | Mid-volume, GPU available |
| Piper TTS (self-hosted) | ~$0.0001/min CPU | Acceptable | 50-150ms | Low-stakes interactions only |
The practical tiered strategy:
- Interviewer persona: Cartesia Sonic or ElevenLabs for maximum naturalness — this is what the candidate hears most
- System transitions (“Let’s move on to the next section”): Coqui XTTS v2 — quality is sufficient for scripted transitions
- Pre-recorded common phrases: Piper TTS, pre-generated — zero incremental cost
The transition phrases optimization alone can reduce TTS cost by 20-30% because section transitions, acknowledgments, and filler phrases make up a large fraction of AI speech volume.
LLM Cost Optimization
For speech-to-speech providers (Grok, OpenAI Realtime, Gemini Live, Bedrock Nova), the LLM cost is embedded in the per-minute or per-token rate. For cascaded pipelines, it’s a separate line item.
Context Window Management
The most impactful LLM cost optimization is aggressive context window management. A naive implementation passes the entire conversation history to every LLM call. A 45-minute interview with a candidate who talks a lot can accumulate 20,000-40,000 tokens of conversation history. At GPT-4o pricing, that context adds $0.12-0.24 per call, and you’re making dozens of calls per interview.
The fix: rolling summarization.
# context_manager.py
class InterviewContextManager:
"""
Maintains a sliding context window for LLM calls.
Older conversation turns are summarized rather than sent verbatim.
"""
def __init__(self, max_recent_turns: int = 6, max_context_tokens: int = 8000):
self.max_recent_turns = max_recent_turns
self.max_context_tokens = max_context_tokens
self.recent_turns: list[dict] = []
self.summary: str = ""
async def add_turn(self, role: str, content: str):
self.recent_turns.append({"role": role, "content": content})
# If we exceed max recent turns, summarize the oldest ones
if len(self.recent_turns) > self.max_recent_turns:
turns_to_summarize = self.recent_turns[:-self.max_recent_turns]
self.recent_turns = self.recent_turns[-self.max_recent_turns:]
await self._update_summary(turns_to_summarize)
async def _update_summary(self, turns: list[dict]):
"""Summarize old turns into a compact representation."""
turns_text = "\n".join(
f"{t['role'].upper()}: {t['content']}" for t in turns
)
prompt = f"""Previous context summary: {self.summary}
New turns to incorporate:
{turns_text}
Write a concise summary (max 200 words) preserving:
- Key technical topics discussed
- Candidate's demonstrated knowledge level
- Any commitments made about next topics
- Red flags or strong positives noted"""
# Use a cheap model for summarization — GPT-4o mini works well
response = await openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=300
)
self.summary = response.choices[0].message.content
def get_context_for_llm(self) -> list[dict]:
"""Return the context to pass to the LLM."""
messages = []
if self.summary:
messages.append({
"role": "system",
"content": f"[Previous conversation summary: {self.summary}]"
})
messages.extend(self.recent_turns)
return messages
In practice, this reduces the average context size by 60-70% with minimal impact on interview quality. The summarization model call costs roughly $0.001 per invocation on GPT-4o mini — well worth the token savings on the main model.
Model Tiering for Different Question Types
Not every interview question needs GPT-4o. Routine acknowledgments, simple clarifying questions, and structured rubric scoring can run on smaller, cheaper models:
# llm_router.py
class InterviewLLMRouter:
"""Route LLM calls to appropriate models based on task type."""
async def generate_response(
self,
task_type: str,
context: list[dict],
prompt: str
) -> str:
model_config = {
# Simple acknowledgments and filler
"acknowledgment": {
"model": "gpt-4o-mini",
"max_tokens": 50,
"temperature": 0.7
},
# Standard interview questions and follow-ups
"interview_turn": {
"model": "gpt-4o-mini",
"max_tokens": 200,
"temperature": 0.8
},
# Complex technical follow-up requiring nuanced judgment
"deep_followup": {
"model": "gpt-4o",
"max_tokens": 300,
"temperature": 0.6
},
# Final rubric scoring and evaluation
"evaluation": {
"model": "gpt-4o",
"max_tokens": 500,
"temperature": 0.3
}
}
config = model_config.get(task_type, model_config["interview_turn"])
response = await openai_client.chat.completions.create(
model=config["model"],
messages=context + [{"role": "user", "content": prompt}],
max_tokens=config["max_tokens"],
temperature=config["temperature"]
)
return response.choices[0].message.content
On a typical 30-minute interview with the routing above, roughly 60% of calls go to gpt-4o-mini and 40% go to gpt-4o. This reduces LLM cost by approximately 35% compared to using gpt-4o for all calls.
Caching Strategies
Pre-Generated TTS for Scripted Content
Every interview follows a predictable structure: opening greeting, section transitions, standard question setups, and closing. These are scripted and never change. Pre-generating them as audio files eliminates TTS cost for a significant fraction of AI speech:
# tts_cache.py
import hashlib
import boto3
from pathlib import Path
s3 = boto3.client('s3')
AUDIO_CACHE_BUCKET = 'your-audio-cache-bucket'
async def get_or_generate_audio(
text: str,
voice_id: str,
tts_client: CartesiaTTS
) -> bytes:
"""
Check S3 cache before calling TTS API.
For scripted phrases, this will almost always hit the cache.
"""
# Hash the text + voice ID to create cache key
cache_key = hashlib.sha256(f"{voice_id}:{text}".encode()).hexdigest()
s3_key = f"tts-cache/{voice_id}/{cache_key}.opus"
try:
# Try cache first
response = s3.get_object(Bucket=AUDIO_CACHE_BUCKET, Key=s3_key)
return response['Body'].read()
except s3.exceptions.NoSuchKey:
pass
# Cache miss — generate and store
audio_bytes = await tts_client.generate(text, voice_id=voice_id)
s3.put_object(
Bucket=AUDIO_CACHE_BUCKET,
Key=s3_key,
Body=audio_bytes,
ContentType='audio/ogg',
# Cache for 30 days — scripted phrases rarely change
CacheControl='max-age=2592000'
)
return audio_bytes
# Pre-warm cache for all scripted phrases at deployment time
SCRIPTED_PHRASES = [
"Hello! I'm Alex, your AI interviewer today. Before we begin, I want to confirm you've consented to this session being recorded.",
"Great. Let's start with a brief introduction. Could you tell me a bit about your current role and what brought you to apply for this position?",
"Thank you. Let's move on to the technical portion of the interview.",
"Excellent. Now I'd like to discuss a system design scenario.",
"Let's shift to some behavioral questions.",
"We're coming up on the end of our time. Do you have any questions for me?",
"Thank you so much for your time today. We'll be in touch with next steps within the next few business days.",
]
For a standard 30-minute interview, pre-generated phrases cover roughly 15-20% of TTS usage by duration (openings, transitions, and closings are verbose). At Cartesia Sonic pricing, this saves roughly $0.001 per interview — small individually but meaningful at scale.
Question Intro Caching
Similarly, the opening line of each interview question is scripted and can be cached. Only the follow-up responses, which depend on candidate answers, must be dynamically generated.
Batch Evaluation on Spot Instances
Post-session evaluation is the most cost-controllable workload because it has no latency requirement. You have hours, not milliseconds. This makes it ideal for spot/preemptible instances.
# batch_evaluator.py — runs as ECS Spot or Kubernetes spot pool job
import boto3
import asyncio
from datetime import datetime
class BatchEvaluator:
"""
Runs evaluation jobs on spot instances during off-peak hours.
Handles instance interruption gracefully via checkpointing.
"""
def __init__(self, session_id: str, checkpoint_bucket: str):
self.session_id = session_id
self.checkpoint_bucket = checkpoint_bucket
self.s3 = boto3.client('s3')
async def evaluate(self, transcript: str, rubric: dict) -> dict:
# Load checkpoint if we've been interrupted before
checkpoint = await self.load_checkpoint()
completed_sections = checkpoint.get('completed_sections', [])
results = checkpoint.get('results', {})
for section_name, section_criteria in rubric['sections'].items():
if section_name in completed_sections:
continue # Skip already-evaluated sections
section_transcript = self.extract_section(transcript, section_name)
section_score = await self.score_section(
section_transcript,
section_criteria
)
results[section_name] = section_score
completed_sections.append(section_name)
# Checkpoint after each section
await self.save_checkpoint({
'completed_sections': completed_sections,
'results': results,
'timestamp': datetime.utcnow().isoformat()
})
return results
async def save_checkpoint(self, data: dict):
import json
self.s3.put_object(
Bucket=self.checkpoint_bucket,
Key=f"eval-checkpoints/{self.session_id}.json",
Body=json.dumps(data)
)
async def load_checkpoint(self) -> dict:
import json
try:
response = self.s3.get_object(
Bucket=self.checkpoint_bucket,
Key=f"eval-checkpoints/{self.session_id}.json"
)
return json.loads(response['Body'].read())
except self.s3.exceptions.NoSuchKey:
return {}
Running evaluation on EC2 Spot (g5.xlarge, ~$0.36/hr vs $1.01/hr on-demand, using Llama 3.1 70B) versus GPT-4o mini API calls reduces evaluation cost by approximately 70% at the cost of managing spot interruptions. The checkpointing above makes interruptions recoverable.
The Build vs. Buy Decision Matrix
Here is the decision framework I use for teams at different scales:
| Monthly Volume | Recommended Stack | Estimated Cost/Interview | Engineering Effort |
|---|---|---|---|
| < 5,000 min | Fully managed (LiveKit Cloud + Grok + ECS) | ~$3.00 | Minimal |
| 5K-20K min | Hybrid: self-hosted LiveKit + Grok + ECS | ~$2.00 | 2-3 weeks |
| 20K-100K min | Hybrid + spot evaluation + context optimization | ~$1.50 | 4-6 weeks |
| 100K-500K min | Self-hosted SFU + Grok + model tiering + caching | ~$1.00 | 2-3 months |
| > 500K min | Full optimization including self-hosted STT/TTS | ~$0.70 | 4-6 months + ongoing |
The “engineering effort” column represents the one-time cost to implement each tier, not ongoing maintenance. The ongoing maintenance cost (roughly 0.25-0.5 engineering weeks per month) is not included but should factor into your ROI calculation.
When Grok Is the Right Full-Stack Answer
For teams between 5K-100K minutes per month who want a simple path to significant cost reduction without complex infrastructure, Grok’s flat $0.05/min rate with OpenAI-compatible API is often the right answer.
The migration from OpenAI Realtime to Grok takes approximately one week (they share the same WebSocket protocol). The savings are immediate. And you skip the complexity of self-hosted SFU or model tiering.
At 50,000 minutes/month:
- OpenAI Realtime: ~$3,000/month in voice AI costs
- Grok flat rate: $2,500/month in voice AI costs
- Delta: $500/month savings from a one-week migration
Combined with self-hosted LiveKit (from Part 10), a mid-scale company spending $8,000/month on fully managed voice AI can reach $3,500/month with three weeks of engineering work and no quality degradation.
The Number That Matters
I started this post with $3.47 per interview on a managed stack. After full optimization:
- Small company, fully managed: $3.47/interview (appropriate — don’t optimize prematurely)
- Medium company, hybrid: $2.13/interview (39% savings, ~3 weeks engineering)
- Large company, optimized: $1.10/interview (68% savings, ~3 months engineering)
- Very large, full self-host: $0.88/interview (75% savings, 4-6 months + maintenance)
The break-even on each optimization tier typically falls between 2-6 months at the volumes that justify it. Start with managed, migrate to hybrid when costs become visible in quarterly reviews, and invest in full optimization only when you can see a clear 12-month ROI.
In Part 12, we close out the series with the architecture that makes all of this cost optimization possible: the multi-provider adapter pattern. Supporting OpenAI Realtime, Bedrock Nova Sonic, Grok, and Gemini Live behind a single clean interface means you can switch providers for cost or reliability reasons without rewriting your interview logic. It also gives you the circuit breaker and failover patterns that keep your system running when any single provider has an outage.
This is Part 11 of a 12-part series: The Voice AI Interview Playbook.
Series outline:
- Why Real-Time Voice Changes Everything — The landscape, the vision, and the reference architecture (Part 1)
- Cascaded vs. Speech-to-Speech — Choosing your pipeline architecture (Part 2)
- LiveKit vs. Pipecat vs. Direct — Picking your framework (Part 3)
- STT, LLM, and TTS That Actually Work — Building the voice pipeline (Part 4)
- Multi-Role Agents — Interviewer, coach, and evaluator personas (Part 5)
- Knowledge Base and RAG — Making your voice agent an expert (Part 6)
- Web and Mobile Clients — Cross-platform voice experiences (Part 7)
- Video Interview Integration — Multimodal analysis with Gemini Live (Part 8)
- Recording, Transcription, and Compliance — GDPR, HIPAA, and getting it right (Part 9)
- Scaling to Thousands — Architecture for concurrent voice sessions (Part 10)
- Cost Optimization — From $0.14/min to $0.03/min (this post)
- Multi-Provider Support — OpenAI Realtime, Bedrock Nova, Grok, and the adapter pattern (Part 12)