Production Voice AI for Research at Scale: The Real Cost (Part 5 of 8)

In Part 4, we built the automatic post-interview pipeline: recording extraction, parallel enrichment, transcript generation, and structured insight delivery. That pipeline runs after every session. This post is about the question that runs during every session: how much is this costing?

Research projects have budgets. Grant-funded studies have per-participant caps. Commercial clients ask “how much did this 500-person study cost?” and they expect an answer down to the cent. When you’re running voice AI sessions at scale — hundreds or thousands of 20-40 minute conversations — cost visibility isn’t optional. It’s a core product feature.

I learned this the hard way. Early on, we tracked costs monthly, in aggregate. A client would ask for a per-session breakdown and we’d spend half a day pulling token logs, cross-referencing timestamps, and estimating transport costs. That was with 50 sessions. At 500 sessions per week, it was untenable.

The fix was real-time cost tracking: per-session, per-provider, per-minute, streaming to the client as the session happens. Here’s exactly how we built it, what the real numbers look like across providers, and the self-hosting math that changed our economics.

Real-Time Token Usage via Data Channel

Speech-to-speech providers — both OpenAI Realtime and Gemini Live — publish token usage events during active sessions. OpenAI sends response.done events containing usage objects with input and output token counts. Gemini Live surfaces usage metadata in its server responses. These events arrive continuously throughout the conversation.

The architecture is straightforward: the server-side agent captures these events, computes a running cost estimate using the provider’s published rates, and pushes updates to the client via the SFU data channel. The client renders a live cost meter. No polling, no post-hoc estimation. The researcher watching the session sees the cost tick up in real time.

Here’s the core cost tracker:

from dataclasses import dataclass, field
from time import time

@dataclass
class SessionCostTracker:
    """Tracks real-time cost for an active S2S session."""
    provider: str  # "openai_realtime" or "gemini_live"
    audio_input_tokens: int = 0
    audio_output_tokens: int = 0
    text_tokens: int = 0
    session_start: float = field(default_factory=time)

    # Per-token rates (USD) — updated from provider pricing pages
    RATES = {
        "openai_realtime": {"audio_in": 100 / 1_000_000, "audio_out": 200 / 1_000_000, "text": 5 / 1_000_000},
        "gemini_live":     {"audio_in": 0.70 / 1_000_000, "audio_out": 2.80 / 1_000_000, "text": 0.15 / 1_000_000},
    }

    def record_usage(self, audio_in: int, audio_out: int, text: int = 0):
        self.audio_input_tokens += audio_in
        self.audio_output_tokens += audio_out
        self.text_tokens += text

    @property
    def total_cost(self) -> float:
        r = self.RATES[self.provider]
        return (self.audio_input_tokens * r["audio_in"]
              + self.audio_output_tokens * r["audio_out"]
              + self.text_tokens * r["text"])

    @property
    def cost_per_minute(self) -> float:
        elapsed = max(time() - self.session_start, 1)
        return self.total_cost / (elapsed / 60)

Every 10 seconds, the agent publishes a cost snapshot to the client through the data channel:

import json
import asyncio

async def publish_cost_updates(tracker: SessionCostTracker, data_channel, interval: float = 10.0):
    """Push cost updates to the client via SFU data channel every N seconds."""
    while True:
        await asyncio.sleep(interval)
        payload = json.dumps({
            "type": "cost_update",
            "total_usd": round(tracker.total_cost, 4),
            "per_minute_usd": round(tracker.cost_per_minute, 4),
            "audio_in_tokens": tracker.audio_input_tokens,
            "audio_out_tokens": tracker.audio_output_tokens,
            "elapsed_sec": round(time() - tracker.session_start, 1),
        })
        await data_channel.send(payload)

The client receives these updates and renders them however the research dashboard needs — a running total, a per-minute rate, a budget gauge. The data channel approach means zero additional API calls. It piggybacks on the existing WebRTC connection that’s already carrying the audio.

One subtlety: the cost calculation is an estimate, not an invoice. Token counts from response.done events are accurate for that response, but there’s a lag — the final response’s usage event arrives after the session ends. We reconcile the real-time estimate against the final provider invoice during post-processing. In practice, the real-time estimate is within 2-5% of the final bill.

OpenAI Realtime vs Gemini Live: The Cost Comparison

This is the comparison everyone asks about. Both providers offer true speech-to-speech: audio in, audio out, one model doing the hearing, thinking, and speaking. But the cost structures differ significantly.

OpenAI Realtime API

Based on OpenAI’s pricing, the Realtime API (gpt-4o-realtime) charges:

Audio input: $100 per 1 million tokens (~100 tokens/second of audio)
Audio output: $200 per 1 million tokens (~100 tokens/second of generated speech)
Text input (system prompts, function results): $5 per 1 million tokens
Text output (function calls, metadata): $20 per 1 million tokens

For a typical 30-minute research session with roughly 50/50 participant-agent speaking ratio:

~15 minutes of audio input = 900 seconds = ~90,000 tokens = $9.00
~15 minutes of audio output = 900 seconds = ~90,000 tokens = $18.00
System prompt + function calls: ~3,000 text tokens = $0.08
Total for 30 minutes: ~$27.08, or roughly ~$0.90/min

Update (February 2026): OpenAI has since introduced cached audio input pricing and the gpt-4o-mini-realtime variant. The mini model brings audio input to $10/1M tokens and audio output to $20/1M tokens — roughly 10x cheaper. A 30-minute session on mini runs about $2.78, or ~$0.093/min. The quality delta is noticeable for complex research protocols but acceptable for straightforward conversational interviews. Check the current pricing page for the latest rates, as these have been changing quarter to quarter.

Gemini Live (gemini-2.0-flash)

Google’s Gemini API pricing for the Live API with gemini-2.0-flash is structured differently — by duration rather than tokens for audio:

Audio input: approximately $0.70 per 1 million tokens
Audio output: approximately $2.80 per 1 million tokens
Text input: $0.15 per 1 million tokens
Text output: $0.60 per 1 million tokens

The token-to-seconds mapping differs from OpenAI’s, but for a comparable 30-minute session:

Audio input (~15 min): ~$0.44
Audio output (~15 min): ~$1.76
Text tokens (system prompt + function results): ~$0.01
Total for 30 minutes: ~$2.21, or roughly ~$0.074/min

Gemini Live also offers a generous free tier (up to certain rate limits) which is useful for development and pilot studies. Check the Gemini pricing page for current free tier limits and paid rates.

The Real Comparison

Here’s what a month of research looks like at different volumes:

                         OpenAI Realtime    OpenAI Mini     Gemini Live
                         (gpt-4o)           (gpt-4o-mini)   (2.0-flash)
─────────────────────────────────────────────────────────────────────────
Per-minute rate (AI)     ~$0.90             ~$0.093         ~$0.074
30-min session           ~$27.08            ~$2.78          ~$2.21
100 sessions/month       ~$2,708            ~$278           ~$221
1,000 sessions/month     ~$27,080           ~$2,780         ~$2,210
─────────────────────────────────────────────────────────────────────────

The choice isn’t purely about cost. OpenAI Realtime (full gpt-4o) has better reasoning, more consistent turn-taking, and stronger function calling reliability. Gemini Live is faster to first response and handles multi-turn context well, but can be less predictable with complex function call schemas. For research use cases where protocol adherence matters, we typically use OpenAI Realtime for complex multi-phase protocols and Gemini Live for simpler conversational formats.

The practical approach: use the mini/flash variants for high-volume studies where cost matters, reserve the full models for protocols that need the reasoning power, and track per-session costs so you can make data-driven provider decisions.

Budget Enforcement: Soft and Hard Limits

Every research project has a budget. A 500-participant study at $3.00 per session means a $1,500 AI budget. Overruns are not acceptable — especially on grant-funded work where the budget was locked 18 months before the study started.

We enforce budgets at two levels: project-level (total spend across all sessions) and session-level (per-individual-session cap). Each level has a soft limit and a hard limit.

Soft limit (80% of budget): The agent receives a system message injection telling it to wrap up naturally. “You’re approaching the time limit. Please summarize and conclude the conversation in the next 2-3 minutes.” The participant never sees anything about costs.
Hard limit (100% of budget): The session ends gracefully. The agent delivers a closing statement, the connection terminates, and post-processing begins.

from dataclasses import dataclass
from enum import Enum

class BudgetStatus(str, Enum):
    OK = "ok"
    SOFT_LIMIT = "soft_limit"
    HARD_LIMIT = "hard_limit"

@dataclass
class BudgetEnforcer:
    """Monitors session cost against project and session limits."""
    project_budget: float       # Total project budget in USD
    project_spent: float        # Already spent across prior sessions
    session_limit: float        # Max cost for this single session
    soft_threshold: float = 0.8 # Trigger wrap-up at 80%

    def check(self, session_cost: float) -> BudgetStatus:
        # Check session-level limit first
        if session_cost >= self.session_limit:
            return BudgetStatus.HARD_LIMIT
        if session_cost >= self.session_limit * self.soft_threshold:
            return BudgetStatus.SOFT_LIMIT
        # Check project-level limit
        total = self.project_spent + session_cost
        if total >= self.project_budget:
            return BudgetStatus.HARD_LIMIT
        if total >= self.project_budget * self.soft_threshold:
            return BudgetStatus.SOFT_LIMIT
        return BudgetStatus.OK

The enforcement loop runs alongside the cost tracker. Every time a cost update fires, it checks the budget status and takes action:

OK — continue normally.
SOFT_LIMIT — inject a system message into the S2S session. Both OpenAI Realtime and Gemini Live support mid-session instruction updates. The message is carefully worded to guide the AI toward a natural conclusion without revealing the budget constraint to the participant.
HARD_LIMIT — send a final AI message (“Thank you for your time, we’ve covered everything we need”), wait 5 seconds for the response to complete, then disconnect.

This has saved us from runaway sessions multiple times. One memorable case: a participant and the AI agent got into a deeply engaging tangent about their career history. Without budget enforcement, that session would have run 90 minutes and cost over $8.00. The soft limit triggered at 35 minutes, the AI wrapped up by minute 38, and the session cost $4.18 — within the per-session cap.

The Hidden 18-43%: SFU Transport Costs

Here’s the line item that surprises everyone: the media transport layer. LiveKit Cloud charges for SFU bandwidth — the WebRTC relay that carries audio between participants and the server-side agent. This is separate from the AI provider cost.

LiveKit Cloud pricing is usage-based: you pay per participant-minute of media transport. For a typical voice-only research session (two participants: the human and the agent), the cost is roughly $0.004-0.006 per participant-minute. A 30-minute session with 2 participants = 60 participant-minutes = roughly $0.24-0.36.

That sounds small. But look at it as a percentage of total session cost:

Provider            AI Cost/30min    Transport/30min    Transport %
──────────────────────────────────────────────────────────────────
OpenAI Realtime     ~$27.08          ~$0.30             ~1%
OpenAI Mini         ~$2.78           ~$0.30             ~10%
Gemini Live         ~$2.21           ~$0.30             ~12%
──────────────────────────────────────────────────────────────────

When you use the cheaper AI providers — which is exactly what you should do for high-volume research — transport becomes a proportionally larger share of the bill. At 1,000 sessions per month on Gemini Live, transport adds ~$300 to the ~$2,210 AI cost. That’s 12% of total cost going to media relay.

The transport cost also scales linearly. There’s no volume discount on bandwidth. Whether you’re running 10 or 10,000 sessions, you pay the same per-minute rate. This is the cost line that motivated our self-hosting investigation.

Self-Hosting Economics

The question we asked at ~3,000 sessions/month: is it cheaper to run our own LiveKit server?

LiveKit is open source. You can deploy it on any Linux server with a public IP and decent bandwidth. The software is free. You pay for the server and bandwidth.

Here’s the math we ran:

A dedicated server with 8 vCPUs, 16GB RAM, and 10TB bandwidth runs about $150-200/month from providers like Hetzner or OVH. LiveKit’s resource usage is modest for audio-only sessions — it’s a media relay, not a transcoding engine. A single server in this tier comfortably handles 300-500 concurrent 2-party audio sessions.

def cost_comparison(sessions_per_month: int, avg_duration_min: float = 30.0):
    """Compare managed vs self-hosted transport costs."""
    total_minutes = sessions_per_month * avg_duration_min
    participant_minutes = total_minutes * 2  # 2-party sessions

    # Managed: LiveKit Cloud ~$0.005/participant-minute (audio)
    managed_cost = participant_minutes * 0.005

    # Self-hosted: dedicated server ~$200/month fixed
    # Handles up to ~500 concurrent sessions
    server_cost = 200.0
    # Additional servers needed at >500 concurrent
    peak_concurrent = sessions_per_month * (avg_duration_min / 60) / 20  # rough estimate
    servers_needed = max(1, int(peak_concurrent / 500) + 1)
    self_hosted_cost = servers_needed * server_cost

    savings = managed_cost - self_hosted_cost
    savings_pct = (savings / managed_cost * 100) if managed_cost > 0 else 0

    return {
        "managed_monthly": round(managed_cost, 2),
        "self_hosted_monthly": round(self_hosted_cost, 2),
        "savings_monthly": round(savings, 2),
        "savings_pct": round(savings_pct, 1),
        "break_even_sessions": int(server_cost / (avg_duration_min * 2 * 0.005)),
    }

The numbers at different scales:

Sessions/month    Managed Transport    Self-Hosted    Monthly Savings
──────────────────────────────────────────────────────────────────────
100               $30                  $200           -$170 (worse)
500               $150                 $200           -$50 (worse)
1,000             $300                 $200           $100 (33%)
3,000             $900                 $200           $700 (78%)
10,000            $3,000               $400           $2,600 (87%)
──────────────────────────────────────────────────────────────────────

The break-even point is around 670 sessions/month at 30 minutes each. Below that, managed is simpler and cheaper. Above that, self-hosted wins — and the gap widens fast.

At our scale of 3,000+ sessions/month, self-hosting the SFU saves roughly $700/month. At 10,000 sessions, it’s $2,600/month. The operational cost of maintaining a LiveKit server is modest — it’s a single Go binary with straightforward configuration. We run it with systemd, monitor with Prometheus (more on monitoring in Part 6), and it just works.

The tradeoff is real though: you own the uptime. No managed service absorbing outages for you. No automatic multi-region failover. If your server goes down during a live session, that session is interrupted. We mitigate this with health checks, automatic restart, and a fast failover to LiveKit Cloud as a backup. The self-hosting deployment guide covers the setup in detail.

The Cost Logging Schema

All of this tracking is useless if you can’t query it later. Every session writes a cost record to PostgreSQL with enough detail to answer any billing question: per-session, per-project, per-provider, per-time-period.

CREATE TABLE session_costs (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    session_id      UUID NOT NULL REFERENCES sessions(id),
    project_id      UUID NOT NULL REFERENCES projects(id),
    provider        VARCHAR(50) NOT NULL,  -- 'openai_realtime', 'gemini_live'
    audio_in_tokens BIGINT NOT NULL DEFAULT 0,
    audio_out_tokens BIGINT NOT NULL DEFAULT 0,
    text_tokens     BIGINT NOT NULL DEFAULT 0,
    ai_cost_usd     NUMERIC(10, 6) NOT NULL,
    transport_cost   NUMERIC(10, 6) NOT NULL,
    total_cost_usd  NUMERIC(10, 6) NOT NULL,
    duration_sec    INTEGER NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_session_costs_project ON session_costs(project_id, created_at);
CREATE INDEX idx_session_costs_provider ON session_costs(provider, created_at);

The ai_cost_usd and transport_cost columns are split intentionally. When a client asks “why did this study cost more than projected?”, you can immediately point to whether it was the AI provider (longer sessions, more complex conversations) or the transport layer (more participants, video sessions mixed in). The provider column lets you compare costs across A/B tests where you’re routing sessions to different models.

We run nightly aggregation queries to populate a project dashboard: total spend, average cost per session, cost trend over time, projected budget consumption. The index on (project_id, created_at) makes these aggregations fast even with millions of rows.

One lesson learned: log the rates you used for cost calculation alongside the token counts. Provider pricing changes. If OpenAI drops audio output pricing by 20%, you need to know which sessions were calculated at the old rate and which at the new rate. We added a rate_snapshot JSONB column after getting burned by a mid-study pricing change that made our historical cost data inconsistent.

Where This Leads

Cost tracking isn’t just about billing. It’s an architectural feedback loop. When you can see per-session costs in real time, you make different decisions:

You route high-value protocol sessions to better (more expensive) models and routine sessions to cheaper ones.
You set session duration targets based on cost data, not guesswork.
You identify sessions where the AI “went off script” — they show up as cost outliers.
You justify self-hosting investments with concrete savings projections.

The infrastructure we built here — real-time tracking, budget enforcement, provider comparison, self-hosting economics — all feeds into the operational challenge of running this at scale. In Part 6, we’ll cover what breaks when you go from 10 sessions per week to 200 concurrent: the enrichment bottleneck, session recovery, provider failover, and the metrics that keep it all visible.

References:

This is Part 5 of an 8-part series: Production Voice AI for Research at Scale.

Series outline:

The Architecture Nobody Warns You About — Server-side agents, metadata transport, provider selection (Part 1)
Zombie Agents, Pre-Warming, and the 5 Bugs That Cost Us Weeks — Production pain points and fixes (Part 2)
Multi-Phase State Machines — Research protocol as code, LLM-driven transitions (Part 3)
From Recording to Insight — The automatic post-interview pipeline (Part 4)
The Real Cost — Per-minute tracking, budgets, self-hosting math (Part 5)
What Breaks at 200 Concurrent Sessions — Scaling bottlenecks and operational metrics (Part 6)
Multi-Language Voice AI — Language detection, provider routing, locale-aware VAD, i18n prompts (Part 7)
Deployment and Go-Live — Docker, Kubernetes, CI/CD, zero-downtime deploys, monitoring (Part 8)

For the broader reference architecture covering cascaded vs S2S pipelines, framework selection, multi-provider support, and the full interview lifecycle, see the 12-part Voice AI Interview Playbook.

Export for reading

Production Voice AI for Research at Scale: The Real Cost (Part 5 of 8)

Real-Time Token Usage via Data Channel

OpenAI Realtime vs Gemini Live: The Cost Comparison

OpenAI Realtime API

Gemini Live (gemini-2.0-flash)

The Real Comparison

Budget Enforcement: Soft and Hard Limits

The Hidden 18-43%: SFU Transport Costs

Self-Hosting Economics

The Cost Logging Schema

Where This Leads

Comments

On this page

Production Voice AI for Research at Scale: The Real Cost (Part 5 of 8)