The Voice AI Interview Playbook: LiveKit vs. Pipecat vs. Direct — Picking Your Framework (Part 3 of 12)

In Part 2, we chose the hybrid architecture — cascaded pipeline for multi-turn interviews with rich evaluation, speech-to-speech for quick practice sessions. Now we need to decide where that architecture lives. What framework do we build on top of?

This is the question I got wrong on my first voice AI project. I picked the shiniest option without thinking through what I actually needed six months later. So let me save you that pain.

There are three meaningful choices in 2026:

LiveKit — WebRTC infrastructure with an Agents SDK built for exactly this use case
Pipecat — a Python pipeline framework by Daily.co that stays vendor-neutral and treats transports as swappable components
Direct integration — connect straight to provider APIs (Gemini Live WebSocket, OpenAI Realtime WebRTC, Grok WebSocket) with no intermediary framework

None of these is universally correct. The right answer depends on your team’s size, existing infrastructure, and how much abstraction you want to pay for in complexity. Let me walk through each one properly, then give you a decision framework you can actually use.

LiveKit: WebRTC Infrastructure Done Right

LiveKit started as an open-source WebRTC Selective Forwarding Unit (SFU). If you’ve never worked with WebRTC infrastructure before, an SFU is the server that sits in the middle of a multi-party call — it receives media streams from each participant and forwards the right streams to the right people, without doing the expensive work of mixing or transcoding.

That foundation matters for voice interviews because you’re not just doing audio processing in isolation. You’re connecting a browser client, an AI agent, potentially a recording system, and maybe a video feed — all in real time, all with latency that can’t exceed 300ms before the conversation starts to feel broken.

The Room Model

LiveKit organizes everything around rooms. A room is a real-time session where participants (humans, AI agents, recording egress jobs) connect and share media tracks. Each participant publishes audio/video tracks and subscribes to others.

For a voice interview platform, this maps beautifully:

The candidate connects to a room via the browser SDK
Your AI agent connects as a participant with its own audio track
An egress job can record the entire room to S3 for compliance
A supervisor dashboard could join as a silent observer

The room model gives you a natural session boundary. When the interview ends, you close the room. Everything — recording, transcripts, agent state — is scoped to that room identity.

LiveKit Agents SDK

The Agents SDK is what makes LiveKit genuinely interesting for AI voice applications. It was built specifically to solve the “how do I run an AI agent in a LiveKit room” problem, and it does it well.

Here’s the architecture: you write an AgentWorker process that registers with the LiveKit server. When a room is created (or when certain conditions are met), the server dispatches a job to your worker pool. Your worker connects to the room as a participant and starts processing audio.

The Agents SDK handles the gnarly parts:

VAD (Voice Activity Detection) integrated via Silero
Turn-taking logic so the agent doesn’t interrupt the human mid-sentence
Pipeline orchestration between STT → LLM → TTS
Graceful reconnection if the agent loses its server connection

# LiveKit Agents SDK — basic voice interview agent
import asyncio
from livekit import agents
from livekit.agents import AgentSession, Agent, RoomInputOptions
from livekit.plugins import openai, silero, deepgram

class InterviewAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="""You are a technical interviewer conducting a
            30-minute Python backend interview. Ask one question at a time.
            Listen to the candidate's answer before moving to the next question.
            Be encouraging but thorough."""
        )

async def entrypoint(ctx: agents.JobContext):
    await ctx.connect()

    session = AgentSession(
        stt=deepgram.STT(model="nova-2"),
        llm=openai.LLM(model="gpt-4o"),
        tts=openai.TTS(voice="alloy"),
        vad=silero.VAD.load(),
        turn_detection=openai.realtime.ServerVAD(),
    )

    await session.start(
        room=ctx.room,
        agent=InterviewAgent(),
        room_input_options=RoomInputOptions(
            noise_cancellation=agents.noise_cancellation.BVC()
        ),
    )

    # Kick off the interview
    await session.say(
        "Hello! I'm your interviewer today. Let's start with something "
        "straightforward — can you explain the difference between "
        "processes and threads in Python?"
    )

if __name__ == "__main__":
    agents.cli.run_app(
        agents.WorkerOptions(entrypoint_fnc=entrypoint)
    )

This is clean. The Agents SDK abstracts away the WebRTC complexity and the media pipeline wiring. You focus on the interview logic.

Cross-Platform SDKs

One of LiveKit’s real strengths is SDK coverage. They maintain official SDKs for:

Web/React: @livekit/components-react — pre-built UI components for audio/video
React Native: @livekit/react-native — same API, works on iOS and Android
Flutter: livekit_client — Dart package for cross-platform mobile
Swift/iOS: LiveKit-Swift — native iOS SDK
Kotlin/Android: livekit-android — native Android SDK
Python: server-side and agent SDK
Go, Rust, Unity: community and official variants

If you’re building a product that needs to run on web, iOS, and Android from a single voice AI platform, LiveKit is the only choice that doesn’t require you to maintain three separate WebRTC stacks.

Egress for Recording

LiveKit Egress is the recording system. It connects to a room as a hidden participant and can:

Record the entire room to a file (MP4 or OGG)
Stream to an RTMP endpoint (YouTube, Twitch, your own ingest)
Capture just an audio track from a specific participant
Produce room-level composite recordings with mixed audio

For a voice interview platform with GDPR or HIPAA requirements (covered in Part 9), this is enormous. You don’t need to build your own recording pipeline. You start an egress job via API when the interview begins, stop it when it ends, and the recording lands in your S3 bucket.

# Starting a recording egress via LiveKit API
from livekit import api

async def start_interview_recording(room_name: str, candidate_id: str):
    lk = api.LiveKitAPI(
        url=LIVEKIT_URL,
        api_key=LIVEKIT_API_KEY,
        api_secret=LIVEKIT_API_SECRET,
    )

    req = api.RoomCompositeEgressRequest(
        room_name=room_name,
        audio_only=True,  # voice interview — no video needed
        file_outputs=[
            api.EncodedFileOutput(
                file_type=api.EncodedFileType.OGG,
                filepath=f"interviews/{candidate_id}/{room_name}.ogg",
                s3=api.S3Upload(
                    access_key=AWS_ACCESS_KEY,
                    secret=AWS_SECRET,
                    bucket="interview-recordings",
                    region="us-east-1",
                )
            )
        ]
    )

    response = await lk.egress.start_room_composite_egress(req)
    return response.egress_id

Horizontal Scaling

LiveKit is designed to scale horizontally. Multiple agent worker processes connect to the same LiveKit server (or cluster), and jobs are dispatched across the pool automatically. You can run 100 agent workers and the LiveKit server load-balances room assignments across them.

LiveKit Cloud (their managed offering) handles the SFU scaling for you. Self-hosted requires you to run the LiveKit server — it’s a single Go binary that’s genuinely straightforward to deploy, but you own the ops.

Pipecat: The Flexible Pipeline Alternative

Pipecat is an open-source Python framework from Daily.co (the WebRTC infrastructure company behind daily-python). It takes a fundamentally different architectural stance from LiveKit Agents.

Where LiveKit Agents gives you an opinionated “connect to a room, here’s your pipeline” model, Pipecat gives you a pipeline DAG (directed acyclic graph) where you explicitly wire together frame processors. It’s more verbose but also more transparent and more flexible about what transport layer you use underneath.

The Frame Processor Model

Everything in Pipecat is a FrameProcessor. Audio comes in as AudioRawFrame objects, text as TextFrame, and LLM responses as LLMFullResponseEndFrame. You chain processors together, and frames flow through the pipeline from left to right.

AudioInput → VAD → STT → LLM → TTS → AudioOutput

Each processor in the chain can:

Pass frames downstream unchanged
Transform frames (STT converts audio frames to text frames)
Generate new frames (TTS generates audio from text)
Buffer frames (the LLM context aggregator collects tokens into a full response)
Drop frames (VAD drops silence frames)

This model is explicit and debuggable. When something goes wrong in your pipeline, you can add a logging processor at any point and see exactly what frames are flowing through.

Transport Layer Abstraction

Here’s Pipecat’s killer feature: the transport layer is swappable. The same pipeline code can run over:

Daily.co — Daily’s WebRTC infrastructure (their own product, obviously well-supported)
LiveKit — use Pipecat’s pipeline with LiveKit for WebRTC transport
WebSocket — for simpler integrations or server-to-server scenarios
Local — for testing and development, no network required

This means your pipeline code — the STT/LLM/TTS wiring, your prompt logic, your evaluation hooks — is completely decoupled from your infrastructure choice. You can start with Daily.co during development, switch to LiveKit when you need their mobile SDKs, and your Python code barely changes.

Hello World in Pipecat

# Pipecat voice interview agent with Daily.co transport
import asyncio
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask, PipelineParams
from pipecat.frames.frames import LLMMessagesFrame, EndFrame
from pipecat.processors.aggregators.openai_llm_context import (
    OpenAILLMContext, OpenAILLMContextAggregator
)
from pipecat.services.openai import OpenAILLMService, OpenAITTSService
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.vad.silero import SileroVADAnalyzer
from pipecat.transports.services.daily import DailyTransport, DailyParams
from pipecat.audio.vad.vad_analyzer import VADParams

async def run_interview(room_url: str, token: str):
    transport = DailyTransport(
        room_url,
        token,
        "Interview Agent",
        DailyParams(
            audio_out_enabled=True,
            audio_in_enabled=True,
            vad_enabled=True,
            vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.5)),
        ),
    )

    stt = DeepgramSTTService(api_key=DEEPGRAM_API_KEY, model="nova-2")

    llm = OpenAILLMService(api_key=OPENAI_API_KEY, model="gpt-4o")

    tts = OpenAITTSService(
        api_key=OPENAI_API_KEY,
        voice="alloy",
        model="tts-1",
    )

    messages = [
        {
            "role": "system",
            "content": """You are a technical interviewer conducting a Python
            backend interview. Ask one question at a time. Listen carefully
            to answers. Be encouraging but thorough."""
        },
        {
            "role": "user",  # bootstrap the conversation
            "content": "Start the interview."
        }
    ]

    context = OpenAILLMContext(messages)
    context_aggregator = llm.create_context_aggregator(context)

    pipeline = Pipeline([
        transport.input(),           # audio frames from Daily
        stt,                         # audio → text
        context_aggregator.user(),   # accumulate user speech
        llm,                         # text → LLM response
        tts,                         # LLM response → audio
        transport.output(),          # audio → Daily (to candidate)
        context_aggregator.assistant(), # track assistant turns
    ])

    task = PipelineTask(
        pipeline,
        PipelineParams(allow_interruptions=True)
    )

    @transport.event_handler("on_first_participant_joined")
    async def on_first_participant_joined(transport, participant):
        transport.capture_participant_audio(participant["id"])
        await task.queue_frames([LLMMessagesFrame(messages)])

    @transport.event_handler("on_participant_left")
    async def on_participant_left(transport, participant, reason):
        await task.queue_frames([EndFrame()])

    runner = PipelineRunner()
    await runner.run(task)

if __name__ == "__main__":
    asyncio.run(run_interview(DAILY_ROOM_URL, DAILY_TOKEN))

The explicit pipeline wiring is more code than the LiveKit version, but you can see exactly what’s happening at each stage. When you need to add evaluation logic — say, capturing the transcript and scoring the candidate’s answer after each turn — you insert a custom FrameProcessor into the pipeline.

Custom Frame Processors

This is where Pipecat earns its keep for complex interview workflows:

from pipecat.frames.frames import Frame, TranscriptionFrame, TextFrame
from pipecat.processors.frame_processor import FrameProcessor

class InterviewEvaluator(FrameProcessor):
    """Captures candidate responses and scores them in background."""

    def __init__(self, question_context: dict):
        super().__init__()
        self.question_context = question_context
        self.current_transcript = []

    async def process_frame(self, frame: Frame, direction):
        # Pass everything downstream unchanged
        await self.push_frame(frame, direction)

        # But also capture transcription frames for evaluation
        if isinstance(frame, TranscriptionFrame):
            self.current_transcript.append({
                "participant": frame.participant_id,
                "text": frame.text,
                "timestamp": frame.timestamp,
            })

        # When the LLM finishes a turn (question asked),
        # score the previous answer in the background
        if isinstance(frame, LLMFullResponseEndFrame):
            if self.current_transcript:
                asyncio.create_task(
                    self._evaluate_response(self.current_transcript.copy())
                )
                self.current_transcript = []

    async def _evaluate_response(self, transcript: list):
        # Call your evaluation service without blocking the main pipeline
        response = await evaluate_candidate_answer(
            transcript=transcript,
            question_context=self.question_context,
        )
        # Store result, update candidate profile, etc.
        await store_evaluation(response)

You insert InterviewEvaluator into the pipeline between the context aggregator and LLM, and it runs evaluation asynchronously without adding latency to the main conversation loop.

Pipecat with LiveKit Transport

If you want Pipecat’s pipeline flexibility but LiveKit’s infrastructure (mobile SDKs, egress, scalability), you can combine them:

from pipecat.transports.services.livekit import LiveKitTransport, LiveKitParams

transport = LiveKitTransport(
    url=LIVEKIT_URL,
    token=LIVEKIT_TOKEN,
    room_name=ROOM_NAME,
    params=LiveKitParams(
        audio_out_enabled=True,
        audio_in_enabled=True,
        vad_enabled=True,
        vad_analyzer=SileroVADAnalyzer(),
    ),
)

# Rest of your pipeline code is identical to the Daily.co version
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant(),
])

The pipeline code is genuinely identical. Only the transport initialization changes. This is the combination I’d reach for if I were building a production system today: LiveKit infrastructure, Pipecat pipeline logic.

Direct Integration: When You Don’t Need a Framework

Both LiveKit Agents and Pipecat add abstraction layers. Sometimes you don’t want that. Direct integration means connecting your server directly to the provider’s real-time API — no SFU, no framework, just WebSocket or WebRTC connections.

The two main options in 2026 are:

OpenAI Realtime API

OpenAI Realtime supports both WebSocket and WebRTC connections. The WebSocket approach is server-side; the WebRTC approach is designed for browser-direct connections with your server acting as an ephemeral key provider.

# Direct WebSocket connection to OpenAI Realtime
import asyncio
import json
import base64
import websockets

async def run_direct_interview(audio_input_queue: asyncio.Queue,
                                audio_output_queue: asyncio.Queue):
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "OpenAI-Beta": "realtime=v1",
    }

    async with websockets.connect(url, additional_headers=headers) as ws:
        # Configure the session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "instructions": "You are a technical interviewer...",
                "voice": "alloy",
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "input_audio_transcription": {"model": "whisper-1"},
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.5,
                    "silence_duration_ms": 600,
                },
            }
        }))

        async def send_audio():
            while True:
                audio_chunk = await audio_input_queue.get()
                if audio_chunk is None:
                    break
                await ws.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": base64.b64encode(audio_chunk).decode()
                }))

        async def receive_events():
            async for message in ws:
                event = json.loads(message)

                if event["type"] == "response.audio.delta":
                    audio_data = base64.b64decode(event["delta"])
                    await audio_output_queue.put(audio_data)

                elif event["type"] == "response.audio_transcript.done":
                    print(f"Assistant said: {event['transcript']}")

                elif event["type"] == "conversation.item.input_audio_transcription.completed":
                    print(f"Candidate said: {event['transcript']}")

        await asyncio.gather(send_audio(), receive_events())

This is about 80 lines of code for a working voice conversation. No framework overhead. You control every byte.

The downside: you’re responsible for audio capture, audio playback, WebRTC signaling (if you go that route), connection management, reconnection logic, and error handling. That’s a lot of plumbing.

Gemini Live WebSocket

Gemini Live uses a similar pattern but with Google’s multimodal capabilities — you can send video frames alongside audio, which is compelling for video interviews (covered in Part 8).

# Direct Gemini Live connection via google-genai SDK
import asyncio
from google import genai

async def run_gemini_interview():
    client = genai.Client(api_key=GEMINI_API_KEY)

    config = {
        "generation_config": {
            "response_modalities": ["AUDIO"],
            "speech_config": {
                "voice_config": {
                    "prebuilt_voice_config": {"voice_name": "Aoede"}
                }
            }
        },
        "system_instruction": "You are a technical interviewer...",
    }

    async with client.aio.live.connect(
        model="gemini-2.0-flash-live-001",
        config=config
    ) as session:
        # Send audio input
        async def send_audio(audio_stream):
            async for chunk in audio_stream:
                await session.send(
                    input={"data": chunk, "mime_type": "audio/pcm"},
                    end_of_turn=False
                )

        # Receive audio output
        async def receive_audio():
            async for response in session.receive():
                if response.data:
                    # PCM audio bytes — play them
                    yield response.data
                if response.text:
                    print(f"Transcript: {response.text}")

        await asyncio.gather(
            send_audio(get_microphone_stream()),
            play_audio(receive_audio()),
        )

Grok WebSocket

xAI’s Grok also supports a real-time WebSocket API for voice, following a similar pattern to OpenAI’s. The interface is close enough to the OpenAI Realtime API that adapters are straightforward — which we’ll cover in depth in Part 12 on multi-provider support.

When Direct Makes Sense

Direct integration is the right choice when:

You’re building a narrow single-provider integration (MVP, PoC, internal tool)
You have existing WebSocket infrastructure you’re adding voice to
You want zero framework dependencies and full control
You need to minimize cold start time in serverless functions
Your team is small and you can afford the plumbing work

The moment you need multi-provider support, browser WebRTC signaling, mobile clients, or recording — you’ll want one of the frameworks.

Comparison Matrix

Let me put the three options side by side on the dimensions that actually matter for a voice interview platform:

Dimension	LiveKit Agents	Pipecat	Direct
Latency (P50)	180–250ms	200–300ms	150–200ms
Latency (P95)	300–400ms	350–500ms	250–350ms
Horizontal scaling	Native (worker pools)	Manual (process management)	DIY
Mobile SDKs	React Native, Flutter, Swift, Kotlin	Depends on transport	None
Recording	Built-in Egress	Transport-dependent	DIY
Multi-provider	Limited (SDK-specific)	Excellent (swap services)	Per-provider code
Implementation complexity	Medium	Medium-High	Low initially, High long-term
Vendor lock-in	LiveKit infrastructure	Low (transport-agnostic)	Provider-dependent
Python maturity	Excellent	Excellent	Excellent
Community / docs	Large, growing fast	Active, well-documented	N/A (provider docs)
Self-hostable	Yes (OSS server)	Yes (framework + any transport)	N/A

A few notes on the latency numbers: direct integration is fastest because you eliminate the SFU hop. But in practice, the difference between 180ms and 200ms is imperceptible. The P95 numbers matter more — a 500ms tail latency is where users start noticing “lag” in a conversation.

Pipecat’s slightly higher P95 comes from Python pipeline serialization overhead — frames queued between processors add small amounts of latency that compound. In practice this is manageable, and the latest Pipecat versions have improved significantly on this front.

Decision Framework

Here’s how I’d think through the choice:

Use LiveKit Agents if:

You need mobile clients (the SDK coverage is unmatched)
You want managed recording and egress with minimal code
Your team comes from a WebRTC or video conferencing background
You plan to scale to thousands of concurrent sessions (LiveKit Cloud auto-scales)
You want an opinionated, batteries-included experience

Use Pipecat if:

You need multi-provider flexibility (swap OpenAI for Anthropic for Gemini without rewriting your pipeline)
Your interview workflow is complex and you want explicit frame-level control
You’re already using Daily.co infrastructure
You want to combine LiveKit infrastructure with pipeline flexibility (Pipecat + LiveKit transport)
You have a strong Python team that values explicit over implicit

Use Direct Integration if:

You’re building a proof-of-concept or internal tool with a 2-week timeline
You’re committing to a single provider for the foreseeable future
You’re deploying to serverless (Cloudflare Workers, Lambda) where framework cold starts matter
Your use case is simple enough that framework overhead isn’t worth it
You want to learn how these APIs work before adding abstraction

The Combination Play

For a production voice interview platform, I’d actually recommend a hybrid:

Pipecat (pipeline) + LiveKit (transport + egress)

You get:

Pipecat’s explicit, debuggable pipeline with easy custom processor insertion
LiveKit’s WebRTC infrastructure with cross-platform mobile SDKs
LiveKit Egress for GDPR-compliant recordings without custom code
The ability to swap LLM/STT/TTS providers without touching infrastructure

The main downside is complexity: you’re learning two frameworks instead of one. But for a production system that needs to run for years, that investment pays off.

LiveKit Cloud vs. Self-Hosted: The Cost Analysis

If you choose LiveKit, you have two deployment options.

LiveKit Cloud

LiveKit Cloud charges for:

Participant minutes: ~$0.006 per participant-minute
Egress minutes: ~$0.025 per minute for audio recordings

For a voice interview platform with:

1,000 interviews/month
Average 30 minutes each
2 participants (candidate + agent)

Monthly cost:

Participant minutes: 1,000 × 30 × 2 × $0.006 = $360/month
Egress/recordings: 1,000 × 30 × $0.025 = $750/month
Total: ~$1,110/month

Self-Hosted LiveKit

LiveKit Server is a single Go binary. Deployed on a reasonable instance:

2 vCPU / 4GB RAM handles ~200 concurrent participants (for voice-only)
On Hetzner: ~$8/month
On AWS EC2 (t3.medium): ~$30/month
3-node cluster for HA: ~$90/month on Hetzner

At 1,000 interviews/month (assuming non-concurrent), a single small instance is sufficient. You eliminate the ~$1,110 Cloud cost and pay ~$30–90/month in compute.

The tradeoff: you own ops. LiveKit Server is genuinely simple to run — it’s a single binary with a YAML config file — but you’re responsible for uptime, upgrades, and capacity planning.

My recommendation: start with LiveKit Cloud during development (the free tier is generous for testing), switch to self-hosted once you’re generating enough interviews to justify the ops overhead. The break-even point is around 200–300 interviews/month.

# livekit.yaml — minimal self-hosted config
port: 7880
rtc:
  tcp_port: 7881
  udp_port: 7882
  use_external_ip: true

keys:
  your-api-key: your-api-secret

logging:
  level: info

room:
  empty_timeout: 300  # close room after 5 min idle
  max_participants: 10

Run it: livekit-server --config livekit.yaml

That’s it. One command. No Kubernetes required at modest scale.

Pipecat with Different Transports

Since transport-swapping is Pipecat’s headline feature, let me show concretely how the same pipeline looks with two different transports.

With Daily.co

from pipecat.transports.services.daily import DailyTransport, DailyParams

transport = DailyTransport(
    room_url="https://yourdomain.daily.co/interview-room",
    token="your-daily-token",
    bot_name="Interview Agent",
    params=DailyParams(
        audio_out_enabled=True,
        audio_in_enabled=True,
        vad_enabled=True,
        vad_analyzer=SileroVADAnalyzer(),
    ),
)

With LiveKit

from pipecat.transports.services.livekit import LiveKitTransport, LiveKitParams

transport = LiveKitTransport(
    url="wss://your-livekit-server.com",
    token="your-livekit-token",
    room_name="interview-room-001",
    params=LiveKitParams(
        audio_out_enabled=True,
        audio_in_enabled=True,
        vad_enabled=True,
        vad_analyzer=SileroVADAnalyzer(),
    ),
)

The pipeline definition after this — Pipeline([transport.input(), stt, llm, tts, transport.output()]) — is byte-for-byte identical. This is the abstraction Pipecat is selling, and it genuinely works.

With WebSocket (for testing or custom protocols)

from pipecat.transports.network.websocket_server import (
    WebsocketServerTransport, WebsocketServerParams
)

transport = WebsocketServerTransport(
    host="0.0.0.0",
    port=8765,
    params=WebsocketServerParams(
        audio_out_enabled=True,
    ),
)

The WebSocket transport is useful for server-to-server testing, for integrations with telephony systems that speak WebSocket (Twilio Media Streams, for example), and for local development where you don’t want to spin up a WebRTC infrastructure.

Making the Call for the Interview Platform

After going through all three options in detail, here’s what I’m building the rest of this series around:

Primary stack: LiveKit Agents SDK

The reasons:

The mobile SDK coverage is a product requirement, not optional
LiveKit Egress eliminates months of recording infrastructure work
The Agents SDK has matured rapidly — the v1.0 API is stable
The worker pool model maps naturally to interview session management
Horizontal scaling on LiveKit Cloud is “add more workers” — not a YAML adventure

Where Pipecat wins for complex pipelines: In Part 5 (Multi-Role Agents) and Part 6 (RAG), where we need custom processing steps between pipeline stages, I’ll show how to implement equivalent functionality on top of LiveKit by writing custom agent logic. The patterns are transferable even if you choose differently.

Direct integration for specific scenarios: Part 8 (Video Interview Integration with Gemini Live) will use a direct Gemini Live WebSocket connection because Gemini Live’s multimodal capabilities aren’t yet wrapped by either framework in a first-class way.

Quick Start: Running Your First Agent

Before Part 4, here’s the minimum you need to get a LiveKit agent running locally:

# Install dependencies
pip install "livekit-agents[openai,deepgram,silero]~=1.0"

# Set environment variables
export LIVEKIT_URL="ws://localhost:7880"
export LIVEKIT_API_KEY="devkey"
export LIVEKIT_API_SECRET="secret"
export OPENAI_API_KEY="sk-..."
export DEEPGRAM_API_KEY="..."

# Run the agent (from the code example above)
python interview_agent.py dev

The dev subcommand starts the agent in development mode — it watches for file changes, auto-restarts, and gives you a built-in playground URL to test in your browser without building a frontend.

When you load the playground URL, you’ll hear: “Hello! I’m your interviewer today…”

That’s it. You’re running a voice interview agent in under 20 minutes.

What’s Next

In Part 4, we go deep on the components inside the pipeline: which STT model to use when, how to choose between GPT-4o and Claude 3.5 for interview logic, and TTS voice selection that doesn’t make candidates want to hang up.

We’ll also cover the latency math in detail — STT adds 100–150ms, LLM first-token adds 200–800ms, TTS streaming adds 50–100ms — and how to structure your pipeline to bring end-to-end latency below 600ms even on cheaper model tiers.

This is Part 3 of a 12-part series: The Voice AI Interview Playbook.

Series outline:

Why Real-Time Voice Changes Everything — The landscape, the vision, and the reference architecture (Part 1)
Cascaded vs. Speech-to-Speech — Choosing your pipeline architecture (Part 2)
LiveKit vs. Pipecat vs. Direct — Picking your framework (this post)
STT, LLM, and TTS That Actually Work — Building the voice pipeline (Part 4)
Multi-Role Agents — Interviewer, coach, and evaluator personas (Part 5)
Knowledge Base and RAG — Making your voice agent an expert (Part 6)
Web and Mobile Clients — Cross-platform voice experiences (Part 7)
Video Interview Integration — Multimodal analysis with Gemini Live (Part 8)
Recording, Transcription, and Compliance — GDPR, HIPAA, and getting it right (Part 9)
Scaling to Thousands — Architecture for concurrent voice sessions (Part 10)
Cost Optimization — From $0.14/min to $0.03/min (Part 11)
Multi-Provider Support — OpenAI Realtime, Bedrock Nova, Grok, and the adapter pattern (Part 12)

Export for reading

The Voice AI Interview Playbook: LiveKit vs. Pipecat vs. Direct — Picking Your Framework (Part 3 of 12)

LiveKit: WebRTC Infrastructure Done Right

The Room Model

LiveKit Agents SDK

Cross-Platform SDKs

Egress for Recording

Horizontal Scaling

Pipecat: The Flexible Pipeline Alternative

The Frame Processor Model

Transport Layer Abstraction

Hello World in Pipecat

Custom Frame Processors

Pipecat with LiveKit Transport

Direct Integration: When You Don’t Need a Framework

OpenAI Realtime API

Gemini Live WebSocket

Grok WebSocket

When Direct Makes Sense

Comparison Matrix

Decision Framework

Use LiveKit Agents if:

Use Pipecat if:

Use Direct Integration if:

The Combination Play

LiveKit Cloud vs. Self-Hosted: The Cost Analysis

LiveKit Cloud

Self-Hosted LiveKit

Pipecat with Different Transports

With Daily.co

With LiveKit

With WebSocket (for testing or custom protocols)

Making the Call for the Interview Platform

Quick Start: Running Your First Agent

What’s Next

Comments

On this page

The Voice AI Interview Playbook: LiveKit vs. Pipecat vs. Direct — Picking Your Framework (Part 3 of 12)