In Part 2, we chose the hybrid architecture — cascaded pipeline for multi-turn interviews with rich evaluation, speech-to-speech for quick practice sessions. Now we need to decide where that architecture lives. What framework do we build on top of?
This is the question I got wrong on my first voice AI project. I picked the shiniest option without thinking through what I actually needed six months later. So let me save you that pain.
There are three meaningful choices in 2026:
- LiveKit — WebRTC infrastructure with an Agents SDK built for exactly this use case
- Pipecat — a Python pipeline framework by Daily.co that stays vendor-neutral and treats transports as swappable components
- Direct integration — connect straight to provider APIs (Gemini Live WebSocket, OpenAI Realtime WebRTC, Grok WebSocket) with no intermediary framework
None of these is universally correct. The right answer depends on your team’s size, existing infrastructure, and how much abstraction you want to pay for in complexity. Let me walk through each one properly, then give you a decision framework you can actually use.
LiveKit: WebRTC Infrastructure Done Right
LiveKit started as an open-source WebRTC Selective Forwarding Unit (SFU). If you’ve never worked with WebRTC infrastructure before, an SFU is the server that sits in the middle of a multi-party call — it receives media streams from each participant and forwards the right streams to the right people, without doing the expensive work of mixing or transcoding.
That foundation matters for voice interviews because you’re not just doing audio processing in isolation. You’re connecting a browser client, an AI agent, potentially a recording system, and maybe a video feed — all in real time, all with latency that can’t exceed 300ms before the conversation starts to feel broken.
The Room Model
LiveKit organizes everything around rooms. A room is a real-time session where participants (humans, AI agents, recording egress jobs) connect and share media tracks. Each participant publishes audio/video tracks and subscribes to others.
For a voice interview platform, this maps beautifully:
- The candidate connects to a room via the browser SDK
- Your AI agent connects as a participant with its own audio track
- An egress job can record the entire room to S3 for compliance
- A supervisor dashboard could join as a silent observer
The room model gives you a natural session boundary. When the interview ends, you close the room. Everything — recording, transcripts, agent state — is scoped to that room identity.
LiveKit Agents SDK
The Agents SDK is what makes LiveKit genuinely interesting for AI voice applications. It was built specifically to solve the “how do I run an AI agent in a LiveKit room” problem, and it does it well.
Here’s the architecture: you write an AgentWorker process that registers with the LiveKit server. When a room is created (or when certain conditions are met), the server dispatches a job to your worker pool. Your worker connects to the room as a participant and starts processing audio.
The Agents SDK handles the gnarly parts:
- VAD (Voice Activity Detection) integrated via Silero
- Turn-taking logic so the agent doesn’t interrupt the human mid-sentence
- Pipeline orchestration between STT → LLM → TTS
- Graceful reconnection if the agent loses its server connection
# LiveKit Agents SDK — basic voice interview agent
import asyncio
from livekit import agents
from livekit.agents import AgentSession, Agent, RoomInputOptions
from livekit.plugins import openai, silero, deepgram
class InterviewAgent(Agent):
def __init__(self):
super().__init__(
instructions="""You are a technical interviewer conducting a
30-minute Python backend interview. Ask one question at a time.
Listen to the candidate's answer before moving to the next question.
Be encouraging but thorough."""
)
async def entrypoint(ctx: agents.JobContext):
await ctx.connect()
session = AgentSession(
stt=deepgram.STT(model="nova-2"),
llm=openai.LLM(model="gpt-4o"),
tts=openai.TTS(voice="alloy"),
vad=silero.VAD.load(),
turn_detection=openai.realtime.ServerVAD(),
)
await session.start(
room=ctx.room,
agent=InterviewAgent(),
room_input_options=RoomInputOptions(
noise_cancellation=agents.noise_cancellation.BVC()
),
)
# Kick off the interview
await session.say(
"Hello! I'm your interviewer today. Let's start with something "
"straightforward — can you explain the difference between "
"processes and threads in Python?"
)
if __name__ == "__main__":
agents.cli.run_app(
agents.WorkerOptions(entrypoint_fnc=entrypoint)
)
This is clean. The Agents SDK abstracts away the WebRTC complexity and the media pipeline wiring. You focus on the interview logic.
Cross-Platform SDKs
One of LiveKit’s real strengths is SDK coverage. They maintain official SDKs for:
- Web/React:
@livekit/components-react— pre-built UI components for audio/video - React Native:
@livekit/react-native— same API, works on iOS and Android - Flutter:
livekit_client— Dart package for cross-platform mobile - Swift/iOS:
LiveKit-Swift— native iOS SDK - Kotlin/Android:
livekit-android— native Android SDK - Python: server-side and agent SDK
- Go, Rust, Unity: community and official variants
If you’re building a product that needs to run on web, iOS, and Android from a single voice AI platform, LiveKit is the only choice that doesn’t require you to maintain three separate WebRTC stacks.
Egress for Recording
LiveKit Egress is the recording system. It connects to a room as a hidden participant and can:
- Record the entire room to a file (MP4 or OGG)
- Stream to an RTMP endpoint (YouTube, Twitch, your own ingest)
- Capture just an audio track from a specific participant
- Produce room-level composite recordings with mixed audio
For a voice interview platform with GDPR or HIPAA requirements (covered in Part 9), this is enormous. You don’t need to build your own recording pipeline. You start an egress job via API when the interview begins, stop it when it ends, and the recording lands in your S3 bucket.
# Starting a recording egress via LiveKit API
from livekit import api
async def start_interview_recording(room_name: str, candidate_id: str):
lk = api.LiveKitAPI(
url=LIVEKIT_URL,
api_key=LIVEKIT_API_KEY,
api_secret=LIVEKIT_API_SECRET,
)
req = api.RoomCompositeEgressRequest(
room_name=room_name,
audio_only=True, # voice interview — no video needed
file_outputs=[
api.EncodedFileOutput(
file_type=api.EncodedFileType.OGG,
filepath=f"interviews/{candidate_id}/{room_name}.ogg",
s3=api.S3Upload(
access_key=AWS_ACCESS_KEY,
secret=AWS_SECRET,
bucket="interview-recordings",
region="us-east-1",
)
)
]
)
response = await lk.egress.start_room_composite_egress(req)
return response.egress_id
Horizontal Scaling
LiveKit is designed to scale horizontally. Multiple agent worker processes connect to the same LiveKit server (or cluster), and jobs are dispatched across the pool automatically. You can run 100 agent workers and the LiveKit server load-balances room assignments across them.
LiveKit Cloud (their managed offering) handles the SFU scaling for you. Self-hosted requires you to run the LiveKit server — it’s a single Go binary that’s genuinely straightforward to deploy, but you own the ops.
Pipecat: The Flexible Pipeline Alternative
Pipecat is an open-source Python framework from Daily.co (the WebRTC infrastructure company behind daily-python). It takes a fundamentally different architectural stance from LiveKit Agents.
Where LiveKit Agents gives you an opinionated “connect to a room, here’s your pipeline” model, Pipecat gives you a pipeline DAG (directed acyclic graph) where you explicitly wire together frame processors. It’s more verbose but also more transparent and more flexible about what transport layer you use underneath.
The Frame Processor Model
Everything in Pipecat is a FrameProcessor. Audio comes in as AudioRawFrame objects, text as TextFrame, and LLM responses as LLMFullResponseEndFrame. You chain processors together, and frames flow through the pipeline from left to right.
AudioInput → VAD → STT → LLM → TTS → AudioOutput
Each processor in the chain can:
- Pass frames downstream unchanged
- Transform frames (STT converts audio frames to text frames)
- Generate new frames (TTS generates audio from text)
- Buffer frames (the LLM context aggregator collects tokens into a full response)
- Drop frames (VAD drops silence frames)
This model is explicit and debuggable. When something goes wrong in your pipeline, you can add a logging processor at any point and see exactly what frames are flowing through.
Transport Layer Abstraction
Here’s Pipecat’s killer feature: the transport layer is swappable. The same pipeline code can run over:
- Daily.co — Daily’s WebRTC infrastructure (their own product, obviously well-supported)
- LiveKit — use Pipecat’s pipeline with LiveKit for WebRTC transport
- WebSocket — for simpler integrations or server-to-server scenarios
- Local — for testing and development, no network required
This means your pipeline code — the STT/LLM/TTS wiring, your prompt logic, your evaluation hooks — is completely decoupled from your infrastructure choice. You can start with Daily.co during development, switch to LiveKit when you need their mobile SDKs, and your Python code barely changes.
Hello World in Pipecat
# Pipecat voice interview agent with Daily.co transport
import asyncio
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask, PipelineParams
from pipecat.frames.frames import LLMMessagesFrame, EndFrame
from pipecat.processors.aggregators.openai_llm_context import (
OpenAILLMContext, OpenAILLMContextAggregator
)
from pipecat.services.openai import OpenAILLMService, OpenAITTSService
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.vad.silero import SileroVADAnalyzer
from pipecat.transports.services.daily import DailyTransport, DailyParams
from pipecat.audio.vad.vad_analyzer import VADParams
async def run_interview(room_url: str, token: str):
transport = DailyTransport(
room_url,
token,
"Interview Agent",
DailyParams(
audio_out_enabled=True,
audio_in_enabled=True,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.5)),
),
)
stt = DeepgramSTTService(api_key=DEEPGRAM_API_KEY, model="nova-2")
llm = OpenAILLMService(api_key=OPENAI_API_KEY, model="gpt-4o")
tts = OpenAITTSService(
api_key=OPENAI_API_KEY,
voice="alloy",
model="tts-1",
)
messages = [
{
"role": "system",
"content": """You are a technical interviewer conducting a Python
backend interview. Ask one question at a time. Listen carefully
to answers. Be encouraging but thorough."""
},
{
"role": "user", # bootstrap the conversation
"content": "Start the interview."
}
]
context = OpenAILLMContext(messages)
context_aggregator = llm.create_context_aggregator(context)
pipeline = Pipeline([
transport.input(), # audio frames from Daily
stt, # audio → text
context_aggregator.user(), # accumulate user speech
llm, # text → LLM response
tts, # LLM response → audio
transport.output(), # audio → Daily (to candidate)
context_aggregator.assistant(), # track assistant turns
])
task = PipelineTask(
pipeline,
PipelineParams(allow_interruptions=True)
)
@transport.event_handler("on_first_participant_joined")
async def on_first_participant_joined(transport, participant):
transport.capture_participant_audio(participant["id"])
await task.queue_frames([LLMMessagesFrame(messages)])
@transport.event_handler("on_participant_left")
async def on_participant_left(transport, participant, reason):
await task.queue_frames([EndFrame()])
runner = PipelineRunner()
await runner.run(task)
if __name__ == "__main__":
asyncio.run(run_interview(DAILY_ROOM_URL, DAILY_TOKEN))
The explicit pipeline wiring is more code than the LiveKit version, but you can see exactly what’s happening at each stage. When you need to add evaluation logic — say, capturing the transcript and scoring the candidate’s answer after each turn — you insert a custom FrameProcessor into the pipeline.
Custom Frame Processors
This is where Pipecat earns its keep for complex interview workflows:
from pipecat.frames.frames import Frame, TranscriptionFrame, TextFrame
from pipecat.processors.frame_processor import FrameProcessor
class InterviewEvaluator(FrameProcessor):
"""Captures candidate responses and scores them in background."""
def __init__(self, question_context: dict):
super().__init__()
self.question_context = question_context
self.current_transcript = []
async def process_frame(self, frame: Frame, direction):
# Pass everything downstream unchanged
await self.push_frame(frame, direction)
# But also capture transcription frames for evaluation
if isinstance(frame, TranscriptionFrame):
self.current_transcript.append({
"participant": frame.participant_id,
"text": frame.text,
"timestamp": frame.timestamp,
})
# When the LLM finishes a turn (question asked),
# score the previous answer in the background
if isinstance(frame, LLMFullResponseEndFrame):
if self.current_transcript:
asyncio.create_task(
self._evaluate_response(self.current_transcript.copy())
)
self.current_transcript = []
async def _evaluate_response(self, transcript: list):
# Call your evaluation service without blocking the main pipeline
response = await evaluate_candidate_answer(
transcript=transcript,
question_context=self.question_context,
)
# Store result, update candidate profile, etc.
await store_evaluation(response)
You insert InterviewEvaluator into the pipeline between the context aggregator and LLM, and it runs evaluation asynchronously without adding latency to the main conversation loop.
Pipecat with LiveKit Transport
If you want Pipecat’s pipeline flexibility but LiveKit’s infrastructure (mobile SDKs, egress, scalability), you can combine them:
from pipecat.transports.services.livekit import LiveKitTransport, LiveKitParams
transport = LiveKitTransport(
url=LIVEKIT_URL,
token=LIVEKIT_TOKEN,
room_name=ROOM_NAME,
params=LiveKitParams(
audio_out_enabled=True,
audio_in_enabled=True,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(),
),
)
# Rest of your pipeline code is identical to the Daily.co version
pipeline = Pipeline([
transport.input(),
stt,
context_aggregator.user(),
llm,
tts,
transport.output(),
context_aggregator.assistant(),
])
The pipeline code is genuinely identical. Only the transport initialization changes. This is the combination I’d reach for if I were building a production system today: LiveKit infrastructure, Pipecat pipeline logic.
Direct Integration: When You Don’t Need a Framework
Both LiveKit Agents and Pipecat add abstraction layers. Sometimes you don’t want that. Direct integration means connecting your server directly to the provider’s real-time API — no SFU, no framework, just WebSocket or WebRTC connections.
The two main options in 2026 are:
OpenAI Realtime API
OpenAI Realtime supports both WebSocket and WebRTC connections. The WebSocket approach is server-side; the WebRTC approach is designed for browser-direct connections with your server acting as an ephemeral key provider.
# Direct WebSocket connection to OpenAI Realtime
import asyncio
import json
import base64
import websockets
async def run_direct_interview(audio_input_queue: asyncio.Queue,
audio_output_queue: asyncio.Queue):
url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
headers = {
"Authorization": f"Bearer {OPENAI_API_KEY}",
"OpenAI-Beta": "realtime=v1",
}
async with websockets.connect(url, additional_headers=headers) as ws:
# Configure the session
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"instructions": "You are a technical interviewer...",
"voice": "alloy",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_transcription": {"model": "whisper-1"},
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"silence_duration_ms": 600,
},
}
}))
async def send_audio():
while True:
audio_chunk = await audio_input_queue.get()
if audio_chunk is None:
break
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(audio_chunk).decode()
}))
async def receive_events():
async for message in ws:
event = json.loads(message)
if event["type"] == "response.audio.delta":
audio_data = base64.b64decode(event["delta"])
await audio_output_queue.put(audio_data)
elif event["type"] == "response.audio_transcript.done":
print(f"Assistant said: {event['transcript']}")
elif event["type"] == "conversation.item.input_audio_transcription.completed":
print(f"Candidate said: {event['transcript']}")
await asyncio.gather(send_audio(), receive_events())
This is about 80 lines of code for a working voice conversation. No framework overhead. You control every byte.
The downside: you’re responsible for audio capture, audio playback, WebRTC signaling (if you go that route), connection management, reconnection logic, and error handling. That’s a lot of plumbing.
Gemini Live WebSocket
Gemini Live uses a similar pattern but with Google’s multimodal capabilities — you can send video frames alongside audio, which is compelling for video interviews (covered in Part 8).
# Direct Gemini Live connection via google-genai SDK
import asyncio
from google import genai
async def run_gemini_interview():
client = genai.Client(api_key=GEMINI_API_KEY)
config = {
"generation_config": {
"response_modalities": ["AUDIO"],
"speech_config": {
"voice_config": {
"prebuilt_voice_config": {"voice_name": "Aoede"}
}
}
},
"system_instruction": "You are a technical interviewer...",
}
async with client.aio.live.connect(
model="gemini-2.0-flash-live-001",
config=config
) as session:
# Send audio input
async def send_audio(audio_stream):
async for chunk in audio_stream:
await session.send(
input={"data": chunk, "mime_type": "audio/pcm"},
end_of_turn=False
)
# Receive audio output
async def receive_audio():
async for response in session.receive():
if response.data:
# PCM audio bytes — play them
yield response.data
if response.text:
print(f"Transcript: {response.text}")
await asyncio.gather(
send_audio(get_microphone_stream()),
play_audio(receive_audio()),
)
Grok WebSocket
xAI’s Grok also supports a real-time WebSocket API for voice, following a similar pattern to OpenAI’s. The interface is close enough to the OpenAI Realtime API that adapters are straightforward — which we’ll cover in depth in Part 12 on multi-provider support.
When Direct Makes Sense
Direct integration is the right choice when:
- You’re building a narrow single-provider integration (MVP, PoC, internal tool)
- You have existing WebSocket infrastructure you’re adding voice to
- You want zero framework dependencies and full control
- You need to minimize cold start time in serverless functions
- Your team is small and you can afford the plumbing work
The moment you need multi-provider support, browser WebRTC signaling, mobile clients, or recording — you’ll want one of the frameworks.
Comparison Matrix
Let me put the three options side by side on the dimensions that actually matter for a voice interview platform:
| Dimension | LiveKit Agents | Pipecat | Direct |
|---|---|---|---|
| Latency (P50) | 180–250ms | 200–300ms | 150–200ms |
| Latency (P95) | 300–400ms | 350–500ms | 250–350ms |
| Horizontal scaling | Native (worker pools) | Manual (process management) | DIY |
| Mobile SDKs | React Native, Flutter, Swift, Kotlin | Depends on transport | None |
| Recording | Built-in Egress | Transport-dependent | DIY |
| Multi-provider | Limited (SDK-specific) | Excellent (swap services) | Per-provider code |
| Implementation complexity | Medium | Medium-High | Low initially, High long-term |
| Vendor lock-in | LiveKit infrastructure | Low (transport-agnostic) | Provider-dependent |
| Python maturity | Excellent | Excellent | Excellent |
| Community / docs | Large, growing fast | Active, well-documented | N/A (provider docs) |
| Self-hostable | Yes (OSS server) | Yes (framework + any transport) | N/A |
A few notes on the latency numbers: direct integration is fastest because you eliminate the SFU hop. But in practice, the difference between 180ms and 200ms is imperceptible. The P95 numbers matter more — a 500ms tail latency is where users start noticing “lag” in a conversation.
Pipecat’s slightly higher P95 comes from Python pipeline serialization overhead — frames queued between processors add small amounts of latency that compound. In practice this is manageable, and the latest Pipecat versions have improved significantly on this front.
Decision Framework
Here’s how I’d think through the choice:
Use LiveKit Agents if:
- You need mobile clients (the SDK coverage is unmatched)
- You want managed recording and egress with minimal code
- Your team comes from a WebRTC or video conferencing background
- You plan to scale to thousands of concurrent sessions (LiveKit Cloud auto-scales)
- You want an opinionated, batteries-included experience
Use Pipecat if:
- You need multi-provider flexibility (swap OpenAI for Anthropic for Gemini without rewriting your pipeline)
- Your interview workflow is complex and you want explicit frame-level control
- You’re already using Daily.co infrastructure
- You want to combine LiveKit infrastructure with pipeline flexibility (Pipecat + LiveKit transport)
- You have a strong Python team that values explicit over implicit
Use Direct Integration if:
- You’re building a proof-of-concept or internal tool with a 2-week timeline
- You’re committing to a single provider for the foreseeable future
- You’re deploying to serverless (Cloudflare Workers, Lambda) where framework cold starts matter
- Your use case is simple enough that framework overhead isn’t worth it
- You want to learn how these APIs work before adding abstraction
The Combination Play
For a production voice interview platform, I’d actually recommend a hybrid:
Pipecat (pipeline) + LiveKit (transport + egress)
You get:
- Pipecat’s explicit, debuggable pipeline with easy custom processor insertion
- LiveKit’s WebRTC infrastructure with cross-platform mobile SDKs
- LiveKit Egress for GDPR-compliant recordings without custom code
- The ability to swap LLM/STT/TTS providers without touching infrastructure
The main downside is complexity: you’re learning two frameworks instead of one. But for a production system that needs to run for years, that investment pays off.
LiveKit Cloud vs. Self-Hosted: The Cost Analysis
If you choose LiveKit, you have two deployment options.
LiveKit Cloud
LiveKit Cloud charges for:
- Participant minutes: ~$0.006 per participant-minute
- Egress minutes: ~$0.025 per minute for audio recordings
For a voice interview platform with:
- 1,000 interviews/month
- Average 30 minutes each
- 2 participants (candidate + agent)
Monthly cost:
- Participant minutes: 1,000 × 30 × 2 × $0.006 = $360/month
- Egress/recordings: 1,000 × 30 × $0.025 = $750/month
- Total: ~$1,110/month
Self-Hosted LiveKit
LiveKit Server is a single Go binary. Deployed on a reasonable instance:
- 2 vCPU / 4GB RAM handles ~200 concurrent participants (for voice-only)
- On Hetzner: ~$8/month
- On AWS EC2 (t3.medium): ~$30/month
- 3-node cluster for HA: ~$90/month on Hetzner
At 1,000 interviews/month (assuming non-concurrent), a single small instance is sufficient. You eliminate the ~$1,110 Cloud cost and pay ~$30–90/month in compute.
The tradeoff: you own ops. LiveKit Server is genuinely simple to run — it’s a single binary with a YAML config file — but you’re responsible for uptime, upgrades, and capacity planning.
My recommendation: start with LiveKit Cloud during development (the free tier is generous for testing), switch to self-hosted once you’re generating enough interviews to justify the ops overhead. The break-even point is around 200–300 interviews/month.
# livekit.yaml — minimal self-hosted config
port: 7880
rtc:
tcp_port: 7881
udp_port: 7882
use_external_ip: true
keys:
your-api-key: your-api-secret
logging:
level: info
room:
empty_timeout: 300 # close room after 5 min idle
max_participants: 10
Run it: livekit-server --config livekit.yaml
That’s it. One command. No Kubernetes required at modest scale.
Pipecat with Different Transports
Since transport-swapping is Pipecat’s headline feature, let me show concretely how the same pipeline looks with two different transports.
With Daily.co
from pipecat.transports.services.daily import DailyTransport, DailyParams
transport = DailyTransport(
room_url="https://yourdomain.daily.co/interview-room",
token="your-daily-token",
bot_name="Interview Agent",
params=DailyParams(
audio_out_enabled=True,
audio_in_enabled=True,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(),
),
)
With LiveKit
from pipecat.transports.services.livekit import LiveKitTransport, LiveKitParams
transport = LiveKitTransport(
url="wss://your-livekit-server.com",
token="your-livekit-token",
room_name="interview-room-001",
params=LiveKitParams(
audio_out_enabled=True,
audio_in_enabled=True,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(),
),
)
The pipeline definition after this — Pipeline([transport.input(), stt, llm, tts, transport.output()]) — is byte-for-byte identical. This is the abstraction Pipecat is selling, and it genuinely works.
With WebSocket (for testing or custom protocols)
from pipecat.transports.network.websocket_server import (
WebsocketServerTransport, WebsocketServerParams
)
transport = WebsocketServerTransport(
host="0.0.0.0",
port=8765,
params=WebsocketServerParams(
audio_out_enabled=True,
),
)
The WebSocket transport is useful for server-to-server testing, for integrations with telephony systems that speak WebSocket (Twilio Media Streams, for example), and for local development where you don’t want to spin up a WebRTC infrastructure.
Making the Call for the Interview Platform
After going through all three options in detail, here’s what I’m building the rest of this series around:
Primary stack: LiveKit Agents SDK
The reasons:
- The mobile SDK coverage is a product requirement, not optional
- LiveKit Egress eliminates months of recording infrastructure work
- The Agents SDK has matured rapidly — the v1.0 API is stable
- The worker pool model maps naturally to interview session management
- Horizontal scaling on LiveKit Cloud is “add more workers” — not a YAML adventure
Where Pipecat wins for complex pipelines: In Part 5 (Multi-Role Agents) and Part 6 (RAG), where we need custom processing steps between pipeline stages, I’ll show how to implement equivalent functionality on top of LiveKit by writing custom agent logic. The patterns are transferable even if you choose differently.
Direct integration for specific scenarios: Part 8 (Video Interview Integration with Gemini Live) will use a direct Gemini Live WebSocket connection because Gemini Live’s multimodal capabilities aren’t yet wrapped by either framework in a first-class way.
Quick Start: Running Your First Agent
Before Part 4, here’s the minimum you need to get a LiveKit agent running locally:
# Install dependencies
pip install "livekit-agents[openai,deepgram,silero]~=1.0"
# Set environment variables
export LIVEKIT_URL="ws://localhost:7880"
export LIVEKIT_API_KEY="devkey"
export LIVEKIT_API_SECRET="secret"
export OPENAI_API_KEY="sk-..."
export DEEPGRAM_API_KEY="..."
# Run the agent (from the code example above)
python interview_agent.py dev
The dev subcommand starts the agent in development mode — it watches for file changes, auto-restarts, and gives you a built-in playground URL to test in your browser without building a frontend.
When you load the playground URL, you’ll hear: “Hello! I’m your interviewer today…”
That’s it. You’re running a voice interview agent in under 20 minutes.
What’s Next
In Part 4, we go deep on the components inside the pipeline: which STT model to use when, how to choose between GPT-4o and Claude 3.5 for interview logic, and TTS voice selection that doesn’t make candidates want to hang up.
We’ll also cover the latency math in detail — STT adds 100–150ms, LLM first-token adds 200–800ms, TTS streaming adds 50–100ms — and how to structure your pipeline to bring end-to-end latency below 600ms even on cheaper model tiers.
This is Part 3 of a 12-part series: The Voice AI Interview Playbook.
Series outline:
- Why Real-Time Voice Changes Everything — The landscape, the vision, and the reference architecture (Part 1)
- Cascaded vs. Speech-to-Speech — Choosing your pipeline architecture (Part 2)
- LiveKit vs. Pipecat vs. Direct — Picking your framework (this post)
- STT, LLM, and TTS That Actually Work — Building the voice pipeline (Part 4)
- Multi-Role Agents — Interviewer, coach, and evaluator personas (Part 5)
- Knowledge Base and RAG — Making your voice agent an expert (Part 6)
- Web and Mobile Clients — Cross-platform voice experiences (Part 7)
- Video Interview Integration — Multimodal analysis with Gemini Live (Part 8)
- Recording, Transcription, and Compliance — GDPR, HIPAA, and getting it right (Part 9)
- Scaling to Thousands — Architecture for concurrent voice sessions (Part 10)
- Cost Optimization — From $0.14/min to $0.03/min (Part 11)
- Multi-Provider Support — OpenAI Realtime, Bedrock Nova, Grok, and the adapter pattern (Part 12)