Building a Production Voice AI Interview Agent: S2S-Only with Dynamic Prompts

In the Voice AI Interview Playbook series, we covered the full stack — cascaded pipelines, speech-to-speech models, framework selection, scaling, cost optimization, and multi-provider failover across twelve posts. That series is the reference architecture. This post is the build guide.

Specifically: how to build a production voice interview agent using only the speech-to-speech approach. No cascaded STT → LLM → TTS pipeline. No intermediate text. Audio goes in, audio comes out. One model does the hearing, thinking, and speaking.

The core innovation here isn’t the S2S model itself — it’s the dynamic prompt system that generates system prompts programmatically from interview configurations. We call this the “researcher approach” because every prompt is a reproducible data product, not a hand-crafted template. Change the config, get a different prompt. Version the config, version the prompt. Test the config, test the prompt.

Here’s what we’re building: a LiveKit MultimodalAgent that supports OpenAI Realtime and Gemini Live, three interview personas (Interviewer, Coach, Evaluator), function calling for flow control, and a state machine for session management. All of it driven by a single InterviewConfig dataclass.

Why S2S-Only?

The pipeline architecture post (Part 2) covered the cascaded vs. S2S tradeoff in depth. Here’s the short version of why you’d go S2S-only:

What you gain:

Lower latency. No STT → LLM → TTS chain. Audio-to-audio in 300-500ms end-to-end.
Simpler architecture. One model instead of three. One connection instead of three. One failure mode instead of three.
Better prosody. S2S models generate speech natively — they control tone, pacing, emphasis, and emotion. Cascaded pipelines lose these nuances in the text intermediate step.
Fewer components to manage. No VAD tuning, no STT provider, no TTS voice selection, no text-audio synchronization.

What you lose:

No real-time transcript. The S2S model doesn’t expose intermediate text. You get transcripts post-hoc by running STT on the recording.
Harder debugging. You can’t inspect the “thinking” step because there is no text step to log.
Model lock-in per session. You can’t mix providers mid-conversation (OpenAI Realtime for the conversation, Whisper for transcription, ElevenLabs for voice).
Fewer voice options. S2S models offer 4-8 built-in voices vs. hundreds from dedicated TTS providers.

For interviews, the tradeoffs favor S2S. Latency matters more than voice variety. Simplicity matters more than debugging visibility. And post-hoc transcription is fine when the recording quality is high.

The S2S-Only Architecture

The architecture has three layers sitting below the LiveKit transport:

S2S Conversation Engine — OpenAI Realtime or Gemini Live, with a provider factory for runtime selection
Dynamic Prompt System — InterviewConfig → PromptBuilder → system prompt with rubric injection
Session Orchestration — State machine, function calling, context updates, scoring

LiveKit’s MultimodalAgent class bridges the gap between the LiveKit room (WebRTC audio tracks) and the S2S model’s WebSocket connection. It handles audio framing, turn detection, and the bidirectional streaming protocol — you don’t write any of that plumbing.

Candidate Browser
    ↓ WebRTC audio
LiveKit SFU
    ↓ audio track
MultimodalAgent
    ↓ S2S WebSocket
OpenAI Realtime / Gemini Live
    ↓ function calls
Session Orchestration (state machine + scoring)
    ↑ context updates
Dynamic Prompt System (InterviewConfig → PromptBuilder)

The key insight: in an S2S architecture, your system prompt is your only control lever. There’s no text pipeline to intercept, no intermediate representation to modify, no separate LLM call to customize. Everything the agent knows, every behavior you want — it all goes into the system prompt. That’s why the prompt system matters more here than anywhere else.

Dynamic Prompts: The Researcher Approach

Most voice AI tutorials show you a static system prompt: “You are a helpful interviewer. Ask questions about Python.” That works for a demo. It falls apart in production because:

Different roles need different questions (backend engineer vs. product manager vs. data scientist)
Different levels need different difficulty (junior vs. senior vs. staff)
Different companies have different evaluation criteria
You need to iterate on prompts without changing code
You need to A/B test prompt variations

The researcher approach treats prompts as data products. The prompt isn’t a string literal in your code — it’s the output of a build function that takes a structured configuration as input. Change the config, get a different prompt. Version the config, version the prompt. Run the config through validation, you validate the prompt.

The Configuration Layer

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional


class InterviewType(str, Enum):
    TECHNICAL = "technical"
    BEHAVIORAL = "behavioral"
    SYSTEM_DESIGN = "system_design"
    MIXED = "mixed"


class CandidateLevel(str, Enum):
    JUNIOR = "junior"
    MID = "mid"
    SENIOR = "senior"
    STAFF = "staff"
    PRINCIPAL = "principal"


@dataclass
class ScoringAnchor:
    """Concrete examples of what a score looks like."""
    score: int  # 1, 5, or 10
    description: str
    example_response: str  # few-shot calibration


@dataclass
class RubricArea:
    """A single competency to evaluate, with scoring anchors."""
    name: str
    weight: float  # 0.0 to 1.0, all weights sum to 1.0
    description: str
    anchors: list[ScoringAnchor] = field(default_factory=list)
    follow_up_triggers: list[str] = field(default_factory=list)


@dataclass
class InterviewConfig:
    """Everything needed to generate a system prompt."""
    role: str  # "Senior Backend Engineer"
    company: str  # "Acme Corp"
    level: CandidateLevel
    interview_type: InterviewType
    duration_minutes: int = 30
    rubric_areas: list[RubricArea] = field(default_factory=list)
    company_context: str = ""  # "We use Python, FastAPI, PostgreSQL..."
    special_instructions: str = ""
    language: str = "en"

This is a dataclass, not a prompt. It contains zero prompt language. No “you are” statements, no behavioral instructions, no formatting directives. Those come next.

The Prompt Builder

The PromptBuilder takes an InterviewConfig and produces a complete system prompt. Every section of the prompt is generated from the config — nothing is hard-coded.

class PromptBuilder:
    """Generates system prompts from InterviewConfig.

    The researcher approach: prompts are data products.
    Every prompt is reproducible from its config.
    """

    def build_interviewer_prompt(self, config: InterviewConfig) -> str:
        sections = [
            self._build_identity(config),
            self._build_role_context(config),
            self._build_interview_structure(config),
            self._build_rubric_section(config),
            self._build_behavioral_guidelines(config),
            self._build_function_calling_instructions(config),
        ]
        return "\n\n".join(sections)

    def build_coach_prompt(self, config: InterviewConfig) -> str:
        sections = [
            self._build_coach_identity(config),
            self._build_role_context(config),
            self._build_coaching_guidelines(config),
            self._build_feedback_format(config),
        ]
        return "\n\n".join(sections)

    def build_evaluator_prompt(self, config: InterviewConfig) -> str:
        sections = [
            self._build_evaluator_identity(config),
            self._build_rubric_section(config),
            self._build_scoring_instructions(config),
            self._build_bias_mitigation(config),
        ]
        return "\n\n".join(sections)

    def _build_identity(self, config: InterviewConfig) -> str:
        return (
            f"You are a professional technical interviewer conducting a "
            f"{config.interview_type.value} interview for the role of "
            f"{config.role} at {config.company}.\n"
            f"The candidate is targeting a {config.level.value}-level position.\n"
            f"This interview is {config.duration_minutes} minutes long."
        )

    def _build_role_context(self, config: InterviewConfig) -> str:
        if not config.company_context:
            return ""
        return (
            f"## Company & Role Context\n"
            f"{config.company_context}"
        )

    def _build_interview_structure(self, config: InterviewConfig) -> str:
        total = config.duration_minutes
        return (
            f"## Interview Structure\n"
            f"Follow this time allocation strictly:\n"
            f"1. Opening & rapport (2 min)\n"
            f"2. Background deep-dive ({int(total * 0.15)} min)\n"
            f"3. Technical questions ({int(total * 0.40)} min)\n"
            f"4. System design or case study ({int(total * 0.25)} min)\n"
            f"5. Candidate Q&A ({int(total * 0.10)} min)\n"
            f"6. Wrap-up (2 min)\n\n"
            f"Call the `transition_section` function when moving between sections. "
            f"Call `flag_time_warning` when 80% of a section's time has elapsed."
        )

    def _build_rubric_section(self, config: InterviewConfig) -> str:
        if not config.rubric_areas:
            return ""

        lines = ["## Evaluation Rubric\n"]
        lines.append("Score each competency on a 1-10 scale using these anchors:\n")

        for area in config.rubric_areas:
            lines.append(f"### {area.name} (weight: {area.weight:.0%})")
            lines.append(f"{area.description}\n")

            for anchor in area.anchors:
                lines.append(f"**Score {anchor.score}:** {anchor.description}")
                lines.append(f"  Example: \"{anchor.example_response}\"\n")

            if area.follow_up_triggers:
                triggers = ", ".join(f'"{t}"' for t in area.follow_up_triggers)
                lines.append(f"Follow-up triggers: {triggers}\n")

        return "\n".join(lines)

    def _build_behavioral_guidelines(self, config: InterviewConfig) -> str:
        return (
            "## Behavioral Guidelines\n"
            "- Speak naturally, not robotically. Use conversational filler "
            "like 'That's interesting' or 'I see' between questions.\n"
            "- Never interrupt the candidate mid-sentence. Wait for a natural pause.\n"
            "- If a candidate gives a vague answer, ask ONE specific follow-up. "
            "Do not ask multiple follow-ups in sequence.\n"
            "- Acknowledge good answers briefly before moving on.\n"
            "- If the candidate is struggling, offer a hint after 15 seconds of silence. "
            "Frame it as 'Let me rephrase that' rather than giving the answer.\n"
            "- Never reveal scores or evaluation criteria during the interview."
        )

    def _build_function_calling_instructions(self, config: InterviewConfig) -> str:
        return (
            "## Function Calling\n"
            "You have access to these functions. Call them at the appropriate moments:\n"
            "- `transition_section(section_name)`: Call when moving to the next interview section.\n"
            "- `score_answer(area, score, evidence)`: Call after each substantive answer to record your assessment.\n"
            "- `flag_followup(reason)`: Call when you identify a topic worth deeper exploration.\n"
            "- `query_knowledge(topic)`: Call to retrieve company-specific context for follow-up questions."
        )

    def _build_coach_identity(self, config: InterviewConfig) -> str:
        return (
            f"You are a supportive interview coach helping a candidate prepare for a "
            f"{config.interview_type.value} interview for {config.role} at {config.company}.\n"
            f"Target level: {config.level.value}.\n"
            f"Your goal is to build confidence and improve specific skills. "
            f"Never score or judge — only coach."
        )

    def _build_coaching_guidelines(self, config: InterviewConfig) -> str:
        return (
            "## Coaching Guidelines\n"
            "- After each practice answer, give exactly ONE specific improvement suggestion.\n"
            "- Frame feedback positively: 'That was solid. To make it even stronger, try...'\n"
            "- If the candidate uses filler words excessively, mention it once, gently.\n"
            "- Suggest the STAR format for behavioral questions (Situation, Task, Action, Result).\n"
            "- Offer to re-do any answer the candidate wants to practice again.\n"
            "- Keep energy high. This is practice, not evaluation."
        )

    def _build_feedback_format(self, config: InterviewConfig) -> str:
        return (
            "## Feedback Format\n"
            "After each practice answer, structure feedback as:\n"
            "1. What worked well (1 sentence)\n"
            "2. One specific improvement\n"
            "3. Optional: suggested rephrasing of one sentence"
        )

    def _build_evaluator_identity(self, config: InterviewConfig) -> str:
        return (
            f"You are an objective evaluator assessing a {config.level.value}-level "
            f"candidate for {config.role} at {config.company}.\n"
            f"You will receive the full interview transcript and must produce a "
            f"structured assessment."
        )

    def _build_scoring_instructions(self, config: InterviewConfig) -> str:
        return (
            "## Scoring Instructions\n"
            "- Score each rubric area independently on a 1-10 scale.\n"
            "- For every score, cite the specific evidence from the transcript.\n"
            "- Run the evaluation twice. If scores differ by more than 2 points "
            "on any area, flag it for human review.\n"
            "- Compute the weighted total using rubric weights.\n"
            "- Provide a hire/no-hire/strong-hire recommendation with confidence level."
        )

    def _build_bias_mitigation(self, config: InterviewConfig) -> str:
        return (
            "## Bias Mitigation\n"
            "- Evaluate based solely on demonstrated competency, not communication style.\n"
            "- Do not penalize for accent, speech patterns, or non-native English.\n"
            "- Apply the same scoring anchors consistently across all candidates.\n"
            "- Flag any area where you feel uncertain — human reviewers will calibrate."
        )

This is roughly 150 lines of Python, but look at what it gives you: three fully-formed system prompts generated from one config object. Change the role from “Senior Backend Engineer” to “Staff Data Scientist” and every section updates automatically — the identity, the rubric, the structure, the coaching guidelines. No prompt surgery required.

Why This Matters More in S2S

In a cascaded pipeline, you have multiple control points: you can modify the STT output before it reaches the LLM, you can post-process the LLM response before TTS, you can inject context at the text level between turns.

In S2S, the system prompt is your only lever. Once the session starts, the model receives audio and produces audio. The only way to influence its behavior is through the initial system prompt and mid-session context updates via set_chat_ctx(). That’s it.

This is why the researcher approach matters: if you’re going to have one lever, make it a precise one. A reproducible, testable, configurable one. Not a hand-written template you tweak in production and hope for the best.

Building the Agent: LiveKit + OpenAI Realtime

Here’s a complete, working agent using LiveKit’s MultimodalAgent with OpenAI Realtime:

from livekit.agents import AutoSubscribe, WorkerOptions, cli
from livekit.agents.multimodal import MultimodalAgent
from livekit.plugins.openai import realtime


async def entrypoint(ctx: JobContext):
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

    participant = await ctx.wait_for_participant()

    # Build config from your database / API
    config = InterviewConfig(
        role="Senior Backend Engineer",
        company="Acme Corp",
        level=CandidateLevel.SENIOR,
        interview_type=InterviewType.TECHNICAL,
        duration_minutes=30,
        rubric_areas=[
            RubricArea(
                name="System Design",
                weight=0.3,
                description="Ability to design scalable distributed systems",
                anchors=[
                    ScoringAnchor(1, "Cannot articulate basic components",
                        "I would just use a single server"),
                    ScoringAnchor(5, "Solid design with standard patterns",
                        "I'd use a load balancer, app servers, and a read replica"),
                    ScoringAnchor(10, "Exceptional depth with novel tradeoff analysis",
                        "Given the read-heavy access pattern, I'd use CQRS with "
                        "event sourcing to decouple writes from the read path"),
                ],
                follow_up_triggers=["mentions scaling", "proposes single point of failure"],
            ),
            RubricArea(
                name="Coding Proficiency",
                weight=0.3,
                description="Problem-solving ability and code quality",
                anchors=[
                    ScoringAnchor(1, "Cannot write basic functions",
                        "I'm not sure how to start"),
                    ScoringAnchor(5, "Correct solution with reasonable complexity",
                        "Here's a solution using a hash map for O(n) lookup"),
                    ScoringAnchor(10, "Optimal solution with edge case handling",
                        "The hash map handles collisions, and I've added bounds "
                        "checking for the edge case where the input is empty"),
                ],
            ),
            RubricArea(
                name="Communication",
                weight=0.2,
                description="Clarity of explanation and collaborative thinking",
                anchors=[
                    ScoringAnchor(1, "Cannot explain thought process", "Um... let me think"),
                    ScoringAnchor(5, "Clear explanations with some structure",
                        "Let me walk through my approach: first I'd..."),
                    ScoringAnchor(10, "Exceptional clarity with proactive communication",
                        "Before I start coding, let me outline my approach and "
                        "confirm we're aligned on the requirements"),
                ],
            ),
            RubricArea(
                name="Technical Depth",
                weight=0.2,
                description="Understanding of underlying systems and tradeoffs",
                anchors=[
                    ScoringAnchor(1, "Surface-level knowledge only",
                        "I use PostgreSQL because it's popular"),
                    ScoringAnchor(5, "Good understanding of common tradeoffs",
                        "PostgreSQL gives us ACID guarantees but we'd need "
                        "connection pooling at this scale"),
                    ScoringAnchor(10, "Deep expertise with production experience",
                        "We hit this exact issue at scale — connection pooling "
                        "with PgBouncer in transaction mode, plus we added "
                        "statement-level timeouts to prevent long-running queries"),
                ],
            ),
        ],
        company_context=(
            "Acme Corp is a B2B SaaS company processing 50K requests/second. "
            "Stack: Python (FastAPI), PostgreSQL, Redis, Kubernetes on AWS. "
            "Team size: 8 backend engineers. Microservices architecture."
        ),
    )

    builder = PromptBuilder()
    system_prompt = builder.build_interviewer_prompt(config)

    model = realtime.RealtimeModel(
        instructions=system_prompt,
        voice="sage",
        temperature=0.7,
        modalities=["text", "audio"],
        turn_detection=realtime.ServerVadOptions(
            threshold=0.5,
            prefix_padding_ms=300,
            silence_duration_ms=500,
        ),
    )

    agent = MultimodalAgent(model=model)
    agent.start(ctx.room, participant)

    # Initial greeting
    session = agent.session
    session.conversation.item.create(
        llm.ChatMessage(
            role="assistant",
            content="Hello! Welcome to your technical interview for the Senior "
                    "Backend Engineer role at Acme Corp. I'm excited to chat "
                    "with you today. Before we dive into the technical questions, "
                    "could you give me a brief overview of your background?",
        )
    )
    session.response.create()


if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

That’s a production-ready S2S interview agent in about 100 lines (plus the config/builder we defined earlier). The MultimodalAgent handles all the WebRTC ↔ WebSocket bridging. The RealtimeModel manages the OpenAI Realtime session. Your job is to define the config and build the prompt.

Provider Swap: Gemini Live

One of the advantages of LiveKit’s agent framework: swapping providers is a model replacement, not a rewrite. Here’s the same agent using Gemini Live:

from livekit.plugins.google import realtime as google_realtime


model = google_realtime.RealtimeModel(
    instructions=system_prompt,
    voice="Puck",
    temperature=0.7,
    modalities=["AUDIO"],
    model="gemini-2.0-flash-live",
)

agent = MultimodalAgent(model=model)

Same MultimodalAgent. Same InterviewConfig. Same PromptBuilder. Different model class. The system prompt is identical because it’s generated from the config, not hard-coded for a specific provider.

The differences that matter in practice:

Feature	OpenAI Realtime	Gemini Live
Latency	~300ms	~320-800ms
Voices	alloy, ash, coral, sage	Puck, Charon, Kore, Fenrir
Multimodal	Audio only	Audio + video
Function calling	Native, parallel supported	Tool declarations, serial only
Session limit	60 min	15 min (context resets on resume)
Pricing model	Per-token (audio + text)	Per-token (audio)
Key production issues	VAD aggression, prompt cache miss	15-min context loss, latency spikes

For interviews, OpenAI Realtime has the edge on latency and session duration. Gemini Live wins if you need video analysis (e.g., watching a candidate write code on a shared screen).

Known Issues & Production Gotchas

The comparison table above makes both providers look clean. Production disagrees. Here are the issues you’ll actually hit when running a real interview platform — not theoretical edge cases, but things that surface during a 30-minute session with a real candidate. Each issue includes the root cause and a mitigation that works.

OpenAI Realtime

1. VAD interrupts candidates mid-thought

The default silence_duration_ms=200 is too short for technical thinking pauses. Candidates who pause to think — “I would use… a distributed hash table with consistent hashing” — get cut off after 200ms of silence. The model interprets the thinking pause as a turn-end and starts responding before the candidate finishes.

# Default — interrupts technical thinkers
turn_detection=realtime.ServerVadOptions(
    threshold=0.5,
    prefix_padding_ms=300,
    silence_duration_ms=200,  # too fast
)

# Tuned for technical interviews
turn_detection=realtime.ServerVadOptions(
    threshold=0.5,
    prefix_padding_ms=400,   # capture the full thought
    silence_duration_ms=700, # 700ms = natural thinking pause
)

2. WebSocket reconnects don’t restore session state

OpenAI Realtime WebSocket connections disconnect occasionally under load (network hiccup, server-side timeout, load balancer idle timeout). When the SDK reconnects, it starts a fresh session — the conversation history is gone. Mid-interview, this means the model has no memory of what was discussed, what questions were asked, or what rubric areas were already scored.

Implement a state snapshot on disconnect and re-inject on reconnect:

class ReconnectHandler:
    def __init__(self, state: InterviewState, config: InterviewConfig):
        self.state = state
        self.config = config

    async def on_reconnect(self, agent: MultimodalAgent):
        """Re-inject critical context after a WebSocket reconnect."""
        session = agent.session
        ctx = session.chat_ctx.copy()
        ctx.add_message(
            role="system",
            content=(
                f"[RECONNECT] Session resumed mid-interview. "
                f"Current section: {self.state.current_section}. "
                f"Scores recorded so far: {json.dumps(self.state.scores)}. "
                f"Topics covered: {', '.join(self.state.followups)}. "
                f"Continue the interview naturally without acknowledging the interruption."
            ),
        )
        session.set_chat_ctx(ctx)

3. Audio token accumulation silently erases rubric context

The Realtime API keeps all conversation turns in context. Audio tokenizes at roughly 100 tokens per second — a 30-minute session with both speakers active generates ~180K audio tokens. When this approaches the model’s context window, oldest turns are silently dropped. The first casualty is usually the rubric and scoring anchors from your system prompt.

Symptoms: the model starts giving vague, uncalibrated scores or stops calling score_answer entirely. It hasn’t broken — it’s just forgotten the scoring instructions.

Re-inject the rubric at the session midpoint:

async def reinject_rubric_at_midpoint(
    self,
    agent: MultimodalAgent,
    config: InterviewConfig,
):
    """Called at 50% of session duration to re-anchor evaluation criteria."""
    builder = PromptBuilder()
    rubric_reminder = builder._build_rubric_section(config)
    session = agent.session
    ctx = session.chat_ctx.copy()
    ctx.add_message(
        role="system",
        content=(
            f"[CONTEXT REFRESH] Re-anchoring evaluation rubric. "
            f"Continue scoring per these criteria:\n{rubric_reminder}"
        ),
    )
    session.set_chat_ctx(ctx)

Schedule this at duration_minutes * 0.5 * 60 seconds into the session.

4. Function calls race with streaming audio

The model can emit a function_call event while audio for the same response is still streaming. If you return the function result immediately (before audio finishes), the model may start generating its next audio turn before the current one completes — producing overlapping speech or a response that contradicts what it just said.

Keep function tool return values short and don’t trigger any follow-up actions in the handler:

@function_tool()
async def score_answer(
    self,
    context: RunContext,
    rubric_area: str,
    score: int,
    evidence: str,
) -> str:
    self.state.scores[rubric_area] = {"score": score, "evidence": evidence}
    # Short return string — don't trigger further responses here.
    # The model receives this AFTER response.done fires, not mid-audio.
    return f"Score {score}/10 recorded for {rubric_area}."

Avoid returning complex objects. The model generates verbal acknowledgment from function results — a long return string causes it to read the result aloud.

5. Dynamic prompts defeat prompt caching

OpenAI Realtime offers a 50% discount on cached input tokens — but only when the system prompt prefix is identical across sessions. The PromptBuilder generates different prompts per role, company, and level. Every session gets a unique prompt, so every session pays full price.

At scale (1,000 sessions/month), this adds up. Fix it by splitting into a static cached prefix and a dynamic session-specific suffix:

class CachingPromptBuilder(PromptBuilder):
    def build_cached_prefix(self) -> str:
        """Identical across all sessions — qualifies for prompt caching."""
        return (
            "You are a professional technical interviewer. "
            "Your role is to evaluate candidates fairly and consistently. "
            "Follow all behavioral guidelines in this prompt exactly.\n\n"
            + self._build_behavioral_guidelines(None)  # no config needed
        )

    def build_dynamic_suffix(self, config: InterviewConfig) -> str:
        """Session-specific — not cached, but small."""
        return "\n\n".join(filter(None, [
            self._build_identity(config),
            self._build_role_context(config),
            self._build_rubric_section(config),
            self._build_function_calling_instructions(config),
        ]))

    def build_instructions(self, config: InterviewConfig) -> str:
        return self.build_cached_prefix() + "\n\n" + self.build_dynamic_suffix(config)

OpenAI caches the longest common prefix across your sessions. The static prefix — behavioral guidelines, which rarely change — gets cached. The dynamic suffix (role, rubric) pays full price but is smaller.

6. response.create() conflicts with server VAD

MultimodalAgent uses server-side VAD for turn detection. If you also call session.response.create() manually at session start (for the initial greeting) and the VAD triggers at the same moment, the model generates two overlapping responses. The candidate hears the agent start two sentences simultaneously.

Don’t trigger a response manually after injecting the opening assistant message — let the VAD handle the turn:

# WRONG: manual response.create() races with server VAD
session.conversation.item.create(
    llm.ChatMessage(role="assistant", content="Welcome to your interview...")
)
session.response.create()  # may double-trigger

# RIGHT: inject an invisible user trigger; let VAD handle the response naturally
session.conversation.item.create(
    llm.ChatMessage(role="user", content="[BEGIN]")
)
session.response.create()  # single intentional trigger, VAD disabled for this turn

Alternatively, disable server VAD entirely and manage turn detection in your application code for full control.

Gemini Live

1. 15-minute hard limit with context-blind session resumption

Gemini Live terminates sessions at 15 minutes. Session resumption exists — you get a session_resumption_handle — but it only reconnects the WebSocket. It does not restore conversation history. The model starts the new segment with no memory of the previous 15 minutes.

For a 30-minute technical interview, this means the session resets exactly at the most critical point — mid-way through a candidate’s system design question. The model will greet the candidate again as if meeting them for the first time.

The only mitigation is external state management with a proactive checkpoint:

class GeminiSessionManager:
    SEGMENT_WARNING_SECONDS = 13 * 60  # warn at 13 min, reset at 15

    async def watch_session_expiry(self, agent: MultimodalAgent):
        """Proactively manage Gemini's 15-minute segment limit."""
        await asyncio.sleep(self.SEGMENT_WARNING_SECONDS)

        # Snapshot current state before the reset
        snapshot = {
            "section": self.state.current_section,
            "scores": self.state.scores,
            "followups": self.state.followups,
            "minutes_elapsed": self.SEGMENT_WARNING_SECONDS / 60,
        }

        # Give the model 2 minutes to wrap up the current topic
        await self._inject_context(
            agent,
            "[SYSTEM] Approaching session checkpoint. "
            "Finish your current question, then pause briefly.",
        )

        await asyncio.sleep(90)

        # Rebuild the agent with state pre-loaded into the new system prompt
        builder = PromptBuilder()
        base_prompt = builder.build_interviewer_prompt(self.config)
        resume_context = (
            f"\n\n[SESSION RESUME] Continuing from prior segment.\n"
            f"Section completed so far: {snapshot['section']}.\n"
            f"Scores recorded: {json.dumps(snapshot['scores'])}.\n"
            f"Resume naturally — do not mention the session restart."
        )
        new_model = create_realtime_model(
            "gemini",
            instructions=base_prompt + resume_context,
        )
        agent._model = new_model
        self.session_start = datetime.now()
        asyncio.create_task(self.watch_session_expiry(agent))

For interviews longer than 15 minutes, OpenAI Realtime’s 60-minute limit makes it the better default choice.

2. P95 latency reaches 800ms+ under load

Gemini Live’s P50 latency is competitive (~320ms), but P95 spikes to 800ms–1.5s during peak US hours (roughly 9–11am and 2–4pm Eastern). At P95, the conversation feels like a bad Zoom call rather than a natural interview. The candidate notices.

OpenAI Realtime’s P95 stays closer to 450ms. The gap is real.

Instrument latency per turn and route to OpenAI Realtime when Gemini degrades:

import time

class LatencyMonitor:
    def __init__(self, p95_threshold_ms: float = 600):
        self.samples: list[float] = []
        self.threshold = p95_threshold_ms

    def record(self, latency_ms: float):
        self.samples.append(latency_ms)
        if len(self.samples) > 20:
            self.samples.pop(0)  # rolling window

    @property
    def p95(self) -> float:
        if len(self.samples) < 5:
            return 0.0
        return sorted(self.samples)[int(len(self.samples) * 0.95)]

    def should_failover(self) -> bool:
        return self.p95 > self.threshold

# In your session loop:
monitor = LatencyMonitor(p95_threshold_ms=600)
# Record turn latency: time from candidate stop speaking to AI first audio
# If monitor.should_failover(): create_agent(..., provider="openai")

3. set_chat_ctx() doesn’t work reliably for mid-session injection

The S2SSessionManager.update_context() pattern used throughout this post calls session.set_chat_ctx() to inject time warnings and section transitions. This works reliably on OpenAI Realtime. On Gemini Live, system-role messages sent via set_chat_ctx() are silently dropped in several versions of livekit-plugins-google — the model never sees them.

The root cause is that Gemini Live uses a different mechanism for mid-session context updates (client_content messages with a turn_complete flag), and the LiveKit plugin’s mapping between set_chat_ctx() and this mechanism has changed across plugin versions.

For Gemini, consolidate all session context into the initial system instruction. Use function return values for dynamic updates during the session — those work reliably because they go through the tool call response channel, not the context channel:

@function_tool()
async def check_time_remaining(self, context: RunContext) -> str:
    """Model calls this to get current time status. Works on both providers."""
    elapsed = (datetime.now() - self.state.section_start).total_seconds()
    remaining = (self.config.duration_minutes * 60) - elapsed
    return (
        f"{remaining // 60:.0f} minutes remaining. "
        f"Current section: {self.state.current_section}. "
        f"{'Begin wrapping up.' if remaining < 300 else 'Continue at current pace.'}"
    )

Teach the model to call this function instead of relying on system-level time injection.

4. Unpinned model version causes silent behavior regressions

model="gemini-2.0-flash-live" (no version suffix) auto-upgrades when Google deploys a new release. Between major versions, function calling schema validation has tightened — tool declarations that worked in one version return 400 errors in the next. Voice characteristics have also changed between versions.

Production deployments that used the unpinned model identifier saw unexpected failures after Google’s releases, with no prior warning.

Always pin the version:

# WRONG — auto-upgrades silently
model="gemini-2.0-flash-live"

# RIGHT — predictable; you control when to upgrade
model="gemini-2.0-flash-live-001"

Test version upgrades in a staging environment against a full interview scenario (not just a short smoke test) before promoting to production. Function calling behavior in edge cases is where regressions appear.

5. No parallel function calls

OpenAI Realtime can call multiple functions in a single turn — the model can simultaneously call score_answer for a rubric area and flag_followup for a topic worth probing. Gemini Live serializes tool calls. If the model wants to call two functions, it calls the first, waits for the result, then calls the second. If your tool schema implies both should happen together, only the first executes reliably in some scenarios.

Design tool schemas as independent, self-contained operations. Never build logic that requires two function calls to be atomic:

# WRONG — combined score+flag assumes parallel execution atomicity
@function_tool()
async def score_and_flag(
    self, context, rubric_area, score, evidence, followup_reason
):
    ...

# RIGHT — independent tools; Gemini can chain them serially
@function_tool()
async def score_answer(self, context, rubric_area: str, score: int, evidence: str) -> str:
    ...

@function_tool()
async def flag_followup(self, context, reason: str) -> str:
    ...

6. Background noise triggers false turn-end

Gemini Live’s VAD is sensitive to background audio in ways that OpenAI Realtime’s isn’t. Keyboard clicks, HVAC noise, or a second voice in the room can cause the model to stop mid-sentence — it classifies the background sound as a candidate turn-end and yields. The model then waits for the next input, producing an awkward silence in the middle of its own question.

This is more common in home-office environments than in controlled interview rooms. Apply client-side noise processing before audio reaches the LiveKit room:

// LiveKit browser SDK — apply before publishing the audio track
const track = await createLocalAudioTrack({
  noiseSuppression: true,    // suppresses background hiss and HVAC
  echoCancellation: true,    // prevents the AI's audio from feeding back
  autoGainControl: true,     // normalizes candidate volume
});
await room.localParticipant.publishTrack(track);

Server-side, you can also gate audio publication: only publish the candidate’s audio track when their microphone level exceeds a threshold, rather than streaming continuous ambient audio to the model.

Issues at a Glance

Issue	OpenAI Realtime	Gemini Live
VAD false interruptions	High severity — tune `silence_duration_ms`	Medium — background noise sensitive
Session hard limit	60 min — manageable with time tracking	15 min — critical for interviews
Context after reconnect	Lost — snapshot + re-inject	Lost — more frequent, harder to manage
Mid-session context updates	`set_chat_ctx()` — works reliably	Partial — use function tools instead
Parallel function calls	Supported natively	Not supported — design tools as independent
Latency P95	~450ms — consistent	~800ms+ — variable, degrades at peak hours
Prompt caching	Works with static prefix strategy	Not applicable
Model version stability	Stable named releases	Pin version or risk silent regressions

Three Personas, One Infrastructure

The three personas — Interviewer, Coach, Evaluator — share the same infrastructure. The difference is which prompt the PromptBuilder generates and which function tools are attached:

from enum import Enum


class PersonaType(str, Enum):
    INTERVIEWER = "interviewer"
    COACH = "coach"
    EVALUATOR = "evaluator"


def create_agent(
    config: InterviewConfig,
    persona: PersonaType,
    provider: str = "openai",
) -> MultimodalAgent:
    """Factory: one config, one persona, one agent."""
    builder = PromptBuilder()

    if persona == PersonaType.INTERVIEWER:
        prompt = builder.build_interviewer_prompt(config)
        tools = InterviewerTools(config)
    elif persona == PersonaType.COACH:
        prompt = builder.build_coach_prompt(config)
        tools = CoachTools(config)
    elif persona == PersonaType.EVALUATOR:
        prompt = builder.build_evaluator_prompt(config)
        tools = EvaluatorTools(config)

    model = create_realtime_model(
        provider=provider,
        instructions=prompt,
        voice="sage" if provider == "openai" else "Puck",
    )

    agent = MultimodalAgent(model=model, tools=tools)
    return agent

The InterviewConfig is the same for all three. The PromptBuilder generates different prompts based on the persona. The tools are different because each persona needs different capabilities. But the model, the transport, the audio pipeline — all shared.

This is the architectural principle from the playbook series (Part 5): personas are configuration, not infrastructure.

Function Calling: Flow Control Without Text

In a cascaded pipeline, you control the interview flow by inspecting the LLM’s text output. In S2S, you don’t have text output — you have audio. Function calling is how the model communicates structured decisions back to your application.

LiveKit’s @function_tool() decorator makes this clean:

from livekit.agents import function_tool, RunContext
from dataclasses import dataclass, field
from datetime import datetime


@dataclass
class InterviewState:
    current_section: str = "opening"
    section_start: datetime = field(default_factory=datetime.now)
    scores: dict = field(default_factory=dict)
    followups: list = field(default_factory=list)
    section_order: list = field(default_factory=lambda: [
        "opening", "background", "technical",
        "system_design", "qa", "wrapup"
    ])


class InterviewerTools:
    def __init__(self, config: InterviewConfig):
        self.config = config
        self.state = InterviewState()

    @function_tool()
    async def transition_section(
        self, context: RunContext, section_name: str
    ) -> str:
        """Move to the next interview section. Call this when the current
        section is complete and you're ready to move on."""
        previous = self.state.current_section
        self.state.current_section = section_name
        self.state.section_start = datetime.now()

        # Calculate remaining time for context update
        elapsed = (datetime.now() - self.state.section_start).seconds
        remaining = (self.config.duration_minutes * 60) - elapsed

        return (
            f"Transitioned from {previous} to {section_name}. "
            f"{remaining // 60} minutes remaining in the interview."
        )

    @function_tool()
    async def score_answer(
        self,
        context: RunContext,
        rubric_area: str,
        score: int,
        evidence: str,
    ) -> str:
        """Record a score for the candidate's answer in a specific
        rubric area. Call after each substantive answer."""
        if rubric_area not in self.state.scores:
            self.state.scores[rubric_area] = []

        self.state.scores[rubric_area].append({
            "score": score,
            "evidence": evidence,
            "section": self.state.current_section,
            "timestamp": datetime.now().isoformat(),
        })

        return f"Recorded score {score}/10 for {rubric_area}."

    @function_tool()
    async def flag_followup(
        self, context: RunContext, reason: str
    ) -> str:
        """Flag a topic for deeper follow-up exploration.
        Call when the candidate mentions something worth probing."""
        self.state.followups.append({
            "reason": reason,
            "section": self.state.current_section,
            "timestamp": datetime.now().isoformat(),
        })
        return f"Flagged for follow-up: {reason}"

    @function_tool()
    async def query_knowledge(
        self, context: RunContext, topic: str
    ) -> str:
        """Retrieve company-specific context for a follow-up question.
        Call when you need technical details about the company's stack."""
        # In production, this queries a vector DB or knowledge base
        return f"Context for {topic}: {self.config.company_context}"

The S2S model calls these functions during the conversation — silently, without the candidate hearing anything. transition_section updates the state machine. score_answer records assessments in real-time. flag_followup marks topics for deeper exploration. The function returns go back to the model as context for its next response.

This is how you get structured data out of a pure audio conversation. The model never speaks the scores — it calls a function. You get JSON-structured evaluation data without any text parsing.

Provider Switching: OpenAI to Gemini Failover

The provider factory abstracts away the model-specific configuration:

from dataclasses import dataclass
from typing import Literal


@dataclass
class ProviderConfig:
    name: Literal["openai", "gemini"]
    model_id: str
    voice: str
    max_session_minutes: int
    cost_per_minute: float


PROVIDERS = {
    "openai": ProviderConfig(
        name="openai",
        model_id="gpt-4o-realtime-preview",
        voice="sage",
        max_session_minutes=60,
        cost_per_minute=0.06,
    ),
    "gemini": ProviderConfig(
        name="gemini",
        model_id="gemini-2.0-flash-live",
        voice="Puck",
        max_session_minutes=15,
        cost_per_minute=0.035,
    ),
}


def create_realtime_model(
    provider: str,
    instructions: str,
    voice: str | None = None,
    temperature: float = 0.7,
):
    """Create a RealtimeModel for the specified provider."""
    pc = PROVIDERS[provider]

    if provider == "openai":
        from livekit.plugins.openai import realtime
        return realtime.RealtimeModel(
            instructions=instructions,
            voice=voice or pc.voice,
            temperature=temperature,
            modalities=["text", "audio"],
            turn_detection=realtime.ServerVadOptions(
                threshold=0.5,
                prefix_padding_ms=300,
                silence_duration_ms=500,
            ),
        )
    elif provider == "gemini":
        from livekit.plugins.google import realtime as google_realtime
        return google_realtime.RealtimeModel(
            instructions=instructions,
            voice=voice or pc.voice,
            temperature=temperature,
            modalities=["AUDIO"],
            model=pc.model_id,
        )
    else:
        raise ValueError(f"Unknown provider: {provider}")

In production, the provider selection happens at session creation time — typically based on cost, regional availability, or A/B testing. If the primary provider fails during session setup, the factory can fall back:

async def create_agent_with_failover(
    config: InterviewConfig,
    persona: PersonaType,
    preferred: str = "openai",
    fallback: str = "gemini",
) -> MultimodalAgent:
    """Try preferred provider, fall back if setup fails."""
    try:
        return create_agent(config, persona, provider=preferred)
    except Exception as e:
        logger.warning(f"{preferred} failed: {e}, falling back to {fallback}")
        return create_agent(config, persona, provider=fallback)

Note: failover happens at session creation, not mid-session. Once a S2S session is established, you can’t switch providers without disconnecting and reconnecting. This is a fundamental constraint of the S2S approach — the model holds conversational state that isn’t transferable between providers.

Session Management: State Machine Meets S2S

The interview state machine tracks where we are in the interview and manages time budgets. In a cascaded pipeline, you’d update context between turns by modifying the text. In S2S, you use set_chat_ctx():

from livekit.agents import llm


class S2SSessionManager:
    """Manages interview session state for S2S agents."""

    def __init__(self, agent: MultimodalAgent, config: InterviewConfig):
        self.agent = agent
        self.config = config
        self.state = InterviewState()

    async def update_context(self, additional_instructions: str):
        """Inject new instructions into the active S2S session.

        This is the S2S equivalent of modifying the LLM prompt
        between turns in a cascaded pipeline.
        """
        session = self.agent.session
        ctx = session.chat_ctx.copy()
        ctx.add_message(
            role="system",
            content=additional_instructions,
        )
        session.set_chat_ctx(ctx)

    async def handle_section_transition(self, new_section: str):
        """Update context when the interview moves to a new section."""
        time_used = (datetime.now() - self.state.section_start).total_seconds()
        total_budget = self.config.duration_minutes * 60
        remaining = total_budget - time_used

        await self.update_context(
            f"[SECTION TRANSITION] Moving to {new_section}. "
            f"Time remaining: {remaining // 60:.0f} minutes. "
            f"Adjust question depth accordingly."
        )

    async def handle_time_warning(self, minutes_left: int):
        """Warn the agent about time running low."""
        await self.update_context(
            f"[TIME WARNING] Only {minutes_left} minutes remaining. "
            f"Begin wrapping up the current section and prepare to "
            f"transition to the next one. Shorten your questions."
        )

The set_chat_ctx() call is the key mechanism. It appends a system message to the active session’s context without interrupting the audio stream. The model sees the new instruction and adjusts its behavior — asking shorter questions, transitioning topics, or wrapping up — without any audible interruption to the candidate.

This is how you steer a S2S conversation: not by intercepting text, but by injecting context. The model receives your update and incorporates it into its next response. It’s less precise than text-level manipulation, but it works reliably for the kind of high-level flow control that interviews need.

S2S-Only Costs: The Real Numbers

Here’s what a 30-minute interview costs with each provider in S2S mode:

Component	OpenAI Realtime	Gemini Live
Audio input	$0.60 (100 tokens/sec × $0.06/1K)	$0.21 (per audio second)
Audio output	$0.60 (AI speaking ~50%)	$0.42 (per audio second)
Text tokens (context)	$0.15 (system prompt + function results)	included
Function calling	included	included
LiveKit SFU	$0.15	$0.15
Egress recording	$0.09	$0.09
Total (30 min)	~$1.59	~$0.87
Per minute	~$0.053	~$0.029

Compare to cascaded (from Part 11):

Cascaded Component	Cost per 30 min
Deepgram STT	$0.22
GPT-4o (text)	$0.45
ElevenLabs TTS	$0.90
LiveKit SFU + Egress	$0.24
Total	$1.81

S2S with Gemini Live is roughly 52% cheaper than a cascaded pipeline at comparable quality. OpenAI Realtime is about 12% cheaper. The cost advantage comes from eliminating the separate STT and TTS billing — the S2S model bundles everything.

At scale (1,000 interviews/month), Gemini Live S2S saves ~$940/month vs. cascaded. At 10,000 interviews, that’s ~$9,400/month.

What You Lose Without a Cascaded Pipeline

S2S-only is simpler and cheaper, but it’s not strictly better. Here’s what you give up:

No real-time transcript. In a cascaded pipeline, the STT step produces a transcript as a byproduct. S2S has no text intermediate — you get transcripts by running STT on the recording after the session ends. For interviews, this means you can’t show live captions, can’t do real-time keyword detection, and can’t feed the transcript to a separate evaluation model during the conversation.

Harder debugging. When a cascaded agent says something weird, you can check: was the STT wrong? Was the LLM response bad? Was the TTS garbled? With S2S, you have a single audio stream. Debugging requires listening to the recording and guessing where the model went wrong.

Limited voice options. OpenAI Realtime offers 4 voices. Gemini Live offers 4-6. ElevenLabs alone offers 1,000+. If voice branding matters (matching your company’s existing voice AI), cascaded gives you more options.

Post-hoc evaluation only. In a cascaded pipeline, you can run a second LLM (the Evaluator persona) in parallel, watching the transcript in real-time and building a score card as the interview progresses. With S2S, evaluation happens after the session using the recording + post-hoc transcript.

Model-specific quirks. Each S2S model handles conversation differently. OpenAI Realtime is more aggressive about turn-taking. Gemini Live is more patient but sometimes has longer pauses. You can’t mix and match characteristics the way you can with separate STT/LLM/TTS components.

When to Use This Approach

Use S2S-only when:

Latency is critical. Sub-500ms response time is a hard requirement.
Simplicity matters. You want fewer moving parts to deploy, monitor, and debug.
Cost efficiency at scale. You’re running thousands of sessions and the per-minute savings add up.
You don’t need real-time transcription. Post-hoc transcription is acceptable.

Use cascaded (or hybrid) when:

You need live transcription. Accessibility requirements, real-time dashboards, or live evaluation.
Voice branding matters. You need a specific voice identity that S2S models don’t offer.
You want maximum control. Text-level manipulation between turns is important for your use case.
You’re integrating multiple AI services. Different providers for different parts of the pipeline.

The playbook series covers the hybrid approach — S2S for the conversation, cascaded for evaluation and transcription. That’s the most robust production architecture. But if you’re starting from scratch and want the simplest path to a working voice interview agent, S2S-only with the dynamic prompt system described here will get you there fastest.

The complete code in this post — InterviewConfig, PromptBuilder, InterviewerTools, provider factory, session manager — forms a working foundation. Clone it, configure it for your roles and rubrics, deploy it on LiveKit Cloud, and you have a voice interview agent in production.

The key insight worth repeating: in S2S, your system prompt is your only lever. Make it a good one. Make it generated, versioned, and testable. The researcher approach isn’t optional — it’s the difference between a demo and a production system.

This is a standalone companion to the Voice AI Interview Playbook series. For the full reference architecture, pipeline selection, framework comparison, scaling, and cost optimization, start with Part 1.

Export for reading

Building a Production Voice AI Interview Agent: S2S-Only with Dynamic Prompts

Why S2S-Only?

The S2S-Only Architecture

Dynamic Prompts: The Researcher Approach

The Configuration Layer

The Prompt Builder

Why This Matters More in S2S

Building the Agent: LiveKit + OpenAI Realtime

Provider Swap: Gemini Live

Known Issues & Production Gotchas

OpenAI Realtime

Gemini Live

Issues at a Glance

Three Personas, One Infrastructure

Function Calling: Flow Control Without Text

Provider Switching: OpenAI to Gemini Failover

Session Management: State Machine Meets S2S

S2S-Only Costs: The Real Numbers

What You Lose Without a Cascaded Pipeline

When to Use This Approach

Comments

On this page

Building a Production Voice AI Interview Agent: S2S-Only with Dynamic Prompts