Research interviews are not free-form conversations. They follow protocols — structured, validated, peer-reviewed protocols. A market research session exploring consumer attitudes toward electric vehicles has a different conversational arc than a user experience study on mobile banking. Both have warmup phases, exploration phases, probing phases, and wrap-up phases. The difference is in the instructions at each phase, the time allocated to each phase, and the follow-up triggers that tell the moderator when to dig deeper.
This is fundamentally different from what most voice AI systems are built for. Chatbots have one mode: answer the question. Hiring interview agents have a linear topic guide. Customer service bots have intent routing. None of these need to change their entire conversational personality mid-session. Research interviews do.
In Part 2, we covered the production bugs that cost us weeks. This post is about the core architectural pattern that makes research-grade voice AI possible: the multi-phase state machine driven by LLM function calling with dynamic instruction swapping.
Why Phases Matter for Research
A typical qualitative research interview has 4-6 distinct phases, each with a different conversational objective:
Warmup (3-5 minutes). Build rapport. Open-ended, encouraging. “Tell me about yourself” territory. The AI needs to be warm, curious, and non-directive. No probing. No challenging. Just establish trust.
Exploration (8-12 minutes). Follow the topic guide. The researcher has specific themes they want to cover — brand perception, purchase triggers, usage patterns. The AI needs to ask about each theme but let the respondent lead. Follow the natural flow of conversation rather than running through a checklist.
Deep Probing (5-8 minutes). This is where the insights live. When a respondent says something interesting — an unexpected opinion, a contradictory behavior, an emotional reaction — the AI needs to dig in. “You mentioned you love the brand but haven’t bought anything in two years. Tell me more about that.” The instructions here are more aggressive, more Socratic.
Synthesis (3-5 minutes). Reflect back what the respondent said. “So it sounds like your main hesitation is around charging infrastructure, not the vehicle itself — is that fair?” This phase validates understanding and gives the respondent a chance to correct misinterpretations.
Wrap-up (2-3 minutes). Thank the respondent. Ask if there’s anything they want to add. Explain next steps. The tone shifts to professional and appreciative.
Each phase has a different system prompt, different follow-up triggers, different conversational boundaries, and a different time budget. A single static prompt cannot capture this. You need a state machine.
The key insight from building this: the LLM itself should decide when to transition between phases. Not a timer. Not a keyword detector. The LLM, because it understands the conversational context well enough to know when a phase’s objectives have been met. We give it a function tool — next_phase() — and trust it to call that tool at the right moment.
The Phases Array
Every research session is configured through room metadata that arrives when the agent connects to the LiveKit room. The critical piece is the phases array — a JSON structure that defines the complete research protocol.
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class PhaseConfig:
"""A single phase in a research interview protocol."""
index: int
name: str # "warmup", "exploration", "probing", "synthesis", "wrapup"
instructions: str # Full system prompt for this phase
duration_minutes: float # Target time budget
topics: list[str] = field(default_factory=list)
follow_up_triggers: list[str] = field(default_factory=list)
transition_hint: str = "" # Guidance on when to move to next phase
@dataclass
class ResearchProtocol:
"""Complete research protocol loaded from room metadata."""
session_id: str
study_name: str
phases: list[PhaseConfig]
global_instructions: str = "" # Applied to ALL phases
max_duration_minutes: float = 45.0
The backend — whatever system manages the research study — defines the protocol. The voice agent reads it. This separation matters because researchers iterate on protocols constantly. They tweak phase instructions after the first five sessions. They add follow-up triggers when they notice respondents mentioning unexpected themes. They adjust time budgets when warmup consistently runs long. None of these changes require agent code changes. It is all metadata.
The instructions field in each PhaseConfig is a complete system prompt fragment. Not a single sentence — a full behavioral specification. For a probing phase, it might be 400 words covering techniques like laddering (“Why is that important to you? And why does that matter?”), projective techniques (“If this brand were a person, how would you describe them?”), and explicit boundaries (“Do not introduce new topics — only probe deeper on topics the respondent has already raised”).
The next_phase() Function Tool
Here is where S2S function calling becomes essential. Both OpenAI Realtime and Gemini Live support function calling — the model can invoke tools you define and receive structured responses. We define a next_phase function tool that the LLM calls when it determines the current phase’s objectives have been met.
The implementation uses LiveKit’s function tool framework which normalizes the differences between OpenAI and Gemini tool declarations:
from livekit.agents import function_tool, RunContext
from livekit.agents.llm import ChatContext, ChatMessage
import time
import json
import logging
logger = logging.getLogger("research-agent")
class PhaseManager:
"""Manages phase transitions for a research interview session."""
def __init__(self, protocol: ResearchProtocol):
self.protocol = protocol
self.current_index = 0
self.phase_start_time = time.time()
self.phase_summaries: list[str] = []
@property
def current_phase(self) -> PhaseConfig:
return self.protocol.phases[self.current_index]
@property
def is_final_phase(self) -> bool:
return self.current_index >= len(self.protocol.phases) - 1
@function_tool()
async def next_phase(self, context: RunContext) -> str:
"""Advance to the next phase of the research interview.
Call this when the current phase objectives are met and
you are ready to transition. Do NOT call during active
respondent speech.
"""
elapsed = (time.time() - self.phase_start_time) / 60.0
completed_phase = self.current_phase
if self.is_final_phase:
return json.dumps({
"status": "already_final",
"message": "You are in the final phase. Wrap up the session.",
})
# Record summary of completed phase
self.phase_summaries.append(
f"Phase '{completed_phase.name}' completed in {elapsed:.1f} min"
)
# Advance
self.current_index += 1
self.phase_start_time = time.time()
new_phase = self.current_phase
logger.info(
f"Phase transition: {completed_phase.name} -> {new_phase.name} "
f"(elapsed: {elapsed:.1f}m, budget: {completed_phase.duration_minutes}m)"
)
# Build new context and swap instructions
new_instructions = self._build_phase_instructions(new_phase)
agent = context.agent
agent.session.set_chat_ctx(
self._build_chat_context(new_instructions)
)
return json.dumps({
"status": "transitioned",
"new_phase": new_phase.name,
"phase_number": f"{self.current_index + 1}/{len(self.protocol.phases)}",
"instructions": f"You are now in the {new_phase.name} phase. "
f"Time budget: {new_phase.duration_minutes} minutes.",
"topics": new_phase.topics,
})
The critical line is agent.session.set_chat_ctx(). This is how you replace the active system prompt on a live S2S session. The function tool returns a JSON response to the LLM confirming the transition, but the real work happens in the set_chat_ctx() call — the model’s behavioral instructions change immediately.
For OpenAI Realtime, this triggers a session.update event on the WebSocket that replaces the instructions field. For Gemini Live, it updates the system_instruction in the session configuration. LiveKit’s MultimodalAgent abstracts this difference, but if you are building directly on the provider APIs, you need to handle each protocol separately.
One important detail: the function tool declaration tells the LLM when to call it. The docstring “Call this when the current phase objectives are met and you do NOT call during active respondent speech” is not decoration — it is a behavioral instruction to the model. OpenAI Realtime and Gemini Live both use the tool description as part of their reasoning about when to invoke tools. Vague descriptions lead to premature or missed transitions.
Dynamic Instruction Swapping
Replacing the system prompt mid-session sounds straightforward. In practice, there are three approaches, and only one works well for research.
Full replacement. Drop the old system prompt, inject the new one. Problem: the model loses all context about what happened in previous phases. It might re-ask questions the respondent already answered. For research, this is unacceptable — the probing phase needs to reference what was said during exploration.
Append-only. Keep the original prompt and append new phase instructions at the end. Problem: the prompt grows linearly with phases. By phase 4, you have four full instruction sets in the context window. The model gets confused about which instructions are active. Contradictory guidance between phases causes erratic behavior.
Summary + Replace (what works). When transitioning, summarize the completed phase into 2-3 sentences of context. Prepend this summary to the new phase’s instructions. Replace the entire system prompt with: global instructions + phase summaries + current phase instructions.
def _build_phase_instructions(self, phase: PhaseConfig) -> str:
"""Build complete instructions for a phase, including prior context."""
sections = []
# Global instructions always come first
if self.protocol.global_instructions:
sections.append(self.protocol.global_instructions)
# Summaries from previous phases provide continuity
if self.phase_summaries:
summary_block = "CONTEXT FROM PREVIOUS PHASES:\n"
for summary in self.phase_summaries:
summary_block += f"- {summary}\n"
summary_block += (
"\nUse this context to avoid repeating questions and to "
"reference earlier responses when relevant."
)
sections.append(summary_block)
# Current phase instructions
phase_header = (
f"CURRENT PHASE: {phase.name.upper()} "
f"(Phase {phase.index + 1} of {len(self.protocol.phases)})\n"
f"Time budget: {phase.duration_minutes} minutes\n"
)
if phase.transition_hint:
phase_header += f"Transition when: {phase.transition_hint}\n"
sections.append(phase_header + phase.instructions)
# Topics to cover in this phase
if phase.topics:
topic_block = "TOPICS TO COVER:\n"
for i, topic in enumerate(phase.topics, 1):
topic_block += f"{i}. {topic}\n"
sections.append(topic_block)
return "\n\n---\n\n".join(sections)
def _build_chat_context(self, instructions: str) -> ChatContext:
"""Create a fresh ChatContext with updated system instructions."""
ctx = ChatContext()
ctx.add_message(ChatMessage.create(
text=instructions,
role="system",
))
return ctx
The summary approach keeps the context window bounded — each phase summary is 2-3 sentences regardless of how long the phase lasted — while preserving the conversational thread. When the probing phase instructions say “dig deeper on themes the respondent raised,” the model has the exploration phase summary to reference.
One thing I learned the hard way: the summary needs to be generated by the LLM, not extracted from transcripts. We initially tried parsing the conversation log for key themes, but the LLM’s own understanding of what was discussed is more reliable for continuity. We inject a hidden system message asking the model to internally summarize before calling next_phase(), and the phase summaries capture the model’s understanding.
Two Modes: Simple vs Stateful
Not every session needs a multi-phase state machine. Sometimes you just need a single-mode agent — a research receptionist that screens participants, or a simple survey bot that asks five questions. We handle this with a branching pattern based on the phases array length:
class ResearchAgent:
"""Voice agent supporting both simple and stateful modes."""
async def start(self, room_metadata: dict):
protocol = self._parse_protocol(room_metadata)
if len(protocol.phases) <= 1:
# Simple mode: single instruction set, no state machine
await self._run_simple(protocol)
else:
# Stateful mode: multi-phase with transitions
await self._run_stateful(protocol)
async def _run_simple(self, protocol: ResearchProtocol):
"""Single-phase agent. No function tools for transitions."""
instructions = protocol.phases[0].instructions
if protocol.global_instructions:
instructions = protocol.global_instructions + "\n\n" + instructions
# Launch agent with static instructions, no next_phase tool
await self._launch_agent(instructions, tools=[])
async def _run_stateful(self, protocol: ResearchProtocol):
"""Multi-phase agent with phase transition function tools."""
phase_manager = PhaseManager(protocol)
initial_instructions = phase_manager._build_phase_instructions(
phase_manager.current_phase
)
await self._launch_agent(
initial_instructions,
tools=[phase_manager.next_phase],
)
The same codebase, the same deployment, the same container image. The difference is entirely in the metadata. This matters for operations — you do not want to maintain two separate agent codebases for simple and complex sessions.
Simple mode also serves as a fallback. If the phases array is malformed or missing, the agent drops to simple mode with a generic instruction set rather than crashing. Defensive design for a system where the configuration comes from an external API.
Time Budget Enforcement
Each phase has a target duration, but the LLM does not have an internal clock. It does not know that 8 minutes have passed. You need to inject time awareness externally.
The approach: a background task monitors elapsed time per phase and injects system messages at key thresholds. We use three trigger points — 50%, 80%, and 100% of the phase budget.
import asyncio
class TimeBudgetTracker:
"""Monitors phase duration and injects time-awareness messages."""
def __init__(self, phase_manager: PhaseManager, agent):
self.phase_manager = phase_manager
self.agent = agent
self._task: Optional[asyncio.Task] = None
async def start(self):
self._task = asyncio.create_task(self._monitor_loop())
async def _monitor_loop(self):
warned_phases: dict[int, set[int]] = {} # phase_index -> set of thresholds hit
while True:
await asyncio.sleep(15) # Check every 15 seconds
pm = self.phase_manager
phase = pm.current_phase
elapsed_min = (time.time() - pm.phase_start_time) / 60.0
pct = (elapsed_min / phase.duration_minutes) * 100 if phase.duration_minutes > 0 else 0
phase_warnings = warned_phases.setdefault(pm.current_index, set())
if pct >= 100 and 100 not in phase_warnings:
phase_warnings.add(100)
await self._inject_time_message(
f"TIME: Phase '{phase.name}' has exceeded its "
f"{phase.duration_minutes}-minute budget. "
f"Wrap up this phase and call next_phase() now."
)
elif pct >= 80 and 80 not in phase_warnings:
phase_warnings.add(80)
await self._inject_time_message(
f"TIME: Phase '{phase.name}' is at 80% of its time budget "
f"({elapsed_min:.1f}/{phase.duration_minutes} min). "
f"Begin transitioning to the next phase soon."
)
elif pct >= 50 and 50 not in phase_warnings:
phase_warnings.add(50)
await self._inject_time_message(
f"TIME: Phase '{phase.name}' is at the halfway point "
f"({elapsed_min:.1f}/{phase.duration_minutes} min). "
f"Ensure remaining topics are covered."
)
async def _inject_time_message(self, message: str):
"""Inject a system message into the active session."""
ctx = self.agent.session.chat_ctx
ctx.add_message(ChatMessage.create(
text=message,
role="system",
))
self.agent.session.set_chat_ctx(ctx)
logger.info(f"Time budget alert: {message}")
The 80% threshold is the most important one. It gives the LLM enough runway to naturally wrap up — ask one last follow-up, acknowledge the respondent’s answer, then call next_phase(). The 100% message is more forceful, using language like “call next_phase() now.”
Why not just use a hard timer that forces the transition? Because research conversations need graceful transitions. Cutting off a respondent mid-sentence to jump to the next phase violates basic qualitative research methodology. The time budget is a guide, not a gate. Some phases run long because the respondent is providing rich insights. That is fine — the researcher would rather have good data from three phases than thin data from five.
In practice, with the 80% warning, the LLM hits the next_phase() call within 90 seconds of the budget threshold about 85% of the time. The 100% override catches the remaining cases. We have never had a phase run more than 150% of its budget with this system.
What This Looks Like in Practice
A real session with a 5-phase, 35-minute protocol plays out like this:
- Agent connects, reads room metadata, parses 5 phases
- Phase 1 (Warmup, 4 min): Agent introduces itself, asks opening questions
- At ~3.5 min, 80% time warning injected
- LLM calls
next_phase()at ~4.2 min - System prompt swapped to Exploration instructions with warmup summary prepended
- Phase 2 (Exploration, 10 min): Agent follows topic guide, adapts to respondent flow
- Continue through all phases, each with its own instruction set and time budget
- Phase 5 (Wrap-up, 3 min): Agent thanks respondent, session ends
The whole mechanism is invisible to the respondent. They experience a natural conversation that happens to cover the research objectives in the right order with the right depth. That is the goal.
Looking Ahead
This state machine handles the live conversation. But research interviews produce data that needs processing — transcription, enrichment, analysis. In Part 4, we will walk through the fully automatic post-interview pipeline that turns raw recordings into structured, queryable research datasets in 3-7 minutes.
References:
- OpenAI Realtime function calling
- Gemini Live tool declarations
- LiveKit Agents function tools
- S2S Voice AI Interview Agent — full build guide
This is Part 3 of an 8-part series: Production Voice AI for Research at Scale.
Series outline:
- The Architecture Nobody Warns You About — Server-side agents, metadata transport, provider selection (Part 1)
- Zombie Agents, Pre-Warming, and the 5 Bugs That Cost Us Weeks — Production pain points and fixes (Part 2)
- Multi-Phase State Machines — Research protocol as code, LLM-driven transitions (Part 3)
- From Recording to Insight — The automatic post-interview pipeline (Part 4)
- The Real Cost — Per-minute tracking, budgets, self-hosting math (Part 5)
- What Breaks at 200 Concurrent Sessions — Scaling bottlenecks and operational metrics (Part 6)
- Multi-Language Voice AI — Language detection, provider routing, locale-aware VAD, i18n prompts (Part 7)
- Deployment and Go-Live — Docker, Kubernetes, CI/CD, zero-downtime deploys, monitoring (Part 8)