Production Voice AI for Research at Scale: Multi-Phase State Machines (Part 3 of 8)

Research interviews are not free-form conversations. They follow protocols — structured, validated, peer-reviewed protocols. A market research session exploring consumer attitudes toward electric vehicles has a different conversational arc than a user experience study on mobile banking. Both have warmup phases, exploration phases, probing phases, and wrap-up phases. The difference is in the instructions at each phase, the time allocated to each phase, and the follow-up triggers that tell the moderator when to dig deeper.

This is fundamentally different from what most voice AI systems are built for. Chatbots have one mode: answer the question. Hiring interview agents have a linear topic guide. Customer service bots have intent routing. None of these need to change their entire conversational personality mid-session. Research interviews do.

In Part 2, we covered the production bugs that cost us weeks. This post is about the core architectural pattern that makes research-grade voice AI possible: the multi-phase state machine driven by LLM function calling with dynamic instruction swapping.

Why Phases Matter for Research

A typical qualitative research interview has 4-6 distinct phases, each with a different conversational objective:

Warmup (3-5 minutes). Build rapport. Open-ended, encouraging. “Tell me about yourself” territory. The AI needs to be warm, curious, and non-directive. No probing. No challenging. Just establish trust.

Exploration (8-12 minutes). Follow the topic guide. The researcher has specific themes they want to cover — brand perception, purchase triggers, usage patterns. The AI needs to ask about each theme but let the respondent lead. Follow the natural flow of conversation rather than running through a checklist.

Deep Probing (5-8 minutes). This is where the insights live. When a respondent says something interesting — an unexpected opinion, a contradictory behavior, an emotional reaction — the AI needs to dig in. “You mentioned you love the brand but haven’t bought anything in two years. Tell me more about that.” The instructions here are more aggressive, more Socratic.

Synthesis (3-5 minutes). Reflect back what the respondent said. “So it sounds like your main hesitation is around charging infrastructure, not the vehicle itself — is that fair?” This phase validates understanding and gives the respondent a chance to correct misinterpretations.

Wrap-up (2-3 minutes). Thank the respondent. Ask if there’s anything they want to add. Explain next steps. The tone shifts to professional and appreciative.

Each phase has a different system prompt, different follow-up triggers, different conversational boundaries, and a different time budget. A single static prompt cannot capture this. You need a state machine.

The key insight from building this: the LLM itself should decide when to transition between phases. Not a timer. Not a keyword detector. The LLM, because it understands the conversational context well enough to know when a phase’s objectives have been met. We give it a function tool — next_phase() — and trust it to call that tool at the right moment.

The Phases Array

Every research session is configured through room metadata that arrives when the agent connects to the LiveKit room. The critical piece is the phases array — a JSON structure that defines the complete research protocol.

from dataclasses import dataclass, field
from typing import Optional


@dataclass
class PhaseConfig:
    """A single phase in a research interview protocol."""
    index: int
    name: str  # "warmup", "exploration", "probing", "synthesis", "wrapup"
    instructions: str  # Full system prompt for this phase
    duration_minutes: float  # Target time budget
    topics: list[str] = field(default_factory=list)
    follow_up_triggers: list[str] = field(default_factory=list)
    transition_hint: str = ""  # Guidance on when to move to next phase


@dataclass
class ResearchProtocol:
    """Complete research protocol loaded from room metadata."""
    session_id: str
    study_name: str
    phases: list[PhaseConfig]
    global_instructions: str = ""  # Applied to ALL phases
    max_duration_minutes: float = 45.0

The backend — whatever system manages the research study — defines the protocol. The voice agent reads it. This separation matters because researchers iterate on protocols constantly. They tweak phase instructions after the first five sessions. They add follow-up triggers when they notice respondents mentioning unexpected themes. They adjust time budgets when warmup consistently runs long. None of these changes require agent code changes. It is all metadata.

The instructions field in each PhaseConfig is a complete system prompt fragment. Not a single sentence — a full behavioral specification. For a probing phase, it might be 400 words covering techniques like laddering (“Why is that important to you? And why does that matter?”), projective techniques (“If this brand were a person, how would you describe them?”), and explicit boundaries (“Do not introduce new topics — only probe deeper on topics the respondent has already raised”).

The next_phase() Function Tool

Here is where S2S function calling becomes essential. Both OpenAI Realtime and Gemini Live support function calling — the model can invoke tools you define and receive structured responses. We define a next_phase function tool that the LLM calls when it determines the current phase’s objectives have been met.

The implementation uses LiveKit’s function tool framework which normalizes the differences between OpenAI and Gemini tool declarations:

from livekit.agents import function_tool, RunContext
from livekit.agents.llm import ChatContext, ChatMessage
import time
import json
import logging

logger = logging.getLogger("research-agent")


class PhaseManager:
    """Manages phase transitions for a research interview session."""

    def __init__(self, protocol: ResearchProtocol):
        self.protocol = protocol
        self.current_index = 0
        self.phase_start_time = time.time()
        self.phase_summaries: list[str] = []

    @property
    def current_phase(self) -> PhaseConfig:
        return self.protocol.phases[self.current_index]

    @property
    def is_final_phase(self) -> bool:
        return self.current_index >= len(self.protocol.phases) - 1

    @function_tool()
    async def next_phase(self, context: RunContext) -> str:
        """Advance to the next phase of the research interview.

        Call this when the current phase objectives are met and
        you are ready to transition. Do NOT call during active
        respondent speech.
        """
        elapsed = (time.time() - self.phase_start_time) / 60.0
        completed_phase = self.current_phase

        if self.is_final_phase:
            return json.dumps({
                "status": "already_final",
                "message": "You are in the final phase. Wrap up the session.",
            })

        # Record summary of completed phase
        self.phase_summaries.append(
            f"Phase '{completed_phase.name}' completed in {elapsed:.1f} min"
        )

        # Advance
        self.current_index += 1
        self.phase_start_time = time.time()
        new_phase = self.current_phase

        logger.info(
            f"Phase transition: {completed_phase.name} -> {new_phase.name} "
            f"(elapsed: {elapsed:.1f}m, budget: {completed_phase.duration_minutes}m)"
        )

        # Build new context and swap instructions
        new_instructions = self._build_phase_instructions(new_phase)
        agent = context.agent
        agent.session.set_chat_ctx(
            self._build_chat_context(new_instructions)
        )

        return json.dumps({
            "status": "transitioned",
            "new_phase": new_phase.name,
            "phase_number": f"{self.current_index + 1}/{len(self.protocol.phases)}",
            "instructions": f"You are now in the {new_phase.name} phase. "
                          f"Time budget: {new_phase.duration_minutes} minutes.",
            "topics": new_phase.topics,
        })

The critical line is agent.session.set_chat_ctx(). This is how you replace the active system prompt on a live S2S session. The function tool returns a JSON response to the LLM confirming the transition, but the real work happens in the set_chat_ctx() call — the model’s behavioral instructions change immediately.

For OpenAI Realtime, this triggers a session.update event on the WebSocket that replaces the instructions field. For Gemini Live, it updates the system_instruction in the session configuration. LiveKit’s MultimodalAgent abstracts this difference, but if you are building directly on the provider APIs, you need to handle each protocol separately.

One important detail: the function tool declaration tells the LLM when to call it. The docstring “Call this when the current phase objectives are met and you do NOT call during active respondent speech” is not decoration — it is a behavioral instruction to the model. OpenAI Realtime and Gemini Live both use the tool description as part of their reasoning about when to invoke tools. Vague descriptions lead to premature or missed transitions.

Dynamic Instruction Swapping

Replacing the system prompt mid-session sounds straightforward. In practice, there are three approaches, and only one works well for research.

Full replacement. Drop the old system prompt, inject the new one. Problem: the model loses all context about what happened in previous phases. It might re-ask questions the respondent already answered. For research, this is unacceptable — the probing phase needs to reference what was said during exploration.

Append-only. Keep the original prompt and append new phase instructions at the end. Problem: the prompt grows linearly with phases. By phase 4, you have four full instruction sets in the context window. The model gets confused about which instructions are active. Contradictory guidance between phases causes erratic behavior.

Summary + Replace (what works). When transitioning, summarize the completed phase into 2-3 sentences of context. Prepend this summary to the new phase’s instructions. Replace the entire system prompt with: global instructions + phase summaries + current phase instructions.

def _build_phase_instructions(self, phase: PhaseConfig) -> str:
    """Build complete instructions for a phase, including prior context."""
    sections = []

    # Global instructions always come first
    if self.protocol.global_instructions:
        sections.append(self.protocol.global_instructions)

    # Summaries from previous phases provide continuity
    if self.phase_summaries:
        summary_block = "CONTEXT FROM PREVIOUS PHASES:\n"
        for summary in self.phase_summaries:
            summary_block += f"- {summary}\n"
        summary_block += (
            "\nUse this context to avoid repeating questions and to "
            "reference earlier responses when relevant."
        )
        sections.append(summary_block)

    # Current phase instructions
    phase_header = (
        f"CURRENT PHASE: {phase.name.upper()} "
        f"(Phase {phase.index + 1} of {len(self.protocol.phases)})\n"
        f"Time budget: {phase.duration_minutes} minutes\n"
    )
    if phase.transition_hint:
        phase_header += f"Transition when: {phase.transition_hint}\n"

    sections.append(phase_header + phase.instructions)

    # Topics to cover in this phase
    if phase.topics:
        topic_block = "TOPICS TO COVER:\n"
        for i, topic in enumerate(phase.topics, 1):
            topic_block += f"{i}. {topic}\n"
        sections.append(topic_block)

    return "\n\n---\n\n".join(sections)


def _build_chat_context(self, instructions: str) -> ChatContext:
    """Create a fresh ChatContext with updated system instructions."""
    ctx = ChatContext()
    ctx.add_message(ChatMessage.create(
        text=instructions,
        role="system",
    ))
    return ctx

The summary approach keeps the context window bounded — each phase summary is 2-3 sentences regardless of how long the phase lasted — while preserving the conversational thread. When the probing phase instructions say “dig deeper on themes the respondent raised,” the model has the exploration phase summary to reference.

One thing I learned the hard way: the summary needs to be generated by the LLM, not extracted from transcripts. We initially tried parsing the conversation log for key themes, but the LLM’s own understanding of what was discussed is more reliable for continuity. We inject a hidden system message asking the model to internally summarize before calling next_phase(), and the phase summaries capture the model’s understanding.

Two Modes: Simple vs Stateful

Not every session needs a multi-phase state machine. Sometimes you just need a single-mode agent — a research receptionist that screens participants, or a simple survey bot that asks five questions. We handle this with a branching pattern based on the phases array length:

class ResearchAgent:
    """Voice agent supporting both simple and stateful modes."""

    async def start(self, room_metadata: dict):
        protocol = self._parse_protocol(room_metadata)

        if len(protocol.phases) <= 1:
            # Simple mode: single instruction set, no state machine
            await self._run_simple(protocol)
        else:
            # Stateful mode: multi-phase with transitions
            await self._run_stateful(protocol)

    async def _run_simple(self, protocol: ResearchProtocol):
        """Single-phase agent. No function tools for transitions."""
        instructions = protocol.phases[0].instructions
        if protocol.global_instructions:
            instructions = protocol.global_instructions + "\n\n" + instructions
        # Launch agent with static instructions, no next_phase tool
        await self._launch_agent(instructions, tools=[])

    async def _run_stateful(self, protocol: ResearchProtocol):
        """Multi-phase agent with phase transition function tools."""
        phase_manager = PhaseManager(protocol)
        initial_instructions = phase_manager._build_phase_instructions(
            phase_manager.current_phase
        )
        await self._launch_agent(
            initial_instructions,
            tools=[phase_manager.next_phase],
        )

The same codebase, the same deployment, the same container image. The difference is entirely in the metadata. This matters for operations — you do not want to maintain two separate agent codebases for simple and complex sessions.

Simple mode also serves as a fallback. If the phases array is malformed or missing, the agent drops to simple mode with a generic instruction set rather than crashing. Defensive design for a system where the configuration comes from an external API.

Time Budget Enforcement

Each phase has a target duration, but the LLM does not have an internal clock. It does not know that 8 minutes have passed. You need to inject time awareness externally.

The approach: a background task monitors elapsed time per phase and injects system messages at key thresholds. We use three trigger points — 50%, 80%, and 100% of the phase budget.

import asyncio


class TimeBudgetTracker:
    """Monitors phase duration and injects time-awareness messages."""

    def __init__(self, phase_manager: PhaseManager, agent):
        self.phase_manager = phase_manager
        self.agent = agent
        self._task: Optional[asyncio.Task] = None

    async def start(self):
        self._task = asyncio.create_task(self._monitor_loop())

    async def _monitor_loop(self):
        warned_phases: dict[int, set[int]] = {}  # phase_index -> set of thresholds hit

        while True:
            await asyncio.sleep(15)  # Check every 15 seconds
            pm = self.phase_manager
            phase = pm.current_phase
            elapsed_min = (time.time() - pm.phase_start_time) / 60.0
            pct = (elapsed_min / phase.duration_minutes) * 100 if phase.duration_minutes > 0 else 0

            phase_warnings = warned_phases.setdefault(pm.current_index, set())

            if pct >= 100 and 100 not in phase_warnings:
                phase_warnings.add(100)
                await self._inject_time_message(
                    f"TIME: Phase '{phase.name}' has exceeded its "
                    f"{phase.duration_minutes}-minute budget. "
                    f"Wrap up this phase and call next_phase() now."
                )
            elif pct >= 80 and 80 not in phase_warnings:
                phase_warnings.add(80)
                await self._inject_time_message(
                    f"TIME: Phase '{phase.name}' is at 80% of its time budget "
                    f"({elapsed_min:.1f}/{phase.duration_minutes} min). "
                    f"Begin transitioning to the next phase soon."
                )
            elif pct >= 50 and 50 not in phase_warnings:
                phase_warnings.add(50)
                await self._inject_time_message(
                    f"TIME: Phase '{phase.name}' is at the halfway point "
                    f"({elapsed_min:.1f}/{phase.duration_minutes} min). "
                    f"Ensure remaining topics are covered."
                )

    async def _inject_time_message(self, message: str):
        """Inject a system message into the active session."""
        ctx = self.agent.session.chat_ctx
        ctx.add_message(ChatMessage.create(
            text=message,
            role="system",
        ))
        self.agent.session.set_chat_ctx(ctx)
        logger.info(f"Time budget alert: {message}")

The 80% threshold is the most important one. It gives the LLM enough runway to naturally wrap up — ask one last follow-up, acknowledge the respondent’s answer, then call next_phase(). The 100% message is more forceful, using language like “call next_phase() now.”

Why not just use a hard timer that forces the transition? Because research conversations need graceful transitions. Cutting off a respondent mid-sentence to jump to the next phase violates basic qualitative research methodology. The time budget is a guide, not a gate. Some phases run long because the respondent is providing rich insights. That is fine — the researcher would rather have good data from three phases than thin data from five.

In practice, with the 80% warning, the LLM hits the next_phase() call within 90 seconds of the budget threshold about 85% of the time. The 100% override catches the remaining cases. We have never had a phase run more than 150% of its budget with this system.

What This Looks Like in Practice

A real session with a 5-phase, 35-minute protocol plays out like this:

Agent connects, reads room metadata, parses 5 phases
Phase 1 (Warmup, 4 min): Agent introduces itself, asks opening questions
At ~3.5 min, 80% time warning injected
LLM calls next_phase() at ~4.2 min
System prompt swapped to Exploration instructions with warmup summary prepended
Phase 2 (Exploration, 10 min): Agent follows topic guide, adapts to respondent flow
Continue through all phases, each with its own instruction set and time budget
Phase 5 (Wrap-up, 3 min): Agent thanks respondent, session ends

The whole mechanism is invisible to the respondent. They experience a natural conversation that happens to cover the research objectives in the right order with the right depth. That is the goal.

Looking Ahead

This state machine handles the live conversation. But research interviews produce data that needs processing — transcription, enrichment, analysis. In Part 4, we will walk through the fully automatic post-interview pipeline that turns raw recordings into structured, queryable research datasets in 3-7 minutes.

References:

This is Part 3 of an 8-part series: Production Voice AI for Research at Scale.

Series outline:

The Architecture Nobody Warns You About — Server-side agents, metadata transport, provider selection (Part 1)
Zombie Agents, Pre-Warming, and the 5 Bugs That Cost Us Weeks — Production pain points and fixes (Part 2)
Multi-Phase State Machines — Research protocol as code, LLM-driven transitions (Part 3)
From Recording to Insight — The automatic post-interview pipeline (Part 4)
The Real Cost — Per-minute tracking, budgets, self-hosting math (Part 5)
What Breaks at 200 Concurrent Sessions — Scaling bottlenecks and operational metrics (Part 6)
Multi-Language Voice AI — Language detection, provider routing, locale-aware VAD, i18n prompts (Part 7)
Deployment and Go-Live — Docker, Kubernetes, CI/CD, zero-downtime deploys, monitoring (Part 8)

Export for reading

Production Voice AI for Research at Scale: Multi-Phase State Machines (Part 3 of 8)

Why Phases Matter for Research

The Phases Array

The next_phase() Function Tool

Dynamic Instruction Swapping

Two Modes: Simple vs Stateful

Time Budget Enforcement

What This Looks Like in Practice

Looking Ahead

Comments

On this page

Production Voice AI for Research at Scale: Multi-Phase State Machines (Part 3 of 8)