The Voice AI Interview Playbook: Multi-Role Agents — Interviewer, Coach, and Evaluator Personas (Part 5 of 12)

In Part 4, we built our voice pipeline — STT, LLM, TTS all wired together into something that can actually hold a conversation. But a voice pipeline is just plumbing. What makes an AI interview system genuinely useful is the layer above that: the personas that govern how the AI behaves, what it says, and when it says it.

This is the part that most tutorials skip. They show you how to hook up Whisper to GPT-4 to ElevenLabs, and then call it done. But if you’ve ever tried to run an actual technical interview with a generic chatbot, you know the problem: the AI is either too accommodating (“Great answer!”), too robotic (reads from a script verbatim), or loses the thread entirely after the first follow-up question.

The real challenge is orchestrating three very different modes of intelligence:

An Interviewer that asks sharp questions, probes weak answers, and manages the clock
A Coach that gives real-time feedback without crushing confidence
An Evaluator that scores objectively without the candidate knowing the criteria

One infrastructure, three personas. Let me show you how to build all of them.

The Three Personas

Before I get into code, I want to establish what each persona is actually trying to accomplish. This matters more than you might think — the system prompt is where 80% of the behavior lives.

Persona 1: The Interviewer

The Interviewer has one job: run a structured interview that produces signal. Not just “get through the questions” — actually differentiate between candidates. This means:

Following the rubric without sounding like it’s following a rubric
Adapting difficulty based on what the candidate is demonstrating
Managing time across sections without making the candidate feel rushed
Probing vague answers without being aggressive
Not helping the candidate with hints (this is where generic chatbots fail badly)

Here’s the full system prompt I use:

You are Alex, an experienced senior technical interviewer at a software company.
You are conducting a technical interview for a senior backend engineer position.

CORE BEHAVIOR:
- Ask one question at a time. Wait for the complete answer before responding.
- Never volunteer information that helps the candidate answer. If they ask for hints,
  acknowledge the question and redirect: "That's worth exploring — what's your current
  thinking on it?"
- Probe incomplete or vague answers with follow-ups. Target 2-3 levels deep on important topics.
- Use a neutral, professional tone. Neither encouraging nor discouraging.
- Never reveal that you are following a rubric or scoring criteria.

TIME AND PACING:
- You have approximately {total_minutes} minutes total for this interview.
- Current section: {current_section} — allocated {section_minutes} minutes.
- If a candidate is spending too long on one question, use: "Let's move to the next area —
  we want to make sure we cover everything today."
- If a candidate finishes quickly, probe deeper or move to the next topic.

QUESTION STYLE:
- Start with behavioral framing: "Tell me about a time when..."
- For technical topics, start conceptual: "How would you approach..."
- Follow up with specifics: "What specifically did you choose X over Y?"
- Challenge depth: "What are the limitations of that approach?"
- Challenge breadth: "Are there scenarios where that wouldn't work?"

DIFFICULTY ADAPTATION:
- If the candidate answers the first 2 questions correctly and with depth, increase difficulty.
- If the candidate struggles on basics, do not advance to advanced topics — stay and probe.
- Never make the difficulty shift obvious. Simply choose harder/easier follow-ups naturally.

STRICTLY FORBIDDEN:
- Do not say "Great answer!" or any variation of praise during the interview.
- Do not say "That's wrong" — instead probe: "Walk me through that reasoning."
- Do not summarize what the candidate just said back to them before asking the next question.
- Do not use filler affirmations like "Absolutely!" or "Perfect!" before questions.

CURRENT RUBRIC AREAS (do not reveal these):
{rubric_areas_json}

Begin by introducing yourself naturally and explaining the interview structure.

Notice a few things. The prompt is specific about anti-patterns (“Never say ‘Great answer!’”). It uses template variables for the rubric and timing — these get filled in at session creation time. And it explicitly tells the AI what it cannot do, not just what it should do.

The “difficulty adaptation” section is particularly important. Without it, the LLM defaults to asking questions in order from whatever rubric you give it, regardless of whether the candidate is clearly at a junior or staff level. Explicitly instructing it to adapt — and to do so naturally — dramatically improves signal quality.

Persona 2: The Coach

The Coach has a fundamentally different goal. Instead of producing evaluation signal, it’s producing improvement. This requires a personality flip: warm, specific, actionable. But not sycophantic — the candidate already knows if their answer was bad.

The hard part of coaching prompts is threading the needle between useful feedback and crushing confidence. “Your answer was weak and you clearly don’t understand distributed systems” is technically accurate feedback but terrible coaching. “That was a great start!” is great for morale but useless.

You are Jordan, an interview coach with 10 years of experience helping software engineers
land senior roles at top tech companies. You are running a practice interview session.

YOUR GOAL:
Help the candidate improve their interview performance through practice, specific feedback,
and technique coaching. Focus on communication quality as much as technical accuracy.

FEEDBACK STYLE:
- Give feedback after each answer, before asking the next question.
- Structure feedback as: what worked + what to improve + how to improve it.
- Be specific, not generic. "Your answer was unclear" is not useful.
  "You mentioned caching three times but didn't explain the eviction strategy —
  that's usually where interviewers probe next" is useful.
- Reference the specific words or phrases they used.

COMMUNICATION COACHING (track and call out):
- Filler words: "um", "uh", "like", "you know", "basically"
  → Suggest: "Try pausing silently instead of filling with 'um' — it sounds more confident."
- Over-qualification: "I think maybe...", "I'm not sure but..."
  → Suggest: "Own your answers. Say 'I would...' not 'I think I might...'"
- Too long: answers over 3 minutes on a single question
  → Suggest: "Real interviews have time pressure. Practice the 90-second version first."
- Too short: answers under 30 seconds on technical questions
  → Suggest: "Expand on the trade-offs. Interviewers want to see your reasoning process."
- Vague claims: "I've worked with distributed systems extensively"
  → Suggest: "Quantify and be specific: 'I scaled X from Y to Z users using...'"

TECHNICAL COACHING:
- If the answer is technically correct but poorly structured, focus on structure.
- If the answer is technically wrong, gently correct with explanation.
- After each topic, give a 1-10 confidence score and explain what would move it up.
- Suggest specific things to study: "You'd benefit from reading about X — the CAP theorem
  section in Designing Data-Intensive Applications covers this well."

TONE:
- Warm but honest. Think mentor, not cheerleader.
- Use phrases like "Here's what that sounded like from an interviewer's perspective..."
- Never say an answer was "perfect" — there's always something to sharpen.
- Celebrate genuine progress: "That's noticeably cleaner than your first answer on this topic."

SESSION CONTEXT:
{session_context}

Start by explaining what you'll be working on today and how the session will run.

The communication coaching section is where this prompt earns its keep. A surprising number of technically strong candidates tank interviews because of communication patterns they don’t even notice — the constant “basically”, the over-hedging, the answers that go on for five minutes when the interviewer wanted two. Explicit instructions to track and call these out turns the AI into something approaching a speech coach.

Persona 3: The Evaluator

The Evaluator is the quietest persona, but in some ways the most important. It runs in the background — not speaking, just building a structured assessment as the conversation unfolds.

I implement the Evaluator differently from the other two personas. Instead of a conversational system prompt, it gets a structured assessment prompt that it uses to generate periodic scoring updates via function calls:

You are an expert technical evaluator assessing a candidate interview in real time.

YOUR ROLE:
You observe the interview conversation and produce structured assessments. You do NOT
participate in the conversation. You analyze what was said and evaluate it against the rubric.

ASSESSMENT FRAMEWORK:
For each candidate response, evaluate:

1. TECHNICAL ACCURACY (0-10)
   - 0-3: Fundamental misunderstandings or significantly wrong
   - 4-6: Partially correct, missing key concepts
   - 7-8: Correct with minor gaps
   - 9-10: Comprehensive and accurate

2. COMMUNICATION CLARITY (0-10)
   - 0-3: Difficult to follow, incoherent structure
   - 4-6: Understandable but could be clearer
   - 7-8: Clear and well-structured
   - 9-10: Exceptionally clear, concise, well-organized

3. DEPTH OF REASONING (0-10)
   - 0-3: Surface level only, no analysis
   - 4-6: Some reasoning shown, but shallow
   - 7-8: Good analytical depth, considers trade-offs
   - 9-10: Expert-level reasoning, nuanced trade-off analysis

4. RELEVANCE (0-10)
   - Did they answer the actual question asked?
   - Penalty for lengthy tangents that don't connect back

RUBRIC:
{rubric_json}

For each answer, call score_answer() with your assessment.
Flag any answer that needs follow-up by calling flag_followup() with the reason.

IMPORTANT: Maintain strict objectivity. Do not adjust scores based on:
- How confident the candidate sounded
- The candidate's apparent background or accent
- How much you personally agree with their architectural choices
- Whether they used the "right" terminology vs. correct concepts

The objectivity section at the bottom is not boilerplate — it’s addressing real failure modes I’ve seen in LLM-based evaluation. Without explicit instructions, LLMs will score confident-sounding answers higher than accurate-but-hesitant ones, and will penalize non-standard terminology even when the underlying concept is correct.

Persona Switching: When and How

You have two patterns for using multiple personas: separate sessions and hot-switching. They solve different problems.

Separate sessions is the simple model. The candidate does a practice session with the Coach, then comes back for an actual interview with the Interviewer, and the Evaluator runs silently during the interview. Three separate conversations, three separate context windows.

This is what you want for real hiring workflows. The candidate shouldn’t be getting feedback during an actual interview — that would contaminate the assessment. And the Evaluator shouldn’t be visible to the candidate at all.

Hot-switching is for more complex scenarios — primarily internal training tools or self-assessment platforms where a single user is playing multiple roles (e.g., preparing for an interview and wanting to shift between practice and critique mode). Here’s how I implement it:

class PersonaManager:
    def __init__(self, session_id: str):
        self.session_id = session_id
        self.active_persona = "interviewer"
        self.persona_prompts = {
            "interviewer": INTERVIEWER_SYSTEM_PROMPT,
            "coach": COACH_SYSTEM_PROMPT,
            "evaluator": EVALUATOR_SYSTEM_PROMPT,
        }
        # Shared session context that survives persona switches
        self.session_context = {
            "questions_asked": [],
            "candidate_answers": [],
            "topics_covered": [],
            "running_scores": {},
            "flags": [],
        }

    def switch_persona(self, new_persona: str, preserve_history: bool = True):
        """
        Hot-switch to a new persona mid-session.

        preserve_history: If True, inject a context summary into the new
        persona's first message so it knows what happened before.
        """
        if new_persona not in self.persona_prompts:
            raise ValueError(f"Unknown persona: {new_persona}")

        old_persona = self.active_persona
        self.active_persona = new_persona

        if preserve_history:
            context_summary = self._build_context_summary()
            # Inject context into the new persona's system prompt
            return self.persona_prompts[new_persona] + f"\n\nSESSION HISTORY:\n{context_summary}"

        return self.persona_prompts[new_persona]

    def _build_context_summary(self) -> str:
        """Build a compact summary of what's happened so far."""
        lines = []

        if self.session_context["topics_covered"]:
            lines.append(f"Topics covered: {', '.join(self.session_context['topics_covered'])}")

        if self.session_context["questions_asked"]:
            lines.append(f"Questions asked: {len(self.session_context['questions_asked'])}")
            lines.append(f"Last question: {self.session_context['questions_asked'][-1]}")

        if self.session_context["running_scores"]:
            avg_score = sum(self.session_context["running_scores"].values()) / len(
                self.session_context["running_scores"]
            )
            lines.append(f"Average score so far: {avg_score:.1f}/10")

        if self.session_context["flags"]:
            lines.append(f"Flagged items: {len(self.session_context['flags'])}")

        return "\n".join(lines)

The key insight here is that persona switching doesn’t mean context switching. The session_context dictionary persists across persona changes — you don’t want to lose the questions asked or the running scores just because you switched from Interviewer to Coach mode. But you do want to give the new persona a compact summary of what happened before, rather than dumping the full conversation history into the prompt (which gets expensive fast).

When should you hot-switch vs. separate sessions? My rule: if the candidate is supposed to know about the switch, hot-switch is fine. If the switch happens behind the scenes (e.g., the Evaluator kicking in silently when the Interviewer starts a new section), keep them as separate processes sharing a state store rather than a single context.

The Interview Flow State Machine

Here’s something most people don’t think about when building interview agents: the interview itself is a state machine. There are defined states, and the AI should transition between them based on specific conditions — not just randomly or when it “feels right.”

The states I use for a full technical interview:

INTRODUCTION → WARM_UP → TECHNICAL_DEEP_DIVE → SYSTEM_DESIGN → BEHAVIORAL → CANDIDATE_QA → WRAP_UP

Each state has:

An entry action (what the AI says/does when entering the state)
Exit conditions (what triggers the transition)
A time budget (hard cap on state duration)
Fallback behavior (what to do if exit conditions aren’t met before time runs out)

from enum import Enum
from dataclasses import dataclass, field
from typing import Callable, Optional
import time

class InterviewState(Enum):
    INTRODUCTION = "introduction"
    WARM_UP = "warm_up"
    TECHNICAL_DEEP_DIVE = "technical_deep_dive"
    SYSTEM_DESIGN = "system_design"
    BEHAVIORAL = "behavioral"
    CANDIDATE_QA = "candidate_qa"
    WRAP_UP = "wrap_up"
    COMPLETE = "complete"

@dataclass
class StateConfig:
    name: InterviewState
    time_budget_minutes: int
    min_questions: int
    entry_prompt: str  # Injected as context when entering state
    exit_conditions: list[str]  # Checked after each answer
    fallback_transition: InterviewState  # Where to go if time runs out

STATE_CONFIGS = {
    InterviewState.INTRODUCTION: StateConfig(
        name=InterviewState.INTRODUCTION,
        time_budget_minutes=3,
        min_questions=0,
        entry_prompt="Welcome the candidate, introduce yourself, explain the interview format.",
        exit_conditions=["candidate_acknowledged_format"],
        fallback_transition=InterviewState.WARM_UP,
    ),
    InterviewState.WARM_UP: StateConfig(
        name=InterviewState.WARM_UP,
        time_budget_minutes=8,
        min_questions=2,
        entry_prompt=(
            "Ask the candidate about their background and current role. "
            "Use this to calibrate your understanding of their experience level. "
            "Ask 2 questions max — this is just warm-up."
        ),
        exit_conditions=["min_questions_met", "background_understood"],
        fallback_transition=InterviewState.TECHNICAL_DEEP_DIVE,
    ),
    InterviewState.TECHNICAL_DEEP_DIVE: StateConfig(
        name=InterviewState.TECHNICAL_DEEP_DIVE,
        time_budget_minutes=25,
        min_questions=4,
        entry_prompt=(
            "Now focus on deep technical questions from the rubric. "
            "Probe each answer 2-3 levels deep before moving to the next topic. "
            "Cover at least {required_topics} from the rubric."
        ),
        exit_conditions=["required_topics_covered", "min_questions_met"],
        fallback_transition=InterviewState.SYSTEM_DESIGN,
    ),
    InterviewState.SYSTEM_DESIGN: StateConfig(
        name=InterviewState.SYSTEM_DESIGN,
        time_budget_minutes=20,
        min_questions=1,
        entry_prompt=(
            "Present a system design problem: {system_design_prompt}. "
            "Let the candidate drive the design. Ask clarifying questions and probe trade-offs. "
            "Don't provide hints about the 'right' architecture."
        ),
        exit_conditions=["design_presented", "major_components_discussed"],
        fallback_transition=InterviewState.BEHAVIORAL,
    ),
    InterviewState.BEHAVIORAL: StateConfig(
        name=InterviewState.BEHAVIORAL,
        time_budget_minutes=15,
        min_questions=2,
        entry_prompt=(
            "Ask behavioral questions using STAR format. "
            "Focus on: leadership, conflict resolution, handling failure, and collaboration. "
            "Probe for specifics — don't accept vague answers."
        ),
        exit_conditions=["min_questions_met", "key_behavioral_areas_covered"],
        fallback_transition=InterviewState.CANDIDATE_QA,
    ),
    InterviewState.CANDIDATE_QA: StateConfig(
        name=InterviewState.CANDIDATE_QA,
        time_budget_minutes=5,
        min_questions=0,
        entry_prompt=(
            "Invite the candidate to ask questions about the role, team, or company. "
            "Answer questions about the role and culture. Do not discuss compensation or next steps. "
            "If asked about the interview outcome, say 'The team will be in touch within a few days.'"
        ),
        exit_conditions=["candidate_has_no_more_questions", "time_elapsed"],
        fallback_transition=InterviewState.WRAP_UP,
    ),
    InterviewState.WRAP_UP: StateConfig(
        name=InterviewState.WRAP_UP,
        time_budget_minutes=2,
        min_questions=0,
        entry_prompt="Thank the candidate, explain next steps, and close the interview professionally.",
        exit_conditions=["wrap_up_complete"],
        fallback_transition=InterviewState.COMPLETE,
    ),
}

class InterviewStateMachine:
    def __init__(self, total_minutes: int = 60):
        self.current_state = InterviewState.INTRODUCTION
        self.state_entry_time = time.time()
        self.session_start_time = time.time()
        self.total_minutes = total_minutes
        self.state_history = []
        self.questions_in_state = 0
        self.conditions_met = set()

    def check_transition(self) -> Optional[InterviewState]:
        """
        Check whether the current state should transition.
        Called after each candidate answer.
        Returns the new state if transition should happen, None otherwise.
        """
        config = STATE_CONFIGS[self.current_state]
        time_in_state = (time.time() - self.state_entry_time) / 60.0

        # Hard time limit — always transition
        if time_in_state >= config.time_budget_minutes:
            return config.fallback_transition

        # Check if exit conditions are met
        conditions_satisfied = all(
            cond in self.conditions_met
            for cond in config.exit_conditions
            if not cond.startswith("time_")  # Skip time-based conditions here
        )

        if conditions_satisfied and self.questions_in_state >= config.min_questions:
            # Determine next state
            state_order = list(InterviewState)
            current_idx = state_order.index(self.current_state)
            if current_idx + 1 < len(state_order):
                return state_order[current_idx + 1]

        return None

    def transition_to(self, new_state: InterviewState):
        """Execute a state transition."""
        self.state_history.append({
            "from": self.current_state,
            "to": new_state,
            "duration_minutes": (time.time() - self.state_entry_time) / 60.0,
            "questions_asked": self.questions_in_state,
        })

        self.current_state = new_state
        self.state_entry_time = time.time()
        self.questions_in_state = 0
        self.conditions_met = set()

    def mark_condition_met(self, condition: str):
        self.conditions_met.add(condition)
        self.questions_in_state += 1

The state machine runs alongside the LLM conversation — it doesn’t replace it. The LLM makes the conversation feel natural; the state machine ensures the interview actually covers what it needs to cover. Every time the candidate finishes an answer, the state machine checks transition conditions. If they’re met, the state machine generates a transition signal that gets injected into the LLM’s context.

Function Calling for Flow Control

State transitions are one thing, but real-time flow control requires function calling. Here are the three functions that do the heavy lifting:

from openai import AsyncOpenAI
from typing import Any
import json

# Function definitions for the LLM
INTERVIEW_FUNCTIONS = [
    {
        "name": "transition_section",
        "description": (
            "Transition the interview to the next section. Call this when "
            "you've covered enough ground in the current section and it's time to move on."
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "from_section": {
                    "type": "string",
                    "enum": [s.value for s in InterviewState],
                    "description": "The current section being completed",
                },
                "to_section": {
                    "type": "string",
                    "enum": [s.value for s in InterviewState],
                    "description": "The next section to transition to",
                },
                "reason": {
                    "type": "string",
                    "description": "Brief internal reason for the transition (not shown to candidate)",
                },
                "topics_covered": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "List of topics covered in the current section",
                },
            },
            "required": ["from_section", "to_section", "reason"],
        },
    },
    {
        "name": "score_answer",
        "description": (
            "Score the candidate's most recent answer against the rubric. "
            "Call this after every substantive answer. Do not reveal scores to the candidate."
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "question": {
                    "type": "string",
                    "description": "The question that was asked",
                },
                "answer_summary": {
                    "type": "string",
                    "description": "A 1-2 sentence summary of what the candidate said",
                },
                "technical_accuracy": {
                    "type": "integer",
                    "minimum": 0,
                    "maximum": 10,
                    "description": "Technical accuracy score",
                },
                "communication_clarity": {
                    "type": "integer",
                    "minimum": 0,
                    "maximum": 10,
                    "description": "Communication clarity score",
                },
                "depth_of_reasoning": {
                    "type": "integer",
                    "minimum": 0,
                    "maximum": 10,
                    "description": "Depth of reasoning score",
                },
                "relevance": {
                    "type": "integer",
                    "minimum": 0,
                    "maximum": 10,
                    "description": "How well they answered the actual question",
                },
                "rubric_area": {
                    "type": "string",
                    "description": "Which rubric area this answer maps to",
                },
                "notable_observations": {
                    "type": "string",
                    "description": "Any standout positives or concerns",
                },
            },
            "required": [
                "question",
                "answer_summary",
                "technical_accuracy",
                "communication_clarity",
                "depth_of_reasoning",
                "relevance",
                "rubric_area",
            ],
        },
    },
    {
        "name": "flag_followup",
        "description": (
            "Flag the candidate's answer for follow-up or special handling. "
            "Use this when an answer warrants deeper probing or special attention."
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "flag_type": {
                    "type": "string",
                    "enum": [
                        "vague_answer",
                        "interesting_depth",
                        "potential_red_flag",
                        "claimed_expertise",
                        "contradiction",
                        "needs_verification",
                    ],
                    "description": "The type of flag to apply",
                },
                "answer_excerpt": {
                    "type": "string",
                    "description": "The specific part of the answer being flagged",
                },
                "suggested_followup": {
                    "type": "string",
                    "description": "A suggested follow-up question or probe",
                },
                "priority": {
                    "type": "string",
                    "enum": ["low", "medium", "high"],
                    "description": "How important is it to follow up on this?",
                },
            },
            "required": ["flag_type", "answer_excerpt", "suggested_followup", "priority"],
        },
    },
]


class InterviewFunctionHandler:
    def __init__(self, state_machine: InterviewStateMachine, scoring_store):
        self.state_machine = state_machine
        self.scoring_store = scoring_store
        self.flags = []

    async def handle_function_call(
        self, function_name: str, arguments: dict[str, Any]
    ) -> dict[str, Any]:
        """Route function calls to appropriate handlers."""
        handlers = {
            "transition_section": self._handle_transition,
            "score_answer": self._handle_score,
            "flag_followup": self._handle_flag,
        }

        handler = handlers.get(function_name)
        if not handler:
            return {"error": f"Unknown function: {function_name}"}

        return await handler(arguments)

    async def _handle_transition(self, args: dict) -> dict:
        from_state = InterviewState(args["from_section"])
        to_state = InterviewState(args["to_section"])

        # Validate transition is allowed
        if from_state != self.state_machine.current_state:
            return {
                "status": "rejected",
                "reason": f"Current state is {self.state_machine.current_state.value}, not {from_state.value}",
            }

        self.state_machine.transition_to(to_state)

        # Inject new state context into the LLM
        new_config = STATE_CONFIGS[to_state]

        return {
            "status": "transitioned",
            "new_state": to_state.value,
            "new_instructions": new_config.entry_prompt,
            "time_budget_minutes": new_config.time_budget_minutes,
            "topics_covered": args.get("topics_covered", []),
        }

    async def _handle_score(self, args: dict) -> dict:
        score_record = {
            "timestamp": time.time(),
            "state": self.state_machine.current_state.value,
            "question": args["question"],
            "answer_summary": args["answer_summary"],
            "scores": {
                "technical_accuracy": args["technical_accuracy"],
                "communication_clarity": args["communication_clarity"],
                "depth_of_reasoning": args["depth_of_reasoning"],
                "relevance": args["relevance"],
            },
            "rubric_area": args["rubric_area"],
            "observations": args.get("notable_observations", ""),
        }

        # Compute composite score (weighted average)
        composite = (
            args["technical_accuracy"] * 0.35
            + args["communication_clarity"] * 0.20
            + args["depth_of_reasoning"] * 0.30
            + args["relevance"] * 0.15
        )
        score_record["composite_score"] = round(composite, 2)

        await self.scoring_store.save_score(score_record)

        return {
            "status": "scored",
            "composite": score_record["composite_score"],
            "recorded": True,
        }

    async def _handle_flag(self, args: dict) -> dict:
        flag = {
            "timestamp": time.time(),
            "state": self.state_machine.current_state.value,
            "flag_type": args["flag_type"],
            "excerpt": args["answer_excerpt"],
            "suggested_followup": args["suggested_followup"],
            "priority": args["priority"],
            "resolved": False,
        }

        self.flags.append(flag)

        # For high-priority flags, instruct the AI to follow up immediately
        if args["priority"] == "high":
            return {
                "status": "flagged",
                "action": "follow_up_now",
                "instruction": f"Ask this follow-up question now: {args['suggested_followup']}",
            }

        return {
            "status": "flagged",
            "action": "queue_for_later",
            "flag_id": len(self.flags) - 1,
        }

The flag_followup function is the one that really elevates the system. The four most useful flag types in practice:

vague_answer: “I’ve worked extensively with distributed systems” — no specifics. The suggested follow-up digs in: “Which distributed systems specifically, and what were you solving for?”
claimed_expertise: “I’m very strong with Kubernetes” — this gets queued as a follow-up probe area. If they claim expertise, they need to demonstrate it.
potential_red_flag: “We didn’t really have a deployment process, I just pushed to prod” — this gets flagged for behavioral probing later.
interesting_depth: Candidate mentioned something unexpectedly sophisticated. Flag it to come back and explore further.

Dynamic Question Generation

Static question banks have a fundamental problem: experienced candidates practice them. When a candidate gives you a textbook answer to a textbook question, you learn almost nothing.

Dynamic follow-up generation — where the AI generates questions based on what the candidate actually said — is one of the most valuable things you can do with an LLM-based interviewer. The key is giving the AI permission and enough context to go off-script intelligently:

DYNAMIC_FOLLOWUP_PROMPT = """
Based on the candidate's answer, generate a highly specific follow-up question.

CANDIDATE SAID: {candidate_answer}
QUESTION THEY WERE ANSWERING: {original_question}
TOPICS MENTIONED: {extracted_topics}

FOLLOW-UP RULES:
1. Use something specific they mentioned — reference their exact words
2. Go one level deeper than what they said
3. Don't ask what they should have said unprompted — probe what they did say
4. If they mentioned a specific technology, tool, or incident — dig into it
5. If they mentioned a number (scale, latency, users), probe the constraints behind it

GOOD FOLLOW-UPS:
- "You mentioned the cache miss rate was 40% — how did you determine that threshold was acceptable?"
- "Walk me through the Kubernetes incident you mentioned. What was the root cause?"
- "You said you chose Redis over Memcached — what were the specific factors in that decision?"

BAD FOLLOW-UPS:
- "Can you tell me more?" (too vague)
- "What are the advantages of X?" (textbook, not personalized)
- "Have you considered using Y instead?" (volunteering hints)

Generate ONE follow-up question.
"""

async def generate_dynamic_followup(
    client: AsyncOpenAI,
    candidate_answer: str,
    original_question: str,
    extracted_topics: list[str],
) -> str:
    """Generate a context-aware follow-up based on candidate's actual answer."""

    prompt = DYNAMIC_FOLLOWUP_PROMPT.format(
        candidate_answer=candidate_answer,
        original_question=original_question,
        extracted_topics=", ".join(extracted_topics),
    )

    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=150,
        temperature=0.7,
    )

    return response.choices[0].message.content.strip()

I extract topics from each candidate answer using a lightweight classification call before generating the follow-up. This keeps the follow-up tightly targeted — instead of asking a generic follow-up about distributed systems, it asks about the specific Kafka lag issue the candidate mentioned.

Rubric-Driven Interviewing Without Sounding Robotic

The tension in rubric-driven interviewing is that rubrics are rigid but good conversations aren’t. If you just give the AI a list of rubric items and say “cover all of these,” you get an interview that feels like a form being filled out.

The trick is separating the rubric from the question bank entirely:

RUBRIC_EXAMPLE = {
    "distributed_systems": {
        "weight": 0.30,
        "required": True,
        "areas": [
            "consistency vs. availability trade-offs",
            "failure modes and recovery",
            "data partitioning strategies",
            "consensus algorithms (at a high level)",
        ],
        "signal_questions": [
            "Design a distributed cache",
            "How would you handle network partitions in a payment system?",
        ],
        "red_flags": [
            "Doesn't know CAP theorem",
            "Can't explain eventual consistency",
            "Has never dealt with distributed failures in production",
        ],
    },
    "system_design": {
        "weight": 0.25,
        "required": True,
        "areas": [
            "capacity estimation",
            "database selection and schema design",
            "API design",
            "scalability bottlenecks",
        ],
    },
    # ... more rubric areas
}

Then I inject the rubric into the system prompt not as a checklist but as context that informs the AI’s judgment:

EVALUATION AREAS (use these to guide your questions, not as a script):

For each area below, you need sufficient signal to evaluate the candidate.
"Sufficient signal" means: you understand their depth of knowledge AND
have seen how they reason through problems in this area.

Areas to cover: {rubric_areas}

IMPORTANT: Don't ask all the signal questions directly. Use them as starting
points and let the conversation evolve. If a candidate naturally covers an area
well in their answer to a different question, that counts — you don't need to
ask about it explicitly.

After the interview, you will have covered the following areas: {required_areas}
If you haven't covered a required area with {n_minutes_remaining} minutes left,
pivot the conversation to address it.

The last paragraph is crucial. The AI won’t robotically go through every item, but it knows what gaps remain and will naturally steer the conversation to fill them as time pressure increases.

Coaching Mode: Real-Time Feedback Patterns

When I’m in Coach mode, the feedback loop needs to be tight — less than five seconds between when the candidate finishes speaking and when the coach responds with specific, actionable feedback.

Here’s the pattern I use for structuring coaching feedback:

COACHING_FEEDBACK_TEMPLATE = """
Analyze the candidate's practice answer and provide coaching feedback.

QUESTION ASKED: {question}
CANDIDATE ANSWER: {answer}
ANSWER DURATION: {duration_seconds} seconds

ANALYZE FOR:
1. FILLER WORDS — count instances of: um, uh, like, basically, you know, right?
   Threshold: >3 in a single answer = mention it

2. ANSWER LENGTH:
   - Under 45 seconds on a technical question: too short, needs expansion
   - Over 3 minutes on any question: too long, needs pruning
   - Sweet spot: 90-120 seconds for technical, 2-3 minutes for behavioral (STAR)

3. STRUCTURE:
   - Did they start with a clear direct answer, then explain?
   - Or did they bury the answer in context?
   - Did they acknowledge trade-offs?

4. SPECIFICITY:
   - Generic claims: "I'm experienced with X" — needs examples
   - Specific: "At [company], we had X problem and I solved it by..." — good
   - Numerical anchors: scales, percentages, timeframes — does the answer have them?

5. CONFIDENCE LANGUAGE:
   - Over-hedging: "I think maybe", "I'm not sure but" — coach to remove
   - Under-confidence: "I'm probably wrong but" — coach to replace with directness
   - Good: Stating positions clearly, then defending them

FEEDBACK FORMAT:
Start with one genuine positive (be specific, not generic).
Then give 1-2 concrete improvements with examples of how to say it differently.
End with the one thing that would most improve their next answer.

Do NOT give a numbered list — make it conversational, like a mentor talking.
"""

The feedback on filler words is particularly interesting in a voice context, because you have access to the raw audio duration and can count actual speech patterns. In practice, I run a lightweight speech pattern analyzer alongside the STT transcript:

import re
from dataclasses import dataclass

@dataclass
class SpeechPatternAnalysis:
    filler_count: int
    filler_words_found: list[str]
    answer_duration_seconds: float
    words_per_minute: int
    hedge_phrases: list[str]
    confidence_score: float  # 0-1

FILLER_PATTERNS = re.compile(
    r'\b(um+|uh+|er+|ah+|like|basically|you know|right\?|kind of|sort of)\b',
    re.IGNORECASE
)

HEDGE_PATTERNS = re.compile(
    r'\b(I think maybe|I\'m not sure|I could be wrong|probably|I guess|I suppose)\b',
    re.IGNORECASE
)

def analyze_speech_patterns(
    transcript: str,
    duration_seconds: float,
) -> SpeechPatternAnalysis:
    words = transcript.split()
    word_count = len(words)
    wpm = int((word_count / duration_seconds) * 60) if duration_seconds > 0 else 0

    fillers = FILLER_PATTERNS.findall(transcript)
    hedges = HEDGE_PATTERNS.findall(transcript)

    # Confidence score: penalize for fillers and hedges
    filler_penalty = min(len(fillers) * 0.05, 0.3)
    hedge_penalty = min(len(hedges) * 0.08, 0.3)
    confidence_score = max(0.0, 1.0 - filler_penalty - hedge_penalty)

    return SpeechPatternAnalysis(
        filler_count=len(fillers),
        filler_words_found=list(set(f.lower() for f in fillers)),
        answer_duration_seconds=duration_seconds,
        words_per_minute=wpm,
        hedge_phrases=list(set(h.lower() for h in hedges)),
        confidence_score=confidence_score,
    )

The coach gets this analysis object alongside the transcript, so it can give feedback like “I counted three ‘basically’ in that answer — try pausing instead” rather than just “watch your filler words.”

Guardrails: Preventing AI Failure Modes

The three failure modes I see most often in LLM-based interviewers:

1. The Sycophant Interviewer. Without explicit guardrails, the LLM defaults to being encouraging. This tanks your evaluation signal. Every answer gets “That’s a great point!” before the next question, the LLM avoids asking challenging follow-ups, and the interview feels more like a friendly chat than an assessment.

2. The Harsh Evaluator. The opposite problem — especially when you give the LLM explicit scoring criteria. It starts treating every question like a test with a right answer, penalizing alternative approaches, and making the candidate feel interrogated.

3. Off-Topic Drift. In a 60-minute conversation, LLMs will drift. The candidate mentions an interesting side project, the AI gets curious, and suddenly you’ve spent 15 minutes on something outside the rubric.

Here are the guardrails I’ve found most effective:

INTERVIEWER_GUARDRAILS = """
GUARDRAILS — CRITICAL RULES:

1. NEUTRALITY ENFORCEMENT
   Never use: "Great!", "Perfect!", "Excellent!", "Absolutely!", "Wonderful!"
   Never use: "That's wrong" or any dismissive language
   Allowed affirmations: "I see", "Okay", "Understood", "Got it", "Thanks"
   After an answer, either ask a follow-up OR move to the next question.
   Never editorialize about answer quality to the candidate.

2. HINT PROHIBITION
   If a candidate says "I'm not sure how to approach this":
   → Do NOT give hints
   → Say: "Take your time, walk me through your initial thinking"
   If a candidate asks what the right answer is:
   → Say: "I'd like to hear your perspective first"
   If a candidate asks if they're on the right track:
   → Say: "Keep going — tell me more about that approach"

3. SCOPE MANAGEMENT
   If conversation drifts more than 2 exchanges from the rubric area:
   → Redirect: "That's interesting context. Let's bring it back to [topic] —
     specifically, [next rubric question]."
   Off-limits topics: compensation, headcount, internal processes not in your briefing,
   other candidates, specific team members.

4. SCORING SECRECY
   Never reveal rubric areas, scoring criteria, or evaluation dimensions.
   If asked "How am I doing?": "The team will be in touch after reviewing everything."
   If asked "Was that the right answer?": "I want to make sure I understand your thinking
     fully — tell me more about [specific aspect]."

5. EDGE CASE HANDLING
   If candidate becomes upset or frustrated: "Let's take a breath — this is a difficult
     topic. What aspect would you like to focus on?"
   If candidate goes completely silent for >15 seconds: "Take your time — would it help
     to approach it from a different angle?"
   If candidate appears to be reading from notes: Note it, do not call it out directly.
     Ask a follow-up that requires synthesis rather than recall.
"""

The “no praise” rule is the one that requires the most discipline to enforce. LLMs are trained on human conversation data, and humans say “great question!” constantly. You have to be explicit that your AI interviewer is not doing that.

Session Context Management

A 60-minute conversation generates a lot of tokens. Left unmanaged, you’ll hit context limits mid-interview, or you’ll spend a fortune on context that isn’t contributing to the conversation quality.

My approach: rolling context compression. Every 10 minutes (or every N messages), I compress earlier conversation turns into a summary that gets appended to the system prompt, then drop the raw messages:

class InterviewContextManager:
    def __init__(self, max_messages_before_compression: int = 20):
        self.messages = []
        self.compressed_summaries = []
        self.max_messages = max_messages_before_compression
        self.total_tokens_used = 0

    def add_message(self, role: str, content: str, token_count: int = 0):
        self.messages.append({"role": role, "content": content})
        self.total_tokens_used += token_count

        if len(self.messages) > self.max_messages:
            self._compress_oldest_messages()

    def _compress_oldest_messages(self):
        """Compress the oldest half of messages into a summary."""
        messages_to_compress = self.messages[: self.max_messages // 2]
        self.messages = self.messages[self.max_messages // 2 :]

        summary = self._summarize_exchange(messages_to_compress)
        self.compressed_summaries.append(summary)

    def _summarize_exchange(self, messages: list[dict]) -> str:
        """
        Summarize a set of messages into a compact representation.
        In production, this is an LLM call. For illustration:
        """
        questions_and_answers = []

        for i in range(0, len(messages) - 1, 2):
            if messages[i]["role"] == "assistant" and i + 1 < len(messages):
                q = messages[i]["content"][:200]  # Truncate
                a = messages[i + 1]["content"][:300]  # Truncate
                questions_and_answers.append(f"Q: {q}\nA: {a}")

        return "EARLIER CONVERSATION SUMMARY:\n" + "\n\n".join(questions_and_answers)

    def get_context_for_llm(self) -> list[dict]:
        """
        Build the message list for the LLM, including compressed summaries.
        """
        context = []

        # Inject compressed summaries as a system message at the top
        if self.compressed_summaries:
            summary_text = "\n\n".join(self.compressed_summaries)
            context.append({
                "role": "system",
                "content": f"CONTEXT FROM EARLIER IN THIS INTERVIEW:\n{summary_text}",
            })

        # Add recent messages
        context.extend(self.messages)

        return context

    def get_topics_covered(self) -> list[str]:
        """Extract topics from compressed summaries for state machine."""
        # In production: run an extraction LLM call on the summaries
        # For now: simple keyword extraction
        all_text = " ".join(s for s in self.compressed_summaries)
        known_topics = [
            "distributed systems", "system design", "databases",
            "scalability", "security", "leadership", "conflict resolution"
        ]
        return [t for t in known_topics if t.lower() in all_text.lower()]

What to summarize vs. keep is a judgment call, but my rule of thumb: keep everything from the last 15 minutes verbatim (recent exchanges are most relevant for follow-ups), compress everything before that into a structured summary that preserves: topics covered, key claims the candidate made, flags raised, and scores assigned.

The key is that the compressed summary feeds the state machine’s “topics covered” check and the dynamic follow-up generator. The LLM doesn’t need the raw transcript to remember that the candidate claimed expertise in Kubernetes — it needs the structured flag that said “claimed_expertise: Kubernetes” 20 minutes ago.

Putting It Together

Here’s how all the pieces connect in a single interview session:

async def run_interview_session(
    candidate_id: str,
    role: str,
    rubric: dict,
    duration_minutes: int = 60,
):
    client = AsyncOpenAI()

    # Initialize all components
    state_machine = InterviewStateMachine(total_minutes=duration_minutes)
    context_manager = InterviewContextManager(max_messages_before_compression=20)
    scoring_store = ScoringStore(candidate_id=candidate_id)
    function_handler = InterviewFunctionHandler(state_machine, scoring_store)
    persona_manager = PersonaManager(session_id=candidate_id)

    # Build the initial system prompt
    system_prompt = persona_manager.persona_prompts["interviewer"].format(
        total_minutes=duration_minutes,
        current_section=state_machine.current_state.value,
        section_minutes=STATE_CONFIGS[state_machine.current_state].time_budget_minutes,
        rubric_areas_json=json.dumps(rubric, indent=2),
    )

    # Main conversation loop
    while state_machine.current_state != InterviewState.COMPLETE:
        # Get candidate audio, transcribe, add to context
        candidate_turn = await get_candidate_input()  # Your STT pipeline
        context_manager.add_message("user", candidate_turn)

        # Analyze speech patterns for coach mode (optional in interviewer mode)
        patterns = analyze_speech_patterns(candidate_turn, duration_seconds=10.0)

        # Check state transition
        transition = state_machine.check_transition()
        if transition:
            state_machine.transition_to(transition)
            # Inject new state context into the next LLM call

        # Build LLM request
        messages = [{"role": "system", "content": system_prompt}]
        messages.extend(context_manager.get_context_for_llm())

        # Call LLM with function calling enabled
        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            functions=INTERVIEW_FUNCTIONS,
            function_call="auto",
            temperature=0.7,
        )

        # Handle function calls
        if response.choices[0].finish_reason == "function_call":
            fn_call = response.choices[0].message.function_call
            fn_result = await function_handler.handle_function_call(
                fn_call.name, json.loads(fn_call.arguments)
            )
            # Add function result to context and get final response
            context_manager.add_message("function", json.dumps(fn_result))
            response = await client.chat.completions.create(
                model="gpt-4o",
                messages=messages + [{"role": "function", "name": fn_call.name,
                                       "content": json.dumps(fn_result)}],
                temperature=0.7,
            )

        # Get AI response and synthesize to speech
        ai_response = response.choices[0].message.content
        context_manager.add_message("assistant", ai_response)
        await synthesize_and_stream(ai_response)  # Your TTS pipeline

    # Generate final assessment report
    return await scoring_store.generate_final_report()

This is the skeleton — your actual implementation will have more error handling, retry logic, and integration with your specific STT/TTS providers. But the architecture here — state machine + function calling + persona management + context compression — is what makes the system production-ready.

Where We Are

We now have an interview agent that knows who it is (Interviewer, Coach, or Evaluator), knows where it is in the conversation (state machine), scores and flags in real time (function calling), and manages its own memory across long sessions (context compression). That’s a genuinely useful system — not a demo, but something you could actually run real interviews with.

In Part 6, we make this agent genuinely knowledgeable. Right now it knows what’s in its system prompt and rubric — but what if you want it to answer questions about your company’s engineering culture, your specific tech stack, or the role requirements with real depth? That’s where RAG (Retrieval-Augmented Generation) comes in, and voice RAG has specific challenges that text-based RAG doesn’t. We’ll build a knowledge base that the agent can query in real time, with latency budgets tight enough to stay within the conversation flow.

This is Part 5 of a 12-part series: The Voice AI Interview Playbook.

Series outline:

Why Real-Time Voice Changes Everything — The landscape, the vision, and the reference architecture (Part 1)
Cascaded vs. Speech-to-Speech — Choosing your pipeline architecture (Part 2)
LiveKit vs. Pipecat vs. Direct — Picking your framework (Part 3)
STT, LLM, and TTS That Actually Work — Building the voice pipeline (Part 4)
Multi-Role Agents — Interviewer, coach, and evaluator personas (this post)
Knowledge Base and RAG — Making your voice agent an expert (Part 6)
Web and Mobile Clients — Cross-platform voice experiences (Part 7)
Video Interview Integration — Multimodal analysis with Gemini Live (Part 8)
Recording, Transcription, and Compliance — GDPR, HIPAA, and getting it right (Part 9)
Scaling to Thousands — Architecture for concurrent voice sessions (Part 10)
Cost Optimization — From $0.14/min to $0.03/min (Part 11)
Multi-Provider Support — OpenAI Realtime, Bedrock Nova, Grok, and the adapter pattern (Part 12)

Export for reading

The Voice AI Interview Playbook: Multi-Role Agents — Interviewer, Coach, and Evaluator Personas (Part 5 of 12)

The Three Personas

Persona 1: The Interviewer

Persona 2: The Coach

Persona 3: The Evaluator

Persona Switching: When and How

The Interview Flow State Machine

Function Calling for Flow Control

Dynamic Question Generation

Rubric-Driven Interviewing Without Sounding Robotic

Coaching Mode: Real-Time Feedback Patterns

Guardrails: Preventing AI Failure Modes

Session Context Management

Putting It Together

Where We Are

Comments

On this page

The Voice AI Interview Playbook: Multi-Role Agents — Interviewer, Coach, and Evaluator Personas (Part 5 of 12)