The Voice AI Interview Playbook: Video Interview Integration — Multimodal Analysis with Gemini Live (Part 8 of 12)

In Part 7, we built the clients — web, mobile, and Flutter. Candidates can now connect to a voice interview from any device. The pipeline works. But for certain interview types, voice alone leaves signal on the table.

This post is about adding video analysis to your interviews — specifically using Gemini Live’s multimodal capabilities to process video frames alongside the voice conversation. I’ll be honest about where video actually helps and where it’s a distraction or worse. The goal is to build something useful, not surveillance.

When to Add Video

The first question isn’t “how” — it’s “should you, and for what?”

Where video analysis genuinely earns its cost:

Live coding interviews are the clearest use case. When a candidate is coding on screen, the AI can observe what they’re actually typing, catch that they’re Googling solutions versus reasoning through them, and ask follow-up questions that reference what it sees. “I noticed you started with a recursive approach — walk me through why you switched to iterative.” That’s a much more valuable interview than one that relies entirely on verbal description.

Engagement tracking during long technical explanations also provides real signal. Is the candidate reading from notes while claiming to explain from memory? Are they confused but not asking for clarification? These behavioral cues are genuinely useful to surface.

Where video analysis adds noise:

Emotional analysis — detecting nervousness, confidence, or deception from facial expressions — is unreliable and ethically fraught. The research on facial action coding systems shows poor accuracy across different demographics, lighting conditions, and individual expression styles. You’ll get false signals that disadvantage candidates who are neurodivergent, camera-shy, or interviewing from a bright window. Don’t do it.

Lie detection from video is even worse — it simply doesn’t work, has been debunked repeatedly in peer-reviewed research, and using it in hiring creates serious legal exposure in many jurisdictions. The same goes for “confidence scoring” that’s really just penalizing introverts.

The working principle: use video when you can make a specific, verifiable claim about what the analysis detects. “The candidate looked up three times while describing their architecture” is verifiable. “The candidate seems nervous” is not.

Gemini Live Multimodal Architecture

Gemini Live 2.0 supports simultaneous audio and video input through a persistent WebSocket session. The key architectural insight is that you’re running two parallel channels:

The voice conversation channel: real-time audio in, audio out, manages the interview flow
The video analysis channel: frame samples in, structured observations out, provides context to the voice channel

These channels share state through a coordinator layer:

Candidate Browser
├── Audio Track ──────────────────► Voice Pipeline (LiveKit + Agent)
│                                        │
└── Video Track ─► Frame Sampler         │
                       │                 │
                       ▼                 │
               Gemini Live Session ◄─────┘
               (multimodal)
                       │
                       ▼
               Observation Buffer
                       │
                       ▼
               Interview Agent Context

The interview agent has access to both the real-time conversation and the observation buffer from video analysis. When it’s appropriate, it can surface observations as interview questions or follow-ups.

Setting Up Gemini Live Multimodal Session

Google’s google-genai Python SDK handles the multimodal session:

# pip install google-genai opencv-python numpy
import asyncio
import base64
import json
import numpy as np
from google import genai
from google.genai import types
from typing import AsyncGenerator, Optional
import cv2

class GeminiLiveVideoAnalyzer:
    """
    Runs a persistent Gemini Live session that accepts video frames
    and returns structured observations about candidate behavior.
    """

    def __init__(self, session_id: str, job_role: str):
        self.session_id = session_id
        self.job_role = job_role
        self.client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
        self.session = None
        self.observation_buffer: list[dict] = []
        self._running = False

    async def start(self):
        """Initialize the Gemini Live multimodal session."""
        config = types.LiveConnectConfig(
            model="models/gemini-2.0-flash-live-001",
            response_modalities=["TEXT"],  # We want text observations, not audio
            system_instruction=types.Content(
                parts=[types.Part(text=self._build_system_prompt())]
            ),
            generation_config=types.GenerationConfig(
                temperature=0.2,  # Low temperature for consistent observations
                max_output_tokens=512,
            ),
        )

        async with self.client.aio.live.connect(config=config) as session:
            self.session = session
            self._running = True
            await self._run_session()

    def _build_system_prompt(self) -> str:
        return f"""You are a behavioral observation assistant for a {self.job_role} technical interview.

Your role is to observe video frames and provide brief, factual observations about:
1. Whether the candidate appears to be reading from notes or an external source
2. Visible code quality, structure, or patterns during live coding
3. Screen content during screen sharing (code editors, whiteboards, browsers)
4. Signs of technical engagement (typing actively, scrolling through code, drawing diagrams)

Do NOT make claims about:
- Emotional states, confidence levels, or personality traits
- Honesty or deception
- Physical appearance
- Personal characteristics

Format each observation as JSON:
{{"type": "observation_type", "detail": "specific factual observation", "confidence": 0.0-1.0}}

Only output observations when you see something genuinely noteworthy. Silence is fine."""

    async def _run_session(self):
        """Process incoming responses from Gemini."""
        try:
            async for message in self.session.receive():
                if message.text:
                    try:
                        observation = json.loads(message.text)
                        observation["timestamp"] = asyncio.get_event_loop().time()
                        self.observation_buffer.append(observation)
                        # Keep buffer size manageable
                        if len(self.observation_buffer) > 50:
                            self.observation_buffer.pop(0)
                    except json.JSONDecodeError:
                        pass  # Non-JSON responses are fine, just skip
        except Exception as e:
            print(f"Gemini session error: {e}")

    async def send_frame(self, frame_bytes: bytes, mime_type: str = "image/jpeg"):
        """Send a single video frame to Gemini for analysis."""
        if not self.session or not self._running:
            return

        await self.session.send(
            input=types.LiveClientRealtimeInput(
                media_chunks=[
                    types.Blob(
                        data=base64.b64encode(frame_bytes).decode(),
                        mime_type=mime_type,
                    )
                ]
            )
        )

    def get_recent_observations(self, last_n_seconds: float = 30.0) -> list[dict]:
        """Get observations from the last N seconds."""
        cutoff = asyncio.get_event_loop().time() - last_n_seconds
        return [o for o in self.observation_buffer if o.get("timestamp", 0) > cutoff]

    async def stop(self):
        self._running = False
        if self.session:
            await self.session.close()

Frame Sampling Strategies

You absolutely do not want to send every video frame to Gemini. At 30fps, that’s 108,000 frames for a 60-minute interview — which would cost a small fortune and provide zero additional signal over sampling.

The right sampling rate depends on what you’re trying to observe:

# frame_sampler.py
import asyncio
import time
import io
from enum import Enum
from PIL import Image
import numpy as np

class SamplingMode(Enum):
    GENERAL = "general"         # 0.5 fps — conversation observation
    ENGAGEMENT = "engagement"   # 1 fps — tracking attention and note-reading
    CODE_REVIEW = "code_review" # 2 fps — live coding, fast iteration

class FrameSampler:
    """
    Intelligently samples video frames based on interview phase.
    Designed to minimize API cost while capturing meaningful moments.
    """

    def __init__(self, analyzer: GeminiLiveVideoAnalyzer):
        self.analyzer = analyzer
        self.mode = SamplingMode.GENERAL
        self.last_sent_time = 0.0
        self._frame_queue: asyncio.Queue = asyncio.Queue(maxsize=10)
        self._running = False

    @property
    def interval_seconds(self) -> float:
        return {
            SamplingMode.GENERAL: 2.0,     # 1 frame per 2 seconds
            SamplingMode.ENGAGEMENT: 1.0,   # 1 frame per second
            SamplingMode.CODE_REVIEW: 0.5,  # 2 frames per second
        }[self.mode]

    def set_mode(self, mode: SamplingMode):
        self.mode = mode

    async def process_frame(self, raw_frame: np.ndarray):
        """
        Receive a raw video frame (numpy array from WebRTC).
        Decide whether to sample it and send to Gemini.
        """
        now = time.monotonic()
        if now - self.last_sent_time < self.interval_seconds:
            return  # Not time yet

        # Resize to reduce payload size
        resized = self._resize_frame(raw_frame, max_dimension=720)

        # Compress to JPEG
        frame_bytes = self._encode_frame(resized, quality=75)

        self.last_sent_time = now
        await self.analyzer.send_frame(frame_bytes)

    def _resize_frame(self, frame: np.ndarray, max_dimension: int) -> np.ndarray:
        h, w = frame.shape[:2]
        if max(h, w) <= max_dimension:
            return frame
        scale = max_dimension / max(h, w)
        new_w, new_h = int(w * scale), int(h * scale)
        return cv2.resize(frame, (new_w, new_h), interpolation=cv2.INTER_AREA)

    def _encode_frame(self, frame: np.ndarray, quality: int = 75) -> bytes:
        # Convert BGR (OpenCV default) to RGB
        rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        img = Image.fromarray(rgb)
        buf = io.BytesIO()
        img.save(buf, format="JPEG", quality=quality, optimize=True)
        return buf.getvalue()

Switching Modes During the Interview

The interview agent can signal when to change sampling modes based on what’s happening:

# In your LiveKit agent
async def on_interview_phase_change(self, phase: str):
    """Called when the interview transitions between phases."""
    mode_map = {
        "introduction": SamplingMode.GENERAL,
        "technical_screening": SamplingMode.ENGAGEMENT,
        "live_coding": SamplingMode.CODE_REVIEW,
        "system_design": SamplingMode.ENGAGEMENT,
        "behavioral": SamplingMode.GENERAL,
        "candidate_questions": SamplingMode.GENERAL,
    }
    new_mode = mode_map.get(phase, SamplingMode.GENERAL)
    self.frame_sampler.set_mode(new_mode)
    print(f"Frame sampling mode: {new_mode.value} for phase: {phase}")

Integrating Video Observations into the Voice Agent

The video observations need to feed into the voice agent’s context without disrupting the natural conversation flow. The trick is treating observations as optional context, not mandatory input:

# In your voice agent's context builder
class InterviewContextBuilder:
    def __init__(self, video_analyzer: Optional[GeminiLiveVideoAnalyzer] = None):
        self.video_analyzer = video_analyzer

    def build_context_injection(self) -> str:
        """
        Build a context string to inject into the agent's next turn.
        Returns empty string if no video or no notable observations.
        """
        if not self.video_analyzer:
            return ""

        recent_observations = self.video_analyzer.get_recent_observations(
            last_n_seconds=60.0
        )

        if not recent_observations:
            return ""

        # Filter to only high-confidence, actionable observations
        actionable = [
            o for o in recent_observations
            if o.get("confidence", 0) >= 0.7
            and o.get("type") in ("code_observation", "reference_check", "screen_content")
        ]

        if not actionable:
            return ""

        obs_text = "\n".join(
            f"- [{o['type']}] {o['detail']}"
            for o in actionable[-3:]  # Last 3 relevant observations max
        )

        return f"\n[Video context — for your awareness only, use naturally if relevant]:\n{obs_text}\n"

This approach lets the agent be aware of video observations without forcing awkward references. If the agent sees “candidate opened browser during algorithm question” in its context, it can naturally ask “I noticed you paused there — what were you looking up?” rather than robotically reciting observations.

Screen sharing is the highest-value video use case. When a candidate shares their screen, you can see their actual code, their IDE layout, their terminal output, and their debugging process.

// src/components/ScreenShare.tsx
import { useLocalParticipant } from "@livekit/components-react";
import { Track } from "livekit-client";
import { useState } from "react";

export function ScreenShareButton() {
  const { localParticipant } = useLocalParticipant();
  const [isSharing, setIsSharing] = useState(false);

  async function toggleScreenShare() {
    if (!localParticipant) return;

    if (isSharing) {
      await localParticipant.setScreenShareEnabled(false);
      setIsSharing(false);
    } else {
      try {
        await localParticipant.setScreenShareEnabled(true, {
          audio: false,
          // Prefer high resolution for code readability
          resolution: {
            width: 1920,
            height: 1080,
            frameRate: 5,  // Low fps is fine for code, saves bandwidth
          },
        });
        setIsSharing(true);
      } catch (err) {
        // User cancelled screen picker or denied permission
        console.log("Screen share cancelled:", err);
      }
    }
  }

  return (
    <button
      onClick={toggleScreenShare}
      className={`screen-share-btn ${isSharing ? "active" : ""}`}
      aria-label={isSharing ? "Stop screen sharing" : "Share your screen"}
    >
      {isSharing ? "Stop Sharing" : "Share Screen"}
    </button>
  );
}

Screen share frames need different handling than webcam frames. Code is text-heavy, so higher JPEG quality is needed:

class ScreenShareSampler(FrameSampler):
    """
    Specialized sampler for screen share tracks.
    Uses higher quality encoding since code legibility matters.
    """

    def _encode_frame(self, frame: np.ndarray, quality: int = 85) -> bytes:
        # Screen share uses higher quality — text needs to be readable
        return super()._encode_frame(frame, quality=85)

    async def process_frame(self, raw_frame: np.ndarray):
        # For screen share, we also detect if the screen changed significantly
        # to avoid sending identical frames
        if hasattr(self, '_last_frame') and self._frames_are_similar(raw_frame):
            return
        self._last_frame = raw_frame.copy()
        await super().process_frame(raw_frame)

    def _frames_are_similar(self, frame: np.ndarray, threshold: float = 0.98) -> bool:
        """Check if frame is similar to last frame using normalized cross-correlation."""
        if not hasattr(self, '_last_frame'):
            return False
        # Downscale both frames for fast comparison
        small_current = cv2.resize(frame, (64, 36))
        small_last = cv2.resize(self._last_frame, (64, 36))
        correlation = np.corrcoef(small_current.flatten(), small_last.flatten())[0, 1]
        return correlation > threshold

Bandwidth Management

Video adds significant bandwidth. Here are the numbers you need to plan around:

Mode	Video Track	Frame Sampling	Extra Monthly Cost (1000 interviews)
Voice only	None	None	$0
Webcam (general)	720p/30fps	0.5 fps to Gemini	~$120
Webcam (engagement)	720p/30fps	1 fps to Gemini	~$240
Screen share (code)	1080p/5fps	2 fps to Gemini	~$400

For bandwidth between candidate and server, screen sharing at 1080p/5fps uses roughly 1-2 Mbps — acceptable on any broadband connection. The Gemini API frames are compressed down to 30-80KB each, so the API upload bandwidth is negligible.

# Cost estimation utility
def estimate_video_cost(
    interview_minutes: float,
    frames_per_second: float,
    include_screen_share: bool = False,
) -> dict:
    """
    Estimate Gemini API costs for video analysis.
    Based on Gemini 2.0 Flash Live pricing.
    """
    total_frames = interview_minutes * 60 * frames_per_second

    # Gemini 2.0 Flash Live: ~$0.0025 per 1000 input tokens
    # A 720p JPEG frame ≈ 500-800 tokens
    tokens_per_frame = 800 if include_screen_share else 500
    total_tokens = total_frames * tokens_per_frame

    # Plus audio tokens for the voice channel
    # ~32 tokens per second of audio
    audio_tokens = interview_minutes * 60 * 32

    input_cost = (total_tokens + audio_tokens) / 1000 * 0.0025

    # Output tokens (observations) — minimal
    output_cost = 0.01  # ~$0.01 flat per interview for observations

    return {
        "video_tokens": int(total_tokens),
        "audio_tokens": int(audio_tokens),
        "input_cost_usd": round(input_cost, 4),
        "output_cost_usd": round(output_cost, 4),
        "total_cost_usd": round(input_cost + output_cost, 4),
    }

# Example: 45-minute coding interview with screen share at 2fps
cost = estimate_video_cost(45, 2.0, include_screen_share=True)
# → total_cost_usd: ~$0.38

This tracks with the $0.40/interview target mentioned in the post description. The breakdown for a 45-minute session: roughly $0.30 in video input tokens, $0.07 in audio tokens, $0.01 in output tokens.

Privacy Considerations

Video changes the privacy calculus significantly. Here’s what you need to think through before shipping:

Informed consent before video analysis is non-negotiable — both ethically and legally in most jurisdictions. The consent must be specific: “This interview will record video of you and analyze your screen activity.” Burying video analysis in general terms of service is not sufficient.

// src/components/VideoConsentModal.tsx
export function VideoConsentModal({ onAccept, onDecline }: VideoConsentProps) {
  return (
    <div className="consent-modal">
      <h2>Video Analysis for This Interview</h2>
      <p>
        This coding interview uses video analysis to help your interviewer
        understand your problem-solving process. Specifically:
      </p>
      <ul>
        <li>Your screen will be analyzed when you share it</li>
        <li>The AI may reference what it observes during the interview</li>
        <li>Video data is processed in real-time and not stored after the session</li>
      </ul>
      <p>
        Video is not used to analyze your facial expressions, emotions, or
        physical appearance.
      </p>
      <div className="consent-actions">
        <button onClick={onAccept} className="btn-primary">
          I understand and agree
        </button>
        <button onClick={onDecline} className="btn-secondary">
          Proceed with voice only
        </button>
      </div>
    </div>
  );
}

Critically — and this is the “btn-secondary” in the code above — candidates must be able to decline video analysis and still complete the interview. Making video mandatory creates accessibility issues and signals that you’re prioritizing surveillance over candidate experience.

Analyze-and-Discard vs Store

For most use cases, analyze-and-discard is the right default. The Gemini Live session processes frames in real-time; you log the structured observations (text), not the video frames. This significantly reduces your GDPR/privacy surface area.

If you do need to store video (for asynchronous review by human interviewers, for example), that triggers a different compliance posture: explicit retention policies, access controls, deletion workflows, and likely a Data Processing Agreement with your cloud provider. Part 9 covers this in depth.

# The analyze-and-discard pattern
class PrivacyFirstVideoProcessor:
    """
    Processes video frames through Gemini without persisting raw video.
    Only structured observations are retained.
    """

    def __init__(self, analyzer: GeminiLiveVideoAnalyzer):
        self.analyzer = analyzer
        # Frames are never stored — they go directly to Gemini and are discarded

    async def process_and_discard(self, frame: np.ndarray):
        """Send frame to Gemini; discard raw frame immediately."""
        await self.analyzer.send_frame(
            self._encode(frame)
        )
        # frame goes out of scope here — no reference retained
        del frame

Putting It Together: The Complete Video Interview Flow

# interview_agent_with_video.py
import asyncio
from livekit import agents, rtc
from livekit.agents import JobContext, WorkerOptions, cli
from livekit.agents.voice_assistant import VoiceAssistant
from livekit.plugins import openai, deepgram, silero

class VideoEnhancedInterviewAgent:
    def __init__(self, ctx: JobContext):
        self.ctx = ctx
        self.session_id = ctx.job.metadata.get("session_id")
        self.job_role = ctx.job.metadata.get("job_role")
        self.video_analyzer: Optional[GeminiLiveVideoAnalyzer] = None
        self.frame_sampler: Optional[FrameSampler] = None
        self.context_builder: Optional[InterviewContextBuilder] = None

    async def run(self):
        room = self.ctx.room

        # Initialize video analysis if enabled
        if self._video_enabled():
            self.video_analyzer = GeminiLiveVideoAnalyzer(
                self.session_id, self.job_role
            )
            self.frame_sampler = FrameSampler(self.video_analyzer)
            self.context_builder = InterviewContextBuilder(self.video_analyzer)

            # Start Gemini session in background
            asyncio.create_task(self.video_analyzer.start())

        # Set up track subscriptions for video frames
        @room.on("track_subscribed")
        def on_track_subscribed(track, publication, participant):
            if track.kind == rtc.TrackKind.KIND_VIDEO:
                asyncio.create_task(self._process_video_track(track))

        # Build voice assistant with video context injection
        assistant = VoiceAssistant(
            vad=silero.VAD.load(),
            stt=deepgram.STT(),
            llm=openai.LLM(model="gpt-4o"),
            tts=openai.TTS(voice="nova"),
            before_llm_cb=self._inject_video_context,
        )

        await assistant.start(room)

    async def _process_video_track(self, track: rtc.VideoTrack):
        """Pull frames from the video track and sample them."""
        async for frame in track:
            if self.frame_sampler:
                # Convert LiveKit frame to numpy array
                numpy_frame = np.frombuffer(frame.data, dtype=np.uint8).reshape(
                    frame.height, frame.width, 4  # RGBA
                )
                bgr_frame = cv2.cvtColor(numpy_frame, cv2.COLOR_RGBA2BGR)
                await self.frame_sampler.process_frame(bgr_frame)

    async def _inject_video_context(self, assistant, chat_context):
        """Inject video observations into the LLM context before each turn."""
        if not self.context_builder:
            return
        injection = self.context_builder.build_context_injection()
        if injection:
            # Add as a system message that the LLM can reference
            chat_context.messages.append({
                "role": "system",
                "content": injection,
            })

    def _video_enabled(self) -> bool:
        return self.ctx.job.metadata.get("video_enabled", False)

What We Built

Adding video to a voice interview is not a trivial feature — it’s a deliberate design decision with real cost and privacy implications. What we built:

A Gemini Live multimodal session that accepts video frames and returns structured observations
Frame sampling at three modes: 0.5fps for general observation, 1fps for engagement tracking, 2fps for live coding
Screen sharing integration with change detection to avoid sending identical frames
A context injection layer that feeds video observations to the voice agent without disrupting conversation flow
Consent and privacy patterns: analyze-and-discard, opt-out for candidates, no emotional analysis
Cost analysis showing $0.38-0.40 per 45-minute session with video enabled

The most important constraint to hold onto: every video feature must have a specific, defensible use case. If you can’t explain precisely what signal you’re capturing and why it improves hiring decisions, cut it.

In Part 9, we tackle the operational and legal side of what we’ve built. Recording interviews is easy. Doing it in a way that’s GDPR-compliant, HIPAA-aware, and defensible under employment law is a different challenge entirely.

This is Part 8 of a 12-part series: The Voice AI Interview Playbook.

Series outline:

Why Real-Time Voice Changes Everything — The landscape, the vision, and the reference architecture (Part 1)
Cascaded vs. Speech-to-Speech — Choosing your pipeline architecture (Part 2)
LiveKit vs. Pipecat vs. Direct — Picking your framework (Part 3)
STT, LLM, and TTS That Actually Work — Building the voice pipeline (Part 4)
Multi-Role Agents — Interviewer, coach, and evaluator personas (Part 5)
Knowledge Base and RAG — Making your voice agent an expert (Part 6)
Web and Mobile Clients — Cross-platform voice experiences (Part 7)
Video Interview Integration — Multimodal analysis with Gemini Live (this post)
Recording, Transcription, and Compliance — GDPR, HIPAA, and getting it right (Part 9)
Scaling to Thousands — Architecture for concurrent voice sessions (Part 10)
Cost Optimization — From $0.14/min to $0.03/min (Part 11)
Multi-Provider Support — OpenAI Realtime, Bedrock Nova, Grok, and the adapter pattern (Part 12)

Export for reading

The Voice AI Interview Playbook: Video Interview Integration — Multimodal Analysis with Gemini Live (Part 8 of 12)

When to Add Video

Gemini Live Multimodal Architecture

Setting Up Gemini Live Multimodal Session

Frame Sampling Strategies

Switching Modes During the Interview

Integrating Video Observations into the Voice Agent

Bandwidth Management

Privacy Considerations

Analyze-and-Discard vs Store

Putting It Together: The Complete Video Interview Flow

What We Built

Comments

On this page

The Voice AI Interview Playbook: Video Interview Integration — Multimodal Analysis with Gemini Live (Part 8 of 12)

The Voice AI Interview Playbook: Video Interview Integration — Multimodal Analysis with Gemini Live (Part 8 of 12)

When to Add Video

Gemini Live Multimodal Architecture

Setting Up Gemini Live Multimodal Session

Frame Sampling Strategies

Switching Modes During the Interview

Integrating Video Observations into the Voice Agent

Screen Sharing for Live Coding Interviews

Web Client Screen Sharing

Processing Screen Share Frames

Bandwidth Management

Privacy Considerations

Consent and Disclosure

Analyze-and-Discard vs Store

Putting It Together: The Complete Video Interview Flow

What We Built

Comments

The Voice AI Interview Playbook: Video Interview Integration — Multimodal Analysis with Gemini Live (Part 8 of 12)