In Part 7, we built the clients — web, mobile, and Flutter. Candidates can now connect to a voice interview from any device. The pipeline works. But for certain interview types, voice alone leaves signal on the table.
This post is about adding video analysis to your interviews — specifically using Gemini Live’s multimodal capabilities to process video frames alongside the voice conversation. I’ll be honest about where video actually helps and where it’s a distraction or worse. The goal is to build something useful, not surveillance.
When to Add Video
The first question isn’t “how” — it’s “should you, and for what?”
Where video analysis genuinely earns its cost:
Live coding interviews are the clearest use case. When a candidate is coding on screen, the AI can observe what they’re actually typing, catch that they’re Googling solutions versus reasoning through them, and ask follow-up questions that reference what it sees. “I noticed you started with a recursive approach — walk me through why you switched to iterative.” That’s a much more valuable interview than one that relies entirely on verbal description.
Engagement tracking during long technical explanations also provides real signal. Is the candidate reading from notes while claiming to explain from memory? Are they confused but not asking for clarification? These behavioral cues are genuinely useful to surface.
Where video analysis adds noise:
Emotional analysis — detecting nervousness, confidence, or deception from facial expressions — is unreliable and ethically fraught. The research on facial action coding systems shows poor accuracy across different demographics, lighting conditions, and individual expression styles. You’ll get false signals that disadvantage candidates who are neurodivergent, camera-shy, or interviewing from a bright window. Don’t do it.
Lie detection from video is even worse — it simply doesn’t work, has been debunked repeatedly in peer-reviewed research, and using it in hiring creates serious legal exposure in many jurisdictions. The same goes for “confidence scoring” that’s really just penalizing introverts.
The working principle: use video when you can make a specific, verifiable claim about what the analysis detects. “The candidate looked up three times while describing their architecture” is verifiable. “The candidate seems nervous” is not.
Gemini Live Multimodal Architecture
Gemini Live 2.0 supports simultaneous audio and video input through a persistent WebSocket session. The key architectural insight is that you’re running two parallel channels:
- The voice conversation channel: real-time audio in, audio out, manages the interview flow
- The video analysis channel: frame samples in, structured observations out, provides context to the voice channel
These channels share state through a coordinator layer:
Candidate Browser
├── Audio Track ──────────────────► Voice Pipeline (LiveKit + Agent)
│ │
└── Video Track ─► Frame Sampler │
│ │
▼ │
Gemini Live Session ◄─────┘
(multimodal)
│
▼
Observation Buffer
│
▼
Interview Agent Context
The interview agent has access to both the real-time conversation and the observation buffer from video analysis. When it’s appropriate, it can surface observations as interview questions or follow-ups.
Setting Up Gemini Live Multimodal Session
Google’s google-genai Python SDK handles the multimodal session:
# pip install google-genai opencv-python numpy
import asyncio
import base64
import json
import numpy as np
from google import genai
from google.genai import types
from typing import AsyncGenerator, Optional
import cv2
class GeminiLiveVideoAnalyzer:
"""
Runs a persistent Gemini Live session that accepts video frames
and returns structured observations about candidate behavior.
"""
def __init__(self, session_id: str, job_role: str):
self.session_id = session_id
self.job_role = job_role
self.client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
self.session = None
self.observation_buffer: list[dict] = []
self._running = False
async def start(self):
"""Initialize the Gemini Live multimodal session."""
config = types.LiveConnectConfig(
model="models/gemini-2.0-flash-live-001",
response_modalities=["TEXT"], # We want text observations, not audio
system_instruction=types.Content(
parts=[types.Part(text=self._build_system_prompt())]
),
generation_config=types.GenerationConfig(
temperature=0.2, # Low temperature for consistent observations
max_output_tokens=512,
),
)
async with self.client.aio.live.connect(config=config) as session:
self.session = session
self._running = True
await self._run_session()
def _build_system_prompt(self) -> str:
return f"""You are a behavioral observation assistant for a {self.job_role} technical interview.
Your role is to observe video frames and provide brief, factual observations about:
1. Whether the candidate appears to be reading from notes or an external source
2. Visible code quality, structure, or patterns during live coding
3. Screen content during screen sharing (code editors, whiteboards, browsers)
4. Signs of technical engagement (typing actively, scrolling through code, drawing diagrams)
Do NOT make claims about:
- Emotional states, confidence levels, or personality traits
- Honesty or deception
- Physical appearance
- Personal characteristics
Format each observation as JSON:
{{"type": "observation_type", "detail": "specific factual observation", "confidence": 0.0-1.0}}
Only output observations when you see something genuinely noteworthy. Silence is fine."""
async def _run_session(self):
"""Process incoming responses from Gemini."""
try:
async for message in self.session.receive():
if message.text:
try:
observation = json.loads(message.text)
observation["timestamp"] = asyncio.get_event_loop().time()
self.observation_buffer.append(observation)
# Keep buffer size manageable
if len(self.observation_buffer) > 50:
self.observation_buffer.pop(0)
except json.JSONDecodeError:
pass # Non-JSON responses are fine, just skip
except Exception as e:
print(f"Gemini session error: {e}")
async def send_frame(self, frame_bytes: bytes, mime_type: str = "image/jpeg"):
"""Send a single video frame to Gemini for analysis."""
if not self.session or not self._running:
return
await self.session.send(
input=types.LiveClientRealtimeInput(
media_chunks=[
types.Blob(
data=base64.b64encode(frame_bytes).decode(),
mime_type=mime_type,
)
]
)
)
def get_recent_observations(self, last_n_seconds: float = 30.0) -> list[dict]:
"""Get observations from the last N seconds."""
cutoff = asyncio.get_event_loop().time() - last_n_seconds
return [o for o in self.observation_buffer if o.get("timestamp", 0) > cutoff]
async def stop(self):
self._running = False
if self.session:
await self.session.close()
Frame Sampling Strategies
You absolutely do not want to send every video frame to Gemini. At 30fps, that’s 108,000 frames for a 60-minute interview — which would cost a small fortune and provide zero additional signal over sampling.
The right sampling rate depends on what you’re trying to observe:
# frame_sampler.py
import asyncio
import time
import io
from enum import Enum
from PIL import Image
import numpy as np
class SamplingMode(Enum):
GENERAL = "general" # 0.5 fps — conversation observation
ENGAGEMENT = "engagement" # 1 fps — tracking attention and note-reading
CODE_REVIEW = "code_review" # 2 fps — live coding, fast iteration
class FrameSampler:
"""
Intelligently samples video frames based on interview phase.
Designed to minimize API cost while capturing meaningful moments.
"""
def __init__(self, analyzer: GeminiLiveVideoAnalyzer):
self.analyzer = analyzer
self.mode = SamplingMode.GENERAL
self.last_sent_time = 0.0
self._frame_queue: asyncio.Queue = asyncio.Queue(maxsize=10)
self._running = False
@property
def interval_seconds(self) -> float:
return {
SamplingMode.GENERAL: 2.0, # 1 frame per 2 seconds
SamplingMode.ENGAGEMENT: 1.0, # 1 frame per second
SamplingMode.CODE_REVIEW: 0.5, # 2 frames per second
}[self.mode]
def set_mode(self, mode: SamplingMode):
self.mode = mode
async def process_frame(self, raw_frame: np.ndarray):
"""
Receive a raw video frame (numpy array from WebRTC).
Decide whether to sample it and send to Gemini.
"""
now = time.monotonic()
if now - self.last_sent_time < self.interval_seconds:
return # Not time yet
# Resize to reduce payload size
resized = self._resize_frame(raw_frame, max_dimension=720)
# Compress to JPEG
frame_bytes = self._encode_frame(resized, quality=75)
self.last_sent_time = now
await self.analyzer.send_frame(frame_bytes)
def _resize_frame(self, frame: np.ndarray, max_dimension: int) -> np.ndarray:
h, w = frame.shape[:2]
if max(h, w) <= max_dimension:
return frame
scale = max_dimension / max(h, w)
new_w, new_h = int(w * scale), int(h * scale)
return cv2.resize(frame, (new_w, new_h), interpolation=cv2.INTER_AREA)
def _encode_frame(self, frame: np.ndarray, quality: int = 75) -> bytes:
# Convert BGR (OpenCV default) to RGB
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
img = Image.fromarray(rgb)
buf = io.BytesIO()
img.save(buf, format="JPEG", quality=quality, optimize=True)
return buf.getvalue()
Switching Modes During the Interview
The interview agent can signal when to change sampling modes based on what’s happening:
# In your LiveKit agent
async def on_interview_phase_change(self, phase: str):
"""Called when the interview transitions between phases."""
mode_map = {
"introduction": SamplingMode.GENERAL,
"technical_screening": SamplingMode.ENGAGEMENT,
"live_coding": SamplingMode.CODE_REVIEW,
"system_design": SamplingMode.ENGAGEMENT,
"behavioral": SamplingMode.GENERAL,
"candidate_questions": SamplingMode.GENERAL,
}
new_mode = mode_map.get(phase, SamplingMode.GENERAL)
self.frame_sampler.set_mode(new_mode)
print(f"Frame sampling mode: {new_mode.value} for phase: {phase}")
Integrating Video Observations into the Voice Agent
The video observations need to feed into the voice agent’s context without disrupting the natural conversation flow. The trick is treating observations as optional context, not mandatory input:
# In your voice agent's context builder
class InterviewContextBuilder:
def __init__(self, video_analyzer: Optional[GeminiLiveVideoAnalyzer] = None):
self.video_analyzer = video_analyzer
def build_context_injection(self) -> str:
"""
Build a context string to inject into the agent's next turn.
Returns empty string if no video or no notable observations.
"""
if not self.video_analyzer:
return ""
recent_observations = self.video_analyzer.get_recent_observations(
last_n_seconds=60.0
)
if not recent_observations:
return ""
# Filter to only high-confidence, actionable observations
actionable = [
o for o in recent_observations
if o.get("confidence", 0) >= 0.7
and o.get("type") in ("code_observation", "reference_check", "screen_content")
]
if not actionable:
return ""
obs_text = "\n".join(
f"- [{o['type']}] {o['detail']}"
for o in actionable[-3:] # Last 3 relevant observations max
)
return f"\n[Video context — for your awareness only, use naturally if relevant]:\n{obs_text}\n"
This approach lets the agent be aware of video observations without forcing awkward references. If the agent sees “candidate opened browser during algorithm question” in its context, it can naturally ask “I noticed you paused there — what were you looking up?” rather than robotically reciting observations.
Screen Sharing for Live Coding Interviews
Screen sharing is the highest-value video use case. When a candidate shares their screen, you can see their actual code, their IDE layout, their terminal output, and their debugging process.
Web Client Screen Sharing
// src/components/ScreenShare.tsx
import { useLocalParticipant } from "@livekit/components-react";
import { Track } from "livekit-client";
import { useState } from "react";
export function ScreenShareButton() {
const { localParticipant } = useLocalParticipant();
const [isSharing, setIsSharing] = useState(false);
async function toggleScreenShare() {
if (!localParticipant) return;
if (isSharing) {
await localParticipant.setScreenShareEnabled(false);
setIsSharing(false);
} else {
try {
await localParticipant.setScreenShareEnabled(true, {
audio: false,
// Prefer high resolution for code readability
resolution: {
width: 1920,
height: 1080,
frameRate: 5, // Low fps is fine for code, saves bandwidth
},
});
setIsSharing(true);
} catch (err) {
// User cancelled screen picker or denied permission
console.log("Screen share cancelled:", err);
}
}
}
return (
<button
onClick={toggleScreenShare}
className={`screen-share-btn ${isSharing ? "active" : ""}`}
aria-label={isSharing ? "Stop screen sharing" : "Share your screen"}
>
{isSharing ? "Stop Sharing" : "Share Screen"}
</button>
);
}
Processing Screen Share Frames
Screen share frames need different handling than webcam frames. Code is text-heavy, so higher JPEG quality is needed:
class ScreenShareSampler(FrameSampler):
"""
Specialized sampler for screen share tracks.
Uses higher quality encoding since code legibility matters.
"""
def _encode_frame(self, frame: np.ndarray, quality: int = 85) -> bytes:
# Screen share uses higher quality — text needs to be readable
return super()._encode_frame(frame, quality=85)
async def process_frame(self, raw_frame: np.ndarray):
# For screen share, we also detect if the screen changed significantly
# to avoid sending identical frames
if hasattr(self, '_last_frame') and self._frames_are_similar(raw_frame):
return
self._last_frame = raw_frame.copy()
await super().process_frame(raw_frame)
def _frames_are_similar(self, frame: np.ndarray, threshold: float = 0.98) -> bool:
"""Check if frame is similar to last frame using normalized cross-correlation."""
if not hasattr(self, '_last_frame'):
return False
# Downscale both frames for fast comparison
small_current = cv2.resize(frame, (64, 36))
small_last = cv2.resize(self._last_frame, (64, 36))
correlation = np.corrcoef(small_current.flatten(), small_last.flatten())[0, 1]
return correlation > threshold
Bandwidth Management
Video adds significant bandwidth. Here are the numbers you need to plan around:
| Mode | Video Track | Frame Sampling | Extra Monthly Cost (1000 interviews) |
|---|---|---|---|
| Voice only | None | None | $0 |
| Webcam (general) | 720p/30fps | 0.5 fps to Gemini | ~$120 |
| Webcam (engagement) | 720p/30fps | 1 fps to Gemini | ~$240 |
| Screen share (code) | 1080p/5fps | 2 fps to Gemini | ~$400 |
For bandwidth between candidate and server, screen sharing at 1080p/5fps uses roughly 1-2 Mbps — acceptable on any broadband connection. The Gemini API frames are compressed down to 30-80KB each, so the API upload bandwidth is negligible.
# Cost estimation utility
def estimate_video_cost(
interview_minutes: float,
frames_per_second: float,
include_screen_share: bool = False,
) -> dict:
"""
Estimate Gemini API costs for video analysis.
Based on Gemini 2.0 Flash Live pricing.
"""
total_frames = interview_minutes * 60 * frames_per_second
# Gemini 2.0 Flash Live: ~$0.0025 per 1000 input tokens
# A 720p JPEG frame ≈ 500-800 tokens
tokens_per_frame = 800 if include_screen_share else 500
total_tokens = total_frames * tokens_per_frame
# Plus audio tokens for the voice channel
# ~32 tokens per second of audio
audio_tokens = interview_minutes * 60 * 32
input_cost = (total_tokens + audio_tokens) / 1000 * 0.0025
# Output tokens (observations) — minimal
output_cost = 0.01 # ~$0.01 flat per interview for observations
return {
"video_tokens": int(total_tokens),
"audio_tokens": int(audio_tokens),
"input_cost_usd": round(input_cost, 4),
"output_cost_usd": round(output_cost, 4),
"total_cost_usd": round(input_cost + output_cost, 4),
}
# Example: 45-minute coding interview with screen share at 2fps
cost = estimate_video_cost(45, 2.0, include_screen_share=True)
# → total_cost_usd: ~$0.38
This tracks with the $0.40/interview target mentioned in the post description. The breakdown for a 45-minute session: roughly $0.30 in video input tokens, $0.07 in audio tokens, $0.01 in output tokens.
Privacy Considerations
Video changes the privacy calculus significantly. Here’s what you need to think through before shipping:
Consent and Disclosure
Informed consent before video analysis is non-negotiable — both ethically and legally in most jurisdictions. The consent must be specific: “This interview will record video of you and analyze your screen activity.” Burying video analysis in general terms of service is not sufficient.
// src/components/VideoConsentModal.tsx
export function VideoConsentModal({ onAccept, onDecline }: VideoConsentProps) {
return (
<div className="consent-modal">
<h2>Video Analysis for This Interview</h2>
<p>
This coding interview uses video analysis to help your interviewer
understand your problem-solving process. Specifically:
</p>
<ul>
<li>Your screen will be analyzed when you share it</li>
<li>The AI may reference what it observes during the interview</li>
<li>Video data is processed in real-time and not stored after the session</li>
</ul>
<p>
Video is not used to analyze your facial expressions, emotions, or
physical appearance.
</p>
<div className="consent-actions">
<button onClick={onAccept} className="btn-primary">
I understand and agree
</button>
<button onClick={onDecline} className="btn-secondary">
Proceed with voice only
</button>
</div>
</div>
);
}
Critically — and this is the “btn-secondary” in the code above — candidates must be able to decline video analysis and still complete the interview. Making video mandatory creates accessibility issues and signals that you’re prioritizing surveillance over candidate experience.
Analyze-and-Discard vs Store
For most use cases, analyze-and-discard is the right default. The Gemini Live session processes frames in real-time; you log the structured observations (text), not the video frames. This significantly reduces your GDPR/privacy surface area.
If you do need to store video (for asynchronous review by human interviewers, for example), that triggers a different compliance posture: explicit retention policies, access controls, deletion workflows, and likely a Data Processing Agreement with your cloud provider. Part 9 covers this in depth.
# The analyze-and-discard pattern
class PrivacyFirstVideoProcessor:
"""
Processes video frames through Gemini without persisting raw video.
Only structured observations are retained.
"""
def __init__(self, analyzer: GeminiLiveVideoAnalyzer):
self.analyzer = analyzer
# Frames are never stored — they go directly to Gemini and are discarded
async def process_and_discard(self, frame: np.ndarray):
"""Send frame to Gemini; discard raw frame immediately."""
await self.analyzer.send_frame(
self._encode(frame)
)
# frame goes out of scope here — no reference retained
del frame
Putting It Together: The Complete Video Interview Flow
# interview_agent_with_video.py
import asyncio
from livekit import agents, rtc
from livekit.agents import JobContext, WorkerOptions, cli
from livekit.agents.voice_assistant import VoiceAssistant
from livekit.plugins import openai, deepgram, silero
class VideoEnhancedInterviewAgent:
def __init__(self, ctx: JobContext):
self.ctx = ctx
self.session_id = ctx.job.metadata.get("session_id")
self.job_role = ctx.job.metadata.get("job_role")
self.video_analyzer: Optional[GeminiLiveVideoAnalyzer] = None
self.frame_sampler: Optional[FrameSampler] = None
self.context_builder: Optional[InterviewContextBuilder] = None
async def run(self):
room = self.ctx.room
# Initialize video analysis if enabled
if self._video_enabled():
self.video_analyzer = GeminiLiveVideoAnalyzer(
self.session_id, self.job_role
)
self.frame_sampler = FrameSampler(self.video_analyzer)
self.context_builder = InterviewContextBuilder(self.video_analyzer)
# Start Gemini session in background
asyncio.create_task(self.video_analyzer.start())
# Set up track subscriptions for video frames
@room.on("track_subscribed")
def on_track_subscribed(track, publication, participant):
if track.kind == rtc.TrackKind.KIND_VIDEO:
asyncio.create_task(self._process_video_track(track))
# Build voice assistant with video context injection
assistant = VoiceAssistant(
vad=silero.VAD.load(),
stt=deepgram.STT(),
llm=openai.LLM(model="gpt-4o"),
tts=openai.TTS(voice="nova"),
before_llm_cb=self._inject_video_context,
)
await assistant.start(room)
async def _process_video_track(self, track: rtc.VideoTrack):
"""Pull frames from the video track and sample them."""
async for frame in track:
if self.frame_sampler:
# Convert LiveKit frame to numpy array
numpy_frame = np.frombuffer(frame.data, dtype=np.uint8).reshape(
frame.height, frame.width, 4 # RGBA
)
bgr_frame = cv2.cvtColor(numpy_frame, cv2.COLOR_RGBA2BGR)
await self.frame_sampler.process_frame(bgr_frame)
async def _inject_video_context(self, assistant, chat_context):
"""Inject video observations into the LLM context before each turn."""
if not self.context_builder:
return
injection = self.context_builder.build_context_injection()
if injection:
# Add as a system message that the LLM can reference
chat_context.messages.append({
"role": "system",
"content": injection,
})
def _video_enabled(self) -> bool:
return self.ctx.job.metadata.get("video_enabled", False)
What We Built
Adding video to a voice interview is not a trivial feature — it’s a deliberate design decision with real cost and privacy implications. What we built:
- A Gemini Live multimodal session that accepts video frames and returns structured observations
- Frame sampling at three modes: 0.5fps for general observation, 1fps for engagement tracking, 2fps for live coding
- Screen sharing integration with change detection to avoid sending identical frames
- A context injection layer that feeds video observations to the voice agent without disrupting conversation flow
- Consent and privacy patterns: analyze-and-discard, opt-out for candidates, no emotional analysis
- Cost analysis showing $0.38-0.40 per 45-minute session with video enabled
The most important constraint to hold onto: every video feature must have a specific, defensible use case. If you can’t explain precisely what signal you’re capturing and why it improves hiring decisions, cut it.
In Part 9, we tackle the operational and legal side of what we’ve built. Recording interviews is easy. Doing it in a way that’s GDPR-compliant, HIPAA-aware, and defensible under employment law is a different challenge entirely.
This is Part 8 of a 12-part series: The Voice AI Interview Playbook.
Series outline:
- Why Real-Time Voice Changes Everything — The landscape, the vision, and the reference architecture (Part 1)
- Cascaded vs. Speech-to-Speech — Choosing your pipeline architecture (Part 2)
- LiveKit vs. Pipecat vs. Direct — Picking your framework (Part 3)
- STT, LLM, and TTS That Actually Work — Building the voice pipeline (Part 4)
- Multi-Role Agents — Interviewer, coach, and evaluator personas (Part 5)
- Knowledge Base and RAG — Making your voice agent an expert (Part 6)
- Web and Mobile Clients — Cross-platform voice experiences (Part 7)
- Video Interview Integration — Multimodal analysis with Gemini Live (this post)
- Recording, Transcription, and Compliance — GDPR, HIPAA, and getting it right (Part 9)
- Scaling to Thousands — Architecture for concurrent voice sessions (Part 10)
- Cost Optimization — From $0.14/min to $0.03/min (Part 11)
- Multi-Provider Support — OpenAI Realtime, Bedrock Nova, Grok, and the adapter pattern (Part 12)