AI Voice Interview Platform - Luong Hong Thuan

Technical hiring is broken. Recruiters spend 60% of their time on initial screening calls that could be automated. Candidates wait days for feedback. And the quality of screening varies wildly depending on who conducts it. Four engineering managers were spending 10+ hours a week each on first-round phone screens, and the signal-to-noise ratio was terrible.

I wanted to build something that could handle the initial technical screen with real voice conversation — not the clunky “record yourself answering these 5 questions” kind, but actual back-and-forth dialogue where the AI listens, follows up, and adapts based on the candidate’s responses.

What We Built

A platform where candidates join a call and have a natural conversation with an AI interviewer. The AI asks technical questions, follows up based on answers, probes deeper when responses are vague, and generates structured evaluation reports afterward.

The key insight: no single AI provider does everything well. So we built a multi-provider architecture that uses each model for what it’s best at.

Architecture

┌─────────────┐     ┌──────────────────┐     ┌────────────────────┐
│   Browser    │────▶│   LiveKit SFU    │────▶│   Agent Server     │
│  (WebRTC)    │◀────│   (Media Relay)  │◀────│   (Python)         │
└─────────────┘     └──────────────────┘     └────────┬───────────┘
                                                       │
                                    ┌──────────────────┼──────────────────┐
                                    │                  │                  │
                              ┌─────▼─────┐    ┌──────▼──────┐   ┌──────▼──────┐
                              │  OpenAI    │    │  Gemini     │   │  Bedrock    │
                              │  Realtime  │    │  Live       │   │  Nova       │
                              │  (Voice)   │    │  (Analysis) │   │  (Eval)     │
                              └───────────┘    └─────────────┘   └─────────────┘

Provider Responsibilities

LiveKit handles the real-time infrastructure — WebRTC connections, audio routing, room management, and recording. We use the LiveKit Agents SDK to bridge media streams and our AI backends.

from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli
from livekit.agents.voice_assistant import VoiceAssistant
from livekit.plugins import openai, silero

async def entrypoint(ctx: JobContext):
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

    assistant = VoiceAssistant(
        vad=silero.VAD.load(),
        stt=openai.STT(),
        llm=openai.LLM.with_realtime("gpt-4o-realtime-preview"),
        tts=openai.TTS(),
    )

    assistant.start(ctx.room)
    await assistant.say(
        "Hi, I'm your interviewer today. Tell me about a "
        "challenging project you've worked on recently.",
        allow_interruptions=True,
    )

OpenAI Realtime API is the primary conversational engine. Sub-200ms latency is critical — anything above 500ms and conversations feel unnatural. We use function calling to let the AI transition between interview sections (intro → technical → behavioral → wrap-up).

Gemini Live runs multimodal analysis in parallel. While OpenAI handles voice, Gemini processes the video feed for engagement signals — is the candidate reading from notes? Are they screen-sharing for whiteboarding? It also does real-time code analysis when candidates share their IDE.

Amazon Bedrock Nova handles post-interview evaluation. Nova Pro generates structured assessment reports, scoring candidates across dimensions (technical depth, communication, problem-solving). We chose Bedrock for its enterprise features — VPC endpoints, IAM integration, and the compliance certifications HR required.

Provider Comparison

Aspect	Gemini Live	OpenAI Realtime	Bedrock Nova
Latency	~300ms	~200ms	~400ms
Natural conversation	Excellent	Very good	Good
Technical depth	Strong	Strongest	Good
Cost per interview	$$	$$$	$
Best for	Multimodal analysis	Voice conversation	Evaluation & high-volume

We use OpenAI Realtime for the primary voice conversation, Gemini Live for parallel multimodal analysis, and Bedrock Nova for post-interview structured evaluation at enterprise scale.

The Hard Parts

Voice Activity Detection tuning. Default VAD settings cut off candidates mid-sentence. We spent two weeks tuning Silero VAD parameters — endpointing delay, speech probability thresholds, padding durations. The sweet spot was 400ms of silence before considering a turn complete.

Provider failover. OpenAI Realtime has occasional latency spikes. We built a circuit breaker that falls back to Gemini Live voice if OpenAI p95 exceeds 800ms. The handoff has to be seamless — the candidate shouldn’t notice.

Audio quality. Opus codec is optimized for speech, but compressed audio sometimes trips up STT models. Using LiveKit’s audio processing pipeline with echo cancellation and noise suppression before feeding to AI models improved transcription accuracy by 23%.

Evaluation consistency. Early Bedrock evaluations had high variance — same transcript, different scores on repeated runs. We fixed this with structured output schemas, few-shot calibrated examples, and temperature 0.3.

Tech Stack

Component	Technology
Frontend	Next.js, LiveKit Components SDK
Real-time Infrastructure	LiveKit Cloud (SFU)
Conversational AI	OpenAI Realtime API (gpt-4o-realtime)
Multimodal Analysis	Google Gemini 2.0 Flash (Live API)
Evaluation Engine	Amazon Bedrock (Nova Pro)
Agent Framework	LiveKit Agents SDK (Python)
Recording & Storage	LiveKit Egress, S3
Auth	Clerk
Database	PostgreSQL, Prisma
Monitoring	OpenTelemetry, Grafana

Results

After 3 months in production with ~200 interviews:

Time saved: Engineering managers freed up 8+ hours/week from first-round screens
Consistency: Candidate satisfaction scores improved 34% — 92% rated it natural
Throughput: 3x more candidates screened per week, available 24/7
Pass-through accuracy: 91% agreement between AI evaluation and subsequent human interviewer assessments
Latency: Average voice response time of 180ms, 99.2% of interactions under 500ms
Scoring correlation: 0.87 with human interviewers over 500 calibration interviews

Timeline: Prototype in 3 weeks, production since late 2025. Still iterating on evaluation calibration, adding screen-sharing support for live coding, and multi-language support (Vietnamese, Japanese).