Skip to content
← All Projects

AI Voice Interview Platform

Real-time voice interview platform powered by multiple AI providers — LiveKit for WebRTC infrastructure, Gemini Live for multimodal understanding, OpenAI Realtime API for low-latency conversation, and Amazon Bedrock Nova for enterprise-grade evaluation.

LiveKitGemini LiveOpenAI RealtimeAmazon Bedrock NovaWebRTCPythonTypeScriptAI

Technical hiring is broken. Recruiters spend 60% of their time on initial screening calls that could be automated. Candidates wait days for feedback. And the quality of screening varies wildly depending on who conducts it. Four engineering managers were spending 10+ hours a week each on first-round phone screens, and the signal-to-noise ratio was terrible.

I wanted to build something that could handle the initial technical screen with real voice conversation — not the clunky “record yourself answering these 5 questions” kind, but actual back-and-forth dialogue where the AI listens, follows up, and adapts based on the candidate’s responses.

What We Built

A platform where candidates join a call and have a natural conversation with an AI interviewer. The AI asks technical questions, follows up based on answers, probes deeper when responses are vague, and generates structured evaluation reports afterward.

The key insight: no single AI provider does everything well. So we built a multi-provider architecture that uses each model for what it’s best at.

Architecture

┌─────────────┐     ┌──────────────────┐     ┌────────────────────┐
│   Browser    │────▶│   LiveKit SFU    │────▶│   Agent Server     │
│  (WebRTC)    │◀────│   (Media Relay)  │◀────│   (Python)         │
└─────────────┘     └──────────────────┘     └────────┬───────────┘

                                    ┌──────────────────┼──────────────────┐
                                    │                  │                  │
                              ┌─────▼─────┐    ┌──────▼──────┐   ┌──────▼──────┐
                              │  OpenAI    │    │  Gemini     │   │  Bedrock    │
                              │  Realtime  │    │  Live       │   │  Nova       │
                              │  (Voice)   │    │  (Analysis) │   │  (Eval)     │
                              └───────────┘    └─────────────┘   └─────────────┘

Provider Responsibilities

LiveKit handles the real-time infrastructure — WebRTC connections, audio routing, room management, and recording. We use the LiveKit Agents SDK to bridge media streams and our AI backends.

from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli
from livekit.agents.voice_assistant import VoiceAssistant
from livekit.plugins import openai, silero

async def entrypoint(ctx: JobContext):
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

    assistant = VoiceAssistant(
        vad=silero.VAD.load(),
        stt=openai.STT(),
        llm=openai.LLM.with_realtime("gpt-4o-realtime-preview"),
        tts=openai.TTS(),
    )

    assistant.start(ctx.room)
    await assistant.say(
        "Hi, I'm your interviewer today. Tell me about a "
        "challenging project you've worked on recently.",
        allow_interruptions=True,
    )

OpenAI Realtime API is the primary conversational engine. Sub-200ms latency is critical — anything above 500ms and conversations feel unnatural. We use function calling to let the AI transition between interview sections (intro → technical → behavioral → wrap-up).

Gemini Live runs multimodal analysis in parallel. While OpenAI handles voice, Gemini processes the video feed for engagement signals — is the candidate reading from notes? Are they screen-sharing for whiteboarding? It also does real-time code analysis when candidates share their IDE.

Amazon Bedrock Nova handles post-interview evaluation. Nova Pro generates structured assessment reports, scoring candidates across dimensions (technical depth, communication, problem-solving). We chose Bedrock for its enterprise features — VPC endpoints, IAM integration, and the compliance certifications HR required.

Provider Comparison

AspectGemini LiveOpenAI RealtimeBedrock Nova
Latency~300ms~200ms~400ms
Natural conversationExcellentVery goodGood
Technical depthStrongStrongestGood
Cost per interview$$$$$$
Best forMultimodal analysisVoice conversationEvaluation & high-volume

We use OpenAI Realtime for the primary voice conversation, Gemini Live for parallel multimodal analysis, and Bedrock Nova for post-interview structured evaluation at enterprise scale.

The Hard Parts

Voice Activity Detection tuning. Default VAD settings cut off candidates mid-sentence. We spent two weeks tuning Silero VAD parameters — endpointing delay, speech probability thresholds, padding durations. The sweet spot was 400ms of silence before considering a turn complete.

Provider failover. OpenAI Realtime has occasional latency spikes. We built a circuit breaker that falls back to Gemini Live voice if OpenAI p95 exceeds 800ms. The handoff has to be seamless — the candidate shouldn’t notice.

Audio quality. Opus codec is optimized for speech, but compressed audio sometimes trips up STT models. Using LiveKit’s audio processing pipeline with echo cancellation and noise suppression before feeding to AI models improved transcription accuracy by 23%.

Evaluation consistency. Early Bedrock evaluations had high variance — same transcript, different scores on repeated runs. We fixed this with structured output schemas, few-shot calibrated examples, and temperature 0.3.

Tech Stack

ComponentTechnology
FrontendNext.js, LiveKit Components SDK
Real-time InfrastructureLiveKit Cloud (SFU)
Conversational AIOpenAI Realtime API (gpt-4o-realtime)
Multimodal AnalysisGoogle Gemini 2.0 Flash (Live API)
Evaluation EngineAmazon Bedrock (Nova Pro)
Agent FrameworkLiveKit Agents SDK (Python)
Recording & StorageLiveKit Egress, S3
AuthClerk
DatabasePostgreSQL, Prisma
MonitoringOpenTelemetry, Grafana

Results

After 3 months in production with ~200 interviews:

  • Time saved: Engineering managers freed up 8+ hours/week from first-round screens
  • Consistency: Candidate satisfaction scores improved 34% — 92% rated it natural
  • Throughput: 3x more candidates screened per week, available 24/7
  • Pass-through accuracy: 91% agreement between AI evaluation and subsequent human interviewer assessments
  • Latency: Average voice response time of 180ms, 99.2% of interactions under 500ms
  • Scoring correlation: 0.87 with human interviewers over 500 calibration interviews

Timeline: Prototype in 3 weeks, production since late 2025. Still iterating on evaluation calibration, adding screen-sharing support for live coding, and multi-language support (Vietnamese, Japanese).