Most voice AI applications feel like talking into a recording device and waiting for a callback. There’s a pause, a beep, and then a response — perfectly correct, but utterly unnatural. For an interview simulator, that latency breaks everything. People don’t interview well when the AI feels like a server processing a batch job.

Azure Foundry Voice Live changes the equation. It is a real-time, bidirectional voice API that streams audio in and out simultaneously, with built-in voice activity detection, interruption handling, and natural turn-taking. When tuned correctly, end-to-end latency drops below 150ms — imperceptible to humans in conversation.

This 7-part series is the complete guide to building a production-grade Interview Voice System on Azure Voice Live with Next.js as the application layer.


What Is Azure Foundry Voice Live?

Azure Foundry Voice Live is part of the Azure AI Foundry platform (formerly Azure OpenAI Service + Azure Speech). It provides a real-time audio WebSocket API that replaces the traditional pipeline of:

User speaks → Convert to text (STT) → Send to LLM → Get response → Convert to audio (TTS) → Play

With a single streaming connection:

User speaks → Audio frames → Azure Voice Live → Response audio frames → Play

The key difference is end-to-end streaming. There is no waiting for the full utterance to be transcribed. Azure Voice Live processes audio frames as they arrive, begins forming a response as soon as intent is clear, and starts streaming the reply before the user has even finished speaking (when configured to do so).

Core Capabilities

CapabilityDetail
Latency120–200ms end-to-end (optimal region + config)
ProtocolWebSocket (wss://) — bidirectional, persistent
Audio InputPCM 16-bit, 16kHz or 24kHz mono
Audio OutputPCM, G.711 μ-law, or Opus
VADBuilt-in Voice Activity Detection
InterruptionNatural barge-in when user speaks over AI
Languages100+ languages
ModelGPT-4o Realtime Audio (Azure-hosted)

The Interview Voice System: Why It Needs This

An interview simulator has unusually strict UX requirements:

  1. Response time < 300ms — anything longer feels like the interviewer is confused
  2. Natural interruption — the user should be able to cut off a long answer mid-stream
  3. Turn-taking clarity — the user needs to know when to speak vs. listen
  4. High audio quality — robotic voices destroy confidence
  5. Session continuity — context must persist across a 30-minute interview

Classic TTS pipelines fail at requirements 1, 2, and 5. Voice Live satisfies all five natively.


System Architecture

The production architecture for our Interview Voice System looks like this:

flowchart TD
    A["🎤 Microphone<br/>(MediaRecorder API)"] -->|"PCM audio frames"| B["useVoiceLive Hook<br/>(React custom hook)"]
    B -->|"wss:// WebSocket"| C["Next.js API Route<br/>/api/voice — proxy"]
    C -->|"Injects API key<br/>wss:// to Azure"| D["Azure Foundry<br/>Voice Live API"]
    D -->|"GPT-4o Realtime Audio + VAD"| D
    D -->|"Response audio frames"| C
    C -->|"wss:// relay"| B
    B -->|"Web Audio API"| E["🔊 Speaker Playback"]
    B -->|"Status events"| F["Interview UI<br/>(React components)"]

    style A fill:#1e3a5f,color:#fff
    style B fill:#1e3a5f,color:#fff
    style C fill:#2d4a7a,color:#fff
    style D fill:#0078d4,color:#fff
    style E fill:#1e3a5f,color:#fff
    style F fill:#1e3a5f,color:#fff

Why a Server-Side Proxy?

The Azure API key must never be exposed in the browser. The Next.js API route acts as a WebSocket proxy — it authenticates with Azure and relays frames between the browser and Azure with minimal overhead.


Comparing Alternatives

Before committing to Azure Voice Live, it’s worth evaluating the alternatives:

SolutionLatencyCostInterruptionControl
Azure Voice Live120–200ms$0.06/min✅ Native✅ Full
OpenAI Realtime API200–400ms$0.10/min✅ NativeMedium
ElevenLabs Conversational300–600ms$0.30/min⚠️ LimitedLow
Twilio + Deepgram + GPT400–800ms$0.15/min❌ Manual✅ Full
Classic STT→LLM→TTS1,000–3,000msVariable❌ Manual✅ Full

Azure Voice Live wins on latency and cost when you’re already in the Azure ecosystem. The OpenAI Realtime API is a close second but is 2x more expensive.


Technology Choices for This Series

LayerTechnologyWhy
Front-endNext.js 14 (App Router)Industry standard, excellent WebSocket support
Voice APIAzure Voice LiveLowest latency, Azure ecosystem
Audio captureMediaRecorder APINative browser, no library needed
Audio playbackWeb Audio APILow-latency, fine-grained control
DeploymentAzure Container AppsScales WebSocket sessions, sticky sessions
MonitoringAzure Application InsightsFirst-party, correlates with Voice Live logs

The 7-Part Roadmap

Here’s what this series covers:

PartTopicKey Question
Part 1Architecture OverviewWhy Voice Live? What does the system look like?
Part 2Setup & ConfigurationHow do I get started fast?
Part 3Next.js IntegrationHow do I connect it to my app?
Part 4Minimum LatencyHow do I get below 150ms?
Part 5Audio QualityHow do I make it sound natural?
Part 6Debugging & IssuesHow do I fix what breaks?
Part 7Deploy, Scale & PricingHow do I run this in production?

Prerequisites

To follow this series you will need:

  • An Azure account with Foundry access (apply here)
  • Basic familiarity with Next.js (App Router)
  • Comfort with TypeScript and async JavaScript
  • A modern browser with WebRTC support (Chrome, Edge, Firefox)

No prior voice AI experience required — we explain all the concepts as we go.


What You’ll Build

By the end of Part 7, you will have a fully working AI Interview Coach with:

  • ✅ Real-time voice conversation with sub-200ms response time
  • ✅ Natural interruption — speak over the AI at any point
  • ✅ Voice activity indicator so users know when to speak
  • ✅ Interview context persistence across the session
  • ✅ Production deployment on Azure Container Apps
  • ✅ Cost monitoring and usage tracking

A Note on “Foundry Voice Live” vs. “Azure OpenAI Realtime Audio”

You may encounter both terms. They refer to the same underlying API, accessed through different control planes:

  • Azure AI Foundry (ai.azure.com) — the newer unified portal
  • Azure OpenAI Service — the older portal

Both use the same gpt-4o-realtime-preview model and the same wss:// WebSocket endpoint. In this series we use the Foundry portal for setup, but all code works identically regardless of which portal you used to provision the resource.


Next: Part 2 — Project Setup & Azure Configuration →

This is Part 1 of the Azure Voice Live series.

Export for reading

Comments