A voice interview system with 150ms latency but choppy audio, robotic speech, or broken interruption handling is worse than useless. Candidates will lose confidence in the product immediately. Audio quality is the second critical dimension — and it’s more configurable than most developers realize.
Audio Format Selection
Azure Voice Live supports three output formats. Choosing correctly has a significant impact on quality and bandwidth:
| Format | Bitrate | Latency | Quality | Use Case |
|---|---|---|---|---|
pcm16 | 768 kbps | Lowest ✅ | Excellent | LAN / good network |
g711_ulaw | 64 kbps | Low | Good | Limited bandwidth |
opus | 16–128 kbps | Medium | Excellent | Mobile / variable network |
For an Interview System: Use pcm16
When your users are on a stable broadband connection (office or home), PCM16 provides the shortest processing pipeline and highest fidelity. There is zero codec encode/decode overhead.
// In session.update config
session: {
input_audio_format: 'pcm16',
output_audio_format: 'pcm16',
// ...
}
For Mobile Users: Use Opus
If you need to support mobile users or expect variable network quality:
session: {
input_audio_format: 'pcm16', // Input stays as PCM (browser capture)
output_audio_format: 'pcm16', // Azure → your server as PCM
// Then transcode to Opus in your proxy before sending to browser
// (requires additional FFmpeg or libopus integration)
}
Note: Azure Voice Live does not natively expose Opus output as of early 2026. Transcode in your proxy if needed.
Browser Audio Capture Quality
The quality of captured microphone audio directly affects how well Azure’s VAD and transcription work:
// Optimal constraints for interview audio capture
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
channelCount: 1, // Mono — Voice Live requires mono
sampleRate: 24000, // Match Azure's native 24kHz rate
echoCancellation: true, // CRITICAL — prevent AI voice feeding back into mic
noiseSuppression: true, // Filter background noise
autoGainControl: true, // Normalize volume for different microphones
// Advanced constraints (Chrome-only)
googEchoCancellation: true,
googNoiseSuppression: true,
googHighpassFilter: true, // Remove low-frequency rumble
},
});
Sample Rate Alignment
Azure Voice Live internal rate is 24kHz. If your browser captures at 48kHz (default), there’s an automatic downsample that can introduce artifacts:
// Force 24kHz capture to avoid resampling
const ctx = new AudioContext({ sampleRate: 24000 });
Compatibility note: Safari supports 24kHz capture only on macOS 12+. Always add a fallback:
let sampleRate = 24000;
try {
const testCtx = new AudioContext({ sampleRate: 24000 });
sampleRate = testCtx.sampleRate; // Browser may override to 48000
testCtx.close();
} catch {
sampleRate = 48000; // Fallback — Azure handles resampling
}
Voice Selection for Interviews
Azure Voice Live exposes 8 neural voice options via GPT-4o Realtime. Choose based on persona:
| Voice | Gender | Tone | Best For |
|---|---|---|---|
alloy | Neutral | Professional, warm | General interviewer |
echo | Male | Authoritative | Senior technical roles |
shimmer | Female | Friendly, clear | Early-career, tech-focused |
ash | Male | Conversational | Startup culture |
ballad | Male | Articulate | Consulting, finance |
coral | Female | Energetic | Product, design roles |
sage | Neutral | Calm, measured | Executive interviews |
verse | Male | Expressive | Creative, marketing roles |
Implementation: Voice Selection UI
// Allow users to choose their interviewer
const VOICES = [
{ id: 'echo', name: 'Marcus', role: 'Senior Engineer' },
{ id: 'shimmer', name: 'Aisha', role: 'Principal Engineer' },
{ id: 'ballad', name: 'James', role: 'Engineering Director' },
{ id: 'sage', name: 'Morgan', role: 'VP of Engineering' },
] as const;
Handling Interruptions (Barge-In)
One of the most powerful features of Voice Live is native barge-in — the user can interrupt the AI mid-speech, just like in a real conversation.
How It Works
When Azure’s VAD detects user speech while the AI is speaking:
- Azure immediately stops sending audio output
- Azure sends
response.audio.doneto signal the truncation - Azure starts processing the user’s new utterance
Client-Side: Clear the Audio Buffer on Interruption
When barge-in occurs, you must clear any buffered AI audio the browser is still playing:
// In useVoiceLive.ts — handleServerMessage
case 'input_audio_buffer.speech_started':
// User started speaking — clear any AI audio still in the playback buffer
clearPlaybackBuffer();
updateStatus('speaking');
break;
// Add clearPlaybackBuffer to the hook
const clearPlaybackBuffer = useCallback(() => {
if (workletNodeRef.current) {
// Send an empty array to flush the buffer
workletNodeRef.current.port.postMessage(new Float32Array(0));
}
}, []);
Critical: Without this, the old AI audio will continue playing for 100–500ms after the user starts speaking, creating a jarring overlap.
Silence and Pause Handling
Long pauses are natural in interviews — candidates think. The VAD must not trigger prematurely.
Extended Thinking Pause Detection
Add a visual indicator when the user has been silent for a long time, so they know the AI is waiting:
let silenceTimer: NodeJS.Timeout;
case 'input_audio_buffer.speech_stopped':
updateStatus('thinking');
silenceTimer = setTimeout(() => {
setShowThinkingHint(true);
}, 8000); // Show hint after 8s silence
break;
case 'input_audio_buffer.speech_started':
clearTimeout(silenceTimer);
setShowThinkingHint(false);
break;
Display a gentle prompt: “Take your time — the interviewer is listening.”
Transcript Quality
Enable input_audio_transcription to receive text versions of what the user said — important for:
- Displaying the conversation as text alongside audio
- Post-interview review and feedback
- Detecting when Azure misheard the user
session: {
input_audio_transcription: {
model: 'whisper-1',
},
},
Listen for transcription events:
case 'conversation.item.input_audio_transcription.completed':
const transcribedText = msg.transcript;
onTranscript?.(transcribedText, 'user');
break;
Dynamic Instructions Mid-Session
You can update the AI’s behavior during the session without disconnecting. Useful for:
- Escalating difficulty after the first question
- Switching interview mode (technical → behavioral)
- Injecting context (e.g., candidate’s resume)
function sendInstructions(ws: WebSocket, additionalContext: string) {
ws.send(JSON.stringify({
type: 'session.update',
session: {
instructions: `${systemPrompt}\n\n${additionalContext}`,
},
}));
}
// Example: after first answer, inject role-specific context
sendInstructions(ws, `
The candidate mentioned they have 5 years of React experience.
Focus next questions on advanced React patterns, state management, and performance optimization.
`);
Audio Quality Checklist
- Sample rate set to
24000in bothAudioContextandgetUserMedia -
echoCancellation: truein microphone constraints -
noiseSuppression: truein microphone constraints - Playback buffer cleared on
speech_started(barge-in support) -
input_audio_transcriptionenabled for Whisper transcripts - Voice persona selected based on interview type
- Silence indicator implemented for long thinking pauses
- Dynamic
session.updateinstructions implemented for adaptive interviews
Next: Part 6 — Debugging & Common Issues →
← Part 4 — Minimum Latency | This is Part 5 of the Azure Voice Live series.