A voice interview system with 150ms latency but choppy audio, robotic speech, or broken interruption handling is worse than useless. Candidates will lose confidence in the product immediately. Audio quality is the second critical dimension — and it’s more configurable than most developers realize.


Audio Format Selection

Azure Voice Live supports three output formats. Choosing correctly has a significant impact on quality and bandwidth:

FormatBitrateLatencyQualityUse Case
pcm16768 kbpsLowest ✅ExcellentLAN / good network
g711_ulaw64 kbpsLowGoodLimited bandwidth
opus16–128 kbpsMediumExcellentMobile / variable network

For an Interview System: Use pcm16

When your users are on a stable broadband connection (office or home), PCM16 provides the shortest processing pipeline and highest fidelity. There is zero codec encode/decode overhead.

// In session.update config
session: {
  input_audio_format: 'pcm16',
  output_audio_format: 'pcm16',
  // ...
}

For Mobile Users: Use Opus

If you need to support mobile users or expect variable network quality:

session: {
  input_audio_format: 'pcm16',   // Input stays as PCM (browser capture)
  output_audio_format: 'pcm16',  // Azure → your server as PCM
  // Then transcode to Opus in your proxy before sending to browser
  // (requires additional FFmpeg or libopus integration)
}

Note: Azure Voice Live does not natively expose Opus output as of early 2026. Transcode in your proxy if needed.


Browser Audio Capture Quality

The quality of captured microphone audio directly affects how well Azure’s VAD and transcription work:

// Optimal constraints for interview audio capture
const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    channelCount: 1,        // Mono — Voice Live requires mono
    sampleRate: 24000,      // Match Azure's native 24kHz rate
    echoCancellation: true, // CRITICAL — prevent AI voice feeding back into mic
    noiseSuppression: true, // Filter background noise
    autoGainControl: true,  // Normalize volume for different microphones
    
    // Advanced constraints (Chrome-only)
    googEchoCancellation: true,
    googNoiseSuppression: true,
    googHighpassFilter: true, // Remove low-frequency rumble
  },
});

Sample Rate Alignment

Azure Voice Live internal rate is 24kHz. If your browser captures at 48kHz (default), there’s an automatic downsample that can introduce artifacts:

// Force 24kHz capture to avoid resampling
const ctx = new AudioContext({ sampleRate: 24000 });

Compatibility note: Safari supports 24kHz capture only on macOS 12+. Always add a fallback:

let sampleRate = 24000;
try {
  const testCtx = new AudioContext({ sampleRate: 24000 });
  sampleRate = testCtx.sampleRate; // Browser may override to 48000
  testCtx.close();
} catch {
  sampleRate = 48000; // Fallback — Azure handles resampling
}

Voice Selection for Interviews

Azure Voice Live exposes 8 neural voice options via GPT-4o Realtime. Choose based on persona:

VoiceGenderToneBest For
alloyNeutralProfessional, warmGeneral interviewer
echoMaleAuthoritativeSenior technical roles
shimmerFemaleFriendly, clearEarly-career, tech-focused
ashMaleConversationalStartup culture
balladMaleArticulateConsulting, finance
coralFemaleEnergeticProduct, design roles
sageNeutralCalm, measuredExecutive interviews
verseMaleExpressiveCreative, marketing roles

Implementation: Voice Selection UI

// Allow users to choose their interviewer
const VOICES = [
  { id: 'echo',    name: 'Marcus',   role: 'Senior Engineer' },
  { id: 'shimmer', name: 'Aisha',    role: 'Principal Engineer' },
  { id: 'ballad',  name: 'James',    role: 'Engineering Director' },
  { id: 'sage',    name: 'Morgan',   role: 'VP of Engineering' },
] as const;

Handling Interruptions (Barge-In)

One of the most powerful features of Voice Live is native barge-in — the user can interrupt the AI mid-speech, just like in a real conversation.

How It Works

When Azure’s VAD detects user speech while the AI is speaking:

  1. Azure immediately stops sending audio output
  2. Azure sends response.audio.done to signal the truncation
  3. Azure starts processing the user’s new utterance

Client-Side: Clear the Audio Buffer on Interruption

When barge-in occurs, you must clear any buffered AI audio the browser is still playing:

// In useVoiceLive.ts — handleServerMessage

case 'input_audio_buffer.speech_started':
  // User started speaking — clear any AI audio still in the playback buffer
  clearPlaybackBuffer();
  updateStatus('speaking');
  break;
// Add clearPlaybackBuffer to the hook
const clearPlaybackBuffer = useCallback(() => {
  if (workletNodeRef.current) {
    // Send an empty array to flush the buffer
    workletNodeRef.current.port.postMessage(new Float32Array(0));
  }
}, []);

Critical: Without this, the old AI audio will continue playing for 100–500ms after the user starts speaking, creating a jarring overlap.


Silence and Pause Handling

Long pauses are natural in interviews — candidates think. The VAD must not trigger prematurely.

Extended Thinking Pause Detection

Add a visual indicator when the user has been silent for a long time, so they know the AI is waiting:

let silenceTimer: NodeJS.Timeout;

case 'input_audio_buffer.speech_stopped':
  updateStatus('thinking');
  silenceTimer = setTimeout(() => {
    setShowThinkingHint(true);
  }, 8000); // Show hint after 8s silence
  break;

case 'input_audio_buffer.speech_started':
  clearTimeout(silenceTimer);
  setShowThinkingHint(false);
  break;

Display a gentle prompt: “Take your time — the interviewer is listening.”


Transcript Quality

Enable input_audio_transcription to receive text versions of what the user said — important for:

  • Displaying the conversation as text alongside audio
  • Post-interview review and feedback
  • Detecting when Azure misheard the user
session: {
  input_audio_transcription: {
    model: 'whisper-1',
  },
},

Listen for transcription events:

case 'conversation.item.input_audio_transcription.completed':
  const transcribedText = msg.transcript;
  onTranscript?.(transcribedText, 'user');
  break;

Dynamic Instructions Mid-Session

You can update the AI’s behavior during the session without disconnecting. Useful for:

  • Escalating difficulty after the first question
  • Switching interview mode (technical → behavioral)
  • Injecting context (e.g., candidate’s resume)
function sendInstructions(ws: WebSocket, additionalContext: string) {
  ws.send(JSON.stringify({
    type: 'session.update',
    session: {
      instructions: `${systemPrompt}\n\n${additionalContext}`,
    },
  }));
}

// Example: after first answer, inject role-specific context
sendInstructions(ws, `
The candidate mentioned they have 5 years of React experience.
Focus next questions on advanced React patterns, state management, and performance optimization.
`);

Audio Quality Checklist

  • Sample rate set to 24000 in both AudioContext and getUserMedia
  • echoCancellation: true in microphone constraints
  • noiseSuppression: true in microphone constraints
  • Playback buffer cleared on speech_started (barge-in support)
  • input_audio_transcription enabled for Whisper transcripts
  • Voice persona selected based on interview type
  • Silence indicator implemented for long thinking pauses
  • Dynamic session.update instructions implemented for adaptive interviews

Next: Part 6 — Debugging & Common Issues →

Part 4 — Minimum Latency | This is Part 5 of the Azure Voice Live series.

Export for reading

Comments