Audio Quality: Codecs, Noise Suppression & Natural Conversation (Part 5 of 7)

A voice interview system with 150ms latency but choppy audio, robotic speech, or broken interruption handling is worse than useless. Candidates will lose confidence in the product immediately. Audio quality is the second critical dimension — and it’s more configurable than most developers realize.

Audio Format Selection

Azure Voice Live supports three output formats. Choosing correctly has a significant impact on quality and bandwidth:

Format	Bitrate	Latency	Quality	Use Case
`pcm16`	768 kbps	Lowest ✅	Excellent	LAN / good network
`g711_ulaw`	64 kbps	Low	Good	Limited bandwidth
`opus`	16–128 kbps	Medium	Excellent	Mobile / variable network

For an Interview System: Use `pcm16`

When your users are on a stable broadband connection (office or home), PCM16 provides the shortest processing pipeline and highest fidelity. There is zero codec encode/decode overhead.

// In session.update config
session: {
  input_audio_format: 'pcm16',
  output_audio_format: 'pcm16',
  // ...
}

For Mobile Users: Use Opus

If you need to support mobile users or expect variable network quality:

session: {
  input_audio_format: 'pcm16',   // Input stays as PCM (browser capture)
  output_audio_format: 'pcm16',  // Azure → your server as PCM
  // Then transcode to Opus in your proxy before sending to browser
  // (requires additional FFmpeg or libopus integration)
}

Note: Azure Voice Live does not natively expose Opus output as of early 2026. Transcode in your proxy if needed.

Browser Audio Capture Quality

The quality of captured microphone audio directly affects how well Azure’s VAD and transcription work:

// Optimal constraints for interview audio capture
const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    channelCount: 1,        // Mono — Voice Live requires mono
    sampleRate: 24000,      // Match Azure's native 24kHz rate
    echoCancellation: true, // CRITICAL — prevent AI voice feeding back into mic
    noiseSuppression: true, // Filter background noise
    autoGainControl: true,  // Normalize volume for different microphones
    
    // Advanced constraints (Chrome-only)
    googEchoCancellation: true,
    googNoiseSuppression: true,
    googHighpassFilter: true, // Remove low-frequency rumble
  },
});

Sample Rate Alignment

Azure Voice Live internal rate is 24kHz. If your browser captures at 48kHz (default), there’s an automatic downsample that can introduce artifacts:

// Force 24kHz capture to avoid resampling
const ctx = new AudioContext({ sampleRate: 24000 });

Compatibility note: Safari supports 24kHz capture only on macOS 12+. Always add a fallback:

let sampleRate = 24000;
try {
  const testCtx = new AudioContext({ sampleRate: 24000 });
  sampleRate = testCtx.sampleRate; // Browser may override to 48000
  testCtx.close();
} catch {
  sampleRate = 48000; // Fallback — Azure handles resampling
}

Voice Selection for Interviews

Azure Voice Live exposes 8 neural voice options via GPT-4o Realtime. Choose based on persona:

Voice	Gender	Tone	Best For
`alloy`	Neutral	Professional, warm	General interviewer
`echo`	Male	Authoritative	Senior technical roles
`shimmer`	Female	Friendly, clear	Early-career, tech-focused
`ash`	Male	Conversational	Startup culture
`ballad`	Male	Articulate	Consulting, finance
`coral`	Female	Energetic	Product, design roles
`sage`	Neutral	Calm, measured	Executive interviews
`verse`	Male	Expressive	Creative, marketing roles

Implementation: Voice Selection UI

// Allow users to choose their interviewer
const VOICES = [
  { id: 'echo',    name: 'Marcus',   role: 'Senior Engineer' },
  { id: 'shimmer', name: 'Aisha',    role: 'Principal Engineer' },
  { id: 'ballad',  name: 'James',    role: 'Engineering Director' },
  { id: 'sage',    name: 'Morgan',   role: 'VP of Engineering' },
] as const;

Handling Interruptions (Barge-In)

One of the most powerful features of Voice Live is native barge-in — the user can interrupt the AI mid-speech, just like in a real conversation.

How It Works

When Azure’s VAD detects user speech while the AI is speaking:

Azure immediately stops sending audio output
Azure sends response.audio.done to signal the truncation
Azure starts processing the user’s new utterance

Client-Side: Clear the Audio Buffer on Interruption

When barge-in occurs, you must clear any buffered AI audio the browser is still playing:

// In useVoiceLive.ts — handleServerMessage

case 'input_audio_buffer.speech_started':
  // User started speaking — clear any AI audio still in the playback buffer
  clearPlaybackBuffer();
  updateStatus('speaking');
  break;

// Add clearPlaybackBuffer to the hook
const clearPlaybackBuffer = useCallback(() => {
  if (workletNodeRef.current) {
    // Send an empty array to flush the buffer
    workletNodeRef.current.port.postMessage(new Float32Array(0));
  }
}, []);

Critical: Without this, the old AI audio will continue playing for 100–500ms after the user starts speaking, creating a jarring overlap.

Silence and Pause Handling

Long pauses are natural in interviews — candidates think. The VAD must not trigger prematurely.

Extended Thinking Pause Detection

Add a visual indicator when the user has been silent for a long time, so they know the AI is waiting:

let silenceTimer: NodeJS.Timeout;

case 'input_audio_buffer.speech_stopped':
  updateStatus('thinking');
  silenceTimer = setTimeout(() => {
    setShowThinkingHint(true);
  }, 8000); // Show hint after 8s silence
  break;

case 'input_audio_buffer.speech_started':
  clearTimeout(silenceTimer);
  setShowThinkingHint(false);
  break;

Display a gentle prompt: “Take your time — the interviewer is listening.”

Transcript Quality

Enable input_audio_transcription to receive text versions of what the user said — important for:

Displaying the conversation as text alongside audio
Post-interview review and feedback
Detecting when Azure misheard the user

session: {
  input_audio_transcription: {
    model: 'whisper-1',
  },
},

Listen for transcription events:

case 'conversation.item.input_audio_transcription.completed':
  const transcribedText = msg.transcript;
  onTranscript?.(transcribedText, 'user');
  break;

Dynamic Instructions Mid-Session

You can update the AI’s behavior during the session without disconnecting. Useful for:

Escalating difficulty after the first question
Switching interview mode (technical → behavioral)
Injecting context (e.g., candidate’s resume)

function sendInstructions(ws: WebSocket, additionalContext: string) {
  ws.send(JSON.stringify({
    type: 'session.update',
    session: {
      instructions: `${systemPrompt}\n\n${additionalContext}`,
    },
  }));
}

// Example: after first answer, inject role-specific context
sendInstructions(ws, `
The candidate mentioned they have 5 years of React experience.
Focus next questions on advanced React patterns, state management, and performance optimization.
`);

Audio Quality Checklist

Sample rate set to 24000 in both AudioContext and getUserMedia
echoCancellation: true in microphone constraints
noiseSuppression: true in microphone constraints
Playback buffer cleared on speech_started (barge-in support)
input_audio_transcription enabled for Whisper transcripts
Voice persona selected based on interview type
Silence indicator implemented for long thinking pauses
Dynamic session.update instructions implemented for adaptive interviews

Next: Part 6 — Debugging & Common Issues →

← Part 4 — Minimum Latency | This is Part 5 of the Azure Voice Live series.

Export for reading

Audio Quality: Codecs, Noise Suppression & Natural Conversation (Part 5 of 7)

Audio Format Selection

For an Interview System: Use `pcm16`

For Mobile Users: Use Opus

Browser Audio Capture Quality

Sample Rate Alignment

Voice Selection for Interviews

Implementation: Voice Selection UI

Handling Interruptions (Barge-In)

How It Works

Client-Side: Clear the Audio Buffer on Interruption

Silence and Pause Handling

Extended Thinking Pause Detection

Transcript Quality

Dynamic Instructions Mid-Session

Audio Quality Checklist

Comments

On this page

Audio Quality: Codecs, Noise Suppression & Natural Conversation (Part 5 of 7)

Audio Quality: Codecs, Noise Suppression & Natural Conversation (Part 5 of 7)

Audio Format Selection

For an Interview System: Use pcm16

For Mobile Users: Use Opus

Browser Audio Capture Quality

Sample Rate Alignment

Voice Selection for Interviews

Implementation: Voice Selection UI

Handling Interruptions (Barge-In)

How It Works

Client-Side: Clear the Audio Buffer on Interruption

Silence and Pause Handling

Extended Thinking Pause Detection

Transcript Quality

Dynamic Instructions Mid-Session

Audio Quality Checklist

Comments

Audio Quality: Codecs, Noise Suppression & Natural Conversation (Part 5 of 7)

For an Interview System: Use `pcm16`