Testing real-time audio AI is one of the most frustrating challenges in this space. You can’t simply call a function and expect(result).toBe(...). There’s no deterministic output — the AI may say the same thing in different words each time. The timing, audio quality, and conversational flow all matter. But there’s a lot you can test, and some powerful tools for the parts you can’t.

The second half of this post is equally important: what do you do after the interview? A voice interview is only valuable if candidates receive actionable feedback. We’ll build a transcript analysis pipeline that scores responses and generates a structured improvement report.


Part 1 — Testing Azure Voice Live

What Makes Voice AI Testing Hard

  1. Non-determinism — GPT generates different responses each session. Classic assert-equal tests fail.
  2. Audio I/O dependency — Real tests require a microphone input and speaker output.
  3. Real-time timing — Latency issues only appear under load. Unit tests won’t catch them.
  4. WebSocket state machine — Sessions have complex state (connecting → ready → listening → speaking → done).

The Testing Pyramid for Voice Systems

flowchart TD
    A["🔺 E2E Tests<br/>(few, slow, real audio)"] --> B["🔷 Integration Tests<br/>(WebSocket protocol, session state)"]
    B --> C["🟢 Unit Tests<br/>(audio processing, transcript parsing)"]
    C --> D["📊 Eval Tests<br/>(LLM-as-judge, response quality)"]

    style A fill:#ef4444,color:#fff
    style B fill:#f97316,color:#fff
    style C fill:#22c55e,color:#fff
    style D fill:#3b82f6,color:#fff

Layer 1: Unit Tests — Audio Processing Logic

The deterministic parts of the system are fully testable:

// tests/audio.test.ts
import { describe, it, expect } from 'vitest';
import { float32ToInt16, int16ToFloat32, calculateRMS } from '../utils/audio';

describe('Audio conversion utils', () => {
  it('converts float32 to int16 without clipping', () => {
    const input = new Float32Array([0, 0.5, 1.0, -1.0, -0.5]);
    const result = float32ToInt16(input);
    expect(result[0]).toBe(0);
    expect(result[1]).toBe(16384);    // 0.5 * 32768
    expect(result[2]).toBe(32767);    // clamped max
    expect(result[3]).toBe(-32768);   // clamped min
  });

  it('calculates RMS correctly for silence', () => {
    const silence = new Float32Array(1024).fill(0);
    expect(calculateRMS(silence)).toBe(0);
  });

  it('identifies speech vs silence by RMS threshold', () => {
    const speech = new Float32Array(1024).fill(0.3);
    const silence = new Float32Array(1024).fill(0.01);
    expect(calculateRMS(speech)).toBeGreaterThan(0.1);
    expect(calculateRMS(silence)).toBeLessThan(0.1);
  });
});
// tests/session.test.ts — Test message parsing
import { parseServerMessage } from '../utils/voice-live';

describe('Server message parsing', () => {
  it('parses response.audio.delta correctly', () => {
    const raw = JSON.stringify({
      type: 'response.audio.delta',
      delta: 'SGVsbG8gV29ybGQ=', // Base64
    });
    const msg = parseServerMessage(raw);
    expect(msg.type).toBe('response.audio.delta');
    expect(msg.audio).toBeInstanceOf(Uint8Array);
  });

  it('parses conversation.item.input_audio_transcription.completed', () => {
    const raw = JSON.stringify({
      type: 'conversation.item.input_audio_transcription.completed',
      transcript: 'I have 5 years of React experience.',
    });
    const msg = parseServerMessage(raw);
    expect(msg.type).toContain('transcription.completed');
    expect(msg.transcript).toBe('I have 5 years of React experience.');
  });
});

Layer 2: Integration Tests — WebSocket Protocol

Use a mock WebSocket server to test the session state machine without real Azure:

// tests/voice-proxy.integration.test.ts
import { WebSocketServer } from 'ws';
import { VoiceProxyClient } from '../lib/voice-proxy-client';

describe('Voice Proxy Integration', () => {
  let mockAzureServer: WebSocketServer;
  let azurePort: number;

  beforeEach(async () => {
    azurePort = 9999;
    mockAzureServer = new WebSocketServer({ port: azurePort });
    
    mockAzureServer.on('connection', (ws) => {
      // Simulate Azure session creation response
      ws.on('message', (data) => {
        const msg = JSON.parse(data.toString());
        
        if (msg.type === 'session.update') {
          // Azure acknowledges session configuration
          ws.send(JSON.stringify({
            type: 'session.updated',
            session: { id: 'mock-session-123', ...msg.session },
          }));
        }
        
        if (msg.type === 'input_audio_buffer.append') {
          // Simulate VAD detecting speech end and generating a response
          setTimeout(() => {
            ws.send(JSON.stringify({ type: 'input_audio_buffer.speech_stopped' }));
            ws.send(JSON.stringify({ type: 'response.audio.delta', delta: 'AAAA' }));
            ws.send(JSON.stringify({ type: 'response.audio.done' }));
          }, 50);
        }
      });
    });
  });

  afterEach(() => {
    mockAzureServer.close();
  });

  it('completes a full send-receive cycle', async () => {
    const client = new VoiceProxyClient(`ws://localhost:${azurePort}`);
    
    const messages: string[] = [];
    client.onMessage((msg) => messages.push(msg.type));
    
    await client.connect();
    await client.sendAudio(new Uint8Array(1024)); // Fake audio frame
    
    // Wait for mock Azure to respond
    await new Promise(r => setTimeout(r, 100));
    
    expect(messages).toContain('input_audio_buffer.speech_stopped');
    expect(messages).toContain('response.audio.done');
    
    client.disconnect();
  });

  it('reconnects after unexpected close', async () => {
    const client = new VoiceProxyClient(`ws://localhost:${azurePort}`, {
      maxReconnectAttempts: 2,
      reconnectDelayMs: 50,
    });
    
    await client.connect();
    
    // Force close the mock server connection
    mockAzureServer.clients.forEach(ws => ws.terminate());
    
    // Wait for reconnect
    await new Promise(r => setTimeout(r, 200));
    
    expect(client.isConnected()).toBe(true); // Should have reconnected
  });
});

Layer 3: Latency Benchmarks

Use the Node.js performance API to measure what matters:

// tests/benchmarks/latency.bench.ts
import { bench, describe } from 'vitest';
import { AudioEncoder } from '../lib/audio-encoder';

describe('Audio encoding latency', () => {
  const samples = new Float32Array(4096).fill(0.3); // 170ms of faked audio

  bench('float32 → int16 conversion (4096 samples)', () => {
    // Should complete in < 1ms to not affect real-time pipeline
    const encoder = new AudioEncoder();
    encoder.encode(samples);
  });

  bench('base64 decode for audio delta', () => {
    const base64 = btoa(String.fromCharCode(...new Uint8Array(8192)));
    Buffer.from(base64, 'base64');
  });
});

Run with: vitest bench

Acceptable tolerances:

OperationMax Acceptable Time
Audio chunk conversion (1024 samples)< 0.5ms
WebSocket frame serialization< 0.1ms
Session message parsing< 0.2ms

Layer 4: LLM-as-Judge Evaluation

The hardest part to test is response quality — was the AI’s answer relevant, helpful, and appropriate? Use GPT-4o as an evaluator:

// tests/eval/response-quality.eval.ts
import OpenAI from 'openai';

const client = new OpenAI();

interface EvalResult {
  score: number;       // 1–5
  relevance: string;   // Was the response on-topic?
  quality: string;     // Was the response well-formed?
  pass: boolean;
}

async function evaluateInterviewerResponse(
  context: string,      // What the candidate said
  aiResponse: string,   // What the AI said
  jobRole: string
): Promise<EvalResult> {
  const result = await client.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `You are an expert interviewer evaluator for ${jobRole} roles.
        Score the AI interviewer's response on a 1–5 scale.
        Return JSON: { score, relevance, quality, pass }
        pass=true if score >= 3`,
      },
      {
        role: 'user',
        content: `Candidate said: "${context}"
        AI interviewer responded: "${aiResponse}"
        
        Evaluate the AI's response quality as an interviewer.`,
      },
    ],
    response_format: { type: 'json_object' },
  });
  
  return JSON.parse(result.choices[0].message.content!) as EvalResult;
}

// Run a batch evaluation
const testCases = [
  {
    context: "I've worked with React for 3 years, building e-commerce apps.",
    expectedTopic: 'Should follow up on React depth or specific challenges',
  },
  {
    context: "I don't really have experience with TypeScript.",
    expectedTopic: 'Should explore transferable skills or learning approach',
  },
];

for (const tc of testCases) {
  const result = await evaluateInterviewerResponse(
    tc.context,
    aiResponse, // Captured from a real session
    'Senior Frontend Engineer'
  );
  
  console.log(`Score: ${result.score}/5 | Pass: ${result.pass}`);
  if (!result.pass) {
    console.warn(`FAIL: ${result.quality}`);
  }
}

Part 2 — Transcript Analysis & Interview Report

After an interview session, you have a transcript — a sequence of exchanges between the AI interviewer and the candidate. The transcript is the gold mine. From it, you can extract:

  • Communication quality — clarity, filler words, response length
  • Technical accuracy — were claims correct? Were buzzwords used without depth?
  • Soft skills signals — confidence, structure (STAR method), problem-solving approach
  • Action items — specific things to practice before the next interview

Capturing the Transcript

In your useVoiceLive hook, collect all exchanges:

// In useVoiceLive.ts

interface TranscriptEntry {
  speaker: 'interviewer' | 'candidate';
  text: string;
  timestamp: number;
  durationMs?: number;
}

const transcript = useRef<TranscriptEntry[]>([]);

// When AI speaks
case 'response.text.done':
  transcript.current.push({
    speaker: 'interviewer',
    text: msg.text,
    timestamp: Date.now(),
  });
  break;

// When candidate speaks
case 'conversation.item.input_audio_transcription.completed':
  const entry: TranscriptEntry = {
    speaker: 'candidate',
    text: msg.transcript,
    timestamp: Date.now(),
  };
  transcript.current.push(entry);
  onTranscriptUpdate?.(entry);
  break;

The Analysis API Route

// app/api/analyze-interview/route.ts

import OpenAI from 'openai';
import { NextRequest, NextResponse } from 'next/server';

const client = new OpenAI();

export interface InterviewReport {
  overallScore: number;            // 0–100
  summary: string;
  strengths: string[];
  weaknesses: string[];
  actionItems: ActionItem[];
  categoryScores: CategoryScores;
  redFlags: string[];
}

interface ActionItem {
  priority: 'high' | 'medium' | 'low';
  area: string;                    // 'communication' | 'technical' | 'structure'
  action: string;                  // Specific thing to practice
  resource?: string;               // Optional link/book/course
}

interface CategoryScores {
  technicalAccuracy: number;       // 0–10
  communicationClarity: number;    // 0–10
  problemSolvingApproach: number;  // 0–10
  structuredThinking: number;      // 0–10 (STAR method usage)
  confidence: number;              // 0–10
}

export async function POST(req: NextRequest) {
  const { transcript, jobRole, jobLevel } = await req.json();

  const transcriptText = transcript
    .map((e: any) => `[${e.speaker.toUpperCase()}]: ${e.text}`)
    .join('\n');

  const completion = await client.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `You are an expert career coach and technical interviewer.
Analyze the interview transcript below for a ${jobLevel} ${jobRole} position.

Return a detailed JSON report with this exact structure:
{
  "overallScore": <0-100>,
  "summary": "<2-3 sentence overall assessment>",
  "strengths": ["<specific strength>", ...],
  "weaknesses": ["<specific weakness>", ...],
  "actionItems": [
    {
      "priority": "high|medium|low",
      "area": "technical|communication|structure|confidence",
      "action": "<concrete, specific action to take>",
      "resource": "<optional: specific book, course, or practice method>"
    }
  ],
  "categoryScores": {
    "technicalAccuracy": <0-10>,
    "communicationClarity": <0-10>,
    "problemSolvingApproach": <0-10>,
    "structuredThinking": <0-10>,
    "confidence": <0-10>
  },
  "redFlags": ["<critical issues that would likely fail the interview>"]
}

Be specific and actionable. "Practice communication" is NOT a valid action item.
"Record yourself explaining the event loop for 2 minutes, then watch it back" IS valid.`,
      },
      {
        role: 'user',
        content: `Interview Transcript:\n\n${transcriptText}`,
      },
    ],
    response_format: { type: 'json_object' },
    temperature: 0.3, // Lower = more consistent analysis
  });

  const report = JSON.parse(
    completion.choices[0].message.content!
  ) as InterviewReport;

  return NextResponse.json(report);
}

The Report UI Component

// components/InterviewReport.tsx

export function InterviewReport({ report }: { report: InterviewReport }) {
  const scoreColor = report.overallScore >= 70 ? 'text-green-500' 
    : report.overallScore >= 50 ? 'text-yellow-500' 
    : 'text-red-500';

  return (
    <div className="max-w-3xl mx-auto space-y-6 p-6">
      {/* Overall Score */}
      <div className="bg-card border rounded-2xl p-6 text-center">
        <div className={`text-6xl font-bold ${scoreColor}`}>
          {report.overallScore}
          <span className="text-2xl text-muted">/100</span>
        </div>
        <p className="mt-2 text-muted-foreground">{report.summary}</p>
      </div>

      {/* Category Radar */}
      <div className="grid grid-cols-5 gap-3">
        {Object.entries(report.categoryScores).map(([key, score]) => (
          <div key={key} className="bg-card border rounded-xl p-3 text-center">
            <div className="text-2xl font-bold">{score}</div>
            <div className="text-xs text-muted-foreground mt-1">
              {key.replace(/([A-Z])/g, ' $1').trim()}
            </div>
          </div>
        ))}
      </div>

      {/* Strengths & Weaknesses */}
      <div className="grid grid-cols-2 gap-4">
        <div className="bg-emerald-500/10 border border-emerald-500/20 rounded-xl p-4">
          <h3 className="font-semibold text-emerald-400 mb-2">✓ Strengths</h3>
          <ul className="space-y-1">
            {report.strengths.map((s, i) => (
              <li key={i} className="text-sm">{s}</li>
            ))}
          </ul>
        </div>
        <div className="bg-red-500/10 border border-red-500/20 rounded-xl p-4">
          <h3 className="font-semibold text-red-400 mb-2">✗ Areas to Improve</h3>
          <ul className="space-y-1">
            {report.weaknesses.map((w, i) => (
              <li key={i} className="text-sm">{w}</li>
            ))}
          </ul>
        </div>
      </div>

      {/* Red Flags */}
      {report.redFlags.length > 0 && (
        <div className="bg-red-500/10 border border-red-500/30 rounded-xl p-4">
          <h3 className="font-semibold text-red-400 mb-2">
            ⚠️ Critical Issues (Would likely fail this interview)
          </h3>
          <ul className="space-y-1">
            {report.redFlags.map((flag, i) => (
              <li key={i} className="text-sm font-medium text-red-300">{flag}</li>
            ))}
          </ul>
        </div>
      )}

      {/* Action Items */}
      <div>
        <h3 className="font-semibold text-lg mb-3">📋 Action Plan</h3>
        <div className="space-y-3">
          {report.actionItems
            .sort((a, b) => {
              const order = { high: 0, medium: 1, low: 2 };
              return order[a.priority] - order[b.priority];
            })
            .map((item, i) => (
              <div
                key={i}
                className="flex gap-3 bg-card border rounded-xl p-4"
              >
                <span className={`
                  shrink-0 text-xs font-bold px-2 py-0.5 rounded-full h-fit mt-0.5
                  ${item.priority === 'high' ? 'bg-red-500/20 text-red-400' :
                    item.priority === 'medium' ? 'bg-yellow-500/20 text-yellow-400' :
                    'bg-blue-500/20 text-blue-400'}
                `}>
                  {item.priority.toUpperCase()}
                </span>
                <div>
                  <div className="text-xs text-muted-foreground uppercase tracking-wide mb-1">
                    {item.area}
                  </div>
                  <p className="text-sm font-medium">{item.action}</p>
                  {item.resource && (
                    <p className="text-xs text-muted-foreground mt-1">
                      📖 {item.resource}
                    </p>
                  )}
                </div>
              </div>
            ))}
        </div>
      </div>
    </div>
  );
}

Triggering Analysis After Session Ends

// In InterviewSession.tsx

const [report, setReport] = useState<InterviewReport | null>(null);
const [isAnalyzing, setIsAnalyzing] = useState(false);

async function handleSessionEnd() {
  disconnect();
  setIsAnalyzing(true);
  
  try {
    const res = await fetch('/api/analyze-interview', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        transcript: transcript.current,
        jobRole: 'Senior Frontend Engineer',
        jobLevel: 'Senior',
      }),
    });
    
    const report = await res.json();
    setReport(report);
  } catch (err) {
    console.error('Analysis failed:', err);
  } finally {
    setIsAnalyzing(false);
  }
}

Cost of Transcript Analysis

A typical 30-minute interview generates ~3,000–5,000 words of transcript. GPT-4o analysis cost:

Session LengthTokensCost per Analysis
15 minutes~5,000 tokens~$0.025
30 minutes~9,000 tokens~$0.045
60 minutes~18,000 tokens~$0.09

Essentially free compared to the ~$0.34 cost of the voice session itself.


Summary

CapabilityWhat to Use
Audio processing unit testsVitest + deterministic conversions
WebSocket protocol testsMock WS server + ws package
Latency benchmarksvitest bench + performance.now()
Response quality evaluationGPT-4o as judge
Transcript generationinput_audio_transcription (Whisper)
Interview scoringGPT-4o + structured JSON output
Report UIReact component with action items

Part 7 — Deploy, Scale & Pricing | This is the Bonus Part 8 of the Azure Voice Live series.

Export for reading

Comments