Production Voice AI for Research at Scale: Multi-Language Voice AI — When Your Agent Needs to Think in Japanese (Part 7 of 8)

In Part 6, we scaled the platform to 200 concurrent sessions, fixed the enrichment bottleneck, built session recovery, and established the six operational metrics that keep everything visible. The platform was solid. Then a Japanese-speaking research participant joined a session, the English-only agent tried to process their speech, and what came back was a hallucinatory mix of phonetic gibberish and apologetic English. The participant disconnected after 40 seconds.

That was the moment multi-language support stopped being a “nice to have” and became a production requirement. Within three weeks we had participants joining in Japanese, Spanish, French, and Korean. Within two months, non-English sessions accounted for 35% of all research activity on the platform.

This post covers everything we built to get there: language detection and provider routing, locale-aware VAD tuning, i18n prompt packs, voice selection per language, and the cross-language analysis pipeline that makes sense of transcripts in five different languages.

Why Multi-Language Voice AI Is Different from Multi-Language Text

Text translation is, for practical purposes, solved. GPT-4o translates Japanese to English at near-native quality. DeepL handles European languages beautifully. You can build a multi-language text chatbot in an afternoon by wrapping your English logic with translation calls on both ends.

Voice is a different problem entirely.

Speech-to-speech models like OpenAI Realtime and Gemini Live handle language natively — there’s no STT-translate-TTS pipeline in the middle. The model hears Japanese audio and produces Japanese audio in a single pass. That’s the whole point of S2S: end-to-end, with the latency and naturalness benefits that cascaded pipelines can’t match.

But native language handling doesn’t mean equal quality across languages. Three things differ dramatically between languages in S2S models:

Voice quality. The naturalness of synthesized speech varies by language. A model that sounds like a real person in English might sound robotic or stilted in Korean. The voice timbre, prosody, and intonation patterns are trained on different data volumes per language, and it shows.

Accent reproduction. This is subtler. A Japanese participant expects the AI to speak standard Japanese (hyojungo), not Japanese-with-an-English-accent. Gemini Live handles this well for most Asian and European languages because Google’s training data includes massive multilingual speech corpora. OpenAI Realtime, as of early 2026, is notably stronger in English — the voices were designed for English-first.

Cultural conversational norms. Japanese speakers use longer pauses. Spanish speakers overlap more. Korean speakers use different turn-taking patterns depending on the formality level. The AI needs to respect these norms or it feels foreign, even if the words are technically correct.

The key insight we arrived at: you don’t want one model doing everything. You want the best model per language. OpenAI Realtime for English, where it’s fastest and most natural. Gemini Live for everything else, where its 30+ language support and accent quality genuinely shine. This is a routing problem, not a model problem.

Language Detection and Provider Routing

Before you can route a session to the right provider, you need to know what language the participant will speak. We tried three approaches:

1. Browser locale from the Accept-Language header. The participant’s browser sends this with every HTTP request. It’s automatic — no user action required. It’s also unreliable. A Japanese researcher in San Francisco might have their browser set to en-US. An immigrant participant might have es-MX configured but plan to speak English for the session. Browser locale tells you what language the user reads, not what language they’ll speak.

2. Explicit user selection in the session UI. A language dropdown before the session starts. “What language would you like to conduct this session in?” This is the safest approach because the participant knows what language they intend to speak. We show the browser locale as the default selection to minimize friction.

3. Audio-based detection using the first 3 seconds of speech. Route the initial audio through Whisper for language identification, then connect to the appropriate provider. This is elegant in theory and terrible in practice. It adds 3-5 seconds of latency before the AI can respond (the participant speaks, you detect, you connect, the AI responds). In research, the first 5 seconds set the tone for the entire session. A 5-second dead silence after the participant says “konnichiwa” is a non-starter.

We use approach 2 — explicit selection — with the browser locale as the default. It’s slightly more friction (one extra click) but eliminates misrouting almost entirely.

Once you know the language, the routing logic maps it to a provider, model, voice, VAD profile, and prompt pack:

from dataclasses import dataclass
from typing import Optional

@dataclass
class LanguageConfig:
    """Complete configuration for a language-specific voice AI session."""
    locale: str           # "en", "ja", "es", "fr", "ko"
    provider: str         # "openai_realtime" or "gemini_live"
    model: str            # Provider-specific model identifier
    voice_id: str         # Provider-specific voice identifier
    vad_profile: str      # Reference to VADProfile name
    prompt_pack: str      # Path to locale-specific prompt YAML
    display_name: str     # Human-readable language name
    ttfv_target_ms: int   # Expected time-to-first-voice for this config

LANGUAGE_REGISTRY: dict[str, LanguageConfig] = {
    "en": LanguageConfig(
        locale="en",
        provider="openai_realtime",
        model="gpt-4o-realtime-preview",
        voice_id="shimmer",
        vad_profile="english",
        prompt_pack="prompts/research_en.yaml",
        display_name="English",
        ttfv_target_ms=300,
    ),
    "ja": LanguageConfig(
        locale="ja",
        provider="gemini_live",
        model="gemini-2.0-flash",
        voice_id="Puck",
        vad_profile="japanese",
        prompt_pack="prompts/research_ja.yaml",
        display_name="Japanese",
        ttfv_target_ms=400,
    ),
    "es": LanguageConfig(
        locale="es",
        provider="gemini_live",
        model="gemini-2.0-flash",
        voice_id="Kore",
        vad_profile="spanish",
        prompt_pack="prompts/research_es.yaml",
        display_name="Spanish",
        ttfv_target_ms=400,
    ),
    "fr": LanguageConfig(
        locale="fr",
        provider="gemini_live",
        model="gemini-2.0-flash",
        voice_id="Charon",
        vad_profile="french",
        prompt_pack="prompts/research_fr.yaml",
        display_name="French",
        ttfv_target_ms=420,
    ),
    "ko": LanguageConfig(
        locale="ko",
        provider="gemini_live",
        model="gemini-2.0-flash",
        voice_id="Aoede",
        vad_profile="korean",
        prompt_pack="prompts/research_ko.yaml",
        display_name="Korean",
        ttfv_target_ms=410,
    ),
}

The ProviderRouter class ties it together. Given a locale, it returns the full configuration or falls back gracefully:

@dataclass
class ProviderRouter:
    """Routes sessions to the optimal provider based on language."""
    registry: dict[str, LanguageConfig]
    default_locale: str = "en"
    fallback_provider: str = "gemini_live"

    def resolve(self, locale: str) -> LanguageConfig:
        """Get the optimal config for a locale, with fallback chain."""
        # Exact match
        if locale in self.registry:
            return self.registry[locale]

        # Language-only match (e.g., "ja-JP" -> "ja")
        lang = locale.split("-")[0].lower()
        if lang in self.registry:
            return self.registry[lang]

        # Unsupported language: route to Gemini Live (broadest support)
        # and use the requested locale as a hint in the system prompt
        fallback = LanguageConfig(
            locale=lang,
            provider=self.fallback_provider,
            model="gemini-2.0-flash",
            voice_id="Puck",
            vad_profile="default",
            prompt_pack="prompts/research_en.yaml",
            display_name=f"Unsupported ({lang})",
            ttfv_target_ms=450,
        )
        return fallback

The fallback to Gemini Live for unsupported languages is deliberate. Gemini Live supports 30+ languages natively, so even if we don’t have a tuned configuration for a language, the model can usually handle it at a baseline level. For the handful of cases where even Gemini can’t handle the language, we fall back to English with a polite notification to the participant through the session UI.

For English specifically, OpenAI Realtime wins on latency — roughly 300ms TTFV versus 400ms for Gemini Live. That 100ms difference is perceptible in conversation. For every other language we’ve tested, Gemini Live produces more natural-sounding speech with better accent reproduction, so the routing math is straightforward: English goes to OpenAI, everything else goes to Gemini.

Locale-Aware VAD Tuning

This was the biggest surprise. Voice Activity Detection settings that work perfectly for English speakers actively harm the experience for Japanese speakers.

The problem is cultural. Japanese speakers use longer pauses — what linguists call “think-silences” — as a natural part of conversation. A 700ms pause in English usually means the speaker is done. A 700ms pause in Japanese might mean the speaker is formulating the next part of a complex thought. If your VAD interprets that silence as end-of-turn and triggers the AI to respond, you’ve just interrupted the participant mid-thought. In a research context, that interruption destroys the very insight you were trying to capture.

Spanish and Portuguese speakers have the opposite pattern. Turn-taking is faster, overlaps are more common, and silences are shorter. A VAD tuned for English will feel sluggish — the AI waits too long before responding, creating awkward gaps that don’t match the conversational rhythm.

We built per-language VAD profiles. This connects directly to the VAD tuning work from Part 2, but at a per-locale level:

from dataclasses import dataclass

@dataclass
class VADProfile:
    """Language-specific Voice Activity Detection configuration."""
    name: str
    threshold: float           # Activation threshold (0.0-1.0)
    silence_duration_ms: int   # Silence before end-of-turn (ms)
    prefix_padding_ms: int     # Audio to keep before speech onset (ms)
    min_speech_duration_ms: int  # Minimum speech to count as utterance

VAD_PROFILES: dict[str, VADProfile] = {
    "english": VADProfile(
        name="english",
        threshold=0.5,
        silence_duration_ms=500,
        prefix_padding_ms=300,
        min_speech_duration_ms=200,
    ),
    "japanese": VADProfile(
        name="japanese",
        threshold=0.4,          # Lower threshold: catch softer speech
        silence_duration_ms=800, # Longer: respect think-pauses
        prefix_padding_ms=500,   # More padding: Japanese speech starts softer
        min_speech_duration_ms=150,
    ),
    "spanish": VADProfile(
        name="spanish",
        threshold=0.5,
        silence_duration_ms=400, # Shorter: faster turn-taking rhythm
        prefix_padding_ms=200,   # Less padding: quicker onsets
        min_speech_duration_ms=180,
    ),
    "korean": VADProfile(
        name="korean",
        threshold=0.45,
        silence_duration_ms=600, # Moderate: between English and Japanese
        prefix_padding_ms=350,
        min_speech_duration_ms=170,
    ),
    "french": VADProfile(
        name="french",
        threshold=0.5,
        silence_duration_ms=450,
        prefix_padding_ms=280,
        min_speech_duration_ms=190,
    ),
    "default": VADProfile(
        name="default",
        threshold=0.5,
        silence_duration_ms=500,
        prefix_padding_ms=300,
        min_speech_duration_ms=200,
    ),
}

The numbers didn’t come from linguistics papers. They came from running 50+ mock sessions per language, recording every instance where the AI interrupted the participant or waited too long, and adjusting until the interruption rate dropped below 5%. The Japanese profile took the most iteration — we started at 600ms silence duration and kept raising it until think-pauses stopped being misinterpreted as end-of-turn.

The threshold parameter also matters more than I expected. Japanese speech often starts quieter — a soft intake of breath, a filler word (“eto…”), then the actual statement. A threshold of 0.5 would miss those quiet onsets and clip the beginning of utterances. Dropping to 0.4 for Japanese catches those soft starts without significantly increasing false activations from background noise.

These VAD profiles plug directly into the room metadata configuration. When the ProviderRouter resolves a locale to a LanguageConfig, the vad_profile field references the appropriate profile, and the agent applies it to the SFU’s audio processing pipeline before the session starts.

i18n Prompt Packs — Instructions in the Target Language

Here’s something that seems obvious in retrospect but caused two weeks of debugging: if the system prompt is in English and the participant speaks Japanese, the AI produces a confused mix. It might respond in Japanese for the first few turns, then randomly switch to English. Or it’ll use Japanese words with English grammar. Or it’ll translate its English-native thinking into stilted, overly formal Japanese that sounds like a textbook rather than a person.

The fix: write the entire system prompt — every phase instruction, every probe template, every wrap-up script — in the target language. The AI produces dramatically better output when the instructions match the expected output language.

We organized this as prompt packs — YAML files per locale that contain the full research protocol in that language:

from dataclasses import dataclass, field
from pathlib import Path
from typing import Optional
import yaml

@dataclass
class PromptPack:
    """Locale-specific research prompts and templates."""
    locale: str
    system_intro: str                  # Base system instruction
    phases: dict[str, str]             # phase_id -> phase-specific instructions
    probe_templates: list[str]         # Follow-up probe patterns
    wrap_up_script: str                # Session closing language
    filler_acknowledgments: list[str]  # Locale-appropriate "mm-hmm" equivalents
    polite_interrupt: str              # How to redirect gently

@dataclass
class PromptPackLoader:
    """Loads locale-specific prompt packs from YAML files."""
    prompts_dir: Path = Path("prompts")

    def load(self, locale: str) -> PromptPack:
        """Load a prompt pack for the given locale."""
        pack_path = self.prompts_dir / f"research_{locale}.yaml"
        if not pack_path.exists():
            # Fall back to English if locale pack doesn't exist
            pack_path = self.prompts_dir / "research_en.yaml"

        with open(pack_path, "r", encoding="utf-8") as f:
            data = yaml.safe_load(f)

        return PromptPack(
            locale=locale,
            system_intro=data["system_intro"],
            phases=data["phases"],
            probe_templates=data.get("probe_templates", []),
            wrap_up_script=data.get("wrap_up_script", ""),
            filler_acknowledgments=data.get("filler_acknowledgments", []),
            polite_interrupt=data.get("polite_interrupt", ""),
        )

A simplified example of the Japanese prompt pack structure:

# prompts/research_ja.yaml
system_intro: |
  あなたは定性調査のインタビュアーです。参加者に敬意を持って接し、
  丁寧語（です・ます調）を使用してください。参加者が考えている間の
  沈黙を尊重し、急かさないでください。

phases:
  warmup: |
    ウォーミングアップの段階です。参加者との信頼関係を築いてください。
    軽い話題から始め、リラックスした雰囲気を作りましょう。
    「本日はお時間をいただきありがとうございます」から始めてください。
  exploration: |
    探索の段階です。研究テーマに関する質問を自然に進めてください。
    参加者の回答に共感を示しながら、深い洞察を引き出しましょう。
  probing: |
    深掘りの段階です。参加者が興味深い発言をした場合、
    「もう少し詳しくお聞かせいただけますか？」のように掘り下げてください。
  wrapup: |
    まとめの段階です。参加者の回答を要約し、
    確認を取ってください。「本日は貴重なお話をありがとうございました」で締めくくります。

probe_templates:
  - "もう少し詳しくお聞かせいただけますか？"
  - "それは興味深いですね。具体的にはどういうことでしょうか？"
  - "その時、どのようにお感じになりましたか？"

filler_acknowledgments:
  - "なるほど"
  - "そうですね"
  - "はい、続けてください"

wrap_up_script: |
  本日は貴重なお時間をいただき、誠にありがとうございました。
  お話しいただいた内容は研究にとって大変参考になります。

polite_interrupt: |
  大変興味深いお話ですが、もう少し別の観点からもお聞きしたいのですが、
  よろしいでしょうか？

Three things I learned the hard way about i18n prompt packs:

Native speakers must review every prompt pack. We initially machine-translated the English prompts to Japanese using GPT-4o. The translations were grammatically correct but culturally wrong. The polite forms were inconsistent — mixing casual and formal registers in ways that would confuse a Japanese speaker. A native Japanese speaker rewrote the prompts in two hours and the quality difference was night and day. The AI’s Japanese went from “technically correct but awkward” to “sounds like an actual research moderator.”

Cultural adaptation goes beyond translation. Japanese research protocols need more polite framing. The keigo (honorific language) patterns matter. You use です/ます forms, not plain forms. You add softening phrases before direct questions. Spanish protocols, conversely, can be more direct — a probe like “Cuénteme más sobre eso” (tell me more about that) works without the layers of indirection that Japanese requires. These are research methodology decisions, not just language decisions.

The set_chat_ctx() dynamic instruction swap works the same regardless of language. This was a relief. The state machine from Part 3 uses set_chat_ctx() to swap system instructions when the AI transitions between phases. The same mechanism works perfectly with Japanese, Spanish, or Korean instructions. The phase transition is a prompt swap — the language of the prompt doesn’t change the mechanics.

Voice Selection Per Language

Each S2S provider offers a set of voices, and their quality varies by language. This is not documented well — you discover it by testing.

OpenAI Realtime currently offers voices like alloy, echo, shimmer, ash, ballad, coral, sage, and verse. These are all English-first voices. They can produce other languages, but the accent and naturalness degrade. Shimmer in Japanese sounds distinctly like an English speaker reading Japanese text. It’s intelligible but not convincing.

Gemini Live voices — Puck, Charon, Kore, Fenrir, Aoede — handle multilingual speech more naturally. Puck in Japanese sounds like a Japanese speaker. Kore in Spanish sounds like a native Spanish speaker. This is likely because Google’s voice training data spans a wider range of languages from their search and assistant products.

We maintain a voice registry that maps provider, language, and use case to the best voice:

from dataclasses import dataclass

@dataclass
class VoiceOption:
    """A voice option with quality scores per language."""
    voice_id: str
    provider: str
    naturalness_scores: dict[str, float]  # locale -> 1.0-5.0 score
    gender_presentation: str              # "neutral", "feminine", "masculine"

VOICE_REGISTRY: list[VoiceOption] = [
    # OpenAI Realtime voices — strong in English
    VoiceOption("shimmer", "openai_realtime",
                {"en": 4.8, "ja": 2.1, "es": 3.0, "fr": 3.2, "ko": 2.0},
                "feminine"),
    VoiceOption("alloy", "openai_realtime",
                {"en": 4.5, "ja": 2.3, "es": 2.8, "fr": 3.0, "ko": 2.2},
                "neutral"),
    VoiceOption("ash", "openai_realtime",
                {"en": 4.7, "ja": 2.0, "es": 2.9, "fr": 3.1, "ko": 1.9},
                "masculine"),

    # Gemini Live voices — strong across languages
    VoiceOption("Puck", "gemini_live",
                {"en": 4.2, "ja": 4.5, "es": 4.3, "fr": 4.1, "ko": 4.0},
                "neutral"),
    VoiceOption("Charon", "gemini_live",
                {"en": 4.0, "ja": 4.1, "es": 4.0, "fr": 4.6, "ko": 3.8},
                "masculine"),
    VoiceOption("Kore", "gemini_live",
                {"en": 4.1, "ja": 4.3, "es": 4.7, "fr": 4.2, "ko": 4.1},
                "feminine"),
    VoiceOption("Aoede", "gemini_live",
                {"en": 3.9, "ja": 4.0, "es": 4.1, "fr": 4.0, "ko": 4.5},
                "feminine"),
]

def select_voice(provider: str, locale: str,
                 preferred_gender: str = "neutral") -> str:
    """Select the best voice for a provider+locale combination."""
    candidates = [v for v in VOICE_REGISTRY if v.provider == provider]

    if preferred_gender != "neutral":
        gendered = [v for v in candidates
                    if v.gender_presentation == preferred_gender]
        if gendered:
            candidates = gendered

    # Sort by naturalness score for the target locale
    candidates.sort(
        key=lambda v: v.naturalness_scores.get(locale, 0.0),
        reverse=True,
    )
    return candidates[0].voice_id if candidates else "Puck"

The naturalness scores came from structured testing: we ran 10-minute mock research sessions in each language with each voice, then had native speakers rate the AI’s speech on a 1-5 scale across four dimensions — accent naturalness, prosody (rhythm and intonation), clarity, and conversational flow. The scores in the registry are the averages.

One operational rule we enforce strictly: voice consistency within a study. If a research study starts with Puck for Japanese sessions, every Japanese session in that study uses Puck. Participants in longitudinal studies (multiple sessions over weeks) notice voice changes and it disrupts their comfort level. The voice_id is locked at the study level, not the session level.

Cross-Language Analysis Pipeline

After sessions complete, you have transcripts in Japanese, English, Spanish, French, and Korean. The research team needs to analyze them together — identify themes across languages, compare sentiment patterns, extract insights that span the entire study regardless of what language each participant spoke.

We evaluated three strategies:

1. Analyze in source language only. Run sentiment analysis, topic extraction, and theme coding in the original language. This preserves cultural nuances — Japanese sentiment expressions don’t map 1:1 to English scales — but makes cross-language comparison nearly impossible. The research team would need to read reports in five languages.

2. Translate to English, then analyze. Translate every transcript to English, then run a single English analysis pipeline. This simplifies cross-language comparison but loses nuance. The Japanese concept of “kuuki wo yomu” (reading the atmosphere) translates literally but the cultural weight disappears. Sarcasm, humor, and understatement are language-specific and often get flattened in translation.

3. Dual analysis: source language + English translation. Analyze in the source language for accuracy, translate for cross-study comparison, and keep both. More expensive, more storage, but the research team gets the best of both approaches.

We use strategy 3 — dual analysis. Here’s the core class:

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class TranscriptSegment:
    """A single segment of a research session transcript."""
    segment_id: str
    locale: str
    speaker: str           # "participant" or "agent"
    text: str              # Original language text
    text_en: Optional[str] = None  # English translation (if non-English)
    start_ms: int = 0
    end_ms: int = 0

@dataclass
class SegmentEnrichment:
    """Enrichment data for a transcript segment."""
    segment_id: str
    locale: str
    sentiment: float                     # -1.0 to 1.0
    sentiment_label: str                 # Source-language label
    topics: list[str]                    # Semantic topic IDs (language-agnostic)
    key_quote: bool = False
    theme_ids: list[str] = field(default_factory=list)

@dataclass
class CrossLanguageEnricher:
    """Runs dual-language enrichment on research transcripts."""
    translation_model: str = "gpt-4o"
    enrichment_model_map: dict[str, str] = field(default_factory=lambda: {
        "en": "gpt-4o-mini",
        "ja": "gpt-4o",      # Full model for Japanese nuance
        "es": "gpt-4o-mini",
        "fr": "gpt-4o-mini",
        "ko": "gpt-4o",      # Full model for Korean nuance
    })

    async def enrich_segment(self, segment: TranscriptSegment) -> SegmentEnrichment:
        """Run source-language enrichment, then optional translation."""
        # Step 1: Enrichment in the source language
        model = self.enrichment_model_map.get(segment.locale, "gpt-4o-mini")
        enrichment = await self._source_language_enrichment(segment, model)

        # Step 2: Translate to English if not already English
        if segment.locale != "en":
            segment.text_en = await self._translate_to_english(
                segment.text, segment.locale
            )

        # Step 3: Map themes to semantic IDs (language-agnostic)
        enrichment.theme_ids = await self._map_to_semantic_themes(
            enrichment.topics, segment.locale
        )

        return enrichment

    async def _source_language_enrichment(
        self, segment: TranscriptSegment, model: str
    ) -> SegmentEnrichment:
        """Analyze sentiment and topics in the original language."""
        # Prompt is in the source language for accuracy
        # Returns sentiment, topics, key_quote flag
        # Implementation calls the LLM with locale-appropriate prompts
        ...

    async def _translate_to_english(self, text: str, source_locale: str) -> str:
        """Translate segment text to English, preserving meaning over literalness."""
        # Uses GPT-4o for translation quality
        # Prompt instructs: "Translate for meaning, not word-for-word.
        # Preserve cultural context in brackets where needed."
        ...

    async def _map_to_semantic_themes(
        self, topics: list[str], locale: str
    ) -> list[str]:
        """Map locale-specific topics to universal semantic theme IDs."""
        # "ブランド認知" (Japanese) and "brand awareness" (English)
        # both map to theme_id "brand_perception"
        # Uses a predefined taxonomy + LLM fallback for unmapped topics
        ...

The semantic theme IDs are the key to cross-language comparison. Instead of matching theme strings across languages (which would require translation), we define a universal taxonomy of research themes — brand_perception, purchase_intent, usability_friction, emotional_response, etc. — and map each language’s topics to these IDs. “ブランド認知” in Japanese and “reconocimiento de marca” in Spanish both resolve to brand_perception. The mapping is a combination of a hand-curated dictionary (covers ~80% of cases) and an LLM fallback for novel topics.

Storage uses GIN-indexed JSONB in PostgreSQL, with the locale field enabling per-language filtering:

CREATE INDEX idx_enrichments_locale
  ON session_enrichments USING GIN (enrichment_data jsonb_path_ops);

-- Query: all key quotes about brand perception, any language
SELECT s.text, s.text_en, e.enrichment_data->>'sentiment_label'
FROM transcript_segments s
JOIN session_enrichments e ON s.segment_id = e.segment_id
WHERE e.enrichment_data @> '{"theme_ids": ["brand_perception"]}'
  AND e.enrichment_data @> '{"key_quote": true}'
ORDER BY (e.enrichment_data->>'sentiment')::float DESC;

The research team gets unified reports with per-locale breakdowns and combined cross-language views. A typical report shows: “Across all 150 sessions (98 English, 32 Japanese, 20 Spanish), the dominant theme was usability_friction with negative sentiment. Japanese participants expressed this through indirect language (mean sentiment -0.3), while Spanish participants were more direct (mean sentiment -0.6).” That level of cross-cultural insight is only possible with dual analysis.

For post-session ASR when we need higher-quality transcripts (the S2S model’s built-in transcription is good but not perfect), we use Whisper which supports 99 languages. The pipeline from Part 4 runs Whisper with the language parameter set to the session’s locale for optimal accuracy.

Production Numbers and What We Learned

After six months of multi-language operation, here’s where the numbers landed.

Language distribution:

English: 65% of sessions (routed to OpenAI Realtime)
Japanese: 20% of sessions (routed to Gemini Live)
Spanish: 10% of sessions (routed to Gemini Live)
Other (French, Korean, Portuguese): 5% combined (routed to Gemini Live)

Time-to-first-voice by language and provider:

Language	Provider	TTFV (p50)	TTFV (p95)
English	OpenAI Realtime	290ms	450ms
English	Gemini Live	390ms	580ms
Japanese	Gemini Live	410ms	620ms
Spanish	Gemini Live	380ms	560ms
Korean	Gemini Live	420ms	640ms
French	Gemini Live	395ms	575ms

OpenAI Realtime consistently wins on English TTFV by ~100ms. For non-English languages, Gemini Live is the only viable option for native-quality speech, so the comparison is academic.

Session completion rates (full protocol, participant didn’t drop out):

English: 94%
Japanese: 91% (up from 83% before VAD tuning)
Spanish: 93%
Korean: 90%
French: 92%

The Japanese completion rate tells the story of VAD tuning. Before locale-specific VAD profiles, Japanese sessions had a 83% completion rate. Participants were getting frustrated by constant interruptions during think-pauses and abandoning sessions. After raising the silence duration to 800ms and lowering the threshold to 0.4, the completion rate jumped to 91%. That 8-point improvement came entirely from not interrupting people while they’re thinking.

Cost per session hour by provider:

OpenAI Realtime (English): ~$3.18/hour
Gemini Live (Japanese/Spanish/other): ~$1.74/hour

Gemini Live sessions are approximately 45% cheaper. This is partly the per-token rate difference (covered in Part 5) and partly because Gemini Live’s audio tokenization is more efficient for the session lengths typical in research (20-40 minutes).

What we learned, distilled:

Don’t try to force one provider to do everything. The temptation is to pick one provider and use it for all languages. It’s simpler to build, simpler to maintain, simpler to debug. But the quality difference is real. OpenAI Realtime in Japanese sounds like a foreigner speaking Japanese. Gemini Live in Japanese sounds like a native speaker. For research, where participant comfort determines data quality, the multi-provider approach is worth the additional complexity.

VAD profiles per language are not optional. They’re the difference between 83% and 91% session completion for Japanese. Every language has different conversational rhythms. A one-size-fits-all VAD configuration guarantees a poor experience for at least some of your language cohorts. The effort to tune per-language profiles is a few weeks of testing — the payoff is permanent.

Prompt packs must be written by native speakers, not machine-translated. Machine translation produces grammatically correct but culturally wrong prompts. The AI mirrors the quality of its instructions. Stilted instructions produce stilted conversation. Native-speaker-written prompts produce natural conversation. Budget for this as a real line item, not an afterthought.

The cross-language analysis pipeline is where the research value multiplies. Running studies in five languages doesn’t give you 5x the insights — it gives you qualitatively different insights. Cultural differences in how people express the same underlying sentiment are themselves research findings. But only if your analysis pipeline can surface them, which requires dual-language enrichment and semantic theme mapping.

Gemini Live’s language breadth is a strategic advantage. If your research spans more than two languages, Gemini Live’s 30+ language support with native accent quality is hard to beat. We route English to OpenAI for the latency edge, but Gemini handles everything else and could handle English too if needed. For teams building multi-language voice AI, starting with Gemini Live as the default and adding OpenAI for English optimization is the pragmatic path.

References:

This is Part 7 of an 8-part series: Production Voice AI for Research at Scale.

Series outline:

The Architecture Nobody Warns You About — Server-side agents, metadata transport, provider selection (Part 1)
Zombie Agents, Pre-Warming, and the 5 Bugs That Cost Us Weeks — Production pain points and fixes (Part 2)
Multi-Phase State Machines — Research protocol as code, LLM-driven transitions (Part 3)
From Recording to Insight — The automatic post-interview pipeline (Part 4)
The Real Cost — Per-minute tracking, budgets, self-hosting math (Part 5)
What Breaks at 200 Concurrent Sessions — Scaling bottlenecks and operational metrics (Part 6)
Multi-Language Voice AI — Language detection, provider routing, i18n prompt packs (this post)
Deployment and Go-Live — Docker, Kubernetes, CI/CD, zero-downtime deploys, monitoring (Part 8)

For the broader reference architecture covering cascaded vs S2S pipelines, framework selection, multi-provider support, and the full interview lifecycle, see the 12-part Voice AI Interview Playbook.

Export for reading

Production Voice AI for Research at Scale: Multi-Language Voice AI — When Your Agent Needs to Think in Japanese (Part 7 of 8)

Why Multi-Language Voice AI Is Different from Multi-Language Text

Language Detection and Provider Routing

Locale-Aware VAD Tuning

i18n Prompt Packs — Instructions in the Target Language

Voice Selection Per Language

Cross-Language Analysis Pipeline

Production Numbers and What We Learned

Comments

On this page

Production Voice AI for Research at Scale: Multi-Language Voice AI — When Your Agent Needs to Think in Japanese (Part 7 of 8)