Introduction

The voice AI landscape in 2026 has a clear frontrunner that most developers did not see coming. While OpenAI and Google dominated headlines with GPT-4o and Gemini, Alibaba’s Qwen team quietly assembled the most comprehensive open-source voice ecosystem available today. What makes Qwen different is not just one model — it is an entire interconnected stack: speech recognition (Qwen3-ASR), text-to-speech (Qwen3-TTS), omni-modal reasoning (Qwen3-Omni), an agent framework (Qwen-Agent), and a coding agent (Qwen3-Coder-Next). All open-weight. All self-hostable.

This matters for three reasons. First, privacy: you can run the entire voice pipeline on your own infrastructure without sending a single audio byte to a third-party API. Second, cost: after the initial GPU investment, inference is essentially free at scale. Third, customization: open weights mean you can fine-tune for your domain, clone voices in 3 seconds, and build agent behaviors that closed APIs simply do not support.

In this guide, we will walk through the entire Qwen voice ecosystem, then build three complete systems from scratch: a Voice Interview Agent, a Language Tutor, and a Voice Tutor with voice cloning. Every code block is practical, and every architecture diagram is grounded in production patterns.


The Qwen Voice Ecosystem

Before writing a single line of code, it is worth understanding what each piece does and how they fit together.

graph TB
    A[Qwen Voice Ecosystem] --> B[Qwen3-ASR]
    A --> C[Qwen3-TTS]
    A --> D[Qwen3-Omni]
    A --> E[Qwen-Agent]
    A --> F[Qwen3-Coder-Next]

    B --> B1[Speech to Text]
    B --> B2[30 Languages + 22 Chinese Dialects]
    B --> B3[Streaming + Offline]

    C --> C1[Text to Speech]
    C --> C2[Voice Cloning from 3s]
    C --> C3[10 Languages]

    D --> D1[Thinker-Talker Architecture]
    D --> D2[Voice-in Voice-out]
    D --> D3[119 Text Languages]

    E --> E1[Function Calling + MCP]
    E --> E2[Code Interpreter]
    E --> E3[RAG + Browser]

    F --> F1[Terminal AI Agent]
    F --> F2[3B active / 80B total MoE]
    F --> F3[256K Context]

Qwen3-Omni: The Flagship

Qwen3-Omni is a natively end-to-end omni-modal LLM. Unlike cascaded systems that chain separate ASR, LLM, and TTS models together, Qwen3-Omni processes audio, images, video, and text in a single forward pass and generates both text and speech output simultaneously.

The Thinker-Talker Architecture

The model is split into two tightly coupled components:

  • The Thinker is the reasoning backbone. It ingests all input modalities through specialized encoders — including a native Audio Transformer (AuT) encoder pre-trained on over 100 million hours of audio-visual data. The Thinker performs deep multimodal reasoning and produces rich hidden representations.

  • The Talker receives the Thinker’s output and generates contextual speech in real time. It handles prosody, emotion, and natural turn-taking without requiring a separate TTS model.

TMRoPE (Time-aware Multimodal Rotary Position Embedding) is the architectural innovation that makes this work. It aligns different modalities along a shared temporal axis, so the model understands that frame 42 of a video corresponds to second 1.4 of the audio track. This temporal grounding is what enables natural, synchronized voice responses.

Key specifications:

SpecificationValue
Text languages119
Speech input languages19 (EN, ZH, KO, JA, DE, RU, IT, FR, ES, PT, MS, NL, ID, TR, VI, Cantonese, AR, UR)
Speech output languages10 (EN, ZH, FR, DE, RU, IT, ES, PT, JA, KO)
Available modelsQwen3-Omni-30B-A3B, Qwen2.5-Omni-7B
ArchitectureMoE, 30B total / 3B active parameters
SOTA benchmarks215 audio and audio-visual subtasks

Compared to GPT-4o and Gemini 2.5, Qwen3-Omni offers comparable quality on most benchmarks while being fully open-weight and self-hostable. The MoE architecture means you get 30B-class reasoning at 3B-class inference cost.

Qwen3-TTS: Text-to-Speech

Qwen3-TTS is a dedicated speech synthesis model series available in two sizes: 0.6B and 1.7B parameters. It is not just a basic TTS engine — it supports three distinct modes of voice control:

  1. Voice Cloning: Provide as little as 3 seconds of reference audio, and Qwen3-TTS will replicate the speaker’s voice characteristics, timbre, and speaking style.

  2. Voice Design: Describe the voice you want in natural language — “a warm, authoritative male voice with a slight British accent, speaking at a moderate pace” — and the model generates speech matching that description.

  3. Preset Voices: Use built-in voice profiles for quick prototyping.

Performance numbers that matter:

  • WER (Word Error Rate): 1.835% averaged across 10 languages — this is exceptionally low for an open-source TTS model
  • First-packet latency: 97ms (0.6B) / 101ms (1.7B) — fast enough for real-time conversation
  • Languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian, plus Chinese dialect support
  • Streaming: Full streaming support with chunked audio output

Available APIs: DashScope (Alibaba Cloud), Replicate, fal.ai, and self-hosted via vLLM or the official inference server.

Qwen3-ASR: Speech-to-Text

Qwen3-ASR complements TTS with equally capable speech recognition, available in 0.6B and 1.7B variants:

  • 52 languages and dialects supported (30 languages + 22 Chinese dialects)
  • Unified streaming and offline inference: The same model handles both real-time microphone input and batch file processing
  • Forced alignment: Qwen3-ForcedAligner-0.6B provides word-level timestamps in 11 languages — critical for pronunciation assessment and subtitle generation
  • Speed: The 0.6B model transcribes 2,000 seconds of speech in 1 second at a concurrency of 128
  • TTFT (Time to First Token): As low as 92ms for the 0.6B model

The 1.7B version achieves state-of-the-art performance among open-source ASR models and is competitive with the strongest proprietary APIs including OpenAI Whisper and Google Cloud Speech.

DashScope WebSocket API provides a production-ready streaming interface:

import dashscope
from dashscope.audio.asr import Recognition

recognition = Recognition(
    model="qwen3-asr-1.7b",
    format="pcm",
    sample_rate=16000,
    callback=on_result
)
recognition.start()
# Feed audio chunks in real-time
recognition.send_audio(audio_chunk)
recognition.stop()

Qwen-Agent Framework

Qwen-Agent is the orchestration layer that turns individual models into functional AI agents. It provides:

  • Function Calling: Native support for parallel function calls with structured tool definitions
  • MCP (Model Context Protocol): Deep integration with MCP servers via stdio, SSE, or streamable HTTP — enabling file system access, database queries, memory management, and more
  • Code Interpreter: Sandboxed Python execution for data analysis and computation
  • RAG: Built-in retrieval-augmented generation for knowledge-grounded responses
  • Chrome Extension & Browser Assistant: Web-aware agents that can read and interact with browser content

For voice applications, Qwen-Agent provides the glue between ASR input, LLM reasoning, and TTS output, handling conversation state, tool calls, and streaming coordination.

Qwen3-Coder-Next and Qwen Code

While not directly voice-related, Qwen3-Coder-Next deserves mention because it powers the coding agent ecosystem:

  • Architecture: MoE with 3B active / 80B total parameters, 256K context window
  • Training: 800K executable tasks with environment interaction and reinforcement learning
  • Qwen Code: An open-source terminal AI agent (similar to Claude Code) optimized for Qwen models
  • Compatibility: Works with Claude Code, Cline, Kilo, Trae, and other agent scaffolds

This is relevant because you can use Qwen Code to build and iterate on your voice agent code directly from the terminal.


Building a Voice Interview Agent with Qwen

Let us build a complete voice-based technical interview system. The architecture uses three Qwen models in a pipeline:

graph LR
    MIC[Microphone] --> ASR[Qwen3-ASR]
    ASR --> |Transcript| LLM[Qwen3-Omni]
    LLM --> |Response text| TTS[Qwen3-TTS]
    TTS --> SPK[Speaker]
    LLM --> |Score + Notes| DB[(Session DB)]
    DB --> |History| LLM

Project Setup

pip install dashscope sounddevice numpy pydantic

Core Interview Agent

"""
Voice Interview Agent using Qwen3-ASR + Qwen3-Omni + Qwen3-TTS
Conducts technical interviews with real-time voice interaction.
"""

import os
import json
import queue
import threading
import numpy as np
import sounddevice as sd
from dataclasses import dataclass, field
from dashscope.audio.asr import Recognition, RecognitionCallback
from dashscope.audio.tts_v2 import SpeechSynthesizer
from openai import OpenAI

DASHSCOPE_API_KEY = os.getenv("DASHSCOPE_API_KEY")
SAMPLE_RATE = 16000
CHANNELS = 1


@dataclass
class InterviewSession:
    role: str = "Senior Python Developer"
    difficulty: str = "medium"
    history: list = field(default_factory=list)
    scores: list = field(default_factory=list)
    current_topic: str = ""
    questions_asked: int = 0
    max_questions: int = 8


SYSTEM_PROMPT = """You are an experienced technical interviewer conducting a
{role} interview at {difficulty} difficulty.

Rules:
- Ask one question at a time, then wait for the candidate's response
- Start with introductions, then move to technical questions
- Adapt question difficulty based on candidate performance
- After each answer, briefly acknowledge it before the next question
- Track topics covered: algorithms, system design, language specifics
- Be professional but conversational -- this is a voice interview
- Keep responses under 3 sentences for natural conversation flow

Current topic: {topic}
Questions asked: {asked}/{max}
"""


class ASRHandler(RecognitionCallback):
    """Handles real-time ASR results from Qwen3-ASR."""

    def __init__(self):
        self.transcript_queue = queue.Queue()
        self.partial = ""

    def on_partial(self, result):
        self.partial = result.get_sentence().get("text", "")

    def on_complete(self, result):
        text = result.get_sentence().get("text", "")
        if text.strip():
            self.transcript_queue.put(text)
        self.partial = ""

    def on_error(self, error):
        print(f"ASR Error: {error}")


def create_llm_client():
    """Create OpenAI-compatible client for Qwen3-Omni."""
    return OpenAI(
        api_key=DASHSCOPE_API_KEY,
        base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    )


def get_interviewer_response(client, session: InterviewSession, candidate_text: str):
    """Generate interviewer response using Qwen3-Omni."""
    session.history.append({"role": "user", "content": candidate_text})

    system = SYSTEM_PROMPT.format(
        role=session.role,
        difficulty=session.difficulty,
        topic=session.current_topic,
        asked=session.questions_asked,
        max=session.max_questions,
    )

    messages = [{"role": "system", "content": system}] + session.history

    response = client.chat.completions.create(
        model="qwen3-omni", messages=messages, max_tokens=300
    )

    reply = response.choices[0].message.content
    session.history.append({"role": "assistant", "content": reply})
    session.questions_asked += 1

    return reply


def speak(text: str, voice: str = "ethan"):
    """Convert text to speech using Qwen3-TTS and play it."""
    synthesizer = SpeechSynthesizer(
        model="qwen3-tts-1.7b", voice=voice, sample_rate=24000
    )

    audio_data = b""
    for chunk in synthesizer.streaming_call(text):
        audio_data += chunk

    audio_array = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768
    sd.play(audio_array, samplerate=24000)
    sd.wait()


def listen(asr_handler: ASRHandler, timeout: float = 30.0) -> str:
    """Listen for speech input via Qwen3-ASR, return transcript."""
    recognition = Recognition(
        model="qwen3-asr-1.7b",
        format="pcm",
        sample_rate=SAMPLE_RATE,
        callback=asr_handler,
    )
    recognition.start()

    audio_queue = queue.Queue()

    def audio_callback(indata, frames, time_info, status):
        audio_queue.put(indata.copy())

    with sd.InputStream(
        samplerate=SAMPLE_RATE,
        channels=CHANNELS,
        dtype="int16",
        callback=audio_callback,
        blocksize=3200,
    ):
        silence_count = 0
        while silence_count < int(timeout * SAMPLE_RATE / 3200):
            try:
                chunk = audio_queue.get(timeout=1.0)
                recognition.send_audio(chunk.tobytes())
                amplitude = np.abs(chunk).mean()
                if amplitude < 200:
                    silence_count += 1
                else:
                    silence_count = 0
            except queue.Empty:
                break

    recognition.stop()

    try:
        return asr_handler.transcript_queue.get(timeout=2.0)
    except queue.Empty:
        return ""


def score_answer(client, question: str, answer: str) -> dict:
    """Score a candidate's answer on multiple dimensions."""
    scoring_prompt = f"""Score this interview answer on a scale of 1-10.
Question: {question}
Answer: {answer}

Return JSON: {{"relevance": int, "depth": int, "clarity": int, "overall": int, "notes": str}}
Only return the JSON, nothing else."""

    response = client.chat.completions.create(
        model="qwen3-omni",
        messages=[{"role": "user", "content": scoring_prompt}],
        max_tokens=200,
    )

    try:
        return json.loads(response.choices[0].message.content)
    except json.JSONDecodeError:
        return {"relevance": 5, "depth": 5, "clarity": 5, "overall": 5, "notes": ""}


def run_interview():
    """Main interview loop."""
    session = InterviewSession()
    client = create_llm_client()
    asr_handler = ASRHandler()

    print("=== Voice Interview Agent ===")
    print(f"Role: {session.role} | Difficulty: {session.difficulty}")
    print("Starting interview...\n")

    # Opening statement
    opening = get_interviewer_response(
        client, session, "[Interview begins. Greet the candidate.]"
    )
    print(f"Interviewer: {opening}")
    speak(opening)

    while session.questions_asked < session.max_questions:
        # Listen to candidate
        print("\n[Listening...]")
        candidate_response = listen(asr_handler)

        if not candidate_response:
            speak("I didn't catch that. Could you repeat?")
            continue

        print(f"Candidate: {candidate_response}")

        # Score the previous answer
        if session.questions_asked > 1:
            last_q = session.history[-3]["content"]  # interviewer question
            score = score_answer(client, last_q, candidate_response)
            session.scores.append(score)
            print(f"  [Score: {score.get('overall', 'N/A')}/10]")

        # Generate next interviewer response
        reply = get_interviewer_response(client, session, candidate_response)
        print(f"Interviewer: {reply}")
        speak(reply)

    # Final summary
    if session.scores:
        avg = sum(s.get("overall", 5) for s in session.scores) / len(session.scores)
        print(f"\n=== Interview Complete ===")
        print(f"Average Score: {avg:.1f}/10")
        print(f"Questions Asked: {session.questions_asked}")
        for i, s in enumerate(session.scores, 1):
            print(f"  Q{i}: {s.get('overall', 'N/A')}/10 - {s.get('notes', '')}")


if __name__ == "__main__":
    run_interview()

This agent captures microphone audio, transcribes it with Qwen3-ASR in real time, feeds the transcript to Qwen3-Omni for intelligent interviewer responses, scores each answer, and speaks the response back through Qwen3-TTS. The silence detection ensures the agent waits for the candidate to finish speaking before responding.


Building a Language Tutor with Qwen

A language tutor needs more than conversation — it needs pronunciation assessment, grammar correction, vocabulary tracking, and adaptive difficulty. Here is the full implementation:

graph TB
    Student[Student speaks] --> ASR[Qwen3-ASR with timestamps]
    ASR --> |Transcript + confidence| Eval[Pronunciation Evaluator]
    ASR --> |Transcript| Grammar[Grammar Checker]
    Eval --> |Pronunciation score| Tutor[Qwen3-Omni Tutor Brain]
    Grammar --> |Corrections| Tutor
    Tutor --> |Feedback text| TTS[Qwen3-TTS]
    TTS --> Speaker[Speaker output]
    Tutor --> |Progress data| Vocab[Vocabulary Tracker]
    Vocab --> |Spaced repetition| Tutor

Language Tutor Agent

"""
Language Tutor Agent using the Qwen voice stack.
Supports conversation practice, pronunciation feedback, grammar correction,
and spaced repetition vocabulary building.
"""

import os
import json
import time
from datetime import datetime, timedelta
from dataclasses import dataclass, field
from openai import OpenAI

DASHSCOPE_API_KEY = os.getenv("DASHSCOPE_API_KEY")


@dataclass
class VocabCard:
    word: str
    translation: str
    context: str
    ease_factor: float = 2.5
    interval_days: int = 1
    next_review: str = ""
    correct_count: int = 0

    def schedule_review(self, quality: int):
        """SM-2 spaced repetition algorithm."""
        if quality >= 3:
            if self.correct_count == 0:
                self.interval_days = 1
            elif self.correct_count == 1:
                self.interval_days = 6
            else:
                self.interval_days = int(self.interval_days * self.ease_factor)
            self.correct_count += 1
        else:
            self.correct_count = 0
            self.interval_days = 1

        self.ease_factor = max(
            1.3,
            self.ease_factor + (0.1 - (5 - quality) * (0.08 + (5 - quality) * 0.02)),
        )
        next_dt = datetime.now() + timedelta(days=self.interval_days)
        self.next_review = next_dt.isoformat()


@dataclass
class TutorSession:
    target_language: str = "Japanese"
    native_language: str = "English"
    level: str = "intermediate"
    mode: str = "conversation"  # conversation | pronunciation | vocabulary
    history: list = field(default_factory=list)
    vocab_deck: list = field(default_factory=list)
    pronunciation_scores: list = field(default_factory=list)
    grammar_errors: list = field(default_factory=list)


TUTOR_SYSTEM_PROMPT = """You are a patient, encouraging {target_lang} language tutor.
The student's native language is {native_lang}. Their level is {level}.

Current mode: {mode}

Guidelines:
- In conversation mode: Have natural dialogues in {target_lang}, gently
  correcting mistakes. Mix in new vocabulary appropriate to their level.
- In pronunciation mode: Give specific phonetic feedback. Use IPA when helpful.
- In vocabulary mode: Introduce words in context, quiz the student.
- Always provide corrections in a supportive way.
- Keep responses short for natural voice conversation.
- After correcting, continue the conversation naturally.

If the student makes a grammar error, respond in this format:
[Correction: "wrong phrase" -> "correct phrase" (explanation)]
Then continue naturally.

Recent vocabulary to reinforce: {vocab_words}
"""


def create_client():
    return OpenAI(
        api_key=DASHSCOPE_API_KEY,
        base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    )


def assess_pronunciation(asr_result: dict) -> dict:
    """
    Evaluate pronunciation using ASR confidence scores and forced alignment.
    Uses Qwen3-ForcedAligner for word-level timing and confidence.
    """
    words = asr_result.get("words", [])
    scores = []
    problem_words = []

    for word_info in words:
        confidence = word_info.get("confidence", 1.0)
        scores.append(confidence)
        if confidence < 0.7:
            problem_words.append(
                {
                    "word": word_info.get("word", ""),
                    "confidence": confidence,
                    "start": word_info.get("start", 0),
                    "end": word_info.get("end", 0),
                }
            )

    avg_score = sum(scores) / len(scores) if scores else 0
    return {
        "overall_score": round(avg_score * 100, 1),
        "problem_words": problem_words,
        "word_count": len(words),
    }


def check_grammar(client, text: str, target_lang: str) -> list:
    """Use Qwen3-Omni to identify grammar errors."""
    response = client.chat.completions.create(
        model="qwen3-omni",
        messages=[
            {
                "role": "user",
                "content": f"""Analyze this {target_lang} text for grammar errors.
Text: "{text}"
Return JSON array: [{{"error": "the mistake", "correction": "fixed version",
"rule": "grammar rule"}}]
Return empty array [] if no errors. Only return JSON.""",
            }
        ],
        max_tokens=300,
    )
    try:
        return json.loads(response.choices[0].message.content)
    except (json.JSONDecodeError, IndexError):
        return []


def extract_vocabulary(client, text: str, target_lang: str, level: str) -> list:
    """Extract new vocabulary words from the conversation for the deck."""
    response = client.chat.completions.create(
        model="qwen3-omni",
        messages=[
            {
                "role": "user",
                "content": f"""From this {target_lang} text, extract 1-3 vocabulary
words appropriate for a {level} learner that they might not know.
Text: "{text}"
Return JSON array: [{{"word": str, "translation": str, "context": str}}]
Only return JSON.""",
            }
        ],
        max_tokens=200,
    )
    try:
        return json.loads(response.choices[0].message.content)
    except (json.JSONDecodeError, IndexError):
        return []


def get_due_vocab(session: TutorSession) -> list:
    """Get vocabulary cards due for review."""
    now = datetime.now().isoformat()
    return [
        card
        for card in session.vocab_deck
        if not card.next_review or card.next_review <= now
    ]


def tutor_respond(client, session: TutorSession, student_text: str) -> str:
    """Generate tutor response with context from all subsystems."""
    # Check grammar
    errors = check_grammar(client, student_text, session.target_language)
    session.grammar_errors.extend(errors)

    # Get vocab words to reinforce
    due_vocab = get_due_vocab(session)
    vocab_words = ", ".join(c.word for c in due_vocab[:5]) if due_vocab else "none"

    session.history.append({"role": "user", "content": student_text})

    system = TUTOR_SYSTEM_PROMPT.format(
        target_lang=session.target_language,
        native_lang=session.native_language,
        level=session.level,
        mode=session.mode,
        vocab_words=vocab_words,
    )

    # Add grammar context if errors found
    if errors:
        error_context = "\n".join(
            f'- "{e["error"]}" should be "{e["correction"]}" ({e["rule"]})'
            for e in errors
        )
        student_text_with_ctx = (
            f"{student_text}\n\n[Grammar errors detected:\n{error_context}]"
        )
        session.history[-1]["content"] = student_text_with_ctx

    messages = [{"role": "system", "content": system}] + session.history[-20:]

    response = client.chat.completions.create(
        model="qwen3-omni", messages=messages, max_tokens=400
    )

    reply = response.choices[0].message.content
    session.history.append({"role": "assistant", "content": reply})

    # Extract new vocabulary from tutor's response
    new_vocab = extract_vocabulary(
        client, reply, session.target_language, session.level
    )
    for v in new_vocab:
        card = VocabCard(
            word=v["word"], translation=v["translation"], context=v["context"]
        )
        if not any(c.word == card.word for c in session.vocab_deck):
            session.vocab_deck.append(card)

    return reply


def run_conversation_practice(client, session: TutorSession):
    """Run a conversation practice session (text-based for demonstration)."""
    session.mode = "conversation"
    print(f"=== Language Tutor: {session.target_language} ===")
    print(f"Level: {session.level} | Mode: {session.mode}")
    print("Type 'quit' to end, 'vocab' to review vocabulary\n")

    # Opening
    opener = tutor_respond(
        client, session, "[Session starts. Begin a natural conversation.]"
    )
    print(f"Tutor: {opener}\n")

    while True:
        user_input = input("You: ").strip()
        if user_input.lower() == "quit":
            break
        if user_input.lower() == "vocab":
            due = get_due_vocab(session)
            print(f"\n--- Vocabulary Review ({len(due)} cards due) ---")
            for card in due[:5]:
                print(f"  {card.word} = {card.translation} ({card.context})")
            print()
            continue

        response = tutor_respond(client, session, user_input)
        print(f"Tutor: {response}\n")

    # Session summary
    print("\n=== Session Summary ===")
    print(f"Grammar errors corrected: {len(session.grammar_errors)}")
    print(f"Vocabulary cards collected: {len(session.vocab_deck)}")
    for card in session.vocab_deck:
        print(f"  {card.word} - {card.translation}")


if __name__ == "__main__":
    client = create_client()
    session = TutorSession(target_language="Japanese", level="intermediate")
    run_conversation_practice(client, session)

The Language Tutor integrates grammar checking, pronunciation assessment via ASR confidence scores, and a full SM-2 spaced repetition system for vocabulary retention. Each conversation naturally builds the student’s vocabulary deck, and due cards are woven into future tutor prompts to reinforce learning.


Building a Voice Tutor with Qwen

The Voice Tutor builds on the Language Tutor by adding voice cloning for a custom tutor persona, adaptive difficulty, and session recording with progress tracking.

"""
Voice Tutor with voice cloning, adaptive difficulty, and progress tracking.
Uses Qwen3-TTS voice cloning to create a consistent tutor persona.
"""

import os
import json
import wave
import time
import numpy as np
import sounddevice as sd
from dataclasses import dataclass, field
from pathlib import Path
from dashscope.audio.tts_v2 import SpeechSynthesizer
from openai import OpenAI

DASHSCOPE_API_KEY = os.getenv("DASHSCOPE_API_KEY")
SESSIONS_DIR = Path("./tutor_sessions")
SESSIONS_DIR.mkdir(exist_ok=True)


@dataclass
class StudentProfile:
    name: str = "Student"
    level: int = 5  # 1-10 scale
    strengths: list = field(default_factory=list)
    weaknesses: list = field(default_factory=list)
    session_count: int = 0
    total_score: float = 0.0
    pronunciation_avg: float = 0.0
    topics_covered: list = field(default_factory=list)


@dataclass
class VoiceTutorConfig:
    tutor_name: str = "Professor Tanaka"
    voice_reference_audio: str = ""  # Path to 3s+ reference audio
    voice_description: str = (
        "A warm, patient female voice with clear enunciation, "
        "speaking at a moderate pace suitable for language learners"
    )
    target_language: str = "Japanese"
    native_language: str = "English"


def create_cloned_voice_synthesizer(config: VoiceTutorConfig):
    """
    Create a TTS synthesizer with a cloned or designed voice.
    Qwen3-TTS supports both voice cloning (from audio) and
    voice design (from text description).
    """
    if config.voice_reference_audio and os.path.exists(config.voice_reference_audio):
        # Voice cloning mode: replicate voice from reference audio
        synthesizer = SpeechSynthesizer(
            model="qwen3-tts-1.7b",
            voice="clone",
            voice_reference_audio=config.voice_reference_audio,
            sample_rate=24000,
        )
    else:
        # Voice design mode: generate voice from description
        synthesizer = SpeechSynthesizer(
            model="qwen3-tts-1.7b",
            voice="design",
            voice_description=config.voice_description,
            sample_rate=24000,
        )
    return synthesizer


def adaptive_difficulty(profile: StudentProfile, recent_scores: list) -> str:
    """Adjust difficulty based on student performance trends."""
    if len(recent_scores) < 3:
        return "maintain"

    avg_recent = sum(recent_scores[-3:]) / 3

    if avg_recent >= 8.5 and profile.level < 10:
        profile.level += 1
        return "increase"
    elif avg_recent <= 4.0 and profile.level > 1:
        profile.level -= 1
        return "decrease"
    return "maintain"


VOICE_TUTOR_PROMPT = """You are {tutor_name}, a voice-based {target_lang} tutor.
Student: {student_name} (Level {level}/10)
Difficulty adjustment: {adjustment}
Student strengths: {strengths}
Student weaknesses: {weaknesses}

Your persona: You are warm, encouraging, and patient. You speak clearly.
Adapt your {target_lang} complexity to level {level}.

For pronunciation feedback:
- If score >= 90: Brief praise, move on
- If score 70-89: Note specific sounds to improve
- If score < 70: Slow down, demonstrate correct pronunciation, have them repeat

Session focus areas: {weaknesses}

Keep responses concise (2-3 sentences) for natural voice conversation."""


def speak_with_cloned_voice(
    synthesizer: SpeechSynthesizer, text: str, save_path: str = None
):
    """Synthesize and play speech, optionally saving to file."""
    audio_data = b""
    for chunk in synthesizer.streaming_call(text):
        audio_data += chunk

    audio_array = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768

    if save_path:
        with wave.open(save_path, "w") as wf:
            wf.setnchannels(1)
            wf.setsampwidth(2)
            wf.setframerate(24000)
            wf.writeframes(audio_data)

    sd.play(audio_array, samplerate=24000)
    sd.wait()
    return audio_data


def record_session_audio(duration: float, sample_rate: int = 16000) -> np.ndarray:
    """Record audio from microphone for a specified duration."""
    print(f"[Recording for {duration}s...]")
    audio = sd.rec(
        int(duration * sample_rate), samplerate=sample_rate, channels=1, dtype="int16"
    )
    sd.wait()
    return audio


def save_progress(profile: StudentProfile, session_dir: Path):
    """Save student progress to disk."""
    progress_file = session_dir / f"{profile.name}_progress.json"
    data = {
        "name": profile.name,
        "level": profile.level,
        "session_count": profile.session_count,
        "average_score": (
            round(profile.total_score / max(profile.session_count, 1), 2)
        ),
        "pronunciation_avg": profile.pronunciation_avg,
        "strengths": profile.strengths,
        "weaknesses": profile.weaknesses,
        "topics_covered": profile.topics_covered,
    }
    with open(progress_file, "w") as f:
        json.dump(data, f, indent=2)
    print(f"Progress saved to {progress_file}")


def run_voice_tutor_session(config: VoiceTutorConfig, profile: StudentProfile):
    """Run an interactive voice tutoring session."""
    client = OpenAI(
        api_key=DASHSCOPE_API_KEY,
        base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    )

    synthesizer = create_cloned_voice_synthesizer(config)
    session_dir = SESSIONS_DIR / f"session_{int(time.time())}"
    session_dir.mkdir(exist_ok=True)

    profile.session_count += 1
    recent_scores = []
    history = []

    print(f"=== Voice Tutor: {config.tutor_name} ===")
    print(f"Student: {profile.name} | Level: {profile.level}/10")
    print(f"Language: {config.target_language}\n")

    # Generate opening based on adaptive difficulty
    adjustment = adaptive_difficulty(profile, recent_scores)
    system = VOICE_TUTOR_PROMPT.format(
        tutor_name=config.tutor_name,
        target_lang=config.target_language,
        student_name=profile.name,
        level=profile.level,
        adjustment=adjustment,
        strengths=", ".join(profile.strengths) or "unknown",
        weaknesses=", ".join(profile.weaknesses) or "to be assessed",
    )

    # Opening greeting
    response = client.chat.completions.create(
        model="qwen3-omni",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": "[New session begins. Greet the student.]"},
        ],
        max_tokens=200,
    )

    greeting = response.choices[0].message.content
    print(f"{config.tutor_name}: {greeting}")
    speak_with_cloned_voice(
        synthesizer, greeting, str(session_dir / "turn_000_tutor.wav")
    )

    history.append({"role": "assistant", "content": greeting})

    turn = 1
    while turn <= 15:  # Max 15 turns per session
        # Record student audio
        student_audio = record_session_audio(duration=10.0)
        audio_path = str(session_dir / f"turn_{turn:03d}_student.wav")
        with wave.open(audio_path, "w") as wf:
            wf.setnchannels(1)
            wf.setsampwidth(2)
            wf.setframerate(16000)
            wf.writeframes(student_audio.tobytes())

        # In production, you would run ASR here and get the transcript
        # student_text = run_asr(student_audio)
        student_text = input("Student (text fallback): ")

        if student_text.lower() in ("quit", "exit", "end"):
            break

        history.append({"role": "user", "content": student_text})

        # Generate tutor response with full context
        adjustment = adaptive_difficulty(profile, recent_scores)
        system = VOICE_TUTOR_PROMPT.format(
            tutor_name=config.tutor_name,
            target_lang=config.target_language,
            student_name=profile.name,
            level=profile.level,
            adjustment=adjustment,
            strengths=", ".join(profile.strengths) or "unknown",
            weaknesses=", ".join(profile.weaknesses) or "to be assessed",
        )

        messages = [{"role": "system", "content": system}] + history[-16:]
        response = client.chat.completions.create(
            model="qwen3-omni", messages=messages, max_tokens=300
        )

        reply = response.choices[0].message.content
        history.append({"role": "assistant", "content": reply})

        print(f"{config.tutor_name}: {reply}")
        speak_with_cloned_voice(
            synthesizer, reply, str(session_dir / f"turn_{turn:03d}_tutor.wav")
        )

        turn += 1

    save_progress(profile, SESSIONS_DIR)
    print(f"\nSession recordings saved to {session_dir}")


if __name__ == "__main__":
    config = VoiceTutorConfig(
        tutor_name="Professor Tanaka",
        voice_description=(
            "A warm, patient female voice with clear enunciation, "
            "speaking at a moderate pace suitable for language learners"
        ),
        target_language="Japanese",
    )
    profile = StudentProfile(name="Alex", level=5)
    run_voice_tutor_session(config, profile)

The Voice Tutor records every turn to WAV files for review, adapts difficulty using a sliding window of recent scores, and uses Qwen3-TTS voice cloning to maintain a consistent tutor persona across sessions. The voice_reference_audio parameter lets you clone any voice from just 3 seconds of sample audio.


Advanced Patterns

Streaming Bidirectional Voice Chat

For production-grade voice agents, you need full-duplex streaming — the agent should be able to listen while speaking and handle interruptions naturally.

"""
Streaming bidirectional voice chat with interrupt handling.
Uses async pipelines for simultaneous ASR and TTS.
"""

import asyncio
import numpy as np
import sounddevice as sd
from collections import deque


class StreamingVoiceChat:
    """Full-duplex voice chat with interrupt support."""

    def __init__(self):
        self.is_speaking = False
        self.is_listening = True
        self.interrupt_threshold = 500  # amplitude threshold
        self.audio_buffer = deque(maxlen=100)

    async def asr_stream(self):
        """Continuous ASR streaming coroutine."""
        # In production, connect to Qwen3-ASR WebSocket
        while self.is_listening:
            if self.audio_buffer:
                chunk = self.audio_buffer.popleft()
                amplitude = np.abs(chunk).mean()

                # Detect user interruption during TTS playback
                if self.is_speaking and amplitude > self.interrupt_threshold:
                    print("[Interrupt detected - stopping speech]")
                    self.is_speaking = False
                    sd.stop()
                    yield chunk  # Process the interrupting speech

                elif not self.is_speaking:
                    yield chunk

            await asyncio.sleep(0.01)

    async def tts_stream(self, text: str):
        """Stream TTS output with interrupt awareness."""
        self.is_speaking = True
        # In production, use Qwen3-TTS streaming API
        # Each chunk is played and can be interrupted
        chunks = text.split(". ")  # Simplified chunking
        for chunk in chunks:
            if not self.is_speaking:
                break  # Interrupted
            # synthesize and play chunk
            await asyncio.sleep(0.1)
        self.is_speaking = False

    async def run(self):
        """Main event loop for bidirectional voice chat."""
        asr_task = asyncio.create_task(self._process_asr())
        # Audio capture runs in a separate thread via sounddevice
        print("Streaming voice chat active. Speak to interact.")
        await asr_task

    async def _process_asr(self):
        async for chunk in self.asr_stream():
            # Process recognized text, generate response, stream TTS
            pass

Emotion Detection and Adaptive Responses

Qwen3-Omni’s audio understanding extends beyond words to prosodic features. You can leverage this for emotion-aware responses:

def detect_emotion_and_adapt(client, audio_description: str, transcript: str) -> dict:
    """Use Qwen3-Omni to detect emotion from speech patterns."""
    response = client.chat.completions.create(
        model="qwen3-omni",
        messages=[
            {
                "role": "user",
                "content": f"""Analyze the speaker's emotional state from this context.
Transcript: "{transcript}"
Speaking pattern: {audio_description}

Return JSON: {{
    "emotion": "confident|nervous|frustrated|confused|enthusiastic",
    "confidence": 0.0-1.0,
    "suggested_tone": "description of how to respond"
}}""",
            }
        ],
        max_tokens=150,
    )
    try:
        return json.loads(response.choices[0].message.content)
    except json.JSONDecodeError:
        return {"emotion": "neutral", "confidence": 0.5, "suggested_tone": "balanced"}

Docker Deployment with vLLM

For self-hosted production deployment, here is a Docker Compose configuration that runs the full Qwen voice stack:

# docker-compose.yml
version: "3.8"

services:
  qwen-asr:
    image: vllm/vllm-openai:latest
    command: >
      --model Qwen/Qwen3-ASR-1.7B
      --port 8001
      --max-model-len 4096
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - "8001:8001"

  qwen-omni:
    image: vllm/vllm-openai:latest
    command: >
      --model Qwen/Qwen3-Omni-30B-A3B-Instruct
      --port 8002
      --tensor-parallel-size 2
      --max-model-len 32768
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    ports:
      - "8002:8002"

  qwen-tts:
    image: vllm/vllm-openai:latest
    command: >
      --model Qwen/Qwen3-TTS-1.7B
      --port 8003
      --max-model-len 4096
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - "8003:8003"

  voice-agent:
    build: .
    environment:
      - ASR_ENDPOINT=http://qwen-asr:8001/v1
      - LLM_ENDPOINT=http://qwen-omni:8002/v1
      - TTS_ENDPOINT=http://qwen-tts:8003/v1
    depends_on:
      - qwen-asr
      - qwen-omni
      - qwen-tts
    ports:
      - "8080:8080"

This gives you a fully self-contained voice agent stack running on your own GPUs — no API keys, no data leaving your infrastructure, no per-token costs.


Comparison Table

How does the Qwen voice stack compare to the alternatives?

FeatureQwen StackOpenAI (Whisper + GPT-4o + TTS)Google (Gemini + Cloud Speech)ElevenLabs + AssemblyAI
Self-hostingFull (open weights)Whisper onlyNoNo
ASR Languages52 (30 + 22 dialects)100+ (Whisper)125+99
TTS Languages105740+32
Voice CloningYes (3s, open-source)NoNoYes (proprietary)
End-to-end VoiceQwen3-OmniGPT-4o (closed)Gemini 2.5 (closed)No (cascaded)
Latency (TTS)97ms first packet~200ms~150ms~300ms
Cost (1M chars TTS)Free (self-hosted)$15.00$16.00$11-24
Cost (1hr ASR)Free (self-hosted)$0.36$0.96$0.37
PrivacyFull controlData sent to OpenAIData sent to GoogleData sent to third party
Fine-tuningFull access to weightsLimitedNoNo
Forced AlignmentBuilt-in (11 langs)Via Whisper timestampsVia APIVia API
MoE Efficiency3B active / 30B totalN/A (dense)MoE (closed)N/A
LicenseApache 2.0ProprietaryProprietaryProprietary

Key takeaways: If privacy, cost control, or customization matter to your use case, Qwen is the clear winner. If you need maximum language coverage for ASR or prefer a managed service with no GPU management, OpenAI Whisper or Google Cloud Speech remain strong choices. ElevenLabs offers the best voice cloning quality in a managed service, but Qwen3-TTS is closing the gap rapidly while being fully open-source.


Conclusion

When to Use the Qwen Voice Stack

The Qwen voice ecosystem is the right choice when:

  • Privacy is non-negotiable: Medical, legal, financial, or defense applications where audio cannot leave your infrastructure
  • Cost at scale matters: High-volume voice applications where per-minute API pricing becomes prohibitive
  • Customization is required: Fine-tuning ASR for domain-specific vocabulary, cloning specific voices, or building specialized agent behaviors
  • You need end-to-end voice AI: Qwen3-Omni eliminates the latency and error compounding of cascaded ASR-LLM-TTS pipelines
  • Multilingual is core: With 52 ASR languages, 10 TTS languages, and 119 text languages, Qwen covers most of the world

What is Coming Next

The Qwen team has maintained a rapid release cadence. Qwen3.5-Omni has already been released (March 2026) with Hybrid-Attention MoE across all modalities and SOTA on 215 benchmarks. The trajectory points toward even larger models, more languages for speech output, and deeper agent integration.

The combination of open weights, competitive performance, and a complete ecosystem — from speech recognition to agent frameworks to coding assistants — makes Qwen the most practical choice for developers building voice AI systems in 2026. The code in this guide gives you a working starting point. Clone the voices, build the agents, and ship something real.

Export for reading

Comments