Introduction
The voice AI landscape in 2026 has a clear frontrunner that most developers did not see coming. While OpenAI and Google dominated headlines with GPT-4o and Gemini, Alibaba’s Qwen team quietly assembled the most comprehensive open-source voice ecosystem available today. What makes Qwen different is not just one model — it is an entire interconnected stack: speech recognition (Qwen3-ASR), text-to-speech (Qwen3-TTS), omni-modal reasoning (Qwen3-Omni), an agent framework (Qwen-Agent), and a coding agent (Qwen3-Coder-Next). All open-weight. All self-hostable.
This matters for three reasons. First, privacy: you can run the entire voice pipeline on your own infrastructure without sending a single audio byte to a third-party API. Second, cost: after the initial GPU investment, inference is essentially free at scale. Third, customization: open weights mean you can fine-tune for your domain, clone voices in 3 seconds, and build agent behaviors that closed APIs simply do not support.
In this guide, we will walk through the entire Qwen voice ecosystem, then build three complete systems from scratch: a Voice Interview Agent, a Language Tutor, and a Voice Tutor with voice cloning. Every code block is practical, and every architecture diagram is grounded in production patterns.
The Qwen Voice Ecosystem
Before writing a single line of code, it is worth understanding what each piece does and how they fit together.
graph TB
A[Qwen Voice Ecosystem] --> B[Qwen3-ASR]
A --> C[Qwen3-TTS]
A --> D[Qwen3-Omni]
A --> E[Qwen-Agent]
A --> F[Qwen3-Coder-Next]
B --> B1[Speech to Text]
B --> B2[30 Languages + 22 Chinese Dialects]
B --> B3[Streaming + Offline]
C --> C1[Text to Speech]
C --> C2[Voice Cloning from 3s]
C --> C3[10 Languages]
D --> D1[Thinker-Talker Architecture]
D --> D2[Voice-in Voice-out]
D --> D3[119 Text Languages]
E --> E1[Function Calling + MCP]
E --> E2[Code Interpreter]
E --> E3[RAG + Browser]
F --> F1[Terminal AI Agent]
F --> F2[3B active / 80B total MoE]
F --> F3[256K Context]Qwen3-Omni: The Flagship
Qwen3-Omni is a natively end-to-end omni-modal LLM. Unlike cascaded systems that chain separate ASR, LLM, and TTS models together, Qwen3-Omni processes audio, images, video, and text in a single forward pass and generates both text and speech output simultaneously.
The Thinker-Talker Architecture
The model is split into two tightly coupled components:
-
The Thinker is the reasoning backbone. It ingests all input modalities through specialized encoders — including a native Audio Transformer (AuT) encoder pre-trained on over 100 million hours of audio-visual data. The Thinker performs deep multimodal reasoning and produces rich hidden representations.
-
The Talker receives the Thinker’s output and generates contextual speech in real time. It handles prosody, emotion, and natural turn-taking without requiring a separate TTS model.
TMRoPE (Time-aware Multimodal Rotary Position Embedding) is the architectural innovation that makes this work. It aligns different modalities along a shared temporal axis, so the model understands that frame 42 of a video corresponds to second 1.4 of the audio track. This temporal grounding is what enables natural, synchronized voice responses.
Key specifications:
| Specification | Value |
|---|---|
| Text languages | 119 |
| Speech input languages | 19 (EN, ZH, KO, JA, DE, RU, IT, FR, ES, PT, MS, NL, ID, TR, VI, Cantonese, AR, UR) |
| Speech output languages | 10 (EN, ZH, FR, DE, RU, IT, ES, PT, JA, KO) |
| Available models | Qwen3-Omni-30B-A3B, Qwen2.5-Omni-7B |
| Architecture | MoE, 30B total / 3B active parameters |
| SOTA benchmarks | 215 audio and audio-visual subtasks |
Compared to GPT-4o and Gemini 2.5, Qwen3-Omni offers comparable quality on most benchmarks while being fully open-weight and self-hostable. The MoE architecture means you get 30B-class reasoning at 3B-class inference cost.
Qwen3-TTS: Text-to-Speech
Qwen3-TTS is a dedicated speech synthesis model series available in two sizes: 0.6B and 1.7B parameters. It is not just a basic TTS engine — it supports three distinct modes of voice control:
-
Voice Cloning: Provide as little as 3 seconds of reference audio, and Qwen3-TTS will replicate the speaker’s voice characteristics, timbre, and speaking style.
-
Voice Design: Describe the voice you want in natural language — “a warm, authoritative male voice with a slight British accent, speaking at a moderate pace” — and the model generates speech matching that description.
-
Preset Voices: Use built-in voice profiles for quick prototyping.
Performance numbers that matter:
- WER (Word Error Rate): 1.835% averaged across 10 languages — this is exceptionally low for an open-source TTS model
- First-packet latency: 97ms (0.6B) / 101ms (1.7B) — fast enough for real-time conversation
- Languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian, plus Chinese dialect support
- Streaming: Full streaming support with chunked audio output
Available APIs: DashScope (Alibaba Cloud), Replicate, fal.ai, and self-hosted via vLLM or the official inference server.
Qwen3-ASR: Speech-to-Text
Qwen3-ASR complements TTS with equally capable speech recognition, available in 0.6B and 1.7B variants:
- 52 languages and dialects supported (30 languages + 22 Chinese dialects)
- Unified streaming and offline inference: The same model handles both real-time microphone input and batch file processing
- Forced alignment: Qwen3-ForcedAligner-0.6B provides word-level timestamps in 11 languages — critical for pronunciation assessment and subtitle generation
- Speed: The 0.6B model transcribes 2,000 seconds of speech in 1 second at a concurrency of 128
- TTFT (Time to First Token): As low as 92ms for the 0.6B model
The 1.7B version achieves state-of-the-art performance among open-source ASR models and is competitive with the strongest proprietary APIs including OpenAI Whisper and Google Cloud Speech.
DashScope WebSocket API provides a production-ready streaming interface:
import dashscope
from dashscope.audio.asr import Recognition
recognition = Recognition(
model="qwen3-asr-1.7b",
format="pcm",
sample_rate=16000,
callback=on_result
)
recognition.start()
# Feed audio chunks in real-time
recognition.send_audio(audio_chunk)
recognition.stop()
Qwen-Agent Framework
Qwen-Agent is the orchestration layer that turns individual models into functional AI agents. It provides:
- Function Calling: Native support for parallel function calls with structured tool definitions
- MCP (Model Context Protocol): Deep integration with MCP servers via stdio, SSE, or streamable HTTP — enabling file system access, database queries, memory management, and more
- Code Interpreter: Sandboxed Python execution for data analysis and computation
- RAG: Built-in retrieval-augmented generation for knowledge-grounded responses
- Chrome Extension & Browser Assistant: Web-aware agents that can read and interact with browser content
For voice applications, Qwen-Agent provides the glue between ASR input, LLM reasoning, and TTS output, handling conversation state, tool calls, and streaming coordination.
Qwen3-Coder-Next and Qwen Code
While not directly voice-related, Qwen3-Coder-Next deserves mention because it powers the coding agent ecosystem:
- Architecture: MoE with 3B active / 80B total parameters, 256K context window
- Training: 800K executable tasks with environment interaction and reinforcement learning
- Qwen Code: An open-source terminal AI agent (similar to Claude Code) optimized for Qwen models
- Compatibility: Works with Claude Code, Cline, Kilo, Trae, and other agent scaffolds
This is relevant because you can use Qwen Code to build and iterate on your voice agent code directly from the terminal.
Building a Voice Interview Agent with Qwen
Let us build a complete voice-based technical interview system. The architecture uses three Qwen models in a pipeline:
graph LR
MIC[Microphone] --> ASR[Qwen3-ASR]
ASR --> |Transcript| LLM[Qwen3-Omni]
LLM --> |Response text| TTS[Qwen3-TTS]
TTS --> SPK[Speaker]
LLM --> |Score + Notes| DB[(Session DB)]
DB --> |History| LLMProject Setup
pip install dashscope sounddevice numpy pydantic
Core Interview Agent
"""
Voice Interview Agent using Qwen3-ASR + Qwen3-Omni + Qwen3-TTS
Conducts technical interviews with real-time voice interaction.
"""
import os
import json
import queue
import threading
import numpy as np
import sounddevice as sd
from dataclasses import dataclass, field
from dashscope.audio.asr import Recognition, RecognitionCallback
from dashscope.audio.tts_v2 import SpeechSynthesizer
from openai import OpenAI
DASHSCOPE_API_KEY = os.getenv("DASHSCOPE_API_KEY")
SAMPLE_RATE = 16000
CHANNELS = 1
@dataclass
class InterviewSession:
role: str = "Senior Python Developer"
difficulty: str = "medium"
history: list = field(default_factory=list)
scores: list = field(default_factory=list)
current_topic: str = ""
questions_asked: int = 0
max_questions: int = 8
SYSTEM_PROMPT = """You are an experienced technical interviewer conducting a
{role} interview at {difficulty} difficulty.
Rules:
- Ask one question at a time, then wait for the candidate's response
- Start with introductions, then move to technical questions
- Adapt question difficulty based on candidate performance
- After each answer, briefly acknowledge it before the next question
- Track topics covered: algorithms, system design, language specifics
- Be professional but conversational -- this is a voice interview
- Keep responses under 3 sentences for natural conversation flow
Current topic: {topic}
Questions asked: {asked}/{max}
"""
class ASRHandler(RecognitionCallback):
"""Handles real-time ASR results from Qwen3-ASR."""
def __init__(self):
self.transcript_queue = queue.Queue()
self.partial = ""
def on_partial(self, result):
self.partial = result.get_sentence().get("text", "")
def on_complete(self, result):
text = result.get_sentence().get("text", "")
if text.strip():
self.transcript_queue.put(text)
self.partial = ""
def on_error(self, error):
print(f"ASR Error: {error}")
def create_llm_client():
"""Create OpenAI-compatible client for Qwen3-Omni."""
return OpenAI(
api_key=DASHSCOPE_API_KEY,
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
def get_interviewer_response(client, session: InterviewSession, candidate_text: str):
"""Generate interviewer response using Qwen3-Omni."""
session.history.append({"role": "user", "content": candidate_text})
system = SYSTEM_PROMPT.format(
role=session.role,
difficulty=session.difficulty,
topic=session.current_topic,
asked=session.questions_asked,
max=session.max_questions,
)
messages = [{"role": "system", "content": system}] + session.history
response = client.chat.completions.create(
model="qwen3-omni", messages=messages, max_tokens=300
)
reply = response.choices[0].message.content
session.history.append({"role": "assistant", "content": reply})
session.questions_asked += 1
return reply
def speak(text: str, voice: str = "ethan"):
"""Convert text to speech using Qwen3-TTS and play it."""
synthesizer = SpeechSynthesizer(
model="qwen3-tts-1.7b", voice=voice, sample_rate=24000
)
audio_data = b""
for chunk in synthesizer.streaming_call(text):
audio_data += chunk
audio_array = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768
sd.play(audio_array, samplerate=24000)
sd.wait()
def listen(asr_handler: ASRHandler, timeout: float = 30.0) -> str:
"""Listen for speech input via Qwen3-ASR, return transcript."""
recognition = Recognition(
model="qwen3-asr-1.7b",
format="pcm",
sample_rate=SAMPLE_RATE,
callback=asr_handler,
)
recognition.start()
audio_queue = queue.Queue()
def audio_callback(indata, frames, time_info, status):
audio_queue.put(indata.copy())
with sd.InputStream(
samplerate=SAMPLE_RATE,
channels=CHANNELS,
dtype="int16",
callback=audio_callback,
blocksize=3200,
):
silence_count = 0
while silence_count < int(timeout * SAMPLE_RATE / 3200):
try:
chunk = audio_queue.get(timeout=1.0)
recognition.send_audio(chunk.tobytes())
amplitude = np.abs(chunk).mean()
if amplitude < 200:
silence_count += 1
else:
silence_count = 0
except queue.Empty:
break
recognition.stop()
try:
return asr_handler.transcript_queue.get(timeout=2.0)
except queue.Empty:
return ""
def score_answer(client, question: str, answer: str) -> dict:
"""Score a candidate's answer on multiple dimensions."""
scoring_prompt = f"""Score this interview answer on a scale of 1-10.
Question: {question}
Answer: {answer}
Return JSON: {{"relevance": int, "depth": int, "clarity": int, "overall": int, "notes": str}}
Only return the JSON, nothing else."""
response = client.chat.completions.create(
model="qwen3-omni",
messages=[{"role": "user", "content": scoring_prompt}],
max_tokens=200,
)
try:
return json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
return {"relevance": 5, "depth": 5, "clarity": 5, "overall": 5, "notes": ""}
def run_interview():
"""Main interview loop."""
session = InterviewSession()
client = create_llm_client()
asr_handler = ASRHandler()
print("=== Voice Interview Agent ===")
print(f"Role: {session.role} | Difficulty: {session.difficulty}")
print("Starting interview...\n")
# Opening statement
opening = get_interviewer_response(
client, session, "[Interview begins. Greet the candidate.]"
)
print(f"Interviewer: {opening}")
speak(opening)
while session.questions_asked < session.max_questions:
# Listen to candidate
print("\n[Listening...]")
candidate_response = listen(asr_handler)
if not candidate_response:
speak("I didn't catch that. Could you repeat?")
continue
print(f"Candidate: {candidate_response}")
# Score the previous answer
if session.questions_asked > 1:
last_q = session.history[-3]["content"] # interviewer question
score = score_answer(client, last_q, candidate_response)
session.scores.append(score)
print(f" [Score: {score.get('overall', 'N/A')}/10]")
# Generate next interviewer response
reply = get_interviewer_response(client, session, candidate_response)
print(f"Interviewer: {reply}")
speak(reply)
# Final summary
if session.scores:
avg = sum(s.get("overall", 5) for s in session.scores) / len(session.scores)
print(f"\n=== Interview Complete ===")
print(f"Average Score: {avg:.1f}/10")
print(f"Questions Asked: {session.questions_asked}")
for i, s in enumerate(session.scores, 1):
print(f" Q{i}: {s.get('overall', 'N/A')}/10 - {s.get('notes', '')}")
if __name__ == "__main__":
run_interview()
This agent captures microphone audio, transcribes it with Qwen3-ASR in real time, feeds the transcript to Qwen3-Omni for intelligent interviewer responses, scores each answer, and speaks the response back through Qwen3-TTS. The silence detection ensures the agent waits for the candidate to finish speaking before responding.
Building a Language Tutor with Qwen
A language tutor needs more than conversation — it needs pronunciation assessment, grammar correction, vocabulary tracking, and adaptive difficulty. Here is the full implementation:
graph TB
Student[Student speaks] --> ASR[Qwen3-ASR with timestamps]
ASR --> |Transcript + confidence| Eval[Pronunciation Evaluator]
ASR --> |Transcript| Grammar[Grammar Checker]
Eval --> |Pronunciation score| Tutor[Qwen3-Omni Tutor Brain]
Grammar --> |Corrections| Tutor
Tutor --> |Feedback text| TTS[Qwen3-TTS]
TTS --> Speaker[Speaker output]
Tutor --> |Progress data| Vocab[Vocabulary Tracker]
Vocab --> |Spaced repetition| TutorLanguage Tutor Agent
"""
Language Tutor Agent using the Qwen voice stack.
Supports conversation practice, pronunciation feedback, grammar correction,
and spaced repetition vocabulary building.
"""
import os
import json
import time
from datetime import datetime, timedelta
from dataclasses import dataclass, field
from openai import OpenAI
DASHSCOPE_API_KEY = os.getenv("DASHSCOPE_API_KEY")
@dataclass
class VocabCard:
word: str
translation: str
context: str
ease_factor: float = 2.5
interval_days: int = 1
next_review: str = ""
correct_count: int = 0
def schedule_review(self, quality: int):
"""SM-2 spaced repetition algorithm."""
if quality >= 3:
if self.correct_count == 0:
self.interval_days = 1
elif self.correct_count == 1:
self.interval_days = 6
else:
self.interval_days = int(self.interval_days * self.ease_factor)
self.correct_count += 1
else:
self.correct_count = 0
self.interval_days = 1
self.ease_factor = max(
1.3,
self.ease_factor + (0.1 - (5 - quality) * (0.08 + (5 - quality) * 0.02)),
)
next_dt = datetime.now() + timedelta(days=self.interval_days)
self.next_review = next_dt.isoformat()
@dataclass
class TutorSession:
target_language: str = "Japanese"
native_language: str = "English"
level: str = "intermediate"
mode: str = "conversation" # conversation | pronunciation | vocabulary
history: list = field(default_factory=list)
vocab_deck: list = field(default_factory=list)
pronunciation_scores: list = field(default_factory=list)
grammar_errors: list = field(default_factory=list)
TUTOR_SYSTEM_PROMPT = """You are a patient, encouraging {target_lang} language tutor.
The student's native language is {native_lang}. Their level is {level}.
Current mode: {mode}
Guidelines:
- In conversation mode: Have natural dialogues in {target_lang}, gently
correcting mistakes. Mix in new vocabulary appropriate to their level.
- In pronunciation mode: Give specific phonetic feedback. Use IPA when helpful.
- In vocabulary mode: Introduce words in context, quiz the student.
- Always provide corrections in a supportive way.
- Keep responses short for natural voice conversation.
- After correcting, continue the conversation naturally.
If the student makes a grammar error, respond in this format:
[Correction: "wrong phrase" -> "correct phrase" (explanation)]
Then continue naturally.
Recent vocabulary to reinforce: {vocab_words}
"""
def create_client():
return OpenAI(
api_key=DASHSCOPE_API_KEY,
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
def assess_pronunciation(asr_result: dict) -> dict:
"""
Evaluate pronunciation using ASR confidence scores and forced alignment.
Uses Qwen3-ForcedAligner for word-level timing and confidence.
"""
words = asr_result.get("words", [])
scores = []
problem_words = []
for word_info in words:
confidence = word_info.get("confidence", 1.0)
scores.append(confidence)
if confidence < 0.7:
problem_words.append(
{
"word": word_info.get("word", ""),
"confidence": confidence,
"start": word_info.get("start", 0),
"end": word_info.get("end", 0),
}
)
avg_score = sum(scores) / len(scores) if scores else 0
return {
"overall_score": round(avg_score * 100, 1),
"problem_words": problem_words,
"word_count": len(words),
}
def check_grammar(client, text: str, target_lang: str) -> list:
"""Use Qwen3-Omni to identify grammar errors."""
response = client.chat.completions.create(
model="qwen3-omni",
messages=[
{
"role": "user",
"content": f"""Analyze this {target_lang} text for grammar errors.
Text: "{text}"
Return JSON array: [{{"error": "the mistake", "correction": "fixed version",
"rule": "grammar rule"}}]
Return empty array [] if no errors. Only return JSON.""",
}
],
max_tokens=300,
)
try:
return json.loads(response.choices[0].message.content)
except (json.JSONDecodeError, IndexError):
return []
def extract_vocabulary(client, text: str, target_lang: str, level: str) -> list:
"""Extract new vocabulary words from the conversation for the deck."""
response = client.chat.completions.create(
model="qwen3-omni",
messages=[
{
"role": "user",
"content": f"""From this {target_lang} text, extract 1-3 vocabulary
words appropriate for a {level} learner that they might not know.
Text: "{text}"
Return JSON array: [{{"word": str, "translation": str, "context": str}}]
Only return JSON.""",
}
],
max_tokens=200,
)
try:
return json.loads(response.choices[0].message.content)
except (json.JSONDecodeError, IndexError):
return []
def get_due_vocab(session: TutorSession) -> list:
"""Get vocabulary cards due for review."""
now = datetime.now().isoformat()
return [
card
for card in session.vocab_deck
if not card.next_review or card.next_review <= now
]
def tutor_respond(client, session: TutorSession, student_text: str) -> str:
"""Generate tutor response with context from all subsystems."""
# Check grammar
errors = check_grammar(client, student_text, session.target_language)
session.grammar_errors.extend(errors)
# Get vocab words to reinforce
due_vocab = get_due_vocab(session)
vocab_words = ", ".join(c.word for c in due_vocab[:5]) if due_vocab else "none"
session.history.append({"role": "user", "content": student_text})
system = TUTOR_SYSTEM_PROMPT.format(
target_lang=session.target_language,
native_lang=session.native_language,
level=session.level,
mode=session.mode,
vocab_words=vocab_words,
)
# Add grammar context if errors found
if errors:
error_context = "\n".join(
f'- "{e["error"]}" should be "{e["correction"]}" ({e["rule"]})'
for e in errors
)
student_text_with_ctx = (
f"{student_text}\n\n[Grammar errors detected:\n{error_context}]"
)
session.history[-1]["content"] = student_text_with_ctx
messages = [{"role": "system", "content": system}] + session.history[-20:]
response = client.chat.completions.create(
model="qwen3-omni", messages=messages, max_tokens=400
)
reply = response.choices[0].message.content
session.history.append({"role": "assistant", "content": reply})
# Extract new vocabulary from tutor's response
new_vocab = extract_vocabulary(
client, reply, session.target_language, session.level
)
for v in new_vocab:
card = VocabCard(
word=v["word"], translation=v["translation"], context=v["context"]
)
if not any(c.word == card.word for c in session.vocab_deck):
session.vocab_deck.append(card)
return reply
def run_conversation_practice(client, session: TutorSession):
"""Run a conversation practice session (text-based for demonstration)."""
session.mode = "conversation"
print(f"=== Language Tutor: {session.target_language} ===")
print(f"Level: {session.level} | Mode: {session.mode}")
print("Type 'quit' to end, 'vocab' to review vocabulary\n")
# Opening
opener = tutor_respond(
client, session, "[Session starts. Begin a natural conversation.]"
)
print(f"Tutor: {opener}\n")
while True:
user_input = input("You: ").strip()
if user_input.lower() == "quit":
break
if user_input.lower() == "vocab":
due = get_due_vocab(session)
print(f"\n--- Vocabulary Review ({len(due)} cards due) ---")
for card in due[:5]:
print(f" {card.word} = {card.translation} ({card.context})")
print()
continue
response = tutor_respond(client, session, user_input)
print(f"Tutor: {response}\n")
# Session summary
print("\n=== Session Summary ===")
print(f"Grammar errors corrected: {len(session.grammar_errors)}")
print(f"Vocabulary cards collected: {len(session.vocab_deck)}")
for card in session.vocab_deck:
print(f" {card.word} - {card.translation}")
if __name__ == "__main__":
client = create_client()
session = TutorSession(target_language="Japanese", level="intermediate")
run_conversation_practice(client, session)
The Language Tutor integrates grammar checking, pronunciation assessment via ASR confidence scores, and a full SM-2 spaced repetition system for vocabulary retention. Each conversation naturally builds the student’s vocabulary deck, and due cards are woven into future tutor prompts to reinforce learning.
Building a Voice Tutor with Qwen
The Voice Tutor builds on the Language Tutor by adding voice cloning for a custom tutor persona, adaptive difficulty, and session recording with progress tracking.
"""
Voice Tutor with voice cloning, adaptive difficulty, and progress tracking.
Uses Qwen3-TTS voice cloning to create a consistent tutor persona.
"""
import os
import json
import wave
import time
import numpy as np
import sounddevice as sd
from dataclasses import dataclass, field
from pathlib import Path
from dashscope.audio.tts_v2 import SpeechSynthesizer
from openai import OpenAI
DASHSCOPE_API_KEY = os.getenv("DASHSCOPE_API_KEY")
SESSIONS_DIR = Path("./tutor_sessions")
SESSIONS_DIR.mkdir(exist_ok=True)
@dataclass
class StudentProfile:
name: str = "Student"
level: int = 5 # 1-10 scale
strengths: list = field(default_factory=list)
weaknesses: list = field(default_factory=list)
session_count: int = 0
total_score: float = 0.0
pronunciation_avg: float = 0.0
topics_covered: list = field(default_factory=list)
@dataclass
class VoiceTutorConfig:
tutor_name: str = "Professor Tanaka"
voice_reference_audio: str = "" # Path to 3s+ reference audio
voice_description: str = (
"A warm, patient female voice with clear enunciation, "
"speaking at a moderate pace suitable for language learners"
)
target_language: str = "Japanese"
native_language: str = "English"
def create_cloned_voice_synthesizer(config: VoiceTutorConfig):
"""
Create a TTS synthesizer with a cloned or designed voice.
Qwen3-TTS supports both voice cloning (from audio) and
voice design (from text description).
"""
if config.voice_reference_audio and os.path.exists(config.voice_reference_audio):
# Voice cloning mode: replicate voice from reference audio
synthesizer = SpeechSynthesizer(
model="qwen3-tts-1.7b",
voice="clone",
voice_reference_audio=config.voice_reference_audio,
sample_rate=24000,
)
else:
# Voice design mode: generate voice from description
synthesizer = SpeechSynthesizer(
model="qwen3-tts-1.7b",
voice="design",
voice_description=config.voice_description,
sample_rate=24000,
)
return synthesizer
def adaptive_difficulty(profile: StudentProfile, recent_scores: list) -> str:
"""Adjust difficulty based on student performance trends."""
if len(recent_scores) < 3:
return "maintain"
avg_recent = sum(recent_scores[-3:]) / 3
if avg_recent >= 8.5 and profile.level < 10:
profile.level += 1
return "increase"
elif avg_recent <= 4.0 and profile.level > 1:
profile.level -= 1
return "decrease"
return "maintain"
VOICE_TUTOR_PROMPT = """You are {tutor_name}, a voice-based {target_lang} tutor.
Student: {student_name} (Level {level}/10)
Difficulty adjustment: {adjustment}
Student strengths: {strengths}
Student weaknesses: {weaknesses}
Your persona: You are warm, encouraging, and patient. You speak clearly.
Adapt your {target_lang} complexity to level {level}.
For pronunciation feedback:
- If score >= 90: Brief praise, move on
- If score 70-89: Note specific sounds to improve
- If score < 70: Slow down, demonstrate correct pronunciation, have them repeat
Session focus areas: {weaknesses}
Keep responses concise (2-3 sentences) for natural voice conversation."""
def speak_with_cloned_voice(
synthesizer: SpeechSynthesizer, text: str, save_path: str = None
):
"""Synthesize and play speech, optionally saving to file."""
audio_data = b""
for chunk in synthesizer.streaming_call(text):
audio_data += chunk
audio_array = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768
if save_path:
with wave.open(save_path, "w") as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(24000)
wf.writeframes(audio_data)
sd.play(audio_array, samplerate=24000)
sd.wait()
return audio_data
def record_session_audio(duration: float, sample_rate: int = 16000) -> np.ndarray:
"""Record audio from microphone for a specified duration."""
print(f"[Recording for {duration}s...]")
audio = sd.rec(
int(duration * sample_rate), samplerate=sample_rate, channels=1, dtype="int16"
)
sd.wait()
return audio
def save_progress(profile: StudentProfile, session_dir: Path):
"""Save student progress to disk."""
progress_file = session_dir / f"{profile.name}_progress.json"
data = {
"name": profile.name,
"level": profile.level,
"session_count": profile.session_count,
"average_score": (
round(profile.total_score / max(profile.session_count, 1), 2)
),
"pronunciation_avg": profile.pronunciation_avg,
"strengths": profile.strengths,
"weaknesses": profile.weaknesses,
"topics_covered": profile.topics_covered,
}
with open(progress_file, "w") as f:
json.dump(data, f, indent=2)
print(f"Progress saved to {progress_file}")
def run_voice_tutor_session(config: VoiceTutorConfig, profile: StudentProfile):
"""Run an interactive voice tutoring session."""
client = OpenAI(
api_key=DASHSCOPE_API_KEY,
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
synthesizer = create_cloned_voice_synthesizer(config)
session_dir = SESSIONS_DIR / f"session_{int(time.time())}"
session_dir.mkdir(exist_ok=True)
profile.session_count += 1
recent_scores = []
history = []
print(f"=== Voice Tutor: {config.tutor_name} ===")
print(f"Student: {profile.name} | Level: {profile.level}/10")
print(f"Language: {config.target_language}\n")
# Generate opening based on adaptive difficulty
adjustment = adaptive_difficulty(profile, recent_scores)
system = VOICE_TUTOR_PROMPT.format(
tutor_name=config.tutor_name,
target_lang=config.target_language,
student_name=profile.name,
level=profile.level,
adjustment=adjustment,
strengths=", ".join(profile.strengths) or "unknown",
weaknesses=", ".join(profile.weaknesses) or "to be assessed",
)
# Opening greeting
response = client.chat.completions.create(
model="qwen3-omni",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": "[New session begins. Greet the student.]"},
],
max_tokens=200,
)
greeting = response.choices[0].message.content
print(f"{config.tutor_name}: {greeting}")
speak_with_cloned_voice(
synthesizer, greeting, str(session_dir / "turn_000_tutor.wav")
)
history.append({"role": "assistant", "content": greeting})
turn = 1
while turn <= 15: # Max 15 turns per session
# Record student audio
student_audio = record_session_audio(duration=10.0)
audio_path = str(session_dir / f"turn_{turn:03d}_student.wav")
with wave.open(audio_path, "w") as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(16000)
wf.writeframes(student_audio.tobytes())
# In production, you would run ASR here and get the transcript
# student_text = run_asr(student_audio)
student_text = input("Student (text fallback): ")
if student_text.lower() in ("quit", "exit", "end"):
break
history.append({"role": "user", "content": student_text})
# Generate tutor response with full context
adjustment = adaptive_difficulty(profile, recent_scores)
system = VOICE_TUTOR_PROMPT.format(
tutor_name=config.tutor_name,
target_lang=config.target_language,
student_name=profile.name,
level=profile.level,
adjustment=adjustment,
strengths=", ".join(profile.strengths) or "unknown",
weaknesses=", ".join(profile.weaknesses) or "to be assessed",
)
messages = [{"role": "system", "content": system}] + history[-16:]
response = client.chat.completions.create(
model="qwen3-omni", messages=messages, max_tokens=300
)
reply = response.choices[0].message.content
history.append({"role": "assistant", "content": reply})
print(f"{config.tutor_name}: {reply}")
speak_with_cloned_voice(
synthesizer, reply, str(session_dir / f"turn_{turn:03d}_tutor.wav")
)
turn += 1
save_progress(profile, SESSIONS_DIR)
print(f"\nSession recordings saved to {session_dir}")
if __name__ == "__main__":
config = VoiceTutorConfig(
tutor_name="Professor Tanaka",
voice_description=(
"A warm, patient female voice with clear enunciation, "
"speaking at a moderate pace suitable for language learners"
),
target_language="Japanese",
)
profile = StudentProfile(name="Alex", level=5)
run_voice_tutor_session(config, profile)
The Voice Tutor records every turn to WAV files for review, adapts difficulty using a sliding window of recent scores, and uses Qwen3-TTS voice cloning to maintain a consistent tutor persona across sessions. The voice_reference_audio parameter lets you clone any voice from just 3 seconds of sample audio.
Advanced Patterns
Streaming Bidirectional Voice Chat
For production-grade voice agents, you need full-duplex streaming — the agent should be able to listen while speaking and handle interruptions naturally.
"""
Streaming bidirectional voice chat with interrupt handling.
Uses async pipelines for simultaneous ASR and TTS.
"""
import asyncio
import numpy as np
import sounddevice as sd
from collections import deque
class StreamingVoiceChat:
"""Full-duplex voice chat with interrupt support."""
def __init__(self):
self.is_speaking = False
self.is_listening = True
self.interrupt_threshold = 500 # amplitude threshold
self.audio_buffer = deque(maxlen=100)
async def asr_stream(self):
"""Continuous ASR streaming coroutine."""
# In production, connect to Qwen3-ASR WebSocket
while self.is_listening:
if self.audio_buffer:
chunk = self.audio_buffer.popleft()
amplitude = np.abs(chunk).mean()
# Detect user interruption during TTS playback
if self.is_speaking and amplitude > self.interrupt_threshold:
print("[Interrupt detected - stopping speech]")
self.is_speaking = False
sd.stop()
yield chunk # Process the interrupting speech
elif not self.is_speaking:
yield chunk
await asyncio.sleep(0.01)
async def tts_stream(self, text: str):
"""Stream TTS output with interrupt awareness."""
self.is_speaking = True
# In production, use Qwen3-TTS streaming API
# Each chunk is played and can be interrupted
chunks = text.split(". ") # Simplified chunking
for chunk in chunks:
if not self.is_speaking:
break # Interrupted
# synthesize and play chunk
await asyncio.sleep(0.1)
self.is_speaking = False
async def run(self):
"""Main event loop for bidirectional voice chat."""
asr_task = asyncio.create_task(self._process_asr())
# Audio capture runs in a separate thread via sounddevice
print("Streaming voice chat active. Speak to interact.")
await asr_task
async def _process_asr(self):
async for chunk in self.asr_stream():
# Process recognized text, generate response, stream TTS
pass
Emotion Detection and Adaptive Responses
Qwen3-Omni’s audio understanding extends beyond words to prosodic features. You can leverage this for emotion-aware responses:
def detect_emotion_and_adapt(client, audio_description: str, transcript: str) -> dict:
"""Use Qwen3-Omni to detect emotion from speech patterns."""
response = client.chat.completions.create(
model="qwen3-omni",
messages=[
{
"role": "user",
"content": f"""Analyze the speaker's emotional state from this context.
Transcript: "{transcript}"
Speaking pattern: {audio_description}
Return JSON: {{
"emotion": "confident|nervous|frustrated|confused|enthusiastic",
"confidence": 0.0-1.0,
"suggested_tone": "description of how to respond"
}}""",
}
],
max_tokens=150,
)
try:
return json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
return {"emotion": "neutral", "confidence": 0.5, "suggested_tone": "balanced"}
Docker Deployment with vLLM
For self-hosted production deployment, here is a Docker Compose configuration that runs the full Qwen voice stack:
# docker-compose.yml
version: "3.8"
services:
qwen-asr:
image: vllm/vllm-openai:latest
command: >
--model Qwen/Qwen3-ASR-1.7B
--port 8001
--max-model-len 4096
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ports:
- "8001:8001"
qwen-omni:
image: vllm/vllm-openai:latest
command: >
--model Qwen/Qwen3-Omni-30B-A3B-Instruct
--port 8002
--tensor-parallel-size 2
--max-model-len 32768
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
ports:
- "8002:8002"
qwen-tts:
image: vllm/vllm-openai:latest
command: >
--model Qwen/Qwen3-TTS-1.7B
--port 8003
--max-model-len 4096
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ports:
- "8003:8003"
voice-agent:
build: .
environment:
- ASR_ENDPOINT=http://qwen-asr:8001/v1
- LLM_ENDPOINT=http://qwen-omni:8002/v1
- TTS_ENDPOINT=http://qwen-tts:8003/v1
depends_on:
- qwen-asr
- qwen-omni
- qwen-tts
ports:
- "8080:8080"
This gives you a fully self-contained voice agent stack running on your own GPUs — no API keys, no data leaving your infrastructure, no per-token costs.
Comparison Table
How does the Qwen voice stack compare to the alternatives?
| Feature | Qwen Stack | OpenAI (Whisper + GPT-4o + TTS) | Google (Gemini + Cloud Speech) | ElevenLabs + AssemblyAI |
|---|---|---|---|---|
| Self-hosting | Full (open weights) | Whisper only | No | No |
| ASR Languages | 52 (30 + 22 dialects) | 100+ (Whisper) | 125+ | 99 |
| TTS Languages | 10 | 57 | 40+ | 32 |
| Voice Cloning | Yes (3s, open-source) | No | No | Yes (proprietary) |
| End-to-end Voice | Qwen3-Omni | GPT-4o (closed) | Gemini 2.5 (closed) | No (cascaded) |
| Latency (TTS) | 97ms first packet | ~200ms | ~150ms | ~300ms |
| Cost (1M chars TTS) | Free (self-hosted) | $15.00 | $16.00 | $11-24 |
| Cost (1hr ASR) | Free (self-hosted) | $0.36 | $0.96 | $0.37 |
| Privacy | Full control | Data sent to OpenAI | Data sent to Google | Data sent to third party |
| Fine-tuning | Full access to weights | Limited | No | No |
| Forced Alignment | Built-in (11 langs) | Via Whisper timestamps | Via API | Via API |
| MoE Efficiency | 3B active / 30B total | N/A (dense) | MoE (closed) | N/A |
| License | Apache 2.0 | Proprietary | Proprietary | Proprietary |
Key takeaways: If privacy, cost control, or customization matter to your use case, Qwen is the clear winner. If you need maximum language coverage for ASR or prefer a managed service with no GPU management, OpenAI Whisper or Google Cloud Speech remain strong choices. ElevenLabs offers the best voice cloning quality in a managed service, but Qwen3-TTS is closing the gap rapidly while being fully open-source.
Conclusion
When to Use the Qwen Voice Stack
The Qwen voice ecosystem is the right choice when:
- Privacy is non-negotiable: Medical, legal, financial, or defense applications where audio cannot leave your infrastructure
- Cost at scale matters: High-volume voice applications where per-minute API pricing becomes prohibitive
- Customization is required: Fine-tuning ASR for domain-specific vocabulary, cloning specific voices, or building specialized agent behaviors
- You need end-to-end voice AI: Qwen3-Omni eliminates the latency and error compounding of cascaded ASR-LLM-TTS pipelines
- Multilingual is core: With 52 ASR languages, 10 TTS languages, and 119 text languages, Qwen covers most of the world
What is Coming Next
The Qwen team has maintained a rapid release cadence. Qwen3.5-Omni has already been released (March 2026) with Hybrid-Attention MoE across all modalities and SOTA on 215 benchmarks. The trajectory points toward even larger models, more languages for speech output, and deeper agent integration.
The combination of open weights, competitive performance, and a complete ecosystem — from speech recognition to agent frameworks to coding assistants — makes Qwen the most practical choice for developers building voice AI systems in 2026. The code in this guide gives you a working starting point. Clone the voices, build the agents, and ship something real.