Qwen3-ASR: Alibaba’s Open‑Source 52‑Language ASR Model
Qwen3‑ASR: Alibaba’s Open‑Source 52‑Language ASR Model
Alibaba Cloud’s new Qwen3‑ASR series brings a powerful, all‑in‑one speech‑recognition system to the open‑source ecosystem. Built on the Qwen‑Omni foundation model, Qwen3‑ASR now supports 52 languages and 22 Chinese dialects, delivers timestamp predictions, and can run efficiently on a single GPU with the vLLM backend.
Why Qwen3‑ASR Stands Out
- Multilingual Breadth – 52 languages (English, Mandarin, Arabic, German, Spanish, French, Italian, Vietnamese, Japanese, Korean, Hindi, and many more) plus 22 Chinese dialects. The model can even differentiate between accents within a language.
- All‑in‑One – Language detection, speech recognition, and timestamp prediction are bundled into one inference call. No need for external language‑ID libraries.
- State‑of‑the‑Art Performance – On LibriSpeech, Qwen3‑ASR‑1.7B achieves a WER of 1.63 % (vs 2.78 % for Whisper‑large‑v3). For singing‑voice tasks, it reaches 5.98 % WER, beating leading commercial demos.
- Fast, Scalable Inference – The vLLM backend gives 2000× throughput on 0.6B at 128 concurrency. Stream‑mode inference lets you transcribe live audio with sub‑second latency.
- Easy Deployment – Docker images, Gradio demos, and an OpenAI‑compatible API are all available out of the box.
Getting Started
Below is a step‑by‑step guide to download, install, and run Qwen3‑ASR. All commands assume a Unix‑style shell.
1. Clone the Repo
git clone https://github.com/QwenLM/Qwen3-ASR.git
cd Qwen3-ASR
2. Install Dependencies
Create a clean Python 3.12 environment:
conda create -n qwen3-asr python=3.12 -y
conda activate qwen3-asr
Install the core package:
pip install -U qwen-asr
If you want the vLLM backend:
pip install -U qwen-asr[vllm]
Tip – Enable FlashAttention‑2 for lower GPU memory and higher speeds:
pip install -U flash-attn --no-build-isolation
3. Download the Model Weights
For users outside Mainland China, the easiest method is via Hugging Face:
pip install -U "huggingface_hub[cli]"
huggingface-cli download Qwen/Qwen3-ASR-1.7B --local-dir ./Qwen3-ASR-1.7B
huggingface-cli download Qwen/Qwen3-ASR-0.6B --local-dir ./Qwen3-ASR-0.6B
If you’re in Mainland China, use ModelScope:
pip install -U modelscope
modelscope download --model Qwen/Qwen3-ASR-1.7B --local_dir ./Qwen3-ASR-1.7B
modelscope download --model Qwen/Qwen3-ASR-0.6B --local_dir ./Qwen3-ASR-0.6B
4. Quick Inference Demo
import torch
from qwen_asr import Qwen3ASRModel
# Load the 1.7B transformer model
model = Qwen3ASRModel.from_pretrained(
"Qwen/Qwen3-ASR-1.7B",
dtype=torch.bfloat16,
device_map="cuda:0",
max_inference_batch_size=32,
max_new_tokens=256,
)
# Transcribe a sample audio
results = model.transcribe(
audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
language=None, # Auto‑detect
)
print("Predicted language:", results[0].language)
print("Transcription:", results[0].text)
5. Streaming Inference (vLLM)
import torch
from qwen_asr import Qwen3ASRModel
if __name__ == "__main__":
model = Qwen3ASRModel.LLM(
model="Qwen/Qwen3-ASR-1.7B",
gpu_memory_utilization=0.7,
max_inference_batch_size=128,
max_new_tokens=4096,
)
# Streaming example omitted for brevity – see repository for full script
6. Forced‑Alignment
Qwen3‑ForcedAligner‑0.6B can provide word‑level timestamps for up to 5 minutes of speech.
import torch
from qwen_asr import Qwen3ForcedAligner
aligner = Qwen3ForcedAligner.from_pretrained(
"Qwen/Qwen3-ForcedAligner-0.6B",
dtype=torch.bfloat16,
device_map="cuda:0",
)
alignment = aligner.align(
audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
text="甚至出现交易几乎停滞的情况。",
language="Chinese",
)
for word in alignment[0]:
print(word.text, word.start_time, word.end_time)
Benchmark Highlights
| Dataset | Qwen3‑ASR‑1.7B | Whisper‑large‑v3 |
|---|---|---|
| LibriSpeech | 1.63 % | 2.78 % |
| Fleurs‑en | 3.35 % | 5.70 % |
| Singing Voice | 5.98 % | 7.88 % |
The 0.6B version offers a 2x speed‑up with a modest 0.4 % WER increase, making it ideal for low‑latency applications.
Deploy with vLLM in Production
- Install vLLM – use the nightly wheel for GPU 12/9 compatibility.
uv venv
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly/cu129
uv pip install "vllm[audio]"
- Launch a Local Server
vllm serve Qwen/Qwen3-ASR-1.7B
- Query via OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Qwen/Qwen3-ASR-1.7B",
messages=[{"role": "user", "content": [{"type": "audio_url", "audio_url": {"url": "<YOUR_AUDIO_URL>"}}]},
)
print(response.choices[0].message.content)
Feel free to expose the server behind Nginx or any API gateway – the OpenAI compatible endpoints make integration trivial.
Docker‑Based Quickstart
docker run --gpus all --name qwen3-asr \
-p 8000:80 \
-v /your/workspace:/data/shared/Qwen3-ASR \
qwenllm/qwen3-asr:latest
The container will expose the Gradio UI on http://localhost:8000 and the vLLM API on 0.0.0.0:8000.
Summary
Qwen3‑ASR is more than just a new open‑source ASR model. It’s a complete ecosystem that offers:
- High‑quality multilingual transcription – 52 languages, 22 Chinese dialects.
- Real‑time & batch inference – via transformers, vLLM, or streaming.
- Forced‑Alignment – fast, non‑autoregressive timestamps.
- Zero‑config demos – Gradio UI, Docker, and API servers.
Whether you’re building a multilingual customer‑support bot, a music‑transcription service, or a research prototype, Qwen3‑ASR gives you the performance of a commercial API at a fraction of the cost.
Get started now by cloning the repository, downloading the weights, and running the sample scripts. The community is active on GitHub and Discord, so share your use cases and help shape the next generation of open‑source speech recognition.