Qwen3-ASR: Alibaba’s Open‑Source 52‑Language ASR Model

January 31, 2026

Category: Practical Open Source Projects

Tags:

Open Source Speech Recognition Alibaba ASR Multilingual

Qwen3‑ASR: Alibaba’s Open‑Source 52‑Language ASR Model

Alibaba Cloud’s new Qwen3‑ASR series brings a powerful, all‑in‑one speech‑recognition system to the open‑source ecosystem. Built on the Qwen‑Omni foundation model, Qwen3‑ASR now supports 52 languages and 22 Chinese dialects, delivers timestamp predictions, and can run efficiently on a single GPU with the vLLM backend.

Why Qwen3‑ASR Stands Out

Multilingual Breadth – 52 languages (English, Mandarin, Arabic, German, Spanish, French, Italian, Vietnamese, Japanese, Korean, Hindi, and many more) plus 22 Chinese dialects. The model can even differentiate between accents within a language.
All‑in‑One – Language detection, speech recognition, and timestamp prediction are bundled into one inference call. No need for external language‑ID libraries.
State‑of‑the‑Art Performance – On LibriSpeech, Qwen3‑ASR‑1.7B achieves a WER of 1.63 % (vs 2.78 % for Whisper‑large‑v3). For singing‑voice tasks, it reaches 5.98 % WER, beating leading commercial demos.
Fast, Scalable Inference – The vLLM backend gives 2000× throughput on 0.6B at 128 concurrency. Stream‑mode inference lets you transcribe live audio with sub‑second latency.
Easy Deployment – Docker images, Gradio demos, and an OpenAI‑compatible API are all available out of the box.

Getting Started

Below is a step‑by‑step guide to download, install, and run Qwen3‑ASR. All commands assume a Unix‑style shell.

1. Clone the Repo

git clone https://github.com/QwenLM/Qwen3-ASR.git
cd Qwen3-ASR

2. Install Dependencies

Create a clean Python 3.12 environment:

conda create -n qwen3-asr python=3.12 -y
conda activate qwen3-asr

Install the core package:

pip install -U qwen-asr

If you want the vLLM backend:

pip install -U qwen-asr[vllm]

Tip – Enable FlashAttention‑2 for lower GPU memory and higher speeds:
pip install -U flash-attn --no-build-isolation

3. Download the Model Weights

For users outside Mainland China, the easiest method is via Hugging Face:

pip install -U "huggingface_hub[cli]"

huggingface-cli download Qwen/Qwen3-ASR-1.7B --local-dir ./Qwen3-ASR-1.7B
huggingface-cli download Qwen/Qwen3-ASR-0.6B --local-dir ./Qwen3-ASR-0.6B

If you’re in Mainland China, use ModelScope:

pip install -U modelscope

modelscope download --model Qwen/Qwen3-ASR-1.7B --local_dir ./Qwen3-ASR-1.7B
modelscope download --model Qwen/Qwen3-ASR-0.6B --local_dir ./Qwen3-ASR-0.6B

4. Quick Inference Demo

import torch
from qwen_asr import Qwen3ASRModel

# Load the 1.7B transformer model
model = Qwen3ASRModel.from_pretrained(
    "Qwen/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
)

# Transcribe a sample audio
results = model.transcribe(
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
    language=None,  # Auto‑detect
)

print("Predicted language:", results[0].language)
print("Transcription:", results[0].text)

5. Streaming Inference (vLLM)

import torch
from qwen_asr import Qwen3ASRModel

if __name__ == "__main__":
    model = Qwen3ASRModel.LLM(
        model="Qwen/Qwen3-ASR-1.7B",
        gpu_memory_utilization=0.7,
        max_inference_batch_size=128,
        max_new_tokens=4096,
    )
    # Streaming example omitted for brevity – see repository for full script

6. Forced‑Alignment

Qwen3‑ForcedAligner‑0.6B can provide word‑level timestamps for up to 5 minutes of speech.

import torch
from qwen_asr import Qwen3ForcedAligner

aligner = Qwen3ForcedAligner.from_pretrained(
    "Qwen/Qwen3-ForcedAligner-0.6B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

alignment = aligner.align(
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
    text="甚至出现交易几乎停滞的情况。",
    language="Chinese",
)

for word in alignment[0]:
    print(word.text, word.start_time, word.end_time)

Benchmark Highlights

Dataset	Qwen3‑ASR‑1.7B	Whisper‑large‑v3
LibriSpeech	1.63 %	2.78 %
Fleurs‑en	3.35 %	5.70 %
Singing Voice	5.98 %	7.88 %

The 0.6B version offers a 2x speed‑up with a modest 0.4 % WER increase, making it ideal for low‑latency applications.

Deploy with vLLM in Production

Install vLLM – use the nightly wheel for GPU 12/9 compatibility.

uv venv
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly/cu129
uv pip install "vllm[audio]"

Launch a Local Server

vllm serve Qwen/Qwen3-ASR-1.7B

Query via OpenAI SDK

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Qwen/Qwen3-ASR-1.7B",
    messages=[{"role": "user", "content": [{"type": "audio_url", "audio_url": {"url": "<YOUR_AUDIO_URL>"}}]},
)
print(response.choices[0].message.content)

Feel free to expose the server behind Nginx or any API gateway – the OpenAI compatible endpoints make integration trivial.

Docker‑Based Quickstart

docker run --gpus all --name qwen3-asr \
  -p 8000:80 \
  -v /your/workspace:/data/shared/Qwen3-ASR \
  qwenllm/qwen3-asr:latest

The container will expose the Gradio UI on http://localhost:8000 and the vLLM API on 0.0.0.0:8000.

Summary

Qwen3‑ASR is more than just a new open‑source ASR model. It’s a complete ecosystem that offers:

High‑quality multilingual transcription – 52 languages, 22 Chinese dialects.
Real‑time & batch inference – via transformers, vLLM, or streaming.
Forced‑Alignment – fast, non‑autoregressive timestamps.
Zero‑config demos – Gradio UI, Docker, and API servers.

Whether you’re building a multilingual customer‑support bot, a music‑transcription service, or a research prototype, Qwen3‑ASR gives you the performance of a commercial API at a fraction of the cost.

Get started now by cloning the repository, downloading the weights, and running the sample scripts. The community is active on GitHub and Discord, so share your use cases and help shape the next generation of open‑source speech recognition.

Original Article: View Original

Share this article