Qwen/Qwen3-Coder-480B-A35B-Instruct
Large coder MoE with 480B total / 35B active parameters, strong tool-use and code generation capabilities.
View on HuggingFaceGuide
Overview
Qwen3-Coder is an advanced large language model created by the Qwen team. Qwen3-Coder-480B-A35B-Instruct is the flagship coder MoE with 480B total / 35B active parameters. vLLM supports it including tool calling; the guide below covers BF16 and FP8 serving on NVIDIA and AMD GPUs.
Prerequisites
CUDA
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
ROCm (MI300X, MI325X, MI355X)
uv venv
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/
The ROCm wheel requires Python 3.12, ROCm 7.0, and glibc >= 2.35.
Deployment Configurations
8xH200 / 8xH20 BF16
vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct \
--max-model-len 32000 \
--enable-expert-parallel \
--tensor-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
8xH200 / 8xH20 FP8 (DP=8, recommended)
VLLM_USE_DEEP_GEMM=1 vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
--max-model-len 131072 \
--enable-expert-parallel \
--data-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
8xMI300X/MI325X/MI355X BF16
VLLM_ROCM_USE_AITER=1 vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct \
--max-model-len 32000 \
--enable-expert-parallel \
--tensor-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
8xMI300X/MI325X/MI355X FP8
VLLM_ROCM_USE_AITER=1 vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
--trust-remote-code \
--max-model-len 131072 \
--enable-expert-parallel \
--data-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Client Usage
vllm bench serve \
--backend vllm \
--model Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
--endpoint /v1/completions \
--dataset-name random \
--random-input 2048 \
--random-output 1024 \
--max-concurrency 10 \
--num-prompt 100
Troubleshooting
- Context-length OOM: A single H20 node cannot serve the native 262144 context. Reduce
--max-model-lenor raise--gpu-memory-utilization. - TP=8 failure on FP8: Expect
ValueError: The output_size of gate's and up's weight = 320 is not divisible by weight quantization block_n = 128.on FP8 with TP=8. Switch to--data-parallel-size 8instead. - DeepGEMM: set
VLLM_USE_DEEP_GEMM=1for faster FP8 matmul. Follow the setup instructions to install it. - Tool calling: add
--tool-call-parser qwen3_coderas shown above.
Evaluation
| Dataset | Test Type | Pass@1 Score |
|---|---|---|
| HumanEval | Base tests | 0.939 |
| HumanEval+ | Base + extra tests | 0.902 |
| MBPP | Base tests | 0.918 |
| MBPP+ | Base + extra tests | 0.794 |