vLLM/Recipes
Qwen

Qwen/Qwen3-Coder-480B-A35B-Instruct

Large coder MoE with 480B total / 35B active parameters, strong tool-use and code generation capabilities.

View on HuggingFace
moe480B / 35B262,144 ctxvLLM 0.10.0+text
Guide

Overview

Qwen3-Coder is an advanced large language model created by the Qwen team. Qwen3-Coder-480B-A35B-Instruct is the flagship coder MoE with 480B total / 35B active parameters. vLLM supports it including tool calling; the guide below covers BF16 and FP8 serving on NVIDIA and AMD GPUs.

Prerequisites

CUDA

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

ROCm (MI300X, MI325X, MI355X)

uv venv
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/

The ROCm wheel requires Python 3.12, ROCm 7.0, and glibc >= 2.35.

Deployment Configurations

8xH200 / 8xH20 BF16

vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct \
  --max-model-len 32000 \
  --enable-expert-parallel \
  --tensor-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder
VLLM_USE_DEEP_GEMM=1 vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
  --max-model-len 131072 \
  --enable-expert-parallel \
  --data-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

8xMI300X/MI325X/MI355X BF16

VLLM_ROCM_USE_AITER=1 vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct \
  --max-model-len 32000 \
  --enable-expert-parallel \
  --tensor-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

8xMI300X/MI325X/MI355X FP8

VLLM_ROCM_USE_AITER=1 vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
  --trust-remote-code \
  --max-model-len 131072 \
  --enable-expert-parallel \
  --data-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Client Usage

vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --max-concurrency 10 \
  --num-prompt 100

Troubleshooting

  • Context-length OOM: A single H20 node cannot serve the native 262144 context. Reduce --max-model-len or raise --gpu-memory-utilization.
  • TP=8 failure on FP8: Expect ValueError: The output_size of gate's and up's weight = 320 is not divisible by weight quantization block_n = 128. on FP8 with TP=8. Switch to --data-parallel-size 8 instead.
  • DeepGEMM: set VLLM_USE_DEEP_GEMM=1 for faster FP8 matmul. Follow the setup instructions to install it.
  • Tool calling: add --tool-call-parser qwen3_coder as shown above.

Evaluation

DatasetTest TypePass@1 Score
HumanEvalBase tests0.939
HumanEval+Base + extra tests0.902
MBPPBase tests0.918
MBPP+Base + extra tests0.794

References