vLLM/Recipes
Qwen

Qwen/Qwen3-235B-A22B-Instruct-2507

Flagship Qwen3 MoE instruct model with 235B total and 22B active parameters, tuned for high-quality text generation.

View on HuggingFace
moe235B / 22B262,144 ctxvLLM 0.10.0+text
Guide

Overview

Qwen3 is the flagship instruct MoE model in the Qwen3 series, with 235B total parameters and 22B active parameters. This guide covers deploying the model efficiently with vLLM on NVIDIA and AMD GPUs.

Prerequisites

NVIDIA CUDA (pip)

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto

AMD ROCm (pip)

Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. Use the Docker flow if your environment is incompatible.

uv venv
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.14.1/rocm700

Deployment Configurations

BF16 on MI300X/MI325X/MI355X (4 GPUs)

HIP_VISIBLE_DEVICES="4,5,6,7" \
VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MHA=0 \
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
VLLM_USE_TRITON_FLASH_ATTN=0 \
SAFETENSORS_FAST_GPU=1 \
vllm serve Qwen/Qwen3-235B-A22B \
  --trust-remote-code \
  -tp 4 \
  --disable-log-requests \
  --swap-space 32 \
  --distributed-executor-backend mp \
  --max-num-batched-tokens 32768 \
  --max-model-len 32768 \
  --no-enable-prefix-caching \
  --gpu-memory-utilization 0.8

FP8 on MI300X/MI325X/MI355X (4 GPUs)

HIP_VISIBLE_DEVICES="4,5,6,7" \
VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MHA=0 \
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
VLLM_USE_TRITON_FLASH_ATTN=0 \
SAFETENSORS_FAST_GPU=1 \
vllm serve Qwen/Qwen3-235B-A22B-FP8 \
  --trust-remote-code \
  -tp 4 \
  --disable-log-requests \
  --swap-space 16 \
  --distributed-executor-backend mp \
  --max-num-batched-tokens 32768 \
  --max-model-len 32768 \
  --no-enable-prefix-caching \
  --gpu-memory-utilization 0.8

TPU Deployment

Client Usage

vllm bench serve \
  --model "Qwen/Qwen3-235B-A22B-FP8" \
  --dataset-name random \
  --random-input-len 8192 \
  --random-output-len 1024 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos \
  --trust-remote-code

Troubleshooting

  • Use --max-num-batched-tokens and --max-model-len to fit memory constraints on smaller nodes.
  • If OOM at startup, lower --gpu-memory-utilization (e.g. 0.8) and disable prefix caching.
  • AMD builds require VLLM_ROCM_USE_AITER=1 for best performance.

References