Qwen/Qwen3.5-397B-A17B
Multimodal MoE model with gated delta networks architecture, 397B total / 17B active parameters, up to 262K context
View on HuggingFaceGuide
Overview
Qwen3.5 is a multimodal mixture-of-experts model featuring a gated delta networks architecture with 397B total parameters and 17B active parameters. This guide covers how to efficiently deploy and serve the model across different hardware configurations and workload profiles using vLLM.
Prerequisites
Pip Install
NVIDIA
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto
AMD
Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup. Supported GPUs: MI300X, MI325X, MI355X.
uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm
Docker
NVIDIA
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai Qwen/Qwen3.5-397B-A17B \
--tensor-parallel-size 8 \
--reasoning-parser qwen3 \
--enable-prefix-caching
For Blackwell GPUs, use vllm/vllm-openai:cu130-nightly.
AMD
docker run --device=/dev/kfd --device=/dev/dri \
--security-opt seccomp=unconfined \
--group-add video \
--ipc=host \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai-rocm:latest \
Qwen/Qwen3.5-397B-A17B-FP8 \
--tensor-parallel-size 8 \
--reasoning-parser qwen3 \
--enable-prefix-caching
Deployment Configurations
The configurations below have been verified on 8x H200 GPUs and 8x MI300X/MI355X GPUs. We recommend using the official FP8 checkpoint Qwen/Qwen3.5-397B-A17B-FP8 for optimal serving efficiency.
Throughput-Focused (Text-Only)
For maximum text throughput under high concurrency, use --language-model-only to skip loading the vision encoder and free up memory for KV cache, and enable Expert Parallelism.
vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
-dp 8 \
--enable-expert-parallel \
--language-model-only \
--reasoning-parser qwen3 \
--enable-prefix-caching
Throughput-Focused (Multimodal)
For multimodal workloads, use --mm-encoder-tp-mode data for data-parallel vision encoding and --mm-processor-cache-type shm for shared-memory caching of preprocessed multimodal inputs.
vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
-dp 8 \
--enable-expert-parallel \
--mm-encoder-tp-mode data \
--mm-processor-cache-type shm \
--reasoning-parser qwen3 \
--enable-prefix-caching
To enable tool calling, add --enable-auto-tool-choice --tool-call-parser qwen3_coder to the serve command.
Latency-Focused
For latency-sensitive workloads at low concurrency, enable MTP-1 speculative decoding and disable prefix caching. MTP-1 reduces time-per-output-token (TPOT) with a high acceptance rate, at the cost of lower throughput under load.
Note: MTP-1 speculative decoding for AMD GPUs is under development.
vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
--tensor-parallel-size 8 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
--reasoning-parser qwen3
GB200 Deployment
We recommend using the NVFP4 checkpoint nvidia/Qwen3.5-397B-A17B-NVFP4 for optimal serving efficiency on GB200 nodes.
vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
-dp 4 \
--enable-expert-parallel \
--language-model-only \
--reasoning-parser qwen3 \
--enable-prefix-caching
MI355X Deployment
vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
-tp 2 \
--enable-expert-parallel \
--language-model-only \
--reasoning-parser qwen3 \
--enable-prefix-caching
Client Usage
import time
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
timeout=3600
)
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
}
},
{
"type": "text",
"text": "Read all the text in the image."
}
]
}
]
start = time.time()
response = client.chat.completions.create(
model="Qwen/Qwen3.5-397B-A17B",
messages=messages,
max_tokens=2048
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")
Troubleshooting
CUDA graph / Mamba cache size error
You may encounter:
assert num_cache_lines >= batch
This occurs because the CUDA graph capture size is larger than the Mamba cache size. Reduce --max-cudagraph-capture-size (default is 512). See https://github.com/vllm-project/vllm/pull/34571 for details.
Configuration tips
- Disable Reasoning: Add
--reasoning-parser qwen3 --default-chat-template-kwargs '{"enable_thinking": false}'to disable reasoning mode via command-line parameters. - Prefix Caching: Prefix caching for Mamba cache "align" mode is currently experimental.
- Multi-token Prediction: MTP-1 reduces per-token latency but degrades throughput under high concurrency because speculative tokens consume KV cache capacity. Adjust
num_speculative_tokens(1-5) based on your use case. - Encoder Data Parallelism:
--mm-encoder-tp-mode datadeploys the vision encoder in a data-parallel fashion. This consumes additional memory and may require adjustment of--gpu-memory-utilization. - Ultra-Long Texts: Qwen3.5 natively supports up to 262,144 tokens. For longer contexts, use RoPE scaling (YaRN). See the HuggingFace model card for details.
References
- Model card: https://huggingface.co/Qwen/Qwen3.5-397B-A17B
- FP8 checkpoint: https://huggingface.co/Qwen/Qwen3.5-397B-A17B-FP8
- NVFP4 checkpoint: https://huggingface.co/nvidia/Qwen3.5-397B-A17B-NVFP4
- vLLM documentation: https://docs.vllm.ai/