Qwen/Qwen3-VL-235B-A22B-Instruct
Qwen3-VL flagship MoE vision-language model with 235B total / 22B active parameters, supporting images, video, and long context.
View on HuggingFaceGuide
Overview
Qwen3-VL is the most powerful vision-language model in the Qwen series, delivering upgrades to text understanding & generation, visual perception & reasoning, extended context, spatial/video dynamics, and agent interaction. The flagship Qwen3-VL-235B-A22B-Instruct is a MoE model that requires at least 8 GPUs with ≥80 GB memory each (A100/H100/H200 class).
Prerequisites
uv venv
source .venv/bin/activate
# Install vLLM >= 0.11.0
uv pip install -U vllm
# Install Qwen-VL utility library (recommended for offline inference)
uv pip install qwen-vl-utils==0.0.14
Deployment Configurations
H100 (Image + Video, FP8)
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
--tensor-parallel-size 8 \
--mm-encoder-tp-mode data \
--enable-expert-parallel \
--async-scheduling
H100 (Image-Only, FP8, TP4)
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
--tensor-parallel-size 4 \
--limit-mm-per-prompt.video 0 \
--async-scheduling \
--gpu-memory-utilization 0.95 \
--max-num-seqs 128
A100 & H100 (Image-Only, BF16)
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
--tensor-parallel-size 8 \
--limit-mm-per-prompt.video 0 \
--async-scheduling
A100 & H100 (Image + Video, BF16)
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 128000 \
--async-scheduling
H200 & B200
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
--tensor-parallel-size 8 \
--mm-encoder-tp-mode data \
--async-scheduling
MI300X/MI325X/MI355X (BF16)
MIOPEN_USER_DB_PATH="$(pwd)/miopen" \
MIOPEN_FIND_MODE=FAST \
VLLM_ROCM_USE_AITER=1 \
SAFETENSORS_FAST_GPU=1 \
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
--tensor-parallel 4 \
--mm-encoder-tp-mode data
Configuration Tips
- Use
--limit-mm-per-prompt.video 0if your server only serves image inputs to save memory. OMP_NUM_THREADS=1reduces CPU contention during preprocessing.- The model's context length is 262K. Reduce
--max-model-len(e.g. 128000) if you don't need the full range. --async-schedulingoverlaps scheduling with decoding for better throughput.--mm-encoder-tp-mode datadeploys the vision encoder in data-parallel fashion for better performance.- If your inputs are mostly unique, pass
--mm-processor-cache-gb 0to skip caching overhead. - Extend context with YaRN:
--rope-scaling '{"rope_type":"yarn","factor":3.0,"original_max_position_embeddings":262144,"mrope_section":[24,20,20],"mrope_interleaved":true}' --max-model-len 1000000
Text-only mode: pass --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 0 to free memory for KV cache when serving text-only traffic.
Client Usage
import time
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1", timeout=3600)
messages = [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"}},
{"type": "text", "text": "Read all the text in the image."},
],
}]
start = time.time()
response = client.chat.completions.create(
model="Qwen/Qwen3-VL-235B-A22B-Instruct",
messages=messages,
max_tokens=2048,
)
print(f"Response costs: {time.time() - start:.2f}s")
print(response.choices[0].message.content)
Troubleshooting
- OOM on A100 / H100 BF16: reduce
--max-model-len, drop to image-only, or switch to the FP8 checkpoint. - If enabling
--mm-encoder-tp-mode dataraises memory pressure, lower--gpu-memory-utilization.