Qwen/Qwen-Image
Text-to-image diffusion model (20B parameters) from the Qwen-Image family, served via vLLM-Omni.
View on HuggingFaceGuide
Overview
Qwen-Image is a diffusion-based text-to-image model. This recipe documents the Qwen-Image family served via vLLM-Omni:
| Model | HuggingFace | Description |
|---|---|---|
| Qwen-Image | Qwen/Qwen-Image | Text-to-image (20B, Aug 2025) |
| Qwen-Image-2512 | Qwen/Qwen-Image-2512 | Updated T2I with enhanced realism (Dec 2025) |
| Qwen-Image-Edit | Qwen/Qwen-Image-Edit | Single-image editing (Aug 2025) |
| Qwen-Image-Edit-2509 | Qwen/Qwen-Image-Edit-2509 | Multi-image editing (Sep 2025) |
| Qwen-Image-Edit-2511 | Qwen/Qwen-Image-Edit-2511 | Enhanced consistency + built-in LoRA (Nov 2025) |
| Qwen-Image-Layered | Qwen/Qwen-Image-Layered | Decomposes input into RGBA layers (Dec 2025) |
All models share the same DiT transformer core — acceleration methods are applicable across the entire series.
Prerequisites
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv venv
source .venv/bin/activate
uv pip install -e . vllm==0.18.0
Usage
Text-to-Image
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
--model Qwen/Qwen-Image \
--prompt "a cup of coffee on the table" \
--output output_qwen_image.png \
--num-inference-steps 50 \
--cfg-scale 4.0
Image Editing (Qwen-Image-Edit)
python3 ./examples/offline_inference/image_to_image/image_edit.py \
--model Qwen/Qwen-Image-Edit \
--image qwen_bear.png \
--prompt "Let this mascot dance under the moon, surrounded by floating stars" \
--output output_image_edit.png \
--num-inference-steps 50 \
--cfg-scale 4.0
Layered RGBA Decomposition
python3 ./examples/offline_inference/image_to_image/image_edit.py \
--model Qwen/Qwen-Image-Layered \
--image input.png \
--prompt "" \
--output layered \
--num-inference-steps 50 \
--cfg-scale 4.0 \
--layers 4 \
--color-format "RGBA"
Acceleration
Pick one cache backend AND any supported parallel strategy.
Cache-DiT
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
--model Qwen/Qwen-Image --prompt "..." --cache-backend cache_dit
TeaCache
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
--model Qwen/Qwen-Image --prompt "..." --cache-backend tea_cache
Ulysses / Ring Sequence Parallelism
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
--model Qwen/Qwen-Image --prompt "..." --ulysses-degree 4
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
--model Qwen/Qwen-Image --prompt "..." --ring-degree 4
CFG Parallelism (2 GPUs, non-distilled models with cfg-scale > 1)
python3 ./examples/offline_inference/image_to_image/image_edit.py \
--model Qwen/Qwen-Image-Edit --image qwen_bear.png --prompt "..." \
--cfg-parallel-size 2 --num-inference-steps 50 --cfg-scale 4.0
Tensor Parallelism
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
--model Qwen/Qwen-Image --prompt "..." --tensor-parallel-size 2
CPU / Layerwise Offload (low VRAM)
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
--model Qwen/Qwen-Image --prompt "..." --enable-cpu-offload
python3 ./examples/offline_inference/image_to_image/image_edit.py \
--model Qwen/Qwen-Image-Edit --image qwen_bear.png --prompt "..." \
--enable-layerwise-offload
VAE Patch Parallelism
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
--model Qwen/Qwen-Image --prompt "..." \
--height 1536 --width 1536 \
--ulysses-degree 2 --vae-patch-parallel-size 2
Must be combined with another parallel method.
Quantization (Qwen-Image / Qwen-Image-2512 only)
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
--model Qwen/Qwen-Image --prompt "..." --quantization fp8 \
--ignored-layers "img_mlp"
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
--model Qwen/Qwen-Image --prompt "..." --quantization int8
Qwen-Image-Edit variants do not support quantization.
Configuration Tips
- Cache + SP is the recommended combo for long-sequence generation.
- Sequence parallelism (Ulysses / Ring) beats TP for high-res / long-sequence.
- Tensor parallelism is most useful when model weights alone don't fit on one GPU.
- CFG parallelism targets non-distilled diffusion with full CFG (not for guidance-distilled models).
- To reduce peak VRAM, use CPU/layerwise offload and/or VAE patch parallelism.
- TeaCache and Cache-DiT cannot be used together.
--enforce-eagerdisables torch.compile if needed.
See the Feature Support Table and Feature Compatibility Guide for combinations.