Qwen/Qwen-Image

Text-to-image diffusion model (20B parameters) from the Qwen-Image family, served via vLLM-Omni.

Guide

Overview

Qwen-Image is a diffusion-based text-to-image model. This recipe documents the Qwen-Image family served via vLLM-Omni:

Model	HuggingFace	Description
Qwen-Image	Qwen/Qwen-Image	Text-to-image (20B, Aug 2025)
Qwen-Image-2512	Qwen/Qwen-Image-2512	Updated T2I with enhanced realism (Dec 2025)
Qwen-Image-Edit	Qwen/Qwen-Image-Edit	Single-image editing (Aug 2025)
Qwen-Image-Edit-2509	Qwen/Qwen-Image-Edit-2509	Multi-image editing (Sep 2025)
Qwen-Image-Edit-2511	Qwen/Qwen-Image-Edit-2511	Enhanced consistency + built-in LoRA (Nov 2025)
Qwen-Image-Layered	Qwen/Qwen-Image-Layered	Decomposes input into RGBA layers (Dec 2025)

All models share the same DiT transformer core — acceleration methods are applicable across the entire series.

Prerequisites

git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv venv
source .venv/bin/activate
uv pip install -e . vllm==0.18.0

Usage

Text-to-Image

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --output output_qwen_image.png \
    --num-inference-steps 50 \
    --cfg-scale 4.0

Image Editing (Qwen-Image-Edit)

python3 ./examples/offline_inference/image_to_image/image_edit.py \
    --model Qwen/Qwen-Image-Edit \
    --image qwen_bear.png \
    --prompt "Let this mascot dance under the moon, surrounded by floating stars" \
    --output output_image_edit.png \
    --num-inference-steps 50 \
    --cfg-scale 4.0

Layered RGBA Decomposition

python3 ./examples/offline_inference/image_to_image/image_edit.py \
    --model Qwen/Qwen-Image-Layered \
    --image input.png \
    --prompt "" \
    --output layered \
    --num-inference-steps 50 \
    --cfg-scale 4.0 \
    --layers 4 \
    --color-format "RGBA"

Acceleration

Pick one cache backend AND any supported parallel strategy.

Cache-DiT

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image --prompt "..." --cache-backend cache_dit

TeaCache

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image --prompt "..." --cache-backend tea_cache

Ulysses / Ring Sequence Parallelism

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image --prompt "..." --ulysses-degree 4

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image --prompt "..." --ring-degree 4

CFG Parallelism (2 GPUs, non-distilled models with `cfg-scale > 1`)

python3 ./examples/offline_inference/image_to_image/image_edit.py \
    --model Qwen/Qwen-Image-Edit --image qwen_bear.png --prompt "..." \
    --cfg-parallel-size 2 --num-inference-steps 50 --cfg-scale 4.0

Tensor Parallelism

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image --prompt "..." --tensor-parallel-size 2

CPU / Layerwise Offload (low VRAM)

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image --prompt "..." --enable-cpu-offload

python3 ./examples/offline_inference/image_to_image/image_edit.py \
    --model Qwen/Qwen-Image-Edit --image qwen_bear.png --prompt "..." \
    --enable-layerwise-offload

VAE Patch Parallelism

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image --prompt "..." \
    --height 1536 --width 1536 \
    --ulysses-degree 2 --vae-patch-parallel-size 2

Must be combined with another parallel method.

Quantization (Qwen-Image / Qwen-Image-2512 only)

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image --prompt "..." --quantization fp8 \
    --ignored-layers "img_mlp"

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image --prompt "..." --quantization int8

Qwen-Image-Edit variants do not support quantization.

Configuration Tips

Cache + SP is the recommended combo for long-sequence generation.
Sequence parallelism (Ulysses / Ring) beats TP for high-res / long-sequence.
Tensor parallelism is most useful when model weights alone don't fit on one GPU.
CFG parallelism targets non-distilled diffusion with full CFG (not for guidance-distilled models).
To reduce peak VRAM, use CPU/layerwise offload and/or VAE patch parallelism.
TeaCache and Cache-DiT cannot be used together.
--enforce-eager disables torch.compile if needed.

See the Feature Support Table and Feature Compatibility Guide for combinations.