vLLM/Recipes
Ernie (Baidu)

baidu/ERNIE-4.5-VL-28B-A3B-PT

Baidu ERNIE 4.5 VL MoE vision-language models (28B-A3B, 424B-A47B) with heterogeneous text/vision experts

View on HuggingFace
moe28B / 3B131,072 ctxvLLM 0.11.0+multimodal
Guide

Overview

ERNIE 4.5 VL is Baidu's multimodal MoE model with heterogeneous experts (separate text and vision experts). Because of the heterogeneous architecture, torch.compile and CUDA graphs are not supported.

  • baidu/ERNIE-4.5-VL-28B-A3B-PT — 28B total / 3B active (1x80GB)
  • baidu/ERNIE-4.5-VL-424B-A47B-PT — 424B total / 47B active (8x140GB BF16, 8x80GB FP8+offload, or 16x80GB BF16)

Prerequisites

  • vLLM: support added to main branch recently; install latest
  • Hardware depends on variant

Install vLLM (CUDA)

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Install vLLM (AMD ROCm MI300X/MI325X/MI355X)

uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/

Launch commands

28B on 1x80GB:

vllm serve baidu/ERNIE-4.5-VL-28B-A3B-PT --trust-remote-code

424B BF16 on 8x140GB:

vllm serve baidu/ERNIE-4.5-VL-424B-A47B-PT \
  --trust-remote-code \
  --tensor-parallel-size 8

424B with FP8 + CPU offload on 8x80GB (testing only):

vllm serve baidu/ERNIE-4.5-VL-424B-A47B-PT \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --cpu-offload-gb 50

28B on AMD MI300X+:

VLLM_ROCM_USE_AITER=1 SAFETENSORS_FAST_GPU=1 \
  vllm serve baidu/ERNIE-4.5-VL-28B-A3B-PT \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9 \
  --disable-log-requests \
  --trust-remote-code

Benchmarking

vllm bench serve \
  --model baidu/ERNIE-4.5-VL-28B-A3B-PT \
  --dataset-name random \
  --random-input-len 8000 --random-output-len 1000 \
  --request-rate 10 --num-prompts 16 --ignore-eos --trust-remote-code

References