Google/gemma-4-E4B-it

Google's compact Gemma 4 multimodal model (effective 4B) with native text, image, and audio, plus thinking mode and tool-use protocol.

View on HuggingFace

dense8B131,072 ctxvLLM 0.19.1+multimodaltext

Guide

Overview

Gemma 4 E4B is Google's effective-4B unified multimodal model — text + images + audio in a single model, with structured thinking/reasoning, function calling, and dynamic vision resolution. It fits on a single 24 GB+ GPU.

Key Features

Multimodal: Text + images + audio natively (video via custom frame-extraction pipeline).
Dual Attention: Alternating sliding-window (local) and global attention with different head dimensions.
Thinking Mode: Structured reasoning via <|channel>thought\n...<channel|> delimiters.
Function Calling: Custom tool-call protocol with dedicated special tokens.
Dynamic Vision Resolution: Per-request configurable vision token budget (70, 140, 280, 560, 1120 tokens).

TPU support is provided through vLLM TPU with recipes for Trillium and Ironwood.

Prerequisites

pip (NVIDIA CUDA)

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly/cu129 \
  --extra-index-url https://download.pytorch.org/whl/cu129 \
  --index-strategy unsafe-best-match

pip (AMD ROCm: MI300X, MI325X, MI350X, MI355X)

Requires Python 3.12, ROCm 7.2.1, glibc >= 2.35 (Ubuntu 22.04+).

uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --pre \
  --extra-index-url https://wheels.vllm.ai/rocm/nightly/rocm721 --upgrade

Docker

docker pull vllm/vllm-openai:gemma4        # CUDA 12.9
docker pull vllm/vllm-openai:gemma4-cu130  # CUDA 13.0
docker pull vllm/vllm-openai-rocm:gemma4   # AMD

Deployment Configurations

Quick Start (Single GPU)

vllm serve google/gemma-4-E4B-it \
  --max-model-len 32768

With Audio Support

vllm serve google/gemma-4-E4B-it \
  --max-model-len 8192 \
  --limit-mm-per-prompt image=4,audio=1

Full-Featured Server Launch

Enables text, image, audio, thinking, and tool calling:

vllm serve google/gemma-4-E4B-it \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --chat-template examples/tool_chat_template_gemma4.jinja \
  --limit-mm-per-prompt image=4,audio=1 \
  --async-scheduling \
  --host 0.0.0.0 \
  --port 8000

Docker (NVIDIA)

docker run -itd --name gemma4-e4b \
  --ipc=host --network host --shm-size 16G --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:gemma4 \
    --model google/gemma-4-E4B-it \
    --max-model-len 32768 \
    --host 0.0.0.0 --port 8000

Docker (AMD MI300X/MI325X/MI350X/MI355X)

docker run -itd --name gemma4-rocm \
  --ipc=host --network=host --privileged \
  --cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri \
  --group-add=video --cap-add=SYS_PTRACE \
  --security-opt=seccomp=unconfined --shm-size 16G \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-rocm:gemma4 \
    --model google/gemma-4-E4B-it \
    --host 0.0.0.0 --port 8000

Docker (Cloud TPU — Trillium / Ironwood)

TPU uses the separate vllm/vllm-tpu image (no pip wheel). Pull the tag specified by the upstream Trillium or Ironwood recipe, then run:

docker run -itd --name gemma4-tpu \
  --privileged --network host --shm-size 16G \
  -v /dev/shm:/dev/shm -e HF_TOKEN=$HF_TOKEN \
  vllm/vllm-tpu:latest \
    --model google/gemma-4-E4B-it \
    --max-model-len 16384 \
    --disable_chunked_mm_input \
    --host 0.0.0.0 --port 8000

Client Usage

Audio Transcription

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="google/gemma-4-E4B-it",
    messages=[{"role": "user", "content": [
        {"type": "audio_url", "audio_url": {"url": "https://example.com/audio.wav"}},
        {"type": "text", "text": "Transcribe this audio."},
    ]}],
    max_tokens=512,
)
print(response.choices[0].message.content)

Image Understanding

response = client.chat.completions.create(
    model="google/gemma-4-E4B-it",
    messages=[{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
        {"type": "text", "text": "Describe this image in detail."},
    ]}],
    max_tokens=1024,
)

Thinking Mode

vllm serve google/gemma-4-E4B-it \
  --max-model-len 16384 \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --enable-auto-tool-choice \
  --chat-template examples/tool_chat_template_gemma4.jinja

Enable per-request via extra_body={"chat_template_kwargs": {"enable_thinking": True}}.

Configuration Tips

Set --max-model-len to match your workload (max 131072).
Image-only workloads: --limit-mm-per-prompt audio=0.
Text-only workloads: --limit-mm-per-prompt image=0,audio=0 to skip MM profiling.
--async-scheduling improves throughput.
FP8 KV cache (--kv-cache-dtype fp8) saves ~50% KV memory.