Google/gemma-4-E4B-it
Google's compact Gemma 4 multimodal model (effective 4B) with native text, image, and audio, plus thinking mode and tool-use protocol.
View on HuggingFaceGuide
Overview
Gemma 4 E4B is Google's effective-4B unified multimodal model — text + images + audio in a single model, with structured thinking/reasoning, function calling, and dynamic vision resolution. It fits on a single 24 GB+ GPU.
Key Features
- Multimodal: Text + images + audio natively (video via custom frame-extraction pipeline).
- Dual Attention: Alternating sliding-window (local) and global attention with different head dimensions.
- Thinking Mode: Structured reasoning via
<|channel>thought\n...<channel|>delimiters. - Function Calling: Custom tool-call protocol with dedicated special tokens.
- Dynamic Vision Resolution: Per-request configurable vision token budget (70, 140, 280, 560, 1120 tokens).
TPU support is provided through vLLM TPU with recipes for Trillium and Ironwood.
Prerequisites
pip (NVIDIA CUDA)
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
--extra-index-url https://wheels.vllm.ai/nightly/cu129 \
--extra-index-url https://download.pytorch.org/whl/cu129 \
--index-strategy unsafe-best-match
pip (AMD ROCm: MI300X, MI325X, MI350X, MI355X)
Requires Python 3.12, ROCm 7.2.1, glibc >= 2.35 (Ubuntu 22.04+).
uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --pre \
--extra-index-url https://wheels.vllm.ai/rocm/nightly/rocm721 --upgrade
Docker
docker pull vllm/vllm-openai:gemma4 # CUDA 12.9
docker pull vllm/vllm-openai:gemma4-cu130 # CUDA 13.0
docker pull vllm/vllm-openai-rocm:gemma4 # AMD
Deployment Configurations
Quick Start (Single GPU)
vllm serve google/gemma-4-E4B-it \
--max-model-len 32768
With Audio Support
vllm serve google/gemma-4-E4B-it \
--max-model-len 8192 \
--limit-mm-per-prompt image=4,audio=1
Full-Featured Server Launch
Enables text, image, audio, thinking, and tool calling:
vllm serve google/gemma-4-E4B-it \
--max-model-len 16384 \
--gpu-memory-utilization 0.90 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--chat-template examples/tool_chat_template_gemma4.jinja \
--limit-mm-per-prompt image=4,audio=1 \
--async-scheduling \
--host 0.0.0.0 \
--port 8000
Docker (NVIDIA)
docker run -itd --name gemma4-e4b \
--ipc=host --network host --shm-size 16G --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:gemma4 \
--model google/gemma-4-E4B-it \
--max-model-len 32768 \
--host 0.0.0.0 --port 8000
Docker (AMD MI300X/MI325X/MI350X/MI355X)
docker run -itd --name gemma4-rocm \
--ipc=host --network=host --privileged \
--cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri \
--group-add=video --cap-add=SYS_PTRACE \
--security-opt=seccomp=unconfined --shm-size 16G \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai-rocm:gemma4 \
--model google/gemma-4-E4B-it \
--host 0.0.0.0 --port 8000
Docker (Cloud TPU — Trillium / Ironwood)
TPU uses the separate vllm/vllm-tpu image (no pip wheel). Pull the tag specified by the upstream Trillium or Ironwood recipe, then run:
docker run -itd --name gemma4-tpu \
--privileged --network host --shm-size 16G \
-v /dev/shm:/dev/shm -e HF_TOKEN=$HF_TOKEN \
vllm/vllm-tpu:latest \
--model google/gemma-4-E4B-it \
--max-model-len 16384 \
--disable_chunked_mm_input \
--host 0.0.0.0 --port 8000
Client Usage
Audio Transcription
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="google/gemma-4-E4B-it",
messages=[{"role": "user", "content": [
{"type": "audio_url", "audio_url": {"url": "https://example.com/audio.wav"}},
{"type": "text", "text": "Transcribe this audio."},
]}],
max_tokens=512,
)
print(response.choices[0].message.content)
Image Understanding
response = client.chat.completions.create(
model="google/gemma-4-E4B-it",
messages=[{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
{"type": "text", "text": "Describe this image in detail."},
]}],
max_tokens=1024,
)
Thinking Mode
vllm serve google/gemma-4-E4B-it \
--max-model-len 16384 \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--enable-auto-tool-choice \
--chat-template examples/tool_chat_template_gemma4.jinja
Enable per-request via extra_body={"chat_template_kwargs": {"enable_thinking": True}}.
Configuration Tips
- Set
--max-model-lento match your workload (max 131072). - Image-only workloads:
--limit-mm-per-prompt audio=0. - Text-only workloads:
--limit-mm-per-prompt image=0,audio=0to skip MM profiling. --async-schedulingimproves throughput.- FP8 KV cache (
--kv-cache-dtype fp8) saves ~50% KV memory.