vLLM/Recipes
GLM (Z-AI)

zai-org/Glyph

Visual-text compression framework that renders long text into images and processes them with a reasoning VLM, scaling effective context length

View on HuggingFace
dense10B131,072 ctxvLLM 0.11.0+multimodal
Guide

Overview

Glyph is a framework from Zhipu AI for scaling context length via visual-text compression. It renders long textual sequences into images and processes them with a vision-language model. This recipe covers the vLLM deployment of the zai-org/Glyph VLM as a component in that framework.

Glyph is a reasoning multimodal model, so --reasoning-parser glm45 is recommended to parse reasoning traces from outputs.

Prerequisites

  • vLLM version: latest stable
  • Hardware: 1x H100 or 1x MI300X/MI325X

Install vLLM (NVIDIA)

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Install vLLM (AMD ROCm)

uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm

ROCm wheel requires Python 3.12, ROCm 7.0, glibc >= 2.35.

Launching the Server

Single H100 GPU

vllm serve zai-org/Glyph \
    --no-enable-prefix-caching \
    --mm-processor-cache-gb 0 \
    --reasoning-parser glm45 \
    --limit-mm-per-prompt.video 0

Single MI300X / MI325X

VLLM_ROCM_USE_AITER=1 \
SAFETENSORS_FAST_GPU=1 \
vllm serve zai-org/Glyph \
    --no-enable-prefix-caching \
    --mm-processor-cache-gb 0 \
    --reasoning-parser glm45 \
    --limit-mm-per-prompt.video 0

Configuration Tips

  • --no-enable-prefix-caching and --mm-processor-cache-gb 0 are recommended for OCR-like workloads where image reuse is uncommon; they avoid unnecessary hashing and caching overhead.
  • Adjust --max-num-batched-tokens for throughput according to your hardware.

Benchmarking

vllm bench serve \
  --model zai-org/Glyph \
  --dataset-name random \
  --random-input-len 8192 \
  --random-output-len 512 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos

Troubleshooting

  • Reasoning traces: Use --reasoning-parser glm45 to extract reasoning content.
  • Slow first inference: Disabling prefix caching and multimodal processor caching is intentional for Glyph's use case and trades off first-request latency for predictable throughput.

References