vLLM/Recipes
GLM (Z-AI)

zai-org/GLM-OCR

GLM-OCR image-to-text model with built-in MTP speculative decoding for high-throughput OCR serving

View on HuggingFace
dense0.9B131,072 ctxvLLM 0.12.0+multimodal
Guide

Overview

GLM-OCR is a vision-language model for end-to-end OCR. It includes built-in Multi-Token Prediction (MTP) layers enabling speculative decoding for higher throughput generation.

Prerequisites

  • vLLM version: nightly recommended (or latest stable with MTP support)
  • Transformers: >= 5.0.0 (install from source for latest)

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Or nightly:

uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
uv pip install git+https://github.com/huggingface/transformers.git

Launching the Server

With MTP Speculative Decoding

vllm serve zai-org/GLM-OCR \
     --speculative-config.method mtp \
     --speculative-config.num_speculative_tokens 1

Client Usage

OpenAI SDK

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1", timeout=3600)

messages = [{
    "role": "user",
    "content": [
        {"type": "image_url", "image_url": {"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"}},
        {"type": "text", "text": "Text Recognition:"}
    ]
}]

response = client.chat.completions.create(
    model="zai-org/GLM-OCR",
    messages=messages,
    max_tokens=2048,
    temperature=0.0,
)
print(response.choices[0].message.content)

cURL

curl -s http://localhost:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
          "model": "zai-org/GLM-OCR",
          "messages": [{
               "role": "user",
               "content": [
                    {"type": "image_url", "image_url": {"url": "https://example.com/receipt.png"}},
                    {"type": "text", "text": "Text Recognition:"}
               ]
          }],
          "max_tokens": 2048,
          "temperature": 0.0
     }'

Troubleshooting

  • Greedy sampling recommended: Use temperature=0.0 for optimal OCR accuracy.
  • Transformers version: Requires transformers >= 5.0.0.

References