vLLM/Recipes
GLM (Z-AI)

zai-org/GLM-4.7

GLM-4.7 MoE language model (~358B total parameters) with MTP speculative decoding, updated tool call parser, and reasoning support

View on HuggingFace
moe358B / 32B202,752 ctxvLLM 0.11.0+text
Guide

Overview

GLM-4.7 is the latest GLM-4.X MoE release from Z-AI. It introduces the glm47 tool call parser while retaining the GLM-4.5 reasoning parser. Built-in Multi-Token Prediction (MTP) layers enable speculative decoding for throughput gains on decode-heavy workloads.

A smaller zai-org/GLM-4.7-Flash variant is also available for lower-latency scenarios.

Prerequisites

  • vLLM version: nightly recommended for GLM-4.7 (until packaged in a stable release)
  • Hardware: 4x-8x H200 (FP8), AMD MI300X / MI325X / MI355X for ROCm
  • Python: 3.10 - 3.13 (3.12 required for ROCm wheels)

Install vLLM (NVIDIA, nightly)

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
uv pip install git+https://github.com/huggingface/transformers.git

Install vLLM (AMD ROCm)

uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm

Launching the Server

Tensor Parallel + MTP (FP8 on 4xH200)

vllm serve zai-org/GLM-4.7-FP8 \
    --tensor-parallel-size 4 \
    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 1 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice

AMD ROCm

SAFETENSORS_FAST_GPU=1 \
vllm serve zai-org/GLM-4.7 \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.9 \
    --disable-log-requests \
    --no-enable-prefix-caching \
    --trust-remote-code

Tuning Tips

  • --max-model-len=65536 is a sensible default; max is 128K.
  • --max-num-batched-tokens=32768 for prompt-heavy workloads; reduce to 8K-16K for latency-sensitive.
  • Use --gpu-memory-utilization=0.95 to maximize KV cache.

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="zai-org/GLM-4.7-FP8",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=512,
)
print(resp.choices[0].message.content)

Benchmarking

vllm bench serve \
  --model zai-org/GLM-4.7-FP8 \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos

Troubleshooting

  • Parser mismatch: GLM-4.7 uses --tool-call-parser glm47 (not glm45).
  • MTP acceptance: 1 speculative token gives ~90%+ acceptance and best throughput.

References