zai-org/GLM-4.7
GLM-4.7 MoE language model (~358B total parameters) with MTP speculative decoding, updated tool call parser, and reasoning support
View on HuggingFaceGuide
Overview
GLM-4.7 is the latest GLM-4.X MoE release from Z-AI. It introduces the glm47
tool call parser while retaining the GLM-4.5 reasoning parser. Built-in
Multi-Token Prediction (MTP) layers enable speculative decoding for throughput
gains on decode-heavy workloads.
A smaller zai-org/GLM-4.7-Flash variant is also available for lower-latency
scenarios.
Prerequisites
- vLLM version: nightly recommended for GLM-4.7 (until packaged in a stable release)
- Hardware: 4x-8x H200 (FP8), AMD MI300X / MI325X / MI355X for ROCm
- Python: 3.10 - 3.13 (3.12 required for ROCm wheels)
Install vLLM (NVIDIA, nightly)
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
uv pip install git+https://github.com/huggingface/transformers.git
Install vLLM (AMD ROCm)
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm
Launching the Server
Tensor Parallel + MTP (FP8 on 4xH200)
vllm serve zai-org/GLM-4.7-FP8 \
--tensor-parallel-size 4 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice
AMD ROCm
SAFETENSORS_FAST_GPU=1 \
vllm serve zai-org/GLM-4.7 \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.9 \
--disable-log-requests \
--no-enable-prefix-caching \
--trust-remote-code
Tuning Tips
--max-model-len=65536is a sensible default; max is 128K.--max-num-batched-tokens=32768for prompt-heavy workloads; reduce to 8K-16K for latency-sensitive.- Use
--gpu-memory-utilization=0.95to maximize KV cache.
Client Usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="zai-org/GLM-4.7-FP8",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=512,
)
print(resp.choices[0].message.content)
Benchmarking
vllm bench serve \
--model zai-org/GLM-4.7-FP8 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--request-rate 10000 \
--num-prompts 16 \
--ignore-eos
Troubleshooting
- Parser mismatch: GLM-4.7 uses
--tool-call-parser glm47(notglm45). - MTP acceptance: 1 speculative token gives ~90%+ acceptance and best throughput.