zai-org/GLM-4.6
GLM-4.6 MoE language model (~357B total parameters, BF16) with MTP speculative decoding, native tool calling and reasoning
View on HuggingFaceGuide
Overview
GLM-4.6 is the successor to GLM-4.5 with ~357B total parameters. It retains the MoE architecture and built-in Multi-Token Prediction (MTP) layers used for speculative decoding. FP8 is the recommended precision for cost-efficient serving with minimal accuracy loss relative to BF16.
Prerequisites
- vLLM version: >= 0.11.0 (latest stable recommended)
- Hardware: 8x H200 (BF16) or 4x-8x H200 (FP8), AMD MI300X / MI325X / MI355X for ROCm
- Python: 3.10 - 3.13 (3.12 required for ROCm wheels)
Install vLLM (NVIDIA)
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Install vLLM (AMD ROCm)
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm
Launching the Server
Tensor Parallel (FP8)
vllm serve zai-org/GLM-4.6-FP8 \
--tensor-parallel-size 8 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice
Enabling MTP Speculative Decoding
vllm serve zai-org/GLM-4.6-FP8 \
--tensor-parallel-size 4 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice
Tuning Tips
--max-model-len=65536works well for most scenarios; max is 128K.--max-num-batched-tokens=32768is a good default for prompt-heavy workloads.--gpu-memory-utilization=0.95maximizes KV cache headroom.
Client Usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="zai-org/GLM-4.6-FP8",
messages=[{"role": "user", "content": "Summarize MTP speculative decoding."}],
max_tokens=512,
)
print(resp.choices[0].message.content)
Benchmarking
vllm bench serve \
--model zai-org/GLM-4.6-FP8 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--request-rate 10000 \
--num-prompts 16 \
--ignore-eos
Troubleshooting
- MTP memory overhead: Monitor GPU memory and tune batch size when enabling MTP.
- Tool calling not firing: Ensure
--tool-call-parser glm45 --enable-auto-tool-choiceare both present.