vLLM/Recipes
MiniMax

MiniMaxAI/MiniMax-M2.1

MiniMax M2.1 MoE language model (230B total / 10B active) for coding, agent toolchains, and long-context reasoning — native FP8 checkpoint

View on HuggingFace
moe230B / 10B196,608 ctxvLLM 0.11.0+text
Guide

Overview

MiniMax-M2.1 is part of the MiniMax M2 series of advanced MoE language models. It retains the M2 architecture (10B active, 230B total) with improvements over the original M2 release. Supports 196K context per sequence.

Prerequisites

  • OS: Linux
  • Python: 3.10 - 3.13
  • NVIDIA: compute capability >= 7.0; ~220 GB for weights + 240 GB per 1M context tokens
  • AMD: MI300X / MI325X / MI350X / MI355X with ROCm 7.0+

Install vLLM (NVIDIA)

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Docker (dedicated M2-series image)

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:minimax27 MiniMaxAI/MiniMax-M2.1 \
      --tensor-parallel-size 4 \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2 \
      --enable-auto-tool-choice \
      --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
      --trust-remote-code

Launching the Server

NVIDIA — TP4

vllm serve MiniMaxAI/MiniMax-M2.1 \
  --tensor-parallel-size 4 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
  --enable-auto-tool-choice \
  --trust-remote-code

Pure TP8 is not supported. For >4 GPUs use DP+EP or TP+EP.

vllm serve MiniMaxAI/MiniMax-M2.1 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
  --enable-auto-tool-choice

AMD ROCm

VLLM_ROCM_USE_AITER=1 vllm serve MiniMaxAI/MiniMax-M2.1 \
  --tensor-parallel-size 4 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice \
  --trust-remote-code

Benchmarking

vllm bench serve \
  --backend vllm \
  --model MiniMaxAI/MiniMax-M2.1 \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --max-concurrency 10 \
  --num-prompt 100

Troubleshooting

  • See MiniMax-M2 for shared troubleshooting notes (fuse_minimax_qk_norm, nightly vs stable, DeepGEMM, AITER).

References