stepfun-ai/Step-3.5-Flash
Production-grade reasoning MoE (~196B total / 11B active parameters) with hybrid attention schedules, SWA compensation, and multi-token prediction for low-latency long-context inference
View on HuggingFaceGuide
Overview
Step-3.5-Flash is an advanced reasoning engine from StepFun. Highlights:
- Hybrid attention schedules with compensation for sliding-window attention (SWA)
- Sparse MoE structure (196B total parameters, 11B active)
- Multi-token prediction mechanism for faster inference
Available precisions:
- stepfun-ai/Step-3.5-Flash (BF16)
- stepfun-ai/Step-3.5-Flash-FP8
- stepfun-ai/Step-3.5-Flash-Int4 (not yet supported by vLLM)
Prerequisites
- vLLM version: latest stable
- Hardware: 4x H200/H20/B200
Install vLLM
uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend auto
Launching the Server
Tensor Parallel
vllm serve stepfun-ai/Step-3.5-Flash \
--tensor-parallel-size 4 \
--reasoning-parser step3p5 \
--tool-call-parser step3p5 \
--enable-auto-tool-choice \
--trust-remote-code
Note: The FP8 version cannot use TP4 — use DP4 instead.
Data Parallel + Expert Parallel (recommended for FP8)
vllm serve stepfun-ai/Step-3.5-Flash \
--data-parallel-size 4 \
--enable-expert-parallel \
--reasoning-parser step3p5 \
--tool-call-parser step3p5 \
--enable-auto-tool-choice \
--trust-remote-code
Enabling MTP Speculative Decoding
vllm serve stepfun-ai/Step-3.5-Flash \
--tensor-parallel-size 4 \
--reasoning-parser step3p5 \
--tool-call-parser step3p5 \
--enable-auto-tool-choice \
--trust-remote-code \
--hf-overrides '{"num_nextn_predict_layers": 1}' \
--speculative-config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}'
Benchmarking
vllm bench serve \
--backend vllm \
--model stepfun-ai/Step-3.5-Flash \
--endpoint /v1/completions \
--dataset-name random \
--random-input 2048 \
--random-output 1024 \
--max-concurrency 10 \
--num-prompt 100
Troubleshooting
- MoE kernel tuning: See tune-moe-kernel to tune Triton kernels for your hardware.
- FP8 DeepGEMM: For FP8, install DeepGEMM via install_deepgemm.sh.
- B200 FlashInfer FP8 MoE error: If you see
routing_logits must be bfloat16when serving FP8 on B200, setexport VLLM_USE_FLASHINFER_MOE_FP8=0as a workaround. - FP8 + TP4 incompatibility: Use DP4+EP instead.