deepseek-ai/DeepSeek-V3.2
DeepSeek V3.2 MoE model with MLA attention, sparse attention, and scalable RL for strong reasoning and agent capabilities.
View on HuggingFaceGuide
Overview
DeepSeek-V3.2 is a Mixture-of-Experts model that balances computational efficiency with strong reasoning and agent capabilities through three technical innovations: DeepSeek Sparse Attention (DSA) for efficient long-context processing, a scalable reinforcement learning framework achieving GPT-5-level performance, and a large-scale agentic task synthesis pipeline for robust tool-use generalization.
Prerequisites
- Hardware: Minimum 8x H100/H200 80GB GPUs (BF16) or 3x H200 (NVFP4 variant).
- vLLM: Version 0.18.0 or later (nightly recommended).
- Python: 3.10+
- CUDA: 12.x or later (CUDA 13.x may require extra env vars; see Troubleshooting).
- Disk: ~1.3 TB for BF16 weights; ~350 GB for NVFP4 variant.
- DeepGEMM (recommended):
Note: Setuv pip install git+https://github.com/deepseek-ai/DeepGEMM.git@v2.1.1.post3 --no-build-isolationVLLM_USE_DEEP_GEMM=0to disable MoE DeepGEMM if you experience issues (e.g., on H20 GPUs) or want to skip the long warmup.
Client Usage
Launch the server:
vllm serve deepseek-ai/DeepSeek-V3.2 \
--tensor-parallel-size 8 \
--trust-remote-code \
--kernel-config.enable_flashinfer_autotune=False \
--tokenizer-mode deepseek_v32 \
--tool-call-parser deepseek_v32 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v3
Use the OpenAI Python SDK to interact with the server:
from openai import OpenAI
client = OpenAI(
api_key="your-api-key",
base_url="http://localhost:8000/v1",
)
# Standard chat
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3.2",
messages=[{"role": "user", "content": "Hello!"}],
)
# Thinking / reasoning mode
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3.2",
messages=[{"role": "user", "content": "Solve this step by step..."}],
extra_body={"chat_template_kwargs": {"thinking": True}},
)
Troubleshooting
ptxas fatal: Value 'sm_110a' is not defined for option 'gpu-name'
This can occur on CUDA 13.x. Fix by exporting:
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
TP=8 performance on Hopper/Blackwell
Avoid -tp 8 with FlashMLA-Sparse. Due to kernel restrictions, TP=8 yields only 16 heads
per rank but is padded to 64, causing overhead. Prefer TP=2 (Hopper) or TP=1 (Blackwell)
with DP/EP mode: vllm serve deepseek-ai/DeepSeek-V3.2 -dp 8 --enable-expert-parallel.
DeepGEMM warmup too slow
Set VLLM_USE_DEEP_GEMM=0 to disable MoE DeepGEMM and skip the long warmup.