vLLM/Recipes
DeepSeek

deepseek-ai/DeepSeek-V3

DeepSeek-V3 is a 671B-parameter Mixture-of-Experts model with native FP8 weights and strong reasoning, coding, and math capabilities.

View on HuggingFace
moe671B / 37B163,840 ctxvLLM 0.12.0+text
Guide

Overview

DeepSeek-V3 is a 671B-parameter Mixture-of-Experts model (37B activated per token) shipped with native FP8 weights. It shares its architecture with DeepSeek-R1, so the same launch recipes apply to both models. For Blackwell GPUs, NVIDIA publishes an FP4 quantized variant (nvidia/DeepSeek-V3-FP4 / nvidia/DeepSeek-R1-FP4) that runs on fewer GPUs.

Prerequisites

  • Hardware (FP8): 8x H200 GPUs (verified)
  • Hardware (FP4): 4x B200 GPUs
  • vLLM: Install with uv pip install -U vllm --torch-backend auto
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Client Usage

8xH200 (FP8)

Tensor Parallel + Expert Parallel (TP8+EP):

vllm serve deepseek-ai/DeepSeek-V3 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --enable-expert-parallel

Data Parallel + Expert Parallel (DP8+EP):

vllm serve deepseek-ai/DeepSeek-V3 \
  --trust-remote-code \
  --data-parallel-size 8 \
  --enable-expert-parallel

4xB200 (FP4)

Enable FlashInfer MoE kernels before launching:

# For FP4 (recommended on Blackwell)
export VLLM_USE_FLASHINFER_MOE_FP4=1
# For FP8 on Blackwell
export VLLM_USE_FLASHINFER_MOE_FP8=1

Tensor Parallel + Expert Parallel (TP4+EP):

CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve nvidia/DeepSeek-V3-FP4 \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --enable-expert-parallel

Data Parallel + Expert Parallel (DP4+EP):

CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve nvidia/DeepSeek-V3-FP4 \
  --trust-remote-code \
  --data-parallel-size 4 \
  --enable-expert-parallel

Benchmarking

For benchmarking, disable prefix caching by adding --no-enable-prefix-caching to the server command.

# Prompt-heavy benchmark (8k/1k)
vllm bench serve \
  --model deepseek-ai/DeepSeek-V3 \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos

Test different workloads by adjusting input/output lengths:

  • Prompt-heavy: 8000 input / 1000 output
  • Decode-heavy: 1000 input / 8000 output
  • Balanced: 1000 input / 1000 output

Troubleshooting

References