vLLM/Recipes
NVIDIA

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

NVIDIA Nemotron-3-Nano Mamba-hybrid MoE (30B total / ~3B active) with BF16 and FP8 variants

View on HuggingFace
moe30B / 3B262,144 ctxvLLM 0.11.2+text
Guide

Overview

NVIDIA Nemotron-3-Nano-30B-A3B is a hybrid-Mamba MoE model (30B total, ~3B active) with FP8 and BF16 variants. It supports DGX Spark and Jetson Thor in addition to standard Hopper/Blackwell servers.

Prerequisites

  • Hardware: 1x H100/H200 or comparable; DGX Spark and Jetson Thor supported
  • vLLM >= 0.11.2 (0.12.0 recommended for full support)
  • Docker with NVIDIA Container Toolkit (recommended)

Pull Docker Image

docker pull --platform linux/amd64 vllm/vllm-openai:v0.12.0
docker tag vllm/vllm-openai:v0.12.0 vllm/vllm-openai:deploy

DGX Spark users can build from source (see README) or use the NGC image:

docker pull nvcr.io/nvidia/vllm:25.12.post1-py3

Jetson Thor:

docker pull ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor

Launch commands

FP8 with FlashInfer MoE backend (Blackwell/Hopper):

export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_FLASHINFER_MOE_BACKEND=throughput

vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --trust-remote-code \
  --async-scheduling \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 1

BF16 (with reasoning + tool parsers — typical for Spark/Thor):

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py

vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
  --max-num-seqs 8 \
  --tensor-parallel-size 1 \
  --max-model-len 262144 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3

Key flags:

  • kv-cache-dtype fp8 for FP8 variant, auto for BF16
  • async-scheduling reduces host overhead between decode steps
  • mamba-ssm-cache-dtype float32 for best accuracy, float16 for speed
  • max-num-seqs cap to match client concurrency for lower per-user latency

Benchmarking

vllm bench serve \
  --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --trust-remote-code \
  --dataset-name random \
  --random-input-len 1024 --random-output-len 1024 \
  --num-warmups 20 \
  --ignore-eos \
  --max-concurrency 1024 \
  --num-prompts 2048

Troubleshooting

  • Use --kv-cache-dtype fp8 only with the FP8 checkpoint.
  • Balance TP and --max-num-seqs for throughput vs. per-user latency.

References