nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16
NVIDIA Nemotron-Nano 12B vision-language model with video support and Efficient Video Sampling (EVS)
View on HuggingFaceGuide
Overview
Nemotron-Nano-12B-v2-VL is a vision-language model with image and video support. It includes Efficient Video Sampling (EVS) to prune video tokens and reduce compute. The model is available in BF16, FP8, and NVFP4 (QAD) precisions.
Prerequisites
- Hardware: 1x GPU (A100/H100/B200, etc.)
- vLLM: 0.11.0 does NOT include this model; use latest nightly or install from source
- DGX Spark: use
nvcr.io/nvidia/vllm:25.12.post1-py3
Install vLLM
docker pull vllm/vllm-openai:nightly-8bff831f0aa239006f34b721e63e1340e3472067
# or for DGX Spark:
docker pull nvcr.io/nvidia/vllm:25.12.post1-py3
Launch command
export VLLM_VIDEO_LOADER_BACKEND=opencv
export CHECKPOINT_PATH="nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16"
export CUDA_VISIBLE_DEVICES=0
python3 -m vllm.entrypoints.openai.api_server \
--model ${CHECKPOINT_PATH} \
--trust-remote-code \
--media-io-kwargs '{"video": {"fps": 2, "num_frames": 128}}' \
--max-model-len 131072 \
--data-parallel-size 1 \
--port 5566 \
--allowed-local-media-path / \
--video-pruning-rate 0.75 \
--served-model-name "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16"
Flags:
--max-model-len: reduce for shorter contexts to save memory--allowed-local-media-path <root>: limit local-file access--video-pruning-rate <0..1>: EVS compression; higher prunes more video tokens
Client Usage
Describe a video:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:5566/v1", api_key="<ignored>")
completion = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe the video."},
{"type": "video_url", "video_url": {"url": "file:///path/to/video.mp4"}},
],
}],
)
print(completion.choices[0].message.content)
Offline / LLM API
from vllm import LLM, SamplingParams
llm = LLM(
"nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16",
trust_remote_code=True,
max_model_len=2**17,
allowed_local_media_path="/",
video_pruning_rate=0.75,
media_io_kwargs=dict(video=dict(fps=2, num_frames=128)),
)
Troubleshooting
- Set
VLLM_VIDEO_LOADER_BACKEND=opencv(required for video inputs). - OOM: lower
--max-model-lenor increase--video-pruning-rate.