microsoft/Phi-4-mini-instruct
Microsoft's Phi-4 family of lightweight dense models (mini-instruct, reasoning, multimodal) with 128K context
View on HuggingFaceGuide
Overview
The Phi-4 family includes several lightweight, open models from Microsoft. These models can process text and, in some variants, multimodal inputs like images, to generate text outputs. They come with a 128K token context length.
Prerequisites
- Hardware: 1x GPU with >=16 GB VRAM (A100, L40S, H100, etc.)
- vLLM >= 0.7.0
Install vLLM
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Launch commands
Phi-4-mini-instruct on a single GPU:
vllm serve microsoft/Phi-4-mini-instruct \
--host 0.0.0.0 \
--max-model-len 4000
Phi-4-multimodal-instruct (requires --trust-remote-code for LoRA modules):
vllm serve microsoft/Phi-4-multimodal-instruct \
--host 0.0.0.0 \
--max-model-len 4000 \
--trust-remote-code
Client Usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1", timeout=3600)
response = client.chat.completions.create(
model="microsoft/Phi-4-mini-instruct",
messages=[{"role": "user", "content": "Write a short story"}],
temperature=0.0,
)
print(response.choices[0].message.content)
Multimodal (requires Phi-4-multimodal-instruct):
response = client.chat.completions.create(
model="microsoft/Phi-4-multimodal-instruct",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
],
}],
)
Available Phi-4 Variants
microsoft/Phi-4-mini-instruct— conversational instruction-tunedmicrosoft/Phi-4-mini-reasoning— optimized for reasoningmicrosoft/Phi-4-reasoning— advanced reasoningmicrosoft/Phi-4-multimodal-instruct— multimodal (text + image)
Benchmarking
vllm bench serve \
--model microsoft/Phi-4-mini-instruct \
--dataset-name random \
--random-input-len 2000 \
--random-output-len 512 \
--num-prompts 100
Troubleshooting
- Multimodal variant fails to load: add
--trust-remote-code.