tencent/HunyuanOCR

Tencent Hunyuan end-to-end OCR expert VLM (~1B) for online OCR serving with an OpenAI-compatible API

dense1B32,768 ctxvLLM 0.11.0+multimodal

Guide

Overview

HunyuanOCR is a leading end-to-end OCR expert VLM powered by Hunyuan's native multimodal architecture. This recipe covers online serving with the OpenAI-compatible API.

Prerequisites

vLLM version: latest stable
Hardware: single GPU (1B model)

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Launching the Server

vllm serve tencent/HunyuanOCR \
    --no-enable-prefix-caching \
    --mm-processor-cache-gb 0

Configuration Tips

Use greedy sampling (temperature=0.0) or low temperature for optimal OCR accuracy.
OCR tasks generally do not benefit from prefix caching or image reuse; disabling them (as above) removes hashing/caching overhead.
Adjust --max-num-batched-tokens for throughput based on your hardware.

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1", timeout=3600)

messages = [
    {"role": "system", "content": ""},
    {
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/chat-ui/tools-dark.png"}},
            {
                "type": "text",
                "text": (
                    "Extract all information from the main body of the document image "
                    "and represent it in markdown format, ignoring headers and footers. "
                    "Tables should be expressed in HTML format, formulas in the document "
                    "should be represented using LaTeX format, and the parsing should be "
                    "organized according to the reading order."
                )
            }
        ]
    }
]

response = client.chat.completions.create(
    model="tencent/HunyuanOCR",
    messages=messages,
    temperature=0.0,
    extra_body={"top_k": 1, "repetition_penalty": 1.0},
)
print(response.choices[0].message.content)

Troubleshooting

Accuracy: Use temperature=0.0 and top_k=1 for deterministic OCR output.
Application-oriented prompts: See the official model card for prompts tuned to various document parsing tasks.

References

Model card