Hiro-Smart-Doc

Layout analysis + OCR API service for documents, PDFs and images — full smart-document processing for document pages.

English | 简体中文

Hiro-Smart-Doc turns a document, PDF, or image into structured, reading-ordered content. It runs an RT-DETR layout model to detect regions (text, tables, equations, figures, chemical structures, …), sorts them into human reading order (including multi-column pages), and runs OCR on each region to recover text, HTML tables, and LaTeX formulas — optionally assembled into Markdown.

Features

Layout analysis — RT-DETR ONNX model detecting 25 region categories, with multi-column reading-order sorting and duplicate-box filtering.
Region OCR — text, tables (HTML), and formulas (LaTeX) via the MOSS-OCR model served behind an OpenAI-compatible (vLLM) endpoint.
Multiple inputs — single images, multi-page PDFs (rendered and processed per page with bounded concurrency), and Office documents (doc/docx/ppt/pptx/ xls/xlsx/odt/odp/ods/rtf) via optional LibreOffice conversion.
Streaming API — results stream back per region/page; an optional markdown=true flag returns a single concatenated Markdown document.
FastAPI service — Swagger UI at /docs, plus an optional standalone Gradio UI for interactive exploration.

Architecture

  PDF / image / Office doc
          │
          ▼
┌─────────────────────────────────────────────────┐
│  FastAPI service (hiro_smart_doc.base_app)
│
│  1. (Office) LibreOffice → PDF
│  2. Render page → image
│  3. Layout model (RT-DETR ONNX)  ──►  Hiro-Layout
│  4. Reading-order sort + filter
│  5. Region OCR  ──►  MOSS-OCR
│  6. Stream regions / assemble Markdown
└─────────────────────────────────────────────────┘

The two models are decoupled from this repo:

Component	What it does	Where it lives
Layout	RT-DETR ONNX region detector	🤗 PatSnap/Hiro-Layout
OCR (MOSS)	Text / table / formula recognition	Hiro-MOSS-OCR

The layout ONNX weights are not bundled in this repository; download them from Hugging Face (see below). The OCR model runs as a separate OpenAI-compatible service that this app calls over HTTP.

Prerequisites

Python 3.12+
uv for environment and dependency management
A running MOSS-OCR endpoint (OpenAI-compatible / vLLM). See Hiro-MOSS-OCR for how to serve the model. Run it as a separate service from Hiro-Smart-Doc; for example, if Smart-Doc listens on 8000, serve MOSS-OCR on 8088 and set MOSS_VLLM_OCR_API=http://127.0.0.1:8088/v1.
(Optional) An external LibreOffice unoserver, only if you need to parse Office documents. Disabled by default; see (Optional) Office document conversion.

Installation

Install uv, then:

git clone https://github.com/patsnap/Hiro-Smart-Doc.git
cd Hiro-Smart-Doc

# Create the virtual environment and install dependencies (CPU onnxruntime)
uv sync

# For GPU inference instead of CPU:
uv sync --extra gpu

# Optional extras: Office-document conversion client, model download helper
uv sync --extra docconvert --extra download

The docconvert extra installs the unoconvert client (to talk to a unoserver). The unoserver server must be deployed separately on LibreOffice's own Python — see (Optional) Office document conversion.

Download the layout model

The RT-DETR layout weights are hosted on Hugging Face at PatSnap/Hiro-Layout. Fetch them into ./layout_model/:

# Using the helper script (requires the `download` extra)
uv run python scripts/download_models.py --models 25

# …or manually with the huggingface CLI
uv run huggingface-cli download PatSnap/Hiro-Layout RT-DETR_25.onnx \
    --local-dir ./layout_model

Files must follow the RT-DETR_<MODEL_ID>.onnx naming pattern (e.g. RT-DETR_25.onnx), matching MODEL_LIST / MODEL_ID in your environment.

Configuration

Copy the example environment file and adjust it:

cp .env.example .env

Key settings:

Variable	Description	Default
`RD_INTERNAL_PORT`	Port the API listens on	`8000`
`RD_API_PATH`	Base path to mount the API under (empty = `/`)	empty
`MODEL_LIST`	Comma-separated layout model ids to load	`25`
`MODEL_ID`	Default layout model id	`25`
`LAYOUT_MODEL_DIR`	Directory holding `RT-DETR_<id>.onnx`	`./layout_model`
`RUNTIME_BACKEND`	Inference backend	`ONNX`
`MOSS_VLLM_OCR_API`	OpenAI-compatible MOSS-OCR endpoint (`.../v1`)	`http://127.0.0.1:8088/v1`
`MOSS_VLLM_OCR_API_KEY`	API key for the OCR endpoint	`EMPTY`
`MOSS_VLLM_MODEL`	OCR model name served by the endpoint	`moss-v1d6-0.3b`
`PDF_RENDER_DPI`	DPI used when rendering PDF pages to images	`150`
`DOCUMENT_CONVERT_ENABLED`	Enable Office→PDF conversion (needs an external LibreOffice unoserver, see below)	`false`
`UNOSERVER_ENDPOINTS`	unoserver address(es), `host:port`, comma-separated	`127.0.0.1:2003`
`DOCUMENT_CONVERT_TIMEOUT`	Per-conversion timeout (seconds)	`60`
`DOCUMENT_CONVERT_MAX_BYTES`	Max upload size in bytes	`52428800`
`DOCUMENT_CONVERT_MAX_CONCURRENCY`	Max concurrent conversions	number of endpoints

(Optional) Office document conversion

The /document/* endpoints and the Gradio document upload first convert doc/docx/ppt/pptx/xls/xlsx/odt/odp/ods/rtf to PDF. This step is disabled by default and is not built into the service — it relies on an external unoserver (backed by LibreOffice). Skip this section if you only process PDFs and images.

Why external: high-fidelity conversion of these formats realistically requires LibreOffice, and its uno Python bindings (pyuno) are tied to the Python that ships with LibreOffice — they cannot be installed into this project's uv virtual environment (Python 3.12). So the unoserver server must be started with LibreOffice's own / the system Python, while this service only connects to it as a client over a socket.

1. Install LibreOffice and unoserver:

# Debian/Ubuntu: LibreOffice ships pyuno
sudo apt-get install -y libreoffice

# Server: install unoserver on the system Python that LibreOffice uses
# (NOT this project's venv); that Python must `python3 -c "import uno"` cleanly
sudo python3 -m pip install unoserver

# Client: install the unoconvert client into this project's venv
uv sync --extra docconvert

2. Start unoserver (long-running, listening on localhost):

# Start with the system Python that can import uno; port matches UNOSERVER_ENDPOINTS
python3 -m unoserver.server --interface 127.0.0.1 --port 2003 &

3. Enable it and point the service at it (.env):

DOCUMENT_CONVERT_ENABLED=true
UNOSERVER_ENDPOINTS=127.0.0.1:2003
# Scale out: run several unoserver ports, comma-separated
# UNOSERVER_ENDPOINTS=127.0.0.1:2003,127.0.0.1:2004

When disabled, the /document/* endpoints return 503 and the Gradio document upload reports that conversion is unavailable; PDF and image pipelines are unaffected.

Running

Load the environment, then start the service. uv run automatically uses the project virtual environment.

# export the variables from your .env (or use your preferred loader)
set -a && . ./.env && set +a

# Development (single process, auto-reload via uvicorn)
uv run uvicorn hiro_smart_doc.base_app:app --host 0.0.0.0 --port 8000

# Production (gunicorn + uvicorn workers)
uv run gunicorn --config gunicorn.conf.py hiro_smart_doc.base_app:app

Open the Swagger UI at http://127.0.0.1:8000/docs.

If the logs show POST /v1/chat/completions ... 404 Not Found, double-check MOSS_VLLM_OCR_API: it must point at the MOSS-OCR/vLLM server (not the Hiro-Smart-Doc API) and include the /v1 suffix.

Gradio UI (optional)

An optional standalone interactive UI is available for exploring the pipeline in a browser. It is not required to run the API service:

uv run python -m hiro_smart_doc.gradio_ui
# then open http://127.0.0.1:7860

API

All endpoints accept multipart/form-data uploads and stream newline-delimited JSON results.

Method	Path	Description
POST	`/image/smart-doc`	Full pipeline on a single image
POST	`/pdf/smart-doc`	Full pipeline on a PDF (per-page, concurrent)
POST	`/document/smart-doc`	Convert an Office doc to PDF, then run pipeline
POST	`/document/convert-pdf`	Convert an Office doc to PDF (returns the PDF)
GET	`/health`	Health check

Common form fields: filter_options (which categories to return), ocr_filter_options (which categories to OCR), and markdown (append a single assembled Markdown document). See /docs for the full schema.

Example

curl -X POST http://127.0.0.1:8000/pdf/smart-doc \
  -F "pdf=@paper.pdf" \
  -F 'ocr_filter_options={"main_text":true,"table":true,"equation":true}' \
  -F "markdown=true"

Security

This service ships without authentication, and the local image store is served publicly under /static. Before exposing it on an untrusted network, put it behind a gateway or reverse proxy that enforces authentication, TLS, and rate limiting.

Related projects

Hiro-MOSS-OCR — the OCR model used here
Hiro-Layout — the layout detection model

License

Released under the Apache License 2.0. See NOTICE for attribution and trademark information.

Hiro-Smart-Doc, Patsnap, and any associated names, logos, product names, service names, designs, and slogans are trademarks or registered trademarks of Patsnap or its affiliates. No trademark license is granted under the open source license or any model license unless expressly stated.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
hiro_smart_doc		hiro_smart_doc
layout_model		layout_model
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
DISCLAIMER.md		DISCLAIMER.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
README_zh.md		README_zh.md
gunicorn.conf.py		gunicorn.conf.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hiro-Smart-Doc

Features

Architecture

Prerequisites

Installation

Download the layout model

Configuration

(Optional) Office document conversion

Running

Gradio UI (optional)

API

Example

Security

Related projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hiro-Smart-Doc

Features

Architecture

Prerequisites

Installation

Download the layout model

Configuration

(Optional) Office document conversion

Running

Gradio UI (optional)

API

Example

Security

Related projects

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages