Verbatim RAG

Chill, I Ground! 🌶 ️

A minimalistic approach to Retrieval-Augmented Generation (RAG) that prevents hallucination by ensuring all generated content is explicitly derived from source documents.

Concept

Traditional RAG systems retrieve relevant documents and then allow an LLM to freely generate responses based on that context. This can lead to hallucinations where the model invents facts not present in the source material.

Verbatim RAG solves this by extracting verbatim text spans from documents and composing responses entirely from these exact passages, with direct citations linking back to sources.

For extraction, we provide two 150M-parameter ModernBERT token classifiers that beat public extractive baselines (Zilliz Semantic Highlight, Provence) across ACL, RAGBench, Squeez, and QASPER — and outperform LLM-based extractors 100× their size on our ACL-Verbatim benchmark. See the paper and HF collection for details.

With this approach, the whole RAG pipeline can be run without any usage of LLMs, and with SPLADE embeddings, the pipeline can be run entirely on CPU, making it lightweight and efficient.

Installation

# Install the package
pip install verbatim-rag

For local development:

pip install -e packages/core/
pip install -e .

Lightweight Core

If you only need the reusable verbatim core without the full RAG pipeline (no torch, transformers, or Milvus):

pip install verbatim-core

from verbatim_core import VerbatimTransform

vt = VerbatimTransform()
response = vt.transform(
    question="What is the main finding?",
    context=[
        {"content": "The study found that X leads to Y.", "title": "Paper A"},
        {"content": "Results show Z is significant.", "title": "Paper B"},
    ],
)
print(response.answer)

Dependencies: only openai, pydantic, rapidfuzz, and jinja2.

Quick Start

from verbatim_rag import VerbatimIndex, VerbatimRAG
from verbatim_rag.ingestion import DocumentProcessor
from verbatim_rag.vector_stores import LocalMilvusStore
from verbatim_rag.embedding_providers import SpladeProvider

# Process documents with intelligent chunking
processor = DocumentProcessor()

# Process PDFs from URLs
document = processor.process_url(
    url="https://aclanthology.org/2025.bionlp-share.8.pdf",
    title="KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering",
    metadata={"authors": ["Adam Kovacs", "Paul Schmitt", "Gabor Recski"]}
)

# Create embedding provider and vector store
sparse_provider = SpladeProvider(
    model_name="opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill",
    device="cpu"
)
vector_store = LocalMilvusStore(
    db_path="./index.db",
    collection_name="verbatim_rag",
    enable_dense=False,
    enable_sparse=True,
)

# Create index with providers
index = VerbatimIndex(
    vector_store=vector_store,
    sparse_provider=sparse_provider
)
index.add_documents([document])

# Then query the index
rag = VerbatimRAG(index)

response = rag.query("What is the main contribution of the paper?")
print(response.answer)

Environment Setup

Set your OpenAI API key before using the system:

export OPENAI_API_KEY=your_api_key_here

How It Works

Document Processing: Documents are processed using docling for format conversion and chonkie for chunking
Document Indexing: Documents are indexed using vector embeddings (both dense and sparse)
Template Management: Response templates are created and stored for common question types
Query Processing:
- Relevant documents are retrieved
- Key passages are extracted verbatim using either LLM-based or fine-tuned span extractors
- Responses are structured using templates
- Citations link back to source documents

This ensures all responses are grounded in the source material, preventing hallucinations.

Architecture

Core Components

VerbatimRAG (verbatim_rag/core.py): Main orchestrator that coordinates document retrieval, span extraction, and response generation
VerbatimIndex (verbatim_rag/index.py): Vector-based document indexing and retrieval
SpanExtractor (verbatim_rag/extractors.py): Abstract interface for extracting relevant text spans from documents
- LLMSpanExtractor: Uses OpenAI models to identify relevant spans
- ModelSpanExtractor: Uses fine-tuned BERT-based models for span classification
DocumentProcessor (verbatim_rag/ingestion/): Docling + Chonkie integration for intelligent document processing
Document (verbatim_rag/document.py): Core document representation with metadata

Data Flow

Documents are processed and chunked using docling and chonkie
Documents are indexed using vector embeddings
User queries retrieve relevant documents
Span extractors identify verbatim passages that answer the question
Response templates structure the final answer with citations
All responses include exact text spans and document references

Web Interface

The package includes a full web interface with React frontend and FastAPI backend:

# Start API server
python api/app.py

# Start React frontend (in another terminal)
cd frontend/
npm install
npm start

ModernBERT Span Extractor

KRLabsOrg/verbatim-rag-modern-bert-v2 is a 150M-parameter query-conditioned token classifier built on gte-reranker-modernbert-base. It supports up to 8,192 tokens and is trained on scientific papers, Wikipedia QA, financial tables, medical literature, legal contracts, product manuals, and code/tool output.

It beats public extractive baselines (Zilliz Semantic Highlight, Provence) across ACL, RAGBench, Squeez, and QASPER. See the paper for full results.

ModelSpanExtractor defaults to this model:

from verbatim_rag.core import VerbatimRAG
from verbatim_rag.index import VerbatimIndex
from verbatim_rag.extractors import ModelSpanExtractor
from verbatim_rag.vector_stores import LocalMilvusStore
from verbatim_rag.embedding_providers import SpladeProvider

extractor = ModelSpanExtractor(
    model_path="KRLabsOrg/verbatim-rag-modern-bert-v2",  # default
    threshold=0.2,
    min_span_chars=30,
    merge_gap_chars=20,
    device=None,  # auto-detects cuda, mps, cpu
)

sparse_provider = SpladeProvider(
    model_name="opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill",
    device="cpu"
)
vector_store = LocalMilvusStore(
    db_path="./index.db",
    collection_name="verbatim_rag",
    enable_dense=False,
    enable_sparse=True,
)
index = VerbatimIndex(vector_store=vector_store, sparse_provider=sparse_provider)

rag_system = VerbatimRAG(index=index, extractor=extractor, k=5)
response = rag_system.query("Main findings of the paper?")
print(response.answer)

Datasets

Resource	Link
114K ACL Anthology papers in structured Markdown	KRLabsOrg/acl-anthology-md
20K+ labelled query-chunk training pairs	KRLabsOrg/verbatim-spans
Human-annotated ACL extraction benchmark	KRLabsOrg/acl-verbatim-spans
Training and evaluation pipeline	KRLabsOrg/acl-verbatim

Citation

If you use Verbatim RAG or the extractive models in your research, please cite our papers:

@misc{Recski:2026,
    title={ACL-Verbatim: hallucination-free question answering for research},
    author={Gábor Recski and Szilveszter Tóth and Nadia Verdha and István Boros and Ádám Kovács},
    year={2026},
    eprint={2605.21102},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2605.21102},
}

@inproceedings{kovacs-etal-2025-kr,
    title = "{KR} Labs at {A}rch{EHR}-{QA} 2025: A Verbatim Approach for Evidence-Based Question Answering",
    author = "Kovacs, Adam  and
      Schmitt, Paul  and
      Recski, Gabor",
    editor = "Soni, Sarvesh  and
      Demner-Fushman, Dina",
    booktitle = "Proceedings of the 24th Workshop on Biomedical Language Processing (Shared Tasks)",
    month = aug,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.bionlp-share.8/",
    pages = "69--74",
    ISBN = "979-8-89176-276-3",
    abstract = "We present a lightweight, domain{-}agnostic verbatim pipeline for evidence{-}grounded question answering. Our pipeline operates in two steps: first, a sentence-level extractor flags relevant note sentences using either zero-shot LLM prompts or supervised ModernBERT classifiers. Next, an LLM drafts a question-specific template, which is filled verbatim with sentences from the extraction step. This prevents hallucinations and ensures traceability. In the ArchEHR{-}QA 2025 shared task, our system scored 42.01{\%}, ranking top{-}10 in core metrics and outperforming the organiser{'}s 70B{-}parameter Llama{-}3.3 baseline. We publicly release our code and inference scripts under an MIT license."
}

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.github/workflows		.github/workflows
api		api
assets		assets
docs		docs
examples		examples
frontend		frontend
packages/core		packages/core
scripts		scripts
tests		tests
verbatim_rag		verbatim_rag
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Verbatim RAG

Concept

Installation

Lightweight Core

Quick Start

Environment Setup

How It Works

Architecture

Core Components

Data Flow

Web Interface

ModernBERT Span Extractor

Datasets

Citation

About

Uh oh!

Releases 19

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Verbatim RAG

Concept

Installation

Lightweight Core

Quick Start

Environment Setup

How It Works

Architecture

Core Components

Data Flow

Web Interface

ModernBERT Span Extractor

Datasets

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 19

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages