The open-source micro-evaluation tool for Legal AI systems.
Built for evaluating AI agents, RAG pipelines, and LLMs on legal documents.
Metrics Β· Quickstart Β· How It Works Β· Agent Interface Β· Dataset Schema Β· Output
LexEval is a lightweight, open-source micro-eval tool purpose-built for Legal AI systems. It gives developers a fast, structured way to measure how accurately an AI agent answers questions about legal documents β contracts, NDAs, service agreements, and more.
Think of it like pytest for your legal AI: plug in your agent, point it at a dataset, and get objective, multi-dimensional scores for every response.
pip install -r requirements.txt
python cli.py agents/my_agent.py datasets/service_agreement_dataset.json
β οΈ LexEval uses an LLM-as-a-judge approach. Set yourOPENAI_API_KEYbefore running.
LexEval evaluates every response across 5 independent metrics, each implemented as an async LLM evaluator:
| Metric | What It Checks | Scoring |
|---|---|---|
| Answer Correctness | Does the predicted answer match the expected legal fact? | 1.0 β 5.0 |
| Hallucination Detection | Did the agent invent facts not present in the document? | 1.0 (hallucinated) β 5.0 (clean) |
| Entity Accuracy | Are named legal parties (companies, individuals) correctly identified? | 1.0 β 5.0 |
| Date Accuracy | Are extracted dates chronologically correct across formats? | 1.0 β 5.0 |
| Refusal Correctness | Does the agent properly refuse when information is absent? | 1.0 β 5.0 |
Each metric returns a structured JSON verdict:
{
"score": 5.0,
"verdict": "Correct",
"reasoning": "The model accurately identifies the governing law as India, per Clause 10."
}Scores are averaged per sample and across all samples into an overall score.
pip install -r requirements.txtCreate a .env file (copy from .env.example):
OPENAI_API_KEY=sk-...
# Optional: Override evaluator model or endpoint
LLM_MODEL=gpt-4o-mini
LLM_API_URL=https://api.openai.com/v1/chat/completionsPlace your contract PDF or TXT in the documents/ folder:
documents/
βββ service_agreement.pdf
Define your evaluation questions in datasets/:
{
"document_file": "service_agreement.pdf",
"documentId": "service_agreement",
"samples": [
{
"id": "q1",
"question": "Who is the service provider in this agreement?",
"expected_answer": "AlphaTech Solutions Pvt. Ltd."
},
{
"id": "q2",
"question": "What arbitration rules apply?",
"expected_answer": "NOT_FOUND"
}
]
}Create an agent file that wraps your AI system (see Agent Interface):
# agents/my_agent.py
def run_agent(document, question):
answer = my_legal_ai_system(document, question)
return {"answer": answer}python cli.py agents/my_agent.py datasets/service_agreement_dataset.jsonResults are saved automatically to results/.
LexEval follows a simple, deterministic pipeline:
Dataset (JSON)
β
βΌ
Document Loader β PDF or TXT extraction via pdfplumber
β
βΌ
Agent Runner β Your run_agent(document, question) function
β
βΌ
Metric Evaluators β 5 async LLM-as-a-judge metrics run in parallel
β
βΌ
Results JSON β Per-sample scores + overall score
- Dataset is loaded β document text is extracted from PDF/TXT, or provided inline.
- Agent is dynamically imported β your
run_agentfunction is loaded at runtime viaimportlib. - Each question is sent to your agent β the agent returns a predicted answer.
- All 5 metrics evaluate the response β they run concurrently via
asyncio.gather. - Scores are aggregated β per-sample averages and an overall score are computed.
- Results are saved to
results/results-<agent>-<dataset>-<timestamp>.json.
LexEval/
β
βββ cli.py # CLI entry point
βββ evaluator.py # Async evaluation pipeline
β
βββ core/
β βββ agent_loader.py # Dynamically loads your agent file
β βββ dataset_loader.py # Loads JSON datasets + extracts document text
β
βββ metrics/
β βββ answer_correctness.py # Checks factual correctness of the answer
β βββ hallucination.py # Detects hallucinated content
β βββ entity_accuracy.py # Validates legal entity identification
β βββ date_accuracy.py # Compares dates across formats
β βββ refusal_correctness.py # Validates NOT_FOUND handling
β βββ llm_client.py # Shared OpenAI-compatible LLM client
β βββ utils/
β βββ normalize.py # Date parsing + text normalization
β
βββ agents/
β βββ my_agent.py # Example agent (edit or replace)
β
βββ datasets/ # Your evaluation datasets (JSON)
βββ documents/ # Contract PDFs or TXT files
βββ results/ # Evaluation outputs (auto-generated)
β
βββ agent_interface.md # Agent contract specification
βββ dataset_schema.md # Dataset format reference
βββ .env.example # Environment variable template
βββ requirements.txt
Your agent must expose a function named run_agent. LexEval will dynamically import and call it for each evaluation question.
def run_agent(document: str, question: str) -> dict | str:
...| Parameter | Type | Description |
|---|---|---|
document |
str |
Document text (full contract) or a document ID β depends on your dataset config |
question |
str |
The evaluation question |
Return value β either a dict with an answer key, or a plain string:
# Option A β preferred
return {"answer": "AlphaTech Solutions Pvt. Ltd."}
# Option B β also accepted
return "AlphaTech Solutions Pvt. Ltd."# agents/my_agent.py
def run_agent(document, question):
# Call your AI system, RAG pipeline, or LLM API here
answer = my_legal_ai(document, question)
return {"answer": answer}import requests
def run_agent(document_id, question):
# 1. Upload the document
with open(f"documents/{document_id}.pdf", "rb") as f:
requests.post("http://localhost:8000/upload", files={"file": f},
headers={"x-document-id": document_id})
# 2. Query the agent
resp = requests.post("http://localhost:8000/ask", json={
"question": question
}, headers={"x-document-id": document_id})
return {"answer": resp.json().get("answer", "")}LexEval decides this based on your dataset configuration:
- If
documentIdis set in the dataset βdocumentis the document ID (use it to fetch your own file). - If
document_textordocument_fileis set withoutdocumentIdβdocumentis the full extracted text.
See Dataset Schema for details.
Datasets are JSON files stored in datasets/. Each dataset represents one document and a list of evaluation questions.
{
"document_file": "service_agreement.pdf",
"document_text": "",
"documentId": "service_agreement",
"samples": [
{
"id": "q1",
"question": "Who is the service provider in this agreement?",
"expected_answer": "AlphaTech Solutions Pvt. Ltd."
},
{
"id": "q15",
"question": "What arbitration rules apply under this agreement?",
"expected_answer": "NOT_FOUND"
}
]
}| Field | Type | Required | Description |
|---|---|---|---|
document_file |
string |
One of these | Filename of a PDF or TXT in documents/. LexEval extracts text automatically. |
document_text |
string |
One of these | Inline document text. Use instead of document_file for short contracts. |
documentId |
string |
Optional | If set, this ID is passed directly to run_agent instead of the document text. |
samples |
array |
β | List of evaluation questions. |
samples[].id |
string |
β | Unique question identifier. |
samples[].question |
string |
β | The question to send to the agent. |
samples[].expected_answer |
string |
β | Correct answer. Use "NOT_FOUND" when the document does not contain the information. |
Option 1 β External file (PDF or TXT)
{
"document_file": "contract.pdf"
}LexEval looks for the file in documents/ and extracts text using pdfplumber.
Option 2 β Inline text
{
"document_text": "This Service Agreement is entered into on March 1, 2026..."
}Useful for testing with short or synthetic documents.
Option 3 β Document ID (for external systems)
{
"documentId": "service_agreement"
}The ID is passed directly to run_agent. Use this when your agent fetches the document itself (e.g., from a database or cloud storage).
Use "NOT_FOUND" as the expected answer when the document does not contain the information:
{
"id": "q15",
"question": "What arbitration rules apply under this agreement?",
"expected_answer": "NOT_FOUND"
}The Refusal Correctness metric rewards agents that correctly refuse to answer β and also rewards agents that find the information when the reference was incorrectly marked as NOT_FOUND.
Results are saved automatically to:
results/results-<agent>-<dataset>-<timestamp>.json
{
"overall_score": 4.29,
"results": [
{
"question": "Who is the service provider in this agreement?",
"expected_answer": "AlphaTech Solutions Pvt. Ltd.",
"predicted_answer": "The Service Provider is AlphaTech Solutions Pvt. Ltd.",
"sample_score": 4.8,
"metrics": {
"answer_correctness": {
"score": 5.0,
"verdict": "Correct",
"reasoning": "The model accurately identifies the Service Provider."
},
"hallucination": {
"score": 5.0,
"verdict": "Pass",
"reasoning": "No hallucinated facts detected."
},
"entity_accuracy": {
"score": 5.0,
"verdict": "Correct",
"reasoning": "AlphaTech Solutions Pvt. Ltd. matches the reference entity."
},
"date_accuracy": {
"score": 5.0,
"verdict": "Correct",
"reasoning": "No date comparison required for this question."
},
"refusal_correctness": {
"score": 4.0,
"verdict": "Correct",
"reasoning": "Answer was expected and correctly provided."
}
}
}
]
}| Score | Interpretation |
|---|---|
| 4.5 β 5.0 | Excellent β agent reliably extracts legal facts |
| 3.5 β 4.4 | Good β minor issues with formatting or partial answers |
| 2.5 β 3.4 | Fair β noticeable errors in specific metric categories |
| < 2.5 | Poor β significant hallucination, entity confusion, or refusal failures |
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
(required) | API key for the LLM evaluator |
LLM_MODEL |
gpt-4o-mini |
Model used by all metric evaluators |
LLM_API_URL |
https://api.openai.com/v1/chat/completions |
OpenAI-compatible endpoint |
LexEval supports any OpenAI-compatible API β swap in Mistral, Together AI, Groq, or a local ollama endpoint by setting LLM_API_URL and LLM_MODEL.
LLM_API_KEY=your-key
LLM_API_URL=http://localhost:11434/v1/chat/completions
LLM_MODEL=llama3| Format | Support | Notes |
|---|---|---|
.pdf |
β | Text extracted with pdfplumber |
.txt |
β | Read directly as UTF-8 |
| Inline text | β | Via document_text field in dataset |
.docx |
β | Not yet supported |
| Scanned PDFs | Only if text layer is present |
LexEval is designed for teams building or evaluating:
- Legal RAG systems β Retrieval-augmented generation over contracts and agreements
- Contract QA agents β AI assistants that answer questions about specific clauses
- Legal LLM benchmarking β Compare model performance on structured legal extraction tasks
- Prompt engineering β Test whether prompt changes improve factual precision
- CI/CD evaluation pipelines β Automate regression testing for legal AI systems
Contributions are welcome! Here's how to get started:
-
Fork and clone the repository.
-
Create a feature branch:
git checkout -b feature/your-feature-name
-
Follow module boundaries:
core/β document loading and agent loading utilitiesmetrics/β individual metric evaluators (one file per metric)agents/β example and reference agent implementationsdatasets/β evaluation datasets (JSON)- Keep secrets and endpoints in environment variables, never hardcoded.
-
Testing: There are currently no automated tests. If you introduce complex logic, please add
pytesttests. At minimum, run the CLI against a local dataset to verify your changes work end-to-end. -
Submit a pull request describing:
- What you changed
- Why it is needed
- Any new environment variables or configuration introduced
Please coordinate with project maintainers for coding style expectations.
β If LexEval is useful to your team, consider starring the repo!