Skip to content

LexStack-AI/LexEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LexEval Logo

βš–οΈ LexEval β€” Legal AI Micro-Eval Tool

The open-source micro-evaluation tool for Legal AI systems.
Built for evaluating AI agents, RAG pipelines, and LLMs on legal documents.

Metrics Β· Quickstart Β· How It Works Β· Agent Interface Β· Dataset Schema Β· Output

Python 3.9+ License Legal AI


LexEval is a lightweight, open-source micro-eval tool purpose-built for Legal AI systems. It gives developers a fast, structured way to measure how accurately an AI agent answers questions about legal documents β€” contracts, NDAs, service agreements, and more.

Think of it like pytest for your legal AI: plug in your agent, point it at a dataset, and get objective, multi-dimensional scores for every response.

pip install -r requirements.txt
python cli.py agents/my_agent.py datasets/service_agreement_dataset.json

⚠️ LexEval uses an LLM-as-a-judge approach. Set your OPENAI_API_KEY before running.


πŸ”₯ Metrics

LexEval evaluates every response across 5 independent metrics, each implemented as an async LLM evaluator:

Metric What It Checks Scoring
Answer Correctness Does the predicted answer match the expected legal fact? 1.0 – 5.0
Hallucination Detection Did the agent invent facts not present in the document? 1.0 (hallucinated) – 5.0 (clean)
Entity Accuracy Are named legal parties (companies, individuals) correctly identified? 1.0 – 5.0
Date Accuracy Are extracted dates chronologically correct across formats? 1.0 – 5.0
Refusal Correctness Does the agent properly refuse when information is absent? 1.0 – 5.0

Each metric returns a structured JSON verdict:

{
  "score": 5.0,
  "verdict": "Correct",
  "reasoning": "The model accurately identifies the governing law as India, per Clause 10."
}

Scores are averaged per sample and across all samples into an overall score.


πŸš€ Quickstart

1. Install dependencies

pip install -r requirements.txt

2. Configure your LLM evaluator

Create a .env file (copy from .env.example):

OPENAI_API_KEY=sk-...

# Optional: Override evaluator model or endpoint
LLM_MODEL=gpt-4o-mini
LLM_API_URL=https://api.openai.com/v1/chat/completions

3. Add your document

Place your contract PDF or TXT in the documents/ folder:

documents/
└── service_agreement.pdf

4. Create a dataset

Define your evaluation questions in datasets/:

{
  "document_file": "service_agreement.pdf",
  "documentId": "service_agreement",
  "samples": [
    {
      "id": "q1",
      "question": "Who is the service provider in this agreement?",
      "expected_answer": "AlphaTech Solutions Pvt. Ltd."
    },
    {
      "id": "q2",
      "question": "What arbitration rules apply?",
      "expected_answer": "NOT_FOUND"
    }
  ]
}

5. Connect your agent

Create an agent file that wraps your AI system (see Agent Interface):

# agents/my_agent.py

def run_agent(document, question):
    answer = my_legal_ai_system(document, question)
    return {"answer": answer}

6. Run the evaluation

python cli.py agents/my_agent.py datasets/service_agreement_dataset.json

Results are saved automatically to results/.


πŸ” How It Works

LexEval follows a simple, deterministic pipeline:

Dataset (JSON)
    β”‚
    β–Ό
Document Loader        ← PDF or TXT extraction via pdfplumber
    β”‚
    β–Ό
Agent Runner           ← Your run_agent(document, question) function
    β”‚
    β–Ό
Metric Evaluators      ← 5 async LLM-as-a-judge metrics run in parallel
    β”‚
    β–Ό
Results JSON           ← Per-sample scores + overall score
  1. Dataset is loaded β€” document text is extracted from PDF/TXT, or provided inline.
  2. Agent is dynamically imported β€” your run_agent function is loaded at runtime via importlib.
  3. Each question is sent to your agent β€” the agent returns a predicted answer.
  4. All 5 metrics evaluate the response β€” they run concurrently via asyncio.gather.
  5. Scores are aggregated β€” per-sample averages and an overall score are computed.
  6. Results are saved to results/results-<agent>-<dataset>-<timestamp>.json.

πŸ“ Project Structure

LexEval/
β”‚
β”œβ”€β”€ cli.py                     # CLI entry point
β”œβ”€β”€ evaluator.py               # Async evaluation pipeline
β”‚
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ agent_loader.py        # Dynamically loads your agent file
β”‚   └── dataset_loader.py      # Loads JSON datasets + extracts document text
β”‚
β”œβ”€β”€ metrics/
β”‚   β”œβ”€β”€ answer_correctness.py  # Checks factual correctness of the answer
β”‚   β”œβ”€β”€ hallucination.py       # Detects hallucinated content
β”‚   β”œβ”€β”€ entity_accuracy.py     # Validates legal entity identification
β”‚   β”œβ”€β”€ date_accuracy.py       # Compares dates across formats
β”‚   β”œβ”€β”€ refusal_correctness.py # Validates NOT_FOUND handling
β”‚   β”œβ”€β”€ llm_client.py          # Shared OpenAI-compatible LLM client
β”‚   └── utils/
β”‚       └── normalize.py       # Date parsing + text normalization
β”‚
β”œβ”€β”€ agents/
β”‚   └── my_agent.py            # Example agent (edit or replace)
β”‚
β”œβ”€β”€ datasets/                  # Your evaluation datasets (JSON)
β”œβ”€β”€ documents/                 # Contract PDFs or TXT files
β”œβ”€β”€ results/                   # Evaluation outputs (auto-generated)
β”‚
β”œβ”€β”€ agent_interface.md         # Agent contract specification
β”œβ”€β”€ dataset_schema.md          # Dataset format reference
β”œβ”€β”€ .env.example               # Environment variable template
└── requirements.txt

πŸ€– Agent Interface

Your agent must expose a function named run_agent. LexEval will dynamically import and call it for each evaluation question.

Signature

def run_agent(document: str, question: str) -> dict | str:
    ...
Parameter Type Description
document str Document text (full contract) or a document ID β€” depends on your dataset config
question str The evaluation question

Return value β€” either a dict with an answer key, or a plain string:

# Option A β€” preferred
return {"answer": "AlphaTech Solutions Pvt. Ltd."}

# Option B β€” also accepted
return "AlphaTech Solutions Pvt. Ltd."

Minimal example

# agents/my_agent.py

def run_agent(document, question):
    # Call your AI system, RAG pipeline, or LLM API here
    answer = my_legal_ai(document, question)
    return {"answer": answer}

Example: Agent calling an external API

import requests

def run_agent(document_id, question):
    # 1. Upload the document
    with open(f"documents/{document_id}.pdf", "rb") as f:
        requests.post("http://localhost:8000/upload", files={"file": f},
                      headers={"x-document-id": document_id})

    # 2. Query the agent
    resp = requests.post("http://localhost:8000/ask", json={
        "question": question
    }, headers={"x-document-id": document_id})

    return {"answer": resp.json().get("answer", "")}

When does document contain the ID vs. full text?

LexEval decides this based on your dataset configuration:

  • If documentId is set in the dataset β†’ document is the document ID (use it to fetch your own file).
  • If document_text or document_file is set without documentId β†’ document is the full extracted text.

See Dataset Schema for details.


πŸ“‹ Dataset Schema

Datasets are JSON files stored in datasets/. Each dataset represents one document and a list of evaluation questions.

Full schema

{
  "document_file": "service_agreement.pdf",
  "document_text": "",
  "documentId": "service_agreement",
  "samples": [
    {
      "id": "q1",
      "question": "Who is the service provider in this agreement?",
      "expected_answer": "AlphaTech Solutions Pvt. Ltd."
    },
    {
      "id": "q15",
      "question": "What arbitration rules apply under this agreement?",
      "expected_answer": "NOT_FOUND"
    }
  ]
}

Field reference

Field Type Required Description
document_file string One of these Filename of a PDF or TXT in documents/. LexEval extracts text automatically.
document_text string One of these Inline document text. Use instead of document_file for short contracts.
documentId string Optional If set, this ID is passed directly to run_agent instead of the document text.
samples array βœ… List of evaluation questions.
samples[].id string βœ… Unique question identifier.
samples[].question string βœ… The question to send to the agent.
samples[].expected_answer string βœ… Correct answer. Use "NOT_FOUND" when the document does not contain the information.

Document loading options

Option 1 β€” External file (PDF or TXT)

{
  "document_file": "contract.pdf"
}

LexEval looks for the file in documents/ and extracts text using pdfplumber.

Option 2 β€” Inline text

{
  "document_text": "This Service Agreement is entered into on March 1, 2026..."
}

Useful for testing with short or synthetic documents.

Option 3 β€” Document ID (for external systems)

{
  "documentId": "service_agreement"
}

The ID is passed directly to run_agent. Use this when your agent fetches the document itself (e.g., from a database or cloud storage).

Handling missing information

Use "NOT_FOUND" as the expected answer when the document does not contain the information:

{
  "id": "q15",
  "question": "What arbitration rules apply under this agreement?",
  "expected_answer": "NOT_FOUND"
}

The Refusal Correctness metric rewards agents that correctly refuse to answer β€” and also rewards agents that find the information when the reference was incorrectly marked as NOT_FOUND.


πŸ“Š Output & Results

Results are saved automatically to:

results/results-<agent>-<dataset>-<timestamp>.json

Output structure

{
  "overall_score": 4.29,
  "results": [
    {
      "question": "Who is the service provider in this agreement?",
      "expected_answer": "AlphaTech Solutions Pvt. Ltd.",
      "predicted_answer": "The Service Provider is AlphaTech Solutions Pvt. Ltd.",
      "sample_score": 4.8,
      "metrics": {
        "answer_correctness": {
          "score": 5.0,
          "verdict": "Correct",
          "reasoning": "The model accurately identifies the Service Provider."
        },
        "hallucination": {
          "score": 5.0,
          "verdict": "Pass",
          "reasoning": "No hallucinated facts detected."
        },
        "entity_accuracy": {
          "score": 5.0,
          "verdict": "Correct",
          "reasoning": "AlphaTech Solutions Pvt. Ltd. matches the reference entity."
        },
        "date_accuracy": {
          "score": 5.0,
          "verdict": "Correct",
          "reasoning": "No date comparison required for this question."
        },
        "refusal_correctness": {
          "score": 4.0,
          "verdict": "Correct",
          "reasoning": "Answer was expected and correctly provided."
        }
      }
    }
  ]
}

Score interpretation

Score Interpretation
4.5 – 5.0 Excellent β€” agent reliably extracts legal facts
3.5 – 4.4 Good β€” minor issues with formatting or partial answers
2.5 – 3.4 Fair β€” noticeable errors in specific metric categories
< 2.5 Poor β€” significant hallucination, entity confusion, or refusal failures

βš™οΈ Configuration

Environment variables

Variable Default Description
OPENAI_API_KEY (required) API key for the LLM evaluator
LLM_MODEL gpt-4o-mini Model used by all metric evaluators
LLM_API_URL https://api.openai.com/v1/chat/completions OpenAI-compatible endpoint

LexEval supports any OpenAI-compatible API β€” swap in Mistral, Together AI, Groq, or a local ollama endpoint by setting LLM_API_URL and LLM_MODEL.

Using a custom evaluator endpoint

LLM_API_KEY=your-key
LLM_API_URL=http://localhost:11434/v1/chat/completions
LLM_MODEL=llama3

🧩 Supported Document Types

Format Support Notes
.pdf βœ… Text extracted with pdfplumber
.txt βœ… Read directly as UTF-8
Inline text βœ… Via document_text field in dataset
.docx ❌ Not yet supported
Scanned PDFs ⚠️ Only if text layer is present

πŸ› οΈ Use Cases

LexEval is designed for teams building or evaluating:

  • Legal RAG systems β€” Retrieval-augmented generation over contracts and agreements
  • Contract QA agents β€” AI assistants that answer questions about specific clauses
  • Legal LLM benchmarking β€” Compare model performance on structured legal extraction tasks
  • Prompt engineering β€” Test whether prompt changes improve factual precision
  • CI/CD evaluation pipelines β€” Automate regression testing for legal AI systems

🀝 Contributing

Contributions are welcome! Here's how to get started:

  1. Fork and clone the repository.

  2. Create a feature branch:

    git checkout -b feature/your-feature-name
  3. Follow module boundaries:

    • core/ β€” document loading and agent loading utilities
    • metrics/ β€” individual metric evaluators (one file per metric)
    • agents/ β€” example and reference agent implementations
    • datasets/ β€” evaluation datasets (JSON)
    • Keep secrets and endpoints in environment variables, never hardcoded.
  4. Testing: There are currently no automated tests. If you introduce complex logic, please add pytest tests. At minimum, run the CLI against a local dataset to verify your changes work end-to-end.

  5. Submit a pull request describing:

    • What you changed
    • Why it is needed
    • Any new environment variables or configuration introduced

Please coordinate with project maintainers for coding style expectations.


⭐ If LexEval is useful to your team, consider starring the repo!

About

A lightweight, open-source toolkit for unit-testing Legal RAG pipelines. It replaces generic string matching with specialized logic to validate contract data and model behavior, allowing for rapid iteration on small, high-quality datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages