SkillTest

A standalone testing framework for agent skills. Currently runs agents via Claude Code; support for additional agent runtimes is planned.

Agent skills are hard to verify. Write a SKILL.md, deploy it, and you're left hoping the model does what you intended. SkillTest changes that: write tests alongside your skills, run them automatically, and know whether your skill works. Watch the demo video here.

Why SkillTest

Most agent skills are tested informally, if at all. Developers prompt the model a few times, eyeball the output, and ship. Existing evaluation tools treat the skill definition and its tests as separate concerns — tests live in a different repo, a different tool, or not at all.

SkillTest is built around a different practice: every skill ships with its tests. When you write a skill, you write its tests. The tests live alongside the skill definition in the same directory:

my-skill/
├── SKILL.md          ← the skill definition
├── Dockerfile        ← execution environment (optional — for pinned dependencies)
└── tests/
    ├── tests.yaml    ← test cases and expectations
    └── pytests/      ← pre-written pytest checks on agent output (optional)
        ├── conftest.py
        └── test_output.py

SkillTest automates the full test cycle — loading the skill, invoking the agent via Docker to finish some task with the skill, grading the agent's output, and reporting results — so you get the same fast feedback loop for agent skills that you get for ordinary code.

Here is a test that tests the agent correctly counts the pdf page with the skill pdf:

# tests/tests.yaml
skill_name: pdf  # the skill under test
tests:
  - name: page-count
    prompt: How many pages does input/guide.pdf have? Just tell me the number.
    input_dir: tests/input
    expectations:
      - text: Agent reports the correct page count (3)
        oracle: pytest
        pytest_path: tests/pytests/test_page_count.py

Agent runtime: SkillTest currently runs agents using Claude Code inside Docker. Support for additional agent runtimes (Gemini CLI, OpenClaw) is on the roadmap.

📋 Requirements

🐍 Python 3.11+
🐳 Docker — required for skilltest run
🔑 ANTHROPIC_API_KEY — required for agent-judge expectations; pytest expectations work without it

🚀 Quickstart

This quickstart uses skills/pdf-skills/ — a working skill included in the repo — so you can run a real test end-to-end in three steps.

Step 1 — Install

git clone https://github.com/JiyangZhang/skilltest
cd skilltest
pip install -e .

Step 2 — Build the Docker image

docker build -t skilltest-pdf:latest -f skills/pdf-skills/Dockerfile .

Step 3 — Run the demo

export ANTHROPIC_API_KEY=sk-ant-...

cd skills/pdf-skills
skilltest run . --docker-image skilltest-pdf:latest --agent-model claude-haiku-4-5-20251001 --judge-model claude-haiku-4-5-20251001

SkillTest will spin up a Docker container per test case, run the Claude Code agent, grade each expectation, and write results to skills/pdf-skills/skilltest-results/:

skills/pdf-skills/
└── skilltest-results/
    ├── grading.json   ← machine-readable results
    ├── grading.xml    ← JUnit XML for CI
    └── report.html    ← HTML dashboard

Open the dashboard:

open skills/pdf-skills/skilltest-results/report.html

Options:

# Stream live agent traces (tool calls, file writes) for debugging
skilltest run . --debug

# Use a faster/cheaper model for the agent
skilltest run . --agent-model claude-haiku-4-5-20251001

# Use a specific model for agent-judge grading
skilltest run . --judge-model claude-haiku-4-5-20251001

⚙️ How it works

skilltest run <skill-dir>
  │
  ├─ for each test case in tests/tests.yaml:
  │    ├─ spin up Docker container (Claude Code agent)
  │    ├─ mount skill dir + input files read-only
  │    ├─ run agent against the test prompt
  │    ├─ capture output/stdout.txt and output/artifacts/
  │    └─ grade each expectation:
  │         ├─ pytest      → run pre-written pytest checks on the host
  │         └─ agent-judge → spin up a Docker container (Claude Code agent) to inspect output + files
  │
  └─ write skilltest-results/grading.json, grading.xml, report.html

The agent under test and the agent-judge both run inside Docker; only pytest grading runs on the host. The per-test workspace (skilltest-results/agent-runs/test-N/) is mounted into the container and remains on disk after the run for inspection.

✍️ Writing tests for your skill

This is the core of SkillTest. Tests live in tests/tests.yaml inside the skill directory and are the primary artifact you author. A test suite is a list of test cases. Each test case is one scenario you want to verify — a task the agent should handle, the input files it needs, and the criteria it must satisfy.

Test cases and expectations

A test case sends a task prompt to the agent and runs it inside Docker. An expectation is a single graded criterion that the agent's output must satisfy. One test case can have multiple expectations — each is graded independently and contributes to the overall pass rate.

Think of it this way:

Test case = the scenario or task ("merge these two PDFs")
Expectation = one thing that must be true about the result ("the merged file has 3 pages", "the content is correct")

test case 2: merge PDFs
  ├── expectation A: merged.pdf exists and has 3 pages  → graded by pytest      → PASS ✓
  └── expectation B: content from both files is present → graded by agent-judge → PASS ✓

Writing `tests.yaml`

Create tests/tests.yaml in your skill directory. Here is the complete file for skills/pdf-skills/:

# skills/pdf-skills/tests/tests.yaml
skill_name: pdf
schema_version: 1

tests:
  - name: page-count
    prompt: |
      How many pages does input/guide.pdf have?
      Just tell me the number.
    input_dir: tests/input          # contents mounted at input/ inside the container
    expectations:
      - text: Output mentions the correct page count (3)
        oracle: pytest
        pytest_path: tests/pytests/test_page_count.py

  - name: merge-pdfs
    prompt: |
      Merge input/chapter1.pdf and input/chapter2.pdf into a single file.
      Save the result as output/artifacts/merged.pdf.
    input_dir: tests/input
    expectations:
      - text: merged.pdf exists in artifacts and contains all 3 pages
        oracle: pytest
        pytest_path: tests/pytests/test_merge.py
      - text: Two PDFs are merged successfully
        oracle: agent-judge
        rubric:
          - merged.pdf exists and includes content from both source PDFs

Test case 1 has one expectation graded by a pytest file. Test case 2 has two expectations: one deterministic check (file exists, correct page count) and one judgment check (content quality). The agent runs once per test case; all expectations are graded against the same output.

Test case fields

Field	Description
`name`	String. Unique within the suite (e.g. `page-count`, `merge-pdfs`).
`prompt`	The task sent to the agent verbatim.
`input_dir`	Path relative to skill root; its contents are copied to `input/` inside the container.
`expectations`	List of graded criteria — see below.
`setup` / `cleanup`	Shell commands or file writes to run on the host before/after the agent.
`constraints`	`{timeout_seconds, max_steps}` — hard limits on the agent run.

Expectations and oracles

Each expectation has a text (what you're checking) and an oracle (how it's graded). Two oracles are supported:

Oracle	How it grades	When to use
`pytest`	Runs a pre-written Python test file. Pass = exit code 0.	Deterministic checks: file exists, page count, JSON structure, required keywords.
`agent-judge`	A Claude Code agent runs in Docker with the artifacts directory mounted read-only. It reads the expectation text, the agent's output, and can inspect artifact files.	Judgment calls: content accuracy, completeness, tone.

If oracle is omitted it defaults to agent-judge.

expectations:
  # deterministic — you write the assertion
  - text: merged.pdf exists in artifacts and has exactly 3 pages
    oracle: pytest
    pytest_path: tests/pytests/test_merge.py   # omit to run all tests/pytests/

  # judgment — Claude evaluates it; rubric adds explicit criteria
  - text: The merged PDF preserves content from both source files
    oracle: agent-judge
    rubric:
      - Chapter 1 content appears on page 1
      - Chapter 2 content appears on pages 2–3

agent-judge requires ANTHROPIC_API_KEY and Docker. pytest needs neither.

Writing pytest checks

Use pytest expectations for anything you can express as a deterministic assertion — file existence, page counts, JSON structure, required keywords. Use agent-judge for judgment calls — accuracy, tone, content quality.

💡 Not sure how to write pytest checks? Use the skills/write-skilltest-pytests skill — it knows the SkillTest conventions and will generate the files for you.

SkillTest injects these environment variables into your pytest subprocess:

Variable	Contents
`SKILLTEST_OUTPUT`	The agent's full stdout as a string
`SKILLTEST_ARTIFACTS_DIR`	Absolute path to `output/artifacts/`
`SKILLTEST_RUN_DIR`	Absolute path to the per-test run bundle root
`SKILLTEST_SKILL_DIR`	Absolute path to the skill root

tests/pytests/conftest.py (scaffolded by skilltest init):

import os
import pytest

@pytest.fixture
def agent_output() -> str:
    return os.environ["SKILLTEST_OUTPUT"]

Example — check a file was created with the right page count (skills/pdf-skills/tests/pytests/test_merge.py):

import os
from pathlib import Path
from pypdf import PdfReader

def test_merged_pdf_exists():
    artifacts = Path(os.environ["SKILLTEST_ARTIFACTS_DIR"])
    assert (artifacts / "merged.pdf").exists()

def test_merged_pdf_has_three_pages():
    artifacts = Path(os.environ["SKILLTEST_ARTIFACTS_DIR"])
    reader = PdfReader(str(artifacts / "merged.pdf"))
    assert len(reader.pages) == 3

Example — check the agent's text output (skills/pdf-skills/tests/pytests/test_page_count.py):

import os, re

def test_output_mentions_page_count(agent_output):
    assert re.search(r"\b3\b|three", agent_output, re.IGNORECASE), \
        f"Expected agent to report 3 pages, got:\n{agent_output[:300]}"

Debugging pytest checks locally without running Docker:

cd skills/pdf-skills
export SKILLTEST_OUTPUT="The guide has 3 pages."
export SKILLTEST_ARTIFACTS_DIR=$(pwd)/skilltest-results/agent-runs/test-1/output/artifacts
export SKILLTEST_SKILL_DIR=$(pwd)
python -m pytest tests/pytests/test_page_count.py -v

Point pytest_path at a specific file when different test cases need different checks:

- text: Page count is correct
  oracle: pytest
  pytest_path: tests/pytests/test_page_count.py

Omit pytest_path to run the entire tests/pytests/ directory.

📊 Output

Results are written to skilltest-results/ inside the skill directory by default, regardless of where you invoke the command:

skilltest run skills/pdf-skills
# → writes to skills/pdf-skills/skilltest-results/

Add skilltest-results/ to your skill's .gitignore.

File	Description
`grading.json`	Per-expectation results: `test_id`, `text`, `passed`, `evidence`, `oracle_used`, `duration_ms`.
`grading.xml`	JUnit XML for the same results (CI dashboards, GitHub Actions test reporter).
`report.html`	Self-contained HTML dashboard — pass rate, per-test cards with prompt and agent output, oracle badges, failure evidence. Supports light/dark mode.
`agent-runs/test-N/`	Full per-test workspace: `prompt.txt`, `input/`, `output/stdout.txt`, `output/artifacts/`, `output/manifest.json`.

Regenerate report.html from a previous run at any time:

skilltest report skilltest-results/grading.json

🛠️ Commands

`skilltest run`

skilltest run <skill-dir> [options]

Option	Default	Purpose
`--output DIR`	`<skill-dir>/skilltest-results/`	Where to write results.
`--agent-model NAME`	image default	Claude model for the agent inside Docker.
`--judge-model NAME`	Claude default	Claude model for `agent-judge` grading.
`--docker-image NAME`	`skilltest-claude:latest`	Docker image to use (or `SKILLTEST_DOCKER_IMAGE` env).
`--run-workspace DIR`	`<output>/agent-runs`	Host directory for per-test run bundles.
`--min-pass-rate FLOAT`	`0.0`	Exit non-zero if pass rate falls below this (CI gate).
`--debug`	off	Stream agent and judge output live; adds `--verbose` to the agent.
`--tests PATH`	`<skill>/tests/tests.yaml`	Alternate tests file.

`skilltest report`

skilltest report <grading.json> [--output DIR]

Generates report.html from a grading.json file. When agent-runs/ exists alongside the grading file, the report is automatically enriched with the original prompt and agent output per test card.

`skilltest init`

skilltest init <skill-name>

Scaffolds a new skill directory with SKILL.md, tests/tests.yaml, and starter pytest files:

my-skill/
├── SKILL.md
└── tests/
    ├── tests.yaml
    └── pytests/
        ├── conftest.py
        └── test_output.py

🐳 Docker

Base image

docker/Dockerfile.claude is the only image you need to build. It contains the Claude Code CLI and Python:

docker build -t skilltest-claude:latest -f docker/Dockerfile.claude docker/

The agent installs any skill-specific packages it needs at runtime via pip install — no skill-level Dockerfile required. This is consistent with how Claude Code works in practice: the agent is capable of setting up its own environment.

Bringing your own image

For production or CI use cases where you want deterministic, fast runs without runtime installs, you can provide a custom image. Place a Dockerfile alongside SKILL.md in the skill directory:

# my-skill/Dockerfile
FROM skilltest-claude:latest
RUN pip3 install --break-system-packages pandas openpyxl

Build it and pass it to skilltest run:

docker build -t my-skill:latest my-skill/
skilltest run ./my-skill --docker-image my-skill:latest

This is an opt-in. The default (skilltest-claude:latest) works for any skill without extra setup.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
docker		docker
skills		skills
skilltest		skilltest
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SkillTest

Why SkillTest

📋 Requirements

🚀 Quickstart

Step 1 — Install

Step 2 — Build the Docker image

Step 3 — Run the demo

⚙️ How it works

✍️ Writing tests for your skill

Test cases and expectations

Writing `tests.yaml`

Test case fields

Expectations and oracles

Writing pytest checks

📊 Output

🛠️ Commands

`skilltest run`

`skilltest report`

`skilltest init`

🐳 Docker

Base image

Bringing your own image

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SkillTest

Why SkillTest

📋 Requirements

🚀 Quickstart

Step 1 — Install

Step 2 — Build the Docker image

Step 3 — Run the demo

⚙️ How it works

✍️ Writing tests for your skill

Test cases and expectations

Writing tests.yaml

Test case fields

Expectations and oracles

Writing pytest checks

📊 Output

🛠️ Commands

skilltest run

skilltest report

skilltest init

🐳 Docker

Base image

Bringing your own image

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Writing `tests.yaml`

`skilltest run`

`skilltest report`

`skilltest init`

Packages