Benchmark and optimize AGENTS.md and SKILL.md for Codex.
CodexOpt is a lightweight CLI for benchmarking and optimizing Codex instruction assets.
It focuses on Codex instruction assets:
AGENTS.md.codex/skills/**/SKILL.md.agents/skills/**/SKILL.md
- Documentation: superagenticai.github.io/CodexOpt
- Codex user workflow: docs/codex-users.md
- Demo repository: github.com/SuperagenticAI/codexopt-demo
- PyPI package: pypi.org/project/codexopt
- Docs source: docs/
CodexOpt gives teams a repeatable workflow to:
- Scan instruction files.
- Benchmark quality.
- Generate optimized candidates.
- Apply only improvements.
- Produce a report.
Most teams edit AGENTS.md and SKILL.md manually, but struggle to answer:
- Did quality actually improve?
- Did we increase prompt bloat?
- Did we break skill frontmatter conventions?
CodexOpt turns these edits into measurable runs with artifacts you can inspect and version.
- Project scan with issue detection for agents and skills.
- Benchmark scoring with sub-scores and natural-language feedback.
- Optional evidence inputs from repo task files and issue exports.
- Optimization engine
heuristic(default, local and deterministic). - Reflective engine for Codex-backed SkillOpt/GEPA-style optimization.
- SkillOpt-inspired
skilloptengine for SKILL.md files with train/validation evidence splits, bounded edits, and validation-gated acceptance. - Explicit reporting when a model-backed run falls back to heuristic optimization.
- Safe apply flow with automatic backups.
- Markdown reporting from latest runs.
- Minimal OSS CI (lint, test, build).
- Python
>=3.10 uv(recommended) orpip
uv sync --extra devRun commands through the managed environment:
uv run codexopt --helpuv.lock is committed to keep dependency resolution reproducible across machines and CI.
pip install -e ".[dev]"# 1) Create config
uv run codexopt init
# 2) Inspect what will be evaluated
uv run codexopt scan
# 3) Get baseline scores
uv run codexopt benchmark
# 4) Optimize AGENTS.md
uv run codexopt optimize agents --file AGENTS.md
# 5) Optimize skills
uv run codexopt optimize skills --glob ".codex/skills/**/SKILL.md"
# 6) Review apply impact without writing
uv run codexopt apply --kind agents --dry-run
# 7) Apply selected improvements
uv run codexopt apply --kind agents
# 8) Generate markdown summary
uv run codexopt report --output codexopt-report.mdFor Codex-specific rollout workflows, including codex exec --json validation tasks, see
Using CodexOpt with Codex.
Developers use CodexOpt in the repository that contains their Codex instruction assets:
AGENTS.md.codex/skills/**/SKILL.md.agents/skills/**/SKILL.md
Optional evidence can also be added to improve benchmarking and optimization quality:
- task files (
tasks.md, task lists, or JSON fixtures) - issue/review exports (
issues.mdor JSON exports)
Typical workflow:
- Run
scanandbenchmarkto measure the current instruction assets. - Run
optimize agentsandoptimize skillsto generate improved candidates. - Review the generated diffs and report artifacts under
.codexopt/runs/. - Run
apply --dry-runfirst, then apply accepted changes. - Commit the updated instruction files and, if useful, attach the report to a PR.
Example with optional evidence configured in codexopt.yaml:
evidence:
task_files:
- tasks.md
issue_files:
- issues.mdWith that config in place, benchmark and optimize use:
- static prompt-quality checks
- repo task alignment
- recurring issue/review themes
Today, task and issue files influence scoring and feedback. With --engine skillopt, CodexOpt
uses task evidence as train/validation splits so skill candidates must improve held-out evidence
before they are accepted. JSON task files can also define executable rollout commands; when present,
those rollout pass rates become the held-out validation gate.
Use codexopt.example.yaml as a starting point for committed team config.
codexopt --config <path-to-codexopt.yaml> <command>Create a default config file.
codexopt init [--path PATH] [--force]Discover AGENTS/SKILL targets and validate shape.
codexopt scanScore current files using built-in heuristics.
codexopt benchmarkOptimize AGENTS files.
codexopt optimize agents \
[--file PATTERN] \
[--engine heuristic|reflective] \
[--reflection-model MODEL] \
[--max-metric-calls N]Optimize SKILL files.
codexopt optimize skills \
[--glob PATTERN] \
[--engine heuristic|skillopt|reflective] \
[--reflection-model MODEL] \
[--max-metric-calls N]One command for Codex users: discover targets, mine starter tasks, run the reflective optimizer, and preview the diff.
codexopt improve # offline preview
codexopt improve --live # Codex-backed reflective preview
codexopt improve --live --apply # write validated changes with backupsApply best candidates from the latest optimization run (or a provided run id).
codexopt apply [--kind agents|skills] [--run-id RUN_ID] [--dry-run]Generate a markdown report from latest runs in state.
codexopt report [--output FILE.md]Default codexopt.yaml:
version: 1
targets:
agents_files:
- AGENTS.md
- "**/AGENTS.md"
- "**/AGENTS.override.md"
skills_globs:
- ".codex/skills/**/SKILL.md"
- "**/.codex/skills/**/SKILL.md"
- ".agents/skills/**/SKILL.md"
- "**/.agents/skills/**/SKILL.md"
exclude_globs:
- ".git/**"
- ".codexopt/**"
- ".venv/**"
- "node_modules/**"
- "reference/**"
output:
root_dir: ".codexopt"
evidence:
task_files: []
issue_files: []
optimization:
engine: "heuristic"
min_apply_delta: 0.01
max_metric_calls: 60
reflection_model: null
skillopt_train_ratio: 0.67
skillopt_edit_budget: 24
skillopt_validation_delta: 0.01Config notes:
targets.agents_files: glob patterns for AGENTS targets.targets.skills_globs: glob patterns forSKILL.mdtargets.targets.exclude_globs: paths ignored during scan.output.root_dir: run artifacts and backups location.evidence.task_files: optional markdown/json task lists used for repo-alignment scoring.evidence.issue_files: optional markdown/json issue or review exports used for theme-aware feedback.optimization.engine: default optimization engine (heuristic,reflective, orskilloptfor skills).optimization.min_apply_delta: minimum score gain required to apply.optimization.max_metric_calls: legacy GEPA metric budget.optimization.reflection_model: legacy GEPA reflection model.optimization.skillopt_train_ratio: task evidence fraction used for skill candidate proposal.optimization.skillopt_edit_budget: maximum line edit operations allowed for SkillOpt candidates.optimization.skillopt_validation_delta: minimum held-out validation gain required for SkillOpt acceptance.
CodexOpt computes a 0.0 to 1.0 score per file.
AGENTS scoring factors include:
- Too short or too long content penalties.
- Token-heaviness estimate penalty.
- Empty file penalty.
- Contradictory guidance penalties.
- Missing workflow / verification / output-format guidance penalties.
- Repo-context and task-alignment signals when evidence files are configured.
SKILL scoring factors include:
- Missing frontmatter penalties.
- Missing
name/descriptionpenalties. - Overly long frontmatter fields penalties.
- Too short or too long content penalties.
- Weak trigger/workflow/verification guidance penalties.
- Repo task alignment signals when evidence files are configured.
Each benchmarked file also includes:
- criterion-level sub-scores
- natural-language feedback
- optional evidence summary from configured task/issue files
Candidate transforms include:
- Whitespace normalization.
- Blank-line compaction.
- Duplicate adjacent line removal.
- Skill-specific frontmatter synthesis/trimming.
The best candidate is selected by score delta. If delta is below min_apply_delta, original content is kept.
The maintained SkillOpt/GEPA-inspired path is --engine reflective, or the
Codex-user shortcut codexopt improve. It evaluates a candidate document on
tasks, captures textual feedback, asks an optimizer model to rewrite the
document, and accepts the rewrite only when it improves held-out validation
tasks.
Defaults stay offline and use static/verifier signals. To run the full live Codex loop, use:
codexopt improve --live--live uses codex exec as both optimizer and judge. You can also set
reflective.optimizer_model and reflective.judge_model to codex,
openai/<model>, or another OpenAI-compatible model.
--engine gepa is deprecated. It targeted an older gepa.optimize_anything
API and now falls back with a clear warning. Use --engine reflective instead.
For SkillOpt-style skill optimization:
optimization:
engine: "skillopt"
reflection_model: "openai/gpt-5-mini" # optional; without it, heuristic proposers are used
skillopt_train_ratio: 0.67
skillopt_edit_budget: 24
skillopt_validation_delta: 0.01Executable rollout task files can be listed in evidence.task_files:
[
{
"name": "skill-verifier",
"description": "Run a repo-local verifier against the candidate skill.",
"command": ["python", "scripts/verify_skill.py"],
"timeout_seconds": 30
}
]Codex-backed rollout tasks can use backend: "codex" and codex_prompt:
[
{
"name": "codex-skill-task",
"backend": "codex",
"description": "Run Codex against the candidate skill.",
"codex_prompt": "Use the local skill to update CHANGELOG.md for a patch release.",
"timeout_seconds": 120,
"expected_final_response_contains": "CHANGELOG.md",
"expected_file_change": "CHANGELOG.md",
"expected_file_contains": {
"path": "CHANGELOG.md",
"contains": "Patch"
}
}
]CodexOpt evaluates those commands in a temporary copy of the repo with the candidate SKILL.md
written in place, then records pass/fail details in optimize.json. For Codex-backed rollouts,
CodexOpt also parses codex exec --json events into trajectory metadata: final response,
commands, file changes, token usage, and errors.
For OpenAI-compatible reflective models, set the provider credentials and use
reflective.optimizer_model / reflective.judge_model values such as
openai/gpt-5-mini:
export OPENAI_API_KEY="your-openai-key"For Gemini-compatible endpoints, set the credentials expected by your OpenAI-compatible
client or run through codexopt improve --live to use codex exec directly.
export GEMINI_API_KEY="your-gemini-key"
export GOOGLE_API_KEY="$GEMINI_API_KEY"Fallback behavior:
- If a configured optimizer or judge model is unavailable, CodexOpt records a note and falls back to the weaker heuristic/static path.
- Fallbacks are recorded in optimization artifacts, CLI summaries, and reports.
By default, everything is written under .codexopt/:
runs/<run_id>/scan.jsonruns/<run_id>/benchmark.jsonruns/<run_id>/optimize.jsonruns/<run_id>/apply.jsonbackups/<timestamp>/...(created on non-dry-run apply)state.json(tracks latest run ids per command type)
Run ids are timestamped and namespaced by command kind, for example:
20260308T184800123456Z-benchmark20260308T184812654321Z-optimize-skills
- Commit current
AGENTS.mdand skills. - Run
scanandbenchmarkto establish baseline. - Run
optimize agentsand/oroptimize skills. - Review
optimize.jsonand diffs. - Run
apply --dry-runfirst, thenapply. - Run
reportand attach report to PR.
Before (AGENTS.md):
## Coding Rules
Always run tests before commit.
Always run tests before commit.
Keep changes minimal.After optimization (heuristic):
## Coding Rules
Always run tests before commit.
Keep changes minimal.What changed:
- Removed duplicate adjacent line.
- Compacted extra blank lines.
Before (.codex/skills/my_skill/SKILL.md):
Use this skill for repository release checks.
Run lint, tests, and changelog validation.After optimization (heuristic):
---
name: my-skill
description: Repository-specific workflow skill.
---
Use this skill for repository release checks.
Run lint, tests, and changelog validation.What changed:
- Added required frontmatter block.
- Generated normalized
namefrom folder name. - Added default
description.
uv run codexopt init
uv run codexopt scan
uv run codexopt benchmark
uv run codexopt optimize agents --file AGENTS.md
uv run codexopt optimize skills --glob ".codex/skills/**/SKILL.md"
uv run codexopt apply --kind skills --dry-run
uv run codexopt apply --kind skills
uv run codexopt report --output codexopt-report.mdFiles to inspect after running:
.codexopt/runs/*/scan.json.codexopt/runs/*/benchmark.json.codexopt/runs/*/optimize.json.codexopt/runs/*/apply.json.codexopt/backups/*
GitHub Actions workflow is included at .github/workflows/ci.yml and runs:
uv lock --checkfor lockfile consistency.uv sync --extra devfor environment setup.- Ruff lint checks.
- Pytest tests.
- Package build (
uv build).
It does not publish packages.
uv lock
uv sync --extra dev
uv run --no-sync ruff check src tests
PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 uv run --no-sync pytest -q
uv buildCause:
- No prior optimization run for the selected kind.
state.jsondoes not contain the expected latest run pointer.
Fix:
uv run codexopt optimize agents
uv run codexopt apply --kind agentsOr pass an explicit run:
uv run codexopt apply --kind agents --run-id <run_id>Cause:
- The legacy GEPA engine targeted an older
gepa.optimize_anythingAPI.
Behavior:
- CodexOpt falls back to heuristic optimization and records the deprecation reason.
Fix:
uv run codexopt optimize agents --engine reflective
uv run codexopt improve --liveExpected behavior:
--dry-runreports candidate applications without writing files.
To write changes, run again without --dry-run:
uv run codexopt apply --kind agentsIf your environment blocks dependency resolution in isolated builds, use:
uv buildSome environments auto-load global pytest plugins that can break local tests. Run with plugin autoload disabled:
PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 uv run --no-sync pytest -qCause:
- Best candidate delta is below
optimization.min_apply_delta, or - File content is already equivalent.
Fix:
- Lower
optimization.min_apply_deltaincodexopt.yaml, then re-run optimize/apply.
MIT. See LICENSE.
- Shashi (
shashi@super-agentic.ai)
