ClawForge is an interactive benchmark for CLI-style agents. It measures whether an agent can inspect persistent workflow state, issue commands step by step, and complete tasks through correct state changes and observable side effects.
Quick Start · Evaluation · Execution Modes · OpenClaw Setup · Task Generation · Docs Index
ClawForge is for teams that need to evaluate execution-time behavior rather than a single final answer. The current flagship family, hard_decision_workflow, focuses on partial state, branch selection, state repair, replacement, duplicate avoidance, and workflow closure.
The checked-in release snapshot currently includes:
17hard scenarios362hard tasks inhard_decision_workflow1616total tasks across the broader repository snapshot- default split files for
train,dev, andtest
These are documented release counts, not fixed generator ceilings. The underlying benchmark is generator-backed, and release profiles can be regenerated with different per-scenario counts while keeping task semantics and evaluator contracts stable.
The public repository surface is organized around runnable benchmark code rather than paper artifacts.
openclaw_env/: environment, task schema, generators, evaluation logic, and checked-in benchmark dataexamples/: primary evaluation entrypointscripts/: task generation and maintainer utilitiestests/: regression coverage for generation, runtime, and CLI behaviordocs/: user-facing and maintainer-facing documentation
See docs/release-snapshot.md for what is treated as a checked-in release artifact versus what can be regenerated locally.
Install from the repository root:
pip install -e .
pip install -e ".[dev]"Generate the benchmark snapshot:
python scripts/generate_tasks.pyRun a first hard-benchmark evaluation:
python examples/train_and_eval.py \
--agent llm \
--llm-provider openai \
--llm-base-url https://api.example.com/v1 \
--model Kimi-K2.5 \
--task-prefix hard_decision_workflow_ \
--split test \
--mode multi \
--max-steps 20 \
--llm-max-tokens 192 \
-vThe benchmark-facing protocol above is intentional. The CLI default step budget is lower (15), so benchmark comparisons should set --max-steps explicitly.
For a longer first-run path, LiteLLM-backed Claude examples, and structured report flags, see docs/quickstart.md.
ClawForge supports four execution modes:
mock: narrow in-process simulation for tests and simple backend checksmulti: routed local execution across the benchmark's app-family backends; default benchmark modereal: realopenclawCLI subprocess path foropenclaw *commandshybrid: live OpenClaw gateway plus the routed local skill stack
multi is the default public benchmark mode because it preserves interactive state across command families while remaining reproducible and inexpensive to run.
See docs/execution-modes.md for exact backend behavior and docs/openclaw-setup.md for real and hybrid prerequisites.
The primary evaluation entrypoint is:
python examples/train_and_eval.pyTypical outputs include:
- full-pass accuracy
- partial-credit average score
- scenario-level and ability-level summaries
- provider-aware fields such as
provider_failuresandprovider_impacted_tasks
See docs/evaluation.md for common commands, history modes, and output interpretation.
ClawForge benchmark artifacts are generator-backed. The standard regeneration command is:
python scripts/generate_tasks.pyGenerated outputs are written under openclaw_env/data/{tasks,datasets} unless --output-data-dir is used.
The current hard release is one official profile rather than a hard-coded maximum. You can adjust the shared hard-scenario count or override individual scenarios without changing the underlying benchmark semantics.
See docs/task-generation.md for generation flags and docs/hard-benchmark.md for the current scenario inventory.
hard_decision_workflow is the most benchmark-focused suite in the repository. It is built around six recurring ability buckets:
duplicate_avoidancegap_completioninformation_transfermulti_source_reasoningstate_repairworkflow_completion
For the current official hard profile, scenario slugs, and evaluator framing, see docs/hard-benchmark.md.
This repository includes a documented release snapshot of benchmark results. It is not a live leaderboard and should be read together with the benchmark configuration, execution mode, and provider caveats described in the docs.
Additional benchmark composition and result-context figures are documented in docs/results.md.
ClawForge is not a generic shell benchmark and not a continuously refreshed model leaderboard.
- The checked-in release snapshot reflects one documented data profile, not every possible generator configuration.
multiis the default benchmark mode;realandhybridrequire extra OpenClaw runtime setup.- This repository hardening pass does not refresh or regenerate the current dirty
openclaw_env/data/...worktree contents. - Interactive results can still be affected by provider behavior such as retries, filtered responses, or endpoint instability.
Start with docs/README.md for goal-oriented routing:
- Run the benchmark:
- Understand the hard suite:
- Set up OpenClaw-backed modes:
- Regenerate release artifacts:
- Develop locally:
If you use ClawForge, cite both the repository and the accompanying paper.
@misc{lai2026clawforgegeneratingexecutableinteractive,
title={ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents},
author={Yuxiang Lai and Peng Xia and Haonian Ji and Kaiwen Xiong and Kaide Zeng and Jiaqi Liu and Fang Wu and Jike Zhong and Zeyu Zheng and Cihang Xie and Huaxiu Yao},
year={2026},
eprint={2605.14133},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.14133},
}This repository is released under the MIT License.



