Most agent leaderboards report a single number. We don't think that's honest.
Live Leaderboard · Methodology · Findings · Roadmap · Contributing
Three things that surprised us when we benchmarked Claude Code and Gemini CLI on 10 real-world tasks:
1 · "Tied overall" hides 7× per-axis gaps. Overall scores look close (Claude 0.63, Gemini 0.52). But on Tool Use, Claude is 7× better. On Multi-Step, Gemini is slightly ahead. The average lies. Pick the agent for the task, not the leaderboard.
2 · Code tasks are commodity now.
Both agents score 1.00 on every code task (code-001, code-002). "Which AI writes code better" is the wrong question in 2026. The interesting differences live in tool calling, research, and multi-step orchestration.
3 · Re-running the same agent on the same task gives wildly different scores.
Claude Code on tool-001: 0.0 in trial 1, 0.7 in trial 2. Same agent, same task, same prompt. Most agent leaderboards quietly publish a single trial. We think that's misleading.
→ Full data, methodology, and per-task breakdown: docs/findings.md
10 tasks across 5 domains · Docker sandbox · Auto-eval (pytest) + LLM-as-Judge
| Domain | Claude Code | Gemini CLI | Gap | Verdict |
|---|---|---|---|---|
| Code | 1.00 | 1.00 | 1.0× | Tied — code is solved |
| Data | 0.49 | 0.32 | 1.5× | Claude leads |
| Multi-Step | 0.74 | 0.77 | 1.0× | Gemini slight edge |
| Research | 0.70 | 0.45 | 1.6× | Claude clearly leads |
| Tool Use | 0.35 | 0.05 | 7.0× | Claude dominates |
| Overall | 0.63 | 0.52 | 1.2× | — |
| Agent | Run 1 | Run 2 | Run 3 | Spread |
|---|---|---|---|---|
| Claude Code | 0.604 | 0.656 | — | ±5% |
| Gemini CLI | 0.516 | 0.516 | 0.518 | ±0.4% |
Single-trial benchmarks ignore this. We won't.
| Agent | Adapter | Status |
|---|---|---|
| Claude Code | claude-code |
✅ Benchmarked |
| Gemini CLI | gemini-cli |
✅ Benchmarked |
| Codex CLI | codex-cli |
🔄 Adapter ready, multi-trial run pending |
| Aider | aider |
🔄 Adapter ready, multi-trial run pending |
Most agent benchmarks (SWE-bench, PinchBench, ClawProBench, OSWorld) optimize for headline numbers. We optimize for honest numbers.
| Most leaderboards | AgentBench-Live | |
|---|---|---|
| Trials per task | 1 | ≥3 (target: 5) |
| Reports variance | ❌ | ✅ min / max / median |
| Reports cost | ❌ | 🔄 v0.3 |
| Reports latency | ❌ | 🔄 v0.3 |
| Sandbox | tempdir / mocks | Docker (real isolation) |
| Adapter overhead | ~hundreds of LOC | ~15 lines of Python |
| Open source | varies | MIT, every line |
This is the entire Claude Code adapter. Yours looks the same:
from agentbench.adapters.base import AgentAdapter
from agentbench.adapters.registry import register_adapter
@register_adapter
class YourAgentAdapter(AgentAdapter):
name = "your-agent"
cli_command = "your-cli"
api_key_env_var = "YOUR_API_KEY"
def _build_command(self, prompt: str) -> list[str]:
return ["your-cli", "--run", prompt]Submit a PR with your adapter. Your agent joins the leaderboard.
# Clone and install
git clone https://github.com/jackjin1997/AgentBench-Live.git
cd AgentBench-Live
pip install -e ".[dev]"
# Run the full benchmark with multiple trials
agentbench run --agent claude-code --tasks all --trials 3
# Compare two agents on a single domain
agentbench run --agents claude-code,gemini-cli --domain tool-use --trials 5
# View results with variance
agentbench leaderboard --show-variance
# Generate a shareable comparison card
agentbench social-card --output comparison.pngTask (YAML) → Docker Sandbox → Agent Execution × N trials → Auto-Eval + LLM Judge → Score (mean ± stdev)
- Task — A structured YAML challenge with inputs, environment setup, and expected outcomes
- Sandbox — Docker prepares an isolated workspace. Falls back to local tempdir if Docker is unavailable
- Agent — Receives the prompt and works autonomously inside the sandbox; runs N independent trials
- Evaluator — Scores output using pytest (code tasks), heuristics, or LLM-as-Judge
- Aggregator — Reports mean, median, min, max, and pass@k across trials
- Leaderboard — Per-domain scores with variance bars, published to GitHub Pages
| Domain | What We Test | How We Score |
|---|---|---|
| Code | Bug fixes, feature implementation, refactoring | pytest pass rate |
| Data | CSV/JSON analysis, insight generation | Accuracy + insight quality |
| Multi-step | Complex workflows across multiple tools | End-to-end success |
| Research | Technical investigation, comparison reports | LLM-as-Judge |
| Tool Use | API calls, CLI tools, file operations | Success rate |
See the full methodology for details on task design, scoring, and reproducibility.
We'd love help adding adapters for these agents. Each one is ~15 lines of Python:
- Cursor — IDE-native agent
- Windsurf — Codeium's agent
- OpenHands — open-source autonomous dev
- Devin — Cognition's autonomous engineer
- Your agent — see contributing guide
Open an issue to claim one.
agentbench-live/
├── src/agentbench/
│ ├── adapters/ # Agent adapters (template method pattern)
│ ├── evaluator/ # Auto-eval, LLM judge, composite scoring
│ ├── sandbox.py # Docker + local sandbox (SandboxFactory)
│ ├── runner.py # Benchmark orchestrator
│ └── cli.py # CLI entry point (click)
├── tasks/ # 10 benchmark tasks across 5 domains (YAML)
├── leaderboard/ # Static frontend (GitHub Pages)
├── docs/ # Methodology, findings, guides
└── tests/ # 183 tests, 90% coverage
- Add an agent — Write an adapter (~15 lines), submit a PR
- Add tasks — Submit new benchmark tasks (task authoring guide)
- Improve scoring — Better heuristics, judges, evaluation methods
- Run benchmarks — Run existing agents and submit results
See CONTRIBUTING.md for the full guide.
MIT — every line, no asterisks.
The best way to improve agents is to measure them honestly — variance and all.