AgentBench-Live

We re-ran the same coding agent twice. The score went from 0.0 to 0.7.

Most agent leaderboards report a single number. We don't think that's honest.

Live Leaderboard · Methodology · Findings · Roadmap · Contributing

What we found

Three things that surprised us when we benchmarked Claude Code and Gemini CLI on 10 real-world tasks:

1 · "Tied overall" hides 7× per-axis gaps. Overall scores look close (Claude 0.63, Gemini 0.52). But on Tool Use, Claude is 7× better. On Multi-Step, Gemini is slightly ahead. The average lies. Pick the agent for the task, not the leaderboard.

2 · Code tasks are commodity now. Both agents score 1.00 on every code task (code-001, code-002). "Which AI writes code better" is the wrong question in 2026. The interesting differences live in tool calling, research, and multi-step orchestration.

3 · Re-running the same agent on the same task gives wildly different scores. Claude Code on tool-001: 0.0 in trial 1, 0.7 in trial 2. Same agent, same task, same prompt. Most agent leaderboards quietly publish a single trial. We think that's misleading.

→ Full data, methodology, and per-task breakdown: docs/findings.md

Leaderboard

10 tasks across 5 domains · Docker sandbox · Auto-eval (pytest) + LLM-as-Judge

Per-domain (where the real differences live)

Domain	Claude Code	Gemini CLI	Gap	Verdict
Code	1.00	1.00	1.0×	Tied — code is solved
Data	0.49	0.32	1.5×	Claude leads
Multi-Step	0.74	0.77	1.0×	Gemini slight edge
Research	0.70	0.45	1.6×	Claude clearly leads
Tool Use	0.35	0.05	7.0×	Claude dominates
Overall	0.63	0.52	1.2×	—

Variance (the part nobody publishes)

Agent	Run 1	Run 2	Run 3	Spread
Claude Code	0.604	0.656	—	±5%
Gemini CLI	0.516	0.516	0.518	±0.4%

Single-trial benchmarks ignore this. We won't.

Agent	Adapter	Status
Claude Code	`claude-code`	✅ Benchmarked
Gemini CLI	`gemini-cli`	✅ Benchmarked
Codex CLI	`codex-cli`	🔄 Adapter ready, multi-trial run pending
Aider	`aider`	🔄 Adapter ready, multi-trial run pending

What makes us different

Most agent benchmarks (SWE-bench, PinchBench, ClawProBench, OSWorld) optimize for headline numbers. We optimize for honest numbers.

	Most leaderboards	AgentBench-Live
Trials per task	1	≥3 (target: 5)
Reports variance	❌	✅ min / max / median
Reports cost	❌	🔄 v0.3
Reports latency	❌	🔄 v0.3
Sandbox	tempdir / mocks	Docker (real isolation)
Adapter overhead	~hundreds of LOC	~15 lines of Python
Open source	varies	MIT, every line

Add Your Agent in 15 Lines

This is the entire Claude Code adapter. Yours looks the same:

from agentbench.adapters.base import AgentAdapter
from agentbench.adapters.registry import register_adapter

@register_adapter
class YourAgentAdapter(AgentAdapter):
    name = "your-agent"
    cli_command = "your-cli"
    api_key_env_var = "YOUR_API_KEY"

    def _build_command(self, prompt: str) -> list[str]:
        return ["your-cli", "--run", prompt]

Submit a PR with your adapter. Your agent joins the leaderboard.

Quick Start

# Clone and install
git clone https://github.com/jackjin1997/AgentBench-Live.git
cd AgentBench-Live
pip install -e ".[dev]"

# Run the full benchmark with multiple trials
agentbench run --agent claude-code --tasks all --trials 3

# Compare two agents on a single domain
agentbench run --agents claude-code,gemini-cli --domain tool-use --trials 5

# View results with variance
agentbench leaderboard --show-variance

# Generate a shareable comparison card
agentbench social-card --output comparison.png

How It Works

Task (YAML) → Docker Sandbox → Agent Execution × N trials → Auto-Eval + LLM Judge → Score (mean ± stdev)

Task — A structured YAML challenge with inputs, environment setup, and expected outcomes
Sandbox — Docker prepares an isolated workspace. Falls back to local tempdir if Docker is unavailable
Agent — Receives the prompt and works autonomously inside the sandbox; runs N independent trials
Evaluator — Scores output using pytest (code tasks), heuristics, or LLM-as-Judge
Aggregator — Reports mean, median, min, max, and pass@k across trials
Leaderboard — Per-domain scores with variance bars, published to GitHub Pages

Capability Domains

Domain	What We Test	How We Score
Code	Bug fixes, feature implementation, refactoring	pytest pass rate
Data	CSV/JSON analysis, insight generation	Accuracy + insight quality
Multi-step	Complex workflows across multiple tools	End-to-end success
Research	Technical investigation, comparison reports	LLM-as-Judge
Tool Use	API calls, CLI tools, file operations	Success rate

See the full methodology for details on task design, scoring, and reproducibility.

Wanted: Community Adapters

We'd love help adding adapters for these agents. Each one is ~15 lines of Python:

Cursor — IDE-native agent
Windsurf — Codeium's agent
OpenHands — open-source autonomous dev
Devin — Cognition's autonomous engineer
Your agent — see contributing guide

Open an issue to claim one.

Architecture

agentbench-live/
├── src/agentbench/
│   ├── adapters/       # Agent adapters (template method pattern)
│   ├── evaluator/      # Auto-eval, LLM judge, composite scoring
│   ├── sandbox.py      # Docker + local sandbox (SandboxFactory)
│   ├── runner.py       # Benchmark orchestrator
│   └── cli.py          # CLI entry point (click)
├── tasks/              # 10 benchmark tasks across 5 domains (YAML)
├── leaderboard/        # Static frontend (GitHub Pages)
├── docs/               # Methodology, findings, guides
└── tests/              # 183 tests, 90% coverage

Contributing

Add an agent — Write an adapter (~15 lines), submit a PR
Add tasks — Submit new benchmark tasks (task authoring guide)
Improve scoring — Better heuristics, judges, evaluation methods
Run benchmarks — Run existing agents and submit results

See CONTRIBUTING.md for the full guide.

License

MIT — every line, no asterisks.

The best way to improve agents is to measure them honestly — variance and all.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github		.github
adapters		adapters
docs		docs
evaluators		evaluators
leaderboard		leaderboard
ranking		ranking
results		results
sandbox		sandbox
scripts		scripts
src/agentbench		src/agentbench
tasks		tasks
tests		tests
.coverage		.coverage
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentBench-Live

We re-ran the same coding agent twice. The score went from 0.0 to 0.7.

What we found

Leaderboard

Per-domain (where the real differences live)

Variance (the part nobody publishes)

What makes us different

Add Your Agent in 15 Lines

Quick Start

How It Works

Capability Domains

Wanted: Community Adapters

Architecture

Contributing

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentBench-Live

We re-ran the same coding agent twice. The score went from 0.0 to 0.7.

What we found

Leaderboard

Per-domain (where the real differences live)

Variance (the part nobody publishes)

What makes us different

Add Your Agent in 15 Lines

Quick Start

How It Works

Capability Domains

Wanted: Community Adapters

Architecture

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages