Goal
Produce Chimera's flagship comparative result: the same DeepSWE (Harbor-format) tasks solved by multiple agent architectures under identical model / budget / latency, scored by the real Harbor Docker verifier — not --env local.
Why
The comparative-methodology thesis is currently demonstrated only by one tiny matrix (3 tasks, react vs plan-execute, GLM-5.1 — see docs/benchmarks/2026-06-11-first-controlled-matrix.md). This issue is the full-scale version that validates the framework's reason for existing.
What already exists
- Harbor adapter:
chimera/eval/benchmarks/harbor.py — parses all 117 DeepSWE tasks (vendor clone at data/vendor/deep-swe, gitignored).
chimera bench-compare CLI — uniform completed-tool-call budgets, ATIF trajectory emission, terminal/json/markdown/html output.
docker_env_factory — per-task Docker provisioning, verified against a live daemon.
Scope / steps
- Core work item: wire
docker_env_factory per task so each agent's work runs in — and is scored by — that task's Harbor Docker image. The CLI default --env local does not satisfy Harbor verifier semantics for real scoring.
- Pre-pull / cache images (multi-GB each — budget pull time). Start with a subset (e.g. 20 tasks) for the first full run.
- Run the matrix:
--agents react,plan-execute,reflexion,... × models (start: glm-5.2, qwen3-coder-next) with --max-tool-calls + --max-wall-clock budgets and a fixed --seed.
- Emit ATIF trajectories (
--emit-atif) for every cell; confirm they validate.
- Report: HTML matrix (
--format html) + dated writeup in docs/benchmarks/ + update row A12 (DeepSWE) in docs/benchmarks/README.md.
Reference command
chimera bench-compare --agents react,plan-execute \
--benchmark harbor --dataset data/vendor/deep-swe/tasks --limit 20 \
--model glm-5.2 --max-tool-calls 30 --max-wall-clock 900 \
--seed 0 --format html --output matrix.html --emit-atif trajectories
Environment notes
- Models run via the Ollama-Cloud bridge:
ANTHROPIC_BASE_URL=http://localhost:11434, ANTHROPIC_API_KEY=$OLLAMA_API_KEY; run ollama pull <model>:cloud first. glm-5.2 is a reasoning model — floor per-turn max_tokens ≥ 8192 or turns come back empty.
- Harbor images are linux/amd64; on an arm64 host QEMU is slow — prefer cloud amd64 runners.
- Validate ONE task end-to-end before fanning out (a premature ProgramBench sweep previously burned ~$8 producing nothing).
Acceptance criteria
- ≥2 architectures × ≥2 models scored by the real Harbor verifier on a defined task subset, with budgets controlled and reported.
- ATIF trajectories emitted and validating for every cell.
- HTML report +
README.md A12 row + dated writeup committed.
Relates to the comparative-methodology mission; sibling of the ProgramBench matrix (#141).
Goal
Produce Chimera's flagship comparative result: the same DeepSWE (Harbor-format) tasks solved by multiple agent architectures under identical model / budget / latency, scored by the real Harbor Docker verifier — not
--env local.Why
The comparative-methodology thesis is currently demonstrated only by one tiny matrix (3 tasks, react vs plan-execute, GLM-5.1 — see
docs/benchmarks/2026-06-11-first-controlled-matrix.md). This issue is the full-scale version that validates the framework's reason for existing.What already exists
chimera/eval/benchmarks/harbor.py— parses all 117 DeepSWE tasks (vendor clone atdata/vendor/deep-swe, gitignored).chimera bench-compareCLI — uniform completed-tool-call budgets, ATIF trajectory emission, terminal/json/markdown/html output.docker_env_factory— per-task Docker provisioning, verified against a live daemon.Scope / steps
docker_env_factoryper task so each agent's work runs in — and is scored by — that task's Harbor Docker image. The CLI default--env localdoes not satisfy Harbor verifier semantics for real scoring.--agents react,plan-execute,reflexion,...× models (start:glm-5.2,qwen3-coder-next) with--max-tool-calls+--max-wall-clockbudgets and a fixed--seed.--emit-atif) for every cell; confirm they validate.--format html) + dated writeup indocs/benchmarks/+ update row A12 (DeepSWE) indocs/benchmarks/README.md.Reference command
Environment notes
ANTHROPIC_BASE_URL=http://localhost:11434,ANTHROPIC_API_KEY=$OLLAMA_API_KEY; runollama pull <model>:cloudfirst.glm-5.2is a reasoning model — floor per-turnmax_tokens≥ 8192 or turns come back empty.Acceptance criteria
README.mdA12 row + dated writeup committed.Relates to the comparative-methodology mission; sibling of the ProgramBench matrix (#141).