Skip to content

[bench] DeepSWE flagship comparative matrix — N architectures × M models, real Docker verifier (117 Harbor tasks) #157

Description

@0bserver07

Goal

Produce Chimera's flagship comparative result: the same DeepSWE (Harbor-format) tasks solved by multiple agent architectures under identical model / budget / latency, scored by the real Harbor Docker verifier — not --env local.

Why

The comparative-methodology thesis is currently demonstrated only by one tiny matrix (3 tasks, react vs plan-execute, GLM-5.1 — see docs/benchmarks/2026-06-11-first-controlled-matrix.md). This issue is the full-scale version that validates the framework's reason for existing.

What already exists

  • Harbor adapter: chimera/eval/benchmarks/harbor.py — parses all 117 DeepSWE tasks (vendor clone at data/vendor/deep-swe, gitignored).
  • chimera bench-compare CLI — uniform completed-tool-call budgets, ATIF trajectory emission, terminal/json/markdown/html output.
  • docker_env_factory — per-task Docker provisioning, verified against a live daemon.

Scope / steps

  1. Core work item: wire docker_env_factory per task so each agent's work runs in — and is scored by — that task's Harbor Docker image. The CLI default --env local does not satisfy Harbor verifier semantics for real scoring.
  2. Pre-pull / cache images (multi-GB each — budget pull time). Start with a subset (e.g. 20 tasks) for the first full run.
  3. Run the matrix: --agents react,plan-execute,reflexion,... × models (start: glm-5.2, qwen3-coder-next) with --max-tool-calls + --max-wall-clock budgets and a fixed --seed.
  4. Emit ATIF trajectories (--emit-atif) for every cell; confirm they validate.
  5. Report: HTML matrix (--format html) + dated writeup in docs/benchmarks/ + update row A12 (DeepSWE) in docs/benchmarks/README.md.

Reference command

chimera bench-compare --agents react,plan-execute \
  --benchmark harbor --dataset data/vendor/deep-swe/tasks --limit 20 \
  --model glm-5.2 --max-tool-calls 30 --max-wall-clock 900 \
  --seed 0 --format html --output matrix.html --emit-atif trajectories

Environment notes

  • Models run via the Ollama-Cloud bridge: ANTHROPIC_BASE_URL=http://localhost:11434, ANTHROPIC_API_KEY=$OLLAMA_API_KEY; run ollama pull <model>:cloud first. glm-5.2 is a reasoning model — floor per-turn max_tokens ≥ 8192 or turns come back empty.
  • Harbor images are linux/amd64; on an arm64 host QEMU is slow — prefer cloud amd64 runners.
  • Validate ONE task end-to-end before fanning out (a premature ProgramBench sweep previously burned ~$8 producing nothing).

Acceptance criteria

  • ≥2 architectures × ≥2 models scored by the real Harbor verifier on a defined task subset, with budgets controlled and reported.
  • ATIF trajectories emitted and validating for every cell.
  • HTML report + README.md A12 row + dated writeup committed.

Relates to the comparative-methodology mission; sibling of the ProgramBench matrix (#141).

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions