[bench] DeepSWE flagship comparative matrix — N architectures × M models, real Docker verifier (117 Harbor tasks)

## Goal

Produce Chimera's flagship comparative result: the same DeepSWE (Harbor-format) tasks solved by **multiple agent architectures under identical model / budget / latency**, scored by the **real Harbor Docker verifier** — not `--env local`.

## Why

The comparative-methodology thesis is currently demonstrated only by one tiny matrix (3 tasks, react vs plan-execute, GLM-5.1 — see `docs/benchmarks/2026-06-11-first-controlled-matrix.md`). This issue is the full-scale version that validates the framework's reason for existing.

## What already exists

- **Harbor adapter:** `chimera/eval/benchmarks/harbor.py` — parses all 117 DeepSWE tasks (vendor clone at `data/vendor/deep-swe`, gitignored).
- **`chimera bench-compare` CLI** — uniform completed-tool-call budgets, ATIF trajectory emission, terminal/json/markdown/html output.
- **`docker_env_factory`** — per-task Docker provisioning, verified against a live daemon.

## Scope / steps

1. **Core work item:** wire `docker_env_factory` per task so each agent's work runs in — and is scored by — that task's Harbor Docker image. The CLI default `--env local` does **not** satisfy Harbor verifier semantics for real scoring.
2. Pre-pull / cache images (multi-GB each — budget pull time). Start with a subset (e.g. 20 tasks) for the first full run.
3. Run the matrix: `--agents react,plan-execute,reflexion,...` × models (start: `glm-5.2`, `qwen3-coder-next`) with `--max-tool-calls` + `--max-wall-clock` budgets and a fixed `--seed`.
4. Emit ATIF trajectories (`--emit-atif`) for every cell; confirm they validate.
5. Report: HTML matrix (`--format html`) + dated writeup in `docs/benchmarks/` + update row **A12 (DeepSWE)** in `docs/benchmarks/README.md`.

## Reference command

```
chimera bench-compare --agents react,plan-execute \
  --benchmark harbor --dataset data/vendor/deep-swe/tasks --limit 20 \
  --model glm-5.2 --max-tool-calls 30 --max-wall-clock 900 \
  --seed 0 --format html --output matrix.html --emit-atif trajectories
```

## Environment notes

- Models run via the **Ollama-Cloud bridge**: `ANTHROPIC_BASE_URL=http://localhost:11434`, `ANTHROPIC_API_KEY=$OLLAMA_API_KEY`; run `ollama pull <model>:cloud` first. `glm-5.2` is a reasoning model — floor per-turn `max_tokens` ≥ 8192 or turns come back empty.
- Harbor images are linux/amd64; on an arm64 host QEMU is slow — prefer cloud amd64 runners.
- **Validate ONE task end-to-end before fanning out** (a premature ProgramBench sweep previously burned ~$8 producing nothing).

## Acceptance criteria

- ≥2 architectures × ≥2 models scored by the **real** Harbor verifier on a defined task subset, with budgets controlled and reported.
- ATIF trajectories emitted and validating for every cell.
- HTML report + `README.md` A12 row + dated writeup committed.

Relates to the comparative-methodology mission; sibling of the ProgramBench matrix (#141).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bench] DeepSWE flagship comparative matrix — N architectures × M models, real Docker verifier (117 Harbor tasks) #157

Goal

Why

What already exists

Scope / steps

Reference command

Environment notes

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[bench] DeepSWE flagship comparative matrix — N architectures × M models, real Docker verifier (117 Harbor tasks) #157

Description

Goal

Why

What already exists

Scope / steps

Reference command

Environment notes

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions