[audit] Adversarially verify documented capability claims against code + expand real-LLM coverage

## Goal

Verify that the repo's documented capability claims match the actual implementation, and expand real-LLM (non-mocked) test coverage. The project makes many capability and parity claims; this is a systematic check for drift between docs and code.

## Claims to verify (non-exhaustive)

- **Parity matrices:** `docs/mink/parity-matrix.md`, `site/src/content/docs/otter/parity-matrix.md`, and the field-guide pages under `site/src/content/docs/field-guide/`.
- **"Replicated scaffolds are real (not stubs)":** re-audit each scaffold's code against its claimed capabilities. Frame as **capability parity** — verify the capability exists, without asserting fidelity to any specific external reference.
- **Capability tables / inventories** in `README.md` and `CLAUDE.md`: tool counts, event-type counts, strategy lists, CLI subcommand inventory, MCP server list.
- **Benchmark numbers** cited in docs vs the writeups they reference (`docs/benchmarks/`).

## Method (parallel agent fan-out)

1. Enumerate concrete, checkable claims from the sources above.
2. For each claim, spawn an agent to verify it against code/tests and return a verdict — **confirmed / drifted / false** — with `file:line` evidence.
3. Adversarially re-check any "confirmed" that rests on a single source.
4. Produce an `AUDIT.md`-style findings list; fix drift via follow-up PRs (additive edits only, per repo rules).

## Live-coverage expansion

- There are ~48 live integration tests against a real model today. Identify high-value behaviors currently covered only by mocks — provider streaming, tool-calling, the ProgramBench `--strategy rebuild` path, `bench-compare` budget enforcement — and add real-LLM tests behind the existing live-test gating.

## Acceptance criteria

- A findings doc listing each claim with a verdict + `file:line` evidence.
- Any confirmed drift either fixed or filed as a follow-up issue.
- A measurable increase in live-test coverage of the listed behaviors.

## Guardrails

- Do not name scrubbed/forbidden upstream identifiers in audit output, even to confirm absence.
- Additive-only on existing docs; never amend commits.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[audit] Adversarially verify documented capability claims against code + expand real-LLM coverage #159

Goal

Claims to verify (non-exhaustive)

Method (parallel agent fan-out)

Live-coverage expansion

Acceptance criteria

Guardrails

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[audit] Adversarially verify documented capability claims against code + expand real-LLM coverage #159

Description

Goal

Claims to verify (non-exhaustive)

Method (parallel agent fan-out)

Live-coverage expansion

Acceptance criteria

Guardrails

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions