Goal
Verify that the repo's documented capability claims match the actual implementation, and expand real-LLM (non-mocked) test coverage. The project makes many capability and parity claims; this is a systematic check for drift between docs and code.
Claims to verify (non-exhaustive)
- Parity matrices:
docs/mink/parity-matrix.md, site/src/content/docs/otter/parity-matrix.md, and the field-guide pages under site/src/content/docs/field-guide/.
- "Replicated scaffolds are real (not stubs)": re-audit each scaffold's code against its claimed capabilities. Frame as capability parity — verify the capability exists, without asserting fidelity to any specific external reference.
- Capability tables / inventories in
README.md and CLAUDE.md: tool counts, event-type counts, strategy lists, CLI subcommand inventory, MCP server list.
- Benchmark numbers cited in docs vs the writeups they reference (
docs/benchmarks/).
Method (parallel agent fan-out)
- Enumerate concrete, checkable claims from the sources above.
- For each claim, spawn an agent to verify it against code/tests and return a verdict — confirmed / drifted / false — with
file:line evidence.
- Adversarially re-check any "confirmed" that rests on a single source.
- Produce an
AUDIT.md-style findings list; fix drift via follow-up PRs (additive edits only, per repo rules).
Live-coverage expansion
- There are ~48 live integration tests against a real model today. Identify high-value behaviors currently covered only by mocks — provider streaming, tool-calling, the ProgramBench
--strategy rebuild path, bench-compare budget enforcement — and add real-LLM tests behind the existing live-test gating.
Acceptance criteria
- A findings doc listing each claim with a verdict +
file:line evidence.
- Any confirmed drift either fixed or filed as a follow-up issue.
- A measurable increase in live-test coverage of the listed behaviors.
Guardrails
- Do not name scrubbed/forbidden upstream identifiers in audit output, even to confirm absence.
- Additive-only on existing docs; never amend commits.
Goal
Verify that the repo's documented capability claims match the actual implementation, and expand real-LLM (non-mocked) test coverage. The project makes many capability and parity claims; this is a systematic check for drift between docs and code.
Claims to verify (non-exhaustive)
docs/mink/parity-matrix.md,site/src/content/docs/otter/parity-matrix.md, and the field-guide pages undersite/src/content/docs/field-guide/.README.mdandCLAUDE.md: tool counts, event-type counts, strategy lists, CLI subcommand inventory, MCP server list.docs/benchmarks/).Method (parallel agent fan-out)
file:lineevidence.AUDIT.md-style findings list; fix drift via follow-up PRs (additive edits only, per repo rules).Live-coverage expansion
--strategy rebuildpath,bench-comparebudget enforcement — and add real-LLM tests behind the existing live-test gating.Acceptance criteria
file:lineevidence.Guardrails