Skip to content

[audit] Adversarially verify documented capability claims against code + expand real-LLM coverage #159

Description

@0bserver07

Goal

Verify that the repo's documented capability claims match the actual implementation, and expand real-LLM (non-mocked) test coverage. The project makes many capability and parity claims; this is a systematic check for drift between docs and code.

Claims to verify (non-exhaustive)

  • Parity matrices: docs/mink/parity-matrix.md, site/src/content/docs/otter/parity-matrix.md, and the field-guide pages under site/src/content/docs/field-guide/.
  • "Replicated scaffolds are real (not stubs)": re-audit each scaffold's code against its claimed capabilities. Frame as capability parity — verify the capability exists, without asserting fidelity to any specific external reference.
  • Capability tables / inventories in README.md and CLAUDE.md: tool counts, event-type counts, strategy lists, CLI subcommand inventory, MCP server list.
  • Benchmark numbers cited in docs vs the writeups they reference (docs/benchmarks/).

Method (parallel agent fan-out)

  1. Enumerate concrete, checkable claims from the sources above.
  2. For each claim, spawn an agent to verify it against code/tests and return a verdict — confirmed / drifted / false — with file:line evidence.
  3. Adversarially re-check any "confirmed" that rests on a single source.
  4. Produce an AUDIT.md-style findings list; fix drift via follow-up PRs (additive edits only, per repo rules).

Live-coverage expansion

  • There are ~48 live integration tests against a real model today. Identify high-value behaviors currently covered only by mocks — provider streaming, tool-calling, the ProgramBench --strategy rebuild path, bench-compare budget enforcement — and add real-LLM tests behind the existing live-test gating.

Acceptance criteria

  • A findings doc listing each claim with a verdict + file:line evidence.
  • Any confirmed drift either fixed or filed as a follow-up issue.
  • A measurable increase in live-test coverage of the listed behaviors.

Guardrails

  • Do not name scrubbed/forbidden upstream identifiers in audit output, even to confirm absence.
  • Additive-only on existing docs; never amend commits.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions