Skip to content

test(memory_eval): comprehensive memory subsystem fixture suite#5

Merged
AlanY1an merged 45 commits into
mainfrom
test/memory-eval-suite
Apr 29, 2026
Merged

test(memory_eval): comprehensive memory subsystem fixture suite#5
AlanY1an merged 45 commits into
mainfrom
test/memory-eval-suite

Conversation

@AlanY1an

Copy link
Copy Markdown
Owner

Summary

  • Adds 28 new YAML fixtures + 14 new invariant fields covering every memory layer (L1–L6), all consolidation phases (A–G except G), retrieval scoring, and forget. Total 33 fixtures discoverable, 52 unit tests on the harness itself.
  • Splits tests/memory_eval/harness.py (had grown to 1001 lines) into focused modules: schema.py / runner.py / invariants.py / embedders.py, with a thin re-export shim preserving existing imports.
  • Adds a "..." placeholder sentinel on persona turns that triggers the production assemble_turn so L1 fixtures actually exercise the system-prompt rendering path; adds a per-turn channel field so retrieve_cross_channel_no_filter truly tests the cross-channel铁律.

Branch only adds tests. Zero src/ changes.

What's covered

Layer Fixtures Behaviors
L1 core blocks 5 persona/user/style block reaches system prompt, never-auto-update invariant, location/timezone in PersonaFactsView
L2 recall + FTS 2 ingest count, FTS fallback for literal phrase
L3 extraction 9 identity disclosure, vocative recognition, persona-led blocked, event_time anchor, persona commitment subject, trivial gate, plus the existing e1–e4
L4 reflection 3 TIMER abstraction (e5), SHOCK reflection, hard-limit suppression
L5 entities 4 alias dedup, embedding dedup, ambiguous-keep-separate, anchor retrieve bonus
L6 mood 2 mood-changes-after-shock, neutral session preserves mood
Retrieval scoring 6 impact boost, relational bonus, FTS fallback, pinned force-load, cross-channel, plus e6
Forget 2 orphan keeps thoughts, cascade removes thoughts

Full row-by-row matrix in tests/memory_eval/COVERAGE.md.

Test plan

  • All 47 memory_eval unit tests pass
  • Broader test suite still green (1446 passed)
  • ruff check tests/ src/ clean
  • lint-imports 4/4 contracts kept
  • Single fixture under live LLM passes (l3_trivial_session_skipped)
  • Full live-LLM pass run: 12 pass, 1 xpass, 20 fail. Failures are categorized in the resume notes — most are real production behaviors that the suite is now positioned to track when memory work continues.

Notes

  • Each commit is conventional-commits-prefixed and one-logical-change, so git log reads as a tutorial of what each capability was added for.
  • harness.py becomes a re-export shim; existing imports in test_eval_fixtures.py, test_harness_seed.py, test_check_invariants.py, and synthesize.py continue to work without modification.
  • One fixture (l5_entity_anchored_retrieve_bonus) is marked xfail because seed-event entity resolution wasn't expected to fire — under live LLM it actually passed (xpassed), so the marker can be dropped in a follow-up.

AlanY1an added 30 commits April 29, 2026 16:07
Vector hits and FTS fallback rows now share a single retrieved list,
each tagged with source ('vector' | 'fts') so callers can disambiguate
without inspecting the relevance score. FTS rows carry relevance 0.0
as a sentinel — there is no vector score to report. This unblocks an
L2 fixture where the query word lives only in raw recall_messages and
the existing top_k_must_contain_descriptions_all invariant needs to
match against the FTS-surfaced row.
Two scripted fixtures exercise the L2 layer end-to-end:
- l2_ingest_writes_recall: 4-turn session must produce exactly 4
  recall_messages rows.
- l2_fts_finds_literal_phrase: a query word that appears only in raw
  recall_messages (never extracted as an event) must surface via the
  FTS fallback path.
AlanY1an added 15 commits April 29, 2026 17:43
The dict-comprehension lookup overwrote duplicates so a wrong-status
match silently slipped through whenever two entities shared a
canonical_name. Group matches by name and require all to satisfy the
wanted status.
L6 mood fixtures need to seed Persona.episodic_state before turns run
so the before/after snapshot can prove the session changed (or did
not change) the mood. Wires the field through load_fixture and into
the Persona row constructor.
Two scripted fixtures cover the L6 episodic_state contract: a heavy
disclosure must shift mood off its seeded baseline, and a banal
chit-chat session must leave it untouched.
…bedders

harness.py grew to 1001 lines covering four concerns: dataclass schemas,
LLM/embedder wiring, the end-to-end runner, and the invariant checker.
Reviewers in tasks 6/9/10 flagged it for split.

schema.py owns the Fixture dataclasses + load_fixture/discover_fixtures.
embedders.py owns build_live_llm + build_eval_embedder + keyword_embedder.
runner.py owns run_fixture + render_evidence + the dict-serialise helpers.
invariants.py owns check_invariants. harness.py is now a 46-line shim that
re-exports the public surface so existing test imports keep working.
The README ships a quick-start, fixture-add workflow, module-layout
note for the post-split harness, and a comparison table against
tests/eval so readers pick the right suite.

Status legend in COVERAGE.md tightens its definition of '✅' to mean
verified against a live LLM, with '➕' covering fixtures that exist
but haven't yet been live-LLM-checked.
The runner used to hardcode channel="web" on every ingest_message call,
so the cross-channel fixture (regression line for the D4 铁律) was
ingesting and retrieving from the same channel and would silently keep
passing if a channel filter regressed in.

FixtureTurn / SeedEvent now carry an optional channel field that
load_fixture parses from the YAML, and run_fixture threads through
ingest_message. retrieve_cross_channel_no_filter.yaml seeds the event
under discord and queries channel-blind so the invariant exercises the
no-filter contract.
…ceholder turns

L1 fixtures use "..." as the persona content because their intent is
"verify the seeded core block reaches the system prompt and influences
the persona's reply" — but the runner used to ingest "..." verbatim,
so consolidation had no real persona text to extract from and the
judge prompts saw evidence that contradicted the fixture's premise.

When a persona turn's content is "...", run_fixture now builds a
TurnContext from the existing engine + backend + embed_fn and calls
the production assemble_turn (the same path web/Discord traffic uses
in production), then leaves the LLM-generated reply in L2. The
sentinel string is documented as PERSONA_GENERATE_SENTINEL on the
runner so the contract is greppable.

assemble_turn ingests both the user message and the persona reply,
so the runner skips the explicit ingest_message for the matched
user turn and advances by 2.
@AlanY1an AlanY1an merged commit 4bc7ddd into main Apr 29, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant