test(memory_eval): comprehensive memory subsystem fixture suite#5
Merged
Conversation
Vector hits and FTS fallback rows now share a single retrieved list,
each tagged with source ('vector' | 'fts') so callers can disambiguate
without inspecting the relevance score. FTS rows carry relevance 0.0
as a sentinel — there is no vector score to report. This unblocks an
L2 fixture where the query word lives only in raw recall_messages and
the existing top_k_must_contain_descriptions_all invariant needs to
match against the FTS-surfaced row.
Two scripted fixtures exercise the L2 layer end-to-end: - l2_ingest_writes_recall: 4-turn session must produce exactly 4 recall_messages rows. - l2_fts_finds_literal_phrase: a query word that appears only in raw recall_messages (never extracted as an event) must surface via the FTS fallback path.
The dict-comprehension lookup overwrote duplicates so a wrong-status match silently slipped through whenever two entities shared a canonical_name. Group matches by name and require all to satisfy the wanted status.
L6 mood fixtures need to seed Persona.episodic_state before turns run so the before/after snapshot can prove the session changed (or did not change) the mood. Wires the field through load_fixture and into the Persona row constructor.
Two scripted fixtures cover the L6 episodic_state contract: a heavy disclosure must shift mood off its seeded baseline, and a banal chit-chat session must leave it untouched.
…bedders harness.py grew to 1001 lines covering four concerns: dataclass schemas, LLM/embedder wiring, the end-to-end runner, and the invariant checker. Reviewers in tasks 6/9/10 flagged it for split. schema.py owns the Fixture dataclasses + load_fixture/discover_fixtures. embedders.py owns build_live_llm + build_eval_embedder + keyword_embedder. runner.py owns run_fixture + render_evidence + the dict-serialise helpers. invariants.py owns check_invariants. harness.py is now a 46-line shim that re-exports the public surface so existing test imports keep working.
The README ships a quick-start, fixture-add workflow, module-layout note for the post-split harness, and a comparison table against tests/eval so readers pick the right suite. Status legend in COVERAGE.md tightens its definition of '✅' to mean verified against a live LLM, with '➕' covering fixtures that exist but haven't yet been live-LLM-checked.
The runner used to hardcode channel="web" on every ingest_message call, so the cross-channel fixture (regression line for the D4 铁律) was ingesting and retrieving from the same channel and would silently keep passing if a channel filter regressed in. FixtureTurn / SeedEvent now carry an optional channel field that load_fixture parses from the YAML, and run_fixture threads through ingest_message. retrieve_cross_channel_no_filter.yaml seeds the event under discord and queries channel-blind so the invariant exercises the no-filter contract.
…ceholder turns L1 fixtures use "..." as the persona content because their intent is "verify the seeded core block reaches the system prompt and influences the persona's reply" — but the runner used to ingest "..." verbatim, so consolidation had no real persona text to extract from and the judge prompts saw evidence that contradicted the fixture's premise. When a persona turn's content is "...", run_fixture now builds a TurnContext from the existing engine + backend + embed_fn and calls the production assemble_turn (the same path web/Discord traffic uses in production), then leaves the LLM-generated reply in L2. The sentinel string is documented as PERSONA_GENERATE_SENTINEL on the runner so the contract is greppable. assemble_turn ingests both the user message and the persona reply, so the runner skips the explicit ingest_message for the matched user turn and advances by 2.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tests/memory_eval/harness.py(had grown to 1001 lines) into focused modules:schema.py/runner.py/invariants.py/embedders.py, with a thin re-export shim preserving existing imports."..."placeholder sentinel on persona turns that triggers the productionassemble_turnso L1 fixtures actually exercise the system-prompt rendering path; adds a per-turnchannelfield soretrieve_cross_channel_no_filtertruly tests the cross-channel铁律.Branch only adds tests. Zero
src/changes.What's covered
Full row-by-row matrix in
tests/memory_eval/COVERAGE.md.Test plan
ruff check tests/ src/cleanlint-imports4/4 contracts keptl3_trivial_session_skipped)Notes
git logreads as a tutorial of what each capability was added for.harness.pybecomes a re-export shim; existing imports intest_eval_fixtures.py,test_harness_seed.py,test_check_invariants.py, andsynthesize.pycontinue to work without modification.l5_entity_anchored_retrieve_bonus) is markedxfailbecause seed-event entity resolution wasn't expected to fire — under live LLM it actually passed (xpassed), so the marker can be dropped in a follow-up.