test(memory_eval): comprehensive memory subsystem fixture suite by AlanY1an · Pull Request #5 · AlanY1an/echovessel

AlanY1an · 2026-04-29T23:21:19Z

Summary

Adds 28 new YAML fixtures + 14 new invariant fields covering every memory layer (L1–L6), all consolidation phases (A–G except G), retrieval scoring, and forget. Total 33 fixtures discoverable, 52 unit tests on the harness itself.
Splits tests/memory_eval/harness.py (had grown to 1001 lines) into focused modules: schema.py / runner.py / invariants.py / embedders.py, with a thin re-export shim preserving existing imports.
Adds a "..." placeholder sentinel on persona turns that triggers the production assemble_turn so L1 fixtures actually exercise the system-prompt rendering path; adds a per-turn channel field so retrieve_cross_channel_no_filter truly tests the cross-channel铁律.

Branch only adds tests. Zero src/ changes.

What's covered

Layer	Fixtures	Behaviors
L1 core blocks	5	persona/user/style block reaches system prompt, never-auto-update invariant, location/timezone in PersonaFactsView
L2 recall + FTS	2	ingest count, FTS fallback for literal phrase
L3 extraction	9	identity disclosure, vocative recognition, persona-led blocked, event_time anchor, persona commitment subject, trivial gate, plus the existing e1–e4
L4 reflection	3	TIMER abstraction (e5), SHOCK reflection, hard-limit suppression
L5 entities	4	alias dedup, embedding dedup, ambiguous-keep-separate, anchor retrieve bonus
L6 mood	2	mood-changes-after-shock, neutral session preserves mood
Retrieval scoring	6	impact boost, relational bonus, FTS fallback, pinned force-load, cross-channel, plus e6
Forget	2	orphan keeps thoughts, cascade removes thoughts

Full row-by-row matrix in tests/memory_eval/COVERAGE.md.

Test plan

All 47 memory_eval unit tests pass
Broader test suite still green (1446 passed)
ruff check tests/ src/ clean
lint-imports 4/4 contracts kept
Single fixture under live LLM passes (l3_trivial_session_skipped)
Full live-LLM pass run: 12 pass, 1 xpass, 20 fail. Failures are categorized in the resume notes — most are real production behaviors that the suite is now positioned to track when memory work continues.

Notes

Each commit is conventional-commits-prefixed and one-logical-change, so git log reads as a tutorial of what each capability was added for.
harness.py becomes a re-export shim; existing imports in test_eval_fixtures.py, test_harness_seed.py, test_check_invariants.py, and synthesize.py continue to work without modification.
One fixture (l5_entity_anchored_retrieve_bonus) is marked xfail because seed-event entity resolution wasn't expected to fire — under live LLM it actually passed (xpassed), so the marker can be dropped in a follow-up.

…ng f

…ks_unchanged

Vector hits and FTS fallback rows now share a single retrieved list, each tagged with source ('vector' | 'fts') so callers can disambiguate without inspecting the relevance score. FTS rows carry relevance 0.0 as a sentinel — there is no vector score to report. This unblocks an L2 fixture where the query word lives only in raw recall_messages and the existing top_k_must_contain_descriptions_all invariant needs to match against the FTS-surfaced row.

Two scripted fixtures exercise the L2 layer end-to-end: - l2_ingest_writes_recall: 4-turn session must produce exactly 4 recall_messages rows. - l2_fts_finds_literal_phrase: a query word that appears only in raw recall_messages (never extracted as an event) must surface via the FTS fallback path.

The dict-comprehension lookup overwrote duplicates so a wrong-status match silently slipped through whenever two entities shared a canonical_name. Group matches by name and require all to satisfy the wanted status.

L6 mood fixtures need to seed Persona.episodic_state before turns run so the before/after snapshot can prove the session changed (or did not change) the mood. Wires the field through load_fixture and into the Persona row constructor.

Two scripted fixtures cover the L6 episodic_state contract: a heavy disclosure must shift mood off its seeded baseline, and a banal chit-chat session must leave it untouched.

…tures

…bedders harness.py grew to 1001 lines covering four concerns: dataclass schemas, LLM/embedder wiring, the end-to-end runner, and the invariant checker. Reviewers in tasks 6/9/10 flagged it for split. schema.py owns the Fixture dataclasses + load_fixture/discover_fixtures. embedders.py owns build_live_llm + build_eval_embedder + keyword_embedder. runner.py owns run_fixture + render_evidence + the dict-serialise helpers. invariants.py owns check_invariants. harness.py is now a 46-line shim that re-exports the public surface so existing test imports keep working.

The README ships a quick-start, fixture-add workflow, module-layout note for the post-split harness, and a comparison table against tests/eval so readers pick the right suite. Status legend in COVERAGE.md tightens its definition of '✅' to mean verified against a live LLM, with '➕' covering fixtures that exist but haven't yet been live-LLM-checked.

The runner used to hardcode channel="web" on every ingest_message call, so the cross-channel fixture (regression line for the D4 铁律) was ingesting and retrieving from the same channel and would silently keep passing if a channel filter regressed in. FixtureTurn / SeedEvent now carry an optional channel field that load_fixture parses from the YAML, and run_fixture threads through ingest_message. retrieve_cross_channel_no_filter.yaml seeds the event under discord and queries channel-blind so the invariant exercises the no-filter contract.

…ceholder turns L1 fixtures use "..." as the persona content because their intent is "verify the seeded core block reaches the system prompt and influences the persona's reply" — but the runner used to ingest "..." verbatim, so consolidation had no real persona text to extract from and the judge prompts saw evidence that contradicted the fixture's premise. When a persona turn's content is "...", run_fixture now builds a TurnContext from the existing engine + backend + embed_fn and calls the production assemble_turn (the same path web/Discord traffic uses in production), then leaves the LLM-generated reply in L2. The sentinel string is documented as PERSONA_GENERATE_SENTINEL on the runner so the contract is greppable. assemble_turn ingests both the user message and the persona reply, so the runner skips the explicit ingest_message for the matched user turn and advances by 2.

AlanY1an added 30 commits April 29, 2026 16:07

chore: ignore .worktrees/ for git worktree workflow

729077b

docs(memory_eval): add coverage matrix

dd0baa2

test(memory_eval): scaffold check_invariants unit tests

e133651

feat(memory_eval): add must_have_event_time invariant

519df19

feat(memory_eval): add must_have_subject_any invariant

1517764

feat(memory_eval): add must_have_concept_type_any invariant

6fad062

feat(memory_eval): add forbidden_descriptions_contain_none invariant

6ad3072

feat(memory_eval): add entity_count_eq and entity_count_max invariants

7cf9872

feat(memory_eval): add entity_merge_status_eq invariant

887cabb

feat(memory_eval): add recall_message_count_eq invariant

24133ea

feat(memory_eval): add core_block_count_unchanged invariant

5b23eb2

feat(memory_eval): add top_k_must_contain_descriptions_all invariant

94b0297

feat(memory_eval): add top_k_must_not_contain_descriptions_any invariant

36583e9

feat(memory_eval): add episodic_state_mood_changed invariant

68e4a75

feat(memory_eval): add episodic_state_mood_unchanged invariant

2ffb772

style(memory_eval): rename forbidden phrase loop var to avoid shadowi…

6e4b5d9

…ng f

fix(memory_eval): default top_k_for_check to full retrieved length

b208af0

refactor(memory_eval): rename core_block_count_unchanged to core_bloc…

4bcbe1f

…ks_unchanged

style(memory_eval): rename top_k_must_not_contain phrase loop var

68c831a

feat(memory_eval): add style_block and persona facts to FixtureSeed

9725bd3

test(memory_eval): add L1 core-block fixtures

6ffb471

refactor(memory_eval): drop dead BlockLabel.MOOD plumbing from harness

2c9a64a

refactor(memory_eval): drop unused FixtureSeed legacy fields

7ca40c1

test(memory_eval): add L3 extraction-quality fixtures

ad45611

feat(memory_eval): add seed_thoughts to FixtureSeed

0467918

feat(memory_eval): add thoughts_max invariant

e3d8819

test(memory_eval): add L4 reflection fixtures

22f3b53

test(memory_eval): add L5 entity fixtures

8196ee3

AlanY1an added 15 commits April 29, 2026 17:43

test(memory_eval): mark l5_entity_anchored_retrieve_bonus xfail

fde6441

fix(memory_eval): tighten entity_merge_status_eq for duplicate names

dd66fb8

The dict-comprehension lookup overwrote duplicates so a wrong-status match silently slipped through whenever two entities shared a canonical_name. Group matches by name and require all to satisfy the wanted status.

test(memory_eval): add L6 mood fixtures

dadba22

Two scripted fixtures cover the L6 episodic_state contract: a heavy disclosure must shift mood off its seeded baseline, and a banal chit-chat session must leave it untouched.

refactor(memory_eval): rename FTS fallback fixture to retrieve_ prefix

ce40189

feat(memory_eval): add force_load_user_thoughts to FixtureRetrieve

e322e22

test(memory_eval): add retrieval scoring fixtures

8b6ad34

feat(memory_eval): add post_consolidate_actions runner for delete fix…

89950d1

…tures

feat(memory_eval): add filling spec to SeedThought for parent linkage

5a7e1e5

test(memory_eval): add forget orphan/cascade fixtures

c7e7c82

docs(memory_eval): add e8_bilingual row to COVERAGE.md

e5b97de

AlanY1an merged commit 4bc7ddd into main Apr 29, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(memory_eval): comprehensive memory subsystem fixture suite#5

test(memory_eval): comprehensive memory subsystem fixture suite#5
AlanY1an merged 45 commits into
mainfrom
test/memory-eval-suite

AlanY1an commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AlanY1an commented Apr 29, 2026

Summary

What's covered

Test plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant