feat(proactive_eval): real-LLM eval suite + Tier 1 (28 fixtures · 18 PASS / 10 XFAIL) by AlanY1an · Pull Request #9 · AlanY1an/echovessel

AlanY1an · 2026-05-05T08:51:32Z

Summary

新建 tests/proactive_eval/ · 沿 tests/memory_eval/ 模式 · 真 LLM + mock-clock + scheduler dispatch + audit + LLM-as-judge。Tier 1 · 28 fixture 覆盖 PART F advance_hours 全 8 类 + 5-gate 全 + 4 个 follow_up_at 计算规则 + 显式压制 + 多 event + supersede 闭环 + engagement bypass。

第一刀目标是把 2026-05-03 dogfood 真实场景（用户说"下周搬去 Houston"，PART F 没把 follow_up_at 传到 DB）锁成永久回归 fixture。调试过程中端到端跑发现这是个 production 接线 bug，已一并修复。

Live-LLM run results

====== 18 passed, 10 xfailed, 13 deselected · run 8 (DeepSeek-V4-pro/flash) ======

每个 xfail 都附 concrete 原因（PART F prompt judgment gap · multi-phase scheduling tuning · LLM advance_hours non-determinism）— 不是"以后再说"的占位。

Production fixes shipped on this branch

eval suite 跑通过程中暴露 + 修复了 4 个 production bug + 1 个 audit 观察：

Phase B 解析层断接 (d0125a6) — PR feat(proactive): memory-driven v3 #8 v3 加了 concept_nodes.follow_up_at schema 和 EXTRACTION_SYSTEM_PROMPT PART F section，但 RawExtractedEvent schema 和 _parse_event 没读 LLM 的 PART F output。所有 concept_nodes.follow_up_at 一直是 NULL。这就是 Houston dogfood bug 的真正根因。修复：parser + wiring 加 5 个字段 (follow_up_at / follow_up_hint / estimated_arc_days / advance_pre_hours / advance_post_hours) + 9 个 unit test。
DeepSeek V4 reasoning budget (b5b78f1 / 0db5050 / 2618fea / 3f556ee) — V4 默认开思考 + 占用 max_tokens 预算，导致 Phase B JSON 截断。新加 thinking_enabled: bool | None 到 LLMProvider.complete 抽象，每个 provider 翻译到 native API（DeepSeek extra_body、Anthropic thinking、OpenAI no-op）。Per-call-site 决策（extract / judge 关思考；reflect / proactive / slow_cycle 默认开）。新 [llm.thinking] config 段允许用户覆盖。creative call 的 max_tokens 同步 bump (V4 推理 headroom)。
FollowUpScheduler._shutdown 不在 start() 重置 (cb70f13) — 守护进程重启 / multi-cycle 调度器在第二次 start 后默默 no-op。修复 + 新 unit test。
Phase A 试探性 trivial 闸门在测试中过度 aggressive (fed0882) — 修复在 runner 中传 trivial_message_count=0 绕过。

观察（未修，单独 followup）：

SQLiteAuditSink.update_latest 不会重写 decision.message_text。Audit row 永远 message_text='' 即使 send_ok=True。Admin "主动消息历史" tab 看到的消息是空的。Runner 改读 FakeChannel.sent 兜底。

Architecture

Runner 流程：

1. :memory: SQLite + create_all_tables (v0.7 migration)
2. seed PersonaProfile / ProactiveState / pre-existing proactive_decisions / in_flight_turn predicate
3. ingest fixture.turns + mark_session_closing(trigger="eval") (skip 10min idle wait + Phase A trivial gate)
4. await consolidate_session(extract_fn=make_extract_fn(real_llm), observer=MemoryFollowUpObserver)
5. per-phase dispatch sorted by fire_at_clock:
   - mock_now = phase.fire_at_clock
   - follow_up_scheduler.start() · re-arms timers
   - poll for first audit row, then stop follow_up_scheduler (prevents rapid-fire reschedule)
   - poll for outcome settlement (LLM call can take 30s+ on V4 thinking)
   - collect audit rows + FakeChannel.sent messages
6. (optional) follow_up_stage second consolidate (supersede scenarios)
7. check_invariants (hard) + judge_prompts (LLM-as-judge soft)

每个 fixture YAML 声明 seed (PersonaProfile / ProactiveState / pre-existing decisions / in_flight_turn predicate / max_per_24h)，turns，expect.event_extracted，expect.phases，可选 follow_up_stage，可选 judge_prompts，可选 xfail。

Coverage matrix · 28 fixtures

Code surface	Status
PART F advance_hours table 8 类	5 PASS · 5 XFAIL (LLM judgment)
PART F follow_up_at 4 规则	4 PASS
PART F 显式压制	PASS
PART F 低价值 / 反向	3 PASS · 1 XFAIL (LLM over-eager)
PART F 模糊时间	1 PASS · 1 XFAIL
第三方承诺	PASS
多 event	PASS
supersede 闭环	XFAIL (LLM 不稳定标 superseded_event_ids)
5-gate (quiet/forbidden/in_flight/rate_limit/engagement)	4 PASS · 1 XFAIL
Engagement bypass	2 PASS (vulnerability + commitment)
生成质量 (judge)	4 PASS judge prompts on PASS fixtures

Test plan

uv run pytest -m eval_proactive · 28 fixture · run 8: 18 PASS / 10 XFAIL · live DeepSeek-V4 · 17 min wall-clock · ~$3-8 LLM cost
uv run pytest -q · 默认 suite · 1623 passed · +12 new tests over baseline (4 openai_compat thinking · 9 PART F parser · others schema/invariants/runner_smoke)
uv run ruff check src/ tests/ · clean
uv run ruff format --check · clean on touched files
uv run lint-imports · 4 contracts kept

Out of scope (followup PRs)

Tier 2 / Tier 3 fixtures (12 + 6 deferred)
config.toml.sample [proactive] 段废字段清理
SQLiteAuditSink.update_latest 持久化 message_text
PART F prompt 改进：surgery / 看医生 / 入职 / supersede tagging — 各自独立 init
nightly cron 配置
摘除 xfailed fixtures 的 xfail (要 PART F prompt 修好之后)

Branch commits (23)

ac1acfc chore(pytest): add eval_proactive mark
810233e feat(proactive_eval): fixture schema + YAML loader
4789d66 feat(proactive_eval): hard-invariant checker
cb35426 feat(proactive_eval): runner with seed + consolidate + phase dispatch
cb70f13 fix(proactive): reset _shutdown in FollowUpScheduler.start() so multi-cycle daemons re-arm
65b6591 refactor(proactive_eval): try/finally + poll-until-grew + seed clock consistency
c8b38ad feat(proactive_eval): harness + parametric pytest runner
2299d46 docs(proactive_eval): README + Tier 1/2/3 coverage matrix
825f16b test(proactive_eval): Tier 1 · 28 fixture YAMLs
b5b78f1 feat(llm): add thinking_enabled to LLMProvider Protocol + StubProvider no-op
0db5050 feat(llm): translate thinking_enabled to DeepSeek extra_body in OpenAICompatibleProvider
2618fea feat(llm): translate thinking_enabled to Anthropic thinking param
3f556ee feat(llm): config [llm.thinking] knob + factory kwargs and call site wiring
fed0882 fix(proactive_eval): bypass Phase A trivial gate so all fixture sessions flow through Phase B
d0125a6 fix(prompts): wire PART F follow_up_at + advance_hours through extraction parser
6999b10 fix(proactive_eval): action translation + max_per_24h knob + YAML on quoting + invariant edge case
d42e46a fix(proactive_eval): poll until audit row outcome is settled (LLM dispatch can take 30s+)
b9554ee fix(proactive_eval): stop follow_up_scheduler after first audit to prevent rapid-fire reschedule
94bf89d test(proactive_eval): tune fire_at_clock to safely past natural windows + travel xfail
92ce5c4 fix(proactive_eval): read generated messages from FakeChannel.sent (audit message_text not persisted)
9009b8e test(proactive_eval): xfail 7 fixtures with concrete PART F + scheduling reasons
10068e7 docs(proactive_eval): record Tier 1 results · 18 PASS · 10 XFAIL · note 4 production fixes shipped
b588fbe style(prompts): ruff format

…-cycle daemons re-arm start() previously left _shutdown sticky from a prior stop(), so any subsequent _schedule_next early-returned and the scheduler silently no-op'd. Module docstring already promises 'On daemon restart start() re-initialises timers from DB state' — this matches that intent and unblocks per-phase eval runner loops that cycle start/stop.

…consistency Three reviewer-flagged improvements to the runner: - Wrap proactive_scheduler lifecycle in try/finally so a raise during consolidate / phase dispatch / supersede stage can't leak the scheduler task. - Replace the bare 0.5s sleep in the per-phase loop with a poll-until-grew loop (5s budget). The fixed sleep was a wall-clock race against the timer's own asyncio.sleep when fire_at_clock lands just before natural fire time. - Pass mock_now to _seed_proactive_state instead of datetime.now() so seeded last_updated matches the fixture clock, not real wall time.

Initial fixture batch covering: - A · advance_hours table (10 categories): moving_houston (xfail) · interview · surgery · medical_routine · travel · deadline_paper · celebration · reminder · vague_date_anchor · commitment_third_party - A semantics (4): commitment_user · unresolved_short_window · ongoing_check_n · fuzzy_future_check - B reverse (4): low_value_future · historical · trivial · user_explicit_decline - C multi/supersede/sensitive (4): multi_event · supersede_blocks_post · vulnerability · medical_sensitive_tone - D · 5 gates + 1 bypass (6): quiet_hours · forbidden_topic · in_flight_turn · rate_limit · engagement_low · engagement_bypass_commitment Initial expectation ranges are best-guesses. Stage C will tune them against real DeepSeek-V4 LLM output, applying remedy A (widen range) / B (xfail with concrete reason) / C (fix runner) per fixture.

…r no-op

…ICompatibleProvider

…wiring Adds ThinkingSection to LLMSection, threads thinking_enabled through make_extract_fn / make_reflect_fn / make_slow_cycle_fn / make_proactive_fn / make_judge_fn, and wires the runtime config into the factory call sites in runtime/app.py. Default per-call-site behaviour: extract / judge: thinking_enabled=False (mechanical reformat) reflect / proactive / slow_cycle: thinking_enabled=None (provider default) max_tokens bumps for the thinking-on call sites give V4 reasoning + visible output the headroom they need (reflect 800->4096, proactive 400->2048, slow_cycle 1024->4096). The offline tests/memory_eval/judge.py and the extract/judge factories pin thinking_enabled=False so V4 stops burning the token budget on hidden reasoning tokens for strict-JSON outputs.

…ons flow through Phase B

…tion parser

…quoting + invariant edge case

…patch can take 30s+)

…event rapid-fire reschedule

…ws + travel xfail

…udit message_text not persisted by update_latest)

…ing reasons

…te 4 production fixes shipped

AlanY1an added 23 commits May 5, 2026 01:48

chore(pytest): add eval_proactive mark

ac1acfc

feat(proactive_eval): fixture schema + YAML loader

810233e

feat(proactive_eval): hard-invariant checker

4789d66

feat(proactive_eval): runner with seed + consolidate + phase dispatch

cb35426

feat(proactive_eval): harness + parametric pytest runner

c8b38ad

docs(proactive_eval): README + Tier 1/2/3 coverage matrix

2299d46

feat(llm): add thinking_enabled to LLMProvider Protocol + StubProvide…

b5b78f1

…r no-op

feat(llm): translate thinking_enabled to DeepSeek extra_body in OpenA…

0db5050

…ICompatibleProvider

feat(llm): translate thinking_enabled to Anthropic thinking param

2618fea

fix(proactive_eval): bypass Phase A trivial gate so all fixture sessi…

fed0882

…ons flow through Phase B

fix(prompts): wire PART F follow_up_at + advance_hours through extrac…

d0125a6

…tion parser

fix(proactive_eval): action translation + max_per_24h knob + YAML on …

6999b10

…quoting + invariant edge case

fix(proactive_eval): poll until audit row outcome is settled (LLM dis…

d42e46a

…patch can take 30s+)

fix(proactive_eval): stop follow_up_scheduler after first audit to pr…

b9554ee

…event rapid-fire reschedule

test(proactive_eval): tune fire_at_clock to safely past natural windo…

94bf89d

…ws + travel xfail

fix(proactive_eval): read generated messages from FakeChannel.sent (a…

92ce5c4

…udit message_text not persisted by update_latest)

test(proactive_eval): xfail 7 fixtures with concrete PART F + schedul…

9009b8e

…ing reasons

docs(proactive_eval): record Tier 1 results · 18 PASS · 10 XFAIL · no…

10068e7

…te 4 production fixes shipped

style(prompts): ruff format

b588fbe

AlanY1an merged commit a25dda4 into main May 6, 2026
6 checks passed

AlanY1an deleted the feat/proactive-eval-skeleton branch May 6, 2026 23:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(proactive_eval): real-LLM eval suite + Tier 1 (28 fixtures · 18 PASS / 10 XFAIL)#9

feat(proactive_eval): real-LLM eval suite + Tier 1 (28 fixtures · 18 PASS / 10 XFAIL)#9
AlanY1an merged 23 commits into
mainfrom
feat/proactive-eval-skeleton

AlanY1an commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AlanY1an commented May 5, 2026

Summary

Live-LLM run results

Production fixes shipped on this branch

Architecture

Coverage matrix · 28 fixtures

Test plan

Out of scope (followup PRs)

Branch commits (23)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant