Skip to content

feat(proactive_eval): real-LLM eval suite + Tier 1 (28 fixtures · 18 PASS / 10 XFAIL)#9

Merged
AlanY1an merged 23 commits into
mainfrom
feat/proactive-eval-skeleton
May 6, 2026
Merged

feat(proactive_eval): real-LLM eval suite + Tier 1 (28 fixtures · 18 PASS / 10 XFAIL)#9
AlanY1an merged 23 commits into
mainfrom
feat/proactive-eval-skeleton

Conversation

@AlanY1an

@AlanY1an AlanY1an commented May 5, 2026

Copy link
Copy Markdown
Owner

Summary

新建 tests/proactive_eval/ · 沿 tests/memory_eval/ 模式 · 真 LLM + mock-clock + scheduler dispatch + audit + LLM-as-judge。Tier 1 · 28 fixture 覆盖 PART F advance_hours 全 8 类 + 5-gate 全 + 4 个 follow_up_at 计算规则 + 显式压制 + 多 event + supersede 闭环 + engagement bypass。

第一刀目标是把 2026-05-03 dogfood 真实场景(用户说"下周搬去 Houston",PART F 没把 follow_up_at 传到 DB)锁成永久回归 fixture。调试过程中端到端跑发现这是个 production 接线 bug,已一并修复。

Live-LLM run results

====== 18 passed, 10 xfailed, 13 deselected · run 8 (DeepSeek-V4-pro/flash) ======

每个 xfail 都附 concrete 原因(PART F prompt judgment gap · multi-phase scheduling tuning · LLM advance_hours non-determinism)— 不是"以后再说"的占位。

Production fixes shipped on this branch

eval suite 跑通过程中暴露 + 修复了 4 个 production bug + 1 个 audit 观察:

  1. Phase B 解析层断接 (d0125a6) — PR feat(proactive): memory-driven v3 #8 v3 加了 concept_nodes.follow_up_at schema 和 EXTRACTION_SYSTEM_PROMPT PART F section,但 RawExtractedEvent schema 和 _parse_event 没读 LLM 的 PART F output。所有 concept_nodes.follow_up_at 一直是 NULL。这就是 Houston dogfood bug 的真正根因。修复:parser + wiring 加 5 个字段 (follow_up_at / follow_up_hint / estimated_arc_days / advance_pre_hours / advance_post_hours) + 9 个 unit test。
  2. DeepSeek V4 reasoning budget (b5b78f1 / 0db5050 / 2618fea / 3f556ee) — V4 默认开思考 + 占用 max_tokens 预算,导致 Phase B JSON 截断。新加 thinking_enabled: bool | NoneLLMProvider.complete 抽象,每个 provider 翻译到 native API(DeepSeek extra_body、Anthropic thinking、OpenAI no-op)。Per-call-site 决策(extract / judge 关思考;reflect / proactive / slow_cycle 默认开)。新 [llm.thinking] config 段允许用户覆盖。creative call 的 max_tokens 同步 bump (V4 推理 headroom)。
  3. FollowUpScheduler._shutdown 不在 start() 重置 (cb70f13) — 守护进程重启 / multi-cycle 调度器在第二次 start 后默默 no-op。修复 + 新 unit test。
  4. Phase A 试探性 trivial 闸门在测试中过度 aggressive (fed0882) — 修复在 runner 中传 trivial_message_count=0 绕过。

观察(未修,单独 followup):

  • SQLiteAuditSink.update_latest 不会重写 decision.message_text。Audit row 永远 message_text='' 即使 send_ok=True。Admin "主动消息历史" tab 看到的消息是空的。Runner 改读 FakeChannel.sent 兜底。

Architecture

Runner 流程:

1. :memory: SQLite + create_all_tables (v0.7 migration)
2. seed PersonaProfile / ProactiveState / pre-existing proactive_decisions / in_flight_turn predicate
3. ingest fixture.turns + mark_session_closing(trigger="eval") (skip 10min idle wait + Phase A trivial gate)
4. await consolidate_session(extract_fn=make_extract_fn(real_llm), observer=MemoryFollowUpObserver)
5. per-phase dispatch sorted by fire_at_clock:
   - mock_now = phase.fire_at_clock
   - follow_up_scheduler.start() · re-arms timers
   - poll for first audit row, then stop follow_up_scheduler (prevents rapid-fire reschedule)
   - poll for outcome settlement (LLM call can take 30s+ on V4 thinking)
   - collect audit rows + FakeChannel.sent messages
6. (optional) follow_up_stage second consolidate (supersede scenarios)
7. check_invariants (hard) + judge_prompts (LLM-as-judge soft)

每个 fixture YAML 声明 seed (PersonaProfile / ProactiveState / pre-existing decisions / in_flight_turn predicate / max_per_24h),turnsexpect.event_extractedexpect.phases,可选 follow_up_stage,可选 judge_prompts,可选 xfail

Coverage matrix · 28 fixtures

Code surface Status
PART F advance_hours table 8 类 5 PASS · 5 XFAIL (LLM judgment)
PART F follow_up_at 4 规则 4 PASS
PART F 显式压制 PASS
PART F 低价值 / 反向 3 PASS · 1 XFAIL (LLM over-eager)
PART F 模糊时间 1 PASS · 1 XFAIL
第三方承诺 PASS
多 event PASS
supersede 闭环 XFAIL (LLM 不稳定标 superseded_event_ids)
5-gate (quiet/forbidden/in_flight/rate_limit/engagement) 4 PASS · 1 XFAIL
Engagement bypass 2 PASS (vulnerability + commitment)
生成质量 (judge) 4 PASS judge prompts on PASS fixtures

Test plan

  • uv run pytest -m eval_proactive · 28 fixture · run 8: 18 PASS / 10 XFAIL · live DeepSeek-V4 · 17 min wall-clock · ~$3-8 LLM cost
  • uv run pytest -q · 默认 suite · 1623 passed · +12 new tests over baseline (4 openai_compat thinking · 9 PART F parser · others schema/invariants/runner_smoke)
  • uv run ruff check src/ tests/ · clean
  • uv run ruff format --check · clean on touched files
  • uv run lint-imports · 4 contracts kept

Out of scope (followup PRs)

  • Tier 2 / Tier 3 fixtures (12 + 6 deferred)
  • config.toml.sample [proactive] 段废字段清理
  • SQLiteAuditSink.update_latest 持久化 message_text
  • PART F prompt 改进:surgery / 看医生 / 入职 / supersede tagging — 各自独立 init
  • nightly cron 配置
  • 摘除 xfailed fixtures 的 xfail (要 PART F prompt 修好之后)

Branch commits (23)

ac1acfc chore(pytest): add eval_proactive mark
810233e feat(proactive_eval): fixture schema + YAML loader
4789d66 feat(proactive_eval): hard-invariant checker
cb35426 feat(proactive_eval): runner with seed + consolidate + phase dispatch
cb70f13 fix(proactive): reset _shutdown in FollowUpScheduler.start() so multi-cycle daemons re-arm
65b6591 refactor(proactive_eval): try/finally + poll-until-grew + seed clock consistency
c8b38ad feat(proactive_eval): harness + parametric pytest runner
2299d46 docs(proactive_eval): README + Tier 1/2/3 coverage matrix
825f16b test(proactive_eval): Tier 1 · 28 fixture YAMLs
b5b78f1 feat(llm): add thinking_enabled to LLMProvider Protocol + StubProvider no-op
0db5050 feat(llm): translate thinking_enabled to DeepSeek extra_body in OpenAICompatibleProvider
2618fea feat(llm): translate thinking_enabled to Anthropic thinking param
3f556ee feat(llm): config [llm.thinking] knob + factory kwargs and call site wiring
fed0882 fix(proactive_eval): bypass Phase A trivial gate so all fixture sessions flow through Phase B
d0125a6 fix(prompts): wire PART F follow_up_at + advance_hours through extraction parser
6999b10 fix(proactive_eval): action translation + max_per_24h knob + YAML on quoting + invariant edge case
d42e46a fix(proactive_eval): poll until audit row outcome is settled (LLM dispatch can take 30s+)
b9554ee fix(proactive_eval): stop follow_up_scheduler after first audit to prevent rapid-fire reschedule
94bf89d test(proactive_eval): tune fire_at_clock to safely past natural windows + travel xfail
92ce5c4 fix(proactive_eval): read generated messages from FakeChannel.sent (audit message_text not persisted)
9009b8e test(proactive_eval): xfail 7 fixtures with concrete PART F + scheduling reasons
10068e7 docs(proactive_eval): record Tier 1 results · 18 PASS · 10 XFAIL · note 4 production fixes shipped
b588fbe style(prompts): ruff format

AlanY1an added 23 commits May 5, 2026 01:48
…-cycle daemons re-arm

start() previously left _shutdown sticky from a prior stop(), so any
subsequent _schedule_next early-returned and the scheduler silently
no-op'd. Module docstring already promises 'On daemon restart start()
re-initialises timers from DB state' — this matches that intent and
unblocks per-phase eval runner loops that cycle start/stop.
…consistency

Three reviewer-flagged improvements to the runner:

- Wrap proactive_scheduler lifecycle in try/finally so a raise during
  consolidate / phase dispatch / supersede stage can't leak the
  scheduler task.
- Replace the bare 0.5s sleep in the per-phase loop with a
  poll-until-grew loop (5s budget). The fixed sleep was a wall-clock
  race against the timer's own asyncio.sleep when fire_at_clock lands
  just before natural fire time.
- Pass mock_now to _seed_proactive_state instead of datetime.now() so
  seeded last_updated matches the fixture clock, not real wall time.
Initial fixture batch covering:
- A · advance_hours table (10 categories): moving_houston (xfail) ·
  interview · surgery · medical_routine · travel · deadline_paper ·
  celebration · reminder · vague_date_anchor · commitment_third_party
- A semantics (4): commitment_user · unresolved_short_window ·
  ongoing_check_n · fuzzy_future_check
- B reverse (4): low_value_future · historical · trivial · user_explicit_decline
- C multi/supersede/sensitive (4): multi_event · supersede_blocks_post ·
  vulnerability · medical_sensitive_tone
- D · 5 gates + 1 bypass (6): quiet_hours · forbidden_topic ·
  in_flight_turn · rate_limit · engagement_low · engagement_bypass_commitment

Initial expectation ranges are best-guesses. Stage C will tune them
against real DeepSeek-V4 LLM output, applying remedy A (widen range) /
B (xfail with concrete reason) / C (fix runner) per fixture.
…wiring

Adds ThinkingSection to LLMSection, threads thinking_enabled through
make_extract_fn / make_reflect_fn / make_slow_cycle_fn / make_proactive_fn /
make_judge_fn, and wires the runtime config into the factory call sites in
runtime/app.py. Default per-call-site behaviour:

  extract / judge: thinking_enabled=False (mechanical reformat)
  reflect / proactive / slow_cycle: thinking_enabled=None (provider default)

max_tokens bumps for the thinking-on call sites give V4 reasoning + visible
output the headroom they need (reflect 800->4096, proactive 400->2048,
slow_cycle 1024->4096). The offline tests/memory_eval/judge.py and the
extract/judge factories pin thinking_enabled=False so V4 stops burning the
token budget on hidden reasoning tokens for strict-JSON outputs.
…udit message_text not persisted by update_latest)
@AlanY1an AlanY1an merged commit a25dda4 into main May 6, 2026
6 checks passed
@AlanY1an AlanY1an deleted the feat/proactive-eval-skeleton branch May 6, 2026 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant