feat(proactive_eval): real-LLM eval suite + Tier 1 (28 fixtures · 18 PASS / 10 XFAIL)#9
Merged
Merged
Conversation
…-cycle daemons re-arm start() previously left _shutdown sticky from a prior stop(), so any subsequent _schedule_next early-returned and the scheduler silently no-op'd. Module docstring already promises 'On daemon restart start() re-initialises timers from DB state' — this matches that intent and unblocks per-phase eval runner loops that cycle start/stop.
…consistency Three reviewer-flagged improvements to the runner: - Wrap proactive_scheduler lifecycle in try/finally so a raise during consolidate / phase dispatch / supersede stage can't leak the scheduler task. - Replace the bare 0.5s sleep in the per-phase loop with a poll-until-grew loop (5s budget). The fixed sleep was a wall-clock race against the timer's own asyncio.sleep when fire_at_clock lands just before natural fire time. - Pass mock_now to _seed_proactive_state instead of datetime.now() so seeded last_updated matches the fixture clock, not real wall time.
Initial fixture batch covering: - A · advance_hours table (10 categories): moving_houston (xfail) · interview · surgery · medical_routine · travel · deadline_paper · celebration · reminder · vague_date_anchor · commitment_third_party - A semantics (4): commitment_user · unresolved_short_window · ongoing_check_n · fuzzy_future_check - B reverse (4): low_value_future · historical · trivial · user_explicit_decline - C multi/supersede/sensitive (4): multi_event · supersede_blocks_post · vulnerability · medical_sensitive_tone - D · 5 gates + 1 bypass (6): quiet_hours · forbidden_topic · in_flight_turn · rate_limit · engagement_low · engagement_bypass_commitment Initial expectation ranges are best-guesses. Stage C will tune them against real DeepSeek-V4 LLM output, applying remedy A (widen range) / B (xfail with concrete reason) / C (fix runner) per fixture.
…ICompatibleProvider
…wiring Adds ThinkingSection to LLMSection, threads thinking_enabled through make_extract_fn / make_reflect_fn / make_slow_cycle_fn / make_proactive_fn / make_judge_fn, and wires the runtime config into the factory call sites in runtime/app.py. Default per-call-site behaviour: extract / judge: thinking_enabled=False (mechanical reformat) reflect / proactive / slow_cycle: thinking_enabled=None (provider default) max_tokens bumps for the thinking-on call sites give V4 reasoning + visible output the headroom they need (reflect 800->4096, proactive 400->2048, slow_cycle 1024->4096). The offline tests/memory_eval/judge.py and the extract/judge factories pin thinking_enabled=False so V4 stops burning the token budget on hidden reasoning tokens for strict-JSON outputs.
…ons flow through Phase B
…quoting + invariant edge case
…patch can take 30s+)
…event rapid-fire reschedule
…ws + travel xfail
…udit message_text not persisted by update_latest)
…te 4 production fixes shipped
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
新建
tests/proactive_eval/· 沿tests/memory_eval/模式 · 真 LLM + mock-clock + scheduler dispatch + audit + LLM-as-judge。Tier 1 · 28 fixture 覆盖 PART F advance_hours 全 8 类 + 5-gate 全 + 4 个 follow_up_at 计算规则 + 显式压制 + 多 event + supersede 闭环 + engagement bypass。第一刀目标是把 2026-05-03 dogfood 真实场景(用户说"下周搬去 Houston",PART F 没把 follow_up_at 传到 DB)锁成永久回归 fixture。调试过程中端到端跑发现这是个 production 接线 bug,已一并修复。
Live-LLM run results
每个 xfail 都附 concrete 原因(PART F prompt judgment gap · multi-phase scheduling tuning · LLM advance_hours non-determinism)— 不是"以后再说"的占位。
Production fixes shipped on this branch
eval suite 跑通过程中暴露 + 修复了 4 个 production bug + 1 个 audit 观察:
d0125a6) — PR feat(proactive): memory-driven v3 #8 v3 加了concept_nodes.follow_up_atschema 和EXTRACTION_SYSTEM_PROMPTPART F section,但RawExtractedEventschema 和_parse_event没读 LLM 的 PART F output。所有concept_nodes.follow_up_at一直是 NULL。这就是 Houston dogfood bug 的真正根因。修复:parser + wiring 加 5 个字段 (follow_up_at/follow_up_hint/estimated_arc_days/advance_pre_hours/advance_post_hours) + 9 个 unit test。b5b78f1/0db5050/2618fea/3f556ee) — V4 默认开思考 + 占用 max_tokens 预算,导致 Phase B JSON 截断。新加thinking_enabled: bool | None到LLMProvider.complete抽象,每个 provider 翻译到 native API(DeepSeekextra_body、Anthropicthinking、OpenAI no-op)。Per-call-site 决策(extract / judge 关思考;reflect / proactive / slow_cycle 默认开)。新[llm.thinking]config 段允许用户覆盖。creative call 的max_tokens同步 bump (V4 推理 headroom)。FollowUpScheduler._shutdown不在start()重置 (cb70f13) — 守护进程重启 / multi-cycle 调度器在第二次 start 后默默 no-op。修复 + 新 unit test。fed0882) — 修复在 runner 中传trivial_message_count=0绕过。观察(未修,单独 followup):
SQLiteAuditSink.update_latest不会重写decision.message_text。Audit row 永远message_text=''即使 send_ok=True。Admin "主动消息历史" tab 看到的消息是空的。Runner 改读FakeChannel.sent兜底。Architecture
Runner 流程:
每个 fixture YAML 声明
seed(PersonaProfile / ProactiveState / pre-existing decisions /in_flight_turnpredicate /max_per_24h),turns,expect.event_extracted,expect.phases,可选follow_up_stage,可选judge_prompts,可选xfail。Coverage matrix · 28 fixtures
Test plan
uv run pytest -m eval_proactive· 28 fixture · run 8: 18 PASS / 10 XFAIL · live DeepSeek-V4 · 17 min wall-clock · ~$3-8 LLM costuv run pytest -q· 默认 suite · 1623 passed · +12 new tests over baseline (4 openai_compat thinking · 9 PART F parser · others schema/invariants/runner_smoke)uv run ruff check src/ tests/· cleanuv run ruff format --check· clean on touched filesuv run lint-imports· 4 contracts keptOut of scope (followup PRs)
config.toml.sample[proactive]段废字段清理SQLiteAuditSink.update_latest持久化 message_textBranch commits (23)