Skip to content

scoring: log shadow + canonical encode timings, pre-cache e5-base-v2#208

Closed
that-guy-wade wants to merge 1 commit into
mainfrom
sethschilbe/shadow-timing-and-e5
Closed

scoring: log shadow + canonical encode timings, pre-cache e5-base-v2#208
that-guy-wade wants to merge 1 commit into
mainfrom
sethschilbe/shadow-timing-and-e5

Conversation

@that-guy-wade

Copy link
Copy Markdown
Contributor

Description

Extend the opt-in shadow scoring log line with per-pair timing for both the canonical (Qwen3-Embedding-0.6B) and the configured shadow encoder, and pre-cache intfloat/e5-base-v2 in the validator image so it can be enabled as a shadow candidate without a runtime download.

Local benchmark of 10 candidate sentence models on 167 production title pairs showed intfloat/e5-base-v2 has the highest Pearson correlation with the canonical Qwen3 sims (0.7591 at 110M params) while running roughly 5.5× faster than Qwen3 on CPU. To quantify the speedup on production validator hardware (Graviton ARM, not Apple Silicon), we need per-pair encode timings logged side-by-side.

Changes Made

  • src/agent/rewards/orm.py: time both encode and similarity calls for shadow and canonical models, add the four *_ms fields to the shadow_sim JSON payload.
  • docker/validator/Dockerfile: add intfloat/e5-base-v2 to the pre-cache step alongside Qwen3 and bge-small.
  • tests/test_scoring_perf.py: assert the new timing fields are present and numeric in the logged payload.

Issue Link

  • Related to: N/A (shadow scoring observability follow-up)
  • Closes: N/A

Testing

Manual Testing

Ran the local pytest suite; the shadow scoring tests pass with the new timing assertions.

Test Results:

  • uv run pytest tests/test_scoring_perf.py -x -q → 6 passed in 0.13s.

Automated Testing

Existing unit tests cover both the enabled and disabled shadow paths. New assertions verify the timing fields appear in the payload when the shadow model loads successfully.

Test Command(s):

uv run pytest tests/test_scoring_perf.py -x -q

Documentation

  • README updated
  • Code comments added/updated
  • API documentation updated
  • Configuration documentation updated
  • Other documentation updated (please specify):

Documentation Changes:
N/A

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been published and merged

Additional Notes

The canonical timing probe adds one extra single-title encode pair per shadow log line. This is wasted compute outside of measurement, so the timing instrumentation is gated on the same SHADOW_SCORE_MODEL env var that gates the shadow log itself. When shadow scoring is disabled (production default), there is no additional cost.

Pre-caching e5-base-v2 increases the validator image by approximately 440 MB. The shadow scorer continues to no-op when SHADOW_SCORE_MODEL is unset, so this is purely opt-in plumbing.

Extend the shadow scoring JSON line with per-pair timing fields for both
the canonical and shadow encoders so we can quantify the runtime impact
of swapping the canonical sentence model. Pre-cache `intfloat/e5-base-v2`
in the validator image to make it usable as a shadow candidate without
runtime download.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@that-guy-wade that-guy-wade self-assigned this Jun 29, 2026
@that-guy-wade that-guy-wade requested a review from shardi-b June 29, 2026 17:55

@shardi-b shardi-b left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. No issues found during code review.

@that-guy-wade

Copy link
Copy Markdown
Contributor Author

Closing — Pearson-to-Qwen3 is not a useful target since Qwen3 is not ground truth (15-pair human judgment showed Qwen3 ≈ bge-small). Will let shadow mode run today's race #76 with bge-small, then judge swap of canonical Qwen3 → bge-small directly.

@that-guy-wade that-guy-wade deleted the sethschilbe/shadow-timing-and-e5 branch June 29, 2026 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants