scoring: log shadow + canonical encode timings, pre-cache e5-base-v2#208
Closed
that-guy-wade wants to merge 1 commit into
Closed
scoring: log shadow + canonical encode timings, pre-cache e5-base-v2#208that-guy-wade wants to merge 1 commit into
that-guy-wade wants to merge 1 commit into
Conversation
Extend the shadow scoring JSON line with per-pair timing fields for both the canonical and shadow encoders so we can quantify the runtime impact of swapping the canonical sentence model. Pre-cache `intfloat/e5-base-v2` in the validator image to make it usable as a shadow candidate without runtime download. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
shardi-b
approved these changes
Jun 29, 2026
shardi-b
left a comment
Contributor
There was a problem hiding this comment.
Approved. No issues found during code review.
Contributor
Author
|
Closing — Pearson-to-Qwen3 is not a useful target since Qwen3 is not ground truth (15-pair human judgment showed Qwen3 ≈ bge-small). Will let shadow mode run today's race #76 with bge-small, then judge swap of canonical Qwen3 → bge-small directly. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Extend the opt-in shadow scoring log line with per-pair timing for both the canonical (Qwen3-Embedding-0.6B) and the configured shadow encoder, and pre-cache
intfloat/e5-base-v2in the validator image so it can be enabled as a shadow candidate without a runtime download.Local benchmark of 10 candidate sentence models on 167 production title pairs showed
intfloat/e5-base-v2has the highest Pearson correlation with the canonical Qwen3 sims (0.7591 at 110M params) while running roughly 5.5× faster than Qwen3 on CPU. To quantify the speedup on production validator hardware (Graviton ARM, not Apple Silicon), we need per-pair encode timings logged side-by-side.Changes Made
src/agent/rewards/orm.py: time both encode and similarity calls for shadow and canonical models, add the four*_msfields to theshadow_simJSON payload.docker/validator/Dockerfile: addintfloat/e5-base-v2to the pre-cache step alongside Qwen3 and bge-small.tests/test_scoring_perf.py: assert the new timing fields are present and numeric in the logged payload.Issue Link
Testing
Manual Testing
Ran the local pytest suite; the shadow scoring tests pass with the new timing assertions.
Test Results:
uv run pytest tests/test_scoring_perf.py -x -q→ 6 passed in 0.13s.Automated Testing
Existing unit tests cover both the enabled and disabled shadow paths. New assertions verify the timing fields appear in the payload when the shadow model loads successfully.
Test Command(s):
Documentation
Documentation Changes:
N/A
Checklist
Additional Notes
The canonical timing probe adds one extra single-title encode pair per shadow log line. This is wasted compute outside of measurement, so the timing instrumentation is gated on the same
SHADOW_SCORE_MODELenv var that gates the shadow log itself. When shadow scoring is disabled (production default), there is no additional cost.Pre-caching e5-base-v2 increases the validator image by approximately 440 MB. The shadow scorer continues to no-op when
SHADOW_SCORE_MODELis unset, so this is purely opt-in plumbing.