scoring: log shadow + canonical encode timings, pre-cache e5-base-v2 by that-guy-wade · Pull Request #208 · ORO-AI/oro

that-guy-wade · 2026-06-29T17:55:26Z

Description

Extend the opt-in shadow scoring log line with per-pair timing for both the canonical (Qwen3-Embedding-0.6B) and the configured shadow encoder, and pre-cache intfloat/e5-base-v2 in the validator image so it can be enabled as a shadow candidate without a runtime download.

Local benchmark of 10 candidate sentence models on 167 production title pairs showed intfloat/e5-base-v2 has the highest Pearson correlation with the canonical Qwen3 sims (0.7591 at 110M params) while running roughly 5.5× faster than Qwen3 on CPU. To quantify the speedup on production validator hardware (Graviton ARM, not Apple Silicon), we need per-pair encode timings logged side-by-side.

Changes Made

src/agent/rewards/orm.py: time both encode and similarity calls for shadow and canonical models, add the four *_ms fields to the shadow_sim JSON payload.
docker/validator/Dockerfile: add intfloat/e5-base-v2 to the pre-cache step alongside Qwen3 and bge-small.
tests/test_scoring_perf.py: assert the new timing fields are present and numeric in the logged payload.

Issue Link

Related to: N/A (shadow scoring observability follow-up)
Closes: N/A

Testing

Manual Testing

Ran the local pytest suite; the shadow scoring tests pass with the new timing assertions.

Test Results:

uv run pytest tests/test_scoring_perf.py -x -q → 6 passed in 0.13s.

Automated Testing

Existing unit tests cover both the enabled and disabled shadow paths. New assertions verify the timing fields appear in the payload when the shadow model loads successfully.

Test Command(s):

uv run pytest tests/test_scoring_perf.py -x -q

Documentation

README updated
Code comments added/updated
API documentation updated
Configuration documentation updated
Other documentation updated (please specify):

Documentation Changes:
N/A

Checklist

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings or errors
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been published and merged

Additional Notes

The canonical timing probe adds one extra single-title encode pair per shadow log line. This is wasted compute outside of measurement, so the timing instrumentation is gated on the same SHADOW_SCORE_MODEL env var that gates the shadow log itself. When shadow scoring is disabled (production default), there is no additional cost.

Pre-caching e5-base-v2 increases the validator image by approximately 440 MB. The shadow scorer continues to no-op when SHADOW_SCORE_MODEL is unset, so this is purely opt-in plumbing.

Extend the shadow scoring JSON line with per-pair timing fields for both the canonical and shadow encoders so we can quantify the runtime impact of swapping the canonical sentence model. Pre-cache `intfloat/e5-base-v2` in the validator image to make it usable as a shadow candidate without runtime download. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

shardi-b

Approved. No issues found during code review.

that-guy-wade · 2026-06-29T18:01:48Z

Closing — Pearson-to-Qwen3 is not a useful target since Qwen3 is not ground truth (15-pair human judgment showed Qwen3 ≈ bge-small). Will let shadow mode run today's race #76 with bge-small, then judge swap of canonical Qwen3 → bge-small directly.

that-guy-wade self-assigned this Jun 29, 2026

that-guy-wade requested a review from shardi-b June 29, 2026 17:55

shardi-b approved these changes Jun 29, 2026

View reviewed changes

that-guy-wade closed this Jun 29, 2026

that-guy-wade deleted the sethschilbe/shadow-timing-and-e5 branch June 29, 2026 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scoring: log shadow + canonical encode timings, pre-cache e5-base-v2#208

scoring: log shadow + canonical encode timings, pre-cache e5-base-v2#208
that-guy-wade wants to merge 1 commit into
mainfrom
sethschilbe/shadow-timing-and-e5

that-guy-wade commented Jun 29, 2026

Uh oh!

shardi-b left a comment

Uh oh!

that-guy-wade commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

that-guy-wade commented Jun 29, 2026

Description

Changes Made

Issue Link

Testing

Manual Testing

Automated Testing

Documentation

Checklist

Additional Notes

Uh oh!

shardi-b left a comment

Choose a reason for hiding this comment

Uh oh!

that-guy-wade commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants