Temp [ improve worker observability ]#109
Conversation
6b615b8 to
0101b39
Compare
smudge
left a comment
There was a problem hiding this comment.
Thanks for this PR!
I know you've marked it WIP/Temp, but I have also noticed that it's possible to trigger ActiveRecord::ConnectionTimeoutError if the pool size is lower than max_claims, so this fix definitely makes sense to me and I would be very open to addressing it. 👍
For the sake of reviewing changes independently, I think it would make sense to split the observability/logging changes out into a separate PR, and keep this PR focused on the thread pool / connection pool / max claims reconciliation. (LMK if that doesn't make sense!)
| def thread_pool_size(job_count) | ||
| return job_count unless Delayed::Job.respond_to?(:connection_pool) | ||
|
|
||
| pool_size = Delayed::Job.connection_pool.size | ||
| return job_count unless pool_size | ||
|
|
||
| [job_count, [pool_size - 1, 1].max].min | ||
| rescue StandardError | ||
| job_count | ||
| end |
There was a problem hiding this comment.
There may be a way to more proactively (e.g. during worker boot/initialization) establish if Delayed::Worker.max_claims > Delayed::Job.connection_pool.size rather than performing this thread_pool_size logic on every pickup loop. (My understanding is that Delayed::Job.connection_pool.size is informed by the pool size config in database.yml, and should not change once the app has loaded.)
The current pickup strategy is also intended to avoid picking up more work than the worker can immediately begin working off (to avoid holding unworked jobs in memory), so it may make sense to raise or warn up front (again, during boot / worker initialization) if a misconfiguration is detected.
don’t give crayons to more kids than the box has.
Summary
Improve
Delayed::Worker#work_offobservability and failure handling around worker thread setup.What Changed
work_offjob_sayperformfrom those that happen after it startsWhy
The worker previously had two observability gaps:
work_offdid not log enough detail to show where a batch was in the reserve and dispatch flowThis change makes worker execution easier to trace and improves behavior when a thread fails before the job body starts running.
Behavior
debug-level logs for thework_offloopperformhas started