Reap orphaned rewrite-rpc node servers to bound the live count#8106
Draft
timtebeek wants to merge 2 commits into
Draft
Reap orphaned rewrite-rpc node servers to bound the live count#8106timtebeek wants to merge 2 commits into
timtebeek wants to merge 2 commits into
Conversation
RewriteRpcProcessManager kept one node RPC server per thread in a ThreadLocal, started lazily by getOrStart() and only ever torn down by shutdown(), which is also thread-local and can only reach the calling thread's server. A node server is a separate OS process with no finalizer, so any server whose owning thread terminates before calling shutdown() is orphaned and survives until the JVM-exit shutdown hook fires. Threads that come and go during a run (ForkJoinPool ManagedBlocker compensation threads, cached/elastic pool threads that retire on their idle timeout) each spawn a server via getOrStart() and none of them is the orchestrator thread that calls shutdownCurrent(), so the processes accumulate without bound across a long-lived host JVM and contributed to an OOM on a large multi-repo run. Track every started server in a process-wide, thread-keyed registry instead of a bare ThreadLocal. Per-thread reuse is unchanged, but getOrStart() and shutdown() now reap servers whose owning thread is no longer alive, which is safe because a terminated thread cannot be mid-RPC. This bounds the live-server count by the number of currently-live RPC threads rather than the total number of threads ever created, with no call-site changes. Add shutdownAll() (exposed on each language RPC) to tear down every live server on JVM/service shutdown, and liveCount() for diagnostics and tests.
…ce coverage Review-driven refinements on top of the thread-keyed registry: - Class javadoc: describe reaping as activity-driven — the residual is bounded by the RPC threads retired since the last reap, not a hard currently-live-thread cap. - getOrStart(): replace stale comment referencing a nonexistent computeIfAbsent lock with one matching the code (loser branch is reentrancy-only). - shutdownAll(): tighten the precondition (a server registers after construction, so a concurrent getOrStart() can survive the sweep) and note each subprocess's own JVM-exit hook is the real backstop. - reset(): route through the get() helper instead of duplicating the lookup. - Document on all four facades that shutdownCurrent() also reaps dead-thread orphans. Tests: - Fix a lost-wakeup hang in shutdownAllDrivesLiveServersToZeroAcrossLiveThreads (guarded wait + released flag + bounded join that asserts liveness). - CountingRpc.shutdown() now chains super.shutdown() so each stub's JsonRpc ForkJoinPool is torn down rather than leaking worker threads across the suite. - Add ThrowingRpc and two tests asserting one failing teardown cannot strand the rest, covering both shutdownAll() and shutdown()'s reap path. - @execution(SAME_THREAD) guards the shared static live counter.
Member
Author
|
@kmccarp as you've explored the rpc servers running in longer running processes, would you be comfortable taking a first pass through this if it aligns with your findings and thoughts? |
Member
Author
|
Unassigning myself as this likely needs more scrutiny than I can fit in for now; I've logged an internal issue to follow up. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
OpenRewrite runs Python/JavaScript/TypeScript (and Go/C#) parsing and printing through out-of-process Node
rewrite-rpcservers.RewriteRpcProcessManagerheld one server per thread in aThreadLocal, started lazily bygetOrStart()and only ever torn down byshutdown()/shutdownCurrent()— which is also thread-local and can only reach the calling thread's server.A node server is a separate OS process with no finalizer: letting its owning
RewriteRpcbecome garbage does not stop the process. So any server whose owning thread terminates before callingshutdown()is orphaned and survives until the JVM-exit shutdown hook fires. Threads that come and go during a run do exactly this:RewriteRpc.send()),Each spawns its own server via
getOrStart(); none of them is the orchestrator thread that callsshutdownCurrent(). The orphaned processes accumulate without bound across a long-lived host JVM. Worse, each orphanedRewriteRpcProcesskeeps its entireRewriteRpcobject graph (object/ref caches, prepared recipes) heap-reachable via the shutdown-hook list. On a large run (a DevCenter recipe over ~4200 repos, the tail of which were Python) this produced many live node servers at once and contributed to an OOM.The create↔dispose accounting was asymmetric: servers are created on any thread, but disposed from one thread.
Fix
RewriteRpcProcessManagernow tracks every started server in a process-wide, thread-keyed registry (replacing the bareThreadLocal) so the full set of live servers can be enumerated:getOrStart()andshutdown()reap servers whose owning thread is no longer alive. Reaping a dead thread's server is always safe (a terminated thread cannot be mid-RPC), so it is sound even while other threads are doing concurrent RPC work. Reaping is activity-driven: it runs whenever a new RPC thread appears and whenever the existing per-runshutdownCurrent()is called — no call-site changes required. So a dead thread's server lingers only until the next such event (rather than until JVM exit), and the residual live-server count is bounded by the number of RPC threads that have retired since the last reap, not by the total number of threads ever created.shutdownAll()(exposed asPythonRewriteRpc.shutdownAll(),JavaScriptRewriteRpc.shutdownAll(), etc.) tears down every live server, driving the count to zero. It stops servers on currently-live threads too, so it is documented for use only when no RPC work is in flight (JVM/service shutdown). It is not required for correctness on JVM exit — each subprocess already registers its own JVM-exit shutdown hook — it only bounds the count earlier within a long-running JVM.liveCount()exposes the live-server count for diagnostics and tests.Tests
RewriteRpcProcessManagerTestuses a counting fake server (no real subprocess) to assert:getOrStart(), leaving exactly one live server;shutdown()on the orchestrator thread reaps a dead-thread orphan as well as its own server;shutdownAll()drives the count to zero even with servers held open on live threads;shutdownAll()andshutdown()'s reap path (via aThrowingRpcstub).The fake server chains
super.shutdown()so each stub's backingJsonRpcForkJoinPoolis torn down rather than leaking worker threads across the suite, and the live-threads test uses a guarded wait + bounded join so a future regression fails fast instead of hanging.Out of scope
RECIPE_EXECUTORis an unboundednewCachedThreadPool(); this PR bounds orphaned (dead-thread) servers but not concurrently-live ones, so bounding the executor's parallelism (so the host spawns "no more servers than needed") is a separate worker-side change.DevCenterStartermemory use on Python repos, observed in the same run, is not addressed here.