Reap orphaned rewrite-rpc node servers to bound the live count by timtebeek · Pull Request #8106 · openrewrite/rewrite

timtebeek · 2026-06-23T12:31:37Z

Problem

OpenRewrite runs Python/JavaScript/TypeScript (and Go/C#) parsing and printing through out-of-process Node rewrite-rpc servers. RewriteRpcProcessManager held one server per thread in a ThreadLocal, started lazily by getOrStart() and only ever torn down by shutdown()/shutdownCurrent() — which is also thread-local and can only reach the calling thread's server.

A node server is a separate OS process with no finalizer: letting its owning RewriteRpc become garbage does not stop the process. So any server whose owning thread terminates before calling shutdown() is orphaned and survives until the JVM-exit shutdown hook fires. Threads that come and go during a run do exactly this:

ForkJoinPool ManagedBlocker compensation threads (the hazard already called out in RewriteRpc.send()),
cached / elastic pool threads that retire on their idle timeout.

Each spawns its own server via getOrStart(); none of them is the orchestrator thread that calls shutdownCurrent(). The orphaned processes accumulate without bound across a long-lived host JVM. Worse, each orphaned RewriteRpcProcess keeps its entire RewriteRpc object graph (object/ref caches, prepared recipes) heap-reachable via the shutdown-hook list. On a large run (a DevCenter recipe over ~4200 repos, the tail of which were Python) this produced many live node servers at once and contributed to an OOM.

The create↔dispose accounting was asymmetric: servers are created on any thread, but disposed from one thread.

Fix

RewriteRpcProcessManager now tracks every started server in a process-wide, thread-keyed registry (replacing the bare ThreadLocal) so the full set of live servers can be enumerated:

Per-thread reuse is unchanged — a thread still gets and reuses exactly one server.
getOrStart() and shutdown() reap servers whose owning thread is no longer alive. Reaping a dead thread's server is always safe (a terminated thread cannot be mid-RPC), so it is sound even while other threads are doing concurrent RPC work. Reaping is activity-driven: it runs whenever a new RPC thread appears and whenever the existing per-run shutdownCurrent() is called — no call-site changes required. So a dead thread's server lingers only until the next such event (rather than until JVM exit), and the residual live-server count is bounded by the number of RPC threads that have retired since the last reap, not by the total number of threads ever created.
New shutdownAll() (exposed as PythonRewriteRpc.shutdownAll(), JavaScriptRewriteRpc.shutdownAll(), etc.) tears down every live server, driving the count to zero. It stops servers on currently-live threads too, so it is documented for use only when no RPC work is in flight (JVM/service shutdown). It is not required for correctness on JVM exit — each subprocess already registers its own JVM-exit shutdown hook — it only bounds the count earlier within a long-running JVM.
liveCount() exposes the live-server count for diagnostics and tests.

Tests

RewriteRpcProcessManagerTest uses a counting fake server (no real subprocess) to assert:

one server per thread, reused;
across 200 sequential runs the live count never exceeds one and ends at zero;
50 transient threads that die without cleanup are all reaped on the next getOrStart(), leaving exactly one live server;
shutdown() on the orchestrator thread reaps a dead-thread orphan as well as its own server;
shutdownAll() drives the count to zero even with servers held open on live threads;
resilience: when one server's teardown throws, the remaining servers are still torn down — covered for both shutdownAll() and shutdown()'s reap path (via a ThrowingRpc stub).

The fake server chains super.shutdown() so each stub's backing JsonRpc ForkJoinPool is torn down rather than leaking worker threads across the suite, and the live-threads test uses a guarded wait + bounded join so a future regression fails fast instead of hanging.

Out of scope

The Moderne worker's RECIPE_EXECUTOR is an unbounded newCachedThreadPool(); this PR bounds orphaned (dead-thread) servers but not concurrently-live ones, so bounding the executor's parallelism (so the host spawns "no more servers than needed") is a separate worker-side change.
DevCenterStarter memory use on Python repos, observed in the same run, is not addressed here.

RewriteRpcProcessManager kept one node RPC server per thread in a ThreadLocal, started lazily by getOrStart() and only ever torn down by shutdown(), which is also thread-local and can only reach the calling thread's server. A node server is a separate OS process with no finalizer, so any server whose owning thread terminates before calling shutdown() is orphaned and survives until the JVM-exit shutdown hook fires. Threads that come and go during a run (ForkJoinPool ManagedBlocker compensation threads, cached/elastic pool threads that retire on their idle timeout) each spawn a server via getOrStart() and none of them is the orchestrator thread that calls shutdownCurrent(), so the processes accumulate without bound across a long-lived host JVM and contributed to an OOM on a large multi-repo run. Track every started server in a process-wide, thread-keyed registry instead of a bare ThreadLocal. Per-thread reuse is unchanged, but getOrStart() and shutdown() now reap servers whose owning thread is no longer alive, which is safe because a terminated thread cannot be mid-RPC. This bounds the live-server count by the number of currently-live RPC threads rather than the total number of threads ever created, with no call-site changes. Add shutdownAll() (exposed on each language RPC) to tear down every live server on JVM/service shutdown, and liveCount() for diagnostics and tests.

@execution

…ce coverage Review-driven refinements on top of the thread-keyed registry: - Class javadoc: describe reaping as activity-driven — the residual is bounded by the RPC threads retired since the last reap, not a hard currently-live-thread cap. - getOrStart(): replace stale comment referencing a nonexistent computeIfAbsent lock with one matching the code (loser branch is reentrancy-only). - shutdownAll(): tighten the precondition (a server registers after construction, so a concurrent getOrStart() can survive the sweep) and note each subprocess's own JVM-exit hook is the real backstop. - reset(): route through the get() helper instead of duplicating the lookup. - Document on all four facades that shutdownCurrent() also reaps dead-thread orphans. Tests: - Fix a lost-wakeup hang in shutdownAllDrivesLiveServersToZeroAcrossLiveThreads (guarded wait + released flag + bounded join that asserts liveness). - CountingRpc.shutdown() now chains super.shutdown() so each stub's JsonRpc ForkJoinPool is torn down rather than leaking worker threads across the suite. - Add ThrowingRpc and two tests asserting one failing teardown cannot strand the rest, covering both shutdownAll() and shutdown()'s reap path. - @execution(SAME_THREAD) guards the shared static live counter.

timtebeek · 2026-06-25T11:31:16Z

@kmccarp as you've explored the rpc servers running in longer running processes, would you be comfortable taking a first pass through this if it aligns with your findings and thoughts?

timtebeek · 2026-06-29T10:20:14Z

Unassigning myself as this likely needs more scrutiny than I can fit in for now; I've logged an internal issue to follow up.

github-project-automation Bot moved this to In Progress in OpenRewrite Jun 23, 2026

github-project-automation Bot added this to OpenRewrite Jun 23, 2026

moderne-meeseeks Bot assigned timtebeek Jun 23, 2026

timtebeek marked this pull request as draft June 23, 2026 12:32

timtebeek changed the title ~~Bound the number of live rewrite-rpc node servers~~ Reap orphaned rewrite-rpc node servers to bound the live count Jun 25, 2026

timtebeek removed their assignment Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reap orphaned rewrite-rpc node servers to bound the live count#8106

Reap orphaned rewrite-rpc node servers to bound the live count#8106
timtebeek wants to merge 2 commits into
mainfrom
tim/rpc-server-leak-cleanup

timtebeek commented Jun 23, 2026 •

edited

Loading

Uh oh!

timtebeek commented Jun 25, 2026

Uh oh!

timtebeek commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

timtebeek commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Out of scope

Uh oh!

timtebeek commented Jun 25, 2026

Uh oh!

timtebeek commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

timtebeek commented Jun 23, 2026 •

edited

Loading