Skip to content

Reap orphaned rewrite-rpc node servers to bound the live count#8106

Draft
timtebeek wants to merge 2 commits into
mainfrom
tim/rpc-server-leak-cleanup
Draft

Reap orphaned rewrite-rpc node servers to bound the live count#8106
timtebeek wants to merge 2 commits into
mainfrom
tim/rpc-server-leak-cleanup

Conversation

@timtebeek

@timtebeek timtebeek commented Jun 23, 2026

Copy link
Copy Markdown
Member

Problem

OpenRewrite runs Python/JavaScript/TypeScript (and Go/C#) parsing and printing through out-of-process Node rewrite-rpc servers. RewriteRpcProcessManager held one server per thread in a ThreadLocal, started lazily by getOrStart() and only ever torn down by shutdown()/shutdownCurrent() — which is also thread-local and can only reach the calling thread's server.

A node server is a separate OS process with no finalizer: letting its owning RewriteRpc become garbage does not stop the process. So any server whose owning thread terminates before calling shutdown() is orphaned and survives until the JVM-exit shutdown hook fires. Threads that come and go during a run do exactly this:

  • ForkJoinPool ManagedBlocker compensation threads (the hazard already called out in RewriteRpc.send()),
  • cached / elastic pool threads that retire on their idle timeout.

Each spawns its own server via getOrStart(); none of them is the orchestrator thread that calls shutdownCurrent(). The orphaned processes accumulate without bound across a long-lived host JVM. Worse, each orphaned RewriteRpcProcess keeps its entire RewriteRpc object graph (object/ref caches, prepared recipes) heap-reachable via the shutdown-hook list. On a large run (a DevCenter recipe over ~4200 repos, the tail of which were Python) this produced many live node servers at once and contributed to an OOM.

The create↔dispose accounting was asymmetric: servers are created on any thread, but disposed from one thread.

Fix

RewriteRpcProcessManager now tracks every started server in a process-wide, thread-keyed registry (replacing the bare ThreadLocal) so the full set of live servers can be enumerated:

  • Per-thread reuse is unchanged — a thread still gets and reuses exactly one server.
  • getOrStart() and shutdown() reap servers whose owning thread is no longer alive. Reaping a dead thread's server is always safe (a terminated thread cannot be mid-RPC), so it is sound even while other threads are doing concurrent RPC work. Reaping is activity-driven: it runs whenever a new RPC thread appears and whenever the existing per-run shutdownCurrent() is called — no call-site changes required. So a dead thread's server lingers only until the next such event (rather than until JVM exit), and the residual live-server count is bounded by the number of RPC threads that have retired since the last reap, not by the total number of threads ever created.
  • New shutdownAll() (exposed as PythonRewriteRpc.shutdownAll(), JavaScriptRewriteRpc.shutdownAll(), etc.) tears down every live server, driving the count to zero. It stops servers on currently-live threads too, so it is documented for use only when no RPC work is in flight (JVM/service shutdown). It is not required for correctness on JVM exit — each subprocess already registers its own JVM-exit shutdown hook — it only bounds the count earlier within a long-running JVM.
  • liveCount() exposes the live-server count for diagnostics and tests.

Tests

RewriteRpcProcessManagerTest uses a counting fake server (no real subprocess) to assert:

  • one server per thread, reused;
  • across 200 sequential runs the live count never exceeds one and ends at zero;
  • 50 transient threads that die without cleanup are all reaped on the next getOrStart(), leaving exactly one live server;
  • shutdown() on the orchestrator thread reaps a dead-thread orphan as well as its own server;
  • shutdownAll() drives the count to zero even with servers held open on live threads;
  • resilience: when one server's teardown throws, the remaining servers are still torn down — covered for both shutdownAll() and shutdown()'s reap path (via a ThrowingRpc stub).

The fake server chains super.shutdown() so each stub's backing JsonRpc ForkJoinPool is torn down rather than leaking worker threads across the suite, and the live-threads test uses a guarded wait + bounded join so a future regression fails fast instead of hanging.

Out of scope

  • The Moderne worker's RECIPE_EXECUTOR is an unbounded newCachedThreadPool(); this PR bounds orphaned (dead-thread) servers but not concurrently-live ones, so bounding the executor's parallelism (so the host spawns "no more servers than needed") is a separate worker-side change.
  • DevCenterStarter memory use on Python repos, observed in the same run, is not addressed here.

RewriteRpcProcessManager kept one node RPC server per thread in a ThreadLocal,
started lazily by getOrStart() and only ever torn down by shutdown(), which is
also thread-local and can only reach the calling thread's server. A node server
is a separate OS process with no finalizer, so any server whose owning thread
terminates before calling shutdown() is orphaned and survives until the JVM-exit
shutdown hook fires. Threads that come and go during a run (ForkJoinPool
ManagedBlocker compensation threads, cached/elastic pool threads that retire on
their idle timeout) each spawn a server via getOrStart() and none of them is the
orchestrator thread that calls shutdownCurrent(), so the processes accumulate
without bound across a long-lived host JVM and contributed to an OOM on a large
multi-repo run.

Track every started server in a process-wide, thread-keyed registry instead of a
bare ThreadLocal. Per-thread reuse is unchanged, but getOrStart() and shutdown()
now reap servers whose owning thread is no longer alive, which is safe because a
terminated thread cannot be mid-RPC. This bounds the live-server count by the
number of currently-live RPC threads rather than the total number of threads ever
created, with no call-site changes. Add shutdownAll() (exposed on each language
RPC) to tear down every live server on JVM/service shutdown, and liveCount() for
diagnostics and tests.
@github-project-automation github-project-automation Bot moved this to In Progress in OpenRewrite Jun 23, 2026
@timtebeek timtebeek marked this pull request as draft June 23, 2026 12:32
…ce coverage

Review-driven refinements on top of the thread-keyed registry:

- Class javadoc: describe reaping as activity-driven — the residual is bounded by
  the RPC threads retired since the last reap, not a hard currently-live-thread cap.
- getOrStart(): replace stale comment referencing a nonexistent computeIfAbsent lock
  with one matching the code (loser branch is reentrancy-only).
- shutdownAll(): tighten the precondition (a server registers after construction, so
  a concurrent getOrStart() can survive the sweep) and note each subprocess's own
  JVM-exit hook is the real backstop.
- reset(): route through the get() helper instead of duplicating the lookup.
- Document on all four facades that shutdownCurrent() also reaps dead-thread orphans.

Tests:
- Fix a lost-wakeup hang in shutdownAllDrivesLiveServersToZeroAcrossLiveThreads
  (guarded wait + released flag + bounded join that asserts liveness).
- CountingRpc.shutdown() now chains super.shutdown() so each stub's JsonRpc
  ForkJoinPool is torn down rather than leaking worker threads across the suite.
- Add ThrowingRpc and two tests asserting one failing teardown cannot strand the
  rest, covering both shutdownAll() and shutdown()'s reap path.
- @execution(SAME_THREAD) guards the shared static live counter.
@timtebeek timtebeek changed the title Bound the number of live rewrite-rpc node servers Reap orphaned rewrite-rpc node servers to bound the live count Jun 25, 2026
@timtebeek

Copy link
Copy Markdown
Member Author

@kmccarp as you've explored the rpc servers running in longer running processes, would you be comfortable taking a first pass through this if it aligns with your findings and thoughts?

@timtebeek timtebeek removed their assignment Jun 29, 2026
@timtebeek

Copy link
Copy Markdown
Member Author

Unassigning myself as this likely needs more scrutiny than I can fit in for now; I've logged an internal issue to follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant