Bound lifespan shutdown so a stuck MQTT disconnect can't hang CI#7
Conversation
The lifespan shutdown cancelled the source tasks and awaited each one unbounded. aiomqtt's graceful disconnect-on-cancel intermittently hangs on Python 3.11 (the cancelled task never settles), so `await task` would wedge shutdown — and the TestClient/lifespan around it — forever. This showed up as a flaky, version-biased CI hang: the 3.11 leg sat in pytest until the job's wall-clock cap while 3.12/3.13 finished in ~30s. Captured stack: MainThread blocked in TestClient.__exit__ → wait_shutdown while the event loop sat idle in the selector — i.e. an awaited shutdown task that never completes. Wait for the cancelled tasks with a bounded grace (asyncio.wait, SHUTDOWN_GRACE_S=5s) and abandon any straggler with a warning; the loop is tearing down regardless, so a stuck client connection is moot (PRD §37 failure isolation). Normal teardown is unaffected (returns as soon as the tasks settle, typically milliseconds). Reproduced locally under 3.11 (~1 hang in 7 runs); 20/20 clean after the fix. scripts/check.sh green (62 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a shutdown grace period of 5 seconds in the FastAPI lifespan handler to prevent the application from hanging indefinitely during shutdown. It replaces the unbounded await of cancelled tasks with asyncio.wait and a timeout. The review feedback correctly points out that completed tasks in _done should be awaited to retrieve their exceptions and avoid 'Task exception was never retrieved' warnings.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| _done, pending = await asyncio.wait(tasks, timeout=SHUTDOWN_GRACE_S) | ||
| if pending: | ||
| log.warning( | ||
| "%d source task(s) did not stop within %.0fs; abandoning", | ||
| len(pending), | ||
| SHUTDOWN_GRACE_S, | ||
| ) |
There was a problem hiding this comment.
By using asyncio.wait and not awaiting or retrieving the results of the completed tasks in _done, any unhandled exceptions raised by these tasks (other than CancelledError) will never be retrieved. This can cause Python to log a Task exception was never retrieved warning when the tasks are garbage collected, and hides potential errors during shutdown.
We should iterate over the completed tasks in _done and await them (suppressing CancelledError) to ensure their exceptions are properly retrieved and propagated/logged.
| _done, pending = await asyncio.wait(tasks, timeout=SHUTDOWN_GRACE_S) | |
| if pending: | |
| log.warning( | |
| "%d source task(s) did not stop within %.0fs; abandoning", | |
| len(pending), | |
| SHUTDOWN_GRACE_S, | |
| ) | |
| _done, pending = await asyncio.wait(tasks, timeout=SHUTDOWN_GRACE_S) | |
| for task in _done: | |
| with contextlib.suppress(asyncio.CancelledError): | |
| await task | |
| if pending: | |
| log.warning( | |
| "%d source task(s) did not stop within %.0fs; abandoning", | |
| len(pending), | |
| SHUTDOWN_GRACE_S, | |
| ) |
Problem
The flaky, 3.11-biased CI hang (a
testleg sitting in pytest until the job cap while 3.12/3.13 finish in ~30s) is a lifespan-shutdown hang, not a test-logic bug.The lifespan's
finallycancelled the source tasks andawaited each one unbounded. aiomqtt's graceful disconnect-on-cancel intermittently hangs on Python 3.11 (the cancelled task never settles), soawait taskwedges shutdown — and theTestClient/lifespan around it — forever.Captured hung stack:
MainThreadblocked inTestClient.__exit__ → wait_shutdownwhile the event loop sat idle in the selector — an awaited shutdown task that never completes. It's pre-existing (the demo task teardown), just rarely hit and unbounded.Fix
Wait for the cancelled tasks with a bounded grace (
asyncio.wait,SHUTDOWN_GRACE_S=5s) and abandon any straggler with a warning. The loop is tearing down regardless, so a stuck client connection is moot (PRD §37). Normal teardown is unaffected —asyncio.waitreturns as soon as the tasks settle (typically milliseconds).Verification
scripts/check.shgreen (ruff + mypy strict + pytest, 62 passed).🤖 Generated with Claude Code