Skip to content

Bound lifespan shutdown so a stuck MQTT disconnect can't hang CI#7

Merged
kninetimmy merged 1 commit into
mainfrom
fix-lifespan-shutdown-hang
Jun 17, 2026
Merged

Bound lifespan shutdown so a stuck MQTT disconnect can't hang CI#7
kninetimmy merged 1 commit into
mainfrom
fix-lifespan-shutdown-hang

Conversation

@kninetimmy

Copy link
Copy Markdown
Owner

Problem

The flaky, 3.11-biased CI hang (a test leg sitting in pytest until the job cap while 3.12/3.13 finish in ~30s) is a lifespan-shutdown hang, not a test-logic bug.

The lifespan's finally cancelled the source tasks and awaited each one unbounded. aiomqtt's graceful disconnect-on-cancel intermittently hangs on Python 3.11 (the cancelled task never settles), so await task wedges shutdown — and the TestClient/lifespan around it — forever.

Captured hung stack: MainThread blocked in TestClient.__exit__ → wait_shutdown while the event loop sat idle in the selector — an awaited shutdown task that never completes. It's pre-existing (the demo task teardown), just rarely hit and unbounded.

Fix

Wait for the cancelled tasks with a bounded grace (asyncio.wait, SHUTDOWN_GRACE_S=5s) and abandon any straggler with a warning. The loop is tearing down regardless, so a stuck client connection is moot (PRD §37). Normal teardown is unaffected — asyncio.wait returns as soon as the tasks settle (typically milliseconds).

Verification

  • Reproduced locally under Python 3.11: ~1 hang in 7 full-suite runs.
  • 20/20 clean after the fix.
  • scripts/check.sh green (ruff + mypy strict + pytest, 62 passed).

🤖 Generated with Claude Code

The lifespan shutdown cancelled the source tasks and awaited each one
unbounded. aiomqtt's graceful disconnect-on-cancel intermittently hangs
on Python 3.11 (the cancelled task never settles), so `await task` would
wedge shutdown — and the TestClient/lifespan around it — forever. This
showed up as a flaky, version-biased CI hang: the 3.11 leg sat in pytest
until the job's wall-clock cap while 3.12/3.13 finished in ~30s.

Captured stack: MainThread blocked in TestClient.__exit__ → wait_shutdown
while the event loop sat idle in the selector — i.e. an awaited shutdown
task that never completes.

Wait for the cancelled tasks with a bounded grace (asyncio.wait,
SHUTDOWN_GRACE_S=5s) and abandon any straggler with a warning; the loop
is tearing down regardless, so a stuck client connection is moot
(PRD §37 failure isolation). Normal teardown is unaffected (returns as
soon as the tasks settle, typically milliseconds).

Reproduced locally under 3.11 (~1 hang in 7 runs); 20/20 clean after the
fix. scripts/check.sh green (62 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a shutdown grace period of 5 seconds in the FastAPI lifespan handler to prevent the application from hanging indefinitely during shutdown. It replaces the unbounded await of cancelled tasks with asyncio.wait and a timeout. The review feedback correctly points out that completed tasks in _done should be awaited to retrieve their exceptions and avoid 'Task exception was never retrieved' warnings.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +62 to +68
_done, pending = await asyncio.wait(tasks, timeout=SHUTDOWN_GRACE_S)
if pending:
log.warning(
"%d source task(s) did not stop within %.0fs; abandoning",
len(pending),
SHUTDOWN_GRACE_S,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

By using asyncio.wait and not awaiting or retrieving the results of the completed tasks in _done, any unhandled exceptions raised by these tasks (other than CancelledError) will never be retrieved. This can cause Python to log a Task exception was never retrieved warning when the tasks are garbage collected, and hides potential errors during shutdown.

We should iterate over the completed tasks in _done and await them (suppressing CancelledError) to ensure their exceptions are properly retrieved and propagated/logged.

Suggested change
_done, pending = await asyncio.wait(tasks, timeout=SHUTDOWN_GRACE_S)
if pending:
log.warning(
"%d source task(s) did not stop within %.0fs; abandoning",
len(pending),
SHUTDOWN_GRACE_S,
)
_done, pending = await asyncio.wait(tasks, timeout=SHUTDOWN_GRACE_S)
for task in _done:
with contextlib.suppress(asyncio.CancelledError):
await task
if pending:
log.warning(
"%d source task(s) did not stop within %.0fs; abandoning",
len(pending),
SHUTDOWN_GRACE_S,
)

@kninetimmy kninetimmy merged commit 8f8a457 into main Jun 17, 2026
5 checks passed
@kninetimmy kninetimmy deleted the fix-lifespan-shutdown-hang branch June 17, 2026 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant