Skip to content

fix(install): visible wait on held apt lock instead of a silent spinner (#740)#213

Closed
LukasWodka wants to merge 1 commit into
developfrom
fix/install-journey-740-apt-lock-wait
Closed

fix(install): visible wait on held apt lock instead of a silent spinner (#740)#213
LukasWodka wants to merge 1 commit into
developfrom
fix/install-journey-740-apt-lock-wait

Conversation

@LukasWodka

Copy link
Copy Markdown
Contributor

Closes #740

Summary

On a fresh cloud VM, unattended-upgrades / apt-daily hold the dpkg frontend lock for the first few minutes after boot. The system-deps step runs apt-get update/install under spin_cmd, which redirects output and animates a spinner — hiding that apt is just blocked on the lock. The install looks frozen for minutes and users abort. This adds a visible wait-for-lock before the apt spinner, with a heartbeat and a bounded timeout, so no long apt step is silent. (Package-name skew was already handled in #720; this is the lock/visibility dimension only.)

Related

Closes #740 · Part of tracebloc/backend#736 (install-journey epic) · Builds on #720 (conntrack package-name skew).

Type of change

  • Feature
  • Bug fix
  • Tech-debt / refactor
  • Docs
  • Security / hardening
  • Breaking change

What changed

scripts/lib/setup-linux.sh:

  • wait_apt_lock() — called in install_system_deps before the $PM_UPDATE / install spinner. Polls the dpkg/apt locks; while held, prints a clear non-spinner message ("Waiting for the system package lock — unattended-upgrades can hold it for a few minutes on a fresh VM…") plus a same-line heartbeat with a ticking elapsed counter (proof of life). Bounded by TRACEBLOC_APT_LOCK_TIMEOUT (default 300s); on timeout it warns with the likely holder + actionable lsof/systemctl status guidance and proceeds (apt queues behind the holder) rather than looping forever. Apt-only — a silent no-op on dnf/yum/zypper/pacman (out of scope here).
  • _apt_lock_held() — the single lock probe (fuser on lock-frontend / lists/lock / dpkg/lock), split out so tests can stub it at the function boundary. If fuser is absent it reports "free" so we never block on an unknowable state (apt's own internal waiting then takes over).
  • _apt_lock_holder_hint() — best-effort name of the holding service for the timeout message.
  • A header comment documenting the "no silent op > a few seconds" progress contract for the known long-running install steps (apt/dnf, downloads, image pulls, CLI pod). Composable with the existing spin_cmd — the wait runs before the spinner, so there is no double-spin.

scripts/tests/setup-linux.bats: 7 new tests (see below).

Test plan

Verified locally (macOS, bats 1.13.0)

  • bash -n on every shell script in scripts/ — all parse.

  • shellcheck --severity=error (mirrors the CI gate) on the libs + entrypoints — clean. Zero warnings on setup-linux.sh even at --severity=warning.

  • bats scripts/tests/setup-linux.batsall 7 new tests pass:

    • wait_apt_lock: held lock emits a visible wait, then proceeds when it clears
    • wait_apt_lock: never-clearing lock times out cleanly (no infinite spin)
    • wait_apt_lock: free lock is a silent no-op
    • wait_apt_lock: non-apt distro skips the apt lock wait entirely
    • install_system_deps: waits on the apt lock before the install spinner
    • _apt_lock_held: no fuser → reports free (does not block)

    Tests mock the lock at the function boundary (_apt_lock_held returns "held" for the first N probes, then "free", simulating unattended-upgrades releasing it) and stub sleep so they run instantly — the bats sandbox can't take a real kernel lock. Limitation: these assert the loop/messaging/timeout logic, not a real held /var/lib/dpkg/lock-frontend. The real-lock path is exercised by the distro-prereqs job's actual apt run in CI.

Pre-existing local failures (NOT from this PR)

Two existing tests — install_docker_engine: Amazon Linux -> dnf docker and … RHEL clone (#719) -> docker-ce dnf repo — fail on a clean develop checkout on macOS because their branch is guarded by [[ -f /etc/os-release ]], which is false on macOS, so the mocked grep never runs. They pass in CI (Linux, where /etc/os-release exists) and are untouched by this change.

Needs CI

  • installer-tests.yamlunit-bash (bats) on Linux: full suite incl. the two macOS-only failures above.
  • installer-tests.yamldistro-prereqs: real apt path on Ubuntu/Debian images exercises wait_apt_lock against a live (usually free) lock.
  • Manual smoke on a fresh Ubuntu cloud VM during the unattended-upgrades window to see the heartbeat in a real boot-time lock hold.

Deployment notes

New optional env var TRACEBLOC_APT_LOCK_TIMEOUT (seconds, default 300) tunes how long to wait before proceeding. No other config or rollout changes.

Checklist

  • Tests added / updated and passing locally (new tests green; 2 pre-existing macOS-only failures unrelated, see above)
  • Docs updated if behavior or config changed (progress-contract comment + env var documented in this PR; new var is internal/advanced)
  • No secrets / credentials in the diff
  • For security-sensitive paths: appropriate reviewer requested (N/A — installer UX path, not a CODEOWNERS-gated file)

…er (#740)

On a fresh cloud VM, unattended-upgrades/apt-daily hold the dpkg
frontend lock for the first few minutes after boot. The system-deps
step runs apt-get update/install under spin_cmd, which redirects output
and animates a spinner, hiding the fact that apt is simply blocked on
the lock. The install looks frozen for minutes and users abort.

Add wait_apt_lock(): before the apt spinner in install_system_deps, poll
the dpkg/apt locks via fuser and surface a clear, non-spinner message
("Waiting for the system package lock - unattended-upgrades can hold it
for a few minutes on a fresh VM...") plus a ticking heartbeat so it is
obviously alive. Bounded by TRACEBLOC_APT_LOCK_TIMEOUT (default 300s);
on timeout it prints actionable guidance and proceeds (apt queues behind
the holder) rather than looping forever. Apt-only - a no-op on other PMs.

The lock probe is split into _apt_lock_held so it can be stubbed at the
function boundary in tests (the bats sandbox cannot take a real kernel
lock). Also documents the "no silent op > a few seconds" progress
contract for the known long-running install steps.

Tests: extend setup-linux.bats with a held-then-released lock (asserts
the wait message + proceed), a never-clearing lock (bounded timeout,
no infinite spin), a free-lock silent no-op, a non-apt no-op, the
install_system_deps ordering (wait before spinner), and the no-fuser
fallback.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@LukasWodka

Copy link
Copy Markdown
Contributor Author

👋 Heads-up — Code review queue is at 17 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@LukasWodka LukasWodka added bug Something isn't working work-type:bug Defect or regression labels Jun 8, 2026
@LukasWodka

Copy link
Copy Markdown
Contributor Author

Closing as superseded.

The simpler apt_wait_for_lock from PR #211/#212 has shipped to develop and main and resolves issue #740. This PR's wait_apt_lock is a different (more elaborate) approach to the same problem — replacing the simpler shipped version with this would be a behavior change, not a rebase.

What this PR adds that the shipped version doesn't:

  • _apt_lock_held and _apt_lock_holder_hint helper functions
  • Heartbeat with elapsed-seconds counter
  • Holder-process hint on timeout
  • 77 lines of setup-linux.bats tests

If we want any of those refinements later, refile each as a small targeted PR against current develop.

Closed during conflict cleanup 2026-06-08.

@LukasWodka LukasWodka closed this Jun 8, 2026
@LukasWodka LukasWodka deleted the fix/install-journey-740-apt-lock-wait branch June 8, 2026 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working work-type:bug Defect or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants