Skip to content

release: promote develop → main (installer hardening — OOM gate, apt/needrestart fixes, drift CI)#221

Merged
saadqbal merged 11 commits into
mainfrom
develop
Jun 8, 2026
Merged

release: promote develop → main (installer hardening — OOM gate, apt/needrestart fixes, drift CI)#221
saadqbal merged 11 commits into
mainfrom
develop

Conversation

@saadqbal

@saadqbal saadqbal commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Promotes the unreleased develop commits to main. The installer helper scripts are served from raw.githubusercontent.com/tracebloc/client/main/scripts/ at runtime (bash <(curl -fsSL https://tracebloc.io/i.sh)), so merging this to main ships the installer fixes to users immediately — none of these commits touch the Helm chart templates, so no chart release is required.

Contains (unreleased since v1.5.0)

Chart release (optional — your call)

The chart stays at 1.5.0; nothing in client/ or ingestor/ templates changed. If you want a versioned marker release (as with v1.4.3 "installer hardening"), bump client/Chart.yaml → 1.5.1 and publish a GitHub Release after merge — it's a tracking marker only, since the fix already ships via main.

Gate

Targets main, so fr-gate runs and blocks merge until all contained kanban items (#217, #218, #210, #211, #212) are in Ready for prod.

🤖 Generated with Claude Code


Note

Medium Risk
Changes install-time gates and Linux package install behavior that affect all new deployments; risk is mitigated by extensive bats/Pester updates and CI-only chart contract tests rather than runtime chart changes.

Overview
Installer hardening so small VMs and flaky Ubuntu apt behavior fail fast instead of hanging or OOMing later. Preflight now uses Docker/Colima VM memory and CPU when the daemon is up (with host fallback), raises floors (5 GB RAM hard-fail on Linux, 10 GB disk), adds train-oriented warnings, and re-checks Docker VM RAM at cluster create (_pf_recheck_runtime_mem / Test-PreflightRuntimeMem). Linux apt paths force non-interactive installs (DEBIAN_FRONTEND, NEEDRESTART_MODE), bound dpkg lock waits (DPkg::Lock::Timeout, apt_wait_for_lock), including get.docker.com and WSL NVIDIA toolkit install. macOS Colima default sizing bumps to 4 CPU / 6 GB RAM.

Adds scripts/tests/check-drift.sh plus a Drift checks GitHub workflow (and shellcheck/bats coverage) to keep backend API hosts and chart workload names aligned with summary.sh / diagnose.sh. Helm unittest files lock down ingestion authz ConfigMap behavior and the stable jobs-manager Service contract (tests only; no template edits in this diff).

Reviewed by Cursor Bugbot for commit 6d381f7. Bugbot is set up for automated code reviews on this repo. Configure here.

LukasWodka and others added 9 commits June 5, 2026 16:54
…stalls on Ubuntu 22.04+

Ubuntu 22.04+ ships needrestart, which hooks `apt-get install` and opens an
interactive "restart services?" prompt that `-y` does not suppress. The
installer runs apt inside spin_cmd (output redirected, process backgrounded),
so the prompt is invisible and blocks on the TTY (SIGTTIN) → the install hangs
forever (reported as the spinner stuck "still pulling conntrack" on Ubuntu 24.04).

Pass DEBIAN_FRONTEND=noninteractive + NEEDRESTART_MODE=a through `sudo env`
(sudo resets the env) on every apt path the installer drives:
- PM_INSTALL: conntrack/openssl/curl/tar (+ nvidia-container-toolkit via $PM_INSTALL)
- get.docker.com convenience script (runs apt-get internally)
- WSL2 NVIDIA Container Toolkit heredoc in install-k8s.ps1 (parity)

Reuses the existing `sudo env VAR=val` pattern (install_k3d, #718). Adds a bats
guard asserting apt stays non-interactive.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ctive

fix(installer): keep apt non-interactive so needrestart can't hang installs on Ubuntu 22.04+
…g the install (#210)

Stacked on #212 (the needrestart non-interactive fix). This adds the lock
dimension #212 doesn't cover: on a freshly-booted Ubuntu, apt-daily /
unattended-upgrades hold the dpkg lock and apt-get waits on it forever —
and spin_cmd hides apt's "Waiting for cache lock" line, so it looks frozen.

- setup_pm: add -o DPkg::Lock::Timeout=600 to the apt update/install commands
  so the wait is bounded and fails with a clear error instead of hanging.
- add apt_wait_for_lock: a visible, bounded "Waiting for background system
  updates to finish…" step before the docker / system-deps installs.
- bats: assert the apt commands carry the lock timeout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fix(installer): bound the apt dpkg-lock wait so a held lock can't hang the install (#210)
… view

Preflight only *warned* on low RAM and read *host* memory, not the Docker VM's —
so small Linux VMs installed then OOM'd (mysql CrashLoopBackOff cascade; the
Charité/Niklas + Giesan reports), and a 36 GB Mac with a 4 GB Docker VM passed.

- Measure `docker info` MemTotal/NCPU (the budget the pods actually get), with a
  host fallback when the daemon is down.
- Hard-fail below PF_MIN_MEM_GB (5 GiB) on Linux; warn-only on Mac/Win (Docker is
  still down at preflight there). Raise warn 4->8 GiB; recommend 16 GiB to train.
  Raise disk floor 5->10 GiB. 64 MiB grace avoids bytes->GiB truncation false-trips.
- Re-check runtime memory once Docker is up (create_cluster) — closes the Mac/Win
  case the preflight read can't see.
- Linux MemAvailable warn for busy shared VMs.
- Bump headless-Mac Colima default to 6 GB / 4 CPU (env-overridable) so the gate
  never flags a VM the installer itself provisioned.
- CPU stays warn-only (throttling != OOM); recommend 4 cores to train.
- Mirror in install-k8s.ps1; bats + Pester coverage.

Thresholds are derived from the chart's real footprint (always-on ~2.1 GiB
requests + k3s + OS ~= 4.4 GiB to be Online; training-job limit ~8 GiB+), and
corroborated by the EKS-guide instance types and legacy system-requirements.

Refs tracebloc/backend#744

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t preflight.sh

The e2e-cluster.sh harness sources cluster.sh but not preflight.sh, so the new
_pf_recheck_runtime_mem call logged "command not found" (harmless under `|| true`,
but wrong). Guard the call with `declare -F`, and source preflight.sh in the e2e
harness so the re-check is exercised on a real cluster bring-up. The real installer
already sources preflight.sh before create_cluster, so production was unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…load names)

Mocked unit tests can't catch a chart rename that breaks the names summary.sh /
diagnose.sh grep for, or a backend host changed in one of the three files that
hardcode it — they ship green and break in the field. Add a drift checker:

- backend API host parity across preflight.sh / install-client-helm.sh /
  install-k8s.ps1 (dev/stg/prod hosts must match).
- workload-name contract: the Deployments/DaemonSet that summary.sh (readiness
  wait) and diagnose.sh (--diagnose bundle) reference by name must be rendered by
  the chart (helm template), and the scripts must still reference each.
- 8 bats cases for the checker; drift-checks.yaml runs on scripts/ or client/
  changes (helm set up in CI); check-drift.sh added to the shellcheck list.

Refs tracebloc/backend#746

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fix(installer): hard-gate low RAM and measure the container-runtime's view
test(drift): CI guard for source-of-truth drift (backend hosts + workload names)
@LukasWodka

Copy link
Copy Markdown
Contributor

👋 Heads-up — Code review queue is at 26 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

saadqbal and others added 2 commits June 8, 2026 12:40
…configmap

test(charts): add helm-unittest suite for ingestion-authz-configmap
…te (#216)

Covers the previously-untested templates/jobs-manager-service.yaml:
asserts it renders a single ClusterIP Service named jobs-manager (the
stable in-cluster name contract relied on by the ingestor subchart
post-install hook), is NOT a NodePort/LoadBalancer (no external
exposure), selects app=manager, exposes HTTP port 8080/TCP only, and
carries the standard chart labels in the release namespace.

Tests-only; security invariants unchanged.

Refs #193

Co-authored-by: Claude <noreply@anthropic.com>
@saadqbal saadqbal merged commit 7eadd30 into main Jun 8, 2026
97 of 123 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants