Conversation
…stalls on Ubuntu 22.04+ Ubuntu 22.04+ ships needrestart, which hooks `apt-get install` and opens an interactive "restart services?" prompt that `-y` does not suppress. The installer runs apt inside spin_cmd (output redirected, process backgrounded), so the prompt is invisible and blocks on the TTY (SIGTTIN) → the install hangs forever (reported as the spinner stuck "still pulling conntrack" on Ubuntu 24.04). Pass DEBIAN_FRONTEND=noninteractive + NEEDRESTART_MODE=a through `sudo env` (sudo resets the env) on every apt path the installer drives: - PM_INSTALL: conntrack/openssl/curl/tar (+ nvidia-container-toolkit via $PM_INSTALL) - get.docker.com convenience script (runs apt-get internally) - WSL2 NVIDIA Container Toolkit heredoc in install-k8s.ps1 (parity) Reuses the existing `sudo env VAR=val` pattern (install_k3d, #718). Adds a bats guard asserting apt stays non-interactive. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ctive fix(installer): keep apt non-interactive so needrestart can't hang installs on Ubuntu 22.04+
…g the install (#210) Stacked on #212 (the needrestart non-interactive fix). This adds the lock dimension #212 doesn't cover: on a freshly-booted Ubuntu, apt-daily / unattended-upgrades hold the dpkg lock and apt-get waits on it forever — and spin_cmd hides apt's "Waiting for cache lock" line, so it looks frozen. - setup_pm: add -o DPkg::Lock::Timeout=600 to the apt update/install commands so the wait is bounded and fails with a clear error instead of hanging. - add apt_wait_for_lock: a visible, bounded "Waiting for background system updates to finish…" step before the docker / system-deps installs. - bats: assert the apt commands carry the lock timeout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fix(installer): bound the apt dpkg-lock wait so a held lock can't hang the install (#210)
… view Preflight only *warned* on low RAM and read *host* memory, not the Docker VM's — so small Linux VMs installed then OOM'd (mysql CrashLoopBackOff cascade; the Charité/Niklas + Giesan reports), and a 36 GB Mac with a 4 GB Docker VM passed. - Measure `docker info` MemTotal/NCPU (the budget the pods actually get), with a host fallback when the daemon is down. - Hard-fail below PF_MIN_MEM_GB (5 GiB) on Linux; warn-only on Mac/Win (Docker is still down at preflight there). Raise warn 4->8 GiB; recommend 16 GiB to train. Raise disk floor 5->10 GiB. 64 MiB grace avoids bytes->GiB truncation false-trips. - Re-check runtime memory once Docker is up (create_cluster) — closes the Mac/Win case the preflight read can't see. - Linux MemAvailable warn for busy shared VMs. - Bump headless-Mac Colima default to 6 GB / 4 CPU (env-overridable) so the gate never flags a VM the installer itself provisioned. - CPU stays warn-only (throttling != OOM); recommend 4 cores to train. - Mirror in install-k8s.ps1; bats + Pester coverage. Thresholds are derived from the chart's real footprint (always-on ~2.1 GiB requests + k3s + OS ~= 4.4 GiB to be Online; training-job limit ~8 GiB+), and corroborated by the EKS-guide instance types and legacy system-requirements. Refs tracebloc/backend#744 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t preflight.sh The e2e-cluster.sh harness sources cluster.sh but not preflight.sh, so the new _pf_recheck_runtime_mem call logged "command not found" (harmless under `|| true`, but wrong). Guard the call with `declare -F`, and source preflight.sh in the e2e harness so the re-check is exercised on a real cluster bring-up. The real installer already sources preflight.sh before create_cluster, so production was unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…load names) Mocked unit tests can't catch a chart rename that breaks the names summary.sh / diagnose.sh grep for, or a backend host changed in one of the three files that hardcode it — they ship green and break in the field. Add a drift checker: - backend API host parity across preflight.sh / install-client-helm.sh / install-k8s.ps1 (dev/stg/prod hosts must match). - workload-name contract: the Deployments/DaemonSet that summary.sh (readiness wait) and diagnose.sh (--diagnose bundle) reference by name must be rendered by the chart (helm template), and the scripts must still reference each. - 8 bats cases for the checker; drift-checks.yaml runs on scripts/ or client/ changes (helm set up in CI); check-drift.sh added to the shellcheck list. Refs tracebloc/backend#746 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fix(installer): hard-gate low RAM and measure the container-runtime's view
test(drift): CI guard for source-of-truth drift (backend hosts + workload names)
Contributor
|
👋 Heads-up — Code review queue is at 26 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
…configmap test(charts): add helm-unittest suite for ingestion-authz-configmap
…te (#216) Covers the previously-untested templates/jobs-manager-service.yaml: asserts it renders a single ClusterIP Service named jobs-manager (the stable in-cluster name contract relied on by the ingestor subchart post-install hook), is NOT a NodePort/LoadBalancer (no external exposure), selects app=manager, exposes HTTP port 8080/TCP only, and carries the standard chart labels in the release namespace. Tests-only; security invariants unchanged. Refs #193 Co-authored-by: Claude <noreply@anthropic.com>
This was referenced Jun 8, 2026
Merged
aptracebloc
approved these changes
Jun 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Promotes the unreleased
developcommits tomain. The installer helper scripts are served fromraw.githubusercontent.com/tracebloc/client/main/scripts/at runtime (bash <(curl -fsSL https://tracebloc.io/i.sh)), so merging this tomainships the installer fixes to users immediately — none of these commits touch the Helm chart templates, so no chart release is required.Contains (unreleased since v1.5.0)
Chart release (optional — your call)
The chart stays at 1.5.0; nothing in
client/oringestor/templates changed. If you want a versioned marker release (as with v1.4.3 "installer hardening"), bumpclient/Chart.yaml→ 1.5.1 and publish a GitHub Release after merge — it's a tracking marker only, since the fix already ships viamain.Gate
Targets
main, sofr-gateruns and blocks merge until all contained kanban items (#217, #218, #210, #211, #212) are in Ready for prod.🤖 Generated with Claude Code
Note
Medium Risk
Changes install-time gates and Linux package install behavior that affect all new deployments; risk is mitigated by extensive bats/Pester updates and CI-only chart contract tests rather than runtime chart changes.
Overview
Installer hardening so small VMs and flaky Ubuntu apt behavior fail fast instead of hanging or OOMing later. Preflight now uses Docker/Colima VM memory and CPU when the daemon is up (with host fallback), raises floors (5 GB RAM hard-fail on Linux, 10 GB disk), adds train-oriented warnings, and re-checks Docker VM RAM at cluster create (
_pf_recheck_runtime_mem/Test-PreflightRuntimeMem). Linux apt paths force non-interactive installs (DEBIAN_FRONTEND,NEEDRESTART_MODE), bound dpkg lock waits (DPkg::Lock::Timeout,apt_wait_for_lock), including get.docker.com and WSL NVIDIA toolkit install. macOS Colima default sizing bumps to 4 CPU / 6 GB RAM.Adds
scripts/tests/check-drift.shplus a Drift checks GitHub workflow (and shellcheck/bats coverage) to keep backend API hosts and chart workload names aligned withsummary.sh/diagnose.sh. Helm unittest files lock down ingestion authz ConfigMap behavior and the stablejobs-managerService contract (tests only; no template edits in this diff).Reviewed by Cursor Bugbot for commit 6d381f7. Bugbot is set up for automated code reviews on this repo. Configure here.