fix(installer): hard-gate low RAM and measure the container-runtime's view by LukasWodka · Pull Request #217 · tracebloc/client

LukasWodka · 2026-06-06T13:08:02Z

What & why

Customers on small Ubuntu VMs (a customer/the customer contact, the customer contact) hit a node-level OOM after install — mysql-client CrashLoopBackOff with jobs-manager/requests-proxy cascading off it. The install "succeeded," then the stack thrashed. Two gaps in scripts/lib/preflight.sh:

RAM/CPU only warned — never blocked. A warn wouldn't have stopped it.
Memory was read from the host, not the container runtime — on macOS/Windows the pods run in the Docker Desktop/Colima/WSL2 VM with its own (often 2–8 GB) budget, so a 36 GB Mac passed while its 4 GB VM OOM'd.

And the warn threshold (4 GB) sat below the stack's real footprint.

Changes

Measure the runtime's view — new _pf_runtime_mem_kb/_pf_runtime_ncpu read docker info MemTotal/NCPU; _pf_total_mem_kb/_pf_ncpu become selectors (prefer runtime, fall back to host). Precedent: _pf_docker_root already reads docker info.
Hard memory gate — PF_MIN_MEM_GB (5 GiB) hard-fails on Linux (mirrors the disk gate); warn-only on Mac/Win (Docker is still down at preflight there → host RAM). Warn raised 4→8; recommend 16 to train. Disk floor 5→10.
64 MiB grace on the floor so a VM reporting a hair under N GiB doesn't truncate-to-(N-1) and false-trip.
Post-Docker re-check in create_cluster (_pf_recheck_runtime_mem, warn-only) — the first point docker info is reliably up on every OS; closes the Mac/Win case the preflight read can't see.
Linux MemAvailable warn for busy shared VMs.
Colima headless default → 6 GB / 4 CPU (env-overridable) so the gate never flags a VM the installer itself provisioned.
CPU stays warn-only (throttling ≠ OOM); recommend 4 cores to train.
PowerShell mirror (install-k8s.ps1) + bats/Pester coverage.

Why these numbers (not folklore)

Derived from the chart's real footprint and cross-checked three ways:

Always-on requests ~2.1 GiB + k3s ~0.8 + OS ~0.7 ≈ ~4.4 GiB just to be Online → below 5 it boots then OOMs.
Training-job limit is ~8 GiB+ → 16 GiB to train locally, matching the EKS guide's t3.xlarge training node, legacy_docs/system-requirements.md (16 GB local), and setup-guide.mdx (16+ GB rec).

Testing

bats scripts/tests/preflight.bats — 37/37 green (16 new: Linux floor hard-fail, warn band, macOS warn-only, selector runtime-preference + host-fallback, 64 MiB grace, PF_MIN_MEM_GB override, MemAvailable, recheck).
Full suite: no new failures (4 pre-existing env failures are identical on clean develop).
Pester: runtime-preference (mock docker) + warn-only gate + recheck cases added; ps1 runtime unverified locally (no Windows) — CI Pester covers.

Notes

Out of scope → separate ticket: the training-job default request/limit mismatch (client-runtime/jobs_manager.py request ~202Mi but limit 20G — schedules almost anywhere, then can OOM the node).
Overrides preserved: PF_MIN_MEM_GB=…, TRACEBLOC_SKIP_PREFLIGHT=1.

Refs tracebloc/backend#744

🤖 Generated with Claude Code

… view Preflight only *warned* on low RAM and read *host* memory, not the Docker VM's — so small Linux VMs installed then OOM'd (mysql CrashLoopBackOff cascade; the Charité/Niklas + Giesan reports), and a 36 GB Mac with a 4 GB Docker VM passed. - Measure `docker info` MemTotal/NCPU (the budget the pods actually get), with a host fallback when the daemon is down. - Hard-fail below PF_MIN_MEM_GB (5 GiB) on Linux; warn-only on Mac/Win (Docker is still down at preflight there). Raise warn 4->8 GiB; recommend 16 GiB to train. Raise disk floor 5->10 GiB. 64 MiB grace avoids bytes->GiB truncation false-trips. - Re-check runtime memory once Docker is up (create_cluster) — closes the Mac/Win case the preflight read can't see. - Linux MemAvailable warn for busy shared VMs. - Bump headless-Mac Colima default to 6 GB / 4 CPU (env-overridable) so the gate never flags a VM the installer itself provisioned. - CPU stays warn-only (throttling != OOM); recommend 4 cores to train. - Mirror in install-k8s.ps1; bats + Pester coverage. Thresholds are derived from the chart's real footprint (always-on ~2.1 GiB requests + k3s + OS ~= 4.4 GiB to be Online; training-job limit ~8 GiB+), and corroborated by the EKS-guide instance types and legacy system-requirements. Refs tracebloc/backend#744 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

LukasWodka · 2026-06-06T13:08:46Z

👋 Heads-up — Code review queue is at 21 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

averaging-service#102 — chore(kanban): declare staging as prod-truth branch via .kanban.yml · author: @LukasWodka · no reviewer assigned
averaging-service#103 — fix(averaging): correctness foundations — DP, shape guard, sklearn weighting, bounded ensembles (chore: add default CODEOWNERS for auto-reviewer assignment #73) · author: @LukasWodka · reviewer: @shujaatTracebloc
averaging-service#98 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
backend#732 — feat(experiments): add per-experiment aggregation_strategy field (#730) · author: @LukasWodka · reviewer: @saqlainsyed007, @shujaatTracebloc
backend#735 — fix(flops): atomic flops_used + fire stop once on threshold crossing (#733) · author: @saadqbal · no reviewer assigned
cli#61 — fix(install.sh): persist PATH to the shell rc (parity with install.ps1) · author: @LukasWodka · reviewer: @saadqbal
client#207 — test(charts): add helm-unittest suite for ingestion-authz-configmap · author: @saadqbal · no reviewer assigned
client#216 — test(charts): add helm-unittest suite for jobs-manager-service template · author: @saadqbal · no reviewer assigned
data-ingestors#156 — harden(time-series): reject locale-ambiguous timestamps (silent mis-parse guard) · author: @LukasWodka · no reviewer assigned
data-ingestors#158 — harden(image): distinct reasons for unreadable images (empty / corrupt / bomb) · author: @LukasWodka · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

…t preflight.sh The e2e-cluster.sh harness sources cluster.sh but not preflight.sh, so the new _pf_recheck_runtime_mem call logged "command not found" (harmless under `|| true`, but wrong). Guard the call with `declare -F`, and source preflight.sh in the e2e harness so the re-check is exercised on a real cluster bring-up. The real installer already sources preflight.sh before create_cluster, so production was unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…gate (#54) The installer now hard-fails below ~5 GB RAM on Linux (was warn-only), so the old "it only warns" line was wrong. State minimum-to-run (2 CPU / 5 GB / 10 GB) vs recommended-to-train (4 CPU / 16 GB / 50 GB) consistently across quickstart, deploy-local, and the setup guide. A single training job reserves ~8 GB. Pairs with tracebloc/client#217 · refs tracebloc/backend#744 Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

saadqbal

Reviewed the runtime-vs-host memory selector and the Linux-hard-fail / Mac+Win-warn-only split — that's the correct distinction (no Docker VM on native Linux, so docker info MemTotal == host there; the gate that matters is the VM budget on Mac/Win). The 64 MiB grace, the post-Docker _pf_recheck_runtime_mem hook (guarded in cluster.sh + sourced in e2e), and the Colima 6 GB default all check out, and the thresholds are justified from the chart's real footprint rather than guessed. 37/37 bats + Pester green. LGTM.

LukasWodka assigned saadqbal Jun 6, 2026

LukasWodka mentioned this pull request Jun 6, 2026

docs(environment-setup): reconcile machine sizing with the preflight gate tracebloc/docs#54

Merged

This was referenced Jun 6, 2026

chore(docs): touch all .mdx files to force Mintlify full rebuild tracebloc/docs#55

Merged

ci(docs): nightly + on-push probe for page coverage on dev and prod tracebloc/docs#56

Merged

LukasWodka added the work-type:bug Defect or regression label Jun 7, 2026

saadqbal approved these changes Jun 8, 2026

View reviewed changes

saadqbal merged commit 5f9c8d2 into develop Jun 8, 2026
19 checks passed

This was referenced Jun 8, 2026

ci(drift): cross-repo guards — docs selectors ↔ chart labels; preflight thresholds ↔ docs sizing #220

Open

release: promote develop → main (installer hardening — OOM gate, apt/needrestart fixes, drift CI) #221

Merged

LukasWodka added the bug Something isn't working label Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(installer): hard-gate low RAM and measure the container-runtime's view#217

fix(installer): hard-gate low RAM and measure the container-runtime's view#217
saadqbal merged 2 commits into
developfrom
fix/preflight-resource-gate

LukasWodka commented Jun 6, 2026 •

edited

Loading

Uh oh!

LukasWodka commented Jun 6, 2026

Uh oh!

saadqbal left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LukasWodka commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

Changes

Why these numbers (not folklore)

Testing

Notes

Uh oh!

LukasWodka commented Jun 6, 2026

Uh oh!

saadqbal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LukasWodka commented Jun 6, 2026 •

edited

Loading