Skip to content

fix(installer): hard-gate low RAM and measure the container-runtime's view#217

Merged
saadqbal merged 2 commits into
developfrom
fix/preflight-resource-gate
Jun 8, 2026
Merged

fix(installer): hard-gate low RAM and measure the container-runtime's view#217
saadqbal merged 2 commits into
developfrom
fix/preflight-resource-gate

Conversation

@LukasWodka

@LukasWodka LukasWodka commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

What & why

Customers on small Ubuntu VMs (a customer/the customer contact, the customer contact) hit a node-level OOM after installmysql-client CrashLoopBackOff with jobs-manager/requests-proxy cascading off it. The install "succeeded," then the stack thrashed. Two gaps in scripts/lib/preflight.sh:

  1. RAM/CPU only warned — never blocked. A warn wouldn't have stopped it.
  2. Memory was read from the host, not the container runtime — on macOS/Windows the pods run in the Docker Desktop/Colima/WSL2 VM with its own (often 2–8 GB) budget, so a 36 GB Mac passed while its 4 GB VM OOM'd.

And the warn threshold (4 GB) sat below the stack's real footprint.

Changes

  • Measure the runtime's view — new _pf_runtime_mem_kb/_pf_runtime_ncpu read docker info MemTotal/NCPU; _pf_total_mem_kb/_pf_ncpu become selectors (prefer runtime, fall back to host). Precedent: _pf_docker_root already reads docker info.
  • Hard memory gatePF_MIN_MEM_GB (5 GiB) hard-fails on Linux (mirrors the disk gate); warn-only on Mac/Win (Docker is still down at preflight there → host RAM). Warn raised 4→8; recommend 16 to train. Disk floor 5→10.
  • 64 MiB grace on the floor so a VM reporting a hair under N GiB doesn't truncate-to-(N-1) and false-trip.
  • Post-Docker re-check in create_cluster (_pf_recheck_runtime_mem, warn-only) — the first point docker info is reliably up on every OS; closes the Mac/Win case the preflight read can't see.
  • Linux MemAvailable warn for busy shared VMs.
  • Colima headless default → 6 GB / 4 CPU (env-overridable) so the gate never flags a VM the installer itself provisioned.
  • CPU stays warn-only (throttling ≠ OOM); recommend 4 cores to train.
  • PowerShell mirror (install-k8s.ps1) + bats/Pester coverage.

Why these numbers (not folklore)

Derived from the chart's real footprint and cross-checked three ways:

  • Always-on requests ~2.1 GiB + k3s ~0.8 + OS ~0.7 ≈ ~4.4 GiB just to be Online → below 5 it boots then OOMs.
  • Training-job limit is ~8 GiB+ → 16 GiB to train locally, matching the EKS guide's t3.xlarge training node, legacy_docs/system-requirements.md (16 GB local), and setup-guide.mdx (16+ GB rec).

Testing

  • bats scripts/tests/preflight.bats37/37 green (16 new: Linux floor hard-fail, warn band, macOS warn-only, selector runtime-preference + host-fallback, 64 MiB grace, PF_MIN_MEM_GB override, MemAvailable, recheck).
  • Full suite: no new failures (4 pre-existing env failures are identical on clean develop).
  • Pester: runtime-preference (mock docker) + warn-only gate + recheck cases added; ps1 runtime unverified locally (no Windows) — CI Pester covers.

Notes

  • Out of scope → separate ticket: the training-job default request/limit mismatch (client-runtime/jobs_manager.py request ~202Mi but limit 20G — schedules almost anywhere, then can OOM the node).
  • Overrides preserved: PF_MIN_MEM_GB=…, TRACEBLOC_SKIP_PREFLIGHT=1.

Refs tracebloc/backend#744

🤖 Generated with Claude Code

… view

Preflight only *warned* on low RAM and read *host* memory, not the Docker VM's —
so small Linux VMs installed then OOM'd (mysql CrashLoopBackOff cascade; the
Charité/Niklas + Giesan reports), and a 36 GB Mac with a 4 GB Docker VM passed.

- Measure `docker info` MemTotal/NCPU (the budget the pods actually get), with a
  host fallback when the daemon is down.
- Hard-fail below PF_MIN_MEM_GB (5 GiB) on Linux; warn-only on Mac/Win (Docker is
  still down at preflight there). Raise warn 4->8 GiB; recommend 16 GiB to train.
  Raise disk floor 5->10 GiB. 64 MiB grace avoids bytes->GiB truncation false-trips.
- Re-check runtime memory once Docker is up (create_cluster) — closes the Mac/Win
  case the preflight read can't see.
- Linux MemAvailable warn for busy shared VMs.
- Bump headless-Mac Colima default to 6 GB / 4 CPU (env-overridable) so the gate
  never flags a VM the installer itself provisioned.
- CPU stays warn-only (throttling != OOM); recommend 4 cores to train.
- Mirror in install-k8s.ps1; bats + Pester coverage.

Thresholds are derived from the chart's real footprint (always-on ~2.1 GiB
requests + k3s + OS ~= 4.4 GiB to be Online; training-job limit ~8 GiB+), and
corroborated by the EKS-guide instance types and legacy system-requirements.

Refs tracebloc/backend#744

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@LukasWodka

Copy link
Copy Markdown
Contributor Author

👋 Heads-up — Code review queue is at 21 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

…t preflight.sh

The e2e-cluster.sh harness sources cluster.sh but not preflight.sh, so the new
_pf_recheck_runtime_mem call logged "command not found" (harmless under `|| true`,
but wrong). Guard the call with `declare -F`, and source preflight.sh in the e2e
harness so the re-check is exercised on a real cluster bring-up. The real installer
already sources preflight.sh before create_cluster, so production was unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
LukasWodka added a commit to tracebloc/docs that referenced this pull request Jun 6, 2026
…gate (#54)

The installer now hard-fails below ~5 GB RAM on Linux (was warn-only), so the old
"it only warns" line was wrong. State minimum-to-run (2 CPU / 5 GB / 10 GB) vs
recommended-to-train (4 CPU / 16 GB / 50 GB) consistently across quickstart,
deploy-local, and the setup guide. A single training job reserves ~8 GB.

Pairs with tracebloc/client#217 · refs tracebloc/backend#744

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
@LukasWodka LukasWodka added the work-type:bug Defect or regression label Jun 7, 2026

@saadqbal saadqbal left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the runtime-vs-host memory selector and the Linux-hard-fail / Mac+Win-warn-only split — that's the correct distinction (no Docker VM on native Linux, so docker info MemTotal == host there; the gate that matters is the VM budget on Mac/Win). The 64 MiB grace, the post-Docker _pf_recheck_runtime_mem hook (guarded in cluster.sh + sourced in e2e), and the Colima 6 GB default all check out, and the thresholds are justified from the chart's real footprint rather than guessed. 37/37 bats + Pester green. LGTM.

@saadqbal saadqbal merged commit 5f9c8d2 into develop Jun 8, 2026
19 checks passed
@LukasWodka LukasWodka added the bug Something isn't working label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working work-type:bug Defect or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants