fix(installer): hard-gate low RAM and measure the container-runtime's view#217
Conversation
… view Preflight only *warned* on low RAM and read *host* memory, not the Docker VM's — so small Linux VMs installed then OOM'd (mysql CrashLoopBackOff cascade; the Charité/Niklas + Giesan reports), and a 36 GB Mac with a 4 GB Docker VM passed. - Measure `docker info` MemTotal/NCPU (the budget the pods actually get), with a host fallback when the daemon is down. - Hard-fail below PF_MIN_MEM_GB (5 GiB) on Linux; warn-only on Mac/Win (Docker is still down at preflight there). Raise warn 4->8 GiB; recommend 16 GiB to train. Raise disk floor 5->10 GiB. 64 MiB grace avoids bytes->GiB truncation false-trips. - Re-check runtime memory once Docker is up (create_cluster) — closes the Mac/Win case the preflight read can't see. - Linux MemAvailable warn for busy shared VMs. - Bump headless-Mac Colima default to 6 GB / 4 CPU (env-overridable) so the gate never flags a VM the installer itself provisioned. - CPU stays warn-only (throttling != OOM); recommend 4 cores to train. - Mirror in install-k8s.ps1; bats + Pester coverage. Thresholds are derived from the chart's real footprint (always-on ~2.1 GiB requests + k3s + OS ~= 4.4 GiB to be Online; training-job limit ~8 GiB+), and corroborated by the EKS-guide instance types and legacy system-requirements. Refs tracebloc/backend#744 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
👋 Heads-up — Code review queue is at 21 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
…t preflight.sh The e2e-cluster.sh harness sources cluster.sh but not preflight.sh, so the new _pf_recheck_runtime_mem call logged "command not found" (harmless under `|| true`, but wrong). Guard the call with `declare -F`, and source preflight.sh in the e2e harness so the re-check is exercised on a real cluster bring-up. The real installer already sources preflight.sh before create_cluster, so production was unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…gate (#54) The installer now hard-fails below ~5 GB RAM on Linux (was warn-only), so the old "it only warns" line was wrong. State minimum-to-run (2 CPU / 5 GB / 10 GB) vs recommended-to-train (4 CPU / 16 GB / 50 GB) consistently across quickstart, deploy-local, and the setup guide. A single training job reserves ~8 GB. Pairs with tracebloc/client#217 · refs tracebloc/backend#744 Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
saadqbal
left a comment
There was a problem hiding this comment.
Reviewed the runtime-vs-host memory selector and the Linux-hard-fail / Mac+Win-warn-only split — that's the correct distinction (no Docker VM on native Linux, so docker info MemTotal == host there; the gate that matters is the VM budget on Mac/Win). The 64 MiB grace, the post-Docker _pf_recheck_runtime_mem hook (guarded in cluster.sh + sourced in e2e), and the Colima 6 GB default all check out, and the thresholds are justified from the chart's real footprint rather than guessed. 37/37 bats + Pester green. LGTM.
What & why
Customers on small Ubuntu VMs (a customer/the customer contact, the customer contact) hit a node-level OOM after install —
mysql-clientCrashLoopBackOff with jobs-manager/requests-proxy cascading off it. The install "succeeded," then the stack thrashed. Two gaps inscripts/lib/preflight.sh:And the warn threshold (4 GB) sat below the stack's real footprint.
Changes
_pf_runtime_mem_kb/_pf_runtime_ncpureaddocker infoMemTotal/NCPU;_pf_total_mem_kb/_pf_ncpubecome selectors (prefer runtime, fall back to host). Precedent:_pf_docker_rootalready readsdocker info.PF_MIN_MEM_GB(5 GiB) hard-fails on Linux (mirrors the disk gate); warn-only on Mac/Win (Docker is still down at preflight there → host RAM). Warn raised 4→8; recommend 16 to train. Disk floor 5→10.create_cluster(_pf_recheck_runtime_mem, warn-only) — the first pointdocker infois reliably up on every OS; closes the Mac/Win case the preflight read can't see.MemAvailablewarn for busy shared VMs.install-k8s.ps1) + bats/Pester coverage.Why these numbers (not folklore)
Derived from the chart's real footprint and cross-checked three ways:
legacy_docs/system-requirements.md(16 GB local), andsetup-guide.mdx(16+ GB rec).Testing
bats scripts/tests/preflight.bats— 37/37 green (16 new: Linux floor hard-fail, warn band, macOS warn-only, selector runtime-preference + host-fallback, 64 MiB grace,PF_MIN_MEM_GBoverride,MemAvailable, recheck).develop).docker) + warn-only gate + recheck cases added; ps1 runtime unverified locally (no Windows) — CI Pester covers.Notes
client-runtime/jobs_manager.pyrequest ~202Mi but limit 20G — schedules almost anywhere, then can OOM the node).PF_MIN_MEM_GB=…,TRACEBLOC_SKIP_PREFLIGHT=1.Refs tracebloc/backend#744
🤖 Generated with Claude Code