Skip to content

WS3 · cluster doctor follow-up checks — node fit, image pullability, in-cluster egress #90

Description

@saadqbal

Parent epic: tracebloc/client-runtime#116 (WS3). Follow-up to #88 / PR #89, which shipped the tracebloc cluster doctor MVP (6 checks) and deferred these three.

Checks to add

  • Node fit — compare each node's allocatable CPU/memory/GPU against the resource requests the jobs-manager stamps on spawned training jobs (RESOURCE_REQUESTS, GPU_REQUESTS env). If no node can fit a job, surface ✖ "training jobs can't schedule" (the silent "Pending forever, no node" class). Read-only.
  • Image pullability — when the chart uses a registry pull secret (tracebloc.useImagePullSecrets), verify that secret exists and is a well-formed kubernetes.io/dockerconfigjson in the namespace, so private-image pulls don't ImagePullBackOff. Read-only.
  • In-cluster egress probe — the bigger one. Today's Backend egress check probes from the CLI host, which isn't the cluster's egress path (the cluster egresses via the egress-proxy). A real probe needs to run inside the cluster — either port-forward to the egress-proxy's own connectivity probe endpoint (preferred; reuses internal/submit/portforward.go, stays side-effect-light), or exec/spawn a probe pod (side-effecting; breaks doctor's read-only contract). Separate design + PR — do not rush into the read-only checks PR.

Sequencing

PR 1: node fit + image pullability (read-only, fit doctor's existing client-go pattern).
PR 2: in-cluster egress probe (after confirming the egress-proxy's probe contract in the client repo).

Target branch: develop.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions