Instances self-heal, restart, and report status correctly on the federation foundation by scotwells · Pull Request #142 · datum-cloud/compute

scotwells · 2026-06-04T21:23:54Z

Value

This makes the federation foundation behave correctly for the people running and using Instances:

Instances self-heal a missed quota grant — a stuck Instance recovers on its own once quota frees up, instead of staying stranded until a manual nudge.
Restarting an Instance actually rolls it — the stateful control path now recreates to roll.
Instance status propagates without a full resync — the federator watches the downstream WorkloadDeployment, so status updates land promptly.
The Instance condition is clearly named Available (renamed from Running), so its meaning matches what users see.
Rollout progress and instanceType sizing are visible — UpdatedReplicas / ObservedGeneration and the instanceType vCPU/memory now surface on status and quota claims.

These fixes were authored downstream on the referenced-data branch (#129) and currently sit at its tip, but none of them depend on the referenced ConfigMap/Secret feature — they complete and correct the foundation itself. This PR moves them onto #107's branch so #107 reviews and merges as a correct, complete foundation rather than a knowingly incomplete one.

What

Thirteen commits, layered on feat/federated-deployment-scheduling:

Status propagation — federator watches the downstream WorkloadDeployment and maps events back to the bare project cluster name, so Instance status updates without waiting for a resync.
Quota self-heal — re-enqueue on quota grant; requeue (observable, conflict-proof, anchored on creation time) while quota is pending so a missed grant recovers on its own; key the claim by Instance name.
Instance restart — roll Instances by recreate so a restart actually rolls them.
refactor(api)!: rename the Instance Running condition to Available — applied across the API constants, controller, and tests; wire values flip "Running" → "Available", plus the CRD/kubebuilder defaults.
Rollout progress — surface UpdatedReplicas / ObservedGeneration on WorkloadDeployment, Workload, and placement status.
Instance blocking reasons + claim sizing — resolve Instance resources (limits → requests → instanceType catalog), include vCPU/memory on the quota claim, and surface a prioritized blocking reason on the Ready condition.
RBAC — grant the Instance controller permission to emit events.

Coordination / decoupling notes

These commits were de-coupled from the referenced-data feature as they moved:

The two feat(controller) commits dropped their referencedData* parameter threading; rollout-progress and blocking-reason surfaces are ported without the refdata tiers.
The blocking-reason priority function was re-authored for feat: route workloads to city locations via distributed scheduling (foundation) #107's scope (Provisioning < PendingQuota < hard runtime errors < network failure); the refdata/network tiers that don't exist on feat: route workloads to city locations via distributed scheduling (foundation) #107 are intentionally omitted.
The work.karmada.io/resourcebindings RBAC rule was not included — it has no controller-gen marker on feat: route workloads to city locations via distributed scheduling (foundation) #107 and belongs with Workloads can reference ConfigMaps and Secrets, delivered to every POP cell #129's hub-side companion-GC work, where the federator reads ResourceBinding annotations.
No ReferencedData* symbols leak into this branch (verified).

After this lands, #129 sheds these commits and re-stacks as the referenced-data feature only.

Verification

go build, go vet, go test ./internal/... ./api/..., and golangci-lint (v2.12.2) all pass clean. Base is a strict fast-forward of feat/federated-deployment-scheduling (no history rewrite).

Reviewer attention

Breaking rename — spot-check the three constants whose wire values flip to "Available" and the CRD/kubebuilder defaults.
Authored blocking-reason priority — confirm the omission of network/refdata tiers is correct for feat: route workloads to city locations via distributed scheduling (foundation) #107's scope.
Dropped resourcebindings RBAC — confirm it's correctly deferred to the hub-GC slice and not needed by feat: route workloads to city locations via distributed scheduling (foundation) #107 at runtime.

🤖 Generated with Claude Code

The WorkloadDeploymentFederator mirrors the downstream Karmada WorkloadDeployment status onto the project (VCP) WorkloadDeployment, but SetupWithManager only watched the project WD via For(). Nothing watched the downstream WD whose status it mirrors, so when Karmada aggregated new status onto the downstream object the federator was not notified — it only caught up on the next informer resync (~10h default) or an incidental project-WD spec write. This is why a freshly created workload's replica counts stayed empty on the VCP long after its projected Instance had already appeared (the InstanceProjector holds the analogous downstream watch and so propagates immediately). Add a downstream watch using the same cross-plane mechanism the InstanceProjector and unikraft-provider use (milosource cluster source + TypedEnqueueRequestsFromMapFunc). The map function correlates a downstream WD event back to its project WD reconcile request: name is stable across planes, namespace comes from the UpstreamOwnerNamespace label the federator stamps, and the project cluster name is recovered by decoding the UpstreamOwnerClusterName label on the downstream namespace (the exact inverse of the encoding applied in ensureDownstreamNamespace). The federation manager already constructed for the InstanceProjector is reused as the watchable source, so there is no additional manager or informer-cache cost beyond the new WD and Namespace informers. Karmada's own status-aggregation interval (edge cell → downstream WD) remains outside this repo; once Karmada writes the aggregated status, the new watch reacts immediately. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The downstream WorkloadDeployment status watch mapped events to a reconcile request whose ClusterName was the full decoded org/project path (decodeUpstreamClusterName turned the "cluster-<org>_<project>" namespace label into "<org>/<project>"). But the Milo multicluster provider keys project clusters by bare project name only. As a result every project except the org-less "datum-cloud" failed to resolve: mcmanager routed the unmatched name (ultimately the empty string) to the local host cluster, which has no compute CRDs, so Reconcile failed with "no matches for kind WorkloadDeployment" in a hot loop (~2 errors/sec observed on staging). Extract the bare project name (final path segment) so it matches the provider key, and guard the mapping with GetCluster: if the project cluster isn't engaged yet, drop the event instead of enqueuing a request that falls back to the host cluster and errors. Dropping is safe — once the provider engages the cluster, the For watch reconciles it and the next downstream status event maps cleanly. Rename decodeUpstreamClusterName to projectClusterNameFromLabel to reflect that it now returns the provider cluster key, and add the not-engaged drop case to the mapping test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The downstream WorkloadDeployment status watch was a complete no-op and the source of a steady ~130 errors/min on the management plane. Two layered causes: milosource.NewClusterSource binds the raw source to the empty cluster name, and the default mchandler.TypedEnqueueRequestsFromMapFunc wraps the map in TypedInjectCluster, which overwrites each request's ClusterName with that bound empty name. So the project cluster name computed by mapDownstreamDeploymentToRequest (and validated by its GetCluster guard) was discarded at enqueue time; every downstream event reached Reconcile with ClusterName="". mcmanager routes the empty name to the local host management cluster, which has no compute CRDs, so the Get failed with "no matches for kind WorkloadDeployment" and requeued in a hot loop — while the watch's actual purpose (immediate status mirror-back) never ran for any project. Switch the handler to TypedEnqueueRequestsFromMapFuncWithClusterPreservation so the map's project cluster name survives to Reconcile, making the downstream watch functional. Add a defensive guard at the top of Reconcile that drops (returns nil, not an error) any request with an empty cluster name, so a host-cluster fallback can never again spin in a requeue loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tance name An Instance could wedge Pending forever (QuotaGranted=Unknown/QuotaNoBudget, Quota scheduling gate never removed) even though its Milo ResourceClaim was granted: the Instance reconciled once while the claim was still pending, and nothing re-triggered it when the grant landed a beat later. The ResourceClaim watch mapped a claim to its Spec.ResourceRef — the Project — so the grant enqueued the project name, never the owning Instance. Fix the watch to enqueue the owning Instance: its namespace is carried on a new compute.datumapis.com/instance-namespace label (the claim lives in the project quota namespace, not the Instance's), and its name is the claim name with the resource-kind prefix stripped. Also name the claim after the Instance (unique among Instances in the project control plane) with an "instance-" prefix so it cannot collide with other resource kinds' claims sharing the quota namespace, replacing the previous "<namespace>--<name>" scheme. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… them A template-hash change (an image update, or a restartedAt annotation from `datumctl compute restart`) previously resolved to an in-place Update of the Instance. The unikraft provider bakes the pod at creation time and never recomputes an existing pod's spec, so the in-place update silently failed to roll the running workload — instances kept their old pod. Emit a delete (recreate) for drifted Ready instances instead. The next reconcile refills the slot via the create path with the new template, and the provider's finalizer-gated teardown plus create-on-new-Instance roll the pod with no provider changes. Ordered one-at-a-time pacing is preserved by the existing descending-ordinal sort, skip-all-but-first, and the DeletionTimestamp WaitAction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The Instance "Running" status condition is renamed to "Available" (wire value "Available"). An instance can be available while not actively running a pod (e.g. scaled to zero), so "Running" was misleading as a serving/health signal. Renamed constants: InstanceRunning -> InstanceAvailable ("Available") InstanceReadyReasonRunning -> InstanceReadyReasonAvailable ("Available") InstanceRunningReasonRunning -> InstanceAvailableReasonAvailable ("Available") InstanceRunningReasonStopped -> InstanceAvailableReasonStopped InstanceRunningReasonStarting -> InstanceAvailableReasonStarting InstanceRunningReasonStopping -> InstanceAvailableReasonStopping BREAKING CHANGE: the on-the-wire Instance condition type changes from "Running" to "Available". Consumers reading conditions[type=="Running"] must switch to "Available". Existing Instances self-heal on the next provider reconcile (the provider re-asserts the condition under its new name); the stale "Running" entry lingers cosmetically until then and is no longer read by the Ready derivation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…eals The instance controller is re-queued by a ResourceClaim watch when the claim is granted, but that grant event lives on the project control plane and can be missed (informer engagement races, watch relist gaps), wedging the instance at QuotaGranted!=True indefinitely (observed: claim Granted, instance stuck QuotaNoBudget until a manual reconcile cleared it). The pending-quota path returned no RequeueAfter, so there was no safety net. Add a backing-off requeue while QuotaGranted is not True, anchored on the condition's last transition: <60s : 1s (catch a grant landing almost immediately) 60s–5m : 15s 5m–10m : 60s >=10m : 300s Folded into the existing referenced-data requeue (soonest wins). The ResourceClaim watch remains the fast path; this only guarantees a missed grant self-heals instead of wedging. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…roof The pending-quota safety-net requeue was wired only at the tail of Reconcile, so an early return during the pending window (a status-update or upstream-writeback conflict) silently dropped it onto controller- runtime's exponential error-backoff — which can stretch to minutes, leaving an instance wedged at QuotaGranted!=True even though its ResourceClaim was granted (observed: the 2nd instance in a rapid burst consistently wedged). - Compute the requeue once, up front, so every return path honors it. - On a Conflict during the pending window, requeue at the bounded quota interval instead of returning the error (which would back off). - Log the requeue decision (and conflict-driven requeues) so the path is observable: a re-firing requeue prints every pass while pending, a dropped one does not. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… LTT Observability revealed the safety-net requeue was firing every reconcile but always at the slowest tier (300s): elapsed was measured from the QuotaGranted condition's LastTransitionTime, which stays at the 1970-01-01 CRD default while quota is pending (PendingEvaluation and NoBudget are both Unknown, so SetStatusCondition never bumps it). Result: a watch-missed instance waited up to 5 minutes for the safety net instead of ~1s, appearing wedged. Anchor elapsed on instance.CreationTimestamp, which reflects actual wait time, so the fast tiers (1s/15s) apply early as intended. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The instance controller emits Warning events on Instances (QuotaNoBudget, ImageUnavailable, InstanceCrashing, ConfigurationError, NetworkFailedToCreate, …) via the event recorder, but no RBAC rule granted it. Every write was rejected — "events is forbidden: ... cannot create resource events in API group \"\" in the namespace ns-<uid>" — so the user-facing signals explaining why an instance is stuck never reached the Instance (kubectl describe / activity timeline). Reconciliation was unaffected; this is an observability gap. Add the kubebuilder marker and regenerate the role. The regen also syncs a pre-existing work.karmada.io/resourcebindings rule (from an existing marker that wasn't reflected in the committed role). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rvedGeneration A restart/rolling update was invisible from the project plane: there was no status field representing how many instances are on the new template revision. Add UpdatedReplicas (instances whose observed template hash matches the desired template, regardless of readiness) and ObservedGeneration to both WorkloadDeployment and Workload (plus placement) status. UpdatedReplicas is computed on the cell WD reconcile alongside CurrentReplicas (which is now its Programmed subset), aggregated up into the Workload, and rides the existing status sync to the project plane. Repoint the "Up-to-date" printcolumn to .status.updatedReplicas to match `kubectl get deployment` semantics, so a roll is visible as the count dips below Replicas and recovers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…emory Two Instance-controller correctness changes: - Blocking-reason rollup: surface the most specific provider sub-condition (ImageUnavailable, InstanceCrashing, ConfigurationError, Provisioning) and its message onto the Instance Ready condition instead of a generic "Instance has not been programmed", so e.g. an image-pull failure reads as ImageUnavailable with the real message. Adds the reason constants and ranks them in the blocking-reason priority. - Quota sizing: resolve vCPU/memory for instanceType-sized instances from a new instanceTypeCatalog (datumcloud/d1-standard-2 = 1 vCPU / 2 GiB) so the quota ResourceClaim requests vcpus + memory, not just instance count. Explicit container limits / instance requests still take precedence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… tests Make the cherry-picked instanceType-sizing and blocking-reason tests lint-clean: hoist the repeated "datumcloud/d1-standard-2", "app", and "test/image:latest" literals into named constants (goconst) and apply gofmt. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scotwells mentioned this pull request Jun 4, 2026

Workloads can reference ConfigMaps and Secrets, delivered to every POP cell #129

Draft

scotwells changed the title ~~fix: complete the federation foundation (bundle stranded fixes from #129)~~ Instances self-heal, restart, and report status correctly on the federation foundation Jun 4, 2026

scotwells force-pushed the feat/federated-deployment-scheduling branch from 82955e2 to bf73355 Compare June 5, 2026 01:42

scotwells force-pushed the split/federation-core-bundled branch from 0920e3e to 741fb15 Compare June 5, 2026 01:42

scotwells mentioned this pull request Jun 5, 2026

End-to-end coverage for federated workload delivery #146

Closed

scotwells force-pushed the feat/federated-deployment-scheduling branch from bf73355 to a5755e4 Compare June 5, 2026 15:13

scotwells force-pushed the split/federation-core-bundled branch from 741fb15 to 119ddfd Compare June 5, 2026 15:13

scotwells force-pushed the feat/federated-deployment-scheduling branch from a5755e4 to cfc79cb Compare June 5, 2026 15:25

scotwells force-pushed the split/federation-core-bundled branch from 119ddfd to 5997231 Compare June 5, 2026 15:25

scotwells force-pushed the feat/federated-deployment-scheduling branch from cfc79cb to a063669 Compare June 5, 2026 16:19

scotwells force-pushed the split/federation-core-bundled branch from 5997231 to 1d0fca7 Compare June 5, 2026 16:19

scotwells and others added 13 commits June 5, 2026 11:42

scotwells force-pushed the feat/federated-deployment-scheduling branch from a063669 to 110778d Compare June 5, 2026 16:46

scotwells force-pushed the split/federation-core-bundled branch from 1d0fca7 to a67b32c Compare June 5, 2026 16:46

scotwells merged commit a67b32c into feat/federated-deployment-scheduling Jun 5, 2026
9 checks passed

scotwells deleted the split/federation-core-bundled branch June 5, 2026 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instances self-heal, restart, and report status correctly on the federation foundation#142

Instances self-heal, restart, and report status correctly on the federation foundation#142
scotwells merged 13 commits into
feat/federated-deployment-schedulingfrom
split/federation-core-bundled

scotwells commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

scotwells commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Value

What

Coordination / decoupling notes

Verification

Reviewer attention

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

scotwells commented Jun 4, 2026 •

edited

Loading