Skip to content

Instances self-heal, restart, and report status correctly on the federation foundation#142

Merged
scotwells merged 13 commits into
feat/federated-deployment-schedulingfrom
split/federation-core-bundled
Jun 5, 2026
Merged

Instances self-heal, restart, and report status correctly on the federation foundation#142
scotwells merged 13 commits into
feat/federated-deployment-schedulingfrom
split/federation-core-bundled

Conversation

@scotwells
Copy link
Copy Markdown
Contributor

@scotwells scotwells commented Jun 4, 2026

Value

This makes the federation foundation behave correctly for the people running and using Instances:

  • Instances self-heal a missed quota grant — a stuck Instance recovers on its own once quota frees up, instead of staying stranded until a manual nudge.
  • Restarting an Instance actually rolls it — the stateful control path now recreates to roll.
  • Instance status propagates without a full resync — the federator watches the downstream WorkloadDeployment, so status updates land promptly.
  • The Instance condition is clearly named Available (renamed from Running), so its meaning matches what users see.
  • Rollout progress and instanceType sizing are visibleUpdatedReplicas / ObservedGeneration and the instanceType vCPU/memory now surface on status and quota claims.

These fixes were authored downstream on the referenced-data branch (#129) and currently sit at its tip, but none of them depend on the referenced ConfigMap/Secret feature — they complete and correct the foundation itself. This PR moves them onto #107's branch so #107 reviews and merges as a correct, complete foundation rather than a knowingly incomplete one.

What

Thirteen commits, layered on feat/federated-deployment-scheduling:

  • Status propagation — federator watches the downstream WorkloadDeployment and maps events back to the bare project cluster name, so Instance status updates without waiting for a resync.
  • Quota self-heal — re-enqueue on quota grant; requeue (observable, conflict-proof, anchored on creation time) while quota is pending so a missed grant recovers on its own; key the claim by Instance name.
  • Instance restart — roll Instances by recreate so a restart actually rolls them.
  • refactor(api)!: rename the Instance Running condition to Available — applied across the API constants, controller, and tests; wire values flip "Running""Available", plus the CRD/kubebuilder defaults.
  • Rollout progress — surface UpdatedReplicas / ObservedGeneration on WorkloadDeployment, Workload, and placement status.
  • Instance blocking reasons + claim sizing — resolve Instance resources (limits → requests → instanceType catalog), include vCPU/memory on the quota claim, and surface a prioritized blocking reason on the Ready condition.
  • RBAC — grant the Instance controller permission to emit events.

Coordination / decoupling notes

These commits were de-coupled from the referenced-data feature as they moved:

After this lands, #129 sheds these commits and re-stacks as the referenced-data feature only.

Verification

go build, go vet, go test ./internal/... ./api/..., and golangci-lint (v2.12.2) all pass clean. Base is a strict fast-forward of feat/federated-deployment-scheduling (no history rewrite).

Reviewer attention

  1. Breaking rename — spot-check the three constants whose wire values flip to "Available" and the CRD/kubebuilder defaults.
  2. Authored blocking-reason priority — confirm the omission of network/refdata tiers is correct for feat: route workloads to city locations via distributed scheduling (foundation) #107's scope.
  3. Dropped resourcebindings RBAC — confirm it's correctly deferred to the hub-GC slice and not needed by feat: route workloads to city locations via distributed scheduling (foundation) #107 at runtime.

🤖 Generated with Claude Code

@scotwells scotwells changed the title fix: complete the federation foundation (bundle stranded fixes from #129) Instances self-heal, restart, and report status correctly on the federation foundation Jun 4, 2026
@scotwells scotwells force-pushed the feat/federated-deployment-scheduling branch from 82955e2 to bf73355 Compare June 5, 2026 01:42
@scotwells scotwells force-pushed the split/federation-core-bundled branch from 0920e3e to 741fb15 Compare June 5, 2026 01:42
@scotwells scotwells force-pushed the feat/federated-deployment-scheduling branch from bf73355 to a5755e4 Compare June 5, 2026 15:13
@scotwells scotwells force-pushed the split/federation-core-bundled branch from 741fb15 to 119ddfd Compare June 5, 2026 15:13
@scotwells scotwells force-pushed the feat/federated-deployment-scheduling branch from a5755e4 to cfc79cb Compare June 5, 2026 15:25
@scotwells scotwells force-pushed the split/federation-core-bundled branch from 119ddfd to 5997231 Compare June 5, 2026 15:25
@scotwells scotwells force-pushed the feat/federated-deployment-scheduling branch from cfc79cb to a063669 Compare June 5, 2026 16:19
@scotwells scotwells force-pushed the split/federation-core-bundled branch from 5997231 to 1d0fca7 Compare June 5, 2026 16:19
scotwells and others added 13 commits June 5, 2026 11:42
The WorkloadDeploymentFederator mirrors the downstream Karmada
WorkloadDeployment status onto the project (VCP) WorkloadDeployment, but
SetupWithManager only watched the project WD via For(). Nothing watched
the downstream WD whose status it mirrors, so when Karmada aggregated new
status onto the downstream object the federator was not notified — it
only caught up on the next informer resync (~10h default) or an
incidental project-WD spec write. This is why a freshly created
workload's replica counts stayed empty on the VCP long after its
projected Instance had already appeared (the InstanceProjector holds the
analogous downstream watch and so propagates immediately).

Add a downstream watch using the same cross-plane mechanism the
InstanceProjector and unikraft-provider use (milosource cluster source +
TypedEnqueueRequestsFromMapFunc). The map function correlates a
downstream WD event back to its project WD reconcile request: name is
stable across planes, namespace comes from the UpstreamOwnerNamespace
label the federator stamps, and the project cluster name is recovered by
decoding the UpstreamOwnerClusterName label on the downstream namespace
(the exact inverse of the encoding applied in ensureDownstreamNamespace).

The federation manager already constructed for the InstanceProjector is
reused as the watchable source, so there is no additional manager or
informer-cache cost beyond the new WD and Namespace informers.

Karmada's own status-aggregation interval (edge cell → downstream WD)
remains outside this repo; once Karmada writes the aggregated status, the
new watch reacts immediately.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The downstream WorkloadDeployment status watch mapped events to a
reconcile request whose ClusterName was the full decoded org/project path
(decodeUpstreamClusterName turned the "cluster-<org>_<project>" namespace
label into "<org>/<project>"). But the Milo multicluster provider keys
project clusters by bare project name only. As a result every project
except the org-less "datum-cloud" failed to resolve: mcmanager routed the
unmatched name (ultimately the empty string) to the local host cluster,
which has no compute CRDs, so Reconcile failed with "no matches for kind
WorkloadDeployment" in a hot loop (~2 errors/sec observed on staging).

Extract the bare project name (final path segment) so it matches the
provider key, and guard the mapping with GetCluster: if the project
cluster isn't engaged yet, drop the event instead of enqueuing a request
that falls back to the host cluster and errors. Dropping is safe — once
the provider engages the cluster, the For watch reconciles it and the
next downstream status event maps cleanly.

Rename decodeUpstreamClusterName to projectClusterNameFromLabel to
reflect that it now returns the provider cluster key, and add the
not-engaged drop case to the mapping test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The downstream WorkloadDeployment status watch was a complete no-op and
the source of a steady ~130 errors/min on the management plane. Two
layered causes:

milosource.NewClusterSource binds the raw source to the empty cluster
name, and the default mchandler.TypedEnqueueRequestsFromMapFunc wraps the
map in TypedInjectCluster, which overwrites each request's ClusterName
with that bound empty name. So the project cluster name computed by
mapDownstreamDeploymentToRequest (and validated by its GetCluster guard)
was discarded at enqueue time; every downstream event reached Reconcile
with ClusterName="". mcmanager routes the empty name to the local host
management cluster, which has no compute CRDs, so the Get failed with
"no matches for kind WorkloadDeployment" and requeued in a hot loop —
while the watch's actual purpose (immediate status mirror-back) never
ran for any project.

Switch the handler to TypedEnqueueRequestsFromMapFuncWithClusterPreservation
so the map's project cluster name survives to Reconcile, making the
downstream watch functional. Add a defensive guard at the top of Reconcile
that drops (returns nil, not an error) any request with an empty cluster
name, so a host-cluster fallback can never again spin in a requeue loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tance name

An Instance could wedge Pending forever (QuotaGranted=Unknown/QuotaNoBudget,
Quota scheduling gate never removed) even though its Milo ResourceClaim was
granted: the Instance reconciled once while the claim was still pending, and
nothing re-triggered it when the grant landed a beat later. The ResourceClaim
watch mapped a claim to its Spec.ResourceRef — the Project — so the grant
enqueued the project name, never the owning Instance.

Fix the watch to enqueue the owning Instance: its namespace is carried on a new
compute.datumapis.com/instance-namespace label (the claim lives in the project
quota namespace, not the Instance's), and its name is the claim name with the
resource-kind prefix stripped.

Also name the claim after the Instance (unique among Instances in the project
control plane) with an "instance-" prefix so it cannot collide with other
resource kinds' claims sharing the quota namespace, replacing the previous
"<namespace>--<name>" scheme.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… them

A template-hash change (an image update, or a restartedAt annotation from
`datumctl compute restart`) previously resolved to an in-place Update of the
Instance. The unikraft provider bakes the pod at creation time and never
recomputes an existing pod's spec, so the in-place update silently failed to
roll the running workload — instances kept their old pod.

Emit a delete (recreate) for drifted Ready instances instead. The next
reconcile refills the slot via the create path with the new template, and the
provider's finalizer-gated teardown plus create-on-new-Instance roll the pod
with no provider changes. Ordered one-at-a-time pacing is preserved by the
existing descending-ordinal sort, skip-all-but-first, and the
DeletionTimestamp WaitAction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Instance "Running" status condition is renamed to "Available" (wire
value "Available"). An instance can be available while not actively
running a pod (e.g. scaled to zero), so "Running" was misleading as a
serving/health signal.

Renamed constants:
  InstanceRunning                -> InstanceAvailable               ("Available")
  InstanceReadyReasonRunning     -> InstanceReadyReasonAvailable    ("Available")
  InstanceRunningReasonRunning   -> InstanceAvailableReasonAvailable ("Available")
  InstanceRunningReasonStopped   -> InstanceAvailableReasonStopped
  InstanceRunningReasonStarting  -> InstanceAvailableReasonStarting
  InstanceRunningReasonStopping  -> InstanceAvailableReasonStopping

BREAKING CHANGE: the on-the-wire Instance condition type changes from
"Running" to "Available". Consumers reading conditions[type=="Running"]
must switch to "Available". Existing Instances self-heal on the next
provider reconcile (the provider re-asserts the condition under its new
name); the stale "Running" entry lingers cosmetically until then and is
no longer read by the Ready derivation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eals

The instance controller is re-queued by a ResourceClaim watch when the
claim is granted, but that grant event lives on the project control plane
and can be missed (informer engagement races, watch relist gaps),
wedging the instance at QuotaGranted!=True indefinitely (observed: claim
Granted, instance stuck QuotaNoBudget until a manual reconcile cleared
it). The pending-quota path returned no RequeueAfter, so there was no
safety net.

Add a backing-off requeue while QuotaGranted is not True, anchored on the
condition's last transition:

  <60s : 1s     (catch a grant landing almost immediately)
  60s–5m : 15s
  5m–10m : 60s
  >=10m : 300s

Folded into the existing referenced-data requeue (soonest wins). The
ResourceClaim watch remains the fast path; this only guarantees a missed
grant self-heals instead of wedging.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…roof

The pending-quota safety-net requeue was wired only at the tail of
Reconcile, so an early return during the pending window (a status-update
or upstream-writeback conflict) silently dropped it onto controller-
runtime's exponential error-backoff — which can stretch to minutes,
leaving an instance wedged at QuotaGranted!=True even though its
ResourceClaim was granted (observed: the 2nd instance in a rapid burst
consistently wedged).

- Compute the requeue once, up front, so every return path honors it.
- On a Conflict during the pending window, requeue at the bounded quota
  interval instead of returning the error (which would back off).
- Log the requeue decision (and conflict-driven requeues) so the path is
  observable: a re-firing requeue prints every pass while pending, a
  dropped one does not.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… LTT

Observability revealed the safety-net requeue was firing every reconcile
but always at the slowest tier (300s): elapsed was measured from the
QuotaGranted condition's LastTransitionTime, which stays at the
1970-01-01 CRD default while quota is pending (PendingEvaluation and
NoBudget are both Unknown, so SetStatusCondition never bumps it). Result:
a watch-missed instance waited up to 5 minutes for the safety net instead
of ~1s, appearing wedged.

Anchor elapsed on instance.CreationTimestamp, which reflects actual wait
time, so the fast tiers (1s/15s) apply early as intended.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The instance controller emits Warning events on Instances (QuotaNoBudget,
ImageUnavailable, InstanceCrashing, ConfigurationError, NetworkFailedToCreate,
…) via the event recorder, but no RBAC rule granted it. Every write was
rejected — "events is forbidden: ... cannot create resource events in API
group \"\" in the namespace ns-<uid>" — so the user-facing signals explaining
why an instance is stuck never reached the Instance (kubectl describe /
activity timeline). Reconciliation was unaffected; this is an observability gap.

Add the kubebuilder marker and regenerate the role. The regen also syncs a
pre-existing work.karmada.io/resourcebindings rule (from an existing marker
that wasn't reflected in the committed role).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rvedGeneration

A restart/rolling update was invisible from the project plane: there was no
status field representing how many instances are on the new template revision.
Add UpdatedReplicas (instances whose observed template hash matches the desired
template, regardless of readiness) and ObservedGeneration to both
WorkloadDeployment and Workload (plus placement) status.

UpdatedReplicas is computed on the cell WD reconcile alongside CurrentReplicas
(which is now its Programmed subset), aggregated up into the Workload, and rides
the existing status sync to the project plane. Repoint the "Up-to-date"
printcolumn to .status.updatedReplicas to match `kubectl get deployment`
semantics, so a roll is visible as the count dips below Replicas and recovers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…emory

Two Instance-controller correctness changes:

- Blocking-reason rollup: surface the most specific provider sub-condition
  (ImageUnavailable, InstanceCrashing, ConfigurationError, Provisioning) and its
  message onto the Instance Ready condition instead of a generic "Instance has
  not been programmed", so e.g. an image-pull failure reads as ImageUnavailable
  with the real message. Adds the reason constants and ranks them in the
  blocking-reason priority.

- Quota sizing: resolve vCPU/memory for instanceType-sized instances from a new
  instanceTypeCatalog (datumcloud/d1-standard-2 = 1 vCPU / 2 GiB) so the quota
  ResourceClaim requests vcpus + memory, not just instance count. Explicit
  container limits / instance requests still take precedence.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… tests

Make the cherry-picked instanceType-sizing and blocking-reason tests
lint-clean: hoist the repeated "datumcloud/d1-standard-2", "app", and
"test/image:latest" literals into named constants (goconst) and apply
gofmt. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@scotwells scotwells force-pushed the feat/federated-deployment-scheduling branch from a063669 to 110778d Compare June 5, 2026 16:46
@scotwells scotwells force-pushed the split/federation-core-bundled branch from 1d0fca7 to a67b32c Compare June 5, 2026 16:46
@scotwells scotwells merged commit a67b32c into feat/federated-deployment-scheduling Jun 5, 2026
9 checks passed
@scotwells scotwells deleted the split/federation-core-bundled branch June 5, 2026 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant