Skip to content

Instances say why they can't start: referenced-data vs quota vs runtime error#143

Open
scotwells wants to merge 7 commits into
feat/configmap-secret-mounts-federatedfrom
split/refdata-blocking-reason
Open

Instances say why they can't start: referenced-data vs quota vs runtime error#143
scotwells wants to merge 7 commits into
feat/configmap-secret-mounts-federatedfrom
split/refdata-blocking-reason

Conversation

@scotwells
Copy link
Copy Markdown
Contributor

@scotwells scotwells commented Jun 4, 2026

Value

When an Instance or WorkloadDeployment can't start, it now tells the user why — distinguishing "waiting on referenced data" from "waiting on quota" from a hard runtime error — instead of a generic not-ready. The failure mode becomes actionable: a user can see at a glance whether they need to grant quota, fix a missing/oversized/unauthorized referenced object, or wait for delivery to finish.

This builds on the referenced-data delivery in the parent PR (#129).

What

  • Adds referenced-data blocking-reason constants and surfaces a specific reason on Instance.Ready, WorkloadDeployment.Available, and (rolled up) Workload.Available.
  • Extends the foundation's existing instanceBlockingReasonPriority (it is not redefined) with the referenced-data tiers: transient resolving/awaiting-propagation/not-ready rank with startup reasons; terminal source-not-found/too-large/unauthorized rank with hard runtime errors.
  • Resolves quota-vs-referenced-data priority when both gates are pending, and reads the hub→cell terminal-error annotation (from the core PR) in selectWDBlockingCondition.

Reviewer attention

Two priority tables exist by design — the Instance-side table (extended foundation table) and the WD-side wdBlockingReasonPriority. The QuotaVsReferencedData test asserts a terminal refdata reason outranks pending-quota; confirm that ordering matches intended UX.

Stack

Stacks on the referenced-data core PR (#129). go build/vet/test/golangci-lint green at tip.

🤖 Generated with Claude Code

@scotwells scotwells changed the title feat: surface specific blocking reasons for referenced-data and quota gates Instances say why they can't start: referenced-data vs quota vs runtime error Jun 4, 2026
@scotwells scotwells force-pushed the feat/configmap-secret-mounts-federated branch from e420096 to b11ab65 Compare June 5, 2026 01:42
@scotwells scotwells force-pushed the split/refdata-blocking-reason branch from b4714a8 to b244bcb Compare June 5, 2026 01:42
@scotwells scotwells force-pushed the feat/configmap-secret-mounts-federated branch from b11ab65 to e0df8b4 Compare June 5, 2026 15:13
@scotwells scotwells force-pushed the split/refdata-blocking-reason branch from b244bcb to 758a87f Compare June 5, 2026 15:13
@scotwells scotwells force-pushed the feat/configmap-secret-mounts-federated branch from e0df8b4 to 5ec3c00 Compare June 5, 2026 15:25
@scotwells scotwells force-pushed the split/refdata-blocking-reason branch from 758a87f to 90721bf Compare June 5, 2026 15:25
@scotwells scotwells force-pushed the feat/configmap-secret-mounts-federated branch from 5ec3c00 to bc253fd Compare June 5, 2026 16:19
@scotwells scotwells force-pushed the split/refdata-blocking-reason branch from 90721bf to 14299a1 Compare June 5, 2026 16:19
@scotwells scotwells force-pushed the feat/configmap-secret-mounts-federated branch from bc253fd to 643da4d Compare June 5, 2026 16:46
@scotwells scotwells force-pushed the split/refdata-blocking-reason branch from 14299a1 to 7598e21 Compare June 5, 2026 16:46
@scotwells scotwells force-pushed the feat/configmap-secret-mounts-federated branch from 643da4d to 4ef5f5c Compare June 5, 2026 17:49
@scotwells scotwells force-pushed the split/refdata-blocking-reason branch from 7598e21 to 0b5a5c5 Compare June 5, 2026 17:49
scotwells and others added 7 commits June 5, 2026 13:36
Adds a new const block to api/v1alpha/instance_types.go with the
reason constants for the top-level readiness conditions
(Instance.Ready, WorkloadDeployment.Available, Workload.Available):

  WorkloadReasonNetworkNotFound
  WorkloadDeploymentReasonNetworkProvisioning   (replaces "ProvisioningNetwork")
  WorkloadDeploymentReasonInstancesProvisioning (replaces "ProvisioningInstances")
  WorkloadDeploymentReasonStableInstanceFound
  WorkloadDeploymentReasonReferencedDataNotReady (new)
  WorkloadDeploymentReasonQuotaNotGranted        (new)
  WorkloadReasonNoAvailablePlacements
  WorkloadReasonNoAvailableDeployments

Reason-string renames (deliberate, approved):
  "ProvisioningNetwork"   → "NetworkProvisioning"
  "ProvisioningInstances" → "InstancesProvisioning"

These renames align the emitted strings with the RFC-agreed vocabulary.
No client currently consumes these conditions; the rename is safe.

Replaces all inline string literals in workload_controller.go and
workloaddeployment_controller.go with the new named constants. No
behavior change; logic wiring happens in subsequent commits.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 8403d3e)
Implements the evaluate-all-then-pick logic in reconcileInstanceReadyCondition
so that the most actionable blocking cause is surfaced on Instance.Ready
instead of always collapsing to SchedulingGatesPresent.

Changes:
- reconcileReferencedDataCondition: when the owning WD carries a terminal
  ReferencedDataReady reason (SourceNotFound, SourceUnauthorized,
  SourceTooLarge), the Instance inherits the WD's reason+message verbatim.
  The companion will never arrive for a terminally missing source, so the
  WD's authoritative resolver verdict supersedes the cell-side "waiting for
  propagation" message. Zero extra API calls (WD already fetched).

- reconcileInstanceReadyCondition (scheduling-gates branch): evaluates ALL
  blocking sub-conditions (ReferencedDataReady, network failure) before
  selecting the winner via instanceBlockingReasonPriority. The previous
  code short-circuited on the first match, which could hide a
  higher-priority error behind a lower-priority one.

- isTerminalReferencedDataReason: helper predicate for the three terminal
  referenced-data reasons.

- instanceBlockingReasonPriority: private priority function implementing
  RFC §5.4 table. Duplicate of wdBlockingReasonPriority (intentional per
  RFC — avoids coupling the two controller packages).

Adds unit tests:
  TestReconcileInstanceReadyCondition_ReferencedDataEnrichment
  TestReconcileInstanceReadyCondition_EvaluateAllThenPick
  TestInstanceBlockingReasonPriority

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 3c075cf)
…g reason

Implements evaluate-all-then-pick logic for the WorkloadDeployment
Available condition when readyReplicas == 0. The previous code used a
short-circuiting if/else that let network-not-ready hide higher-priority
referenced-data errors.

Changes in workloaddeployment_controller.go:
- The Available condition assignment block is replaced with a call to
  selectWDBlockingCondition, which evaluates all blocking causes
  (NetworkProvisioning, ReferencedDataNotReady, QuotaNotGranted,
  InstancesProvisioning) and applies wdBlockingReasonPriority to select
  the winner.
- All Available conditions now carry ObservedGeneration set to
  deployment.Generation (previously unset).
- wdBlockingReasonPriority: private priority function implementing RFC §5.4.

Key test added (TestWDAvailableCondition_NetworkProvisioningVsReferencedData):
verifies that ReferencedDataNotReady (priority 4) beats NetworkProvisioning
(priority 2) even when network is not yet ready — the old short-circuit would
have returned NetworkProvisioning.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit a75b33e)
…ailable

Modifies reconcileWorkloadStatus to propagate the highest-priority
blocking reason from non-available WorkloadDeployments up to
Workload.Available, rather than always collapsing to the boolean
NoAvailablePlacements.

Changes in workload_controller.go:
- Iterates all deployments and tracks the worst blocking reason via
  workloadBlockingReasonPriority (RFC §5.4 table).
- Sorts placement names and deployments by name before iteration so the
  tie-break between equal-priority blockers is deterministic (lex-first
  deployment name wins — resolves RFC §12 open question #3).
- Workload.Available now carries ObservedGeneration set to
  workload.Generation (previously unset, RFC §6 requirement).
- workloadBlockingReasonPriority: private priority function, independently
  defined from the WD controller per RFC §5.4.

Creates internal/controller/workload_controller_test.go (new file) with:
  TestReconcileWorkloadStatus_AllDeploymentsSameReason
  TestReconcileWorkloadStatus_MixedReasons
  TestReconcileWorkloadStatus_OneAvailableDeployment
  TestReconcileWorkloadStatus_NoDeployments
  TestReconcileWorkloadStatus_TiebreakerByName
  TestReconcileWorkloadStatus_ObservedGeneration
  TestWorkloadBlockingReasonPriority (exhaustive priority table)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 0abaed2)
When QuotaGranted=False and scheduling gates are present, the previous
code early-returned Ready=False/PendingQuota before the evaluate-all-then-
pick block could run. This meant SourceNotFound (priority 5) was masked by
PendingQuota (priority 3) — the same class of short-circuit bug the
evaluate-all redesign was meant to eliminate.

Fix: when scheduling gates are present, quota is fed into consider() like
any other blocking cause so instanceBlockingReasonPriority picks the winner.
The Programmed=False and Running=False side effects of quota denial are
preserved unconditionally regardless of which reason wins Ready — they
reflect quota state independently.

The quota early-return is retained only for the no-gates case, where quota
is the sole active blocker and the three-condition atomic write is correct.

The scheduling-gates evaluation block is extracted into
reconcileGatedReadyCondition to keep reconcileInstanceReadyCondition within
the project's cyclomatic-complexity lint limit (gocyclo ≤ 30).

Adds TestReconcileInstanceReadyCondition_QuotaVsReferencedData (RFC §8.1
headline case): QuotaGranted=False/QuotaExceeded + ReferencedDataReady=
False/SourceNotFound → Ready=False/SourceNotFound (priority 5 > 3), with
Programmed=False/Running=False still set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit e4d419b)
…ngCondition

In federated topology the cell WD never receives the ReferencedDataReady
status condition written by the hub-side resolver (Karmada status aggregation
is cell→hub only). The ReferencedDataErrorAnnotation written by #38 already
bridges terminal errors hub→cell via ObjectMeta propagation; this commit
teaches the cell WD reconciler to read it.

selectWDBlockingCondition now checks deployment.Annotations for the terminal
error annotation after the existing status-condition path. When present and
parseable, decodeTerminalError (same package) returns the raw terminal reason
(SourceNotFound / SourceUnauthorized / SourceTooLarge, all priority 5) which
feeds directly into the existing consider() priority-ranked selection. The
annotation path is evaluated before the propagation-lag check so a terminal
annotation wins over the AwaitingPropagation reason at the same bucket.

No changes to the federator merge logic or Workload controller are needed:
once the cell WD Available carries the correct reason, Karmada statusAggregation
carries it hub-ward and syncStatusFromDownstream copies it to the project WD
as-is.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit ec668d9)
Replace literal "test-workload" occurrences with the existing
rdTestWorkloadName constant so goconst no longer flags them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@scotwells scotwells force-pushed the feat/configmap-secret-mounts-federated branch from 4ef5f5c to 646124c Compare June 5, 2026 18:38
@scotwells scotwells force-pushed the split/refdata-blocking-reason branch from 0b5a5c5 to ada425d Compare June 5, 2026 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant