Instances say why they can't start: referenced-data vs quota vs runtime error#143
Open
scotwells wants to merge 7 commits into
Open
Instances say why they can't start: referenced-data vs quota vs runtime error#143scotwells wants to merge 7 commits into
scotwells wants to merge 7 commits into
Conversation
e420096 to
b11ab65
Compare
b4714a8 to
b244bcb
Compare
b11ab65 to
e0df8b4
Compare
b244bcb to
758a87f
Compare
e0df8b4 to
5ec3c00
Compare
758a87f to
90721bf
Compare
5ec3c00 to
bc253fd
Compare
90721bf to
14299a1
Compare
bc253fd to
643da4d
Compare
14299a1 to
7598e21
Compare
643da4d to
4ef5f5c
Compare
7598e21 to
0b5a5c5
Compare
Adds a new const block to api/v1alpha/instance_types.go with the reason constants for the top-level readiness conditions (Instance.Ready, WorkloadDeployment.Available, Workload.Available): WorkloadReasonNetworkNotFound WorkloadDeploymentReasonNetworkProvisioning (replaces "ProvisioningNetwork") WorkloadDeploymentReasonInstancesProvisioning (replaces "ProvisioningInstances") WorkloadDeploymentReasonStableInstanceFound WorkloadDeploymentReasonReferencedDataNotReady (new) WorkloadDeploymentReasonQuotaNotGranted (new) WorkloadReasonNoAvailablePlacements WorkloadReasonNoAvailableDeployments Reason-string renames (deliberate, approved): "ProvisioningNetwork" → "NetworkProvisioning" "ProvisioningInstances" → "InstancesProvisioning" These renames align the emitted strings with the RFC-agreed vocabulary. No client currently consumes these conditions; the rename is safe. Replaces all inline string literals in workload_controller.go and workloaddeployment_controller.go with the new named constants. No behavior change; logic wiring happens in subsequent commits. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit 8403d3e)
Implements the evaluate-all-then-pick logic in reconcileInstanceReadyCondition so that the most actionable blocking cause is surfaced on Instance.Ready instead of always collapsing to SchedulingGatesPresent. Changes: - reconcileReferencedDataCondition: when the owning WD carries a terminal ReferencedDataReady reason (SourceNotFound, SourceUnauthorized, SourceTooLarge), the Instance inherits the WD's reason+message verbatim. The companion will never arrive for a terminally missing source, so the WD's authoritative resolver verdict supersedes the cell-side "waiting for propagation" message. Zero extra API calls (WD already fetched). - reconcileInstanceReadyCondition (scheduling-gates branch): evaluates ALL blocking sub-conditions (ReferencedDataReady, network failure) before selecting the winner via instanceBlockingReasonPriority. The previous code short-circuited on the first match, which could hide a higher-priority error behind a lower-priority one. - isTerminalReferencedDataReason: helper predicate for the three terminal referenced-data reasons. - instanceBlockingReasonPriority: private priority function implementing RFC §5.4 table. Duplicate of wdBlockingReasonPriority (intentional per RFC — avoids coupling the two controller packages). Adds unit tests: TestReconcileInstanceReadyCondition_ReferencedDataEnrichment TestReconcileInstanceReadyCondition_EvaluateAllThenPick TestInstanceBlockingReasonPriority Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit 3c075cf)
…g reason Implements evaluate-all-then-pick logic for the WorkloadDeployment Available condition when readyReplicas == 0. The previous code used a short-circuiting if/else that let network-not-ready hide higher-priority referenced-data errors. Changes in workloaddeployment_controller.go: - The Available condition assignment block is replaced with a call to selectWDBlockingCondition, which evaluates all blocking causes (NetworkProvisioning, ReferencedDataNotReady, QuotaNotGranted, InstancesProvisioning) and applies wdBlockingReasonPriority to select the winner. - All Available conditions now carry ObservedGeneration set to deployment.Generation (previously unset). - wdBlockingReasonPriority: private priority function implementing RFC §5.4. Key test added (TestWDAvailableCondition_NetworkProvisioningVsReferencedData): verifies that ReferencedDataNotReady (priority 4) beats NetworkProvisioning (priority 2) even when network is not yet ready — the old short-circuit would have returned NetworkProvisioning. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit a75b33e)
…ailable Modifies reconcileWorkloadStatus to propagate the highest-priority blocking reason from non-available WorkloadDeployments up to Workload.Available, rather than always collapsing to the boolean NoAvailablePlacements. Changes in workload_controller.go: - Iterates all deployments and tracks the worst blocking reason via workloadBlockingReasonPriority (RFC §5.4 table). - Sorts placement names and deployments by name before iteration so the tie-break between equal-priority blockers is deterministic (lex-first deployment name wins — resolves RFC §12 open question #3). - Workload.Available now carries ObservedGeneration set to workload.Generation (previously unset, RFC §6 requirement). - workloadBlockingReasonPriority: private priority function, independently defined from the WD controller per RFC §5.4. Creates internal/controller/workload_controller_test.go (new file) with: TestReconcileWorkloadStatus_AllDeploymentsSameReason TestReconcileWorkloadStatus_MixedReasons TestReconcileWorkloadStatus_OneAvailableDeployment TestReconcileWorkloadStatus_NoDeployments TestReconcileWorkloadStatus_TiebreakerByName TestReconcileWorkloadStatus_ObservedGeneration TestWorkloadBlockingReasonPriority (exhaustive priority table) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit 0abaed2)
When QuotaGranted=False and scheduling gates are present, the previous code early-returned Ready=False/PendingQuota before the evaluate-all-then- pick block could run. This meant SourceNotFound (priority 5) was masked by PendingQuota (priority 3) — the same class of short-circuit bug the evaluate-all redesign was meant to eliminate. Fix: when scheduling gates are present, quota is fed into consider() like any other blocking cause so instanceBlockingReasonPriority picks the winner. The Programmed=False and Running=False side effects of quota denial are preserved unconditionally regardless of which reason wins Ready — they reflect quota state independently. The quota early-return is retained only for the no-gates case, where quota is the sole active blocker and the three-condition atomic write is correct. The scheduling-gates evaluation block is extracted into reconcileGatedReadyCondition to keep reconcileInstanceReadyCondition within the project's cyclomatic-complexity lint limit (gocyclo ≤ 30). Adds TestReconcileInstanceReadyCondition_QuotaVsReferencedData (RFC §8.1 headline case): QuotaGranted=False/QuotaExceeded + ReferencedDataReady= False/SourceNotFound → Ready=False/SourceNotFound (priority 5 > 3), with Programmed=False/Running=False still set. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit e4d419b)
…ngCondition In federated topology the cell WD never receives the ReferencedDataReady status condition written by the hub-side resolver (Karmada status aggregation is cell→hub only). The ReferencedDataErrorAnnotation written by #38 already bridges terminal errors hub→cell via ObjectMeta propagation; this commit teaches the cell WD reconciler to read it. selectWDBlockingCondition now checks deployment.Annotations for the terminal error annotation after the existing status-condition path. When present and parseable, decodeTerminalError (same package) returns the raw terminal reason (SourceNotFound / SourceUnauthorized / SourceTooLarge, all priority 5) which feeds directly into the existing consider() priority-ranked selection. The annotation path is evaluated before the propagation-lag check so a terminal annotation wins over the AwaitingPropagation reason at the same bucket. No changes to the federator merge logic or Workload controller are needed: once the cell WD Available carries the correct reason, Karmada statusAggregation carries it hub-ward and syncStatusFromDownstream copies it to the project WD as-is. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit ec668d9)
Replace literal "test-workload" occurrences with the existing rdTestWorkloadName constant so goconst no longer flags them. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4ef5f5c to
646124c
Compare
0b5a5c5 to
ada425d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Value
When an Instance or WorkloadDeployment can't start, it now tells the user why — distinguishing "waiting on referenced data" from "waiting on quota" from a hard runtime error — instead of a generic not-ready. The failure mode becomes actionable: a user can see at a glance whether they need to grant quota, fix a missing/oversized/unauthorized referenced object, or wait for delivery to finish.
This builds on the referenced-data delivery in the parent PR (#129).
What
Instance.Ready,WorkloadDeployment.Available, and (rolled up)Workload.Available.instanceBlockingReasonPriority(it is not redefined) with the referenced-data tiers: transient resolving/awaiting-propagation/not-ready rank with startup reasons; terminal source-not-found/too-large/unauthorized rank with hard runtime errors.selectWDBlockingCondition.Reviewer attention
Two priority tables exist by design — the Instance-side table (extended foundation table) and the WD-side
wdBlockingReasonPriority. TheQuotaVsReferencedDatatest asserts a terminal refdata reason outranks pending-quota; confirm that ordering matches intended UX.Stack
Stacks on the referenced-data core PR (#129).
go build/vet/test/golangci-lintgreen at tip.🤖 Generated with Claude Code