Workloads can reference ConfigMaps and Secrets, delivered to every POP cell#129
Draft
scotwells wants to merge 19 commits into
Draft
Workloads can reference ConfigMaps and Secrets, delivered to every POP cell#129scotwells wants to merge 19 commits into
scotwells wants to merge 19 commits into
Conversation
e0b8ee7 to
9a064cc
Compare
This was referenced Jun 2, 2026
scotwells
added a commit
that referenced
this pull request
Jun 3, 2026
Drop the pre-merge dependency framing (the #129 'not yet on main' note) and repoint example references from PR URLs to relative paths (examples/serverless-js-configmap, examples/config-secret-probe), as if the referenced-data delivery and example PRs are merged. Keep the upstream base/base-compat ROM-enablement limitation, which is a Unikraft runtime matter independent of these PRs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scotwells
added a commit
that referenced
this pull request
Jun 4, 2026
Surface rolling-update / restart progress in `datumctl compute workloads` by showing updated/desired replica counts next to ready. UP-TO-DATE counts instances on the latest template revision (status.updatedReplicas), so a roll is visible as the count dips below desired and then recovers. Includes a byte-identical copy of the UpdatedReplicas/ObservedGeneration WorkloadDeployment status fields in api/v1alpha so the plugin can read them. These fields are defined identically on the controller branch (PR #129); the duplicate resolves cleanly once both land on main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d981d02 to
e420096
Compare
0920e3e to
741fb15
Compare
e420096 to
b11ab65
Compare
741fb15 to
119ddfd
Compare
b11ab65 to
e0df8b4
Compare
119ddfd to
5997231
Compare
e0df8b4 to
5ec3c00
Compare
5997231 to
1d0fca7
Compare
5ec3c00 to
bc253fd
Compare
1d0fca7 to
a67b32c
Compare
bc253fd to
643da4d
Compare
An error occurred while trying to automatically change base from
split/federation-core-bundled
to
feat/federated-deployment-scheduling
June 5, 2026 17:30
a67b32c to
b45810f
Compare
643da4d to
4ef5f5c
Compare
Phase 0 of the cross-plane referenced-data delivery design: the API surface, validation, and the management-plane building blocks every later phase compiles against. - API: EnvFrom on SandboxContainer; ReferencedDataReady condition + reasons; ReferencedData scheduling gate; referenced-data label and expected-referenced-data/restartedAt annotations. - Validation: secret-volume validation, ConfigMap/Secret items key->path projection, file-mode range and path safety, EnvFrom validation, and an admission SubjectAccessReview that the submitter can read each referenced ConfigMap/Secret. - internal/referenceddata seam: reference collector, deterministic companion naming, and the scoped ProjectConfigSecretReader (+ single-cluster variant). - Feature-flagged gate insertion (EnableReferencedDataGate, default off) so behavior is unchanged until the delivery and consumption phases are deployed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 8f4014b)
Phase 1: the management-plane ReferencedDataController. For each WorkloadDeployment it collects referenced ConfigMaps/Secrets, reads them with a scoped reader, enforces per-object/aggregate size limits, materializes one shared companion per source as a local copy, records the expected companion set on the deployment, reference-counts across deployments with finalizer cleanup, and surfaces resolution failures on ReferencedDataReady. Source watches refresh companions on rotation. Delivery is abstracted behind a companionWriter seam; the federated writer and PropagationPolicy extension are deferred to Phase 1b (federation merge). Sizes are configurable (referencedData.perObjectLimitBytes / aggregateLimitBytes). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 7f48a97)
Phase 2: the cell-side InstanceReconciler now resolves the referenced-data scheduling gate. For a gated Instance it reads the expected companion set from the owning WorkloadDeployment, lists the labeled companions in the namespace, and clears the gate once they are all present — surfacing ReferencedDataReady (Resolving/AwaitingPropagation/Ready) with the missing set on the message. A companion watch re-checks waiting Instances, with a requeue safety net. Adds Kubernetes Events on transitions and compute_referenced_data_* metrics (companions present/expected, gate-wait duration, condition transitions), and counts referenced-data-blocked replicas in the deployment status. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit b3da528)
Fix type mismatches and test setup issues that arose when landing the
referenced-data commits onto the federated-deployment-scheduling base:
- Use multicluster.ClusterName (not string) in Watches handler signatures
for referenceddata_controller.go and instance_controller.go companion
watches; update enqueueWDsForSource to accept multicluster.ClusterName
- Cast req.ClusterName to string when passing to resolveAndValidateSources
- Replace New(Options{}) with NewWithOptions(Options{}) in stateful_control_test.go
after re-introducing the no-arg New() on the combined Options struct
- Initialize r.finalizers in newRefDataReconciler test helper; add
instanceControllerFinalizer to test instance fixtures so the finalizer
framework does not short-circuit reconcile on first pass
- Update TestReferencedDataEventEmittedOnClear to expect single-pass
gate-clearing (federation branch clears both status and gate in same pass)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 3b00771)
Phase 1b: the resolver now has a federated companion writer. When a federation
client is configured it materializes companions into the project's downstream
namespace (ns-{project-uid}) on the Karmada hub via the same
MappedNamespaceResourceStrategy the federator uses; with no federation client it
falls back to the single-cluster local copy.
The federator's PropagationPolicy now selects the referenced-data-labeled
ConfigMaps and Secrets alongside the WorkloadDeployment, so companions
co-propagate to the same cells, and it forwards the expected-referenced-data
annotation to the downstream deployment so the cell can clear the gate.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit 5f42053)
The ReferencedDataController was registered unconditionally, so it also started in the cell operator and collided with the cell's WorkloadDeploymentReconciler (both default their controller name to "workloaddeployment"). Gate it to the management controller set and give it a distinct name. Found by running the operators in the multi-cluster e2e environment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 0240e90)
The WorkloadDeploymentFederator and ReferencedDataController both call Update() on the same WorkloadDeployment to add/remove their respective finalizers, producing optimistic-lock conflicts under concurrent reconciliation. The companions DO converge, but the errors are noisy. Wrap all three finalizer mutation sites (add on first reconcile, remove on deletion, remove on empty-refs) in retry.RetryOnConflict so that a concurrent federator update causes a transparent re-read + retry rather than a logged error. Each retry re-GETs the latest object version before re-applying only the single finalizer change. Adds two unit tests using interceptor.Funcs to inject a conflict on the first Update and assert that the finalizer is correctly applied/removed on retry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit 87fe6ac)
…cret-mounts
Fixes six issues identified in code review of the federated configmap/secret
mounts feature:
1. [BLOCKER] ValidateUpdate was a no-op stub: an attacker could create a
Workload with no refs then edit in refs to ConfigMaps/Secrets they cannot
read. Fixed by implementing ValidateUpdate to run the same full template
validation (including SAR) as ValidateCreate on the new object.
2. [MAJOR] SAR for referenced-data was missing UserInfo.Extra, unlike the
sibling Network SAR. Fixed by populating Extra from AdmissionRequest.
3. [MAJOR] Collector still collected both refs from an envFrom entry that had
both configMapRef and secretRef set, even though validateEnvFrom forbids it.
Fixed by short-circuiting such entries before collection in CollectFromTemplate.
4. [MAJOR] validateReferencedDataAccess maintained a hand-rolled traversal
(collectRefsFromSpec) parallel to referenceddata.CollectFromTemplate, with
drift risk. Fixed by adding CollectFromSpec to collector.go and making the
validation call it as the single source of truth. No import cycle (validation
-> referenceddata, not the reverse).
5. [MINOR] names.go: after TrimRight("-.", ...) the truncated segment could be
empty (e.g. source name is all '-' or '.'), producing "<prefix>.-<hash>"
which is invalid DNS-1123 (segment starts with '-'). Fixed by emitting
"<prefix>.<hash>" when truncated is empty.
6. [MINOR] Tests added: both-refs-set envFrom rejection + no SAR for invalid
entry; SAR InternalError/fail-closed path; ValidateUpdate SAR path;
names edge cases (all-dashes, all-dots, name ending on dot, valid prefix).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit a7ac4dd)
Fixes 10 issues found in the code review of the configmap/secret mounts
federation feature.
**Blockers:**
Fix 1 (federator status sync clobbers resolver conditions): syncStatusFromDownstream
now merges downstream status into the project WD while preserving the
resolver-owned ReferencedDataReady condition, rather than replacing .Status
wholesale. Wrapped in RetryOnConflict for concurrent-write safety.
Fix 2 (ref-count annotation race): materialiseOne and releaseOneCompanion now
wrap their read-modify-write of the ref-count annotation in
retry.RetryOnConflict loops so that two WDs reconciling concurrently against
the same companion cannot drop each other's ref entries.
**Majors:**
Fix 3 (optional source escape hatch): resolveAndValidateSources accepts the WD
template spec and uses a new isOptionalRef helper to determine per-source
optionality. NotFound and TooLarge errors for optional sources are silently
skipped; the WD proceeds with the remaining required companions.
Fix 4 (unparseable ref-count): decodeRefCount now returns ([]string, error).
An unmarshal failure propagates as a transient error; release paths return
the error rather than computing an empty remaining set that would incorrectly
delete the companion.
Fix 5 (nil-source panic): resolveOneSource nil-guards the reader's return
value and maps a (nil, nil) return to a SourceNotFound condition error.
Fix 6 (non-conflict-tolerant writes): the expected-set annotation Patch and
Status().Update are each wrapped in retry.RetryOnConflict; residual conflicts
after retries return (Result{}, nil) to requeue cleanly.
Fix 7 (annotation/label overwrite): ApplyConfigMap and ApplySecret in both
localCompanionWriter and downstreamCompanionWriter now use mergeLabels /
mergeAnnotations helpers that only upsert controller-owned keys, preserving
third-party entries such as Karmada bookkeeping annotations.
**Minors:**
Fix 8 (unused scheme field): ProjectReader.scheme field and the second
parameter of NewProjectReader removed; client.Options{} is sufficient.
Fix 9 (misleading localCompanionWriter comment): clarified that
localCompanionWriter is reachable only when FederationClient is nil, which
management-plane federation wiring never does on this branch.
Fix 10 (literal "v1" APIVersion): PropagationPolicy ConfigMap and Secret
selectors now use corev1.SchemeGroupVersion.String() instead of the literal
string "v1".
**Regression tests added:**
- Two WDs sharing a source, interleaved reconciles → both ref-count entries preserved
- Optional missing source → skipped, WD not failed
- Unparseable ref-count annotation → companion NOT deleted
- Federator status sync preserves ReferencedDataReady condition
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 5d5dce1)
…etrics - [BLOCKER] Restore ObservedGeneration guard on gate clearing: both the quota and referenced-data gate removals in reconcileSchedulingGates now check cond.ObservedGeneration == instance.Generation, preventing a stale True condition from generation N clearing a freshly-stamped gate at N+1. - [MAJOR] Suppress AwaitingPropagation event flood: the Warning event is now emitted only on a reason transition (prev != AwaitingPropagation), not on every reconcile where the missing-set message changes. - [MAJOR] Delete dead removeQuotaSchedulingGate: its generation guard has been ported to reconcileSchedulingGates; the standalone function is gone. - [MAJOR] Drop high-cardinality instance label from CompanionsPresent and CompanionsExpected metrics: both gauges now aggregate per namespace only, eliminating the unbounded per-instance series set. - [MINOR] Delete drainEvents test helper (never called). - [MINOR] Remove instanceByWorkloadDeploymentUIDIndex: the constant, its IndexField registration, and the index func are all deleted; the companion watch enqueue path already lists the full namespace and never queried it. Updated addInstanceIndexers doc-comment accordingly. - [MINOR] Route checkForNetworkCreationFailure through fetchOwnerWorkloadDeployment to avoid a duplicate WD Get; fix the stale doc-comment on enqueueInstancesInNamespace. - [MINOR] Fix no-gate self-heal: the branch that flips a non-True ReferencedDataReady condition to True when the gate is absent now validates that all expected companions are actually present before marking Ready, avoiding a false-Ready status when the gate was stripped out-of-band while companions were still missing. - ADD TEST: TestReferencedDataStaleConditionGuard verifies that a stale True condition at ObservedGeneration N does not clear a gate at generation N+1; only the second reconcile (after condition is re-evaluated at N+1) removes the gate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit c433ad1)
The companion ref-count (companionRefCountAnnotation) read-modify-write was non-atomic: materialiseOne fetched the companion (R1), computed the updated ref-count, built a desired object carrying that ref-count, then called ApplyConfigMap which issued a second independent GET (R2) before calling Update. mergeAnnotations then overwrote the referenced-by key on R2 with the R1-derived value. If a concurrent WD had committed its ref-count entry between R1 and R2, R2 already contained that entry, but the merge silently discarded it. Because Apply's own GET+Update targeted R2, no conflict was raised and RetryOnConflict never fired. Fix (approach a): remove the second internal GET from ApplyConfigMap / ApplySecret. Both methods now accept an `existing` argument — the already-fetched object (or nil when the companion does not yet exist). When existing is non-nil, the implementation writes directly onto it and calls Update, targeting the same resourceVersion that the caller used to compute the ref-count. A concurrent change between GET and Update will now produce a Conflict that propagates out to the materialiseOne / releaseOneCompanion RetryOnConflict loop, which re-reads and recomputes. The same double-GET path existed in releaseOneCompanion and is fixed with the same pattern: pass the already-fetched companion as the `existing` argument. Add TestReferencedData_RefCount_ConflictForcesReread, which injects a concurrent write of a third WD's key (wd3Key) into the companion on realCl immediately after the outer GET returns. Under the fixed code the subsequent Update at R1 conflicts with R2, RetryOnConflict re-reads R2 (with wd3Key), and all three keys appear in the final annotation. Under the old code Apply's internal GET would return R2 with wd3Key, but mergeAnnotations would overwrite referenced-by with the R1-derived value ([wd1Key, wd2Key]), silently dropping wd3Key, and the Update at R2 would succeed — the test would fail with only 2 keys instead of 3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit 9b2b7e0)
Replace manual membership loop in refCountAdd with slices.Contains. Replace two manual map-copy loops in mergeLabels/mergeAnnotations with maps.Copy (semantics preserved: copies wanted INTO existing, never replaces the map). Add ReferencedDataLabelValue = "true" to api/v1alpha/labels.go and use it everywhere in place of the repeated string literal. Use kindSecret and kindConfigMap constants in indexers.go and the controller instead of repeated string literals. Extract isOptionalInVolumes and isOptionalInContainers helpers from isOptionalRef to bring gocyclo from 32 down to within the limit. Define named constants for all repeated test strings (rdTestWD1, rdTestBlobKey, testKindConfigMap, etc.) introduced by the configmap/secret change set to clear the goconst and gocyclo lint issues it brought in. The instance_controller.go GetEventRecorderFor call retains its existing //nolint:staticcheck comment: the replacement API (GetEventRecorder) has an incompatible Eventf signature requiring a migration of all emit sites, which is deferred as a separate task. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit 8d782e7)
…olve
Companions were named "<lower-kind>.<source-name>" (e.g. configmap.app-config)
but Instances and Pods reference sources by their verbatim source name
(e.g. app-config). The runtime found only the prefixed companion → 404
→ mounts/env vars failed.
Option B fix: companion objects are now named exactly by their source name.
Cross-kind collision is safe (ConfigMap and Secret are distinct resource
types in the same namespace). Pod/Instance volume and envFrom references
resolve naturally without any translation layer.
Kind-disambiguation is preserved in the expected-referenced-data annotation
as "Kind/name" tokens (e.g. ["ConfigMap/app-config","Secret/app-secret"])
so the cell InstanceReconciler can verify each companion by kind+name
without probing both resource types.
Changes:
- internal/referenceddata/names.go: CompanionName drops the kind prefix;
adds CompanionToken("Kind", "name") → "Kind/name" helper
- internal/referenceddata/names_test.go: update expected values; add
TestCompanionName_SourceNameContract (contract test that would fail
against the old prefixed code) and TestCompanionName_SameSourceDifferentKind
- internal/controller/referenceddata_controller.go: annotation stores
kind-qualified tokens; releaseOneCompanion accepts explicit kind param
- internal/controller/referenceddata_controller_test.go: update assertions
to kind-qualified tokens; fix stale-RV and same-name source/companion cases
- internal/controller/instance_controller.go: gate-clearing parses kind-
qualified tokens via listPresentCompanionsByKindName
- internal/controller/instance_referenced_data_test.go: align to new names
and annotation token format
- test/e2e/referenced-data-mounts/chainsaw-test.yaml: update companion name
assertions (configmap.app-config → app-config, secret.app-secret → app-secret)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 7434216)
Note (split): empty against split/federation-core-bundled. The foundation already guards the nil Instance Status.Controller deref in reconcileInstanceGates (alongside its updatedReplicas rollout tracking), so this commit's production change is a no-op here. Retained to preserve the cherry-pick order; the accompanying nil-guard test lands in 6e7de18. (cherry picked from commit 58783e84a9810a78eae84b67e887be663008c01d5)
Add a regression test that constructs an Instance with Programmed=True but Status.Controller==nil and asserts that reconcileInstanceGates neither panics nor counts it as a current replica. Without the nil guard introduced in the same area the test panics with a nil-pointer dereference, confirming the guard is load-bearing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit 6e7de18)
Spec.Controller is a nilable pointer the infra provider populates independently of an Instance's networking readiness. reconcileInstanceGates dereferenced instance.Spec.Controller.SchedulingGates in the network gate-clearing path without a nil guard, so an Instance that reached networkReady before its controller spec was populated panic-looped the WorkloadDeployment reconcile and froze status. This mirrors the existing Status.Controller guard. Extends the nil-controller regression test to drive networkReady=true with a nil Spec.Controller, the case the first guard missed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 8cf0945)
…rencedDataReady
Add wdReferencedDataChangedPredicate to the WorkloadDeploymentReconciler's For()
watch so the reconciler re-runs when ReferencedDataController writes
ReferencedDataReady=False/SourceNotFound, without creating a self-trigger loop on
the reconciler's own Available/ReplicasReady status writes.
Root cause: the For() watch had no predicate, so each Status().Update by the WD
reconciler (writing Available and ReplicasReady) re-enqueued itself, creating a
continuous reconcile loop. The resolver's write of ReferencedDataReady was lost in
this noise — the WD reconciler would re-run but process a stale queue entry before
the cache reflected the resolver's update, leaving Available stuck at
InstancesProvisioning instead of being promoted to ReferencedDataNotReady.
Fix approach:
- Add wdReferencedDataChangedPredicate: passes on generation change (spec update) or
ReferencedDataReady condition status/reason/message change. Drops updates where
only Available/ReplicasReady changed (own writes).
- Add equality.Semantic.DeepEqual guard before Status().Update as belt-and-suspenders:
avoids the superfluous API call entirely when status did not change.
- Extract "ReplicasAvailable" inline string to reasonReplicasAvailable constant
(goconst compliance).
Loop prevention: the WD reconciler writes Available and ReplicasReady but never
writes ReferencedDataReady. So its own status writes fail the predicate's Update
check and do not re-enqueue via For(). Instance changes still trigger via
Owns(&Instance{}), which has no predicate.
Workload level: WorkloadReconciler.Owns(&WorkloadDeployment{}) has no predicate and
correctly fires on Available condition changes. The reconciler already has the
equality.Semantic.DeepEqual guard so it does not loop on its own writes.
Tests: wdRefDataCondChanged unit tests (nil/nil, added, removed, status change,
message change, identical), wdReferencedDataChangedPredicate tests (resolver write
passes, resolver clear passes, generation change passes, Available-only drops,
ReplicasReady-only drops, create/delete/generic always pass).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 4ba1d46)
…nnotation The referenced-data resolver sets ReferencedDataReady=False/SourceNotFound on the hub WorkloadDeployment, but Karmada statusAggregation only flows cell->hub, so the cell Instance controller (which reads the cell WD copy) never saw the verdict and the Instance stayed Ready=Unknown/Resolving instead of Ready=False/SourceNotFound. Bridge it via an annotation (compute.datumapis.com/referenced-data-error), which DOES propagate hub->cell as part of ObjectMeta. The resolver stamps it on terminal errors (SourceNotFound/TooLarge/Unauthorized) and clears it on resolve; the cell instance controller reads it from the owner WD and surfaces the terminal reason+ message, with a condition-based fallback preserved for single-cluster. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 3751f78)
Extract shared WorkloadDeployment test fixtures (cityCode, wdControllerTestUID) and the quota-denied message into package-level constants so goconst no longer flags repeated string literals across the controller test files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b45810f to
73177eb
Compare
4ef5f5c to
646124c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Value
Workloads can reference ConfigMaps and Secrets, and they are delivered to the Instances that need them in every POP cell. The management plane resolves those references into labeled companion objects and propagates them cross-plane via the federator's PropagationPolicy, gated by a
ReferencedDatascheduling gate (behind theenableReferencedDataGateflag, default off) so an Instance only starts once its referenced data has landed.This is the core of referenced ConfigMap/Secret delivery — the bottom of a reviewable stack carved out of the original monolithic #129. It lands the resolution + cross-plane delivery mechanism only; status-reason polish, hub-side GC, and e2e follow as separate stacked PRs so each gets focused review.
What
Stack
#107←#142(foundation completion) ← this PR (referenced-data core) ← blocking-reason ← hub-side GC ← e2e. Base is the bundled foundation (#142); merges down in order.Verification
go build/go vet/go test ./internal/... ./api/.../golangci-lintv2.12.2 all green at the branch tip.🤖 Generated with Claude Code