feat: datumctl compute plugin — deploy and manage workloads from the CLI by scotwells · Pull Request #113 · datum-cloud/compute

scotwells · 2026-05-22T21:50:21Z

Summary

Adds the datumctl compute plugin so developers can deploy and manage containerized workloads on Datum Cloud directly from the CLI.

Commands shipped:

deploy — push a container image as a workload with flags or a manifest file; waits for rollout
destroy — tear down a workload with a confirmation prompt
status — show workload health, per-city placement summary, and the active revision
instances — list all running instances across cities, with describe for full detail
scale — adjust minimum replica count across all placements
rollout — watch live rollout progress, browse revision history, and roll back to any prior revision
restart — trigger a rolling restart of a workload or a specific city
quota — inspect per-city instance usage and surface quota-exceeded messages

Revision history is stored as a ConfigMap per workload so rollout history and rollout undo work without server-side tracking.

Dependencies

Depends on feat: extend datumctl with installable service plugins datumctl#198 (plugin dispatch foundation) — go.mod currently uses a replace directive pointing at that PR's worktree; the directive should be removed and replaced with a release tag once that PR merges.

What's not included

logs — telemetry service not yet implemented
Tests — next step is adding envtest-based integration tests for each command
cities / instance-types resource listing commands

Workloads targeting a city location are now automatically routed to the correct physical site via a Karmada-based federation layer. Each POP cell operates independently, instance health is surfaced back to the control plane in real time, and the platform remains available even when parts of the control plane are temporarily unreachable. Controllers added: - WorkloadDeploymentFederator: replicates WDs into Karmada and manages PropagationPolicies per city code - InstanceProjector: mirrors Instance write-backs from Karmada into the project namespace on the control plane ResourceInterpreterCustomization deployed at config time teaches Karmada how to aggregate replica counts and conditions across POP cells. Operator flags --enable-management-controllers and --enable-cell-controllers allow each deployment to opt into only the controllers it needs. Includes a 6-test Chainsaw e2e suite covering federation, deletion cascade, propagation policy lifecycle, instance projection, instance write-back, and the full end-to-end chain. Resolves #85 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…edge Introduces management-plane and cell overlay paths to the compute OCI artifact so the infra repo can deploy compute-manager in the correct mode for each tier of the federation architecture. The management-plane overlay deploys compute-manager with only WorkloadDeploymentFederator and InstanceProjector enabled, connected to the Karmada downstream control plane via projected ServiceAccount token auth. The cell overlay deploys compute-manager with only WorkloadDeploymentReconciler and InstanceReconciler enabled, with no downstream connection or webhook server. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ts for webhook TLS Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove the hardcoded datum-control-plane ClusterIssuer from the csi-webhook-cert component. DNS names stay since they are fixed by the service name and namespace. Each consuming overlay now supplies the issuer via a strategic merge patch, allowing different environments to use different cert issuers without forking the component. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Each WorkloadDeployment is routed to exactly one cell cluster via its PropagationPolicy, so aggregation across multiple members is not needed. Replace the summing logic with a direct pass-through of the single member's status. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The cert issuer name is environment-specific configuration that belongs in the infra repo, not the compute overlay. The infra repo's base manager patch already owns the full webhook-server-tls volume definition including the issuer. Consumers deploying outside infra must patch the issuer in their own overlay. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…moval dev: inline self-signed Issuer + Certificate for host.docker.internal, replace kustomize replacements block with direct annotation patch, remove Certificate-patching from webhook_patch.yaml, and clear webhookServer secretRef from config.yaml. single-cluster: replace cert-manager Certificate approach with the csi-webhook-cert component, matching the main branch overlay. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The WorkloadReconciler watches networkingv1alpha.Network objects, which requires the network-services-operator CRDs to be installed. Cell clusters don't have those CRDs, causing the manager to crash on startup. Gate the WorkloadReconciler behind enableManagementControllers so it only runs where the Network CRDs are present. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extracts server config file reading and decoding into a dedicated loadServerConfig helper, reducing main's cyclomatic complexity from 31 to 29 to satisfy the gocyclo linter limit of 30. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Milo's authorization webhook uses Extra claims on the admission request (iam.miloapis.com/parent-name, iam.miloapis.com/parent-type, etc.) to resolve the correct project-scoped policy binding. Dropping them caused the SAR to return Allowed=false even for users with networks.use, because the authorizer couldn't locate the binding without the project context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

metricRules belongs under spec.quota, not spec.billing. The field is not declared in the ServiceBillingConfig schema, causing Flux dry-run failures in staging with: .spec.billing.metricRules: field not declared in schema

Previously, InstanceReconciler wrote ResourceClaim objects against the local deployment cluster via managementCluster.GetClient(). Those claims were never seen by the Milo quota system, leaving every Instance in QuotaGranted=Unknown indefinitely. This change routes claim creation and deletion to the correct Milo project control plane for each instance using a new ProjectQuotaClientManager that builds per-project REST clients by rewriting the host path — mirroring the URL construction already used by the milomulticluster provider. The management-cluster claim watch is replaced with a multicluster Watches call so that grant/denial status changes in project control planes re-trigger instance reconciles. Claims are stamped with a source-cluster label (discovery.clusterName) so each edge controller only reacts to the claims it created. Co-Authored-By: Claude <claude@anthropic.com>

The admission webhook requires that all metrics referenced in spec.quota.limits[].metric and spec.quota.metricRules[].metricCosts match a name declared in spec.metrics[]. The four quota-tracking metrics (workloads, instances, vcpus, memory) were missing from spec.metrics[], causing the webhook to reject the resource.

…o cell setup Controller flags --enable-management-controllers and --enable-cell-controllers now default to false so kustomize components must explicitly opt in, rather than both groups running by default. This prevented the management-plane deployment from crashing when discovery.clusterName was unset — that field is only required by the InstanceReconciler (a cell controller), so the validation now lives in InstanceReconciler.SetupWithManager instead of initializeClusterDiscovery. Also adds cell-controllers and management-controllers components to the single-cluster overlay, which was silently running with no controllers enabled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…scovery The rebase during cherry-pick propagation introduced a mixed state where cmd/main.go had the edgeClusterName/projectRestConfig return values partially reverted. This cleans up the function signature and call sites to be consistent, while keeping the validation removed from initializeClusterDiscovery (it belongs in InstanceReconciler.SetupWithManager per the original fix intent). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… RBAC The workload-deployment-federator calls ensureDownstreamNamespace before federating WorkloadDeployment resources, but the compute-manager ClusterRole was missing core-group namespace permissions, causing every reconcile to fail with a forbidden error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Workload scheduling and admission now consult LocationBinding objects (project-scoped, created by the service catalog) rather than the global Location list. This ensures consumers only see locations that are both healthy and available to their specific project. Also upgrades network-services-operator and milo dependencies to versions that introduce LocationBinding and address multicluster-runtime v0.23 API changes (ClusterName type, ProviderRunnable Start lifecycle, generic webhook builder). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ources WorkloadDeploymentReconciler creates and owns NetworkBinding and SubnetClaim resources, and watches Location, NetworkContext, and Subnet. InstanceReconciler watches ResourceClaim for quota. Neither was granted the necessary ClusterRole rules, causing watch failures on cell clusters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

From the cell cluster's perspective, Karmada is upstream (the federation control plane), not downstream. Rename the flag, env var, and related variables throughout to reflect the actual relationship. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…viderRunnable fix Points go.miloapis.com/milo to the feature branch commit that implements multicluster.ProviderRunnable on the Milo provider, enabling the mc manager to auto-call provider.Start() and set p.mcAware so project clusters can be registered. Without this, p.mcAware was always nil and every project reconcile logged "Multicluster manager not yet started" forever. Also removes the & from ResourceRef in ResourceClaimSpec — the feature branch has ResourceRef as a value type, not a pointer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove non-existent QuotaRestConfig() call and fix SetupWithManager argument count; pass nil quota config to skip quota enforcement for now. Single-tenant cell mode uses namespace-as-project-id and the fixed 'single' cluster name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Wires up Milo ResourceClaim-based quota accounting for cells running in single-cell discovery mode (mode: single), where the multicluster ClusterName is always "single" rather than the Milo project name. Key changes: - Add QuotaKubeconfigPath config field and QuotaRestConfig() method so quota REST config can be configured independently of discovery mode. Returns (nil, nil) when neither path is set, disabling quota rather than silently targeting the local apiserver. - Add projectIDForInstance and clusterNameForProject func fields to InstanceReconciler. In single mode, project ID is derived from instance.Namespace; the watch map func always enqueues ClusterName "single" rather than the project namespace, avoiding ErrClusterNotFound on every quota-grant event. - Guard ResourceClaim watch map func against claims with empty ResourceRef to prevent a nil-dereference panic when a label-matching claim from another actor has no ResourceRef set. - Add TestReconcileQuotaSingleMode covering the full single-mode quota flow: project ID from namespace, watch re-enqueue to "single" cluster. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

v2.1.5 was built with Go 1.24 and refuses to lint Go 1.25 modules. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tatus change Write-back was only triggered inside the statusChanged||readyChanged block, so instances stuck in a scheduling gate (no status transitions) were never replicated to Karmada. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nged Use apiequality.Semantic.DeepEqual to avoid unnecessary API calls to Karmada on every reconcile when nothing has actually changed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

From the cell cluster's perspective, Karmada is upstream. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Exposes per-container entrypoint and argument overrides on the SandboxContainer type, mirroring Kubernetes pod-spec semantics: - Command []string — overrides the image ENTRYPOINT - Args []string — overrides the image CMD; combined with Command when both are set When neither field is set the image's own ENTRYPOINT/CMD are used unchanged, which is the correct default for standard OCI images (e.g. hello-world, nginx). Infrastructure providers that translate Instance specs (such as unikraft-provider) should map these fields through to the underlying runtime. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…stant Four occurrences in instance_writeback_test.go triggered goconst because testInstanceType = "d1-standard-2" already exists in the same package. Replacing all four with the constant keeps golangci-lint at 0 issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nd-args feat(api): add Command and Args fields to SandboxContainer

Instances are projected back to the project cluster where the CLI reads them. Previously, correlating an Instance with its workload/deployment required joining on workload-deployment-uid, which differs per Karmada plane (the WD UID in the management cluster does not match the uid assigned in the cell). Add four new label constants and stamp them on every Instance at create and update time: - workload-deployment-name (deployment.Name) - city-code (deployment.Spec.CityCode) - workload-name (deployment.Spec.WorkloadRef.Name) - placement-name (deployment.Spec.PlacementName) These self-describing labels let the CLI resolve WORKLOAD/CITY/placement directly from the projected Instance object, without any cross-plane join. All four labels are included in the writeBackToUpstream allowlist so they propagate through InstanceReconciler → Karmada → InstanceProjector into the user-facing project cluster. Also persist the resolved Location (already discovered during network reconciliation) onto WorkloadDeployment.Status.Location, and propagate it into Instance.Spec.Location best-effort. A nil location never blocks instance creation; the existing scheduling path is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The owner reference on a projected Instance must reference the actual project-control-plane WorkloadDeployment so that GC cascades and deletes projections when the deployment is removed. The previous implementation compared the WorkloadDeploymentUIDLabel value (which carries the edge/Karmada plane WD UID) against project-cluster WD UIDs — a match that never succeeds because each Kubernetes plane mints its own UID for the same object. The result was that ownerWD stayed nil, no ownerReference was set, and Instance projections leaked indefinitely (e.g. my-api/test-workload orphaned after WD deletion). Fix: resolve the owning WD by the federation-stable WorkloadDeployment NAME via a direct projectClient.Get against the project cluster, satisfying the core invariant that the owner reference UID/name/GVK must come from a live project-cluster object. The name is read from the new WorkloadDeploymentNameLabel (already stamped by dd3421a); an ordinal-strip fallback handles Instances created before that label was introduced. If the project WD is NotFound, requeue with RequeueAfter: 5s without creating the projection, so a projection is never created without an owner reference. This handles the transient ordering race where Karmada propagates an Instance back before WorkloadReconciler has created the project WD. Existing ownerless projections self-heal on the next reconcile once the project WD exists. Tests added: - "WD name label present, edge UID differs from project UID": asserts ownerRef.UID == projTestWDUID AND != projTestEdgeWDUID (regression guard against reintroducing cross-plane UID matching). - "WD name label absent, fallback name extraction from instance name": verifies the ordinal-strip path produces a correct owner reference. - "project WD not found — requeue, no ownerless projection created": asserts RequeueAfter > 0 and no projection object exists. - "WD name label absent and instance name yields no resolvable WD": verifies unrecognised instance names are skipped cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Newly-introduced controller labels (city-code, workload-name, workload-deployment-name, placement-name) were only stamped on the create path and the template-hash-mismatch update path. A pre-existing instance that is not-Ready with an unchanged template hash takes the Wait branch and was never re-stamped, so the labels were absent on instances like sre-gate-test-default-dfw-0. Add a dedicated label-backfill pass that runs after the ordered rollout decision and skip-loop. For each existing, non-deleting instance, desiredControllerLabels() computes the full desired label set; if any key differs a NewPatchLabelsAction (ActionTypePatchLabels) is emitted. The action executes via client.MergeFrom patch, which sends only the metadata diff — spec, template, and template-hash are never touched (constraint 1). Backfill actions are appended after the rollout skip-loop so they are never subject to the "skip all but first" rule and never counted as an update in progress (constraint 2). The pass is idempotent: it is a no-op when all labels already match. Fix the misleading comment on addInstanceControllerLabels that overstated coverage by claiming the function ran on "both create and update paths"; the comment now reflects that backfill covers every reconcile pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…s update After quota is granted, the Quota scheduling gate was never removed from spec.controller.schedulingGates, leaving instances stuck "Pending (SchedulingGatesPresent)" even though the workload was running. Root cause: Reconcile returned early after writing QuotaGranted=True to status (statusChanged=true path), before reaching removeQuotaSchedulingGate. Because ResourceClaims are immutable after creation and local Instances are not watched (WithEngageWithLocalCluster(false)), no subsequent event would re-enqueue the instance — the gate was stranded forever. Fix: on the success path (quotaErr==nil), fall through to removeQuotaSchedulingGate after persisting the status update rather than returning early. Only return early with quotaErr when it is non-nil, which preserves the transient-failure backoff-requeue behavior. Also updates existing tests that previously required two reconciles to clear the gate (the second of which could never arrive in production), and adds TestQuotaGateRemovedInSingleReconcile as a regression test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The platform now stamps city-code, workload-name, workload-deployment-name, and placement-name directly onto Instances at creation time. The CLI can therefore resolve CITY/WORKLOAD/placement directly from those labels without performing cross-object joins. The prior approach keyed the WorkloadDeployment map on UID and looked up instances via WorkloadDeploymentUIDLabel. That UID is the edge/Karmada WD UID, which differs from the project-cluster WD UID, causing the join to fail across federation planes and producing "unknown"/"orphaned" output. The new label-first path reads CityCodeLabel, WorkloadNameLabel, PlacementNameLabel, and WorkloadDeploymentNameLabel (name is identical across all planes) before falling back to the WD Get/List join. A wdNameFromInstanceName helper strips the trailing ordinal suffix from the Instance name as a last-resort fallback for instances created before the labels existed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The `compute deploy` rollout watcher reported PHASE=Done and exited within seconds of creating the workload, before any instances were scheduled. A WorkloadDeployment's Status.DesiredReplicas stays at zero until the controller first reconciles it, and computePhase treated zero desired as Done — so the very first poll of a fresh deployment looked complete. Resolve the wait target from the spec minimum while the controller has not yet reported a desired count, and require that no stale replicas remain before reporting Done so scale-downs and rolling updates aren't declared complete while old instances are still draining. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…l-compute-plugin

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Consume the server-side status-blocking-reason contract: each resource's readiness condition (Instance/Ready, WorkloadDeployment/Available, Workload/Available) now carries a machine-readable reason and human message when not True. - Add ReadinessBlock helper in util/conditions.go: given a condition list and type, returns (reason, message, blocked) with no per-reason branching — the single reusable entry-point for the new contract. - InstanceStatus (list view): falls through to "Pending (<reason>)" from the Ready condition when no specific sub-condition check matches, replacing the bare "Pending" for unknown causes like SourceNotFound or ReferencedDataNotReady. - InstanceStatusDetail (describe view): falls through to "Pending — <reason>" with the message as detail, replacing "Unknown" for those same causes. - WorkloadHealth: surfaces the reason from Available when false, e.g. "Unavailable — SourceNotFound" instead of the generic message. - degradedAnnotation (workloads describe per-city line): rewritten to read the WorkloadDeployment's own Available condition; removes the per-instance List fetch and the quota/InstanceStatusDetail special-casing that was its only logic. - printBlockedDetail (rollout watch): rewritten to read the deployment's Available condition; removes the per-instance List fetch entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rovisioning status The Programmed condition starts as Unknown (not False) while programming is in progress, so the ConditionFalse-only checks were bypassed and the raw ProgrammingInProgress reason leaked through the Ready condition fallback. Widen the checks to status != True to cover both Unknown and False states. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add three provider-emitted reason constants to the API types and map them to plain-English STATUS strings in the list and describe views: ImageUnavailable → Failed (image unavailable) InstanceCrashing → Failed (crashing) ConfigurationError → Failed (configuration error) Rename the PendingProgramming/ProgrammingInProgress cases from the misleading "network provisioning" to "Starting", which accurately describes the transient state without implying network work is involved. Failed statuses are already counted in the "N Failed" summary line via the existing strings.HasPrefix(status, "Failed") check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

scotwells · 2026-06-03T13:51:55Z

📋 Real-world UX issue from a user enabling compute

Heads up — we got a user report that surfaces a confusing first-run experience with the enablement flow, and I've traced it end-to-end via the staging audit logs. Sharing here since the fix touches this plugin.

What the user saw:

% datumctl compute instances list
Compute is not enabled for project "personal-project-153fe986".
Would you like to request access? [y/N]: y
Requesting access to compute for project "personal-project-153fe986"...
Error: requesting compute access: serviceentitlements.services.miloapis.com "compute" already exists

From their perspective this looks like a flat-out failure. In reality, their first attempt succeeded — compute was enabled.

What actually happened (from the audit trail):

First run created the entitlement successfully. ✅
But the backend takes a short while (~minutes in this case) to mark it Ready.
During that window, the CLI's "is compute enabled?" check keys off the entitlement's Ready status, not its existence — so it kept reporting "not enabled" and re-offering to request access.
Each retry tried to create the entitlement again and hit a 409 already exists, which we surfaced as a raw, scary error.

Why it matters for the product: the very first thing a new user does is turn compute on, and today that happy path can look broken even when it worked. The error message also leaks an internal resource name (serviceentitlements.services.miloapis.com) that means nothing to a user.

Proposed fix (branch fix/compute-entitlement-pending-state, built off this PR's branch): teach the enablement check to distinguish three states instead of two —

not requested → offer to request access (today's behavior)
requested but still activating → tell the user it's in progress and to try again in a moment (no re-prompt, no error)
active → proceed

…and treat a 409 already exists as "already requested, activation pending" rather than a fatal error. Net result: the user sees a calm "enablement in progress, hang tight" message instead of a stack of confusing failures.

Happy to fold this into this PR or send it as a follow-up — whichever you prefer. 🙏

`compute restart` stamped the non-canonical kubectl.kubernetes.io/restartedAt annotation on the workload/deployment template. Use the documented RestartedAtAnnotation (compute.datumapis.com/restartedAt) instead, matching the controller's restart contract. Both keys change the template hash, but the canonical one is the documented trigger for the ordered instance roll. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Surface rolling-update / restart progress in `datumctl compute workloads` by showing updated/desired replica counts next to ready. UP-TO-DATE counts instances on the latest template revision (status.updatedReplicas), so a roll is visible as the count dips below desired and then recovers. Includes a byte-identical copy of the UpdatedReplicas/ObservedGeneration WorkloadDeployment status fields in api/v1alpha so the plugin can read them. These fields are defined identically on the controller branch (PR #129); the duplicate resolves cleanly once both land on main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Renames the Instance "Running" status condition to "Available" (wire value "Available") across the API types, controller, and CLI. An instance can be available while not actively running a pod (e.g. scaled to zero), so "Running" was a misleading serving/health signal. API/controller: same constant renames as the backend branch (InstanceRunning -> InstanceAvailable, InstanceRunningReason* -> InstanceAvailableReason*, InstanceReadyReasonRunning -> InstanceReadyReasonAvailable) plus the kubebuilder default marker and regenerated Instance CRD. CLI: the derived status now reports availability, never live runtime state. `Ready=True` displays "Available" (was "Running"), failure details read "Not available — …" (was "Not running — …"), and the Available-condition-derived "Starting"/"Stopping" liveness states are dropped — the CLI no longer indicates whether a process is actively running at this instant. IsRunning -> IsAvailable. BREAKING CHANGE: the on-the-wire Instance condition type changes from "Running" to "Available". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scotwells · 2026-06-04T00:59:47Z

Context: should `rollout` become a verb group?

Capturing the context so we can make this call deliberately — the decision is open, including doing it before this PR merges. Not asserting a deferral.

Today restart and rollout are flat siblings under compute, and rollout is a status-watcher (≈ kubectl rollout status). As we add more rollout lifecycle operations there's a natural pull toward the kubectl mental model, where rollout is a group of verbs:

compute rollout restart    # what `compute restart` does today
compute rollout status     # what `compute rollout` does today
compute rollout undo       # rollback
compute rollout history    # revisions
compute rollout pause / resume

The tradeoff

Moving to the grouped form is a breaking change to the surface (compute restart → compute rollout restart). Cheapest to do before this PR merges / before the plugin has real adoption — every release we wait raises the cost of moving users' muscle memory, scripts, and docs.
With only two rollout verbs today (restart + status), the grouping is mostly cosmetic; its clear payoff lands once we ship undo / history (revision tracking) and can design the whole verb set + shared flags (--to-revision, --watch, …) in one coherent pass.
So it's a real "now vs later" call: now = pay the design cost early but lock in the clean structure before adoption; later = avoid churn until the verbs justify it, at the cost of a harder migration.

Open design question — where does history come from?
rollout history and undo both depend on revision history of the resource, and we should decide where that lives before committing to these verbs:

Compute-specific — compute tracks its own Workload/template revisions (à la Deployment → ReplicaSet), owned and stored by this service; or
Generic platform capability — a shared resource-history / revisioning / audit-trail primitive that any resource type plugs into.

If the platform offers (or should offer) generic resource history, compute's rollout {history,undo} should be a thin view over that primitive, not a bespoke revision store — and that also shapes whether these verbs stay under compute or surface more generically across the CLI. This is worth resolving early because it drives both the data model behind rollout and the command surface we'd be locking in.

A signal we're already drifting toward grouping
workloads describe → "Next steps" already advertises datumctl compute rollout undo <wl>, which doesn't exist yet. Our own copy is implicitly assuming the grouped model.

If we do it now, keep restart as a hidden alias of rollout restart for a release or two so we don't break early scripts. If we defer, revisit when we pick up rollout history/undo.

scotwells mentioned this pull request May 22, 2026

Launch Datum Compute Service datum-cloud/enhancements#682

Open

scotwells and others added 28 commits May 26, 2026 15:04

feat: replace cert-manager certificate resources with CSI volume moun…

0f69956

…ts for webhook TLS Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: remove webhook CA injection — Milo trusts the cert issuer directly

a11861e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci: bump Go version to 1.25 to match go.mod requirement

81e73c3

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci: bump golangci-lint to v2.2.2 for Go 1.25 compatibility

bed3d12

v2.1.5 was built with Go 1.24 and refuses to lint Go 1.25 modules. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci: bump golangci-lint to v2.12.2 (latest, built with Go 1.25)

0d26598

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: skip upstream write-back when spec, labels, and status are uncha…

3ac5115

…nged Use apiequality.Semantic.DeepEqual to avoid unnecessary API calls to Karmada on every reconcile when nothing has actually changed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor: rename writeBackToDownstream -> writeBackToUpstream

c15161e

From the cell cluster's perspective, Karmada is upstream. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

scotwells mentioned this pull request May 29, 2026

chore: remove milo v0.25.2 replace pin once service-catalog compatibility is resolved #123

Open

scotwells and others added 11 commits May 29, 2026 06:39

Merge pull request #125 from datum-cloud/feat/sandbox-container-comma…

fa711b9

…nd-args feat(api): add Command and Args fields to SandboxContainer

Merge branch 'feat/federated-deployment-scheduling' into feat/datumct…

f5a25aa

…l-compute-plugin

chore: ignore goreleaser dist output and local plugin binary

685e353

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scotwells force-pushed the feat/datumctl-compute-plugin branch from 8bc1efb to 685e353 Compare June 1, 2026 21:23

scotwells mentioned this pull request May 26, 2026

docs: propose datumctl compute developer experience #111

Merged

scotwells and others added 3 commits June 1, 2026 20:10

scotwells and others added 3 commits June 3, 2026 19:04

scotwells mentioned this pull request Jun 4, 2026

Simpler, more reliable webhook TLS via a cert-manager CSI mount #141

Merged

scotwells force-pushed the feat/federated-deployment-scheduling branch 7 times, most recently from b45810f to 73177eb Compare June 5, 2026 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: datumctl compute plugin — deploy and manage workloads from the CLI#113

feat: datumctl compute plugin — deploy and manage workloads from the CLI#113
scotwells wants to merge 91 commits into
feat/federated-deployment-schedulingfrom
feat/datumctl-compute-plugin

scotwells commented May 22, 2026

Uh oh!

scotwells commented Jun 3, 2026

Uh oh!

scotwells commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

scotwells commented May 22, 2026

Summary

Dependencies

What's not included

Related

Uh oh!

scotwells commented Jun 3, 2026

📋 Real-world UX issue from a user enabling compute

Uh oh!

scotwells commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context: should rollout become a verb group?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

scotwells commented Jun 4, 2026 •

edited

Loading

Context: should `rollout` become a verb group?