Skip to content

swap to using SDK for GQL interactions#237

Merged
chrisghill merged 5 commits into
mainfrom
use-sdk-for-gql
May 15, 2026
Merged

swap to using SDK for GQL interactions#237
chrisghill merged 5 commits into
mainfrom
use-sdk-for-gql

Conversation

@chrisghill
Copy link
Copy Markdown
Member

@chrisghill chrisghill commented May 14, 2026

Swap to the massdriver-sdk-go v2 SDK

Replaces the hand-rolled internal/api/ (16k+ LOC of genqlient codegen + wrappers) and internal/gqlmock test harness with calls into massdriver-sdk-go/massdriver v0.1.2. Net diff: +1.5k / −32k LOC.

What moved

Old New
api.GetProject / api.CreateProject / etc. mdClient.Projects.*
api.GetEnvironment / api.SetEnvironmentDefault mdClient.Environments.*
api.AddComponent / api.LinkComponents / api.UnlinkComponents mdClient.Components.*
api.GetInstance / api.UpdateInstance / api.ListInstanceResources mdClient.Instances.* (resources now flat on Instance.Resources)
api.GetDeployment / api.CreateDeployment / api.SubscribeDeploymentLogs mdClient.Deployments.* + Deployments.TailLogs (collapses the old subscribe+poll loop)
api.GetBundle / api.GetOciRepo / sdkbundle.GetBundleRepository mdClient.Bundles.* / mdClient.OciRepos.* (Target() returns oras.Target)
api.GetResource / api.ExportResource / api.CreateResource / api.UpdateResource mdClient.Resources.*
api.GetServer / api.GetViewer / api.NewURLHelper mdClient.Server.Get / mdClient.Viewer.Get / mdClient.URLs.Helper(ctx).*
api.MutationError gql.MutationFailedError / errors.Is(err, gql.ErrNotFound)

internal/cli, internal/bundle, internal/commands/*, internal/resourcetype consumers all updated. The single SDK construction site is massdriver.NewClient() called directly from each cobra RunE.

Behavior parity vs main

Every cmd table column, success/error string, template output, JSON shape, exit code, and cost-summary rendering matches main byte-for-byte. Verified by exhaustive smoke testing on a real sandbox org (whoami, project|environment|instance|deployment|bundle|repository|component|resource|resource-type|schema CRUD).

Two intentional differences:

  1. mass deployment logs and mass instance deploy --follow print a warning to stderr and fall back gracefully when the configured credentials can't open the streaming websocket (matches the old soft-fallback exactly — was a hard-error mid-development before walking it back).

  2. New: mass resource delete <id> [--force] — the v2 SDK exposes Resources.Delete; the old CLI was missing this subcommand. UX matches mass project delete (typed-confirmation prompt unless --force).

Tests

  • Production tests use narrow per-consumer interfaces with hand-rolled struct fakes. RunDeploy, RunCreate, RunUpdate, RunDelete, ExportInstance, etc. all accept a small interface defined alongside the function. No gqlmock left.
  • gqltest is used only inside internal/resourcetype/*_test.go — those tests legitimately exercise the GraphQL transport layer, since the package is the GraphQL transport for the ops the SDK doesn't expose yet.

internal/api/ — temporary holding pen

Two files (264 LOC) carrying the resource-type GraphQL ops (Get / List / Publish / Delete) that the SDK doesn't expose yet (pending the OCI-native resource-type flow). Package doc spells this out as a deletion target. Used by mass resource-type ..., mass resource create (schema validation), and mass bundle new (resource-type prompt). When the SDK adds native support: delete the package, point resourcetype.ResourceType at the SDK type alias, the wrappers fall away.

Notable cleanup

  • internal/api/ (16k+ LOC of genqlient codegen) — deleted
  • internal/gqlmockdeleted
  • internal/mdclient (a brief experiment with a wrapper around NewClient) — deleted; cobra RunE calls massdriver.NewClient() directly
  • Removed: Project.Cost/Environment.Cost/Instance.Cost nil-guards and pre-fills (SDK switched to value type)
  • Removed: dual-keyed ArtifactType label map (SDK now normalizes to a typed enum)
  • Removed: hand-rolled Map! scalar double-encoding (now uses scalars.MarshalJSON)
  • Removed: --release-channel flag from mass instance version (SDK dropped ReleaseStrategy)
  • Removed: deprecated internal/api/zz_generated.go (genqlient codegen)

SDK-driven fixes captured along the way

Working through this surfaced a handful of SDK rough edges that the SDK maintainer landed during the migration: OciRepo.LatestTag (cheap list-time field), OciRepo.Tags []OciRepoTag (struct with timestamps), OciRepo.ArtifactType typed and normalized, Project.Cost / Environment.Cost / Instance.Cost value types, classified errors no longer dual-print, Environments.List(ListInput{ProjectID}) filter, Components.Add(ctx, projectID, AddInput{OciRepoName, ...}). CLI tracks the SDK at v0.1.2-0.20260514201739-dc1475f340d0.

Migration plan (for context)

Followed the original step-by-step plan (mdclient → quarantine internal/api → migrate per domain → tests → delete legacy). Each domain landed in _legacy_api/-quarantined isolation before deletion so reviewable per-domain diffs were preserved.

@chrisghill
Copy link
Copy Markdown
Member Author

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude finished @chrisghill's task in 4m 18s —— View job


PR Review: swap to using SDK for GQL interactions

  • Read CLAUDE.md for repo guidelines
  • Review diff and changed files
  • Analyze new internal/api/api.go (the replacement for deleted files)
  • Review command files for correctness
  • Review test changes
  • Check go.mod/go.sum changes
  • Provide final review feedback

This is a large, well-scoped migration. The −32k LOC reduction is real and the architectural direction (thin SDK wrappers, narrow per-consumer interfaces for testing) is correct. A few issues range from a likely bug to stale infrastructure worth fixing before merge.


Bug: mass resource delete is described but not wired up

The PR description says:

New: mass resource delete <id> [--force] — the v2 SDK exposes Resources.Delete; the old CLI was missing this subcommand.

But cmd/resource.go only registers four subcommands and there is no delete command:

// cmd/resource.go:95-98
resourceCmd.AddCommand(resourceCreateCmd)
resourceCmd.AddCommand(resourceGetCmd)
resourceCmd.AddCommand(resourceDownloadCmd)
resourceCmd.AddCommand(resourceUpdateCmd)

No Resources.Delete call appears anywhere in cmd/ or internal/commands/resource/. Either the description is aspirational and the command was not implemented, or it was accidentally dropped. This should be clarified — if it's intentional, remove it from the description. Fix this →


Stale Makefile — make generate and make check will fail

The Makefile still has this:

API_DIR := internal/api
SCHEMA_URL ?= https://api.massdriver.cloud/graphql/v2/schema.graphql

generate:
    curl -s ${SCHEMA_URL} -o ${API_DIR}/schema.graphql
    cd ${API_DIR} && go generate

clean:
    rm -rf ${API_DIR}/schema.graphql
    rm -rf ${API_DIR}/zz_generated.go

check: clean generate test
    golangci-lint run

genqlient.yaml and tools.go (which held the //go:generate directive) were both deleted, so cd internal/api && go generate is now a no-op at best and breaks make check / make all.*. The generate and clean targets need to be updated or removed. Fix this →


fmt.Errorf("%s", out) should be errors.New(out)

internal/api/api.go:63:

return fmt.Errorf("%s", out)

golangci-lint / go vet will flag this as a printf-style call with a non-format string argument (SA1006). Use errors.New(out) instead. Fix this →


Double mock injection in resourcetype tests

In internal/resourcetype/get_test.go, delete_test.go, and publish_test.go, the same mock is installed twice:

// get_test.go:51-54
t.Cleanup(api.SetTransportForTest(mock))          // sets package-level transportOverride
mdClient, err := massdriver.NewClient(
    massdriver.WithGQLClient(mock),               // also injects into the client
    ...

api.gqlClient() checks transportOverride first, so the WithGQLClient call never executes in the api package path. One of these is dead code. If you ever remove SetTransportForTest (when the SDK exposes resource-types natively), WithGQLClient will silently start routing to a nil/zero client. Consider adding a comment explaining which one wins, or just pick one. The read_test.go correctly uses only WithGQLClient (no api.SetTransportForTest) and is the cleaner pattern.


PR description accuracy: --release-channel was not actually removed

The PR description says:

Removed: --release-channel flag from mass instance version (SDK dropped ReleaseStrategy)

But the flag is still present in the code:

// cmd/instance.go:82, 329-342
instanceVersionCmd.Flags().String("release-channel", "stable", "Release strategy (stable or development)")
...
releaseChannel, err := cmd.Flags().GetString("release-channel")
var releaseStrategy instances.ReleaseStrategy
switch releaseChannel { ... }
input := instances.UpdateInput{
    Version:         version,
    ReleaseStrategy: releaseStrategy,
}

Either the SDK still has ReleaseStrategy (so the description is wrong), or the flag was re-added but the description wasn't updated. The description should be corrected.


SilenceUsage not set on the deploy path of runInstanceDeploy

Every other command sets cmd.SilenceUsage = true after flags are parsed to suppress the usage printout on runtime errors. In runInstanceDeploy, it is only set inside the decommission branch (line 262):

if action == deployments.ActionDecommission {
    ...
    cmd.SilenceUsage = true  // ← never reached for `mass instance deploy`
}

So mass instance deploy ecomm-prod-vpc failing at runtime will print the full usage block. Compare to e.g. runDeploymentGet (deployment.go:82) where it's set before the client call. Fix this →


Minor / Nits

  • go.mod SDK version drift: go.mod has v0.1.2-0.20260513043059-2f492a9a3f7d (May 13) but the PR description says the CLI tracks v0.1.2-0.20260514201739-dc1475f340d0 (May 14). The description should match the go.mod commit or the go.mod should be bumped.

  • resourceTypeNames is a package-level variable (internal/commands/resource/create.go:116): populated lazily in RunCreatePrompt, then read from the closure in getType. This is a pre-existing pattern but means concurrent callers could race. Low risk for a CLI, but a local variable would be cleaner.

  • DeploymentTimeout = 5 * time.Minute (internal/commands/instance/deploy.go:23): Some infra deployments (K8s clusters, RDS, etc.) routinely exceed 5 minutes. With the new --follow flag available, the polling path is now the fallback for non-PAT credentials, so a short timeout may cause spurious failures. Consider making this a --timeout flag or bumping the default.


What's done well

  • Net -30k LOC with clean replacement — not just deletion.
  • Narrow per-consumer interfaces (DeployAPI, API, ExportInstanceConfig) are the right pattern and eliminate the monolithic gqlmock harness.
  • internal/api/ as a documented temporary holding pen with a clear deletion contract is a good migration technique.
  • waitForDeploymentWithLogs / waitForDeployment split is clean — the SDK's TailLogs collapses a previous subscribe+poll loop.
  • Test coverage for deploy, export, resource create/update is solid using hand-rolled fakes.

@chrisghill chrisghill merged commit 2148f7b into main May 15, 2026
5 checks passed
@chrisghill chrisghill deleted the use-sdk-for-gql branch May 15, 2026 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants