Skip to content

A-1320 retry cache commands#3963

Merged
buildkate merged 1 commit into
mainfrom
kates/A-1320-retry-cache-commands
May 28, 2026
Merged

A-1320 retry cache commands#3963
buildkate merged 1 commit into
mainfrom
kates/A-1320-retry-cache-commands

Conversation

@buildkate
Copy link
Copy Markdown
Contributor

@buildkate buildkate commented May 28, 2026

Description

Adds retry coverage to the cache CLI commands (cache restore and cache save), matching the consolidated retry pattern from api/retryable.go and sibling commands. Each API call is wrapped individually with roko.NewRetrier(...).DoWithContext(...) + api.BreakOnNonRetryable so transient 429/5xx/network errors retry with backoff while non-retryable 4xx breaks immediately.

Context

Changes

  • All five cache API methods now return *api.Response (positioned second-to-last to match the sibling convention in api/) so callers can drive BreakOnNonRetryable decisions; cacheAPI interface and mockAPIClient updated in lockstep.
  • Wrapped each individual cache API call in internal/cache/restore.go and internal/cache/save.go with retriers using the meta_data_get config (WithMaxAttempts(10), Constant(5*time.Second), WithJitter()).
  • Verified CacheEntryCreate and CacheEntryCommit are retry-safe by reading the server-side code in buildkite/buildkite; short comments at each wrap site cite the verified server behaviour.

Key decisions

  • Retry lives at the per-API-call boundary inside internal/cache, not at the CLI or as one outer wrap around Save/Restore — outer-wrapping would replay already-succeeded caches in the fan-out and redo expensive non-API work (archive build, upload) when only the final API call failed.
  • Closure-capture pattern over roko.DoFunc[N] — matches the 28 other retry call sites in the codebase and sidesteps the DoFunc3 arity ceiling.

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go tool gofumpt -extra -w .)

Disclosures / Credits

Used opencode (Claude) to walk the codebase, evaluate retry-scoping alternatives, verify server-side idempotency, and write the implementation. All decisions and final code reviewed by me.

@buildkate buildkate requested review from a team as code owners May 28, 2026 02:07
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: abfa38bf81

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread api/cache_test.go Outdated
@buildkate buildkate force-pushed the kates/A-1320-retry-cache-commands branch 2 times, most recently from 9b7a2dd to d73c3a9 Compare May 28, 2026 02:25
@buildkate buildkate requested a review from buildsworth-bk May 28, 2026 02:25
@buildkate buildkate added the internal Non-user facing, internal change. label May 28, 2026
Copy link
Copy Markdown

@buildsworth-bk-app buildsworth-bk-app Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-call retry boundaries with api.BreakOnNonRetryable match the meta_data_get / agent_pause pattern, and I traced 429 / 5xx / 4xx / cache-miss paths through cacheDo and interpretCacheResponse — the (apiResp, err) plumbing flips break/retry correctly in every case I worked through. The retry-safety claims on CacheEntryCreate (fresh upload_uuid per call) and CacheEntryCommit (unconditional overwrite) rest on the server-side review noted in the PR description; I couldn't independently verify those, but the inline comments at each wrap site make the assumption visible to future readers.

One naming nit inline. No risk: label set on this PR.

Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:

Download the buildsworth logs from build 326, then answer my questions about the findings.

Comment thread internal/cache/save.go
@zhming0 zhming0 changed the title kates/A 1320 retry cache commands A-1320 retry cache commands May 28, 2026
Copy link
Copy Markdown
Contributor

@zhming0 zhming0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me with one suggestion on using exponential retry strategy instead.

Comment thread internal/cache/restore.go Outdated

err = roko.NewRetrier(
roko.WithMaxAttempts(10),
roko.WithStrategy(roko.Constant(5*time.Second)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use exponential instead? Cache restore is latency sensitive, 5s cache restore might as well be considered cache miss 😬 .

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lowered retry count to 5, and applied exponential strategy, so max 5 retries will be ~3.5 seconds.

Also applied to the API calls in save for consistency.

Copy link
Copy Markdown
Contributor

@zhming0 zhming0 May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[non-blocking] I don't have a hard evidences against 3.5s total wait time, but it's worth keep in mind that it's better to have a green build than red build.

A red build usually result in support tickets

@buildkate buildkate force-pushed the kates/A-1320-retry-cache-commands branch from 4c90e53 to e171d5c Compare May 28, 2026 03:50
@buildkate buildkate enabled auto-merge May 28, 2026 03:51
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e171d5c5f2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread api/cache.go
body, err := io.ReadAll(httpResp.Body)
if err != nil {
return httpResp, fmt.Errorf("failed to read response body: %w", err)
return apiResp, fmt.Errorf("failed to read response body: %w", err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't mark mid-body EOFs non-retryable

In the new cache retry paths, this returns a non-nil response even when reading the response body fails. If the cache API sends headers (for example a 200) and then the connection drops or truncates the body, the error can wrap EOF/connection reset, but BreakOnNonRetryable checks the response status first; because 200 is not 429/5xx it calls Break() and the retrier gives up after the first attempt. That skips the intended retry for transient EOF/connection errors in this scenario, so the body transport error should be classified without the successful response status blocking IsRetryableError.

Useful? React with 👍 / 👎.

@buildkate buildkate merged commit 64a29d0 into main May 28, 2026
4 checks passed
@buildkate buildkate deleted the kates/A-1320-retry-cache-commands branch May 28, 2026 03:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal Non-user facing, internal change.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants