A-1320 retry cache commands by buildkate · Pull Request #3963 · buildkite/agent

buildkate · 2026-05-28T02:07:06Z

Description

Adds retry coverage to the cache CLI commands (cache restore and cache save), matching the consolidated retry pattern from api/retryable.go and sibling commands. Each API call is wrapped individually with roko.NewRetrier(...).DoWithContext(...) + api.BreakOnNonRetryable so transient 429/5xx/network errors retry with backoff while non-retryable 4xx breaks immediately.

Context

Linear: A-1320

Changes

All five cache API methods now return *api.Response (positioned second-to-last to match the sibling convention in api/) so callers can drive BreakOnNonRetryable decisions; cacheAPI interface and mockAPIClient updated in lockstep.
Wrapped each individual cache API call in internal/cache/restore.go and internal/cache/save.go with retriers using the meta_data_get config (WithMaxAttempts(10), Constant(5*time.Second), WithJitter()).
Verified CacheEntryCreate and CacheEntryCommit are retry-safe by reading the server-side code in buildkite/buildkite; short comments at each wrap site cite the verified server behaviour.

Key decisions

Retry lives at the per-API-call boundary inside internal/cache, not at the CLI or as one outer wrap around Save/Restore — outer-wrapping would replay already-succeeded caches in the fan-out and redo expensive non-API work (archive build, upload) when only the final API call failed.
Closure-capture pattern over roko.DoFunc[N] — matches the 28 other retry call sites in the codebase and sidesteps the DoFunc3 arity ceiling.

Testing

Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
Code is formatted (with go tool gofumpt -extra -w .)

Disclosures / Credits

Used opencode (Claude) to walk the codebase, evaluate retry-scoping alternatives, verify server-side idempotency, and write the implementation. All decisions and final code reviewed by me.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: abfa38bf81

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

buildsworth-bk-app

Per-call retry boundaries with api.BreakOnNonRetryable match the meta_data_get / agent_pause pattern, and I traced 429 / 5xx / 4xx / cache-miss paths through cacheDo and interpretCacheResponse — the (apiResp, err) plumbing flips break/retry correctly in every case I worked through. The retry-safety claims on CacheEntryCreate (fresh upload_uuid per call) and CacheEntryCommit (unconditional overwrite) rest on the server-side review noted in the PR description; I couldn't independently verify those, but the inline comments at each wrap site make the assumption visible to future readers.

One naming nit inline. No risk: label set on this PR.

Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 326, then answer my questions about the findings.

zhming0

Looks good to me with one suggestion on using exponential retry strategy instead.

zhming0 · 2026-05-28T03:31:59Z

+
+	err = roko.NewRetrier(
+		roko.WithMaxAttempts(10),
+		roko.WithStrategy(roko.Constant(5*time.Second)),


Would it be better to use exponential instead? Cache restore is latency sensitive, 5s cache restore might as well be considered cache miss 😬 .

I lowered retry count to 5, and applied exponential strategy, so max 5 retries will be ~3.5 seconds.

Also applied to the API calls in save for consistency.

[non-blocking] I don't have a hard evidences against 3.5s total wait time, but it's worth keep in mind that it's better to have a green build than red build.

A red build usually result in support tickets

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e171d5c5f2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-28T03:56:17Z

 	body, err := io.ReadAll(httpResp.Body)
 	if err != nil {
-		return httpResp, fmt.Errorf("failed to read response body: %w", err)
+		return apiResp, fmt.Errorf("failed to read response body: %w", err)


Don't mark mid-body EOFs non-retryable

In the new cache retry paths, this returns a non-nil response even when reading the response body fails. If the cache API sends headers (for example a 200) and then the connection drops or truncates the body, the error can wrap EOF/connection reset, but BreakOnNonRetryable checks the response status first; because 200 is not 429/5xx it calls Break() and the retrier gives up after the first attempt. That skips the intended retry for transient EOF/connection errors in this scenario, so the body transport error should be classified without the successful response status blocking IsRetryableError.

Useful? React with 👍 / 👎.

buildkate requested review from a team as code owners May 28, 2026 02:07

chatgpt-codex-connector Bot reviewed May 28, 2026

View reviewed changes

Comment thread api/cache_test.go Outdated

buildkate force-pushed the kates/A-1320-retry-cache-commands branch 2 times, most recently from 9b7a2dd to d73c3a9 Compare May 28, 2026 02:25

buildkate requested a review from buildsworth-bk May 28, 2026 02:25

buildkate added the internal Non-user facing, internal change. label May 28, 2026

buildsworth-bk-app Bot reviewed May 28, 2026

View reviewed changes

Comment thread internal/cache/save.go

zhming0 changed the title ~~kates/A 1320 retry cache commands~~ A-1320 retry cache commands May 28, 2026

zhming0 approved these changes May 28, 2026

View reviewed changes

chore(cache): wrap save and restore api calls in retryer

e171d5c

buildkate force-pushed the kates/A-1320-retry-cache-commands branch from 4c90e53 to e171d5c Compare May 28, 2026 03:50

buildkate enabled auto-merge May 28, 2026 03:51

chatgpt-codex-connector Bot reviewed May 28, 2026

View reviewed changes

buildkate merged commit 64a29d0 into main May 28, 2026
4 checks passed

buildkate deleted the kates/A-1320-retry-cache-commands branch May 28, 2026 03:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A-1320 retry cache commands#3963

A-1320 retry cache commands#3963
buildkate merged 1 commit into
mainfrom
kates/A-1320-retry-cache-commands

buildkate commented May 28, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

buildsworth-bk-app Bot left a comment

Uh oh!

Uh oh!

zhming0 left a comment

Uh oh!

zhming0 May 28, 2026

Uh oh!

buildkate May 28, 2026

Uh oh!

zhming0 May 28, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

buildkate commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Context

Changes

Key decisions

Testing

Disclosures / Credits

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

buildsworth-bk-app Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhming0 left a comment

Choose a reason for hiding this comment

Uh oh!

zhming0 May 28, 2026

Choose a reason for hiding this comment

Uh oh!

buildkate May 28, 2026

Choose a reason for hiding this comment

Uh oh!

zhming0 May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

buildkate commented May 28, 2026 •

edited

Loading

zhming0 May 28, 2026 •

edited

Loading