A-1320 retry cache commands#3963
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: abfa38bf81
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
9b7a2dd to
d73c3a9
Compare
There was a problem hiding this comment.
Per-call retry boundaries with api.BreakOnNonRetryable match the meta_data_get / agent_pause pattern, and I traced 429 / 5xx / 4xx / cache-miss paths through cacheDo and interpretCacheResponse — the (apiResp, err) plumbing flips break/retry correctly in every case I worked through. The retry-safety claims on CacheEntryCreate (fresh upload_uuid per call) and CacheEntryCommit (unconditional overwrite) rest on the server-side review noted in the PR description; I couldn't independently verify those, but the inline comments at each wrap site make the assumption visible to future readers.
One naming nit inline. No risk: label set on this PR.
Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 326, then answer my questions about the findings.
zhming0
left a comment
There was a problem hiding this comment.
Looks good to me with one suggestion on using exponential retry strategy instead.
|
|
||
| err = roko.NewRetrier( | ||
| roko.WithMaxAttempts(10), | ||
| roko.WithStrategy(roko.Constant(5*time.Second)), |
There was a problem hiding this comment.
Would it be better to use exponential instead? Cache restore is latency sensitive, 5s cache restore might as well be considered cache miss 😬 .
There was a problem hiding this comment.
I lowered retry count to 5, and applied exponential strategy, so max 5 retries will be ~3.5 seconds.
Also applied to the API calls in save for consistency.
There was a problem hiding this comment.
[non-blocking] I don't have a hard evidences against 3.5s total wait time, but it's worth keep in mind that it's better to have a green build than red build.
A red build usually result in support tickets
4c90e53 to
e171d5c
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e171d5c5f2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| body, err := io.ReadAll(httpResp.Body) | ||
| if err != nil { | ||
| return httpResp, fmt.Errorf("failed to read response body: %w", err) | ||
| return apiResp, fmt.Errorf("failed to read response body: %w", err) |
There was a problem hiding this comment.
Don't mark mid-body EOFs non-retryable
In the new cache retry paths, this returns a non-nil response even when reading the response body fails. If the cache API sends headers (for example a 200) and then the connection drops or truncates the body, the error can wrap EOF/connection reset, but BreakOnNonRetryable checks the response status first; because 200 is not 429/5xx it calls Break() and the retrier gives up after the first attempt. That skips the intended retry for transient EOF/connection errors in this scenario, so the body transport error should be classified without the successful response status blocking IsRetryableError.
Useful? React with 👍 / 👎.
Description
Adds retry coverage to the cache CLI commands (
cache restoreandcache save), matching the consolidated retry pattern fromapi/retryable.goand sibling commands. Each API call is wrapped individually withroko.NewRetrier(...).DoWithContext(...)+api.BreakOnNonRetryableso transient 429/5xx/network errors retry with backoff while non-retryable 4xx breaks immediately.Context
Changes
*api.Response(positioned second-to-last to match the sibling convention inapi/) so callers can driveBreakOnNonRetryabledecisions;cacheAPIinterface andmockAPIClientupdated in lockstep.internal/cache/restore.goandinternal/cache/save.gowith retriers using themeta_data_getconfig (WithMaxAttempts(10),Constant(5*time.Second),WithJitter()).CacheEntryCreateandCacheEntryCommitare retry-safe by reading the server-side code inbuildkite/buildkite; short comments at each wrap site cite the verified server behaviour.Key decisions
internal/cache, not at the CLI or as one outer wrap aroundSave/Restore— outer-wrapping would replay already-succeeded caches in the fan-out and redo expensive non-API work (archive build, upload) when only the final API call failed.roko.DoFunc[N]— matches the 28 other retry call sites in the codebase and sidesteps theDoFunc3arity ceiling.Testing
go test ./...). Buildkite employees may check this if the pipeline has run automatically.go tool gofumpt -extra -w .)Disclosures / Credits
Used opencode (Claude) to walk the codebase, evaluate retry-scoping alternatives, verify server-side idempotency, and write the implementation. All decisions and final code reviewed by me.