Skip to content

fix(cloud-agent-next): wait for tool loops to finish#3753

Open
eshurakov wants to merge 1 commit into
mainfrom
third-drive
Open

fix(cloud-agent-next): wait for tool loops to finish#3753
eshurakov wants to merge 1 commit into
mainfrom
third-drive

Conversation

@eshurakov
Copy link
Copy Markdown
Contributor

Summary

Why

Cloud Agent currently treats a completed assistant message.updated event as successful user-turn completion, but Kilo emits those events for intermediate tool-loop steps. That can pin kg agent result, callbacks, and durable session state to an early tool-call response while the agent continues working. The success boundary needs to live with the wrapper that observes Kilo's run lifecycle and post-processing, while the Durable Object remains a durable coordinator rather than reconstructing prompt lifecycle from transient events.

What was done

  • Keep successful assistant updates as transcript/activity events and reserve prompt failure for explicit assistant errors.
  • Have the wrapper seal the exact admitted batch after three seconds of stable root idle, block new admission while finalizing, run post-processing, and emit one authoritative complete with sealed message IDs.
  • Cancel stable-idle finalization only when the root session explicitly resumes via session.turn.open or busy session.status; trailing post-idle events no longer prevent completion.
  • Persist one run-level finalizing hold in the Durable Object so follow-up work stays pending without retry churn, then settle exact sealed membership and drain held work under a fresh run.
  • Preserve rolling-deployment compatibility by accepting legacy complete payloads without membership for legacy runs while wrapper version 2.3.0 moves fresh work onto the sealed-batch protocol.

High-level architecture

sequenceDiagram
  participant DO as Session Durable Object
  participant Runtime as Agent Runtime
  participant Wrapper
  participant Kilo
  participant Ingest as Ingest Worker

  DO->>Runtime: Deliver fenced pending message
  Runtime->>Wrapper: Submit prompt or command
  Wrapper->>Kilo: Admit work
  Kilo-->>Wrapper: Assistant steps and root idle
  Note over Wrapper: Stable root idle seals exact admitted membership
  Wrapper->>Ingest: wrapper_finalizing(wrapperRunId)
  Ingest->>DO: Hold new delivery for current run
  Wrapper->>Wrapper: Run post-processing
  Wrapper->>Ingest: complete(sealed messageIds)
  Ingest->>DO: Settle exact batch and retire run
  DO->>Runtime: Drain held work under a fresh run
Loading

Architecture decision

Decision: Make fenced wrapper complete with exact sealed batch membership the only normal success boundary, backed by one Durable Object run-level finalizing hold.

Context: Completed assistant messages are model-step boundaries, root idle can be transient between queued prompts, and the Durable Object has no bounded authoritative Kilo query that can reconstruct a prompt's final success after the wrapper is gone.

Rationale: The wrapper directly observes Kilo lifecycle and owns post-processing, so it can seal the admitted run batch without guessing. Exact membership lets the Durable Object settle durable message state while staying focused on coordination, fencing, pending delivery, and failure supervision.

Alternatives considered:

  • Terminalize on completed assistant updates. This is the source of premature tool-loop completion because one user turn can produce multiple completed assistant steps.
  • Infer success from raw root idle in the Durable Object. Root idle is transient and would duplicate wrapper lifecycle logic without proving exact batch membership.
  • Add complete acknowledgements or a terminal outbox. This would add protocol and recovery complexity beyond the current failure-safe requirement; lost complete remains a conservative wrapper failure through existing supervision.

Consequences: Follow-up messages arriving during finalization wait durably for a fresh wrapper run, and rolling deployment retains a narrow legacy-complete fallback. The protocol intentionally favors conservative failure over inferring success when an authoritative complete is lost.

Verification

  • Ran a fake-LLM cold turn and observed terminal complete.
  • Queued two follow-up turns while busy and observed FIFO completion across the sealed batch boundary.
  • Ran fake-LLM chunked-streaming and empty-response paths and observed terminal complete.
  • Ran fake-LLM waiter cleanup and observed zero leaked waiters and live responses.

Visual Changes

N/A

Reviewer Notes

  • Review the wrapper stable-idle/admission fence together with Durable Object exact-membership settlement and run-level finalizing hold; these form one lifecycle contract.
  • Legacy complete without messageIds is accepted only for rolling compatibility; fresh work is pushed to wrapper protocol 2.3.0.
  • The default fake-LLM smoke matrix excludes callback scenarios because callbackTarget uses the internal legacy API; focused callback scenarios remain available manually.
  • The focused external-kill fake-LLM scenario still fails independently with no reconnect/terminal event and is not addressed by this PR.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Jun 4, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

Incremental review of the new commit finds no bugs, security issues, or logic errors across the full sealed-batch protocol — wrapper state machine, DO coordination, ingest routing, and integration tests all look correct.

Key Design Decisions Verified (informational)

isDeliveryHeld vs isWrapperDeliveryHeld asymmetry (CloudAgentSession.ts:605): The isDeliveryHeld callback passed to createSessionMessageQueue only checks isWrapperRunFinalizing, not lease.state. Physical cleanup blocking is handled by the pre-existing getDeliveryBlock path. isDeliveryHeld is exclusively for the WRAPPER_FINALIZING hold-without-retry semantics. Alarm scheduling correctly uses the full isWrapperDeliveryHeld (both checks). Confirmed by integration test "holds pending delivery through physical cleanup and drains after confirmed absence".

_admissionsBlocked lifecycle (wrapper/src/state.ts): Only cleared in bindSession() when !this.session. clearAllMessages() does not reset it. bindSessionContext detects isFreshRunAfterFinalization and allows a fresh bind which resets it. Validated by "accepts a fresh wrapper run after finalizing clears its session".

stableIdleTimer race (wrapper/src/lifecycle.ts): trySealIdleBatch sets stableIdleTimer = null before any checks, preventing double-arm. stop() calls clearStableIdleCandidate() — no leaks.

SQL query removal of time.completed constraint (session/queries/events.ts): Intentional. Settlement authority moves to settleSealedBatch which uses the sealed messageIds list. Tool-loop fixture tests in tool-loop-terminalization.test.ts validate end-to-end correctness.

settleSealedBatch null return: Only when getMetadata() is null. onTerminalEvent handles this conservatively by stopping and failing all accepted messages.

markWrapperFinalizing read+write: Safe under DO single-threaded execution guarantee, with run-ID guard.

Files Reviewed (54 files)
  • services/cloud-agent-next/src/execution/errors.ts
  • services/cloud-agent-next/src/execution/orchestrator.test.ts
  • services/cloud-agent-next/src/execution/orchestrator.ts
  • services/cloud-agent-next/src/execution/types.ts
  • services/cloud-agent-next/src/kilo-facade/user-kilo-facade.ts
  • services/cloud-agent-next/src/kilo/wrapper-client.test.ts
  • services/cloud-agent-next/src/kilo/wrapper-client.ts
  • services/cloud-agent-next/src/persistence/CloudAgentSession.ts
  • services/cloud-agent-next/src/session/agent-runtime.test.ts
  • services/cloud-agent-next/src/session/agent-runtime.ts
  • services/cloud-agent-next/src/session/message-settlement-outbox.test.ts
  • services/cloud-agent-next/src/session/message-settlement-outbox.ts
  • services/cloud-agent-next/src/session/pending-messages.ts
  • services/cloud-agent-next/src/session/queries/events.ts
  • services/cloud-agent-next/src/session/queue-message.ts
  • services/cloud-agent-next/src/session/session-message-queue.test.ts
  • services/cloud-agent-next/src/session/session-message-queue.ts
  • services/cloud-agent-next/src/session/session-message-state.test.ts
  • services/cloud-agent-next/src/session/session-message-state.ts
  • services/cloud-agent-next/src/session/wrapper-runtime-state.test.ts
  • services/cloud-agent-next/src/session/wrapper-runtime-state.ts
  • services/cloud-agent-next/src/session/wrapper-supervisor.test.ts
  • services/cloud-agent-next/src/session/wrapper-supervisor.ts
  • services/cloud-agent-next/src/shared/protocol.ts
  • services/cloud-agent-next/src/shared/wrapper-bootstrap.ts
  • services/cloud-agent-next/src/shared/wrapper-version.ts
  • services/cloud-agent-next/src/websocket/ingest.test.ts
  • services/cloud-agent-next/src/websocket/ingest.ts
  • services/cloud-agent-next/src/websocket/types.ts
  • services/cloud-agent-next/test/e2e/smoke.ts
  • services/cloud-agent-next/test/fixtures/tool-loop-turn-events.ts
  • services/cloud-agent-next/test/integration/session/disconnect-and-reaper.test.ts
  • services/cloud-agent-next/test/integration/session/events.test.ts
  • services/cloud-agent-next/test/integration/session/execute-directly-failure.test.ts
  • services/cloud-agent-next/test/integration/session/hot-delivery.test.ts
  • services/cloud-agent-next/test/integration/session/idle-reconciliation.test.ts
  • services/cloud-agent-next/test/integration/session/tool-loop-terminalization.test.ts
  • services/cloud-agent-next/test/unit/wrapper/batch-admission.test.ts
  • services/cloud-agent-next/test/unit/wrapper/batch-lifecycle.test.ts
  • services/cloud-agent-next/test/unit/wrapper/connection.test.ts
  • services/cloud-agent-next/test/unit/wrapper/lifecycle.test.ts
  • services/cloud-agent-next/test/unit/wrapper/network-resume.test.ts
  • services/cloud-agent-next/test/unit/wrapper/reconnection.test.ts
  • services/cloud-agent-next/test/unit/wrapper/server.test.ts
  • services/cloud-agent-next/test/unit/wrapper/snapshot.test.ts
  • services/cloud-agent-next/test/unit/wrapper/state.test.ts
  • services/cloud-agent-next/wrapper/src/connection.ts
  • services/cloud-agent-next/wrapper/src/lifecycle.test.ts
  • services/cloud-agent-next/wrapper/src/lifecycle.ts
  • services/cloud-agent-next/wrapper/src/main.ts
  • services/cloud-agent-next/wrapper/src/server.test.ts
  • services/cloud-agent-next/wrapper/src/server.ts
  • services/cloud-agent-next/wrapper/src/state.ts

Reviewed by claude-sonnet-4.6 · 10,638,218 tokens

Review guidance: REVIEW.md from base branch main

@eshurakov eshurakov changed the title fix(cloud-agent-next): settle tool loops on sealed wrapper batches fix(cloud-agent-next): wait for tool loops to finish Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants