Skip to content

fix(security-agent): make async command lifecycle durable#3726

Open
jeanduplessis wants to merge 1 commit into
mainfrom
security-agent-polish
Open

fix(security-agent): make async command lifecycle durable#3726
jeanduplessis wants to merge 1 commit into
mainfrom
security-agent-polish

Conversation

@jeanduplessis
Copy link
Copy Markdown
Contributor

@jeanduplessis jeanduplessis commented Jun 4, 2026

Summary

Makes Security Agent queue-backed actions durable and observable from admission through terminal completion.

Why this change is needed

Manual sync, finding dismissal, and manual analysis previously returned success at queue admission while the UI inferred completion from broad invalidation and timeouts. This could hide downstream failures, leave stale progress states, and acknowledge rollback callbacks before durable processing was admitted.

How this is addressed

  • Adds an owner-scoped command ledger with guarded lifecycle transitions, stale-command reconciliation, terminal retention, and queue-admission compensation.
  • Settles sync, dismissal, and manual-analysis commands with stable result codes and records per-repository sync freshness even when a repository has zero findings.
  • Exposes owner-scoped command status APIs and polls exact command IDs so pending, success, failure, and recovered-after-reload states remain accurate.
  • Preserves authoritative partial success for enable/config flows and routes rollback callbacks through durable Worker queue admission.

Human Verification

finding-dismissal-lifecycle-guided.mp4
dashboard-refresh-lifecycle-guided.mp4
  • Security auto-analysis Worker suite: 71 tests passed.
  • Security sync Worker suite: 29 tests passed.
  • Focused web, database, callback, router, client, and GDPR suites: 97 tests passed.
  • React Doctor diff scan remained at the 82/100 baseline.
  • Independent logic, security, data, type, resource, and style reviews found no remaining issues.
Local end-to-end browser and database verification

Tested with agent-browser against the local app as local@admin.example.com, using the configured Kilo-Org/security-agent-testbed repository. Correlated each user-visible transition with local Postgres state.

  • Dashboard refresh: Clicking Refresh showed Syncing... and Sync queued, then returned to Refresh with a fresh Last synced timestamp. The durable command settled as succeeded / SYNC_COMPLETED, and repository sync-state success timestamps were persisted.
  • Manual analysis: Clicking Analyze progressed through Queueing and Analyzing. Reloading the Findings page restored the active state. The finding and analysis queue completed after about two minutes, the result rendered as Not Exploitable, and the dashboard analysis counts updated. The launch command settled as succeeded / ANALYSIS_LAUNCH_STARTED with Cloud Agent and CLI session IDs persisted.
  • Manual dismissal: Dismissing the analyzed finding as Inaccurate showed Dismissal queued and Dismissing..., then changed counts from 301 Open / 14 Closed to 300 Open / 15 Closed. The finding appeared in Closed findings. The durable command settled as succeeded / FINDING_DISMISSED, and the finding persisted as ignored with reason inaccurate.
  • Configuration: Config UI values matched persisted configuration, including repository selection, enabled state, models, analysis mode, automation toggles, and SLA thresholds.
  • No functional defects were found in the tested flows. Guided video recordings were captured for refresh and dismissal lifecycles.

Reviewer Notes

Human Reviewer Flags

  • Adds generated migration 0155_wakeful_skaar with owner-scoped command and repository sync-state tables.
  • Reconciliation marks accepted commands stale after 5 minutes, running commands stale after 30 minutes, and retains terminal rows for 30 days.
  • Rollback callback handling now validates in web, then proxies to Worker ingress before acknowledging success.

Code Reviewer Agent

Code Reviewer Notes
  • Queue producers create ledger rows before enqueue and compensate admission failures with QUEUE_ADMISSION_FAILED.
  • Consumers use guarded transitions so retries cannot overwrite terminal outcomes.
  • Personal Security Agent command and repository sync-state rows are included in GDPR soft deletion.
  • Manual sync queue validation now requires commandId; scheduled sync messages remain ledger-free.

Comment thread services/security-sync/src/index.ts
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Jun 4, 2026

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Executive Summary

processSecurityDismissMessage leaves the command ledger row stuck in running when the GitHub Dependabot dismissal API call throws, because the terminal transitionSecurityAgentCommand is never reached on the error path.

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
services/security-sync/src/index.ts 347 Command stuck in running when processSecurityFindingDismissal throws — terminal transition never called on error path
Other Observations (not in diff)
  • consumeManualAnalysisBatch (manual-analysis.ts:271) — same pattern: if processManualAnalysisStart throws an unexpected error after the running transition, the command is left in running until reconciliation. The InsufficientCreditsError path is handled gracefully, but any other unexpected throw from startSecurityAnalysis would hit this gap. The 30-minute reconciliation timeout is the safety net but it delays visibility.
  • processSecuritySyncMessage (index.ts:404) — same gap: if syncOwner throws (and it does rethrow when all repos fail with the same error via if (successfulRepos === 0 && firstError) throw firstError), the command stays in running. The outer queue.retry() handler catches it but the command ledger won't be updated until reconciliation.
  • Migration 0155_wakeful_skaar.sql: The CREATE INDEX statements are plain (non-CONCURRENTLY). These will take a table-level lock on security_agent_commands and security_agent_repository_sync_state. Since these are new tables being created in the same migration, there will be no existing rows at deploy time, so the lock window is negligible in practice — not a concern here.
  • upsertSecurityAgentRepositorySyncStatelast_succeeded_at in the set object is undefined when called from recordSecurityAgentRepositorySyncAttempt. Drizzle ORM omits undefined values from onConflictDoUpdate sets, so existing success timestamps are preserved correctly. No bug here.
Files Reviewed (40 files)
  • apps/web/src/app/api/internal/security-analysis-callback/[findingId]/route.test.ts
  • apps/web/src/app/api/internal/security-analysis-callback/[findingId]/route.ts
  • apps/web/src/components/security-agent/ClearFindingsCard.tsx
  • apps/web/src/components/security-agent/FindingDetailDialog.tsx
  • apps/web/src/components/security-agent/SecurityAgentContext.tsx
  • apps/web/src/components/security-agent/SecurityFindingsPage.tsx
  • apps/web/src/lib/security-agent/core/schemas.ts
  • apps/web/src/lib/security-agent/db/security-commands.test.ts
  • apps/web/src/lib/security-agent/db/security-commands.ts
  • apps/web/src/lib/security-agent/db/security-findings.ts
  • apps/web/src/lib/security-agent/router/shared-handlers.test.ts
  • apps/web/src/lib/security-agent/router/shared-handlers.ts
  • apps/web/src/lib/security-agent/services/manual-analysis-client.test.ts
  • apps/web/src/lib/security-agent/services/manual-analysis-client.ts
  • apps/web/src/lib/security-agent/services/manual-dismiss-client.test.ts
  • apps/web/src/lib/security-agent/services/manual-dismiss-client.ts
  • apps/web/src/lib/security-agent/services/manual-sync-client.test.ts
  • apps/web/src/lib/security-agent/services/manual-sync-client.ts
  • apps/web/src/lib/user/index.test.ts
  • apps/web/src/lib/user/index.ts - GDPR soft-delete updated ✓
  • apps/web/src/routers/organizations/organization-security-agent-router.ts
  • apps/web/src/routers/security-agent-router.ts
  • packages/db/src/index.ts
  • packages/db/src/migrations/0155_wakeful_skaar.sql
  • packages/db/src/migrations/meta/0155_snapshot.json
  • packages/db/src/migrations/meta/_journal.json
  • packages/db/src/schema.ts
  • packages/db/src/security-agent-command-repository.ts
  • packages/db/src/security-agent-repository-sync-state.ts
  • services/security-auto-analysis/src/dispatcher.ts
  • services/security-auto-analysis/src/index.test.ts
  • services/security-auto-analysis/src/index.ts
  • services/security-auto-analysis/src/manual-analysis.test.ts
  • services/security-auto-analysis/src/manual-analysis.ts
  • services/security-sync/src/dismiss.test.ts
  • services/security-sync/src/dismiss.ts
  • services/security-sync/src/index.test.ts
  • services/security-sync/src/index.ts - 1 issue
  • services/security-sync/src/sync.test.ts
  • services/security-sync/src/sync.ts

Fix these issues in Kilo Cloud


Reviewed by claude-4.6-sonnet-20260217 · 2,528,768 tokens

Review guidance: REVIEW.md from base branch main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant