Skip to content

fix(backup/restore): include S3 Vectors index in the snapshot loop#481

Merged
colinmxs merged 1 commit into
developfrom
fix/restore-rag-vector-index
Jun 16, 2026
Merged

fix(backup/restore): include S3 Vectors index in the snapshot loop#481
colinmxs merged 1 commit into
developfrom
fix/restore-rag-vector-index

Conversation

@colinmxs

Copy link
Copy Markdown
Contributor

Symptom (reported by a forker)

After a teardown → redeploy → restore cycle, RAG assistants showed their documents in the UI (visual indicator from the DDB rag-assistants rows + S3 originals), but the connection to the knowledge base didn't carry forward — every retrieval came back empty.

Root cause

The assistants RAG knowledge base is backed by an AWS::S3Vectors::Index (rag-vector-index-v1) that lives in a separate AWS service from regular S3 — the s3vectors boto3 client / AWS::S3Vectors::* CFN types. aws s3 sync cannot reach it, list_objects_v2 doesn't see it, and nothing in backup.py or restore.py was touching it.

The result: after teardown→redeploy→restore the new vector index was empty. The DDB document metadata restored, the S3 originals restored, every assistant appeared connected — but every s3vectors.query_vectors(filter={"assistant_id": ...}) call returned zero hits because no vectors existed for any assistant_id in the new index.

This wasn't caught by the existing supply-chain coverage test because that test only enforces DynamoDB-table and regular-S3-bucket coverage. Adding the vector store via untyped CfnResource slipped through the canary.

Fix

Same "snapshot and replay" model the rest of the restore tool uses, applied to the S3 Vectors index.

backup.py

Change Detail
New VECTOR_INDEXES list Parallel to S3_BUCKETS. Each entry maps a logical name to two SSM paths (bucket-name + index-name).
New backup_vector_index() Paginates s3vectors.list_vectors(returnData=True, returnMetadata=True) and streams every {key, data, metadata} record to vectors/{logical}.jsonl.gz in the backup bucket. The on-disk shape is byte-compatible with put_vectors on restore — round-trip with no transformation.
run() orchestrator Iterates VECTOR_INDEXES after the regular S3 buckets pass.

restore.py

Change Detail
New VECTOR_INDEXES constant Mirrors backup.py.
New restore_vector_index() Reads the gzipped JSONL, batches records 50-at-a-time (matches bedrock_embeddings.store_embeddings_in_s3.BATCH_SIZE), and calls s3vectors.put_vectors. Idempotent on re-run (put_vectors with same key is an upsert). Skips cleanly on older backups that pre-date the vectors snapshot and on target prefixes where RAG is disabled.
run_restore() orchestrator Runs the vector restore step after the S3 buckets pass and before the AgentCore Memory replay.

Test coverage

Suite Result
tests/supply_chain/test_backup_coverage.py::TestBackupCoversVectorIndexes (new) 5 / 5 ✅ — scans CDK for AWS::S3Vectors::Index and verifies backup.py declares it, calls list_vectors with returnData/returnMetadata, and restore.py both defines AND wires in restore_vector_index. Same canary pattern that already covers DDB tables and S3 buckets.
scripts/restore-data/test_restore.py (7 new tests) 48 / 48 ✅ — covers: 1:1 round-trip with batch-of-50 flushing, missing backup file skip, missing target SSM skip, dry-run, idempotent re-run via put_vectors upsert, unknown logical name skip, and the rag-vectors logical-name pin.

Drive-by: extract_backup_tables in the supply-chain test is now section-scoped so its "logical": regex stops slurping entries from S3_BUCKETS or the new VECTOR_INDEXES.

The 3 pre-existing supply-chain failures (system-prompts table and skill-resources bucket from prior feature PRs) are untouched and unrelated to this PR.

API verification

I verified the API surface against the AWS docs before relying on it:

  • s3vectors:ListVectors supports paginated full enumeration via nextToken and exposes returnData + returnMetadata flags. With both true, the response includes the float32 vector and the metadata document — exactly the shape put_vectors accepts. Permissions required: s3vectors:ListVectors + s3vectors:GetVectors for the backup, s3vectors:PutVectors for the restore. (See https://docs.aws.amazon.com/cli/latest/reference/s3vectors/list-vectors.html.)
  • s3vectors.put_vectors is keyed on key and is an upsert — so a partially-completed prior restore can be resumed by re-invoking the script.

The backup workflow runs under secrets.AWS_ROLE_ARN, which is the deployment role that already has s3vectors:* permissions for the platform. No IAM changes are needed for the typical fork.

What this PR does NOT include

A one-off recovery script for the existing affected forker who's already in the empty-vector-index state. That was scoped as a separate task per the diagnosis discussion. The two viable options for them are:

  1. Reuse the original backup (if the source environment's vector store is still reachable) — write a small one-off that does the new backup_vectors → restore_vectors loop directly between the two environments.
  2. Re-ingest from S3 originals — trigger the rag-ingestion lambda for every document in the assistants table. Costs Bedrock Titan embedding calls, but it's a recovery scenario.

I'm happy to write whichever the forker prefers as a follow-up.

Operator notes (for forks adopting this PR)

  • After merging, run backup-data.yml to capture a fresh snapshot that includes the vectors. Existing backup buckets predating this PR will still be readable by the restore (it will skip the vectors restore with a clear "no vectors backup file" reason).
  • The new manifest entry (vectors) is additive; older restore tooling reading newer manifests would simply ignore it.

The assistants RAG knowledge base is backed by an AWS::S3Vectors::Index
('rag-vector-index-v1') that lives in a separate AWS service from
regular S3 — the s3vectors boto3 client / AWS::S3Vectors::* CFN types.
'aws s3 sync' cannot reach it, list_objects_v2 doesn't see it, and
nothing in backup.py or restore.py was touching it. The result: after
teardown -> redeploy -> restore, the new vector index was empty, every
assistant's knowledge base appeared connected (DDB document metadata
restored, S3 originals restored) but every retrieval call returned
zero hits because no vectors existed for any assistant_id.

This change closes that gap with the same 'snapshot and replay' model
the rest of the restore tool uses.

backup.py
  - new VECTOR_INDEXES list (parallel to S3_BUCKETS)
  - new backup_vector_index() that paginates s3vectors.list_vectors
    with returnData=True + returnMetadata=True and streams every
    record to vectors/{logical}.jsonl.gz in the backup bucket. The
    line format ({key, data, metadata}) is byte-compatible with
    s3vectors.put_vectors on restore.
  - run() iterates VECTOR_INDEXES after the regular S3 buckets pass

restore.py
  - new VECTOR_INDEXES constant mirroring backup.py
  - new restore_vector_index() that reads the gzipped JSONL, batches
    records 50-at-a-time (matching bedrock_embeddings.store_embeddings
    _in_s3 BATCH_SIZE), and calls s3vectors.put_vectors. Idempotent
    on re-run (put_vectors with same key is upsert), skips cleanly on
    older backups that pre-date the vectors snapshot, and skips
    cleanly on target prefixes where RAG is disabled.
  - run_restore() runs the vector restore step after the S3 buckets
    pass and before the AgentCore Memory replay

tests
  - tests/supply_chain/test_backup_coverage.py adds
    TestBackupCoversVectorIndexes — 5 assertions that scan the CDK
    constructs for AWS::S3Vectors::Index resources and verify backup.py
    declares them, calls list_vectors with returnData/returnMetadata,
    and restore.py both defines AND wires in restore_vector_index.
    Same canary pattern that already covers DynamoDB tables and
    regular S3 buckets.
  - tests/supply_chain/test_backup_coverage.py: extract_backup_tables
    is now section-scoped so its 'logical' regex stops slurping
    entries from S3_BUCKETS or the new VECTOR_INDEXES. Drive-by fix.
  - scripts/restore-data/test_restore.py adds 7 behavioral tests for
    restore_vector_index covering: 1:1 round-trip with batch-of-50
    flushing, missing backup file skip, missing target SSM skip,
    dry-run, idempotent re-run, unknown logical name skip, and the
    rag-vectors logical-name pin.

All 48/48 restore-data tests pass; all 5/5 new vector coverage tests
pass. The 3 pre-existing supply-chain failures (system-prompts table
and skill-resources bucket from prior feature PRs) are untouched and
unrelated.

NOTE: the existing affected forker still has an empty vector index
post-restore. This PR does not include a one-off recovery script —
that's intentionally scoped as a separate task per the diagnosis
discussion.
@colinmxs colinmxs merged commit a02aae9 into develop Jun 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant