fix(backup/restore): include S3 Vectors index in the snapshot loop by colinmxs · Pull Request #481 · Boise-State-Development/agentcore-public-stack

colinmxs · 2026-06-16T21:12:52Z

Symptom (reported by a forker)

After a teardown → redeploy → restore cycle, RAG assistants showed their documents in the UI (visual indicator from the DDB rag-assistants rows + S3 originals), but the connection to the knowledge base didn't carry forward — every retrieval came back empty.

Root cause

The assistants RAG knowledge base is backed by an AWS::S3Vectors::Index (rag-vector-index-v1) that lives in a separate AWS service from regular S3 — the s3vectors boto3 client / AWS::S3Vectors::* CFN types. aws s3 sync cannot reach it, list_objects_v2 doesn't see it, and nothing in backup.py or restore.py was touching it.

The result: after teardown→redeploy→restore the new vector index was empty. The DDB document metadata restored, the S3 originals restored, every assistant appeared connected — but every s3vectors.query_vectors(filter={"assistant_id": ...}) call returned zero hits because no vectors existed for any assistant_id in the new index.

This wasn't caught by the existing supply-chain coverage test because that test only enforces DynamoDB-table and regular-S3-bucket coverage. Adding the vector store via untyped CfnResource slipped through the canary.

Fix

Same "snapshot and replay" model the rest of the restore tool uses, applied to the S3 Vectors index.

`backup.py`

Change	Detail
New `VECTOR_INDEXES` list	Parallel to `S3_BUCKETS`. Each entry maps a logical name to two SSM paths (bucket-name + index-name).
New `backup_vector_index()`	Paginates `s3vectors.list_vectors(returnData=True, returnMetadata=True)` and streams every `{key, data, metadata}` record to `vectors/{logical}.jsonl.gz` in the backup bucket. The on-disk shape is byte-compatible with `put_vectors` on restore — round-trip with no transformation.
`run()` orchestrator	Iterates `VECTOR_INDEXES` after the regular S3 buckets pass.

`restore.py`

Change	Detail
New `VECTOR_INDEXES` constant	Mirrors backup.py.
New `restore_vector_index()`	Reads the gzipped JSONL, batches records 50-at-a-time (matches `bedrock_embeddings.store_embeddings_in_s3.BATCH_SIZE`), and calls `s3vectors.put_vectors`. Idempotent on re-run (put_vectors with same key is an upsert). Skips cleanly on older backups that pre-date the vectors snapshot and on target prefixes where RAG is disabled.
`run_restore()` orchestrator	Runs the vector restore step after the S3 buckets pass and before the AgentCore Memory replay.

Test coverage

Suite	Result
`tests/supply_chain/test_backup_coverage.py::TestBackupCoversVectorIndexes` (new)	5 / 5 ✅ — scans CDK for `AWS::S3Vectors::Index` and verifies `backup.py` declares it, calls `list_vectors` with `returnData`/`returnMetadata`, and `restore.py` both defines AND wires in `restore_vector_index`. Same canary pattern that already covers DDB tables and S3 buckets.
`scripts/restore-data/test_restore.py` (7 new tests)	48 / 48 ✅ — covers: 1:1 round-trip with batch-of-50 flushing, missing backup file skip, missing target SSM skip, dry-run, idempotent re-run via put_vectors upsert, unknown logical name skip, and the `rag-vectors` logical-name pin.

Drive-by: extract_backup_tables in the supply-chain test is now section-scoped so its "logical": regex stops slurping entries from S3_BUCKETS or the new VECTOR_INDEXES.

The 3 pre-existing supply-chain failures (system-prompts table and skill-resources bucket from prior feature PRs) are untouched and unrelated to this PR.

API verification

I verified the API surface against the AWS docs before relying on it:

s3vectors:ListVectors supports paginated full enumeration via nextToken and exposes returnData + returnMetadata flags. With both true, the response includes the float32 vector and the metadata document — exactly the shape put_vectors accepts. Permissions required: s3vectors:ListVectors + s3vectors:GetVectors for the backup, s3vectors:PutVectors for the restore. (See https://docs.aws.amazon.com/cli/latest/reference/s3vectors/list-vectors.html.)
s3vectors.put_vectors is keyed on key and is an upsert — so a partially-completed prior restore can be resumed by re-invoking the script.

The backup workflow runs under secrets.AWS_ROLE_ARN, which is the deployment role that already has s3vectors:* permissions for the platform. No IAM changes are needed for the typical fork.

What this PR does NOT include

A one-off recovery script for the existing affected forker who's already in the empty-vector-index state. That was scoped as a separate task per the diagnosis discussion. The two viable options for them are:

Reuse the original backup (if the source environment's vector store is still reachable) — write a small one-off that does the new backup_vectors → restore_vectors loop directly between the two environments.
Re-ingest from S3 originals — trigger the rag-ingestion lambda for every document in the assistants table. Costs Bedrock Titan embedding calls, but it's a recovery scenario.

I'm happy to write whichever the forker prefers as a follow-up.

Operator notes (for forks adopting this PR)

After merging, run backup-data.yml to capture a fresh snapshot that includes the vectors. Existing backup buckets predating this PR will still be readable by the restore (it will skip the vectors restore with a clear "no vectors backup file" reason).
The new manifest entry (vectors) is additive; older restore tooling reading newer manifests would simply ignore it.

The assistants RAG knowledge base is backed by an AWS::S3Vectors::Index ('rag-vector-index-v1') that lives in a separate AWS service from regular S3 — the s3vectors boto3 client / AWS::S3Vectors::* CFN types. 'aws s3 sync' cannot reach it, list_objects_v2 doesn't see it, and nothing in backup.py or restore.py was touching it. The result: after teardown -> redeploy -> restore, the new vector index was empty, every assistant's knowledge base appeared connected (DDB document metadata restored, S3 originals restored) but every retrieval call returned zero hits because no vectors existed for any assistant_id. This change closes that gap with the same 'snapshot and replay' model the rest of the restore tool uses. backup.py - new VECTOR_INDEXES list (parallel to S3_BUCKETS) - new backup_vector_index() that paginates s3vectors.list_vectors with returnData=True + returnMetadata=True and streams every record to vectors/{logical}.jsonl.gz in the backup bucket. The line format ({key, data, metadata}) is byte-compatible with s3vectors.put_vectors on restore. - run() iterates VECTOR_INDEXES after the regular S3 buckets pass restore.py - new VECTOR_INDEXES constant mirroring backup.py - new restore_vector_index() that reads the gzipped JSONL, batches records 50-at-a-time (matching bedrock_embeddings.store_embeddings _in_s3 BATCH_SIZE), and calls s3vectors.put_vectors. Idempotent on re-run (put_vectors with same key is upsert), skips cleanly on older backups that pre-date the vectors snapshot, and skips cleanly on target prefixes where RAG is disabled. - run_restore() runs the vector restore step after the S3 buckets pass and before the AgentCore Memory replay tests - tests/supply_chain/test_backup_coverage.py adds TestBackupCoversVectorIndexes — 5 assertions that scan the CDK constructs for AWS::S3Vectors::Index resources and verify backup.py declares them, calls list_vectors with returnData/returnMetadata, and restore.py both defines AND wires in restore_vector_index. Same canary pattern that already covers DynamoDB tables and regular S3 buckets. - tests/supply_chain/test_backup_coverage.py: extract_backup_tables is now section-scoped so its 'logical' regex stops slurping entries from S3_BUCKETS or the new VECTOR_INDEXES. Drive-by fix. - scripts/restore-data/test_restore.py adds 7 behavioral tests for restore_vector_index covering: 1:1 round-trip with batch-of-50 flushing, missing backup file skip, missing target SSM skip, dry-run, idempotent re-run, unknown logical name skip, and the rag-vectors logical-name pin. All 48/48 restore-data tests pass; all 5/5 new vector coverage tests pass. The 3 pre-existing supply-chain failures (system-prompts table and skill-resources bucket from prior feature PRs) are untouched and unrelated. NOTE: the existing affected forker still has an empty vector index post-restore. This PR does not include a one-off recovery script — that's intentionally scoped as a separate task per the diagnosis discussion.

colinmxs merged commit a02aae9 into develop Jun 16, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backup/restore): include S3 Vectors index in the snapshot loop#481

fix(backup/restore): include S3 Vectors index in the snapshot loop#481
colinmxs merged 1 commit into
developfrom
fix/restore-rag-vector-index

colinmxs commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

colinmxs commented Jun 16, 2026

Symptom (reported by a forker)

Root cause

Fix

backup.py

restore.py

Test coverage

API verification

What this PR does NOT include

Operator notes (for forks adopting this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`backup.py`

`restore.py`