fix(backup/restore): include S3 Vectors index in the snapshot loop#481
Merged
Conversation
The assistants RAG knowledge base is backed by an AWS::S3Vectors::Index
('rag-vector-index-v1') that lives in a separate AWS service from
regular S3 — the s3vectors boto3 client / AWS::S3Vectors::* CFN types.
'aws s3 sync' cannot reach it, list_objects_v2 doesn't see it, and
nothing in backup.py or restore.py was touching it. The result: after
teardown -> redeploy -> restore, the new vector index was empty, every
assistant's knowledge base appeared connected (DDB document metadata
restored, S3 originals restored) but every retrieval call returned
zero hits because no vectors existed for any assistant_id.
This change closes that gap with the same 'snapshot and replay' model
the rest of the restore tool uses.
backup.py
- new VECTOR_INDEXES list (parallel to S3_BUCKETS)
- new backup_vector_index() that paginates s3vectors.list_vectors
with returnData=True + returnMetadata=True and streams every
record to vectors/{logical}.jsonl.gz in the backup bucket. The
line format ({key, data, metadata}) is byte-compatible with
s3vectors.put_vectors on restore.
- run() iterates VECTOR_INDEXES after the regular S3 buckets pass
restore.py
- new VECTOR_INDEXES constant mirroring backup.py
- new restore_vector_index() that reads the gzipped JSONL, batches
records 50-at-a-time (matching bedrock_embeddings.store_embeddings
_in_s3 BATCH_SIZE), and calls s3vectors.put_vectors. Idempotent
on re-run (put_vectors with same key is upsert), skips cleanly on
older backups that pre-date the vectors snapshot, and skips
cleanly on target prefixes where RAG is disabled.
- run_restore() runs the vector restore step after the S3 buckets
pass and before the AgentCore Memory replay
tests
- tests/supply_chain/test_backup_coverage.py adds
TestBackupCoversVectorIndexes — 5 assertions that scan the CDK
constructs for AWS::S3Vectors::Index resources and verify backup.py
declares them, calls list_vectors with returnData/returnMetadata,
and restore.py both defines AND wires in restore_vector_index.
Same canary pattern that already covers DynamoDB tables and
regular S3 buckets.
- tests/supply_chain/test_backup_coverage.py: extract_backup_tables
is now section-scoped so its 'logical' regex stops slurping
entries from S3_BUCKETS or the new VECTOR_INDEXES. Drive-by fix.
- scripts/restore-data/test_restore.py adds 7 behavioral tests for
restore_vector_index covering: 1:1 round-trip with batch-of-50
flushing, missing backup file skip, missing target SSM skip,
dry-run, idempotent re-run, unknown logical name skip, and the
rag-vectors logical-name pin.
All 48/48 restore-data tests pass; all 5/5 new vector coverage tests
pass. The 3 pre-existing supply-chain failures (system-prompts table
and skill-resources bucket from prior feature PRs) are untouched and
unrelated.
NOTE: the existing affected forker still has an empty vector index
post-restore. This PR does not include a one-off recovery script —
that's intentionally scoped as a separate task per the diagnosis
discussion.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom (reported by a forker)
After a teardown → redeploy → restore cycle, RAG assistants showed their documents in the UI (visual indicator from the DDB
rag-assistantsrows + S3 originals), but the connection to the knowledge base didn't carry forward — every retrieval came back empty.Root cause
The assistants RAG knowledge base is backed by an
AWS::S3Vectors::Index(rag-vector-index-v1) that lives in a separate AWS service from regular S3 — thes3vectorsboto3 client /AWS::S3Vectors::*CFN types.aws s3 synccannot reach it,list_objects_v2doesn't see it, and nothing inbackup.pyorrestore.pywas touching it.The result: after teardown→redeploy→restore the new vector index was empty. The DDB document metadata restored, the S3 originals restored, every assistant appeared connected — but every
s3vectors.query_vectors(filter={"assistant_id": ...})call returned zero hits because no vectors existed for anyassistant_idin the new index.This wasn't caught by the existing supply-chain coverage test because that test only enforces DynamoDB-table and regular-S3-bucket coverage. Adding the vector store via untyped
CfnResourceslipped through the canary.Fix
Same "snapshot and replay" model the rest of the restore tool uses, applied to the S3 Vectors index.
backup.pyVECTOR_INDEXESlistS3_BUCKETS. Each entry maps a logical name to two SSM paths (bucket-name + index-name).backup_vector_index()s3vectors.list_vectors(returnData=True, returnMetadata=True)and streams every{key, data, metadata}record tovectors/{logical}.jsonl.gzin the backup bucket. The on-disk shape is byte-compatible withput_vectorson restore — round-trip with no transformation.run()orchestratorVECTOR_INDEXESafter the regular S3 buckets pass.restore.pyVECTOR_INDEXESconstantrestore_vector_index()bedrock_embeddings.store_embeddings_in_s3.BATCH_SIZE), and callss3vectors.put_vectors. Idempotent on re-run (put_vectors with same key is an upsert). Skips cleanly on older backups that pre-date the vectors snapshot and on target prefixes where RAG is disabled.run_restore()orchestratorTest coverage
tests/supply_chain/test_backup_coverage.py::TestBackupCoversVectorIndexes(new)AWS::S3Vectors::Indexand verifiesbackup.pydeclares it, callslist_vectorswithreturnData/returnMetadata, andrestore.pyboth defines AND wires inrestore_vector_index. Same canary pattern that already covers DDB tables and S3 buckets.scripts/restore-data/test_restore.py(7 new tests)rag-vectorslogical-name pin.Drive-by:
extract_backup_tablesin the supply-chain test is now section-scoped so its"logical":regex stops slurping entries fromS3_BUCKETSor the newVECTOR_INDEXES.The 3 pre-existing supply-chain failures (
system-promptstable andskill-resourcesbucket from prior feature PRs) are untouched and unrelated to this PR.API verification
I verified the API surface against the AWS docs before relying on it:
s3vectors:ListVectorssupports paginated full enumeration vianextTokenand exposesreturnData+returnMetadataflags. With both true, the response includes the float32 vector and the metadata document — exactly the shapeput_vectorsaccepts. Permissions required:s3vectors:ListVectors+s3vectors:GetVectorsfor the backup,s3vectors:PutVectorsfor the restore. (See https://docs.aws.amazon.com/cli/latest/reference/s3vectors/list-vectors.html.)s3vectors.put_vectorsis keyed onkeyand is an upsert — so a partially-completed prior restore can be resumed by re-invoking the script.The backup workflow runs under
secrets.AWS_ROLE_ARN, which is the deployment role that already hass3vectors:*permissions for the platform. No IAM changes are needed for the typical fork.What this PR does NOT include
A one-off recovery script for the existing affected forker who's already in the empty-vector-index state. That was scoped as a separate task per the diagnosis discussion. The two viable options for them are:
I'm happy to write whichever the forker prefers as a follow-up.
Operator notes (for forks adopting this PR)
backup-data.ymlto capture a fresh snapshot that includes the vectors. Existing backup buckets predating this PR will still be readable by the restore (it will skip the vectors restore with a clear "no vectors backup file" reason).vectors) is additive; older restore tooling reading newer manifests would simply ignore it.