Zero-downtime migration tooling; widen File.file_size to bigint (expand stage)#5986
Zero-downtime migration tooling; widen File.file_size to bigint (expand stage)#5986rtibbles wants to merge 3 commits into
Conversation
- Safe-DDL backend: lock timeout, fail-fast, runtime unsafe-op guard - CI linting of new migrations on pull requests - Declarative dual-write trigger decorator (mirror_field) - Reusable batched-backfill command (idempotent, resumable, throttled) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LfZvkigk8hdsKdEif3hzBi
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LfZvkigk8hdsKdEif3hzBi
Expand stage of the zero-downtime int->bigint widening: - Add nullable file_size_bigint shadow column and its index (built CONCURRENTLY). - Mirror file_size into it via the change-guarded @mirror_field trigger. - Wire the online backfill as a commented deploy-migrate step. - Stage the cutover and contract steps as comments on the File model. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LfZvkigk8hdsKdEif3hzBi
bjester
left a comment
There was a problem hiding this comment.
Overall, the code looks good and makes sense. I have lots of comments and a few questions for deliberation. After that, I'll look at testing it
| ) | ||
| parser.add_argument("--start-id", default=None, help="resume from this pk") | ||
| parser.add_argument( | ||
| "--verify", |
There was a problem hiding this comment.
nitpick: --verify doesn't verify backfilled columns, it only checks progress. IMO --check or something like that aligns better with what it does.
There was a problem hiding this comment.
--progress-check maybe to be completely explict?
|
|
||
| total = 0 | ||
| while lo is not None: | ||
| # hi = batch_size-th pk at/after lo (None on the final short batch). |
There was a problem hiding this comment.
hi = batch_size-th pk at/after lo -- this a level of terse LLM writing that makes this comment difficult to understand. IDK what it's saying. I understand the code more than the comment which is telling.
There was a problem hiding this comment.
Yes - explanatory code would be better than comments here.
| # Throttle between batches so WAL generation / replication lag and autovacuum | ||
| # can keep up on large tables. Pass --sleep 0 to disable. | ||
| if throttle: | ||
| time.sleep(throttle) |
There was a problem hiding this comment.
This may not provide a meaningful benefit, especially for large tables like the file table. I don't think there is any worry about the WAL, but autovacuum would be triggered. I just don't know that the sleep time would necessarily align with its completion (likely much much longer). If there is concern, I think a more explicit approach to managing it would be beneficial:
- turn it off while this running, and
- execute vacuum manually at the end
There was a problem hiding this comment.
Yes, I also argued with Claude quite a bit about this, and I relented because it was insistent. I think I mostly gave in out of fear because File is such a large table - but maybe we just assume we don't need the throttle at all, test it out on hotfixes and see whether that's a bad idea?
| models.Index( | ||
| fields=["checksum", "file_size_bigint"], | ||
| name=FILE_DISTINCT_BIGINT_INDEX_NAME, | ||
| ), |
There was a problem hiding this comment.
A constraint on file_size_bigint being not null may speed up this creation of this new index.
| ## Reclaiming the physical name (usually skip) | ||
|
|
||
| `db_column` makes the `file_size_bigint` physical name invisible to all ORM/app code; only raw SQL sees it. Matching the physical name to `file_size` requires a second expand/contract cycle — needed only for raw-SQL or external consumers. |
There was a problem hiding this comment.
As a raw SQL writer, IMO it's tech debt if it's skipped. I think this should be expected after we have propagated to all environments and generated backups-- we don't want to leave remnants of migrations gone by for no real reason.
There was a problem hiding this comment.
I did wonder if you would have an opinion about this - I think perfectly fine to do this, just noting that we'll essentially have to do the reverse operation (but without a field type change) to achieve this with zero downtime - but it's only one more release because we can drop the int column and add the new bigint column in the same release, then start backfilling.
There was a problem hiding this comment.
A column rename should be quick with postgres, at least, that's what I recall with these types of migrations. And yeah we don't need to rush to do it afterwards, but like I said, just not forget about it!
|
|
||
| ## Guards (already configured) | ||
|
|
||
| - **Safe-DDL backend** — `DATABASES["default"]["ENGINE"]` in `settings.py` (prod adds Prometheus via `contentcuration.db.backends.zero_downtime_prometheus`). Lock/statement timeouts and `RAISE_FOR_UNSAFE` make rewriting or exclusively-locking DDL fail at migrate time. See the `ZERO_DOWNTIME_MIGRATIONS_*` settings. |
There was a problem hiding this comment.
See the
ZERO_DOWNTIME_MIGRATIONS_*settings.
Could link to docs?
| ## Guards (already configured) | ||
|
|
||
| - **Safe-DDL backend** — `DATABASES["default"]["ENGINE"]` in `settings.py` (prod adds Prometheus via `contentcuration.db.backends.zero_downtime_prometheus`). Lock/statement timeouts and `RAISE_FOR_UNSAFE` make rewriting or exclusively-locking DDL fail at migrate time. See the `ZERO_DOWNTIME_MIGRATIONS_*` settings. | ||
| - **`django-migration-linter`** (CI, `pythontest.yml`) — flags backward-incompatible schema (drops, renames, NOT NULL adds) the old pods would break on. The backend doesn't judge this axis. |
There was a problem hiding this comment.
The backend doesn't judge this axis.
wut
There was a problem hiding this comment.
I swear I read these docs multiple times - but this still got through. Will do another edit pass here.
| ] | ||
| ``` | ||
|
|
||
| Raw SQL because cutover detached both objects from Django state: the column lost its field, and `RemoveTrigger` was applied state-only to keep the physical trigger alive. `RemoveTrigger.database_forwards` and `RemoveField` look their target up in migration state, so with nothing there they have nothing to drop. Copy the trigger `pgid` from release 1's `AddTrigger` operation. |
| parser.add_argument("--model", required=True, help="app_label.ModelName") | ||
| parser.add_argument("--source-field", required=True) | ||
| parser.add_argument("--target-field", required=True) | ||
| parser.add_argument("--batch-size", type=int, default=1000) |
There was a problem hiding this comment.
Batch size could probably be pushed to 10k or a little more since this operates entirely within the DB
There was a problem hiding this comment.
And I think it also our default elsewhere?
| # studio#5974 cutover (next release, after backfill completes): | ||
| # - drop the @mirror_field decorator and the file_size_bigint field below | ||
| # - file_size = models.BigIntegerField(blank=True, null=True, db_column="file_size_bigint") | ||
| # - FILE_DISTINCT_BIGINT_INDEX_NAME -> fields=["checksum", "file_size"] |
There was a problem hiding this comment.
Not really apart of this step, but how confident are we in Django handling this particular action without significant flux?
There was a problem hiding this comment.
That's why we need to hand write the migrations with the "separate database state and operations" wrapper for this next step - otherwise it will almost certainly do something we definitely do not want.
Summary
Add tooling and playbook for zero downtime migrations for changing field types, e.g. int to bigint, or char to uuid - implement start of migration for file size on File object
Create reusable tooling for migration of file_size and other fields (like our CharField UUIDs to proper UUIDFields) - allow upload of files bigger than 2.1GB > int max value
Implemented with clear tooling to help ensure we implement with guardrails and clear repeatable steps.
References
Fixes #5973
Fixes #5974
Reviewer guidance
settings.py,production_settings.py): DB backend now routes throughdjango-pg-zero-downtime-migrations.0167: nullable shadow column + concurrent index + mirror trigger — all safe DDL (the backend forcesatomic=False, so the in-migrationCREATE INDEX CONCURRENTLYis fine).deploy-migrateruns the backfill this release. New writes dual-write via the trigger.AI usage
Used Claude Code to implement the tooling (safe DB backend wiring, the
mirror_fielddual-write trigger, thebackfill_columncommand) and draft the runbook. I reviewed the output across several local review rounds for correctness, tightened it to Studio's conventions, and ran the test suite.