Skip to content

Zero-downtime migration tooling; widen File.file_size to bigint (expand stage)#5986

Open
rtibbles wants to merge 3 commits into
learningequality:hotfixesfrom
rtibbles:widen_file_size_bigint
Open

Zero-downtime migration tooling; widen File.file_size to bigint (expand stage)#5986
rtibbles wants to merge 3 commits into
learningequality:hotfixesfrom
rtibbles:widen_file_size_bigint

Conversation

@rtibbles

@rtibbles rtibbles commented Jun 24, 2026

Copy link
Copy Markdown
Member

Summary

Add tooling and playbook for zero downtime migrations for changing field types, e.g. int to bigint, or char to uuid - implement start of migration for file size on File object
Create reusable tooling for migration of file_size and other fields (like our CharField UUIDs to proper UUIDFields) - allow upload of files bigger than 2.1GB > int max value
Implemented with clear tooling to help ensure we implement with guardrails and clear repeatable steps.

References

Fixes #5973
Fixes #5974

Reviewer guidance

  • DB engine swap (settings.py, production_settings.py): DB backend now routes through django-pg-zero-downtime-migrations.
  • Migration 0167: nullable shadow column + concurrent index + mirror trigger — all safe DDL (the backend forces atomic=False, so the in-migration CREATE INDEX CONCURRENTLY is fine).
  • deploy-migrate runs the backfill this release. New writes dual-write via the trigger.
  • New CI step lints PR migrations against the merge-base.

AI usage

Used Claude Code to implement the tooling (safe DB backend wiring, the mirror_field dual-write trigger, the backfill_column command) and draft the runbook. I reviewed the output across several local review rounds for correctness, tightened it to Studio's conventions, and ran the test suite.

rtibbles and others added 3 commits June 23, 2026 18:33
- Safe-DDL backend: lock timeout, fail-fast, runtime unsafe-op guard
- CI linting of new migrations on pull requests
- Declarative dual-write trigger decorator (mirror_field)
- Reusable batched-backfill command (idempotent, resumable, throttled)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LfZvkigk8hdsKdEif3hzBi
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LfZvkigk8hdsKdEif3hzBi
Expand stage of the zero-downtime int->bigint widening:
- Add nullable file_size_bigint shadow column and its index (built CONCURRENTLY).
- Mirror file_size into it via the change-guarded @mirror_field trigger.
- Wire the online backfill as a commented deploy-migrate step.
- Stage the cutover and contract steps as comments on the File model.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LfZvkigk8hdsKdEif3hzBi
@rtibbles rtibbles marked this pull request as ready for review June 24, 2026 02:43
@bjester bjester self-assigned this Jun 26, 2026

@bjester bjester left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the code looks good and makes sense. I have lots of comments and a few questions for deliberation. After that, I'll look at testing it

)
parser.add_argument("--start-id", default=None, help="resume from this pk")
parser.add_argument(
"--verify",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: --verify doesn't verify backfilled columns, it only checks progress. IMO --check or something like that aligns better with what it does.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--progress-check maybe to be completely explict?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds good


total = 0
while lo is not None:
# hi = batch_size-th pk at/after lo (None on the final short batch).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi = batch_size-th pk at/after lo -- this a level of terse LLM writing that makes this comment difficult to understand. IDK what it's saying. I understand the code more than the comment which is telling.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - explanatory code would be better than comments here.

Comment on lines +95 to +98
# Throttle between batches so WAL generation / replication lag and autovacuum
# can keep up on large tables. Pass --sleep 0 to disable.
if throttle:
time.sleep(throttle)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not provide a meaningful benefit, especially for large tables like the file table. I don't think there is any worry about the WAL, but autovacuum would be triggered. I just don't know that the sleep time would necessarily align with its completion (likely much much longer). If there is concern, I think a more explicit approach to managing it would be beneficial:

  • turn it off while this running, and
  • execute vacuum manually at the end

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also argued with Claude quite a bit about this, and I relented because it was insistent. I think I mostly gave in out of fear because File is such a large table - but maybe we just assume we don't need the throttle at all, test it out on hotfixes and see whether that's a bad idea?

models.Index(
fields=["checksum", "file_size_bigint"],
name=FILE_DISTINCT_BIGINT_INDEX_NAME,
),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A constraint on file_size_bigint being not null may speed up this creation of this new index.

Comment on lines +104 to +106
## Reclaiming the physical name (usually skip)

`db_column` makes the `file_size_bigint` physical name invisible to all ORM/app code; only raw SQL sees it. Matching the physical name to `file_size` requires a second expand/contract cycle — needed only for raw-SQL or external consumers.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a raw SQL writer, IMO it's tech debt if it's skipped. I think this should be expected after we have propagated to all environments and generated backups-- we don't want to leave remnants of migrations gone by for no real reason.

@rtibbles rtibbles Jun 29, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did wonder if you would have an opinion about this - I think perfectly fine to do this, just noting that we'll essentially have to do the reverse operation (but without a field type change) to achieve this with zero downtime - but it's only one more release because we can drop the int column and add the new bigint column in the same release, then start backfilling.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A column rename should be quick with postgres, at least, that's what I recall with these types of migrations. And yeah we don't need to rush to do it afterwards, but like I said, just not forget about it!


## Guards (already configured)

- **Safe-DDL backend** — `DATABASES["default"]["ENGINE"]` in `settings.py` (prod adds Prometheus via `contentcuration.db.backends.zero_downtime_prometheus`). Lock/statement timeouts and `RAISE_FOR_UNSAFE` make rewriting or exclusively-locking DDL fail at migrate time. See the `ZERO_DOWNTIME_MIGRATIONS_*` settings.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the ZERO_DOWNTIME_MIGRATIONS_* settings.

Could link to docs?

## Guards (already configured)

- **Safe-DDL backend** — `DATABASES["default"]["ENGINE"]` in `settings.py` (prod adds Prometheus via `contentcuration.db.backends.zero_downtime_prometheus`). Lock/statement timeouts and `RAISE_FOR_UNSAFE` make rewriting or exclusively-locking DDL fail at migrate time. See the `ZERO_DOWNTIME_MIGRATIONS_*` settings.
- **`django-migration-linter`** (CI, `pythontest.yml`) — flags backward-incompatible schema (drops, renames, NOT NULL adds) the old pods would break on. The backend doesn't judge this axis.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backend doesn't judge this axis.

wut

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I swear I read these docs multiple times - but this still got through. Will do another edit pass here.

]
```

Raw SQL because cutover detached both objects from Django state: the column lost its field, and `RemoveTrigger` was applied state-only to keep the physical trigger alive. `RemoveTrigger.database_forwards` and `RemoveField` look their target up in migration state, so with nothing there they have nothing to drop. Copy the trigger `pgid` from release 1's `AddTrigger` operation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Raw SQL" -> "Raw SQL is used"

parser.add_argument("--model", required=True, help="app_label.ModelName")
parser.add_argument("--source-field", required=True)
parser.add_argument("--target-field", required=True)
parser.add_argument("--batch-size", type=int, default=1000)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch size could probably be pushed to 10k or a little more since this operates entirely within the DB

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think it also our default elsewhere?

# studio#5974 cutover (next release, after backfill completes):
# - drop the @mirror_field decorator and the file_size_bigint field below
# - file_size = models.BigIntegerField(blank=True, null=True, db_column="file_size_bigint")
# - FILE_DISTINCT_BIGINT_INDEX_NAME -> fields=["checksum", "file_size"]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really apart of this step, but how confident are we in Django handling this particular action without significant flux?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why we need to hand write the migrations with the "separate database state and operations" wrapper for this next step - otherwise it will almost certainly do something we definitely do not want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants