RFC 089: Identifiers API by kenoir · Pull Request #156 · wellcomecollection/docs

kenoir · 2026-06-18T12:00:57Z

Preview

View the rendered RFC on this branch:

What does this change?

The proposal is a small, read-only Identifiers API. Wellcome Collection gives every catalogue thing (a work, an image, an item) a stable public "canonical" id, and keeps a registry recording which underlying source ids that canonical id was built from. This API does one job: given a canonical id it returns the source id(s) behind it, and given a source id it returns the canonical id (optionally with its siblings). It only ever reads that registry; it never creates or changes ids.

It exists because of the Sierra/CALM to FOLIO/Axiell migration. As records move between systems a single canonical id accumulates several source ids over time (an original plus inherited "predecessor" aliases), and a couple of internal services sit right at the edges where that translation has to happen: the IIIF viewer needs to turn old b-numbers and CALM refs into the canonical id it presents under, and requesting needs to turn a canonical item id into the FOLIO UUID a hold is placed on, and back again. The guiding principle is that everything public speaks canonical and source ids only appear at those two edges (ingest and the FOLIO boundary). Rather than have each consumer re-derive the mapping or query the catalogue by source id, this API is the single shared place that translation lives. Because the main running cost is database queries, it is also the natural place to cache aggressively (at the edge, to keep requests off the database) and to attribute that database cost to the consumers driving it.

How it relates to the other RFCs:

It reads the ID Registry defined in RFC 083 (stable identifiers). RFC 083 owns the data and all the writes; this API is just a read-only window onto it.
It serves the IIIF/DDS lookup that RFC 085 (IIIF identities, open PR RFC 085: IIIF identities #143) describes, where the canonical Work id becomes the IIIF URI and older identifier forms redirect to it.
It is the concrete "service" answer to the open question in RFC 088 (the Sierra to FOLIO identity/requesting migration, open PR RFC 088: Migrating identity, requesting and items APIs from Sierra to FOLIO #153) about how requesting should translate item ids. RFC 088 left that access mechanism open (a direct database read, a sync, or a service); this RFC proposes the service and is for RFC 088 to ratify.

The RFC is written to stand on its own and carries the API contract alongside it, so it can be reviewed without access to the closed discovery/prototype repository where the working prototype lives.

Files added:

rfcs/089-identifiers-api/README.md: the RFC document.
rfcs/README.md: refreshed RFC listing table (RFC 089 row added).
rfcs/089-identifiers-api/openapi.yaml: the OpenAPI 3.0 spec for the two lookup operations (the source of truth).
rfcs/089-identifiers-api/openapi.md: generated human-readable rendering of the spec, browsable on GitHub without a Swagger/Redoc renderer.
render_docs.py, pyproject.toml, .python-version, .gitignore, uv.lock: a self-contained uv project that validates openapi.yaml and regenerates openapi.md.

How to test

Read rfcs/089-identifiers-api/README.md and review the contract, architecture, caching strategy and open questions.
Confirm the RFC passes repo validation: .scripts/validate_rfc.py.
Confirm the listing table is in sync: .scripts/create_table_summary.py --check-readme.
(Optional) Regenerate the rendered contract and confirm no diff: from the RFC directory, uv run python render_docs.py: this validates openapi.yaml against the OpenAPI spec validator and rewrites openapi.md.

How can we measure success?

No measurable runtime success criteria; this is a documentation RFC. Success is the RFC being reviewed and providing a clear, self-contained contract and architecture that the team can align on, and a decision on whether this service is the access mechanism for identifier translation in RFC 088.

Have we considered potential risks?

Documentation only; no production code or infrastructure is changed by this PR, so there is no runtime or deployment risk.
The design risks themselves (the caching strategy and the database cost it controls, the unmet FOLIO-item ingestion dependency, item canonical-id stability) are enumerated in the RFC's Open questions section and are intended to be the subject of review.
The OpenAPI spec is validated and the rendered Markdown is generated from it, reducing the chance of the contract and its human-readable rendering drifting apart.

Read-only canonical <-> source identifier translation over the RFC 083 ID Registry, for the IIIF/DDS (RFC 085) and requesting (RFC 088) consumers. Carries the OpenAPI contract alongside the RFC (openapi.yaml + a rendered openapi.md via a small uv project, following the RFC 088 pattern) so the proposal stands alone without the private prototype repository. Covers the contract, AWS architecture, API-key auth + usage-plan metering, the caching topology, and the live-data findings (folio-instance aliases present; folio-item-id absent, so the requesting translation has no data yet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The concern is the cost of database (Aurora) queries, not policing a per-consumer billing quota. That inverts the caching strategy: an edge (CloudFront) cache that serves hits without touching the database is now preferred, rather than rejected for breaking metering. Recasts the per-consumer story as API keys for identity / cost attribution plus a throttle as a database safety valve, and reorients the caching open question toward hit-ratio, throttle sizing and cost attribution. Also removes the detailed real-data-findings list (kept in the prototype docs), leaving a one-line pointer. Contract unchanged (openapi.yaml/openapi.md untouched). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Replaces em dashes with plain punctuation throughout and tones down a few flourishes ("single translation membrane", "exactly the win", "evaporates"). Directional notation (the migration and lookup arrows) is kept. No change to the contract, the decisions, or the meaning. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Removes the decision-log section and its table of contents entry, and renames the contract-summary heading to "API Contract" (the separate "API contract (OpenAPI)" section is unchanged). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Avoids two similarly-named sections after "The contract" became "API Contract". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

paul-butcher · 2026-06-18T15:51:43Z

+
+| Status | Body | Description |
+|---|---|---|
+| `200` | n/a | A mapping exists for the supplied tuple. |


agnesgaroux · 2026-06-19T12:23:51Z

+  about resolves to `404` rather than a spurious `400`.
+
+**Status codes:** `200` found; `304` conditional GET (matched `ETag`); `400` malformed `canonicalId`
+or an unsupported enum value (rejected at the gateway); `404` no mapping: an unknown id, an unknown


Is this

`400` [...] an unsupported enum value

consistent with

- **`sourceSystem`** is an open set, not enum-constrained, so a system the gateway has not been told about resolves to `404` rather than a spurious `400`.

paul-butcher · 2026-06-19T13:18:51Z

+
+Consequences for the two lookups:
+
+- **Forward** (canonical → sources) rides `idx_canonical` and returns N rows: cheap, but not a


Suggested change

- **Forward** (canonical → sources) rides `idx_canonical` and returns N rows: cheap, but not a

- **Forward** (canonical → sources) reads `idx_canonical` and returns N rows: cheap, but not a

typo?

paul-butcher · 2026-06-19T13:24:38Z

+  bills usage). The keys, the throttle, and their stage binding are **not** in the OpenAPI body;
+  they are separate API Gateway resources in Terraform, so re-importing the definition does not
+  disturb them.
+- A **gateway-level regex** on `canonicalId` (`^[a-hjkmnp-z][a-hjkmnp-z2-9]{7}$`) rejects malformed


Might we want something similar in the reverse lookup? Validate sourceSystem against a list?

It's almost certainly overkill to try to validate source system/id pairs (e.g. sierra-system-number/b{0-7}.)

paul-butcher · 2026-06-19T13:32:31Z

+
+3. **Item canonical-id stability through the FOLIO migration.** Items are minted canonically today,
+   but the canonical id must survive Sierra → FOLIO via RFC 083 predecessor inheritance at item
+   level (a FOLIO item UUID added as a predecessor of the Sierra item number). RFC 083's transformer


Should this be the other way around?

Suggested change

level (a FOLIO item UUID added as a predecessor of the Sierra item number). RFC 083's transformer

level (a Sierra item number added as a predecessor of the FOLIO item UUID). RFC 083's transformer

paul-butcher · 2026-06-19T13:39:02Z

+   `obsolete` flag (source system retired). That is adjacent to `isAlias` (inherited predecessor) but
+   not identical. Reconcile the two so one representation serves both consumers.
+
+6. **The `type` enum vs reality.** The live registry holds types beyond `Work` / `Image` / `Item`


Can we populate this enum with the types we need to use the API for (those three), and extend it as we need, rather than worrying about the question up front?

agnesgaroux · 2026-06-19T13:58:39Z

+   changes are described at bib/work level; item-level predecessor emission needs confirming with the
+   pipeline workstream.
+
+4. **Bare-value reverse lookup (RFC 085).** The DDS wants to query by bare value without a


We get multiple rows with identical SourceId though right? eg

WHERE SourceId = 'b19871831' AND OntologyType = 'Work' AND SourceSystem = 'sierra-system-number'; WHERE SourceId = 'b19871831' AND OntologyType = 'Work' AND SourceSystem = 'mets';

sierra canonical id -> adyxgsj4
mets canonical id -> ugpnzjrn

We can't tell which one the caller wants, unless we or they know the source system

agnesgaroux · 2026-06-19T14:09:35Z

+   `If-None-Match` with a `304`, as **prototype defaults, not contract decisions**, and as response
+   headers only (no real edge or stage cache).
+
+2. **The FOLIO-item ingestion dependency (RFC 088).** The requesting translation (canonical item id


Can we get on with transforming folio items? so we can clarify this sooner rather than later?
What if the item ids are HRID in the aoi-pmh feed?
Is there a point where we should think about having HRID and UUID in the identifiers DB so as to be able to convert from one to the other?
(okapi does it as an alternative
/instance-storage/instances?query=hrid=={hrid} or
/instance-storage/instances?query=id=={uuid} )

paul-butcher · 2026-06-19T14:17:34Z

+
+Each has a prototype direction but an unsettled integration point.
+
+1. **Caching and cost.** The cache placement (edge/CloudFront as primary vs the API Gateway stage


Can the caching and cost questions be "suck it and see", or do we need to think more carefully about when and where we expect the lookups to happen and how that might change the access patterns?

Current expected clients include:

Producing the metadata document upon ingestion of digitised material (probably mostly unique or thereabouts)

The Items API (unless this is handled within the pipeline) more chance of repeated requests

paul-butcher · 2026-06-19T14:47:08Z

+|---|---|---|
+| Reverse, bare (`source → canonicalId`) | Immutable once minted | Long TTL, even mid-migration. |
+| Forward (`canonicalId → sources`) | Alias set can grow during migration | Bounded TTL + ETag. |
+| Reverse, `include=siblings` | Carries the canonical → sources set | Same as forward. |


If we offer a reverse + specific sibling, it could be more cacheable.

e.g. ?include=sierra-system-number

This may not always be true, I suppose, but as the id database has migrated from a 1:1 mapping of old system ids to canonical ids, I think this means that there would continue to be a 1:1 mapping from ids in one scheme to ids in another.

That said, doing so may be undesirable or bring along other complications:

One of the envisaged use cases (digitisation) is that, given a folio identifier, we can find the corresponding bNumber for digitisation, but if that does not exist, use the canonical identifier. So we'd either want to always return canonical if it exists, or include canonical in the list of requested returns. In which case, we could not rely on the absence of the requested sibling(s) to provide a 4xx response.

I think that might make it a bit tricky to generalise, because if we are going from a new system id to old system id (like this), then we can expect the result to be unchanging, but old to new might just mean it's not been migrated yet.

kenoir and others added 5 commits June 18, 2026 13:00

RFC 089: rename "API contract (OpenAPI)" to "OpenAPI specification"

2099c1c

Avoids two similarly-named sections after "The contract" became "API Contract". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

paul-butcher reviewed Jun 18, 2026

View reviewed changes

agnesgaroux reviewed Jun 19, 2026

View reviewed changes

paul-butcher reviewed Jun 19, 2026

View reviewed changes

agnesgaroux reviewed Jun 19, 2026

View reviewed changes

paul-butcher reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC 089: Identifiers API#156

RFC 089: Identifiers API#156
kenoir wants to merge 5 commits into
mainfrom
rk/identifiers-api-rfc

kenoir commented Jun 18, 2026 •

edited

Loading

Uh oh!

paul-butcher Jun 18, 2026

Uh oh!

agnesgaroux Jun 19, 2026

Uh oh!

paul-butcher Jun 19, 2026

Uh oh!

paul-butcher Jun 19, 2026

Uh oh!

paul-butcher Jun 19, 2026

Uh oh!

paul-butcher Jun 19, 2026

Uh oh!

agnesgaroux Jun 19, 2026 •

edited

Loading

Uh oh!

agnesgaroux Jun 19, 2026

Uh oh!

paul-butcher Jun 19, 2026

Uh oh!

paul-butcher Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	\| `200` \| n/a \| A mapping exists for the supplied tuple. \|
	\| `200` \| IdentifierSet\\|CanonicalIdRef \| A mapping exists for the supplied tuple. \|


		Consequences for the two lookups:

		- Forward (canonical → sources) rides `idx_canonical` and returns N rows: cheap, but not a

	- Forward (canonical → sources) rides `idx_canonical` and returns N rows: cheap, but not a
	- Forward (canonical → sources) reads `idx_canonical` and returns N rows: cheap, but not a

	level (a FOLIO item UUID added as a predecessor of the Sierra item number). RFC 083's transformer
	level (a Sierra item number added as a predecessor of the FOLIO item UUID). RFC 083's transformer


		Each has a prototype direction but an unsettled integration point.

		1. Caching and cost. The cache placement (edge/CloudFront as primary vs the API Gateway stage

Conversation

kenoir commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Preview

What does this change?

How to test

How can we measure success?

Have we considered potential risks?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agnesgaroux Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kenoir commented Jun 18, 2026 •

edited

Loading

agnesgaroux Jun 19, 2026 •

edited

Loading