RFC 089: Identifiers API#156
Conversation
Read-only canonical <-> source identifier translation over the RFC 083 ID Registry, for the IIIF/DDS (RFC 085) and requesting (RFC 088) consumers. Carries the OpenAPI contract alongside the RFC (openapi.yaml + a rendered openapi.md via a small uv project, following the RFC 088 pattern) so the proposal stands alone without the private prototype repository. Covers the contract, AWS architecture, API-key auth + usage-plan metering, the caching topology, and the live-data findings (folio-instance aliases present; folio-item-id absent, so the requesting translation has no data yet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The concern is the cost of database (Aurora) queries, not policing a per-consumer billing quota. That inverts the caching strategy: an edge (CloudFront) cache that serves hits without touching the database is now preferred, rather than rejected for breaking metering. Recasts the per-consumer story as API keys for identity / cost attribution plus a throttle as a database safety valve, and reorients the caching open question toward hit-ratio, throttle sizing and cost attribution. Also removes the detailed real-data-findings list (kept in the prototype docs), leaving a one-line pointer. Contract unchanged (openapi.yaml/openapi.md untouched). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replaces em dashes with plain punctuation throughout and tones down a few
flourishes ("single translation membrane", "exactly the win", "evaporates").
Directional notation (the migration and lookup arrows) is kept. No change to the
contract, the decisions, or the meaning.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Removes the decision-log section and its table of contents entry, and renames the contract-summary heading to "API Contract" (the separate "API contract (OpenAPI)" section is unchanged). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Avoids two similarly-named sections after "The contract" became "API Contract". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
||
| | Status | Body | Description | | ||
| |---|---|---| | ||
| | `200` | n/a | A mapping exists for the supplied tuple. | |
There was a problem hiding this comment.
Should this be:
| | `200` | n/a | A mapping exists for the supplied tuple. | | |
| | `200` | IdentifierSet\|CanonicalIdRef | A mapping exists for the supplied tuple. | |
| about resolves to `404` rather than a spurious `400`. | ||
|
|
||
| **Status codes:** `200` found; `304` conditional GET (matched `ETag`); `400` malformed `canonicalId` | ||
| or an unsupported enum value (rejected at the gateway); `404` no mapping: an unknown id, an unknown |
There was a problem hiding this comment.
Is this
`400` [...] an unsupported enum value
consistent with
- **`sourceSystem`** is an open set, not enum-constrained, so a system the gateway has not been told
about resolves to `404` rather than a spurious `400`.
|
|
||
| Consequences for the two lookups: | ||
|
|
||
| - **Forward** (canonical → sources) rides `idx_canonical` and returns N rows: cheap, but not a |
There was a problem hiding this comment.
| - **Forward** (canonical → sources) rides `idx_canonical` and returns N rows: cheap, but not a | |
| - **Forward** (canonical → sources) reads `idx_canonical` and returns N rows: cheap, but not a |
typo?
| bills usage). The keys, the throttle, and their stage binding are **not** in the OpenAPI body; | ||
| they are separate API Gateway resources in Terraform, so re-importing the definition does not | ||
| disturb them. | ||
| - A **gateway-level regex** on `canonicalId` (`^[a-hjkmnp-z][a-hjkmnp-z2-9]{7}$`) rejects malformed |
There was a problem hiding this comment.
Might we want something similar in the reverse lookup? Validate sourceSystem against a list?
It's almost certainly overkill to try to validate source system/id pairs (e.g. sierra-system-number/b{0-7}.)
|
|
||
| 3. **Item canonical-id stability through the FOLIO migration.** Items are minted canonically today, | ||
| but the canonical id must survive Sierra → FOLIO via RFC 083 predecessor inheritance at item | ||
| level (a FOLIO item UUID added as a predecessor of the Sierra item number). RFC 083's transformer |
There was a problem hiding this comment.
Should this be the other way around?
| level (a FOLIO item UUID added as a predecessor of the Sierra item number). RFC 083's transformer | |
| level (a Sierra item number added as a predecessor of the FOLIO item UUID). RFC 083's transformer |
| `obsolete` flag (source system retired). That is adjacent to `isAlias` (inherited predecessor) but | ||
| not identical. Reconcile the two so one representation serves both consumers. | ||
|
|
||
| 6. **The `type` enum vs reality.** The live registry holds types beyond `Work` / `Image` / `Item` |
There was a problem hiding this comment.
Can we populate this enum with the types we need to use the API for (those three), and extend it as we need, rather than worrying about the question up front?
| changes are described at bib/work level; item-level predecessor emission needs confirming with the | ||
| pipeline workstream. | ||
|
|
||
| 4. **Bare-value reverse lookup (RFC 085).** The DDS wants to query by bare value without a |
There was a problem hiding this comment.
We get multiple rows with identical SourceId though right? eg
WHERE SourceId = 'b19871831'
AND OntologyType = 'Work'
AND SourceSystem = 'sierra-system-number';
WHERE SourceId = 'b19871831'
AND OntologyType = 'Work'
AND SourceSystem = 'mets';
sierra canonical id -> adyxgsj4
mets canonical id -> ugpnzjrn
We can't tell which one the caller wants, unless we or they know the source system
| `If-None-Match` with a `304`, as **prototype defaults, not contract decisions**, and as response | ||
| headers only (no real edge or stage cache). | ||
|
|
||
| 2. **The FOLIO-item ingestion dependency (RFC 088).** The requesting translation (canonical item id |
There was a problem hiding this comment.
Can we get on with transforming folio items? so we can clarify this sooner rather than later?
What if the item ids are HRID in the aoi-pmh feed?
Is there a point where we should think about having HRID and UUID in the identifiers DB so as to be able to convert from one to the other?
(okapi does it as an alternative
/instance-storage/instances?query=hrid=={hrid} or
/instance-storage/instances?query=id=={uuid} )
|
|
||
| Each has a prototype direction but an unsettled integration point. | ||
|
|
||
| 1. **Caching and cost.** The cache placement (edge/CloudFront as primary vs the API Gateway stage |
There was a problem hiding this comment.
Can the caching and cost questions be "suck it and see", or do we need to think more carefully about when and where we expect the lookups to happen and how that might change the access patterns?
Current expected clients include:
- Producing the metadata document upon ingestion of digitised material (probably mostly unique or thereabouts)
- The Items API (unless this is handled within the pipeline) more chance of repeated requests
| |---|---|---| | ||
| | Reverse, bare (`source → canonicalId`) | Immutable once minted | Long TTL, even mid-migration. | | ||
| | Forward (`canonicalId → sources`) | Alias set can grow during migration | Bounded TTL + ETag. | | ||
| | Reverse, `include=siblings` | Carries the canonical → sources set | Same as forward. | |
There was a problem hiding this comment.
If we offer a reverse + specific sibling, it could be more cacheable.
e.g. ?include=sierra-system-number
This may not always be true, I suppose, but as the id database has migrated from a 1:1 mapping of old system ids to canonical ids, I think this means that there would continue to be a 1:1 mapping from ids in one scheme to ids in another.
That said, doing so may be undesirable or bring along other complications:
One of the envisaged use cases (digitisation) is that, given a folio identifier, we can find the corresponding bNumber for digitisation, but if that does not exist, use the canonical identifier. So we'd either want to always return canonical if it exists, or include canonical in the list of requested returns. In which case, we could not rely on the absence of the requested sibling(s) to provide a 4xx response.
I think that might make it a bit tricky to generalise, because if we are going from a new system id to old system id (like this), then we can expect the result to be unchanging, but old to new might just mean it's not been migrated yet.
Preview
View the rendered RFC on this branch:
What does this change?
The proposal is a small, read-only Identifiers API. Wellcome Collection gives every catalogue thing (a work, an image, an item) a stable public "canonical" id, and keeps a registry recording which underlying source ids that canonical id was built from. This API does one job: given a canonical id it returns the source id(s) behind it, and given a source id it returns the canonical id (optionally with its siblings). It only ever reads that registry; it never creates or changes ids.
It exists because of the Sierra/CALM to FOLIO/Axiell migration. As records move between systems a single canonical id accumulates several source ids over time (an original plus inherited "predecessor" aliases), and a couple of internal services sit right at the edges where that translation has to happen: the IIIF viewer needs to turn old b-numbers and CALM refs into the canonical id it presents under, and requesting needs to turn a canonical item id into the FOLIO UUID a hold is placed on, and back again. The guiding principle is that everything public speaks canonical and source ids only appear at those two edges (ingest and the FOLIO boundary). Rather than have each consumer re-derive the mapping or query the catalogue by source id, this API is the single shared place that translation lives. Because the main running cost is database queries, it is also the natural place to cache aggressively (at the edge, to keep requests off the database) and to attribute that database cost to the consumers driving it.
How it relates to the other RFCs:
The RFC is written to stand on its own and carries the API contract alongside it, so it can be reviewed without access to the closed discovery/prototype repository where the working prototype lives.
Files added:
rfcs/089-identifiers-api/README.md: the RFC document.rfcs/README.md: refreshed RFC listing table (RFC 089 row added).rfcs/089-identifiers-api/openapi.yaml: the OpenAPI 3.0 spec for the two lookup operations (the source of truth).rfcs/089-identifiers-api/openapi.md: generated human-readable rendering of the spec, browsable on GitHub without a Swagger/Redoc renderer.render_docs.py,pyproject.toml,.python-version,.gitignore,uv.lock: a self-containeduvproject that validatesopenapi.yamland regeneratesopenapi.md.How to test
rfcs/089-identifiers-api/README.mdand review the contract, architecture, caching strategy and open questions..scripts/validate_rfc.py..scripts/create_table_summary.py --check-readme.uv run python render_docs.py: this validatesopenapi.yamlagainst the OpenAPI spec validator and rewritesopenapi.md.How can we measure success?
No measurable runtime success criteria; this is a documentation RFC. Success is the RFC being reviewed and providing a clear, self-contained contract and architecture that the team can align on, and a decision on whether this service is the access mechanism for identifier translation in RFC 088.
Have we considered potential risks?