Skip to content

RFC 089: Identifiers API#156

Draft
kenoir wants to merge 5 commits into
mainfrom
rk/identifiers-api-rfc
Draft

RFC 089: Identifiers API#156
kenoir wants to merge 5 commits into
mainfrom
rk/identifiers-api-rfc

Conversation

@kenoir

@kenoir kenoir commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Preview

View the rendered RFC on this branch:


What does this change?

The proposal is a small, read-only Identifiers API. Wellcome Collection gives every catalogue thing (a work, an image, an item) a stable public "canonical" id, and keeps a registry recording which underlying source ids that canonical id was built from. This API does one job: given a canonical id it returns the source id(s) behind it, and given a source id it returns the canonical id (optionally with its siblings). It only ever reads that registry; it never creates or changes ids.

It exists because of the Sierra/CALM to FOLIO/Axiell migration. As records move between systems a single canonical id accumulates several source ids over time (an original plus inherited "predecessor" aliases), and a couple of internal services sit right at the edges where that translation has to happen: the IIIF viewer needs to turn old b-numbers and CALM refs into the canonical id it presents under, and requesting needs to turn a canonical item id into the FOLIO UUID a hold is placed on, and back again. The guiding principle is that everything public speaks canonical and source ids only appear at those two edges (ingest and the FOLIO boundary). Rather than have each consumer re-derive the mapping or query the catalogue by source id, this API is the single shared place that translation lives. Because the main running cost is database queries, it is also the natural place to cache aggressively (at the edge, to keep requests off the database) and to attribute that database cost to the consumers driving it.

How it relates to the other RFCs:

  • It reads the ID Registry defined in RFC 083 (stable identifiers). RFC 083 owns the data and all the writes; this API is just a read-only window onto it.
  • It serves the IIIF/DDS lookup that RFC 085 (IIIF identities, open PR RFC 085: IIIF identities #143) describes, where the canonical Work id becomes the IIIF URI and older identifier forms redirect to it.
  • It is the concrete "service" answer to the open question in RFC 088 (the Sierra to FOLIO identity/requesting migration, open PR RFC 088: Migrating identity, requesting and items APIs from Sierra to FOLIO #153) about how requesting should translate item ids. RFC 088 left that access mechanism open (a direct database read, a sync, or a service); this RFC proposes the service and is for RFC 088 to ratify.

The RFC is written to stand on its own and carries the API contract alongside it, so it can be reviewed without access to the closed discovery/prototype repository where the working prototype lives.

Files added:

  • rfcs/089-identifiers-api/README.md: the RFC document.
  • rfcs/README.md: refreshed RFC listing table (RFC 089 row added).
  • rfcs/089-identifiers-api/openapi.yaml: the OpenAPI 3.0 spec for the two lookup operations (the source of truth).
  • rfcs/089-identifiers-api/openapi.md: generated human-readable rendering of the spec, browsable on GitHub without a Swagger/Redoc renderer.
  • render_docs.py, pyproject.toml, .python-version, .gitignore, uv.lock: a self-contained uv project that validates openapi.yaml and regenerates openapi.md.

How to test

  • Read rfcs/089-identifiers-api/README.md and review the contract, architecture, caching strategy and open questions.
  • Confirm the RFC passes repo validation: .scripts/validate_rfc.py.
  • Confirm the listing table is in sync: .scripts/create_table_summary.py --check-readme.
  • (Optional) Regenerate the rendered contract and confirm no diff: from the RFC directory, uv run python render_docs.py: this validates openapi.yaml against the OpenAPI spec validator and rewrites openapi.md.

How can we measure success?

No measurable runtime success criteria; this is a documentation RFC. Success is the RFC being reviewed and providing a clear, self-contained contract and architecture that the team can align on, and a decision on whether this service is the access mechanism for identifier translation in RFC 088.

Have we considered potential risks?

  • Documentation only; no production code or infrastructure is changed by this PR, so there is no runtime or deployment risk.
  • The design risks themselves (the caching strategy and the database cost it controls, the unmet FOLIO-item ingestion dependency, item canonical-id stability) are enumerated in the RFC's Open questions section and are intended to be the subject of review.
  • The OpenAPI spec is validated and the rendered Markdown is generated from it, reducing the chance of the contract and its human-readable rendering drifting apart.

kenoir and others added 5 commits June 18, 2026 13:00
Read-only canonical <-> source identifier translation over the RFC 083 ID
Registry, for the IIIF/DDS (RFC 085) and requesting (RFC 088) consumers.

Carries the OpenAPI contract alongside the RFC (openapi.yaml + a rendered
openapi.md via a small uv project, following the RFC 088 pattern) so the
proposal stands alone without the private prototype repository. Covers the
contract, AWS architecture, API-key auth + usage-plan metering, the caching
topology, and the live-data findings (folio-instance aliases present;
folio-item-id absent, so the requesting translation has no data yet).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The concern is the cost of database (Aurora) queries, not policing a per-consumer
billing quota. That inverts the caching strategy: an edge (CloudFront) cache that
serves hits without touching the database is now preferred, rather than rejected
for breaking metering. Recasts the per-consumer story as API keys for identity /
cost attribution plus a throttle as a database safety valve, and reorients the
caching open question toward hit-ratio, throttle sizing and cost attribution.

Also removes the detailed real-data-findings list (kept in the prototype docs),
leaving a one-line pointer. Contract unchanged (openapi.yaml/openapi.md untouched).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replaces em dashes with plain punctuation throughout and tones down a few
flourishes ("single translation membrane", "exactly the win", "evaporates").
Directional notation (the migration and lookup arrows) is kept. No change to the
contract, the decisions, or the meaning.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Removes the decision-log section and its table of contents entry, and renames the
contract-summary heading to "API Contract" (the separate "API contract (OpenAPI)"
section is unchanged).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Avoids two similarly-named sections after "The contract" became "API Contract".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

| Status | Body | Description |
|---|---|---|
| `200` | n/a | A mapping exists for the supplied tuple. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be:

Suggested change
| `200` | n/a | A mapping exists for the supplied tuple. |
| `200` | IdentifierSet\|CanonicalIdRef | A mapping exists for the supplied tuple. |

about resolves to `404` rather than a spurious `400`.

**Status codes:** `200` found; `304` conditional GET (matched `ETag`); `400` malformed `canonicalId`
or an unsupported enum value (rejected at the gateway); `404` no mapping: an unknown id, an unknown

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this

`400` [...]  an unsupported enum value

consistent with

- **`sourceSystem`** is an open set, not enum-constrained, so a system the gateway has not been told
  about resolves to `404` rather than a spurious `400`.


Consequences for the two lookups:

- **Forward** (canonical → sources) rides `idx_canonical` and returns N rows: cheap, but not a

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Forward** (canonical → sources) rides `idx_canonical` and returns N rows: cheap, but not a
- **Forward** (canonical → sources) reads `idx_canonical` and returns N rows: cheap, but not a

typo?

bills usage). The keys, the throttle, and their stage binding are **not** in the OpenAPI body;
they are separate API Gateway resources in Terraform, so re-importing the definition does not
disturb them.
- A **gateway-level regex** on `canonicalId` (`^[a-hjkmnp-z][a-hjkmnp-z2-9]{7}$`) rejects malformed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might we want something similar in the reverse lookup? Validate sourceSystem against a list?

It's almost certainly overkill to try to validate source system/id pairs (e.g. sierra-system-number/b{0-7}.)


3. **Item canonical-id stability through the FOLIO migration.** Items are minted canonically today,
but the canonical id must survive Sierra → FOLIO via RFC 083 predecessor inheritance at item
level (a FOLIO item UUID added as a predecessor of the Sierra item number). RFC 083's transformer

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be the other way around?

Suggested change
level (a FOLIO item UUID added as a predecessor of the Sierra item number). RFC 083's transformer
level (a Sierra item number added as a predecessor of the FOLIO item UUID). RFC 083's transformer

`obsolete` flag (source system retired). That is adjacent to `isAlias` (inherited predecessor) but
not identical. Reconcile the two so one representation serves both consumers.

6. **The `type` enum vs reality.** The live registry holds types beyond `Work` / `Image` / `Item`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we populate this enum with the types we need to use the API for (those three), and extend it as we need, rather than worrying about the question up front?

changes are described at bib/work level; item-level predecessor emission needs confirming with the
pipeline workstream.

4. **Bare-value reverse lookup (RFC 085).** The DDS wants to query by bare value without a

@agnesgaroux agnesgaroux Jun 19, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get multiple rows with identical SourceId though right? eg

WHERE SourceId = 'b19871831'
AND OntologyType = 'Work'
AND SourceSystem = 'sierra-system-number';

WHERE SourceId = 'b19871831'
AND OntologyType = 'Work'
AND SourceSystem = 'mets';

sierra canonical id -> adyxgsj4
mets canonical id -> ugpnzjrn

We can't tell which one the caller wants, unless we or they know the source system

`If-None-Match` with a `304`, as **prototype defaults, not contract decisions**, and as response
headers only (no real edge or stage cache).

2. **The FOLIO-item ingestion dependency (RFC 088).** The requesting translation (canonical item id

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get on with transforming folio items? so we can clarify this sooner rather than later?
What if the item ids are HRID in the aoi-pmh feed?
Is there a point where we should think about having HRID and UUID in the identifiers DB so as to be able to convert from one to the other?
(okapi does it as an alternative
/instance-storage/instances?query=hrid=={hrid} or
/instance-storage/instances?query=id=={uuid} )


Each has a prototype direction but an unsettled integration point.

1. **Caching and cost.** The cache placement (edge/CloudFront as primary vs the API Gateway stage

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the caching and cost questions be "suck it and see", or do we need to think more carefully about when and where we expect the lookups to happen and how that might change the access patterns?

Current expected clients include:

  • Producing the metadata document upon ingestion of digitised material (probably mostly unique or thereabouts)
  • The Items API (unless this is handled within the pipeline) more chance of repeated requests

|---|---|---|
| Reverse, bare (`source → canonicalId`) | Immutable once minted | Long TTL, even mid-migration. |
| Forward (`canonicalId → sources`) | Alias set can grow during migration | Bounded TTL + ETag. |
| Reverse, `include=siblings` | Carries the canonical → sources set | Same as forward. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we offer a reverse + specific sibling, it could be more cacheable.

e.g. ?include=sierra-system-number

This may not always be true, I suppose, but as the id database has migrated from a 1:1 mapping of old system ids to canonical ids, I think this means that there would continue to be a 1:1 mapping from ids in one scheme to ids in another.

That said, doing so may be undesirable or bring along other complications:

One of the envisaged use cases (digitisation) is that, given a folio identifier, we can find the corresponding bNumber for digitisation, but if that does not exist, use the canonical identifier. So we'd either want to always return canonical if it exists, or include canonical in the list of requested returns. In which case, we could not rely on the absence of the requested sibling(s) to provide a 4xx response.

I think that might make it a bit tricky to generalise, because if we are going from a new system id to old system id (like this), then we can expect the result to be unchanging, but old to new might just mean it's not been migrated yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants