Skip to content

Openconceptlab/ocl_issues#2505 | Fix US/British spelling mismatch in semantic concept search#868

Open
snyaggarwal wants to merge 4 commits intomasterfrom
issues#2505
Open

Openconceptlab/ocl_issues#2505 | Fix US/British spelling mismatch in semantic concept search#868
snyaggarwal wants to merge 4 commits intomasterfrom
issues#2505

Conversation

@snyaggarwal
Copy link
Copy Markdown
Contributor

@snyaggarwal snyaggarwal commented May 7, 2026

Refs OpenConceptLab/ocl_issues#2505 (Problem B only — Problem A still open)
Refs OpenConceptLab/ocl_issues#2511 (Lexical Variant Dictionary infrastructure)

Heads up — approach changed mid-flight

Sunny — apologies for the redirect. After deeper review of #2505 with @paynejd, we landed on a substantially different architecture and felt it was faster to push the new approach directly than to iterate via comments. Your kNN sub-query construction, rescore expansion, and the empty-synonyms crash fix are all kept — only the variant source changed from inline regex to a dictionary-driven lookup. Happy to walk through the diff on a call if useful.

Three reasons for the redirect:

  1. False positives in the original regex list. hem/haem matches themselves, anthem, hemisphere, hemp, hemlock, remember — hard to fix defensively in regex without enumerating exclusions for every English word.
  2. Reusability. The fix should be available beyond ConceptFuzzySearch. Standard /concepts/?q=… has the same gap, and upcoming abbreviation-expansion work needs the same primitive. A dictionary-as-Source approach makes that reuse trivial.
  3. Eat our own dog food. Storing dictionaries as OCL Sources gives versioning, release management, locale support, editability through the OCL UI, and discoverability for free — and aligns with how UMLS / SNOMED / OBO model lexical variants.

Tradeoff worth flagging

This approach puts pressure on oclapi2 to do caching well. The MVP scale is trivial (~40 entries, in-memory dict), but as dictionary types and sizes grow — especially abbreviation dictionaries (UMLS LRABR is ~250k rows) and once variants are enabled by default on high-QPS endpoints — the per-token DB lookup will need to graduate beyond the MVP's module-level cache. Likely Redis-backed token-indexed cache for >1k-entry dictionaries; ES-backed lookup for very large ones. Plan to revisit when (a) a dictionary exceeds ~5k entries, or (b) variants are enabled by default on a high-QPS endpoint. Captured in #2511.

What this PR does

  • New helper: core/common/lexical_variants.pyget_lexical_variants(text, source_uri) and get_variant_terms(text, source_uri). Tokenizes input (lowercase + whitespace + ASCII punctuation strip), looks each token up against the dictionary's Names, returns sibling Names from the matching Concept. Module-level cache keyed on (source_uri, version). Returns [] and never raises if the dictionary is missing — graceful degradation.
  • Seed data: core/common/data/lexical-variants-en.json — OCL bulk import file creating OCL/lexical-variants-en (Source + 43 vetted Concepts + Source Version v1.0). Each Concept has en-US and en-GB Names with name_type=Fully Specified. False-positive-prone substring entries dropped.
  • Wired into $match: MetadataToConceptsListView reads variants from request body, passes the resolved URI (or None) to ConceptFuzzySearch.search(..., variants_repo=...). The kNN sub-query construction and rescore expansion are unchanged in shape; only the variant source changed.
  • Empty-synonyms rescore crash fix: preserved from your initial commit.
  • Removed: get_spelling_variant() from core/common/utils.py.

API: variants request body field

Lexical variant expansion is OFF by default. Clients opt in. Same shape will apply to standard concept search when that wiring lands as a follow-up.

Value Behavior
absent / null / false / "false" / "0" disabled (default)
true / "true" / "1" enabled, default dictionary (OCL/lexical-variants-en)
URI string (e.g. "/orgs/myorg/sources/my-dictionary/") enabled, that dictionary

Tests

core/common/tests.py LexicalVariantsTest covers:

  • Tokenization
  • Multi-token expansion
  • Cache lifecycle (per-source-version) + invalidation
  • Missing-source graceful degradation
  • Regression: themselves, anthem, hemisphere, hemp, hemlock, remember return no variants

Deployment

The dictionary Source has already been seeded on prod (OCL/lexical-variants-en v1.0, 43 concepts, public read) so the wiring will work as soon as this PR merges and deploys. For dev / qa / staging environments, run:

ocl --server ocl-dev import file --wait core/common/data/lexical-variants-en.json
ocl --server ocl-qa import file --wait core/common/data/lexical-variants-en.json
ocl --server ocl-staging import file --wait core/common/data/lexical-variants-en.json

Verify with: ocl repo get OCL lexical-variants-en (should show 43 concepts, source_type: "Lexical Variants", extras.dictionary_kind: "lexical_variant", version v1.0 released).

The helper degrades gracefully (returns [], never raises) if the Source doesn't exist yet — so a deploy gap on any environment won't break search.

Out of scope (deferred follow-ups)

  • ?variants=... on standard concept search (ConceptListView) — touches CustomESSearch / get_raw_search_string layer, smaller blast radius as a separate PR
  • $lexical-variants HTTP operation
  • Abbreviation dictionary type (Mapping-based)
  • Multi-dictionary composition
  • Problem A from #2505 (CIEL bridge exact-match scoring via apply_score() highlight) — entirely separate work; this PR no longer auto-closes that part of the ticket

Architecture / design references

  • Full design proposal: Lexical Variant Dictionaries — UMLS / SNOMED / OBO precedents, dictionary-type modeling decisions, tokenization-first reasoning, multi-dictionary roadmap
  • Infrastructure tracking ticket: #2511 — Phase 2/3 work (standard search wiring, $lexical-variants operation, abbreviation dictionary type, caching strategy, etc.)

Adds get_spelling_variant() to generate US<->British spelling alternatives
(leukemia/leukaemia, haem/hem, paed/ped, oedema/edema, -our/-or, -ise/-ize,
etc.). In semantic search, additional kNN sub-queries are fired using the
variant's embedding so that e.g. querying "leukemia" still retrieves
"leukaemia" concepts when the index is large enough to push them out of the
default top-50 candidates. The rescore query is also expanded to boost exact
matches of either spelling, and a pre-existing crash risk (empty should-clause
when no synonyms are provided) is fixed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@snyaggarwal snyaggarwal requested a review from paynejd May 7, 2026 15:55
paynejd and others added 2 commits May 9, 2026 10:31
…structure

Models lexical variants as an OCL Source so they get versioning, release
management, locale handling, editability, and discoverability through
existing OCL infrastructure — replacing the hardcoded regex approach with
a data-driven lookup that aligns with UMLS / SNOMED / OBO conventions.

* core/common/lexical_variants.py — get_lexical_variants() and
  get_variant_terms() helpers with token-level lookup, dataclass result,
  and per-(source_uri, version) cache. Tokenization-first design avoids
  the regex false-positive problems (e.g. "themselves" no longer matches
  "hem", "hemisphere" no longer matches "haem").

* core/common/data/lexical-variants-en.json — OCL bulk import file
  seeding ocl/lexical-variants-en with 43 vetted whole-word spelling
  pairs across en-US and en-GB, plus a v1.0 Source Version. Load via
  the standard OCL bulk import path.

* core/common/tests.py — LexicalVariantsTest covering tokenization,
  cache lifecycle, false-positive regressions (themselves, anthem,
  hemisphere, hemp, hemlock, remember), missing-source graceful
  degradation, and multi-token expansion.

Co-Authored-By: Sunny Aggarwal <sny.aggarwal@gmail.com>
… $match

ConceptFuzzySearch.search() and MetadataToConceptsListView (the $match
endpoint) now consume the dictionary helper instead of the inline regex.

Variant expansion is OFF by default — clients opt in via the request body
`variants` field. Same shape will be used for standard concept search
when that wiring lands as a follow-up.

Variants param accepted forms:
* missing / null / false / "false" / "0" → disabled (default)
* true / "true" / "1" → use DEFAULT_LEXICAL_VARIANTS_REPO
* non-empty URI string → use that dictionary Source

Sunny's kNN sub-query construction and rescore expansion are kept;
only the variant source changes from hardcoded regex to dictionary
lookup. The empty-synonyms rescore crash fix is preserved.

* core/common/utils.py — remove get_spelling_variant() (replaced by
  the dictionary helper).

* core/concepts/search.py — ConceptFuzzySearch.search() accepts
  variants_repo; gates variant kNN and rescore expansion on it being
  truthy. None means skip entirely.

* core/concepts/views.py — MetadataToConceptsListView reads
  request.data.variants and normalizes it to a URI or None via
  _resolve_variants_repo before passing to ConceptFuzzySearch.

Co-Authored-By: Sunny Aggarwal <sny.aggarwal@gmail.com>
@paynejd
Copy link
Copy Markdown
Member

paynejd commented May 9, 2026

@snyaggarwal — heads up, I pushed two commits on top of yours that redirect this PR to a dictionary-driven approach (full rationale in the updated description above + the proposal doc).

TL;DR — your kNN sub-query construction, rescore expansion, and the empty-synonyms crash fix are all kept. The variant source changed from inline regex to a dictionary lookup against ocl/lexical-variants-en (a new OCL Source seeded by bulk import). Variant expansion is now off by default; clients opt in via variants: true (or a custom dictionary URI) in the request body. We'll need follow up work to utilize this in Mapper (expose in project config?).

Apologies for the in-flight redirect — the false-positive cases in the regex list (themselves, anthem, hemisphere, hemp, etc. all matching hem/haem) plus the realization that the same primitive will be needed for upcoming synonym/abbreviation/keyword-expansion evaluation that we're starting pushed me to the dictionary-as-Source approach. Phase 2/3 work tracked in #2511.

Happy to walk through the diff or rework anything that doesn't sit right.

…riants-en

The OCL organization on prod has mnemonic OCL (uppercase), not ocl. Updates
the bulk import file owner field, the DEFAULT_LEXICAL_VARIANTS_REPO
constant, and test mock URIs to match.

Co-Authored-By: Sunny Aggarwal <sny.aggarwal@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants