Openconceptlab/ocl_issues#2505 | Fix US/British spelling mismatch in semantic concept search#868
Openconceptlab/ocl_issues#2505 | Fix US/British spelling mismatch in semantic concept search#868snyaggarwal wants to merge 4 commits intomasterfrom
Conversation
Adds get_spelling_variant() to generate US<->British spelling alternatives (leukemia/leukaemia, haem/hem, paed/ped, oedema/edema, -our/-or, -ise/-ize, etc.). In semantic search, additional kNN sub-queries are fired using the variant's embedding so that e.g. querying "leukemia" still retrieves "leukaemia" concepts when the index is large enough to push them out of the default top-50 candidates. The rescore query is also expanded to boost exact matches of either spelling, and a pre-existing crash risk (empty should-clause when no synonyms are provided) is fixed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…structure Models lexical variants as an OCL Source so they get versioning, release management, locale handling, editability, and discoverability through existing OCL infrastructure — replacing the hardcoded regex approach with a data-driven lookup that aligns with UMLS / SNOMED / OBO conventions. * core/common/lexical_variants.py — get_lexical_variants() and get_variant_terms() helpers with token-level lookup, dataclass result, and per-(source_uri, version) cache. Tokenization-first design avoids the regex false-positive problems (e.g. "themselves" no longer matches "hem", "hemisphere" no longer matches "haem"). * core/common/data/lexical-variants-en.json — OCL bulk import file seeding ocl/lexical-variants-en with 43 vetted whole-word spelling pairs across en-US and en-GB, plus a v1.0 Source Version. Load via the standard OCL bulk import path. * core/common/tests.py — LexicalVariantsTest covering tokenization, cache lifecycle, false-positive regressions (themselves, anthem, hemisphere, hemp, hemlock, remember), missing-source graceful degradation, and multi-token expansion. Co-Authored-By: Sunny Aggarwal <sny.aggarwal@gmail.com>
… $match ConceptFuzzySearch.search() and MetadataToConceptsListView (the $match endpoint) now consume the dictionary helper instead of the inline regex. Variant expansion is OFF by default — clients opt in via the request body `variants` field. Same shape will be used for standard concept search when that wiring lands as a follow-up. Variants param accepted forms: * missing / null / false / "false" / "0" → disabled (default) * true / "true" / "1" → use DEFAULT_LEXICAL_VARIANTS_REPO * non-empty URI string → use that dictionary Source Sunny's kNN sub-query construction and rescore expansion are kept; only the variant source changes from hardcoded regex to dictionary lookup. The empty-synonyms rescore crash fix is preserved. * core/common/utils.py — remove get_spelling_variant() (replaced by the dictionary helper). * core/concepts/search.py — ConceptFuzzySearch.search() accepts variants_repo; gates variant kNN and rescore expansion on it being truthy. None means skip entirely. * core/concepts/views.py — MetadataToConceptsListView reads request.data.variants and normalizes it to a URI or None via _resolve_variants_repo before passing to ConceptFuzzySearch. Co-Authored-By: Sunny Aggarwal <sny.aggarwal@gmail.com>
|
@snyaggarwal — heads up, I pushed two commits on top of yours that redirect this PR to a dictionary-driven approach (full rationale in the updated description above + the proposal doc). TL;DR — your kNN sub-query construction, rescore expansion, and the empty-synonyms crash fix are all kept. The variant source changed from inline regex to a dictionary lookup against Apologies for the in-flight redirect — the false-positive cases in the regex list ( Happy to walk through the diff or rework anything that doesn't sit right. |
…riants-en The OCL organization on prod has mnemonic OCL (uppercase), not ocl. Updates the bulk import file owner field, the DEFAULT_LEXICAL_VARIANTS_REPO constant, and test mock URIs to match. Co-Authored-By: Sunny Aggarwal <sny.aggarwal@gmail.com>
Refs OpenConceptLab/ocl_issues#2505 (Problem B only — Problem A still open)
Refs OpenConceptLab/ocl_issues#2511 (Lexical Variant Dictionary infrastructure)
Heads up — approach changed mid-flight
Sunny — apologies for the redirect. After deeper review of #2505 with @paynejd, we landed on a substantially different architecture and felt it was faster to push the new approach directly than to iterate via comments. Your kNN sub-query construction, rescore expansion, and the empty-synonyms crash fix are all kept — only the variant source changed from inline regex to a dictionary-driven lookup. Happy to walk through the diff on a call if useful.
Three reasons for the redirect:
hem/haemmatchesthemselves,anthem,hemisphere,hemp,hemlock,remember— hard to fix defensively in regex without enumerating exclusions for every English word.ConceptFuzzySearch. Standard/concepts/?q=…has the same gap, and upcoming abbreviation-expansion work needs the same primitive. A dictionary-as-Source approach makes that reuse trivial.Tradeoff worth flagging
This approach puts pressure on oclapi2 to do caching well. The MVP scale is trivial (~40 entries, in-memory dict), but as dictionary types and sizes grow — especially abbreviation dictionaries (UMLS LRABR is ~250k rows) and once variants are enabled by default on high-QPS endpoints — the per-token DB lookup will need to graduate beyond the MVP's module-level cache. Likely Redis-backed token-indexed cache for >1k-entry dictionaries; ES-backed lookup for very large ones. Plan to revisit when (a) a dictionary exceeds ~5k entries, or (b) variants are enabled by default on a high-QPS endpoint. Captured in #2511.
What this PR does
core/common/lexical_variants.py—get_lexical_variants(text, source_uri)andget_variant_terms(text, source_uri). Tokenizes input (lowercase + whitespace + ASCII punctuation strip), looks each token up against the dictionary's Names, returns sibling Names from the matching Concept. Module-level cache keyed on(source_uri, version). Returns[]and never raises if the dictionary is missing — graceful degradation.core/common/data/lexical-variants-en.json— OCL bulk import file creatingOCL/lexical-variants-en(Source + 43 vetted Concepts + Source Versionv1.0). Each Concept has en-US and en-GB Names withname_type=Fully Specified. False-positive-prone substring entries dropped.$match:MetadataToConceptsListViewreadsvariantsfrom request body, passes the resolved URI (orNone) toConceptFuzzySearch.search(..., variants_repo=...). The kNN sub-query construction and rescore expansion are unchanged in shape; only the variant source changed.get_spelling_variant()fromcore/common/utils.py.API:
variantsrequest body fieldLexical variant expansion is OFF by default. Clients opt in. Same shape will apply to standard concept search when that wiring lands as a follow-up.
null/false/"false"/"0"true/"true"/"1"OCL/lexical-variants-en)"/orgs/myorg/sources/my-dictionary/")Tests
core/common/tests.pyLexicalVariantsTestcovers:themselves,anthem,hemisphere,hemp,hemlock,rememberreturn no variantsDeployment
The dictionary Source has already been seeded on prod (
OCL/lexical-variants-env1.0, 43 concepts, public read) so the wiring will work as soon as this PR merges and deploys. For dev / qa / staging environments, run:Verify with:
ocl repo get OCL lexical-variants-en(should show 43 concepts,source_type: "Lexical Variants",extras.dictionary_kind: "lexical_variant", versionv1.0released).The helper degrades gracefully (returns
[], never raises) if the Source doesn't exist yet — so a deploy gap on any environment won't break search.Out of scope (deferred follow-ups)
?variants=...on standard concept search (ConceptListView) — touchesCustomESSearch/get_raw_search_stringlayer, smaller blast radius as a separate PR$lexical-variantsHTTP operationapply_score()highlight) — entirely separate work; this PR no longer auto-closes that part of the ticketArchitecture / design references
$lexical-variantsoperation, abbreviation dictionary type, caching strategy, etc.)