Skip to content

feat: add masked language modeling ingestor template#88

Merged
shujaatTracebloc merged 4 commits into
developfrom
feat/add-masked-language-modeling-template
May 18, 2026
Merged

feat: add masked language modeling ingestor template#88
shujaatTracebloc merged 4 commits into
developfrom
feat/add-masked-language-modeling-template

Conversation

@shujaatTracebloc

@shujaatTracebloc shujaatTracebloc commented May 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add MASKED_LANGUAGE_MODELING to TaskCategory constants and get_all_categories()
  • Add MLM validator mapping in validators_mapping.py — validates .txt file extensions in sequences/ directory, plus table name and duplicate checks
  • Create templates/masked_language_modeling/ with full ingestion script, README, and 5 sample PrimeKG-style random walk sequences
  • Add examples/yaml/masked_language_modeling.yaml for CLI-based ingestion

Key difference from text_classification

MLM is self-supervised — no label_column is needed. The CSV manifest only contains filename and extension columns. Masking is applied on-the-fly by the training client during training.

Data format

data/
├── labels_file.csv           filename, extension
├── sequences/
│     seq_0000001.txt         "Lepirudin indication Huntington phenotype_present Chorea"
│     seq_0000002.txt         "Soliris target TP53 associated_with Breast_cancer"
└── tokenizer.json            (placed on dataset path, not ingested)

Files changed

File Change
tracebloc_ingestor/utils/constants.py Add MASKED_LANGUAGE_MODELING to TaskCategory
tracebloc_ingestor/utils/validators_mapping.py Add MLM validator chain
templates/masked_language_modeling/ New template with script, README, sample data
examples/yaml/masked_language_modeling.yaml New YAML config example

Test plan

  • Verify TaskCategory.is_valid_category("masked_language_modeling") returns True
  • Verify map_validators(TaskCategory.MASKED_LANGUAGE_MODELING, ...) returns correct validator chain
  • Run masked_language_modeling.py template against sample data
  • Verify ingested records have correct filename, extension, and no label column errors

🤖 Generated with Claude Code


Note

Medium Risk
Adds a new ingestion category that changes CLI schema validation, convention resolution, and validator selection, which could affect YAML-driven runs if misconfigured. Main risk is around sidecar path resolution (sequences/SRC_PATH) and the new TokenizerValidator failing datasets missing tokenizer.json or required tokens.

Overview
Adds first-class masked language modeling (MLM) ingestion support across the declarative YAML/CLI path.

The PR extends TaskCategory, the ingest JSON schema, and CLI conventions to recognize category: masked_language_modeling, require a sequences sidecar directory, derive SRC_PATH from it, and default MLM to DataFormat.TEXT with .txt file options.

It wires in an MLM-specific validator chain (including a new TokenizerValidator that enforces presence of tokenizer.json with [MASK]/[PAD]) and adds a complete templates/masked_language_modeling/ example plus an examples/yaml/masked_language_modeling.yaml config.

Reviewed by Cursor Bugbot for commit 5e774ec. Bugbot is set up for automated code reviews on this repo. Configure here.

- Add MASKED_LANGUAGE_MODELING to TaskCategory constants
- Add MLM validator mapping (FileTypeValidator for sequences/, TableName,
  Duplicate, optional DataValidator)
- Create templates/masked_language_modeling/ with ingestion script, README,
  and 5 sample sequence files from PrimeKG-style random walks
- Add YAML example for CLI-based ingestion
- MLM is self-supervised: no label_column needed, CSV manifest only has
  filename + extension columns

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@LukasWodka

Copy link
Copy Markdown
Collaborator

👋 Heads-up — Code review queue is at 18 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@shujaatTracebloc shujaatTracebloc self-assigned this May 18, 2026
Comment thread examples/yaml/masked_language_modeling.yaml
Comment thread examples/yaml/masked_language_modeling.yaml
shujaatTracebloc and others added 2 commits May 18, 2026 08:50
Validates that tokenizer.json exists at the data path and contains
required [MASK] and [PAD] tokens before ingestion, preventing silent
embedding out-of-bounds errors during training.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds "masked_language_modeling" to the category enum in ingest.v1.json,
defines the "sequences" sidecar property, and wires MLM into conventions.py
so _data_format_for() and resolve() handle the new category without
raising ValueError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread examples/yaml/masked_language_modeling.yaml

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit fd3578e. Configure here.

Comment thread tracebloc_ingestor/validators/tokenizer_validator.py
divyasinghds
divyasinghds previously approved these changes May 18, 2026
Unigram tokenizers store vocab as [[token, score], ...] lists, not dicts.
Without this, special tokens in the main vocab would be missed and
ingestion incorrectly blocked.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants