feat: add masked language modeling ingestor template by shujaatTracebloc · Pull Request #88 · tracebloc/data-ingestors

shujaatTracebloc · 2026-05-18T06:39:18Z

Summary

Add MASKED_LANGUAGE_MODELING to TaskCategory constants and get_all_categories()
Add MLM validator mapping in validators_mapping.py — validates .txt file extensions in sequences/ directory, plus table name and duplicate checks
Create templates/masked_language_modeling/ with full ingestion script, README, and 5 sample PrimeKG-style random walk sequences
Add examples/yaml/masked_language_modeling.yaml for CLI-based ingestion

Key difference from text_classification

MLM is self-supervised — no label_column is needed. The CSV manifest only contains filename and extension columns. Masking is applied on-the-fly by the training client during training.

Data format

data/
├── labels_file.csv           filename, extension
├── sequences/
│     seq_0000001.txt         "Lepirudin indication Huntington phenotype_present Chorea"
│     seq_0000002.txt         "Soliris target TP53 associated_with Breast_cancer"
└── tokenizer.json            (placed on dataset path, not ingested)

Files changed

File	Change
`tracebloc_ingestor/utils/constants.py`	Add `MASKED_LANGUAGE_MODELING` to `TaskCategory`
`tracebloc_ingestor/utils/validators_mapping.py`	Add MLM validator chain
`templates/masked_language_modeling/`	New template with script, README, sample data
`examples/yaml/masked_language_modeling.yaml`	New YAML config example

Test plan

Verify TaskCategory.is_valid_category("masked_language_modeling") returns True
Verify map_validators(TaskCategory.MASKED_LANGUAGE_MODELING, ...) returns correct validator chain
Run masked_language_modeling.py template against sample data
Verify ingested records have correct filename, extension, and no label column errors

🤖 Generated with Claude Code

Note

Medium Risk
Adds a new ingestion category that changes CLI schema validation, convention resolution, and validator selection, which could affect YAML-driven runs if misconfigured. Main risk is around sidecar path resolution (sequences/SRC_PATH) and the new TokenizerValidator failing datasets missing tokenizer.json or required tokens.

Overview
Adds first-class masked language modeling (MLM) ingestion support across the declarative YAML/CLI path.

The PR extends TaskCategory, the ingest JSON schema, and CLI conventions to recognize category: masked_language_modeling, require a sequences sidecar directory, derive SRC_PATH from it, and default MLM to DataFormat.TEXT with .txt file options.

It wires in an MLM-specific validator chain (including a new TokenizerValidator that enforces presence of tokenizer.json with [MASK]/[PAD]) and adds a complete templates/masked_language_modeling/ example plus an examples/yaml/masked_language_modeling.yaml config.

^{Reviewed by Cursor Bugbot for commit 5e774ec. Bugbot is set up for automated code reviews on this repo. Configure here.}

- Add MASKED_LANGUAGE_MODELING to TaskCategory constants - Add MLM validator mapping (FileTypeValidator for sequences/, TableName, Duplicate, optional DataValidator) - Create templates/masked_language_modeling/ with ingestion script, README, and 5 sample sequence files from PrimeKG-style random walks - Add YAML example for CLI-based ingestion - MLM is self-supervised: no label_column needed, CSV manifest only has filename + extension columns Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

LukasWodka · 2026-05-18T06:39:46Z

👋 Heads-up — Code review queue is at 18 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

.github#41 — fix(profile): point install snippets to canonical i.sh / i.ps1 URLs · author: @saadqbal · reviewer: @LukasWodka
client#100 — feat(client): scaffolding for stateless requests-proxy auth (HC-1) · author: @saadqbal · no reviewer assigned
client-runtime#35 — feat(removing package for dev for merge #21): jobs-manager submit-ingestion-run HTTP endpoint · author: @saadqbal · no reviewer assigned
client-runtime#36 — Bump requests from 2.32.3 to 2.33.0 in /Node-deploy · author: @dependabot · no reviewer assigned
client-runtime#37 — Bump black from 25.1.0 to 26.3.1 in /Node-deploy · author: @dependabot · no reviewer assigned
docs#3 — Block bots from crawling Mintlify static assets · author: @LukasWodka · reviewer: @saadqbal
frontend-app#465 — chore(deps): bump next from 15.5.9 to 15.5.18 · author: @dependabot · no reviewer assigned
frontend-app#466 — chore(deps): bump systeminformation from 5.28.5 to 5.31.6 · author: @dependabot · no reviewer assigned
frontend-app#467 — chore(deps): bump axios from 0.21.4 to 0.31.1 · author: @dependabot · no reviewer assigned
start-training#16 — chore: merge main into develop to resolve PR Csv ingestor #15 conflicts · author: @saadqbal · reviewer: @LukasWodka

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Validates that tokenizer.json exists at the data path and contains required [MASK] and [PAD] tokens before ingestion, preventing silent embedding out-of-bounds errors during training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds "masked_language_modeling" to the category enum in ingest.v1.json, defines the "sequences" sidecar property, and wires MLM into conventions.py so _data_format_for() and resolve() handle the new category without raising ValueError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit fd3578e. Configure here.}

Unigram tokenizers store vocab as [[token, score], ...] lists, not dicts. Without this, special tokens in the main vocab would be missed and ingestion incorrectly blocked. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

shujaatTracebloc self-assigned this May 18, 2026

cursor Bot reviewed May 18, 2026

View reviewed changes

Comment thread examples/yaml/masked_language_modeling.yaml

Comment thread examples/yaml/masked_language_modeling.yaml

shujaatTracebloc and others added 2 commits May 18, 2026 08:50

cursor Bot reviewed May 18, 2026

View reviewed changes

Comment thread examples/yaml/masked_language_modeling.yaml

cursor Bot reviewed May 18, 2026

View reviewed changes

Comment thread tracebloc_ingestor/validators/tokenizer_validator.py

shujaatTracebloc requested a review from divyasinghds May 18, 2026 07:00

divyasinghds previously approved these changes May 18, 2026

View reviewed changes

shujaatTracebloc dismissed divyasinghds’s stale review via 5e774ec May 18, 2026 07:04

This was referenced May 18, 2026

docs: clarify setup guide deploys single-node workspace tracebloc/docs#41

Merged

Add 4 trending models (round 2 batch B): EBM, TFT, DeepHit tracebloc/model-zoo#70

Merged

shujaatTracebloc requested a review from aptracebloc May 18, 2026 08:53

This was referenced May 18, 2026

docs: clarify setup guide deploys single-node workspace tracebloc/docs#42

Merged

chore: lift python_requires to 3.11 #89

Merged

aptracebloc approved these changes May 18, 2026

View reviewed changes

shujaatTracebloc merged commit 04369f8 into develop May 18, 2026
1 check passed

shujaatTracebloc deleted the feat/add-masked-language-modeling-template branch May 18, 2026 09:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add masked language modeling ingestor template#88

feat: add masked language modeling ingestor template#88
shujaatTracebloc merged 4 commits into
developfrom
feat/add-masked-language-modeling-template

shujaatTracebloc commented May 18, 2026 •

edited by cursor Bot

Loading

Uh oh!

LukasWodka commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

shujaatTracebloc commented May 18, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key difference from text_classification

Data format

Files changed

Test plan

Uh oh!

LukasWodka commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shujaatTracebloc commented May 18, 2026 •

edited by cursor Bot

Loading