feat: add masked language modeling ingestor template#88
Merged
shujaatTracebloc merged 4 commits intoMay 18, 2026
Conversation
- Add MASKED_LANGUAGE_MODELING to TaskCategory constants - Add MLM validator mapping (FileTypeValidator for sequences/, TableName, Duplicate, optional DataValidator) - Create templates/masked_language_modeling/ with ingestion script, README, and 5 sample sequence files from PrimeKG-style random walks - Add YAML example for CLI-based ingestion - MLM is self-supervised: no label_column needed, CSV manifest only has filename + extension columns Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Collaborator
|
👋 Heads-up — Code review queue is at 18 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
Validates that tokenizer.json exists at the data path and contains required [MASK] and [PAD] tokens before ingestion, preventing silent embedding out-of-bounds errors during training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds "masked_language_modeling" to the category enum in ingest.v1.json, defines the "sequences" sidecar property, and wires MLM into conventions.py so _data_format_for() and resolve() handle the new category without raising ValueError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit fd3578e. Configure here.
divyasinghds
previously approved these changes
May 18, 2026
Unigram tokenizers store vocab as [[token, score], ...] lists, not dicts. Without this, special tokens in the main vocab would be missed and ingestion incorrectly blocked. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This was referenced May 18, 2026
This was referenced May 18, 2026
aptracebloc
approved these changes
May 18, 2026
This was referenced May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
MASKED_LANGUAGE_MODELINGtoTaskCategoryconstants andget_all_categories()validators_mapping.py— validates.txtfile extensions insequences/directory, plus table name and duplicate checkstemplates/masked_language_modeling/with full ingestion script, README, and 5 sample PrimeKG-style random walk sequencesexamples/yaml/masked_language_modeling.yamlfor CLI-based ingestionKey difference from text_classification
MLM is self-supervised — no
label_columnis needed. The CSV manifest only containsfilenameandextensioncolumns. Masking is applied on-the-fly by the training client during training.Data format
Files changed
tracebloc_ingestor/utils/constants.pyMASKED_LANGUAGE_MODELINGtoTaskCategorytracebloc_ingestor/utils/validators_mapping.pytemplates/masked_language_modeling/examples/yaml/masked_language_modeling.yamlTest plan
TaskCategory.is_valid_category("masked_language_modeling")returnsTruemap_validators(TaskCategory.MASKED_LANGUAGE_MODELING, ...)returns correct validator chainmasked_language_modeling.pytemplate against sample datafilename,extension, and no label column errors🤖 Generated with Claude Code
Note
Medium Risk
Adds a new ingestion category that changes CLI schema validation, convention resolution, and validator selection, which could affect YAML-driven runs if misconfigured. Main risk is around sidecar path resolution (
sequences/SRC_PATH) and the newTokenizerValidatorfailing datasets missingtokenizer.jsonor required tokens.Overview
Adds first-class masked language modeling (MLM) ingestion support across the declarative YAML/CLI path.
The PR extends
TaskCategory, the ingest JSON schema, and CLI conventions to recognizecategory: masked_language_modeling, require asequencessidecar directory, deriveSRC_PATHfrom it, and default MLM toDataFormat.TEXTwith.txtfile options.It wires in an MLM-specific validator chain (including a new
TokenizerValidatorthat enforces presence oftokenizer.jsonwith[MASK]/[PAD]) and adds a completetemplates/masked_language_modeling/example plus anexamples/yaml/masked_language_modeling.yamlconfig.Reviewed by Cursor Bugbot for commit 5e774ec. Bugbot is set up for automated code reviews on this repo. Configure here.