refactor(ingest): drop ingest-time tokenizer + fingerprint (#805)#299
Merged
shujaatTracebloc merged 1 commit intoJun 17, 2026
Merged
Conversation
The contributor tokenizer is distributed to every client and validated against the model at upload, so federated correctness never depended on a data-side tokenizer. Remove the tokenizer.json copy, TokenizerValidator (incl. the MLM-mandatory rule), and the 4-integer fingerprint extraction/registration: they required a tokenizer at ingest yet only cross-checked tokenizer-vs-tokenizer, validating nothing about tokenizer-vs-data fit. Keep is_nlp/NLP_CATEGORIES, repurposed for the upcoming data-derived text profile. Supersedes #281, #286. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
shujaatTracebloc
approved these changes
Jun 17, 2026
2 tasks
5 tasks
shujaatTracebloc
added a commit
that referenced
this pull request
Jun 19, 2026
Wire causal_language_modeling as a first-class supported modality across every per-category dispatch site, mirroring how masked_language_modeling (self-supervised) and token_classification (texts/ raw-text layout) are already wired. The training-container side (tracebloc-client) already supports it; this closes the ingestor-side gap (was 0 references). Causal LM is self-supervised (only a `filename` column, no label) and NLP (ships the #805 data-derived text profile for the contributor-tokenizer-fit check). Each sample is one `.txt` of either plain text (pretraining) or a tab-separated `prompt\tcompletion` pair (SFT) — both ordinary UTF-8 text, so the centralized TextContentValidator is the whole content check (no bespoke per-modality validator; honors the #317 centralization). It stages raw text from `texts/`, NOT MLM's `sequences/` — the framework reserves `sequences/` for pre-tokenized data. Decoder-only models tie pad=eos, so there is no [MASK] requirement; the vocab+pad alignment check lives on the training-client side and is fed by the is_nlp-gated text profile here (the ingest-time tokenizer was dropped in #299). Production wiring: - constants: CAUSAL_LANGUAGE_MODELING + get_all_categories() - registry: ModalitySpec (file-bearing, self-supervised, NLP, TEXT, file_subdir="texts") - modalities/validators: causal_language_modeling factory (FileType + TextContentValidator + optional DataValidator; no label validators) - modalities/transfer: causal_language_modeling -> text_transfer (texts/) - cli/conventions: added to TEXT_CATEGORIES (.txt default, texts/ SRC_PATH) - schema/ingest.v1.json: enum, "requires texts" rule, self-supervised "must not set label" rule (#213), texts description - text_content_validator: docstrings Template + example: - templates/causal_language_modeling/ (script delegating to run_ingestion, README, labels CSV, 5 samples: 3 plain-text + 2 SFT tab pairs) - examples/yaml/causal_language_modeling.yaml + root README table Tests: - updated every category enumeration (registry NLP set, template-category list, ALL_CATEGORIES, NLP/non-classification lists, text-profile params, equivalence CASES, e2e cases) + schema accept/reject - new tests/test_causal_language_modeling.py: CLI run, validation boundaries (header-only / all-files-missing fail-fast, binary reject, clean pass), failure accounting Full unit suite: 1250 passed, 1 xfailed. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Removes the ingest-time tokenizer requirement and the 4-integer tokenizer fingerprint (epic #805, superseding #281, #286).
Federated correctness is guaranteed by distributing the contributor tokenizer to every client plus the upload-time tokenizer↔model checks — never by a data-side tokenizer. The fingerprint forced a tokenizer at ingest (mandatory for MLM) yet only compared structural integers, validating nothing about tokenizer↔data fit.
Changes
validators/tokenizer_validator.py(TokenizerValidator + fingerprint extraction).tokenizer.jsoncopy (_copy_tokenizer_if_present,get_shipped_tokenizer_metadata) and the three per-category wirings (drops the MLM-mandatory-at-ingest rule).ingestors/base.py.is_nlp/NLP_CATEGORIES, repurposed for the upcoming data-derived text profile (P2).Test plan
pytest tests/→ 1120 passed, 1 xfailed locally.Part of the #805 redesign. Next: data-derived text profile + warn-only fit checks.
🤖 Generated with Claude Code