refactor(ingest): drop ingest-time tokenizer + fingerprint (#805) by LukasWodka · Pull Request #299 · tracebloc/data-ingestors

LukasWodka · 2026-06-17T12:24:56Z

Summary

Removes the ingest-time tokenizer requirement and the 4-integer tokenizer fingerprint (epic #805, superseding #281, #286).

Federated correctness is guaranteed by distributing the contributor tokenizer to every client plus the upload-time tokenizer↔model checks — never by a data-side tokenizer. The fingerprint forced a tokenizer at ingest (mandatory for MLM) yet only compared structural integers, validating nothing about tokenizer↔data fit.

Changes

Delete validators/tokenizer_validator.py (TokenizerValidator + fingerprint extraction).
Remove the tokenizer.json copy (_copy_tokenizer_if_present, get_shipped_tokenizer_metadata) and the three per-category wirings (drops the MLM-mandatory-at-ingest rule).
Remove the fingerprint attach/registration block in ingestors/base.py.
Keep is_nlp / NLP_CATEGORIES, repurposed for the upcoming data-derived text profile (P2).
Remove the obsolete tokenizer tests.

Test plan

pytest tests/ → 1120 passed, 1 xfailed locally.

Part of the #805 redesign. Next: data-derived text profile + warn-only fit checks.

🤖 Generated with Claude Code

The contributor tokenizer is distributed to every client and validated against the model at upload, so federated correctness never depended on a data-side tokenizer. Remove the tokenizer.json copy, TokenizerValidator (incl. the MLM-mandatory rule), and the 4-integer fingerprint extraction/registration: they required a tokenizer at ingest yet only cross-checked tokenizer-vs-tokenizer, validating nothing about tokenizer-vs-data fit. Keep is_nlp/NLP_CATEGORIES, repurposed for the upcoming data-derived text profile. Supersedes #281, #286. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Wire causal_language_modeling as a first-class supported modality across every per-category dispatch site, mirroring how masked_language_modeling (self-supervised) and token_classification (texts/ raw-text layout) are already wired. The training-container side (tracebloc-client) already supports it; this closes the ingestor-side gap (was 0 references). Causal LM is self-supervised (only a `filename` column, no label) and NLP (ships the #805 data-derived text profile for the contributor-tokenizer-fit check). Each sample is one `.txt` of either plain text (pretraining) or a tab-separated `prompt\tcompletion` pair (SFT) — both ordinary UTF-8 text, so the centralized TextContentValidator is the whole content check (no bespoke per-modality validator; honors the #317 centralization). It stages raw text from `texts/`, NOT MLM's `sequences/` — the framework reserves `sequences/` for pre-tokenized data. Decoder-only models tie pad=eos, so there is no [MASK] requirement; the vocab+pad alignment check lives on the training-client side and is fed by the is_nlp-gated text profile here (the ingest-time tokenizer was dropped in #299). Production wiring: - constants: CAUSAL_LANGUAGE_MODELING + get_all_categories() - registry: ModalitySpec (file-bearing, self-supervised, NLP, TEXT, file_subdir="texts") - modalities/validators: causal_language_modeling factory (FileType + TextContentValidator + optional DataValidator; no label validators) - modalities/transfer: causal_language_modeling -> text_transfer (texts/) - cli/conventions: added to TEXT_CATEGORIES (.txt default, texts/ SRC_PATH) - schema/ingest.v1.json: enum, "requires texts" rule, self-supervised "must not set label" rule (#213), texts description - text_content_validator: docstrings Template + example: - templates/causal_language_modeling/ (script delegating to run_ingestion, README, labels CSV, 5 samples: 3 plain-text + 2 SFT tab pairs) - examples/yaml/causal_language_modeling.yaml + root README table Tests: - updated every category enumeration (registry NLP set, template-category list, ALL_CATEGORIES, NLP/non-classification lists, text-profile params, equivalence CASES, e2e cases) + schema accept/reject - new tests/test_causal_language_modeling.py: CLI run, validation boundaries (header-only / all-files-missing fail-fast, binary reject, clean pass), failure accounting Full unit suite: 1250 passed, 1 xfailed. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

LukasWodka requested a review from shujaatTracebloc June 17, 2026 12:28

shujaatTracebloc approved these changes Jun 17, 2026

View reviewed changes

shujaatTracebloc merged commit 82a52bb into develop Jun 17, 2026
6 checks passed

shujaatTracebloc deleted the refactor/805-drop-ingest-tokenizer-fingerprint branch June 17, 2026 12:59

LukasWodka mentioned this pull request Jun 17, 2026

feat(ingest): ship data-derived text profile for NLP datasets (#805) #300

Merged

divyasinghds mentioned this pull request Jun 18, 2026

test(nlp-profile): accept source_record kwarg in map_file_transfer stubs (bugbot #304) #305

Merged

2 tasks

saadqbal mentioned this pull request Jun 18, 2026

Release v0.4.0 (minor: ModalityRegistry refactors + atomicity + NLP text profile) #308

Closed

5 tasks

shujaatTracebloc mentioned this pull request Jun 19, 2026

feat: add causal_language_modeling modality #318

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(ingest): drop ingest-time tokenizer + fingerprint (#805)#299

refactor(ingest): drop ingest-time tokenizer + fingerprint (#805)#299
shujaatTracebloc merged 1 commit into
developfrom
refactor/805-drop-ingest-tokenizer-fingerprint

LukasWodka commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

LukasWodka commented Jun 17, 2026

Summary

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants