Skip to content

refactor(ingest): drop ingest-time tokenizer + fingerprint (#805)#299

Merged
shujaatTracebloc merged 1 commit into
developfrom
refactor/805-drop-ingest-tokenizer-fingerprint
Jun 17, 2026
Merged

refactor(ingest): drop ingest-time tokenizer + fingerprint (#805)#299
shujaatTracebloc merged 1 commit into
developfrom
refactor/805-drop-ingest-tokenizer-fingerprint

Conversation

@LukasWodka

Copy link
Copy Markdown
Collaborator

Summary

Removes the ingest-time tokenizer requirement and the 4-integer tokenizer fingerprint (epic #805, superseding #281, #286).

Federated correctness is guaranteed by distributing the contributor tokenizer to every client plus the upload-time tokenizer↔model checks — never by a data-side tokenizer. The fingerprint forced a tokenizer at ingest (mandatory for MLM) yet only compared structural integers, validating nothing about tokenizer↔data fit.

Changes

  • Delete validators/tokenizer_validator.py (TokenizerValidator + fingerprint extraction).
  • Remove the tokenizer.json copy (_copy_tokenizer_if_present, get_shipped_tokenizer_metadata) and the three per-category wirings (drops the MLM-mandatory-at-ingest rule).
  • Remove the fingerprint attach/registration block in ingestors/base.py.
  • Keep is_nlp / NLP_CATEGORIES, repurposed for the upcoming data-derived text profile (P2).
  • Remove the obsolete tokenizer tests.

Test plan

  • pytest tests/ → 1120 passed, 1 xfailed locally.

Part of the #805 redesign. Next: data-derived text profile + warn-only fit checks.

🤖 Generated with Claude Code

The contributor tokenizer is distributed to every client and validated against the model at upload, so federated correctness never depended on a data-side tokenizer. Remove the tokenizer.json copy, TokenizerValidator (incl. the MLM-mandatory rule), and the 4-integer fingerprint extraction/registration: they required a tokenizer at ingest yet only cross-checked tokenizer-vs-tokenizer, validating nothing about tokenizer-vs-data fit. Keep is_nlp/NLP_CATEGORIES, repurposed for the upcoming data-derived text profile. Supersedes #281, #286.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@shujaatTracebloc shujaatTracebloc merged commit 82a52bb into develop Jun 17, 2026
6 checks passed
@shujaatTracebloc shujaatTracebloc deleted the refactor/805-drop-ingest-tokenizer-fingerprint branch June 17, 2026 12:59
shujaatTracebloc added a commit that referenced this pull request Jun 19, 2026
Wire causal_language_modeling as a first-class supported modality across
every per-category dispatch site, mirroring how masked_language_modeling
(self-supervised) and token_classification (texts/ raw-text layout) are
already wired. The training-container side (tracebloc-client) already
supports it; this closes the ingestor-side gap (was 0 references).

Causal LM is self-supervised (only a `filename` column, no label) and NLP
(ships the #805 data-derived text profile for the contributor-tokenizer-fit
check). Each sample is one `.txt` of either plain text (pretraining) or a
tab-separated `prompt\tcompletion` pair (SFT) — both ordinary UTF-8 text, so
the centralized TextContentValidator is the whole content check (no bespoke
per-modality validator; honors the #317 centralization).

It stages raw text from `texts/`, NOT MLM's `sequences/` — the framework
reserves `sequences/` for pre-tokenized data. Decoder-only models tie
pad=eos, so there is no [MASK] requirement; the vocab+pad alignment check
lives on the training-client side and is fed by the is_nlp-gated text
profile here (the ingest-time tokenizer was dropped in #299).

Production wiring:
- constants: CAUSAL_LANGUAGE_MODELING + get_all_categories()
- registry: ModalitySpec (file-bearing, self-supervised, NLP, TEXT,
  file_subdir="texts")
- modalities/validators: causal_language_modeling factory (FileType +
  TextContentValidator + optional DataValidator; no label validators)
- modalities/transfer: causal_language_modeling -> text_transfer (texts/)
- cli/conventions: added to TEXT_CATEGORIES (.txt default, texts/ SRC_PATH)
- schema/ingest.v1.json: enum, "requires texts" rule, self-supervised
  "must not set label" rule (#213), texts description
- text_content_validator: docstrings

Template + example:
- templates/causal_language_modeling/ (script delegating to run_ingestion,
  README, labels CSV, 5 samples: 3 plain-text + 2 SFT tab pairs)
- examples/yaml/causal_language_modeling.yaml + root README table

Tests:
- updated every category enumeration (registry NLP set, template-category
  list, ALL_CATEGORIES, NLP/non-classification lists, text-profile params,
  equivalence CASES, e2e cases) + schema accept/reject
- new tests/test_causal_language_modeling.py: CLI run, validation boundaries
  (header-only / all-files-missing fail-fast, binary reject, clean pass),
  failure accounting

Full unit suite: 1250 passed, 1 xfailed.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants