chore: lift python_requires to 3.11#89
Merged
Merged
Conversation
The Dockerfile and all tracebloc deployments run Python 3.11. Drop the stale 3.8 floor so pip refuses installs on incompatible interpreters instead of silently producing a broken environment. - python_requires = ">=3.8" -> ">=3.11" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
|
👋 Heads-up — Code review queue is at 21 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
Collaborator
|
👋 Heads-up — Code review queue is at 25 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
aptracebloc
approved these changes
May 18, 2026
2 tasks
saadqbal
approved these changes
May 18, 2026
saadqbal
added a commit
that referenced
this pull request
May 20, 2026
* feat(#44): declarative ingest.yaml schema, entrypoint, and equivalence harness (#69) * feat(#44): add ingest.v1 JSON schema, examples, and validation tests First commit of #44's declarative-YAML-config flow. Contract-only — no runtime code yet (entrypoint, conventions resolver, label-policy bucketing, and equivalence harness follow in subsequent commits on this branch). What this lands: - schema/ingest.v1.json - Draft-07 JSON Schema for the IngestConfig YAML. - additionalProperties:false at every level (catches typos like `lable:` or `catagory:` at validation time, not runtime). - apiVersion locked to "tracebloc.io/v1"; breaking changes require v2. - category enum covers all 10 TaskCategory values from tracebloc_ingestor/utils/constants.py, including instance_segmentation (which has no map_validators branch yet but is in the enum so customers can author against the contract; missing-branch is tracked separately). - Conditional requirements per category: * image-based → `images` required * object_detection → `annotations` required * semantic_segmentation → `masks` required * text_classification → `texts` required * tabular/time-series → `schema` required * regression-class → `label` MUST be the object form with `policy` set (string shorthand rejected) - Exactly one of `csv` / `json` is the data source (oneOf with mutual not-required). - `label`: string shorthand for the dominant case; object form for explicit policy control. Both validate where allowed. - `data_id`: defaults to UUID strategy (no source column leaves the cluster). Opting into source-column mapping requires explicit `strategy:column` + `column:<name>`; loud and auditable. - `spec.processors[]`: contract for the customScript escape hatch (Helm subchart will mount the script via ConfigMap; the official image imports + instantiates the named class). Note: the underlying BaseProcessor mechanism does NOT exist yet — #44's later commits create it. The ticket overstated existing code. - examples/yaml/*.yaml — one per task category plus the custom-processor escape hatch: image_classification.yaml (8 lines, the dominant case) object_detection.yaml keypoint_detection.yaml semantic_segmentation.yaml text_classification.yaml tabular_classification.yaml tabular_regression.yaml (regression-class, label.policy:bucket) time_series_forecasting.yaml (regression-class, label.policy:bucket) time_to_event_prediction.yaml (regression-class + time_column) custom_processor.yaml (escape hatch, PHI decryption use case) - tests/test_schema_validation.py — 36 tests: * Positive: every example validates. * Acceptance: image_classification stays at 8 payload lines (the "extremely simple for users" #44 design constraint). * Coverage: every enum category has an example (modulo instance_segmentation, deferred until a template exists). * Negative: every conditional-requirement and rejection path the schema enforces — typos, invalid category, missing data source, both csv+json present, image categories without images, regression-class with shorthand label or missing policy, data_id.column without strategy:column, processor without class, locked apiVersion/kind. - requirements.txt — adds PyYAML and jsonschema (needed by the entrypoint and these tests; the schema-image flow doesn't need them). Refs #44 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(#44): add conventions resolver (category → ingestor kwargs) Pure-function module that translates a (validated) ingest.yaml dict into a ResolvedConfig the entrypoint can hand to CSVIngestor / JSONIngestor. No I/O, no env reads, no globals — trivially unit-testable. What's in: - tracebloc_ingestor/cli/__init__.py - Module docstring describing the planned entrypoint flow. - tracebloc_ingestor/cli/conventions.py - Category groupings (IMAGE_CATEGORIES, TEXT_CATEGORIES, TABULAR_CATEGORIES, TIME_SERIES_CATEGORIES, TIME_TO_EVENT_CATEGORIES, REGRESSION_CLASS_CATEGORIES). Single source of truth — both this module and (later) the entrypoint read these instead of redefining inline. - Default options (DEFAULT_CSV_OPTIONS, DEFAULT_IMAGE_FILE_OPTIONS, DEFAULT_TEXT_FILE_OPTIONS) match what the existing templates set, so YAML-driven runs default-equivalent to script-driven runs (verified by the equivalence harness in a later commit). - ResolvedConfig dataclass: a fully-resolved configuration the entrypoint consumes. Every field is filled in (customer values win, conventions fill the rest). - resolve(config) -> ResolvedConfig: the resolver. Pre-condition is that config has already passed jsonschema validation; resolve() does not re-validate. Design notes baked in: - data_id default is UUID generation (no source column leaves the cluster); customers explicitly opt into column-mapping via `data_id.strategy:column`. Privacy-safe by default. This is a behavior change from the existing keypoint_detection / semantic_segmentation templates which set unique_id_column="filename" — the equivalence-harness YAMLs for those will set the column explicitly to preserve template-equivalent behavior. - keypoint_detection automatically sets annotation_column="Annotation" (matches the existing template's use of that CSV column for keypoint coords). Other categories rely on sidecar files only. - data_format derives from category groupings, mapping back to the existing DataFormat enum (image / text / tabular). - tests/test_conventions.py - 38 tests, ~100% line coverage of conventions.py. - Round-trip: every shipped example resolves without error. - Defaults: csv_options, file_options, label_policy, unique_id_column. - Customer overrides win over defaults. - Label shorthand and object form produce equivalent ResolvedConfig for classification; regression-class object form's policy carries through. - data_id.strategy correctly drives unique_id_column. - processor_specs pass through verbatim (the entrypoint will load classes; resolver stays I/O-free). - Sanity-check: REGRESSION_CLASS_CATEGORIES and IMAGE_CATEGORIES sets must match the schema's `if/then` blocks — drift here is a bug. - JSON source dispatch (no JSON example ships yet; constructed inline). 74/74 passing tests across schema validation + conventions resolver. Refs #44 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(#44): add tracebloc-ingest entrypoint with schema-validation dispatch Wires up the runtime side of the declarative-YAML flow. Reads INGEST_CONFIG env, parses + validates the YAML, resolves convention defaults, dispatches to CSVIngestor / JSONIngestor, runs the ingestion. Registered as the `tracebloc-ingest` console script so the official image (#45) can use it as ENTRYPOINT. What's in: - tracebloc_ingestor/cli/run.py (the entrypoint) - main(): full happy-path flow with explicit fail-fast points before any DB/network call (missing INGEST_CONFIG, malformed YAML, schema violations). Returns the exit code (0=ok, 1=records failed, 2=fail-fast) rather than calling sys.exit, so it's testable from inside pytest. - _validate(): yields jsonschema errors sorted by absolute_path so the output is deterministic regardless of validator traversal order. - _format_errors(): emits `<json-pointer>: <message>` per error. Real line numbers (per the ticket text) require a position-preserving YAML loader; deferred to v1.1 — the json-pointer path is enough for grep. - _set_legacy_env_vars(): sets SRC_PATH / TABLE_NAME / LABEL_FILE from the resolved config before constructing Config(), so the framework's existing path-resolution in file_transfer.py keeps working unchanged. A direct file_transfer.py refactor to take paths as parameters is the right long-term move; it's a follow-up — env-var injection is the minimal bridge for v1. - _build_ingestor(): dispatches by source_type. CSV path uses csv_options + file_options; JSON path uses json_options (defaulted to empty since the schema doesn't expose it yet) and validators=None (lets map_validators run from category, the dominant case). - Schema relocation: schema/ingest.v1.json → tracebloc_ingestor/schema/ - Bundles inside the package so it's discoverable post pip-install rather than only from a repo checkout. setup.py's package_data picks up the new tracebloc_ingestor.schema package; sdist + wheel both include the JSON. - Top-level schema/ directory is gone. Tooling references move to the new path. - setup.py - entry_points["console_scripts"]: tracebloc-ingest = ...cli.run:main - package_data: bundle the schema JSON. - tests/test_cli_run.py — 11 tests, full coverage of the entrypoint: - Failure modes (missing/nonexistent/malformed/schema-invalid YAML) all fail fast with a non-zero exit before any DB or network call. - Happy paths (CSV and JSON) construct the right ingestor with convention-default kwargs and call ingest() with the source path. - The legacy env-var bridge actually fires before Config() construction. - Failed records during ingest() yield exit code 1 (distinct from the fail-fast exit code 2). DEFERRED to v1.1 (deliberately, with logging when triggered): - spec.processors[] runtime execution. The schema accepts processors today, but the deployment story requires the Helm subchart from client#86 to mount the script body via ConfigMap — without it, there's no path for processor scripts to land in the pod. The entrypoint warns and skips; rest of the run continues. Customers shouldn't write processors: until client#86 lands. - Line-numbered validation errors. JSON-pointer paths land in v1; real line numbers via a position-preserving YAML loader is a quality improvement for v1.1. 84/84 tests passing across schema validation + conventions + entrypoint. Refs #44 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(#44): label-policy bucketing for regression-class targets Per #44 / parent client#85: when the label is a numeric prediction target (regression, time-series forecasting, time-to-event prediction), the raw value must NOT leak to the central backend. The on-prem-data principle is that only metadata crosses the cluster boundary; shipping the literal target value defeats it. This commit adds the single point where raw labels become bucket IDs before APIClient.send_batch builds its payload. What's in: - tracebloc_ingestor/utils/label_policy.py - apply(value, policy): the only function callers need. - PASSTHROUGH: classification — value sent unchanged. No-op. - BUCKET: regression-class — value replaced with sha256(str(value))[:8] truncated mod NUM_BUCKETS=64, giving a stable int in [0, 64). - MISSING_LABEL_BUCKET=-1: explicit sentinel for missing/empty/whitespace labels, outside the valid bucket range so it can't collide. - Bucketing strategy chosen for v1 because: * Stable: same value always → same bucket. Central backend can group identical labels without seeing them. * Privacy-preserving: raw value not derivable from bucket. * One-pass: no need to scan twice for min/max — works with chunked CSV reads. * Lossy on ordinality: a feature for privacy; analytic insights stay on-prem anyway. - Equal-width / quantile bucketing is a v1.1 extension if customers ask; schema can grow `label.policy: equal_width` without breaking existing `passthrough` / `bucket` consumers. - tracebloc_ingestor/ingestors/base.py - Add `label_policy: str = PASSTHROUGH` parameter to BaseIngestor.__init__ (keyword-only at the end so the dozen positional kwargs above don't shift). - Apply policy in _map_unique_id at the latest moment before the API payload is built, so failure modes (missing label_column, validation failures) short-circuit before bucketing happens. - tracebloc_ingestor/ingestors/csv_ingestor.py - tracebloc_ingestor/ingestors/json_ingestor.py - Thread `label_policy` keyword through both subclasses' __init__ and super().__init__ calls so YAML-driven and template-driven runs alike can opt in. - tracebloc_ingestor/cli/run.py - Pass `resolved.label_policy` into the ingestor's common_kwargs. Bucket fires automatically for any regression-class YAML config (the schema already requires `label.policy` for those categories per commit 1). - tests/test_label_policy.py — 14 tests: * Pure: PASSTHROUGH no-op, BUCKET stable + in-range + handles missing. * BaseIngestor wiring: _map_unique_id mutates label only under BUCKET. * Entrypoint integration: tabular_regression.yaml flows through with label_policy=BUCKET; image_classification.yaml flows through with label_policy=PASSTHROUGH. 112/112 tests passing across schema + conventions + entrypoint + label-policy. Refs #44 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(#44): equivalence harness — YAML path matches every existing template Closes the #44 acceptance criterion that says: *"For each of the seven existing templates plus segmentation, an equivalent examples/yaml/<name>.yaml exists and produces the same end state (same MySQL rows, same backend POSTs)."* We don't run real DB / API I/O. Instead we capture the kwargs each path hands to the ingestor and assert they match. Same kwargs through the same framework code → same end state. What's in: - tests/test_template_equivalence.py - One parametrized case per existing template directory (9 total: image_classification, object_detection, keypoint_detection, semantic_segmentation, text_classification, tabular_classification, tabular_regression, time_series_forecasting, time_to_event_prediction). - Each case asserts every consequential ingestor kwarg matches what the template's CSVIngestor() call sets: category, data_format, intent, label_column, label_policy, unique_id_column, annotation_column, file_options. - test_every_existing_template_has_a_case: enumerates templates/ and fails if a category dir is missing from the harness — drift guard. - test_yaml_config_reaches_ingestor_via_entrypoint: same parametrized cases, but routed through cli.run.main with mocked DB / API. Ensures the resolved config actually lands on the ingestor constructor in the entrypoint flow, not just in resolve() output. Two intentional, documented divergences from the templates: 1. unique_id_column defaults to None (UUID, no PII leakage). Existing keypoint_detection / semantic_segmentation templates set unique_id_column="filename"; the harness configs for those two categories opt back in via data_id: {strategy: column, column: filename} to preserve template-equivalent end-state. 2. label_policy=bucket required by schema for regression-class categories. The existing templates send raw target values through; YAML path applies hash-bucket. Per parent client#85 — raw targets shouldn't leak. The harness asserts label_policy="bucket" for those three categories rather than asserting payload equivalence on the numeric label. One non-functional difference (tolerated): tabular templates set file_options={"number_of_columns": len(schema)}. number_of_columns is dead code (no consumer in the package; grep confirms). YAML path omits it. Plus per-category tuning needed to make equivalence work: - tracebloc_ingestor/cli/conventions.py - Replace single DEFAULT_IMAGE_FILE_OPTIONS with DEFAULT_IMAGE_FILE_OPTIONS_BY_CATEGORY: 512×512 for classification + segmentation, 448×448 for object_detection + keypoint_detection (matches what each template explicitly sets). - Bridge resolved.time_column → file_options["time_column"] for time_to_event_prediction so TimeToEventValidator gets it without the customer having to repeat the value in spec.file_options. - tests/test_conventions.py: updated to test per-category target_size defaults; added tests pinning the 448×448 vs 512×512 split. 133/133 tests passing across schema + conventions + entrypoint + label-policy + equivalence. #44 acceptance criteria fully met. Refs #44 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(#44): JSONIngestor super() call collides with label_policy `super().__init__()` passed `log_level` and `validators` as positional args at slots 12-13, which now map to BaseIngestor's `file_options` and `label_policy` parameters. Combined with the explicit `label_policy=` kwarg, every JSONIngestor instantiation raised `TypeError: got multiple values for argument 'label_policy'`. The crash was invisible to tests because all tests mock JSONIngestor. Drop the bogus positional args from the super call (BaseIngestor no longer accepts them), stash `validators` on the instance for callers that pass it through, and guard `logger.setLevel` against a None level. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#44): require data_id.strategy so column-only configs are rejected The schema's `if: properties.strategy.const == "column"` was vacuously true when `strategy` was absent (draft-07 `properties` matches on absent keys), so `data_id: { column: filename }` passed validation. The resolver in `cli/conventions.py` then checked `data_id.get("strategy") == "column"`, returned False, and silently fell back to UUID generation — dropping the customer's explicit column selection with no error. Make `strategy` required at the `data_id` object level (and inside the `if` clause for completeness) so a missing strategy is now a hard validation failure pointing at `data_id.strategy`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#44): patch file_transfer.config in place after env-var bridge `file_transfer.py` holds a module-level `config = Config()` captured at import time — long before the entrypoint runs `_set_legacy_env_vars`. And `Config` is a dataclass whose `os.getenv` defaults are evaluated once at class-definition time, not per-instance. So setting `SRC_PATH` / `TABLE_NAME` / `LABEL_FILE` in os.environ from the entrypoint never reached the already-constructed `file_transfer.config`: image / text / segmentation transfers used stale defaults and wrote to the wrong destination. Keep the env-var sets (for any code that reads them lazily) and add a direct in-place patch of `file_transfer.config.{SRC_PATH, LABEL_FILE, TABLE_NAME, DEST_PATH}`. Refactoring `file_transfer.py` to take paths as parameters is still the right long-term move and stays out of scope per the PR description. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#44): warn on spec.validators / spec.sidecars; drop dead JSONIngestor params Two related issues from bugbot: 1. spec.validators and spec.sidecars were schema-accepted but never extracted by resolve() nor warned about at runtime — a customer writing either got no indication their override was inert. Mirror the spec.processors pattern: capture them in ResolvedConfig and emit a one-line warning in the entrypoint when non-empty. 2. JSONIngestor's `validators` kwarg (and the matching `validators=None` at the call site) was dead code. BaseIngestor builds validators via map_validators(category, file_options), so the stored attribute was never read. Removed the parameter, the unused BaseValidator import, and the now-redundant call-site kwarg. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#44): propagate file_options to JSONIngestor `_build_ingestor` constructed `JSONIngestor` without passing `resolved.file_options`, so any category-specific knobs the resolver had bridged into `file_options` were silently dropped on the JSON path. The acute case: `time_to_event_prediction` with a JSON source — step 7a of `resolve()` injects `time_column` into `file_options`, which `BaseIngestor.validate_data` then feeds into `map_validators(category, file_options)`. Without it, `TimeToEventValidator` falls back to the default column, ignoring the customer's `time_column`. Added the missing `file_options` param to JSONIngestor's signature (plumbed through to `super().__init__` so `BaseIngestor.file_options` is populated) and now passing `resolved.file_options` at the call site. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#44): NaN labels and time_column override precedence Two issues, one caught by bugbot and one caught by a proactive review pass. 1. Bucket policy treated float('nan') as a regular value: pandas renders missing numeric cells as NaN, str(nan) is "nan" (non-empty), so NaN labels bypassed the missing-sentinel branch and got hashed into a bucket. Added an isnan check before stringification so NaN, None, empty, and whitespace all collapse to MISSING_LABEL_BUCKET as documented. 2. time_column precedence: when both the top-level time_column shorthand and spec.file_options.time_column were set, the top-level value overwrote the explicit spec override — opposite of how every other spec.file_options key behaves. Switched to setdefault so the spec value (already merged in step 7) wins, matching the rest of the resolver's "spec is the advanced override" model. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ingestor): strip label columns from schema metadata sent to backend (#85) * fix(ingestor): strip label/annotation/unique_id columns from schema metadata The schema in file_options is forwarded to the backend as meta_data via send_global_meta_meta. If a template (e.g. time_to_event_prediction) passes the raw schema in file_options, or if the base injects it, the label/annotation/unique_id columns leak through even though the DB table itself excludes them. Always sanitize file_options["schema"] using the same cleaned table_schema used to create the table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ingestor): recompute number_of_columns after schema cleanup Templates set file_options["number_of_columns"] = len(schema) before the base ingestor strips label/annotation/unique_id columns. Keep the count consistent with the sanitized file_options["schema"] so any future consumer sees matching values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(#45): release official ingestor image to GHCR with cosign signing and SBOM (#77) * feat(#45): release image to GHCR with cosign signing and SBOM Replaces the per-customer Python script + Dockerfile pattern with one official image, semver-tagged, signed, and consumed by digest. Depends on the declarative entrypoint from #44 (`tracebloc-ingest` console script registered in setup.py). .github/workflows/release-image.yml Triggered on v*.*.* tags (or workflow_dispatch). Builds for ghcr.io/tracebloc/ingestor with semver tags :X.Y.Z / :X.Y / :X (deliberately no :latest per acceptance criteria). Buildkit produces SLSA provenance and SBOM as OCI attestations. Cosign keyless-signs the digest via GitHub Actions OIDC. The release notes are updated with the digest + verification command for downstream pinning. Dockerfile Multi-stage. Builder stage produces a wheel from this repo's source (no `pip install tracebloc_ingestor` from PyPI — the image must ship the exact code being released). Runtime stage installs the wheel + requirements.txt deps onto a python:3.11-slim base. Drops the `COPY ingestor.py` pattern (the customer no longer brings a script). Runs as nobody (non-root). docker-entrypoint.sh Keeps the MySQL wait but bounds it (MYSQL_WAIT_SECONDS, default 120s) so a misconfigured client surfaces a clear failure instead of hanging the Job. `exec tracebloc-ingest "$@"` at the end so the Python process becomes PID 1 and Kubernetes signal handling reaches the application directly. .dockerignore Un-exclude Readme.md (setup.py reads it for long_description). Verified locally: `docker build .` succeeds, image runs as nobody, and the entrypoint correctly fails fast with "INGEST_CONFIG env var not set" when run without a config — confirming #44's cli/run.py is wired through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(release-image): use inputs.ref for tags on workflow_dispatch Bugbot caught a real bug: docker/metadata-action defaults to reading github.ref, but on a workflow_dispatch event github.ref points at the default branch, not the tag the user typed into inputs.ref. The semver patterns produced zero tags, build-push-action ran without pushing anything, and the cosign loop signed nothing. The workflow exited 0 despite no image being published — the worst failure mode for a release pipeline. Two changes: 1. Pin each type=semver pattern with `value=${{ inputs.ref || github.ref_name }}` so both trigger paths feed the same source into the semver parser. On push-tag: github.ref_name is the tag. On workflow_dispatch: inputs.ref wins. Either way meta produces the three expected tags. 2. Add a "Verify tags were produced" step between meta and build that fails loudly if the tag list is empty. Defense-in-depth against the same class of bug returning: if anyone refactors the value source later and breaks it, the build aborts before silently pushing nothing. The checkout step's `inputs.ref || github.ref` was already correct, so the build context is fine — only the metadata extraction needed fixing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(release-image): move inputs.ref out of inline run-script interpolation Bugbot caught a regression I introduced in ae711ef: the "Verify tags were produced" step embedded ${{ inputs.ref || github.ref_name }} directly inside a double-quoted echo, which is a shell-injection vector on workflow_dispatch — a collaborator could craft an inputs.ref value like `$(malicious)` and the substitution would execute before bash even sees the line. Move REF into the env: block alongside TAGS so it's passed via the environment and read with `${REF}` at runtime — the shell never sees the user-supplied string in a position where command substitution applies. Same pattern the release-notes step already uses for TAG / IMAGE / DIGEST / TAGS. Audited the rest of the workflow: all other `${{ inputs.ref … }}` references are action parameters (actions/checkout's ref, docker/ metadata-action's value=) or env-block values, none in run: scripts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: add masked language modeling ingestor template - Add MASKED_LANGUAGE_MODELING to TaskCategory constants - Add MLM validator mapping (FileTypeValidator for sequences/, TableName, Duplicate, optional DataValidator) - Create templates/masked_language_modeling/ with ingestion script, README, and 5 sample sequence files from PrimeKG-style random walks - Add YAML example for CLI-based ingestion - MLM is self-supervised: no label_column needed, CSV manifest only has filename + extension columns Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add TokenizerValidator for MLM special token validation Validates that tokenizer.json exists at the data path and contains required [MASK] and [PAD] tokens before ingestion, preventing silent embedding out-of-bounds errors during training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add masked_language_modeling to schema and conventions layer Adds "masked_language_modeling" to the category enum in ingest.v1.json, defines the "sequences" sidecar property, and wires MLM into conventions.py so _data_format_for() and resolve() handle the new category without raising ValueError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Handle Unigram tokenizer vocab format in TokenizerValidator Unigram tokenizers store vocab as [[token, score], ...] lists, not dicts. Without this, special tokens in the main vocab would be missed and ingestion incorrectly blocked. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Merge pull request #89 from tracebloc/chore/align-python-version chore: lift python_requires to 3.11 * Fix MLM ingestor skipping file transfer to cluster storage The masked_language_modeling category was missing from the file transfer routing in both base.py and file_transfer.py. This meant sequence text files were never copied from SRC_PATH/sequences/ to DEST_PATH/{TABLE_NAME}/, leaving the database with records that have no backing files on the shared volume. - Add MASKED_LANGUAGE_MODELING to the category list that triggers file transfer in BaseIngestor.ingest() - Add MASKED_LANGUAGE_MODELING case to map_file_transfer() routing, calling text_transfer() with src_subdir="sequences" - Make text_transfer() accept a configurable src_subdir parameter (defaults to "texts" for backwards compatibility) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(csv_ingestor): stop clobbering cleaned schema in file_options (#93) (#94) BaseIngestor.__init__ populates self.file_options with the cleaned table_schema (label_column / annotation_column / unique_id_column stripped) so map_validators and send_global_meta_meta see a sanitized schema before any data leaves the cluster. CSVIngestor.__init__ then ran `self.file_options = file_options or {}` after super().__init__, which silently undid that work. On the new YAML path with tabular and time-series categories the caller passes file_options={}, so `file_options or {}` evaluated to a fresh empty dict, dropping the cleaned schema entirely: - map_validators received empty options → DataValidator, TimeFormatValidator, NumericColumnsValidator setup was skipped. - send_global_meta_meta sent empty metadata to the backend. The reassignment was always redundant (base already initialises self.file_options); deleting it lets the cleaned schema survive on the YAML path. Pre-YAML templates were unaffected because file_options was truthy. Caught by Cursor Bugbot on #92. Closes #93. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dockerfile): pin numeric USER 65534 for K8s runAsNonRoot admission (#92) Kubernetes admission rejects Pods with securityContext.runAsNonRoot: true when the image USER is a non-numeric string — the kubelet cannot verify "nobody" is not root at admission time and the Pod ends up in CreateContainerConfigError: container has runAsNonRoot and image has non-numeric user (nobody), cannot verify user is non-root The client-runtime side already works around this by setting run_as_user=65534 explicitly on the Pod's securityContext (submit_ingestion_run.py build_job_spec), but any other consumer who pulls ghcr.io/tracebloc/ingestor and runs it under their own Pod spec with runAsNonRoot: true will hit the same admission failure. Pinning a numeric UID in the image is the right defense-in-depth. 65534 is the `nobody` UID on Debian (the python:3.11-slim base), Ubuntu, and Alpine. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(#97): make Config lazy and drop laptop-path defaults (#98) * refactor(#97): make Config lazy and drop laptop-path defaults Validators (file_validator, image_validator, ...) plus file_transfer, database, csv_ingestor all instantiate `config = Config()` at module top-level. When Config was a @DataClass with `os.getenv` defaults, those fields were frozen at class-definition time — long before the declarative entrypoint (cli/run.py:main) set SRC_PATH / TABLE_NAME / LABEL_FILE from the resolved ingest.yaml. On a customer cluster on 2026-05-19, FileTypeValidator failed with "Path does not exist" against the laptop-default `~/Downloads/data-ingestors/.../crowd_monitoring/...` because that's the value the dataclass captured at import. The previous workaround in _set_legacy_env_vars patched only file_transfer.config in place — the 14 validator modules each held their own stale snapshot, so the workaround never reached them. Fix: Config is now a plain class whose env-driven fields are @Property, reading os.environ on access. Module-level `config = Config()` is harmless; properties capture nothing at instantiation. Tests that build pinpoint configs via `Config(BACKEND_TOKEN=..., CLIENT_USERNAME=...)` keep working: __init__ accepts overrides as kwargs and they win over env (sentinel lookup distinguishes absent from explicit None). _set_legacy_env_vars reduces to env-var writes only; the in-place patches go away. Hardening: drop the developer-laptop defaults entirely. SRC_PATH / LABEL_FILE / TABLE_NAME default to empty string and TITLE to None — a misconfigured pod fails loudly in path operations instead of silently scanning a developer dir. Tests added in tests/test_config_lazy.py lock in: - env set before Config() flows through - env mutated *after* Config() flows through (the regression) - module-level validator configs observe later env mutations - per-instance overrides win over env, including explicit None - laptop-path defaults are gone Existing tests/test_cli_run.py:test_file_transfer_config_patched_in_place is rewritten as test_file_transfer_config_reads_env_lazily — the new contract is env-set → property reads through, no in-place mutation. Version bumps 0.2.10 → 0.3.0 (internal Config rewrite, surface API preserved). Refs: #97 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(config): reject None for numeric overrides at construction `Config(DB_PORT=None)` / `Config(BATCH_SIZE=None)` previously deferred a `TypeError: int() argument must be a string ...` to first property access, with no hint about which field was at fault. Both properties unconditionally call `int(raw)` on the override. The `_MISSING` sentinel design treats `None` as a valid suppression value for nullable fields (BACKEND_TOKEN, CLIENT_USERNAME, ...) — but for numeric fields it makes no sense. Reject at construction with a message that names the field and points the caller to omit the kwarg. Regression test in `test_config_lazy.py` covers both fields and locks in that valid ints / str-ints continue to work. Addresses bugbot finding on #98. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(#99): surface file_transfer failures in ingestion summary (#100) * fix(#99): surface file_transfer failures in ingestion summary Single-file transfer functions (image_transfer / annotation_transfer / text_transfer) used to return the record unchanged when the source file was missing, so map_file_transfer returned a truthy value and the DB + API write paths happily completed. The summary's counters tracked DB + API outcomes only, so a run where every source file was missing still printed "🎉 Ingestion completed successfully!" at 100% — masking silent data loss on the destination volume. * Return None on missing source / missing filename so the existing skip path in BaseIngestor.ingest picks them up. * Add IngestionSummary.file_transfer_failures (tracked separately from skipped_records so operators can tell data-loss skips from validation skips) and IngestionSummary.has_failures. * Rewrite the banner: never celebrate when any non-trivial failure occurred; print "completed with N failures, see logs" instead. * Append file-transfer failures to the failed_records list returned by ingest() so cli.run.main exits non-zero and the K8s job marker reflects the failure. * Tests: 13 new cases covering the transfer functions, has_failures, banner gating, and the end-to-end counter wiring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(#99): address bugbot review - Banner total no longer double-counts file_transfer failures. The three channels (file_transfer, DB, API-only) are mutually exclusive per record, so we sum them directly instead of using ``total_records - api_sent_records`` which conflates all dropped records into the API bucket. - ``Failed to Send to API`` summary line now reads ``inserted_records - api_sent_records`` for the same reason — a 576-record file-transfer disaster used to print "Failed to Send to API: 576" alongside "File Transfer Failures: 576", double-counting the same set. - The file-transfer skip branch now calls ``pbar.update(1)`` before ``continue`` so an all-transfer-failure run advances tqdm instead of leaving it stuck at 0/N. - Regression tests for both: 576-failure run reports a single 576 banner total (not 1152) and ticks the progress bar 5 times in the end-to-end test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(#101): allow rerun when destination dir is empty leftover (#102) A previous ingestion that aborted during file_transfer can leave an empty /data/shared/<table_name>/ directory on the shared PVC. The next attempt — same ingest.yaml, fixed image — then failed validation: ValueError: Duplicate Validator Validator failed: Destination directory '/data/shared/sample_cats_dogs_train' already exists Customers had to kubectl exec into a PVC-mounted pod and rm -rf between attempts, or rename the table for every retry. Neither was documented. Treat an existing-but-empty destination as a leftover from an aborted run: log a warning and let validation pass. A populated destination still fails — that's the real "would clobber an existing dataset" case the validator was written to catch. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix three real-cluster bugs found during 2026-05-19 validation (#106) * docs: drop "active use case" from prereqs A use case can only be created after data has been ingested, since the platform needs the dataset's real schema and stats to define one. The prior wording sent newcomers to the web app to create a use case before they had a dataset, which is impossible. Closes #64 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version to 0.2.11 * chore: scrub client credentials from ingestor-job.yaml * fix(ingestor): strip label columns from schema metadata sent to backend (#85) * fix(ingestor): strip label/annotation/unique_id columns from schema metadata The schema in file_options is forwarded to the backend as meta_data via send_global_meta_meta. If a template (e.g. time_to_event_prediction) passes the raw schema in file_options, or if the base injects it, the label/annotation/unique_id columns leak through even though the DB table itself excludes them. Always sanitize file_options["schema"] using the same cleaned table_schema used to create the table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ingestor): recompute number_of_columns after schema cleanup Templates set file_options["number_of_columns"] = len(schema) before the base ingestor strips label/annotation/unique_id columns. Keep the count consistent with the sanitized file_options["schema"] so any future consumer sees matching values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(packaging): bundle ingest.v1.json in wheel and sdist setup.py declared package_data for "tracebloc_ingestor.schema" but the directory had no __init__.py, so find_packages() didn't see it as a subpackage and the data-files declaration was a silent no-op. Both the wheel (via package_data) and sdist (via MANIFEST.in) were missing the schema, so the released image crashed on first invocation of cli.run._load_schema with FileNotFoundError on the site-packages schema path. Add the marker __init__.py + a recursive-include line in MANIFEST.in so both build paths bundle the JSON. Tests pin both the subpackage import and the production _load_schema call so a future regression in the wheel-build pipeline surfaces here, not three weeks later in cluster validation. Closes #103. Complements #95 (release-time smoke test that would have caught this at build time). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(image_validator): normalize tuple/list in resolution comparison PIL returns image.size as a tuple; YAML/JSON parses ``target_size: [H, W]`` as a list. Python's sequence equality is type-strict — ``(256, 256) == [256, 256]`` is False — so the tolerance==0 path in _resolution_matches rejected every image whose dimensions matched exactly. The tolerance>0 branch already worked by accident (it uses index access, which doesn't care about sequence type). Wrap both sides in tuple() before comparison. Tests pin every tuple/list shape combination plus the tolerance>0 path that was already working, so a future regression on either side surfaces here. Closes #104. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(file_transfer): normalize extension dot+case in _has_extension FileExtension.get_all_extensions() returns values WITH the leading dot (".jpeg" etc.) but str.split(".") returns the suffix WITHOUT one. _has_extension therefore always returned False, _find_src always appended the extension a second time, and every transfer attempt looked for a non-existent ``cat1.jpeg.jpeg`` source. Combined with the pre-#100 silent- failure summary, customers saw 100% success while no image files actually landed on the PVC. Use rsplit(".", 1) + a normalized ".<lowercase-ext>" comparison so case-insensitive matching works too (consistent with ImageResolutionValidator._is_image_file). Tests cover the classes of input that mattered in cluster validation: lowercase, uppercase, no-extension, unknown extension, and multi-dot names. Closes #105. Together with #100 closes the 2026-05-19 cluster- validation incident. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: shujaat <shujaat@tracebloc.io> Co-authored-by: shujaat_tracebloc <153823837+shujaatTracebloc@users.noreply.github.com> Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com> Co-authored-by: Lukas Wuttke <lukas@tracebloc.io> Co-authored-by: Divya <divyasingh@tracebloc.io> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Divya <divyasingh@tracebloc.io> Co-authored-by: shujaat <shujaat@tracebloc.io> Co-authored-by: shujaat_tracebloc <153823837+shujaatTracebloc@users.noreply.github.com> Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com> Co-authored-by: Lukas Wuttke <lukas@tracebloc.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The Dockerfile is
FROM python:3.11; the published package'ssetup.pywas still declaringpython_requires=">=3.8". Lifting the floor so pip refuses installs on incompatible interpreters instead of silently producing a broken environment.Part of the org-wide Python version alignment. Companion PRs in
backend,averaging-service,tracebloc-client,tracebloc-py-package.Risk
Any downstream consumer of
tracebloc_ingestoron Python <3.11 will fail to install after the next release. Confirm internal consumers (Helm subchart in tracebloc-client) all use 3.11 images.Test plan
tracebloc-ingestentrypoint still launches inside the 3.11 base imageNote
Low Risk
Low code risk (metadata-only), but it will break installation for any downstream users still on Python <3.11.
Overview
Updates packaging metadata in
setup.pyto raisepython_requiresfrom>=3.8to>=3.11, ensuring pip refuses installs on unsupported Python versions.Reviewed by Cursor Bugbot for commit 9313b91. Bugbot is set up for automated code reviews on this repo. Configure here.