Skip to content

chore: lift python_requires to 3.11#89

Merged
divyasinghds merged 1 commit into
developfrom
chore/align-python-version
May 18, 2026
Merged

chore: lift python_requires to 3.11#89
divyasinghds merged 1 commit into
developfrom
chore/align-python-version

Conversation

@divyasinghds

@divyasinghds divyasinghds commented May 18, 2026

Copy link
Copy Markdown
Contributor

Summary

The Dockerfile is FROM python:3.11; the published package's setup.py was still declaring python_requires=">=3.8". Lifting the floor so pip refuses installs on incompatible interpreters instead of silently producing a broken environment.

Part of the org-wide Python version alignment. Companion PRs in backend, averaging-service, tracebloc-client, tracebloc-py-package.

Risk

Any downstream consumer of tracebloc_ingestor on Python <3.11 will fail to install after the next release. Confirm internal consumers (Helm subchart in tracebloc-client) all use 3.11 images.

Test plan

  • CI green
  • Confirm tracebloc-ingest entrypoint still launches inside the 3.11 base image

Note

Low Risk
Low code risk (metadata-only), but it will break installation for any downstream users still on Python <3.11.

Overview
Updates packaging metadata in setup.py to raise python_requires from >=3.8 to >=3.11, ensuring pip refuses installs on unsupported Python versions.

Reviewed by Cursor Bugbot for commit 9313b91. Bugbot is set up for automated code reviews on this repo. Configure here.

The Dockerfile and all tracebloc deployments run Python 3.11. Drop the
stale 3.8 floor so pip refuses installs on incompatible interpreters
instead of silently producing a broken environment.

- python_requires = ">=3.8" -> ">=3.11"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukasWodka

Copy link
Copy Markdown
Collaborator

👋 Heads-up — Code review queue is at 21 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@divyasinghds divyasinghds self-assigned this May 18, 2026
@divyasinghds divyasinghds requested a review from saadqbal May 18, 2026 09:30
@divyasinghds divyasinghds marked this pull request as ready for review May 18, 2026 09:49
@LukasWodka

Copy link
Copy Markdown
Collaborator

👋 Heads-up — Code review queue is at 25 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@divyasinghds divyasinghds merged commit 42d1787 into develop May 18, 2026
7 checks passed
@divyasinghds divyasinghds deleted the chore/align-python-version branch May 18, 2026 12:49
saadqbal added a commit that referenced this pull request May 20, 2026
* feat(#44): declarative ingest.yaml schema, entrypoint, and equivalence harness (#69)

* feat(#44): add ingest.v1 JSON schema, examples, and validation tests

First commit of #44's declarative-YAML-config flow. Contract-only — no
runtime code yet (entrypoint, conventions resolver, label-policy bucketing,
and equivalence harness follow in subsequent commits on this branch).

What this lands:

- schema/ingest.v1.json
  - Draft-07 JSON Schema for the IngestConfig YAML.
  - additionalProperties:false at every level (catches typos like `lable:`
    or `catagory:` at validation time, not runtime).
  - apiVersion locked to "tracebloc.io/v1"; breaking changes require v2.
  - category enum covers all 10 TaskCategory values from
    tracebloc_ingestor/utils/constants.py, including instance_segmentation
    (which has no map_validators branch yet but is in the enum so customers
    can author against the contract; missing-branch is tracked separately).
  - Conditional requirements per category:
      * image-based      → `images` required
      * object_detection → `annotations` required
      * semantic_segmentation → `masks` required
      * text_classification   → `texts` required
      * tabular/time-series   → `schema` required
      * regression-class      → `label` MUST be the object form with
                                 `policy` set (string shorthand rejected)
  - Exactly one of `csv` / `json` is the data source (oneOf with mutual
    not-required).
  - `label`: string shorthand for the dominant case; object form for
    explicit policy control. Both validate where allowed.
  - `data_id`: defaults to UUID strategy (no source column leaves the
    cluster). Opting into source-column mapping requires explicit
    `strategy:column` + `column:<name>`; loud and auditable.
  - `spec.processors[]`: contract for the customScript escape hatch (Helm
    subchart will mount the script via ConfigMap; the official image
    imports + instantiates the named class). Note: the underlying
    BaseProcessor mechanism does NOT exist yet — #44's later commits
    create it. The ticket overstated existing code.

- examples/yaml/*.yaml — one per task category plus the custom-processor
  escape hatch:
      image_classification.yaml      (8 lines, the dominant case)
      object_detection.yaml
      keypoint_detection.yaml
      semantic_segmentation.yaml
      text_classification.yaml
      tabular_classification.yaml
      tabular_regression.yaml        (regression-class, label.policy:bucket)
      time_series_forecasting.yaml   (regression-class, label.policy:bucket)
      time_to_event_prediction.yaml  (regression-class + time_column)
      custom_processor.yaml          (escape hatch, PHI decryption use case)

- tests/test_schema_validation.py — 36 tests:
  * Positive: every example validates.
  * Acceptance: image_classification stays at 8 payload lines (the
    "extremely simple for users" #44 design constraint).
  * Coverage: every enum category has an example (modulo
    instance_segmentation, deferred until a template exists).
  * Negative: every conditional-requirement and rejection path the
    schema enforces — typos, invalid category, missing data source, both
    csv+json present, image categories without images, regression-class
    with shorthand label or missing policy, data_id.column without
    strategy:column, processor without class, locked apiVersion/kind.

- requirements.txt — adds PyYAML and jsonschema (needed by the entrypoint
  and these tests; the schema-image flow doesn't need them).

Refs #44

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(#44): add conventions resolver (category → ingestor kwargs)

Pure-function module that translates a (validated) ingest.yaml dict into a
ResolvedConfig the entrypoint can hand to CSVIngestor / JSONIngestor.
No I/O, no env reads, no globals — trivially unit-testable.

What's in:

- tracebloc_ingestor/cli/__init__.py
  - Module docstring describing the planned entrypoint flow.

- tracebloc_ingestor/cli/conventions.py
  - Category groupings (IMAGE_CATEGORIES, TEXT_CATEGORIES, TABULAR_CATEGORIES,
    TIME_SERIES_CATEGORIES, TIME_TO_EVENT_CATEGORIES, REGRESSION_CLASS_CATEGORIES).
    Single source of truth — both this module and (later) the entrypoint
    read these instead of redefining inline.
  - Default options (DEFAULT_CSV_OPTIONS, DEFAULT_IMAGE_FILE_OPTIONS,
    DEFAULT_TEXT_FILE_OPTIONS) match what the existing templates set, so
    YAML-driven runs default-equivalent to script-driven runs (verified
    by the equivalence harness in a later commit).
  - ResolvedConfig dataclass: a fully-resolved configuration the entrypoint
    consumes. Every field is filled in (customer values win, conventions
    fill the rest).
  - resolve(config) -> ResolvedConfig: the resolver. Pre-condition is that
    config has already passed jsonschema validation; resolve() does not
    re-validate.

  Design notes baked in:
  - data_id default is UUID generation (no source column leaves the cluster);
    customers explicitly opt into column-mapping via `data_id.strategy:column`.
    Privacy-safe by default. This is a behavior change from the existing
    keypoint_detection / semantic_segmentation templates which set
    unique_id_column="filename" — the equivalence-harness YAMLs for those
    will set the column explicitly to preserve template-equivalent behavior.
  - keypoint_detection automatically sets annotation_column="Annotation"
    (matches the existing template's use of that CSV column for keypoint
    coords). Other categories rely on sidecar files only.
  - data_format derives from category groupings, mapping back to the
    existing DataFormat enum (image / text / tabular).

- tests/test_conventions.py
  - 38 tests, ~100% line coverage of conventions.py.
  - Round-trip: every shipped example resolves without error.
  - Defaults: csv_options, file_options, label_policy, unique_id_column.
  - Customer overrides win over defaults.
  - Label shorthand and object form produce equivalent ResolvedConfig for
    classification; regression-class object form's policy carries through.
  - data_id.strategy correctly drives unique_id_column.
  - processor_specs pass through verbatim (the entrypoint will load classes;
    resolver stays I/O-free).
  - Sanity-check: REGRESSION_CLASS_CATEGORIES and IMAGE_CATEGORIES sets
    must match the schema's `if/then` blocks — drift here is a bug.
  - JSON source dispatch (no JSON example ships yet; constructed inline).

74/74 passing tests across schema validation + conventions resolver.

Refs #44

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(#44): add tracebloc-ingest entrypoint with schema-validation dispatch

Wires up the runtime side of the declarative-YAML flow. Reads INGEST_CONFIG
env, parses + validates the YAML, resolves convention defaults, dispatches
to CSVIngestor / JSONIngestor, runs the ingestion. Registered as the
`tracebloc-ingest` console script so the official image (#45) can use it
as ENTRYPOINT.

What's in:

- tracebloc_ingestor/cli/run.py (the entrypoint)
  - main(): full happy-path flow with explicit fail-fast points before any
    DB/network call (missing INGEST_CONFIG, malformed YAML, schema
    violations). Returns the exit code (0=ok, 1=records failed, 2=fail-fast)
    rather than calling sys.exit, so it's testable from inside pytest.
  - _validate(): yields jsonschema errors sorted by absolute_path so the
    output is deterministic regardless of validator traversal order.
  - _format_errors(): emits `<json-pointer>: <message>` per error. Real
    line numbers (per the ticket text) require a position-preserving YAML
    loader; deferred to v1.1 — the json-pointer path is enough for grep.
  - _set_legacy_env_vars(): sets SRC_PATH / TABLE_NAME / LABEL_FILE from
    the resolved config before constructing Config(), so the framework's
    existing path-resolution in file_transfer.py keeps working unchanged.
    A direct file_transfer.py refactor to take paths as parameters is the
    right long-term move; it's a follow-up — env-var injection is the
    minimal bridge for v1.
  - _build_ingestor(): dispatches by source_type. CSV path uses
    csv_options + file_options; JSON path uses json_options (defaulted to
    empty since the schema doesn't expose it yet) and validators=None
    (lets map_validators run from category, the dominant case).

- Schema relocation: schema/ingest.v1.json → tracebloc_ingestor/schema/
  - Bundles inside the package so it's discoverable post pip-install
    rather than only from a repo checkout. setup.py's package_data picks
    up the new tracebloc_ingestor.schema package; sdist + wheel both
    include the JSON.
  - Top-level schema/ directory is gone. Tooling references move to the
    new path.

- setup.py
  - entry_points["console_scripts"]: tracebloc-ingest = ...cli.run:main
  - package_data: bundle the schema JSON.

- tests/test_cli_run.py — 11 tests, full coverage of the entrypoint:
  - Failure modes (missing/nonexistent/malformed/schema-invalid YAML) all
    fail fast with a non-zero exit before any DB or network call.
  - Happy paths (CSV and JSON) construct the right ingestor with
    convention-default kwargs and call ingest() with the source path.
  - The legacy env-var bridge actually fires before Config() construction.
  - Failed records during ingest() yield exit code 1 (distinct from the
    fail-fast exit code 2).

DEFERRED to v1.1 (deliberately, with logging when triggered):

- spec.processors[] runtime execution. The schema accepts processors
  today, but the deployment story requires the Helm subchart from
  client#86 to mount the script body via ConfigMap — without it, there's
  no path for processor scripts to land in the pod. The entrypoint warns
  and skips; rest of the run continues. Customers shouldn't write
  processors: until client#86 lands.
- Line-numbered validation errors. JSON-pointer paths land in v1; real
  line numbers via a position-preserving YAML loader is a quality
  improvement for v1.1.

84/84 tests passing across schema validation + conventions + entrypoint.

Refs #44

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(#44): label-policy bucketing for regression-class targets

Per #44 / parent client#85: when the label is a numeric prediction target
(regression, time-series forecasting, time-to-event prediction), the raw
value must NOT leak to the central backend. The on-prem-data principle is
that only metadata crosses the cluster boundary; shipping the literal
target value defeats it.

This commit adds the single point where raw labels become bucket IDs
before APIClient.send_batch builds its payload.

What's in:

- tracebloc_ingestor/utils/label_policy.py
  - apply(value, policy): the only function callers need.
  - PASSTHROUGH: classification — value sent unchanged. No-op.
  - BUCKET: regression-class — value replaced with sha256(str(value))[:8]
    truncated mod NUM_BUCKETS=64, giving a stable int in [0, 64).
  - MISSING_LABEL_BUCKET=-1: explicit sentinel for missing/empty/whitespace
    labels, outside the valid bucket range so it can't collide.
  - Bucketing strategy chosen for v1 because:
      * Stable: same value always → same bucket. Central backend can group
        identical labels without seeing them.
      * Privacy-preserving: raw value not derivable from bucket.
      * One-pass: no need to scan twice for min/max — works with chunked
        CSV reads.
      * Lossy on ordinality: a feature for privacy; analytic insights stay
        on-prem anyway.
  - Equal-width / quantile bucketing is a v1.1 extension if customers ask;
    schema can grow `label.policy: equal_width` without breaking existing
    `passthrough` / `bucket` consumers.

- tracebloc_ingestor/ingestors/base.py
  - Add `label_policy: str = PASSTHROUGH` parameter to BaseIngestor.__init__
    (keyword-only at the end so the dozen positional kwargs above don't
    shift).
  - Apply policy in _map_unique_id at the latest moment before the API
    payload is built, so failure modes (missing label_column, validation
    failures) short-circuit before bucketing happens.

- tracebloc_ingestor/ingestors/csv_ingestor.py
- tracebloc_ingestor/ingestors/json_ingestor.py
  - Thread `label_policy` keyword through both subclasses' __init__ and
    super().__init__ calls so YAML-driven and template-driven runs alike
    can opt in.

- tracebloc_ingestor/cli/run.py
  - Pass `resolved.label_policy` into the ingestor's common_kwargs. Bucket
    fires automatically for any regression-class YAML config (the schema
    already requires `label.policy` for those categories per commit 1).

- tests/test_label_policy.py — 14 tests:
  * Pure: PASSTHROUGH no-op, BUCKET stable + in-range + handles missing.
  * BaseIngestor wiring: _map_unique_id mutates label only under BUCKET.
  * Entrypoint integration: tabular_regression.yaml flows through with
    label_policy=BUCKET; image_classification.yaml flows through with
    label_policy=PASSTHROUGH.

112/112 tests passing across schema + conventions + entrypoint + label-policy.

Refs #44

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(#44): equivalence harness — YAML path matches every existing template

Closes the #44 acceptance criterion that says: *"For each of the seven
existing templates plus segmentation, an equivalent examples/yaml/<name>.yaml
exists and produces the same end state (same MySQL rows, same backend
POSTs)."*

We don't run real DB / API I/O. Instead we capture the kwargs each path
hands to the ingestor and assert they match. Same kwargs through the same
framework code → same end state.

What's in:

- tests/test_template_equivalence.py
  - One parametrized case per existing template directory (9 total:
    image_classification, object_detection, keypoint_detection,
    semantic_segmentation, text_classification, tabular_classification,
    tabular_regression, time_series_forecasting, time_to_event_prediction).
  - Each case asserts every consequential ingestor kwarg matches what the
    template's CSVIngestor() call sets: category, data_format, intent,
    label_column, label_policy, unique_id_column, annotation_column,
    file_options.
  - test_every_existing_template_has_a_case: enumerates templates/ and
    fails if a category dir is missing from the harness — drift guard.
  - test_yaml_config_reaches_ingestor_via_entrypoint: same parametrized
    cases, but routed through cli.run.main with mocked DB / API. Ensures
    the resolved config actually lands on the ingestor constructor in
    the entrypoint flow, not just in resolve() output.

Two intentional, documented divergences from the templates:

1. unique_id_column defaults to None (UUID, no PII leakage). Existing
   keypoint_detection / semantic_segmentation templates set
   unique_id_column="filename"; the harness configs for those two
   categories opt back in via data_id: {strategy: column, column: filename}
   to preserve template-equivalent end-state.
2. label_policy=bucket required by schema for regression-class
   categories. The existing templates send raw target values through;
   YAML path applies hash-bucket. Per parent client#85 — raw targets
   shouldn't leak. The harness asserts label_policy="bucket" for those
   three categories rather than asserting payload equivalence on the
   numeric label.

One non-functional difference (tolerated): tabular templates set
file_options={"number_of_columns": len(schema)}. number_of_columns is
dead code (no consumer in the package; grep confirms). YAML path omits it.

Plus per-category tuning needed to make equivalence work:

- tracebloc_ingestor/cli/conventions.py
  - Replace single DEFAULT_IMAGE_FILE_OPTIONS with
    DEFAULT_IMAGE_FILE_OPTIONS_BY_CATEGORY: 512×512 for classification +
    segmentation, 448×448 for object_detection + keypoint_detection
    (matches what each template explicitly sets).
  - Bridge resolved.time_column → file_options["time_column"] for
    time_to_event_prediction so TimeToEventValidator gets it without the
    customer having to repeat the value in spec.file_options.

- tests/test_conventions.py: updated to test per-category target_size
  defaults; added tests pinning the 448×448 vs 512×512 split.

133/133 tests passing across schema + conventions + entrypoint +
label-policy + equivalence. #44 acceptance criteria fully met.

Refs #44

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#44): JSONIngestor super() call collides with label_policy

`super().__init__()` passed `log_level` and `validators` as positional
args at slots 12-13, which now map to BaseIngestor's `file_options` and
`label_policy` parameters. Combined with the explicit `label_policy=`
kwarg, every JSONIngestor instantiation raised `TypeError: got multiple
values for argument 'label_policy'`. The crash was invisible to tests
because all tests mock JSONIngestor.

Drop the bogus positional args from the super call (BaseIngestor no
longer accepts them), stash `validators` on the instance for callers
that pass it through, and guard `logger.setLevel` against a None level.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#44): require data_id.strategy so column-only configs are rejected

The schema's `if: properties.strategy.const == "column"` was vacuously
true when `strategy` was absent (draft-07 `properties` matches on
absent keys), so `data_id: { column: filename }` passed validation.
The resolver in `cli/conventions.py` then checked
`data_id.get("strategy") == "column"`, returned False, and silently
fell back to UUID generation — dropping the customer's explicit column
selection with no error.

Make `strategy` required at the `data_id` object level (and inside the
`if` clause for completeness) so a missing strategy is now a hard
validation failure pointing at `data_id.strategy`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#44): patch file_transfer.config in place after env-var bridge

`file_transfer.py` holds a module-level `config = Config()` captured at
import time — long before the entrypoint runs `_set_legacy_env_vars`.
And `Config` is a dataclass whose `os.getenv` defaults are evaluated
once at class-definition time, not per-instance. So setting `SRC_PATH`
/ `TABLE_NAME` / `LABEL_FILE` in os.environ from the entrypoint never
reached the already-constructed `file_transfer.config`: image / text /
segmentation transfers used stale defaults and wrote to the wrong
destination.

Keep the env-var sets (for any code that reads them lazily) and add a
direct in-place patch of `file_transfer.config.{SRC_PATH, LABEL_FILE,
TABLE_NAME, DEST_PATH}`. Refactoring `file_transfer.py` to take paths
as parameters is still the right long-term move and stays out of scope
per the PR description.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#44): warn on spec.validators / spec.sidecars; drop dead JSONIngestor params

Two related issues from bugbot:

1. spec.validators and spec.sidecars were schema-accepted but never
   extracted by resolve() nor warned about at runtime — a customer
   writing either got no indication their override was inert. Mirror
   the spec.processors pattern: capture them in ResolvedConfig and
   emit a one-line warning in the entrypoint when non-empty.

2. JSONIngestor's `validators` kwarg (and the matching `validators=None`
   at the call site) was dead code. BaseIngestor builds validators via
   map_validators(category, file_options), so the stored attribute was
   never read. Removed the parameter, the unused BaseValidator import,
   and the now-redundant call-site kwarg.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#44): propagate file_options to JSONIngestor

`_build_ingestor` constructed `JSONIngestor` without passing
`resolved.file_options`, so any category-specific knobs the resolver
had bridged into `file_options` were silently dropped on the JSON path.
The acute case: `time_to_event_prediction` with a JSON source — step
7a of `resolve()` injects `time_column` into `file_options`, which
`BaseIngestor.validate_data` then feeds into
`map_validators(category, file_options)`. Without it, `TimeToEventValidator`
falls back to the default column, ignoring the customer's `time_column`.

Added the missing `file_options` param to JSONIngestor's signature
(plumbed through to `super().__init__` so `BaseIngestor.file_options` is
populated) and now passing `resolved.file_options` at the call site.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#44): NaN labels and time_column override precedence

Two issues, one caught by bugbot and one caught by a proactive
review pass.

1. Bucket policy treated float('nan') as a regular value: pandas renders
   missing numeric cells as NaN, str(nan) is "nan" (non-empty), so NaN
   labels bypassed the missing-sentinel branch and got hashed into a
   bucket. Added an isnan check before stringification so NaN, None,
   empty, and whitespace all collapse to MISSING_LABEL_BUCKET as
   documented.

2. time_column precedence: when both the top-level time_column shorthand
   and spec.file_options.time_column were set, the top-level value
   overwrote the explicit spec override — opposite of how every other
   spec.file_options key behaves. Switched to setdefault so the spec
   value (already merged in step 7) wins, matching the rest of the
   resolver's "spec is the advanced override" model.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ingestor): strip label columns from schema metadata sent to backend (#85)

* fix(ingestor): strip label/annotation/unique_id columns from schema metadata

The schema in file_options is forwarded to the backend as meta_data via
send_global_meta_meta. If a template (e.g. time_to_event_prediction)
passes the raw schema in file_options, or if the base injects it, the
label/annotation/unique_id columns leak through even though the DB
table itself excludes them. Always sanitize file_options["schema"]
using the same cleaned table_schema used to create the table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ingestor): recompute number_of_columns after schema cleanup

Templates set file_options["number_of_columns"] = len(schema) before
the base ingestor strips label/annotation/unique_id columns. Keep the
count consistent with the sanitized file_options["schema"] so any
future consumer sees matching values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(#45): release official ingestor image to GHCR with cosign signing and SBOM (#77)

* feat(#45): release image to GHCR with cosign signing and SBOM

Replaces the per-customer Python script + Dockerfile pattern with one
official image, semver-tagged, signed, and consumed by digest. Depends
on the declarative entrypoint from #44 (`tracebloc-ingest` console
script registered in setup.py).

.github/workflows/release-image.yml
  Triggered on v*.*.* tags (or workflow_dispatch). Builds for
  ghcr.io/tracebloc/ingestor with semver tags :X.Y.Z / :X.Y / :X
  (deliberately no :latest per acceptance criteria). Buildkit produces
  SLSA provenance and SBOM as OCI attestations. Cosign keyless-signs
  the digest via GitHub Actions OIDC. The release notes are updated
  with the digest + verification command for downstream pinning.

Dockerfile
  Multi-stage. Builder stage produces a wheel from this repo's source
  (no `pip install tracebloc_ingestor` from PyPI — the image must
  ship the exact code being released). Runtime stage installs the
  wheel + requirements.txt deps onto a python:3.11-slim base. Drops
  the `COPY ingestor.py` pattern (the customer no longer brings a
  script). Runs as nobody (non-root).

docker-entrypoint.sh
  Keeps the MySQL wait but bounds it (MYSQL_WAIT_SECONDS, default
  120s) so a misconfigured client surfaces a clear failure instead
  of hanging the Job. `exec tracebloc-ingest "$@"` at the end so the
  Python process becomes PID 1 and Kubernetes signal handling reaches
  the application directly.

.dockerignore
  Un-exclude Readme.md (setup.py reads it for long_description).

Verified locally: `docker build .` succeeds, image runs as nobody,
and the entrypoint correctly fails fast with "INGEST_CONFIG env var
not set" when run without a config — confirming #44's cli/run.py is
wired through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(release-image): use inputs.ref for tags on workflow_dispatch

Bugbot caught a real bug: docker/metadata-action defaults to reading
github.ref, but on a workflow_dispatch event github.ref points at the
default branch, not the tag the user typed into inputs.ref. The semver
patterns produced zero tags, build-push-action ran without pushing
anything, and the cosign loop signed nothing. The workflow exited 0
despite no image being published — the worst failure mode for a
release pipeline.

Two changes:

1. Pin each type=semver pattern with `value=${{ inputs.ref || github.ref_name }}`
   so both trigger paths feed the same source into the semver parser.
   On push-tag: github.ref_name is the tag. On workflow_dispatch:
   inputs.ref wins. Either way meta produces the three expected tags.

2. Add a "Verify tags were produced" step between meta and build that
   fails loudly if the tag list is empty. Defense-in-depth against the
   same class of bug returning: if anyone refactors the value source
   later and breaks it, the build aborts before silently pushing
   nothing. The checkout step's `inputs.ref || github.ref` was already
   correct, so the build context is fine — only the metadata extraction
   needed fixing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(release-image): move inputs.ref out of inline run-script interpolation

Bugbot caught a regression I introduced in ae711ef: the "Verify tags
were produced" step embedded ${{ inputs.ref || github.ref_name }}
directly inside a double-quoted echo, which is a shell-injection
vector on workflow_dispatch — a collaborator could craft an inputs.ref
value like `$(malicious)` and the substitution would execute before
bash even sees the line.

Move REF into the env: block alongside TAGS so it's passed via the
environment and read with `${REF}` at runtime — the shell never sees
the user-supplied string in a position where command substitution
applies. Same pattern the release-notes step already uses for
TAG / IMAGE / DIGEST / TAGS.

Audited the rest of the workflow: all other `${{ inputs.ref … }}`
references are action parameters (actions/checkout's ref, docker/
metadata-action's value=) or env-block values, none in run: scripts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: add masked language modeling ingestor template

- Add MASKED_LANGUAGE_MODELING to TaskCategory constants
- Add MLM validator mapping (FileTypeValidator for sequences/, TableName,
  Duplicate, optional DataValidator)
- Create templates/masked_language_modeling/ with ingestion script, README,
  and 5 sample sequence files from PrimeKG-style random walks
- Add YAML example for CLI-based ingestion
- MLM is self-supervised: no label_column needed, CSV manifest only has
  filename + extension columns

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add TokenizerValidator for MLM special token validation

Validates that tokenizer.json exists at the data path and contains
required [MASK] and [PAD] tokens before ingestion, preventing silent
embedding out-of-bounds errors during training.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add masked_language_modeling to schema and conventions layer

Adds "masked_language_modeling" to the category enum in ingest.v1.json,
defines the "sequences" sidecar property, and wires MLM into conventions.py
so _data_format_for() and resolve() handle the new category without
raising ValueError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Handle Unigram tokenizer vocab format in TokenizerValidator

Unigram tokenizers store vocab as [[token, score], ...] lists, not dicts.
Without this, special tokens in the main vocab would be missed and
ingestion incorrectly blocked.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Merge pull request #89 from tracebloc/chore/align-python-version

chore: lift python_requires to 3.11

* Fix MLM ingestor skipping file transfer to cluster storage

The masked_language_modeling category was missing from the file
transfer routing in both base.py and file_transfer.py. This meant
sequence text files were never copied from SRC_PATH/sequences/ to
DEST_PATH/{TABLE_NAME}/, leaving the database with records that
have no backing files on the shared volume.

- Add MASKED_LANGUAGE_MODELING to the category list that triggers
  file transfer in BaseIngestor.ingest()
- Add MASKED_LANGUAGE_MODELING case to map_file_transfer() routing,
  calling text_transfer() with src_subdir="sequences"
- Make text_transfer() accept a configurable src_subdir parameter
  (defaults to "texts" for backwards compatibility)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(csv_ingestor): stop clobbering cleaned schema in file_options (#93) (#94)

BaseIngestor.__init__ populates self.file_options with the cleaned
table_schema (label_column / annotation_column / unique_id_column
stripped) so map_validators and send_global_meta_meta see a sanitized
schema before any data leaves the cluster.

CSVIngestor.__init__ then ran `self.file_options = file_options or {}`
after super().__init__, which silently undid that work. On the new YAML
path with tabular and time-series categories the caller passes
file_options={}, so `file_options or {}` evaluated to a fresh empty
dict, dropping the cleaned schema entirely:

  - map_validators received empty options → DataValidator,
    TimeFormatValidator, NumericColumnsValidator setup was skipped.
  - send_global_meta_meta sent empty metadata to the backend.

The reassignment was always redundant (base already initialises
self.file_options); deleting it lets the cleaned schema survive on the
YAML path. Pre-YAML templates were unaffected because file_options was
truthy.

Caught by Cursor Bugbot on #92. Closes #93.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dockerfile): pin numeric USER 65534 for K8s runAsNonRoot admission (#92)

Kubernetes admission rejects Pods with securityContext.runAsNonRoot: true
when the image USER is a non-numeric string — the kubelet cannot verify
"nobody" is not root at admission time and the Pod ends up in
CreateContainerConfigError:

    container has runAsNonRoot and image has non-numeric user (nobody),
    cannot verify user is non-root

The client-runtime side already works around this by setting
run_as_user=65534 explicitly on the Pod's securityContext
(submit_ingestion_run.py build_job_spec), but any other consumer who
pulls ghcr.io/tracebloc/ingestor and runs it under their own Pod spec
with runAsNonRoot: true will hit the same admission failure. Pinning a
numeric UID in the image is the right defense-in-depth.

65534 is the `nobody` UID on Debian (the python:3.11-slim base),
Ubuntu, and Alpine.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(#97): make Config lazy and drop laptop-path defaults (#98)

* refactor(#97): make Config lazy and drop laptop-path defaults

Validators (file_validator, image_validator, ...) plus file_transfer,
database, csv_ingestor all instantiate `config = Config()` at module
top-level. When Config was a @DataClass with `os.getenv` defaults, those
fields were frozen at class-definition time — long before the
declarative entrypoint (cli/run.py:main) set SRC_PATH / TABLE_NAME /
LABEL_FILE from the resolved ingest.yaml. On a customer cluster on
2026-05-19, FileTypeValidator failed with "Path does not exist" against
the laptop-default `~/Downloads/data-ingestors/.../crowd_monitoring/...`
because that's the value the dataclass captured at import.

The previous workaround in _set_legacy_env_vars patched only
file_transfer.config in place — the 14 validator modules each held
their own stale snapshot, so the workaround never reached them.

Fix: Config is now a plain class whose env-driven fields are @Property,
reading os.environ on access. Module-level `config = Config()` is
harmless; properties capture nothing at instantiation. Tests that build
pinpoint configs via `Config(BACKEND_TOKEN=..., CLIENT_USERNAME=...)`
keep working: __init__ accepts overrides as kwargs and they win over
env (sentinel lookup distinguishes absent from explicit None).

_set_legacy_env_vars reduces to env-var writes only; the in-place
patches go away.

Hardening: drop the developer-laptop defaults entirely. SRC_PATH /
LABEL_FILE / TABLE_NAME default to empty string and TITLE to None — a
misconfigured pod fails loudly in path operations instead of silently
scanning a developer dir.

Tests added in tests/test_config_lazy.py lock in:
  - env set before Config() flows through
  - env mutated *after* Config() flows through (the regression)
  - module-level validator configs observe later env mutations
  - per-instance overrides win over env, including explicit None
  - laptop-path defaults are gone

Existing tests/test_cli_run.py:test_file_transfer_config_patched_in_place
is rewritten as test_file_transfer_config_reads_env_lazily — the new
contract is env-set → property reads through, no in-place mutation.

Version bumps 0.2.10 → 0.3.0 (internal Config rewrite, surface API
preserved).

Refs: #97

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(config): reject None for numeric overrides at construction

`Config(DB_PORT=None)` / `Config(BATCH_SIZE=None)` previously deferred a
`TypeError: int() argument must be a string ...` to first property
access, with no hint about which field was at fault. Both properties
unconditionally call `int(raw)` on the override.

The `_MISSING` sentinel design treats `None` as a valid suppression
value for nullable fields (BACKEND_TOKEN, CLIENT_USERNAME, ...) — but
for numeric fields it makes no sense. Reject at construction with a
message that names the field and points the caller to omit the kwarg.

Regression test in `test_config_lazy.py` covers both fields and locks
in that valid ints / str-ints continue to work.

Addresses bugbot finding on #98.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#99): surface file_transfer failures in ingestion summary (#100)

* fix(#99): surface file_transfer failures in ingestion summary

Single-file transfer functions (image_transfer / annotation_transfer /
text_transfer) used to return the record unchanged when the source file
was missing, so map_file_transfer returned a truthy value and the DB +
API write paths happily completed. The summary's counters tracked DB +
API outcomes only, so a run where every source file was missing still
printed "🎉 Ingestion completed successfully!" at 100% — masking silent
data loss on the destination volume.

  * Return None on missing source / missing filename so the existing
    skip path in BaseIngestor.ingest picks them up.
  * Add IngestionSummary.file_transfer_failures (tracked separately
    from skipped_records so operators can tell data-loss skips from
    validation skips) and IngestionSummary.has_failures.
  * Rewrite the banner: never celebrate when any non-trivial failure
    occurred; print "completed with N failures, see logs" instead.
  * Append file-transfer failures to the failed_records list returned
    by ingest() so cli.run.main exits non-zero and the K8s job marker
    reflects the failure.
  * Tests: 13 new cases covering the transfer functions, has_failures,
    banner gating, and the end-to-end counter wiring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#99): address bugbot review

- Banner total no longer double-counts file_transfer failures. The
  three channels (file_transfer, DB, API-only) are mutually exclusive
  per record, so we sum them directly instead of using
  ``total_records - api_sent_records`` which conflates all dropped
  records into the API bucket.
- ``Failed to Send to API`` summary line now reads
  ``inserted_records - api_sent_records`` for the same reason — a
  576-record file-transfer disaster used to print "Failed to Send to
  API: 576" alongside "File Transfer Failures: 576", double-counting
  the same set.
- The file-transfer skip branch now calls ``pbar.update(1)`` before
  ``continue`` so an all-transfer-failure run advances tqdm instead
  of leaving it stuck at 0/N.
- Regression tests for both: 576-failure run reports a single 576
  banner total (not 1152) and ticks the progress bar 5 times in the
  end-to-end test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#101): allow rerun when destination dir is empty leftover (#102)

A previous ingestion that aborted during file_transfer can leave an
empty /data/shared/<table_name>/ directory on the shared PVC. The next
attempt — same ingest.yaml, fixed image — then failed validation:

    ValueError: Duplicate Validator Validator failed:
    Destination directory '/data/shared/sample_cats_dogs_train' already exists

Customers had to kubectl exec into a PVC-mounted pod and rm -rf between
attempts, or rename the table for every retry. Neither was documented.

Treat an existing-but-empty destination as a leftover from an aborted
run: log a warning and let validation pass. A populated destination
still fails — that's the real "would clobber an existing dataset" case
the validator was written to catch.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix three real-cluster bugs found during 2026-05-19 validation (#106)

* docs: drop "active use case" from prereqs

A use case can only be created after data has been ingested, since
the platform needs the dataset's real schema and stats to define one.
The prior wording sent newcomers to the web app to create a use case
before they had a dataset, which is impossible.

Closes #64

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version to 0.2.11

* chore: scrub client credentials from ingestor-job.yaml

* fix(ingestor): strip label columns from schema metadata sent to backend (#85)

* fix(ingestor): strip label/annotation/unique_id columns from schema metadata

The schema in file_options is forwarded to the backend as meta_data via
send_global_meta_meta. If a template (e.g. time_to_event_prediction)
passes the raw schema in file_options, or if the base injects it, the
label/annotation/unique_id columns leak through even though the DB
table itself excludes them. Always sanitize file_options["schema"]
using the same cleaned table_schema used to create the table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ingestor): recompute number_of_columns after schema cleanup

Templates set file_options["number_of_columns"] = len(schema) before
the base ingestor strips label/annotation/unique_id columns. Keep the
count consistent with the sanitized file_options["schema"] so any
future consumer sees matching values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(packaging): bundle ingest.v1.json in wheel and sdist

setup.py declared package_data for "tracebloc_ingestor.schema"
but the directory had no __init__.py, so find_packages() didn't
see it as a subpackage and the data-files declaration was a
silent no-op. Both the wheel (via package_data) and sdist (via
MANIFEST.in) were missing the schema, so the released image
crashed on first invocation of cli.run._load_schema with
FileNotFoundError on the site-packages schema path.

Add the marker __init__.py + a recursive-include line in
MANIFEST.in so both build paths bundle the JSON. Tests pin
both the subpackage import and the production _load_schema
call so a future regression in the wheel-build pipeline
surfaces here, not three weeks later in cluster validation.

Closes #103. Complements #95 (release-time smoke test that
would have caught this at build time).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(image_validator): normalize tuple/list in resolution comparison

PIL returns image.size as a tuple; YAML/JSON parses
``target_size: [H, W]`` as a list. Python's sequence equality
is type-strict — ``(256, 256) == [256, 256]`` is False — so the
tolerance==0 path in _resolution_matches rejected every image
whose dimensions matched exactly. The tolerance>0 branch
already worked by accident (it uses index access, which doesn't
care about sequence type).

Wrap both sides in tuple() before comparison. Tests pin every
tuple/list shape combination plus the tolerance>0 path that
was already working, so a future regression on either side
surfaces here.

Closes #104.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(file_transfer): normalize extension dot+case in _has_extension

FileExtension.get_all_extensions() returns values WITH the
leading dot (".jpeg" etc.) but str.split(".") returns the
suffix WITHOUT one. _has_extension therefore always returned
False, _find_src always appended the extension a second time,
and every transfer attempt looked for a non-existent
``cat1.jpeg.jpeg`` source. Combined with the pre-#100 silent-
failure summary, customers saw 100% success while no image
files actually landed on the PVC.

Use rsplit(".", 1) + a normalized ".<lowercase-ext>" comparison
so case-insensitive matching works too (consistent with
ImageResolutionValidator._is_image_file). Tests cover the
classes of input that mattered in cluster validation:
lowercase, uppercase, no-extension, unknown extension, and
multi-dot names.

Closes #105. Together with #100 closes the 2026-05-19 cluster-
validation incident.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: shujaat <shujaat@tracebloc.io>
Co-authored-by: shujaat_tracebloc <153823837+shujaatTracebloc@users.noreply.github.com>
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Co-authored-by: Lukas Wuttke <lukas@tracebloc.io>
Co-authored-by: Divya <divyasingh@tracebloc.io>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Divya <divyasingh@tracebloc.io>
Co-authored-by: shujaat <shujaat@tracebloc.io>
Co-authored-by: shujaat_tracebloc <153823837+shujaatTracebloc@users.noreply.github.com>
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Co-authored-by: Lukas Wuttke <lukas@tracebloc.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants