Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a per-column unique constraint to dy.Column/dy.Schema, aligning validation and SQLAlchemy output, and updates array/object primary key behavior.
Changes:
- Introduce
uniqueas a first-class column attribute and emitunique=Truein SQLAlchemy column definitions. - Add schema validation rules for
uniquecolumns (and tighten primary key validation viais_unique()). - Expand tests for unique constraints and enable primary keys on
Arraycolumns while disallowing them forObjectcolumns.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/schema/test_validate.py | Adds validation tests for unique columns and Schema.unique_columns(). |
| tests/schema/test_sample.py | Adds sampling tests ensuring generated data respects unique constraints. |
| tests/column_types/test_array.py | Adds coverage for primary keys on Array columns. |
| dataframely/columns/_base.py | Adds unique attribute to Column and passes it to SQLAlchemy columns. |
| dataframely/_base_schema.py | Adds unique rules into schema validation; uses is_unique() for primary keys. |
| dataframely/columns/array.py | Allows primary_key on arrays and threads through unique. |
| dataframely/columns/string.py | Threads unique through to Column. |
| dataframely/columns/integer.py | Threads unique through to Column. |
| dataframely/columns/float.py | Threads unique through to Column. |
| dataframely/columns/decimal.py | Threads unique through to Column. |
| dataframely/columns/datetime.py | Threads unique through to Column for date/time/datetime/timedelta. |
| dataframely/columns/enum.py | Threads unique through to Column. |
| dataframely/columns/categorical.py | Threads unique through to Column. |
| dataframely/columns/list.py | Threads unique through to Column. |
| dataframely/columns/struct.py | Threads unique through to Column. |
| dataframely/columns/object.py | Removes primary_key kwarg from Object column constructor. |
| self, | ||
| *, | ||
| nullable: bool = True, | ||
| primary_key: bool = False, | ||
| check: Check | None = None, | ||
| alias: str | None = None, |
There was a problem hiding this comment.
Removing the primary_key keyword argument from Object.__init__ is a breaking API change (existing callers using dy.Object(primary_key=...) will now error at call-time with an unexpected kwarg). If the goal is to disallow object primary keys, consider keeping the parameter and raising a clear ValueError when primary_key=True (optionally with a deprecation path), so callers get a more actionable error while minimizing breakage.
There was a problem hiding this comment.
Was there a reason to change this?
There was a problem hiding this comment.
this is more a pass-by changes, I though about NOT adding is_unique for object as it is not allowed by polars and then I saw the primary_key argument here... It shouldn't be here, as primary_key for object dtype are not possible.
There was a problem hiding this comment.
You're right, I was not aware of the polars behavior here. TLDR:
- You can call
df.unique()on a dataframe with an object column and it will work as expected iff your objects are hashable - You cannot call
df.select(pl.col("mycol").is_unique()), which errors saying that uniqueness checks are not supported for object columns.
In any case, passing primary_key=True to dy.Object leads to failing validation no matter what, so I think it's reasonable to remove it here.
| self, | ||
| *, | ||
| nullable: bool = True, | ||
| primary_key: bool = False, | ||
| check: Check | None = None, | ||
| alias: str | None = None, | ||
| metadata: dict[str, Any] | None = None, |
There was a problem hiding this comment.
All other column types in this PR accept unique: bool = False, but Object now accepts neither primary_key nor unique, making the API inconsistent. If unique is unsupported for object dtype, it would be helpful to make that explicit (e.g., accept unique and raise a targeted ValueError when True, or document why it’s omitted). If it is supported, add unique: bool = False and pass it through to super().__init__(unique=unique, ...).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #325 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 56 56
Lines 3399 3408 +9
=========================================
+ Hits 3399 3408 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Andreas Albert (AndreasAlbertQC)
left a comment
There was a problem hiding this comment.
Thanks gab23r ! This looks quite nice to me already. Small suggestions.
| self, | ||
| *, | ||
| nullable: bool = True, | ||
| primary_key: bool = False, | ||
| check: Check | None = None, | ||
| alias: str | None = None, |
There was a problem hiding this comment.
Was there a reason to change this?
| @pytest.mark.parametrize("n", [0, 100]) | ||
| def test_sample_unique_constraint(n: int) -> None: | ||
| df = UniqueSchema.sample(n, generator=Generator(seed=42)) | ||
| assert len(df) == n | ||
| UniqueSchema.validate(df) | ||
| # Verify uniqueness | ||
| assert df.get_column("email").n_unique() == n |
There was a problem hiding this comment.
I think this test would be more meaningful if we used a smaller dtype here, e.g. UInt8. With a string, I think your chance of sampling two identical strings is very small either way, it's unclear if the unique constraint worked.
|
|
||
|
|
||
| class MultiUniqueSchema(dy.Schema): | ||
| a = dy.Int64(unique=True) |
There was a problem hiding this comment.
same here
is_unique rule to dy.Columnunique rule to dy.Column
Motivation
Closes #313
Changes
Add the new rule using the same logic than primary_keys.
Drive by:
primary_keysfor array dtype as it now works in polars, I added a test of it.primary_keysfor object dtype as it never worked. (technically breaking but IMHO shouldn't break anything)