Skip to content

Oss upstream#602

Draft
Davis-Zhang-Onehouse wants to merge 52 commits into
apache:mainfrom
vinishjail97:oss-upstream
Draft

Oss upstream#602
Davis-Zhang-Onehouse wants to merge 52 commits into
apache:mainfrom
vinishjail97:oss-upstream

Conversation

@Davis-Zhang-Onehouse

Copy link
Copy Markdown

Description

How are the changes test-covered

  • N/A
  • Automated tests (unit and/or integration tests)
  • Manual tests
    • Details are described below

@Davis-Zhang-Onehouse Davis-Zhang-Onehouse marked this pull request as draft May 7, 2026 19:32
Davis Zhang and others added 19 commits May 7, 2026 20:52
Brainstormed plan for porting org.apache.hudi.expression (Predicate
hierarchy) plus the org.apache.hudi.internal.schema.{Type,Types}
subset it depends on, then wiring keyFilterOpt through ReaderContext
and the file-group reader so a key-based In-predicate actually filters
output rows on a v9 MOR + COMMIT_TIME_ORDERING table.

Three-phase landing strategy with manual cross-check against
readerContext_callstack.md as Phase 2 verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the previous \"trait hierarchy + downcast in create_key_spec\"
plan with a kind() accessor approach (Option A from the brainstorm).
Each Predicate / Expression trait gets a kind() -> {Predicate,Expression}Kind<'_>
method returning a borrowed enum view, so create_key_spec is a clean
Rust match instead of Any::downcast_ref scaffolding.

Trait hierarchy stays Java-faithful; kind() is pure inspection sugar.
Documented as deviation #2 in §6, with the Literal { value: LiteralValue }
shape (deviation #1) re-justified by the kind()-driven extraction pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three-phase plan (1: port Predicate hierarchy + Type/Types; 2: wire
key_filter_opt through ReaderContext + reader; 3: e2e smoke test).
Each task is a TDD cycle with concrete code, exact file paths, and
explicit commit instructions. Spec at
docs/superpowers/specs/2026-05-07-keyfilteropt-port-design.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stub modules to be populated in subsequent tasks of the keyFilterOpt
port. See docs/superpowers/plans/2026-05-07-keyfilteropt-port-implementation.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Java's ArrayData (org.apache.hudi.expression.ArrayData) is a concrete
class implementing StructLike, not a separate interface. Replace the
incorrectly-defined trait with a Vec<Box<dyn Any + Send + Sync>>-backed
struct that implements StructLike.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ExpressionKind variants land in subsequent tasks as concrete types are
introduced (Literal, NameReference, BoundReference, Predicate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PredicateKind variants are added in Task 1.14 once concrete predicate
types exist. ExpressionKind now has all four variants and the
_Placeholder is removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…th*,StringContains} + factories

Finalizes PredicateKind to 12 variants. All Predicates inner classes
from Java's Predicates.java are now ported.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Davis Zhang and others added 24 commits May 7, 2026 22:15
Visitors are ported as declarations for hierarchy completeness. The bind
logic is not wired into the reader path — see spec §6 deviation apache#6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors Java HoodieMergedLogRecordReader.createKeySpec via Rust kind()
pattern matching. Pure addition; not yet wired into scanner. Tests
cover In, StringStartsWithAny, non-matching, and KeySpec.matches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors Java HoodieReaderContext.keyFilterOpt. Default None on all
construction sites including FFI bridge and ReaderContext::empty().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors Java BaseHoodieLogRecordReader.scanInternal(Option<KeySpec>, boolean).
KeyBasedFileGroupRecordBuffer overrides set_key_spec to store the spec
and skip non-matching records in process_data_block/process_delete_block.
All existing callers pass None; behavior unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes the existing TODO. Mirrors Java performScan() lines 95-107.
key_filter_opt defaults to None on all current call paths so behavior
is unchanged for existing users.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors Java HoodieAvroReaderContext.getFileRecordIterator (lines
218-228) "fall through to row-level filter" path. Parquet has no native
key predicate, so we filter via Arrow filter_record_batch on the
_hoodie_record_key column. No-op when key_filter_opt is None.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Maps each Java readerContext interaction in the call-stack doc to its
hudi-rs file:line equivalent. Validates Phase 2 implementation parity
for the keyFilterOpt path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
read_file_group_with_key_filter is a self-contained variant of the
existing read_file_group helper that accepts a key_filter_opt. lookup_record_key
and extract_row_with_id_opt are small Arrow accessors used by the
filter test in the next task.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reads city=sf partition of v9_mor_8i4u_commit_time with and without
an In predicate on _hoodie_record_key. Asserts:
  * baseline returns 2 rows
  * filtered returns 1 row matching baseline content for id=1
  * id=2 (base-only, not in filter) is excluded

End-to-end validation of the keyFilterOpt port: ReaderContext field,
log-scan KeySpec, base-file row filter all confirmed working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three sites in cpp/ build ReaderContext via struct literal:
- cpp/src/lib.rs:372 (FFI bridge)
- cpp/tests/read_record_batch_tests.rs:79 (test helper)
- cpp/tests/read_record_batch_tests.rs:609 (per-test setup)

All initialize key_filter_opt to None. FFI integration is out of scope
per spec §2 — this commit just keeps the workspace compiling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 new tests across 3 fixtures, AB-pattern validation, isolated
log-scan vs base-file filter coverage, no-op regression guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8-test plan in 4 phases (A: helper + refactor; B: tests 2-5 on
V9Mor8I4UCommitTime; C: tests 6-7 on V9MorNonpart3Commits with
delete-block path; D: test 8 on MorLayoutLogOnly with log-only
isolation). Each task is a TDD cycle with concrete code and
fixture filenames pre-resolved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds:
- FilterAbResult struct + 4 assertion methods (narrowed/noop/empty/ids_eq)
- ab_read_with_filter shared driver
- 3 fixture locators: sf_file_group, nonpart_3commits_file_group, log_only_file_group
- ids_in_batch helper

Renames fg_reader_with_key_filter_filters_rows → fg_filter_in_log_updated_key,
refactored to use the AB helper. Behavior unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
extract_row_with_id_opt_v9nonpart helper

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reads MorLayoutLogOnly with and without an In filter on
_hoodie_record_key. Since the fixture has no base file, ALL output
flows through KeyBasedFileGroupRecordBuffer's process_data_block /
process_delete_block — this test isolates the log-scan filter.

Probe confirmed: _hoodie_record_key is present (String, values "k1"/"k2"),
baseline row count = 2 (k3 deleted), so primary path was taken.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…jection contract

The previous assertions expected a third 'age' column even though the
ReaderContext-supplied schema_handler projects output down to the
requested_schema [id, name]. Update the test to verify the projected
schema, and add a companion test
test_read_record_batch_no_projection_when_requested_equals_data that
covers the requested == data case where OutputConverter is None and all
columns flow through.

Mirrors Java HoodieFileGroupReader constructor lines 119-122 and the
next() projection loop at lines 264-265 (hudi-internal).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant