Skip to content

[spark][doc] Add Spark batch union read#3142

Merged
luoyuxia merged 3 commits intoapache:mainfrom
Yohahaha:spark-union-read-doc
Apr 24, 2026
Merged

[spark][doc] Add Spark batch union read#3142
luoyuxia merged 3 commits intoapache:mainfrom
Yohahaha:spark-union-read-doc

Conversation

@Yohahaha
Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #xxx

Brief change log

Tests

API and Format

Documentation

@Yohahaha Yohahaha marked this pull request as ready for review April 20, 2026 13:58
@Yohahaha
Copy link
Copy Markdown
Contributor Author

@wuchong @YannByron @luoyuxia please take a look, thank you!

Comment thread website/docs/engine-spark/reads.md Outdated
The union read works for both **log tables** and **primary key tables**:

- **Log tables**: Combines Fluss log data with lake historical data
- **Primary key tables**: Merges the latest Fluss snapshot with log changes and lake history to provide the most up-to-date view
Copy link
Copy Markdown
Contributor

@beryllw beryllw Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combines lake snapshot data with recent KV log changes using sort-merge to provide the most up-to-date view

The phrase "latest Fluss snapshot" may cause confusion, as Fluss has its own internal snapshot concept (used for KV compaction).

-- Returns complete view combining Fluss and lake data
SELECT * FROM fluss_order_with_lake ORDER BY order_key;
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we could add a note:

Union read requires `scan.startup.mode = full` (default). Non-FULL modes (e.g., `earliest`, `latest`) bypass the lake path and read only from Fluss.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scan.startup.mode was not used in batch read actually, will fix related codes in another pr.

@beryllw
Copy link
Copy Markdown
Contributor

beryllw commented Apr 22, 2026

Thanks for the PR! Overall LGTM, with a few minor comments.

Copy link
Copy Markdown
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yohahaha Hi, thanks for the pr. Left minor comments. PTAL

Comment thread website/docs/engine-spark/reads.md Outdated
Comment thread website/docs/engine-spark/reads.md Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Fluss website documentation to reflect Spark batch support for lake-enabled “union read” and to surface that Spark is now a supported engine for union reads.

Changes:

  • Update Lakehouse overview to state both Flink and Spark support union reads.
  • Expand Spark “Reads” docs with a new section describing union reads for lake-enabled tables.
  • Update Spark “Getting Started” feature matrix note to mention union read support.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
website/docs/streaming-lakehouse/overview.md Updates union read engine support statement to include Spark.
website/docs/engine-spark/reads.md Removes old limitation and adds documentation + examples for Spark batch union reads on lake-enabled tables.
website/docs/engine-spark/getting-started.md Updates feature support note for batch select to mention union read.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread website/docs/engine-spark/reads.md Outdated

```sql title="Spark SQL"
-- Query will union data from Fluss and lake
SELECT SUM(total_amount) AS total_revenue FROM fluss_order_with_lake;
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example query uses total_amount, but the table created below defines total_price (and no total_amount). As written, this SQL will fail when users try it; update the column name in the query (or the example schema) so they match.

Suggested change
SELECT SUM(total_amount) AS total_revenue FROM fluss_order_with_lake;
SELECT SUM(total_price) AS total_revenue FROM fluss_order_with_lake;

Copilot uses AI. Check for mistakes.
Comment thread website/docs/engine-spark/reads.md Outdated

#### Union Read

To read the full dataset, simply query the table without any suffix. The Spark connector automatically unions data from Fluss and the lake storage:
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says to query the table "without any suffix", but this page doesn't introduce any table suffix concept for Spark (unlike some Flink docs). Consider rephrasing to "query the table directly" or explicitly documenting what suffixes (if any) are supported for Spark reads and what they do.

Suggested change
To read the full dataset, simply query the table without any suffix. The Spark connector automatically unions data from Fluss and the lake storage:
To read the full dataset, simply query the table directly. The Spark connector automatically unions data from Fluss and the lake storage:

Copilot uses AI. Check for mistakes.
| [SQL Add Partition](ddl.md#add-partition) | ✔️ | |
| [SQL Drop Partition](ddl.md#drop-partition) | ✔️ | |
| [SQL Select (Batch)](reads.md) | ✔️ | Log table and primary-key table |
| [SQL Select (Batch)](reads.md) | ✔️ | Log table and primary-key table with union read |
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the feature table note, "Log table and primary-key table with union read" is ambiguous (it reads like only the primary-key table has union read). Consider rewording to make it clear that union read is supported for both table types when lake-enabled.

Suggested change
| [SQL Select (Batch)](reads.md) | ✔️ | Log table and primary-key table with union read |
| [SQL Select (Batch)](reads.md) | ✔️ | Log table and primary-key table; both support union read when lake-enabled |

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for update. +1

@luoyuxia luoyuxia merged commit b6d2f33 into apache:main Apr 24, 2026
2 checks passed
@Yohahaha Yohahaha deleted the spark-union-read-doc branch April 24, 2026 03:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants