feat(tools): support Parquet and Arrow import with schema/auto modes by gengziyand · Pull Request #793 · apache/tsfile

gengziyand · 2026-04-23T12:36:58Z

Summary

Add Parquet and Arrow format support to the TsFile import tool (alongside existing CSV)
Add auto mode (schema-less) import for all three formats
Unify schema naming: id_columns → tag_columns, csv_columns → source_columns (backward compatible)
Add parquet2tsfile.sh/.bat and arrow2tsfile.sh/.bat scripts
Add CLI options: --format, --table_name, --time_precision, --separator

Changes

New classes: ParquetSourceReader, ArrowSourceReader, AutoSchemaInferer, ImportExecutor, ImportSchema, ImportSchemaParser, SourceReader,
SourceBatch, TabletBuilder, TimeConverter, ValueConverter
Modified: TsFileTool.java (unified CLI entry point), CsvSourceReader.java (auto mode support)
New dependencies: parquet-hadoop 1.14.4, hadoop-common 3.3.6, arrow-vector 15.0.2
189 automated tests across 11 test classes
Updated README.md and README-zh.md

Test plan

189 unit and E2E tests passing (JDK 8)
Smoke test on Linux JDK 17 (Docker) - 9 scenarios
Smoke test on Windows JDK 8 and JDK 17 - 9 scenarios
Data correctness verification - 39 checks (table name, TAG/FIELD, cross-format consistency)
Boundary tests - 8 scenarios (tab separator, null values, SKIP/DEFAULT, error handling, fail_dir)

CritasWang

need check skip

CritasWang · 2026-04-24T08:22:34Z

+  private List<String> getSchemaColumnNames() {
+    List<String> names = new ArrayList<>();
+    for (ImportSchema.SourceColumn col : schema.getSourceColumns()) {
+      if (!col.isSkip()) {


In schema mode these readers drop skipped source columns from the SourceBatch, but TabletBuilder still indexes values by the original source_columns positions. With a schema like time, unused SKIP, value, getSchemaColumnNames() returns only [time, value], while TabletBuilder maps value to index 2, so batch.getValue(row, 2) throws and the import fails (or later columns shift incorrectly). This affects real Parquet/Arrow schemas using SKIP, and the current Parquet test only checks readBatch(), not the ImportExecutor path. Keep placeholder columns for skipped entries or change the builder mapping to use batch column names rather than schema positions.

… mismatch

…fs-client transitive dependencies

ziyangeng added 2 commits April 23, 2026 20:17

feat(tools): support Parquet and Arrow import with schema/auto modes

e9e6bde

fix(tools): fix dependency-plugin analysis errors in pom.xml

882755f

CritasWang suggested changes Apr 24, 2026

View reviewed changes

ziyangeng added 2 commits April 24, 2026 17:02

fix(tools): preserve column positions for SKIP columns to avoid index…

911fd81

… mismatch

refactor(tools): exclude unnecessary hadoop-yarn-client and hadoop-hd…

7fd33ed

…fs-client transitive dependencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tools): support Parquet and Arrow import with schema/auto modes#793

feat(tools): support Parquet and Arrow import with schema/auto modes#793
gengziyand wants to merge 4 commits intoapache:developfrom
gengziyand:feat/tools-format-to-tsfile

gengziyand commented Apr 23, 2026

Uh oh!

CritasWang left a comment

Uh oh!

CritasWang Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gengziyand commented Apr 23, 2026

Summary

Changes

Test plan

Uh oh!

CritasWang left a comment

Choose a reason for hiding this comment

Uh oh!

CritasWang Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants