Skip to content

feat(tools): support Parquet and Arrow import with schema/auto modes#793

Open
gengziyand wants to merge 4 commits intoapache:developfrom
gengziyand:feat/tools-format-to-tsfile
Open

feat(tools): support Parquet and Arrow import with schema/auto modes#793
gengziyand wants to merge 4 commits intoapache:developfrom
gengziyand:feat/tools-format-to-tsfile

Conversation

@gengziyand
Copy link
Copy Markdown

Summary

  • Add Parquet and Arrow format support to the TsFile import tool (alongside existing CSV)
  • Add auto mode (schema-less) import for all three formats
  • Unify schema naming: id_columnstag_columns, csv_columnssource_columns (backward compatible)
  • Add parquet2tsfile.sh/.bat and arrow2tsfile.sh/.bat scripts
  • Add CLI options: --format, --table_name, --time_precision, --separator

Changes

  • New classes: ParquetSourceReader, ArrowSourceReader, AutoSchemaInferer, ImportExecutor, ImportSchema, ImportSchemaParser, SourceReader,
    SourceBatch, TabletBuilder, TimeConverter, ValueConverter
  • Modified: TsFileTool.java (unified CLI entry point), CsvSourceReader.java (auto mode support)
  • New dependencies: parquet-hadoop 1.14.4, hadoop-common 3.3.6, arrow-vector 15.0.2
  • 189 automated tests across 11 test classes
  • Updated README.md and README-zh.md

Test plan

  • 189 unit and E2E tests passing (JDK 8)
  • Smoke test on Linux JDK 17 (Docker) - 9 scenarios
  • Smoke test on Windows JDK 8 and JDK 17 - 9 scenarios
  • Data correctness verification - 39 checks (table name, TAG/FIELD, cross-format consistency)
  • Boundary tests - 8 scenarios (tab separator, null values, SKIP/DEFAULT, error handling, fail_dir)

Copy link
Copy Markdown
Contributor

@CritasWang CritasWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need check skip

private List<String> getSchemaColumnNames() {
List<String> names = new ArrayList<>();
for (ImportSchema.SourceColumn col : schema.getSourceColumns()) {
if (!col.isSkip()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In schema mode these readers drop skipped source columns from the SourceBatch, but TabletBuilder still indexes values by the original source_columns positions. With a schema like time, unused SKIP, value, getSchemaColumnNames() returns only [time, value], while TabletBuilder maps value to index 2, so batch.getValue(row, 2) throws and the import fails (or later columns shift incorrectly). This affects real Parquet/Arrow schemas using SKIP, and the current Parquet test only checks readBatch(), not the ImportExecutor path. Keep placeholder columns for skipped entries or change the builder mapping to use batch column names rather than schema positions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants