Skip to content

GH-3451. Add a JMH benchmark for variants#3452

Open
steveloughran wants to merge 13 commits intoapache:masterfrom
steveloughran:pr/benchmark-variant
Open

GH-3451. Add a JMH benchmark for variants#3452
steveloughran wants to merge 13 commits intoapache:masterfrom
steveloughran:pr/benchmark-variant

Conversation

@steveloughran
Copy link
Copy Markdown
Contributor

@steveloughran steveloughran commented Mar 19, 2026

Rationale for this change

There's no benchmark for variant IO and so there's no knowledge of any problems which exist now, or any way to detect regressions.

What changes are included in this PR?

  • adds parquet-variant to parquet-benchmark dependencies
  • new JMH benchmark VariantBuilderBenchmark to measure builder cost.
  • VariantProjectionBenchmark: selective read of fields from shredded/unshredded files
  • Improves VariantConverter of a binary to a string through a new package-private method in VariantBuilder. Benchmark VariantConverterBenchmark shows a 20% reduction in time to convert a marshalled string to a java string within a builder. This codepath is used when reconstructing partially shredded variants.

Are these changes tested?

Apart from a minor change in VariantConverter/VariantBuilder, this is all benchmark

See https://steveloughran.github.io/benchmarking-variants/ for the writeup and the interactive benchmark results of Iceberg and Parquet benchmarks.

Are there any user-facing changes?

No

Closes #3451

@steveloughran steveloughran marked this pull request as draft March 19, 2026 10:45
@steveloughran
Copy link
Copy Markdown
Contributor Author

Still thinking of what else can be done here...suggestions welcome.

Probably a real write to the localfs and read back in

@steveloughran
Copy link
Copy Markdown
Contributor Author

I'll add a "deep" option too, for consistency with the iceberg pr

@steveloughran steveloughran marked this pull request as ready for review March 24, 2026 14:58
private static int count() {
int c = counter++;
if (c >= 512) {
c = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only resets the local copy, counter keeps growing?

Copy link
Copy Markdown
Contributor Author

@steveloughran steveloughran Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. will fix.

* deser to recurse down
* include uuid and bigdecimal
* reset counter on benchmark setup
iterations of class code and #of rows are the same
for easy compare of overheads.
Using the same structure as the iceberg tests do
@steveloughran
Copy link
Copy Markdown
Contributor Author

There's now a new benchmark which writes a file using the same simple schema as I'm doing in iceberg apache/iceberg#15629 , and tries to do a projection on it.

 SELECT id, category, variant_get('nested.varcategory') FROM table

Review by the copilot


Setup: 1M rows, 4-field nested variant (idstr, varid, varcategory, col4), querying varcategory only. SingleShotTime, 15 iterations, @fork(0).

Raw Results


  ┌───────────────────────────┬──────────┬───────────────┬─────────┬────────┐
  │ Benchmark                 │ shredded │ Score (ms/op) │ Error   │ µs/row │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readAllRecords            │ false    │ 728.514       │ ±11.253 │ 0.729  │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readProjectedFileSchema   │ false    │ 760.287       │ ±3.314  │ 0.760  │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readProjectedLeanSchema   │ false    │ 1405.264      │ ±8.399  │ 1.405  │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readAllRecords            │ true     │ 1315.615      │ ±14.598 │ 1.316  │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readProjectedFileSchema   │ true     │ 1297.870      │ ±19.621 │ 1.298  │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readProjectedLeanSchema   │ true     │ 725.618       │ ±10.574 │ 0.726  │
  └───────────────────────────┴──────────┴───────────────┴─────────┴────────┘

Speedup/Penalty vs readAllRecords Baseline


  ┌───────────────────────────┬──────────────────┬──────────────────┐
  │ Benchmark                 │ shredded=false   │ shredded=true    │
  ├───────────────────────────┼──────────────────┼──────────────────┤
  │ readProjectedFileSchema   │ −4% (overhead)   │ +1% (noise)      │
  ├───────────────────────────┼──────────────────┼──────────────────┤
  │ readProjectedLeanSchema   │ −93% penalty     │ +45% speedup     │
  └───────────────────────────┴──────────────────┴──────────────────┘
  • Lean schema projection is the only technique that skips columns. Projecting the full file schema (readProjectedFileSchema) gives zero benefit in either case — Parquet still reads all column chunks.
  • Lean schema + shredded = 45% faster than reading all columns. Skipping idstr, varid, and col4 typed columns saves ~590ms per 1M rows.
  • Lean schema + unshredded = 93% slower. The lean schema requests typed_value.varcategory which does not exist in the unshredded file. Parquet handles the missing columns at every row, which is more expensive than
    reading the single binary blob directly.
  • Schema detection in ReadSupport.init() is essential. Applying containsField("typed_value") to choose between lean and full schema prevents the unshredded penalty while preserving the shredded speedup.

Recommendation

Always detect file layout in ReadSupport.init() and apply the lean projection only when the file was written with a shredded schema. For unshredded files, use the full file schema or no projection.

If you have a query with a pushdown predicate that wants to look inside a variant, creating a MessageType schema referring to the shredded values is counterproductive unless you know that the variant is shedded.

That can be determined by looking at the schema and use `.containsField("typed_value") to see if the target variant has any nested values.

    @Override
    public ReadContext init(InitContext context) {
      MessageType fileSchema = context.getFileSchema();
      GroupType nested = fileSchema.getType("nested").asGroupType();
      if (nested.containsField("typed_value")) {
        return new ReadContext(VARCATEGORY_PROJECTION);
      }
      // Unshredded file: projection designed for typed columns provides no benefit and
      // causes schema mismatch overhead — fall back to the full file schema.
      return new ReadContext(fileSchema);
    }

@steveloughran
Copy link
Copy Markdown
Contributor Author

build failures are all because java11 javadoc is extra-fussy than the versions either side of it.

@steveloughran
Copy link
Copy Markdown
Contributor Author

Flame graph profiles so you can compare what is using CPU time when working with variants

profiles.zip

Variant rebuild is expensive, I've got a minor (and will keep package private) tweak to VariantBuilder to allow VariantStringConverter to add a Binary as a string to the variant without converting it to a string en-route, which would save a lot of needless string creation.

Variant.getFieldAtIndex() is pretty expensive too.

@steveloughran
Copy link
Copy Markdown
Contributor Author

Slides. Reading the PR will make the notion of file vs lean schema clearer.

2026-04-01-variant reads considered suboptimal

* Move under variant package to access private members
* add "variant" group to run.sh
* new benchmark VariantConverterBenchmark
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a JMH benchmark for variants

2 participants