GH-3451. Add a JMH benchmark for variants by steveloughran · Pull Request #3452 · apache/parquet-java

steveloughran · 2026-03-19T10:44:55Z

Rationale for this change

There's no benchmark for variant IO and so there's no knowledge of any problems which exist now, or any way to detect regressions.

What changes are included in this PR?

adds parquet-variant to parquet-benchmark dependencies
new JMH benchmark VariantBuilderBenchmark to measure builder cost.
VariantProjectionBenchmark: selective read of fields from shredded/unshredded files
Improves VariantConverter of a binary to a string through a new package-private method in VariantBuilder. Benchmark VariantConverterBenchmark shows a 20% reduction in time to convert a marshalled string to a java string within a builder. This codepath is used when reconstructing partially shredded variants.

Are these changes tested?

Apart from a minor change in VariantConverter/VariantBuilder, this is all benchmark

See https://steveloughran.github.io/benchmarking-variants/ for the writeup and the interactive benchmark results of Iceberg and Parquet benchmarks.

Are there any user-facing changes?

No

Closes #3451

Initial impl.

steveloughran · 2026-03-19T18:21:01Z

Still thinking of what else can be done here...suggestions welcome.

Probably a real write to the localfs and read back in

steveloughran · 2026-03-23T19:01:36Z

I'll add a "deep" option too, for consistency with the iceberg pr

xiaoxuandev · 2026-03-27T00:35:04Z

+  private static int count() {
+    int c = counter++;
+    if (c >= 512) {
+      c = 0;


only resets the local copy, counter keeps growing？

good point. will fix.

* deser to recurse down * include uuid and bigdecimal * reset counter on benchmark setup

iterations of class code and #of rows are the same for easy compare of overheads.

Using the same structure as the iceberg tests do

painful.

steveloughran · 2026-03-30T18:42:19Z

There's now a new benchmark which writes a file using the same simple schema as I'm doing in iceberg apache/iceberg#15629 , and tries to do a projection on it.

 SELECT id, category, variant_get('nested.varcategory') FROM table

Review by the copilot

Setup: 1M rows, 4-field nested variant (idstr, varid, varcategory, col4), querying varcategory only. SingleShotTime, 15 iterations, @fork(0).

Raw Results


  ┌───────────────────────────┬──────────┬───────────────┬─────────┬────────┐
  │ Benchmark                 │ shredded │ Score (ms/op) │ Error   │ µs/row │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readAllRecords            │ false    │ 728.514       │ ±11.253 │ 0.729  │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readProjectedFileSchema   │ false    │ 760.287       │ ±3.314  │ 0.760  │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readProjectedLeanSchema   │ false    │ 1405.264      │ ±8.399  │ 1.405  │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readAllRecords            │ true     │ 1315.615      │ ±14.598 │ 1.316  │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readProjectedFileSchema   │ true     │ 1297.870      │ ±19.621 │ 1.298  │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readProjectedLeanSchema   │ true     │ 725.618       │ ±10.574 │ 0.726  │
  └───────────────────────────┴──────────┴───────────────┴─────────┴────────┘

Speedup/Penalty vs readAllRecords Baseline


  ┌───────────────────────────┬──────────────────┬──────────────────┐
  │ Benchmark                 │ shredded=false   │ shredded=true    │
  ├───────────────────────────┼──────────────────┼──────────────────┤
  │ readProjectedFileSchema   │ −4% (overhead)   │ +1% (noise)      │
  ├───────────────────────────┼──────────────────┼──────────────────┤
  │ readProjectedLeanSchema   │ −93% penalty     │ +45% speedup     │
  └───────────────────────────┴──────────────────┴──────────────────┘

Lean schema projection is the only technique that skips columns. Projecting the full file schema (readProjectedFileSchema) gives zero benefit in either case — Parquet still reads all column chunks.
Lean schema + shredded = 45% faster than reading all columns. Skipping idstr, varid, and col4 typed columns saves ~590ms per 1M rows.
Lean schema + unshredded = 93% slower. The lean schema requests typed_value.varcategory which does not exist in the unshredded file. Parquet handles the missing columns at every row, which is more expensive than
reading the single binary blob directly.
Schema detection in ReadSupport.init() is essential. Applying containsField("typed_value") to choose between lean and full schema prevents the unshredded penalty while preserving the shredded speedup.

Recommendation

Always detect file layout in ReadSupport.init() and apply the lean projection only when the file was written with a shredded schema. For unshredded files, use the full file schema or no projection.

If you have a query with a pushdown predicate that wants to look inside a variant, creating a MessageType schema referring to the shredded values is counterproductive unless you know that the variant is shedded.

That can be determined by looking at the schema and use `.containsField("typed_value") to see if the target variant has any nested values.

    @Override
    public ReadContext init(InitContext context) {
      MessageType fileSchema = context.getFileSchema();
      GroupType nested = fileSchema.getType("nested").asGroupType();
      if (nested.containsField("typed_value")) {
        return new ReadContext(VARCATEGORY_PROJECTION);
      }
      // Unshredded file: projection designed for typed columns provides no benefit and
      // causes schema mismatch overhead — fall back to the full file schema.
      return new ReadContext(fileSchema);
    }

steveloughran · 2026-03-31T13:31:22Z

build failures are all because java11 javadoc is extra-fussy than the versions either side of it.

steveloughran · 2026-04-02T18:12:08Z

Flame graph profiles so you can compare what is using CPU time when working with variants

profiles.zip

Variant rebuild is expensive, I've got a minor (and will keep package private) tweak to VariantBuilder to allow VariantStringConverter to add a Binary as a string to the variant without converting it to a string en-route, which would save a lot of needless string creation.

Variant.getFieldAtIndex() is pretty expensive too.

steveloughran · 2026-04-08T17:25:50Z

Slides. Reading the PR will make the notion of file vs lean schema clearer.

2026-04-01-variant reads considered suboptimal

* Move under variant package to access private members * add "variant" group to run.sh * new benchmark VariantConverterBenchmark

apacheGH-3451. Add a JMH benchmark for variants

3e35efd

Initial impl.

steveloughran marked this pull request as draft March 19, 2026 10:45

steveloughran mentioned this pull request Mar 20, 2026

Core, Spark: Add JMH benchmarks for Variants apache/iceberg#15628

Open

3 tasks

steveloughran marked this pull request as ready for review March 24, 2026 14:58

steveloughran added 2 commits March 25, 2026 13:54

WiP: add a deeper version.

520c80c

revert plans for a deeper version

63e6096

xiaoxuandev reviewed Mar 27, 2026

View reviewed changes

steveloughran added 6 commits March 27, 2026 11:40

fixes to this benchmark (copilot review)

85a2570

* deser to recurse down * include uuid and bigdecimal * reset counter on benchmark setup

Measure parquet write/read costs.

4edec9a

iterations of class code and #of rows are the same for easy compare of overheads.

WiP, on a file read/write benchmark

e6ccbfc

Using the same structure as the iceberg tests do

building a projection schema

0889e2d

projection works

2e29ea6

Test to validate impact of lean schema against unshredded files.

77bfce4

painful.

steveloughran mentioned this pull request Mar 30, 2026

Spark: Support writing shredded variant in Iceberg-Spark apache/iceberg#14297

Open

apacheGH-561 ongoing work

c1c333d

apacheGH-561 profile driven review of VariantConverters.

b9a5d32

apacheGH-561 Variant benchmarks

d75278a

* Move under variant package to access private members * add "variant" group to run.sh * new benchmark VariantConverterBenchmark

steveloughran force-pushed the pr/benchmark-variant branch from f5416d7 to d75278a Compare April 10, 2026 17:24

spotless

cc77f02

nssalian mentioned this pull request Apr 15, 2026

Optimizing Variant read path with lazy caching #3481

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3451. Add a JMH benchmark for variants#3452

GH-3451. Add a JMH benchmark for variants#3452
steveloughran wants to merge 13 commits intoapache:masterfrom
steveloughran:pr/benchmark-variant

steveloughran commented Mar 19, 2026 •

edited

Loading

Uh oh!

steveloughran commented Mar 19, 2026

Uh oh!

steveloughran commented Mar 23, 2026

Uh oh!

xiaoxuandev Mar 27, 2026

Uh oh!

steveloughran Mar 30, 2026 •

edited

Loading

Uh oh!

steveloughran commented Mar 30, 2026

Uh oh!

steveloughran commented Mar 31, 2026

Uh oh!

steveloughran commented Apr 2, 2026

Uh oh!

steveloughran commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

steveloughran commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

steveloughran commented Mar 19, 2026

Uh oh!

steveloughran commented Mar 23, 2026

Uh oh!

xiaoxuandev Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

steveloughran Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Mar 30, 2026

Always detect file layout in ReadSupport.init() and apply the lean projection only when the file was written with a shredded schema. For unshredded files, use the full file schema or no projection.

Uh oh!

steveloughran commented Mar 31, 2026

Uh oh!

steveloughran commented Apr 2, 2026

Uh oh!

steveloughran commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

steveloughran commented Mar 19, 2026 •

edited

Loading

steveloughran Mar 30, 2026 •

edited

Loading