feat: ColliderML reader by benjaminhuth · Pull Request #5546 · acts-project/acts

benjaminhuth · 2026-06-04T13:31:48Z

Blocked by

Root cause of g2l failures: the geo map KDTree matching was unreliable at sensor boundaries because normal-distance alone cannot distinguish phi- or z-adjacent sensors (all have near-zero perp distance for hits at the edge). Fix 1 — generate_colliderml_geo_map.py: use direct (vol, layer, sensitive) lookup as the primary strategy (CML surface_id == ACTS sensitive ID in ODD Gen1). Verify each match by projecting the representative hit and checking surface bounds. Fall back to KDTree only for surfaces where direct lookup fails. Result: 18824/18824 matched, 0 KDTree fallbacks. Fix 2 — ColliderMLInputConverter: project hits with unlimited perpendicular tolerance (CML positions are 3D track positions inside sensor volume), then validate local coords against surface bounds with 5 mm Euclidean tolerance. 5 mm accepts boundary incidence effects on large sensors (halfX ≈ 56 mm) while still catching genuine wrong-surface assignments (tens of mm outside). Fix 3 — Python/Core/src/Surfaces.cpp: add Surface.globalToLocal() binding returning Optional[Vector2] with configurable tolerance, used by both the geo map verification and the investigation scripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a second PythonTrackFinderPerformanceWriter at the seed-track level (before KF fitting) using the seed_particle_matching created by addSeeding. Saves histograms_proto.pkl alongside histograms.pkl. The plots script overlays proto-track and KF+selector efficiency on slide 1, and adds a dedicated comparison slide (slide 1b) showing the seeding vs KF efficiency breakdown — directly answering where the 30% efficiency gap comes from. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Both proto-track and KF writers now use truth_seeded_particles as the denominator. This cleanly separates seeding layer coverage (~22% gap from geoSelection config not covering all detector layers) from KF quality. Proto-track efficiency should now be ~100% for seeded particles; KF shows the actual per-seeded-particle loss. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- First PDF page is now a title slide with centred title and metadata. - Efficiency and profile plots use step-function style (horizontal bar per bin + vertical error bars, no connecting lines) via xerr=half_bin_width and fmt="none". Matches standard HEP efficiency plot conventions. - Title parameter added to make_plots() for reuse. - Slides skill updated with both conventions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-06-04T14:45:51Z

📊: Physics performance monitoring for `2636986`

Full contents

physmon summary

❗️: Downstream build failure

Key4hep (cc @acts-project/key4hep-contacts)

Integrates the official Arrow/Parquet base (PR acts-project#5410) from upstream/main. Our branch retains only the ColliderML reader on top: - ArrowUtil: keep flatColumnUInt*/readFlatParquetFile (used by ColliderML) - Parquet CMakeLists: keep ColliderMLInputConverter target block - Python Arrow bindings: keep ColliderMLInputConverter pybind11 declarations - root_file_hashes.txt: keep ColliderML test hashes + upstream strip_space_points rename Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…-reader-arrow

paulgessinger

Thanks!

To be completely honest, with the amount of massaging needed here, it's getting close to the point where I would just conclude that the current ColliderML data content is just not suitable to be used as an input here.

We can still go ahead with this, but I would make it a priority to try to augment the parquet file content to take the extra work out of this implementation, like encoding the local dimensions and a way to map the geometry ids.

/cc @murnanedaniel

paulgessinger · 2026-06-09T11:25:07Z

Is this the volume mapping? Should that be parquet? I think there's benefit in having this be ASCII, no? How large is it?

I had it in CSV and thought I put it in parquet because its CSV with ten-thousands of lines of samples... but can do CSV as well

paulgessinger · 2026-06-09T11:40:31Z

+std::optional<double> sigmaFromSmearer(
+    const ActsFatras::SingleParameterSmearFunction<RandomEngine>& fn) {
+  if (const auto* g = fn.target<const Digitization::Gauss>()) {
+    return g->sigma;
+  }
+  if (const auto* g = fn.target<const Digitization::GaussTrunc>()) {
+    return g->sigma;
+  }
+  if (const auto* g = fn.target<const Digitization::GaussClipped>()) {
+    return g->sigma;
+  }
+  if (const auto* g = fn.target<const Digitization::Exact>()) {
+    return g->sigma;
+  }
+  return std::nullopt;
+}


I we need this information here we should either provide it as an explicit input (json) or rethink the digitization to make this part of an interface (go away from an opaque function).

Yeah so I think we only have not-nice solutions here. But I think the canonical source of truth on subspace and sigmas are the digitization config files. I do not want to create a new file for this, and think this is the best we can do to fill in the missing info as of now.

I thought about a sigma interface, but not all smearers have a sigma canonically...

paulgessinger · 2026-06-09T11:50:35Z

@@ -0,0 +1,257 @@
+#!/usr/bin/env python3


This is probably fairly slow in python? Might be worth writing in C++ instead.

good point, but its a do-once task...

paulgessinger · 2026-06-09T11:55:23Z

+    """
+    data_dir_env = os.environ.get("COLLIDERML_DATA_DIR")


Why don't we just produce a file on the fly with the correct schema? I'm not a huge fan of downloading this behind-the-scenes.

Hmm but this thing is meant to read in the ColliderML file as they are on the internet. So the second-best thing is to store a small sample of ColliderML on CERN ressources...

Or we just generate it: the Arrow schema pretty much guarantees we're testing the right thing, and avoids both these pitfalls.

- Add `collidermlParticleSchema()` to ArrowUtil with the exact columns ColliderML provides; fix `colliderml_truth_tracking.py` which was using the ACTS `particleSchema()` (a superset) as the expected schema for ColliderML particle files. - Add upfront schema validation in `ColliderMLInputConverter::execute()` so all downstream column accesses are guaranteed correct. - Move `readFlatParquetFile` out of the public ArrowUtil API into an anonymous-namespace helper in ColliderMLInputConverter.cpp (its only caller); add the required Arrow/Parquet includes there. - Replace the `getCol.operator()<T>()` lambda pattern in `loadColliderMLGeoIdMap` with a free template function `getFlatColumn<T>()`. - Add `--hits-dir` CLI argument to `generate_colliderml_geo_map.py` to avoid a hardcoded dataset subdirectory path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ParquetReader already enforces the schema via expectedSchemas/targetSchema before the table reaches ColliderMLInputConverter, so the per-field checks in execute() are redundant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sonarqubecloud · 2026-06-09T17:17:17Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

paulgessinger added 30 commits May 12, 2026 16:59

feat: add parent id to existing SimParticle EDM

6ede096

feat: Make ScopedTimer threadsafe

eb9835a

particle docs fixes

37b3ec6

clang-format

06005ef

MERGE

59290dd

feat: Initial arrow/parquet support

32d6228

experiment with arrow object library

970f19d

clean up symbol visibility in wrapper target

d2080b8

make the isolated arrow absorption optional

f331c3b

add parquet option to full chain odd

3ce451b

updated particle arrow schema based on colliderml

99478a4

particle arrow converter writes parent id

c28a93c

use row indices as particle ids

5fbc174

add edm4hep to parquet conversion script

5801bf5

update output converters to produce proper nulls

299d69e

add sim hit output converter + connect to track hit_ids

74cd1ea

update detector resolver

c6b9587

add jobs arg to full chain odd

a163863

drop separate generated particles output

a824336

add plan for edm4hep input perf opt

bcd8891

clang-format

4bbc029

initial calo conversion

2de4baf

validated calo output

c421bc2

optimization for calo hits and averaging timers

8c72b08

some timing for edm4hepsiminput

fddfd47

add proper detector encoding, speedup

fed4480

restore pythia script (?)

ffbd6a8

use acts units more

fac62a5

dataset system shards files

b15e10c

address large number of propagation to perigee failures

5865ab1

benjaminhuth and others added 4 commits June 4, 2026 11:53

github-actions Bot added this to the next milestone Jun 4, 2026

github-actions Bot added Component - Examples Affects the Examples module Component - Plugins Affects one or more Plugins Component - Documentation Affects the documentation labels Jun 4, 2026

benjaminhuth and others added 2 commits June 5, 2026 15:28

update

1b7c564

github-actions Bot added Changes Performance and removed Component - Documentation Affects the documentation labels Jun 5, 2026

update unused files

134fa89

github-actions Bot added the Infrastructure Changes to build tools, continous integration, ... label Jun 8, 2026

benjaminhuth added 3 commits June 9, 2026 09:49

lint

a9391d2

Merge remote-tracking branch 'upstream/main' into feature/collider-ml…

5fce52a

…-reader-arrow

remove unrelated stuff

494f445

benjaminhuth commented Jun 9, 2026

View reviewed changes

benjaminhuth added 2 commits June 9, 2026 12:24

update

b1cf5ed

restore odd.py

a3e6abd

benjaminhuth marked this pull request as ready for review June 9, 2026 10:28

benjaminhuth requested a review from AJPfleger as a code owner June 9, 2026 10:28

benjaminhuth requested a review from paulgessinger June 9, 2026 10:29

paulgessinger added the 🛑 blocked This item is blocked by another item label Jun 9, 2026

paulgessinger reviewed Jun 9, 2026

View reviewed changes

benjaminhuth and others added 2 commits June 9, 2026 17:24

remove redundant schema validation from execute()

2636986

ParquetReader already enforces the schema via expectedSchemas/targetSchema before the table reaches ColliderMLInputConverter, so the per-field checks in execute() are redundant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

benjaminhuth marked this pull request as draft June 10, 2026 06:55

Conversation

benjaminhuth commented Jun 4, 2026 • edited by paulgessinger Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊: Physics performance monitoring for 2636986

physmon summary

❗️: Downstream build failure

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paulgessinger left a comment

Choose a reason for hiding this comment

Uh oh!

paulgessinger Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Jun 9, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benjaminhuth commented Jun 4, 2026 •

edited by paulgessinger

Loading

github-actions Bot commented Jun 4, 2026 •

edited

Loading

📊: Physics performance monitoring for `2636986`

paulgessinger Jun 9, 2026 •

edited

Loading