Skip to content

feat: ColliderML reader#5546

Draft
benjaminhuth wants to merge 100 commits into
acts-project:mainfrom
benjaminhuth:feature/collider-ml-reader-arrow
Draft

feat: ColliderML reader#5546
benjaminhuth wants to merge 100 commits into
acts-project:mainfrom
benjaminhuth:feature/collider-ml-reader-arrow

Conversation

@benjaminhuth

@benjaminhuth benjaminhuth commented Jun 4, 2026

Copy link
Copy Markdown
Member

benjaminhuth and others added 4 commits June 4, 2026 11:53
Root cause of g2l failures: the geo map KDTree matching was unreliable at
sensor boundaries because normal-distance alone cannot distinguish phi- or
z-adjacent sensors (all have near-zero perp distance for hits at the edge).

Fix 1 — generate_colliderml_geo_map.py: use direct (vol, layer, sensitive)
lookup as the primary strategy (CML surface_id == ACTS sensitive ID in ODD
Gen1). Verify each match by projecting the representative hit and checking
surface bounds. Fall back to KDTree only for surfaces where direct lookup
fails. Result: 18824/18824 matched, 0 KDTree fallbacks.

Fix 2 — ColliderMLInputConverter: project hits with unlimited perpendicular
tolerance (CML positions are 3D track positions inside sensor volume), then
validate local coords against surface bounds with 5 mm Euclidean tolerance.
5 mm accepts boundary incidence effects on large sensors (halfX ≈ 56 mm)
while still catching genuine wrong-surface assignments (tens of mm outside).

Fix 3 — Python/Core/src/Surfaces.cpp: add Surface.globalToLocal() binding
returning Optional[Vector2] with configurable tolerance, used by both the
geo map verification and the investigation scripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a second PythonTrackFinderPerformanceWriter at the seed-track level
(before KF fitting) using the seed_particle_matching created by addSeeding.
Saves histograms_proto.pkl alongside histograms.pkl.

The plots script overlays proto-track and KF+selector efficiency on slide 1,
and adds a dedicated comparison slide (slide 1b) showing the seeding vs KF
efficiency breakdown — directly answering where the 30% efficiency gap comes
from.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both proto-track and KF writers now use truth_seeded_particles as the
denominator. This cleanly separates seeding layer coverage (~22% gap from
geoSelection config not covering all detector layers) from KF quality.
Proto-track efficiency should now be ~100% for seeded particles; KF shows
the actual per-seeded-particle loss.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- First PDF page is now a title slide with centred title and metadata.
- Efficiency and profile plots use step-function style (horizontal bar per
  bin + vertical error bars, no connecting lines) via xerr=half_bin_width
  and fmt="none". Matches standard HEP efficiency plot conventions.
- Title parameter added to make_plots() for reuse.
- Slides skill updated with both conventions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added this to the next milestone Jun 4, 2026
@github-actions github-actions Bot added Component - Examples Affects the Examples module Component - Plugins Affects one or more Plugins Component - Documentation Affects the documentation labels Jun 4, 2026
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

📊: Physics performance monitoring for 2636986

Full contents

physmon summary

❗️: Downstream build failure

  • Key4hep (cc @acts-project/key4hep-contacts)

benjaminhuth and others added 2 commits June 5, 2026 15:28
Integrates the official Arrow/Parquet base (PR acts-project#5410) from upstream/main.
Our branch retains only the ColliderML reader on top:
- ArrowUtil: keep flatColumnUInt*/readFlatParquetFile (used by ColliderML)
- Parquet CMakeLists: keep ColliderMLInputConverter target block
- Python Arrow bindings: keep ColliderMLInputConverter pybind11 declarations
- root_file_hashes.txt: keep ColliderML test hashes + upstream strip_space_points rename

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added Changes Performance and removed Component - Documentation Affects the documentation labels Jun 5, 2026
@github-actions github-actions Bot added the Infrastructure Changes to build tools, continous integration, ... label Jun 8, 2026
Comment thread Examples/Io/Parquet/src/ColliderMLInputConverter.cpp Outdated
Comment thread Examples/Io/Arrow/src/ColliderMLInputConverter.cpp
Comment thread Examples/Io/Arrow/src/ColliderMLInputConverter.cpp
Comment thread Examples/Io/Parquet/CMakeLists.txt Outdated
Comment thread Examples/Scripts/Python/colliderml_performance_plots.py Outdated
Comment thread Plugins/Arrow/include/ActsPlugins/Arrow/ArrowUtil.hpp Outdated
Comment thread Examples/Io/Parquet/include/ActsExamples/Io/Parquet/ColliderMLInputConverter.hpp Outdated
Comment thread Examples/Io/Parquet/include/ActsExamples/Io/Parquet/ColliderMLInputConverter.hpp Outdated
@benjaminhuth benjaminhuth marked this pull request as ready for review June 9, 2026 10:28
@benjaminhuth benjaminhuth requested a review from AJPfleger as a code owner June 9, 2026 10:28
@paulgessinger paulgessinger added the 🛑 blocked This item is blocked by another item label Jun 9, 2026

@paulgessinger paulgessinger left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

To be completely honest, with the amount of massaging needed here, it's getting close to the point where I would just conclude that the current ColliderML data content is just not suitable to be used as an input here.

We can still go ahead with this, but I would make it a priority to try to augment the parquet file content to take the extra work out of this implementation, like encoding the local dimensions and a way to map the geometry ids.

/cc @murnanedaniel

@paulgessinger paulgessinger Jun 9, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the volume mapping? Should that be parquet? I think there's benefit in having this be ASCII, no? How large is it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it in CSV and thought I put it in parquet because its CSV with ten-thousands of lines of samples... but can do CSV as well

Comment thread Examples/Io/Arrow/src/ColliderMLInputConverter.cpp
Comment on lines +77 to +92
std::optional<double> sigmaFromSmearer(
const ActsFatras::SingleParameterSmearFunction<RandomEngine>& fn) {
if (const auto* g = fn.target<const Digitization::Gauss>()) {
return g->sigma;
}
if (const auto* g = fn.target<const Digitization::GaussTrunc>()) {
return g->sigma;
}
if (const auto* g = fn.target<const Digitization::GaussClipped>()) {
return g->sigma;
}
if (const auto* g = fn.target<const Digitization::Exact>()) {
return g->sigma;
}
return std::nullopt;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I we need this information here we should either provide it as an explicit input (json) or rethink the digitization to make this part of an interface (go away from an opaque function).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so I think we only have not-nice solutions here. But I think the canonical source of truth on subspace and sigmas are the digitization config files. I do not want to create a new file for this, and think this is the best we can do to fill in the missing info as of now.

I thought about a sigma interface, but not all smearers have a sigma canonically...

Comment thread Examples/Io/Arrow/src/ColliderMLInputConverter.cpp Outdated
Comment thread Examples/Scripts/Python/generate_colliderml_geo_map.py
@@ -0,0 +1,257 @@
#!/usr/bin/env python3

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fairly slow in python? Might be worth writing in C++ instead.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, but its a do-once task...

Comment thread Plugins/Arrow/include/ActsPlugins/Arrow/ArrowUtil.hpp Outdated
Comment on lines +36 to +37
"""
data_dir_env = os.environ.get("COLLIDERML_DATA_DIR")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just produce a file on the fly with the correct schema? I'm not a huge fan of downloading this behind-the-scenes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm but this thing is meant to read in the ColliderML file as they are on the internet. So the second-best thing is to store a small sample of ColliderML on CERN ressources...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we just generate it: the Arrow schema pretty much guarantees we're testing the right thing, and avoids both these pitfalls.

benjaminhuth and others added 2 commits June 9, 2026 17:24
- Add `collidermlParticleSchema()` to ArrowUtil with the exact columns
  ColliderML provides; fix `colliderml_truth_tracking.py` which was
  using the ACTS `particleSchema()` (a superset) as the expected schema
  for ColliderML particle files.
- Add upfront schema validation in `ColliderMLInputConverter::execute()`
  so all downstream column accesses are guaranteed correct.
- Move `readFlatParquetFile` out of the public ArrowUtil API into an
  anonymous-namespace helper in ColliderMLInputConverter.cpp (its only
  caller); add the required Arrow/Parquet includes there.
- Replace the `getCol.operator()<T>()` lambda pattern in
  `loadColliderMLGeoIdMap` with a free template function
  `getFlatColumn<T>()`.
- Add `--hits-dir` CLI argument to `generate_colliderml_geo_map.py`
  to avoid a hardcoded dataset subdirectory path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ParquetReader already enforces the schema via expectedSchemas/targetSchema
before the table reaches ColliderMLInputConverter, so the per-field checks
in execute() are redundant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sonarqubecloud

sonarqubecloud Bot commented Jun 9, 2026

Copy link
Copy Markdown

@benjaminhuth benjaminhuth marked this pull request as draft June 10, 2026 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Changes Performance Component - Examples Affects the Examples module Component - Plugins Affects one or more Plugins Infrastructure Changes to build tools, continous integration, ... 🛑 blocked This item is blocked by another item

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants