feat: ColliderML reader#5546
Conversation
Root cause of g2l failures: the geo map KDTree matching was unreliable at sensor boundaries because normal-distance alone cannot distinguish phi- or z-adjacent sensors (all have near-zero perp distance for hits at the edge). Fix 1 — generate_colliderml_geo_map.py: use direct (vol, layer, sensitive) lookup as the primary strategy (CML surface_id == ACTS sensitive ID in ODD Gen1). Verify each match by projecting the representative hit and checking surface bounds. Fall back to KDTree only for surfaces where direct lookup fails. Result: 18824/18824 matched, 0 KDTree fallbacks. Fix 2 — ColliderMLInputConverter: project hits with unlimited perpendicular tolerance (CML positions are 3D track positions inside sensor volume), then validate local coords against surface bounds with 5 mm Euclidean tolerance. 5 mm accepts boundary incidence effects on large sensors (halfX ≈ 56 mm) while still catching genuine wrong-surface assignments (tens of mm outside). Fix 3 — Python/Core/src/Surfaces.cpp: add Surface.globalToLocal() binding returning Optional[Vector2] with configurable tolerance, used by both the geo map verification and the investigation scripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a second PythonTrackFinderPerformanceWriter at the seed-track level (before KF fitting) using the seed_particle_matching created by addSeeding. Saves histograms_proto.pkl alongside histograms.pkl. The plots script overlays proto-track and KF+selector efficiency on slide 1, and adds a dedicated comparison slide (slide 1b) showing the seeding vs KF efficiency breakdown — directly answering where the 30% efficiency gap comes from. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both proto-track and KF writers now use truth_seeded_particles as the denominator. This cleanly separates seeding layer coverage (~22% gap from geoSelection config not covering all detector layers) from KF quality. Proto-track efficiency should now be ~100% for seeded particles; KF shows the actual per-seeded-particle loss. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- First PDF page is now a title slide with centred title and metadata. - Efficiency and profile plots use step-function style (horizontal bar per bin + vertical error bars, no connecting lines) via xerr=half_bin_width and fmt="none". Matches standard HEP efficiency plot conventions. - Title parameter added to make_plots() for reuse. - Slides skill updated with both conventions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Integrates the official Arrow/Parquet base (PR acts-project#5410) from upstream/main. Our branch retains only the ColliderML reader on top: - ArrowUtil: keep flatColumnUInt*/readFlatParquetFile (used by ColliderML) - Parquet CMakeLists: keep ColliderMLInputConverter target block - Python Arrow bindings: keep ColliderMLInputConverter pybind11 declarations - root_file_hashes.txt: keep ColliderML test hashes + upstream strip_space_points rename Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
paulgessinger
left a comment
There was a problem hiding this comment.
Thanks!
To be completely honest, with the amount of massaging needed here, it's getting close to the point where I would just conclude that the current ColliderML data content is just not suitable to be used as an input here.
We can still go ahead with this, but I would make it a priority to try to augment the parquet file content to take the extra work out of this implementation, like encoding the local dimensions and a way to map the geometry ids.
/cc @murnanedaniel
There was a problem hiding this comment.
Is this the volume mapping? Should that be parquet? I think there's benefit in having this be ASCII, no? How large is it?
There was a problem hiding this comment.
I had it in CSV and thought I put it in parquet because its CSV with ten-thousands of lines of samples... but can do CSV as well
| std::optional<double> sigmaFromSmearer( | ||
| const ActsFatras::SingleParameterSmearFunction<RandomEngine>& fn) { | ||
| if (const auto* g = fn.target<const Digitization::Gauss>()) { | ||
| return g->sigma; | ||
| } | ||
| if (const auto* g = fn.target<const Digitization::GaussTrunc>()) { | ||
| return g->sigma; | ||
| } | ||
| if (const auto* g = fn.target<const Digitization::GaussClipped>()) { | ||
| return g->sigma; | ||
| } | ||
| if (const auto* g = fn.target<const Digitization::Exact>()) { | ||
| return g->sigma; | ||
| } | ||
| return std::nullopt; | ||
| } |
There was a problem hiding this comment.
I we need this information here we should either provide it as an explicit input (json) or rethink the digitization to make this part of an interface (go away from an opaque function).
There was a problem hiding this comment.
Yeah so I think we only have not-nice solutions here. But I think the canonical source of truth on subspace and sigmas are the digitization config files. I do not want to create a new file for this, and think this is the best we can do to fill in the missing info as of now.
I thought about a sigma interface, but not all smearers have a sigma canonically...
| @@ -0,0 +1,257 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
This is probably fairly slow in python? Might be worth writing in C++ instead.
There was a problem hiding this comment.
good point, but its a do-once task...
| """ | ||
| data_dir_env = os.environ.get("COLLIDERML_DATA_DIR") |
There was a problem hiding this comment.
Why don't we just produce a file on the fly with the correct schema? I'm not a huge fan of downloading this behind-the-scenes.
There was a problem hiding this comment.
Hmm but this thing is meant to read in the ColliderML file as they are on the internet. So the second-best thing is to store a small sample of ColliderML on CERN ressources...
There was a problem hiding this comment.
Or we just generate it: the Arrow schema pretty much guarantees we're testing the right thing, and avoids both these pitfalls.
- Add `collidermlParticleSchema()` to ArrowUtil with the exact columns ColliderML provides; fix `colliderml_truth_tracking.py` which was using the ACTS `particleSchema()` (a superset) as the expected schema for ColliderML particle files. - Add upfront schema validation in `ColliderMLInputConverter::execute()` so all downstream column accesses are guaranteed correct. - Move `readFlatParquetFile` out of the public ArrowUtil API into an anonymous-namespace helper in ColliderMLInputConverter.cpp (its only caller); add the required Arrow/Parquet includes there. - Replace the `getCol.operator()<T>()` lambda pattern in `loadColliderMLGeoIdMap` with a free template function `getFlatColumn<T>()`. - Add `--hits-dir` CLI argument to `generate_colliderml_geo_map.py` to avoid a hardcoded dataset subdirectory path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ParquetReader already enforces the schema via expectedSchemas/targetSchema before the table reaches ColliderMLInputConverter, so the per-field checks in execute() are redundant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|



Blocked by