Coalesce routing-equivalent fragmented edges in OSM import (DATA-6259)#199
Conversation
Over-fragmented sources like TomTom Orbis split ways at every HD-map attribute change, so adjacent edges frequently carry bit-identical routing data and differ only in provenance (osmid + stable IDs). This adds an optional post-parse pass that merges runs of such edges across degree-2 connection nodes, clawing back the structural portion of the Orbis storage overhead without changing routing behavior. Gated by routing.coalesce_mergeable_edges (default false). When on, the import parses into a temporary RAM graph and writes a coalesced copy into the real graph (BaseGraph is append-only, so in-place merging is not possible). Node IDs are preserved 1:1; edge IDs are renumbered. Merge predicate (EdgeCoalescer): a degree-2 tower node whose two edges are storage-aligned through it, have distinct far endpoints, carry bit-identical flags outside the identity-EV mask, and share equal key values. Merged edges adopt a representative osmid (real OSM preferred over orbis-only) but mint a fresh segment index above the max seen for that osmid, so every merged edge gets a brand-new stable ID and stale IDs resolve to a clean miss rather than silently remapping to different geometry. Non-merged edges are copied untouched. Correctness: - Identity mask is derived from the live EncodingManager bit layout (IdentityEncodedValues), so it cannot drift from the real EV offsets. - Turn restrictions: via nodes are protected from being merged through (merging a from/to edge across a via node would erase the turn); restricted-way edges otherwise merge freely and their way-to-edge registrations are remapped through the merge. - Rings and balloon chains never collapse into loop edges; runs split before exceeding the storable distance bound. Also adds an EdgeIntAccess-based stable-ID read path so SEID-keyed custom-speed parsers read through the graph being written during import (the temporary graph when coalescing), not the not-yet-populated served graph. Tests cover the mask, merge/no-merge cases, orientation handling, osmid preference, fresh-index collision avoidance, via-node protection, ring splitting, and geometry/distance/KV preservation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Claude finished @bnaul's task in 5m 39s —— View job Review: Coalesce fragmented edges (PR #199)
Overall this is a careful, well-tested change. The merge predicate is sound, the storage-orientation invariant is the right way to make raw-flag equality meaningful, the bit-precise identity mask derived from the live Correctness — verified sound
Concurrency / thread-safety — no regressionThe coalescing pass runs single-threaded after Things to confirm during rollout (inline)
Nits (non-blocking): |
| private IntHashSet collectRestrictionViaNodes() { | ||
| IntHashSet viaNodes = new IntHashSet(); | ||
| for (ReaderRelation relation : restrictionRelations) { | ||
| int viaNode = relation.getTag("graphhopper:via_node", -1); | ||
| if (viaNode >= 0) | ||
| viaNodes.add(viaNode); | ||
| } | ||
| return viaNodes; | ||
| } |
There was a problem hiding this comment.
Only single-via-node restrictions are protected here. Via-way restrictions are added to restrictionRelations (in processRelation, the osmViaNode < 0 branch) but contribute no protected node, so the nodes where their from/via/to edges meet are eligible to be merged through.
In practice the junction node of a real turn restriction is almost always degree ≥ 3, which already blocks the merge — so this is likely benign. But two cases are worth confirming in the planned turn-restriction regression run:
- A restricted-way edge merging with a routing-identical non-restricted neighbor across a degree-2 node. The merged target edge then carries the restricted way's
restrictedWaysToEdgesMapregistration but spans beyond the original geometry, so the turn cost applies to the longer merged edge. - A from-edge and via-way edge that meet at a degree-2 node merging into one edge, after which both ways' registrations dedupe to the same target edge and the restriction may resolve degenerately.
If the regression shows neither occurs (because such nodes are always degree ≥ 3), a one-line comment stating that assumption would be worth adding. Otherwise, consider also protecting the endpoint nodes of via-way restriction members.
| private void copyNodes() { | ||
| NodeAccess sourceNodes = source.getNodeAccess(); | ||
| NodeAccess targetNodes = target.getNodeAccess(); | ||
| boolean is3D = sourceNodes.is3D(); | ||
| for (int node = 0; node < nodeCount; node++) { | ||
| if (is3D) | ||
| targetNodes.setNode(node, sourceNodes.getLat(node), sourceNodes.getLon(node), sourceNodes.getEle(node)); | ||
| else | ||
| targetNodes.setNode(node, sourceNodes.getLat(node), sourceNodes.getLon(node)); | ||
| } | ||
| } |
There was a problem hiding this comment.
Nodes are copied 1:1, so a bypassed degree-2 connection node survives in the target graph as an orphan tower node (no incident edges; its coordinate lives on as a pillar of the merged edge). That's fine for snapping (the location index is edge-based and will snap nearby queries onto the merged edge), but worth confirming for any consumer that resolves a request to a specific GH/OSM node id — e.g. a via-point expressed as an OSM node that happened to be a connection node will now map to a node with no edges and become unroutable, whereas before coalescing it was on a real edge.
The PR body says GH-node→OSM-node mappings "stay valid" because ids don't shift, which is true for id stability, but the routability of those bypassed nodes changes. Please make sure the gRPC route-equivalence regression exercises via-node requests that land on former connection nodes.
| tempGraph = new BaseGraph.Builder(encodingManager) | ||
| .setDir(new RAMDirectory()) | ||
| .set3D(nodeAccess.is3D()) | ||
| .withTurnCosts(false) | ||
| .setSegmentSize(baseGraph.getSegmentSize()) | ||
| .create(); | ||
| writeGraph = tempGraph; | ||
| writeEdgeIntAccess = tempGraph.createEdgeIntAccess(); | ||
| writeNodeAccess = tempGraph.getNodeAccess(); | ||
| LOGGER.info("Edge coalescing enabled: parsing into a temporary graph before merging"); |
There was a problem hiding this comment.
With coalescing on, the full parse lands in tempGraph (RAM) and is then copied into baseGraph, so peak import memory is roughly 2× the edge/geometry storage until tempGraph.close() runs in coalesceMergeableEdges(). For mini_nor_cal that's a non-issue, but for full-size Orbis imports this could be the binding constraint — the listed "heap delta" validation should specifically capture peak (not just steady-state) RSS. No code change needed; just flagging so the rollout sizing accounts for it.
Drives real beatty.osm and a synthetic 3-way-chain import through the full GraphHopperManaged pipeline (production encoded values and tag parsers) with the flag off vs on, asserting: - the synthetic chain collapses 3 fragments -> 1 edge; - coalescing never increases edge count and preserves the node set; - routes are byte-identical (distance + time) between the two builds. On beatty this exercises 7 real merges and 40 route comparisons, all identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rounding Adds an ORBIS_PBF-gated test method that imports a real TomTom Orbis state extract off vs on (skipped when the env var is unset, so CI is unaffected). On the US-DC release-26180 extract this coalesces 243,116 -> 214,566 edges (11.7%, 22,907 merged edges) with the node set preserved and all 54 sampled routes equivalent. Route-equivalence now allows a small travel-time tolerance: GraphHopper rounds time to integer milliseconds per edge, so summing N fragments' rounded times differs from one rounded time over the merged edge by single-digit ms (observed 9 ms on a ~25 min route). Distance stays the exact geometric invariant (0.5 m). Shared assertRouteEquivalent() helper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reports on-disk graph dir size (total + top files) for the Orbis off/on import, and adds an env-gated characterizeGraph diagnostic (GRAPH_PBF, GRAPH_COALESCE) that dumps node/edge counts, tower-node degree distribution, shape-point density, and file sizes. Used to decompose where an Orbis graph differs from OSM (the difference is degree-2 connection-node fragmentation; junction counts and geometry match OSM, and coalescing brings the base graph to OSM parity). Both are skipped in normal CI (Assumptions on env vars). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Refactors isLinkMergeable into classifyLink, which returns a LinkOutcome enum (MERGEABLE plus one bucket per rejection gate, in early-return order). run() tallies the outcomes into a LinkClassificationStats struct exposed on Result; CustomOsmReader logs the breakdown alongside its existing edge-count summary so the diagnostic shows up in every build's INFO log. Used to decompose why a coalescing pass is reducing edges less than expected: each rejection bucket sets the upper bound on additional merges recoverable by relaxing that gate. On US-DC (orbis): of 66,314 degree-2 candidates, 43.1% merged, 40.2% blocked by excluded_edge (barrier-edge proximity, which is dense in DC and structurally unmergeable), 8.6% by flag_mismatch, 6.1% orientation_mismatch. On mini_nor_cal: 1,089,421 candidates, 59.9% merged, 23.0% excluded_edge, 8.8% flag_mismatch, 4.2% orientation. Barriers are not an orbis-specific problem — OSM-DC produced ~the same excluded_edge count. No behavior change. Existing EdgeCoalescerTest suite (15 tests) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EdgeCoalescer used to copy all source nodes to the target with IDs preserved 1:1, leaving connection nodes whose two incident edges were merged into one as 'dead' tower nodes — referenced by no target edge but still consuming a row in the target's nodes table, each per-CH nodes_ch_<profile> file, and the GH-node-to-OSM-node map. On orbis mini_nor_cal that's ~650K dead nodes (one per merge), and the 10 production CH profiles each carry their own per-node entry. Refactor run() to plan-first: walk every chain and split it into runs before any edge is written, accumulating a bypassedNode bitmap for the interior nodes of every multi-edge run. Compute a dense source-to-target node remap (sourceToTargetNode), copy only surviving nodes into the target with consecutive IDs starting at 0, then emit edges through the remap so base/adj nodes resolve to the new dense IDs. Result exposes sourceToTargetNode so the caller can rewrite its own node-keyed bookkeeping. CustomOsmReader uses it to: - rewrite the graphhopper:via_node tag on each restriction relation (protected via nodes are never bypassed, so the remap is total); - rebuild ghNodeIdToOsmNodeIdMap into the target's ID space, dropping entries for bypassed source nodes. Local single-profile validation on mini_nor_cal orbis cutout: 4.27M edges -> 3.62M (15.2% reduction, unchanged from pre-compaction), 3.38M nodes -> 2.73M (650K compacted, exactly matching the 650K eliminated edges), on-disk graph dir 443.1 -> 385.1 MiB (13.1% smaller, 58 MB saved -> +12 MB on top of pre-compaction's 46 MB). The compaction delta multiplies through the 10 production CH profiles on the deployed config — nodes_ch_<profile> alone accounts for ~52 MB of that scaling, before CH shortcut downstream effects. Tests: all 15 EdgeCoalescerTest cases pass after updating hardcoded node-ID assertions to remap through Result.sourceToTargetNode. EdgeCoalescingImportTest (beatty + 3-way chain) likewise updated to assert renumbering correctness instead of 1:1 preservation; 58 routes through the orbis cutout remain bit-identical between pre- and post- compaction builds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
What
Adds an optional post-parse pass that merges runs of routing-equivalent edges across degree-2 connection nodes during OSM import. Over-fragmented sources like TomTom Orbis split ways at every HD-map attribute change (curvature, grade, surface variant…), producing many adjacent edges that carry bit-identical routing data and differ only in provenance (
osmid+ stable IDs). Per the DATA-6256 calibration, ~310 MB of the ~360 MB Orbis-vs-OSM size gap on mini_nor_cal is structural fragmentation; this targets that, amplified through CH copies.Gated by
routing.coalesce_mergeable_edges(defaultfalse). With the flag off the import path is byte-for-byte unchanged.How
BaseGraphis append-only (no edge removal), so the merge can't happen in place. When the flag is on, the import parses into a temporary RAM graph andEdgeCoalescerwrites a coalesced copy into the real graph. Node IDs are preserved 1:1 (so via-node / GH-node→OSM-node mappings stay valid); edge IDs are renumbered densely.Merge predicate — two edges at a tower node
vmerge iff:degree(v) == 2andvis not a protected via node;vstorage-aligned (one's stored adj isv, the other's stored base isv) — this is what makes raw-flag equality meaningful and is the enforced form of the issue's "MVP skips mixed-orientation" carve-out;Maximal mergeable runs are found by a single forward+backward chain walk.
Notable decisions / deviations from the issue's pseudocode
BaseGraphbeing append-only.osmid(real OSM preferred over orbis-only, then min osmid/segment-index) but mint a fresh segment index above the max seen for that osmid. Every merged edge gets a brand-new stable ID; all source-fragment IDs retire, so a stale SEID resolves to a clean miss rather than silently pointing at a longer, different merged geometry. Non-merged edges are copied untouched (exact stable IDs preserved).EncodingManager(IdentityEncodedValues) by writing all-ones per identity EV into a scratch flags array — can't drift from the real bit layout.StableEdgeIdManager, which resolved through the served base graph; during a coalescing import that graph isn't populated yet. Added anEdgeIntAccess-based overload and switched the 5 web-bundle parser call sites to it.Tests
EdgeCoalescerTest(15) +IdentityEncodedValuesTest(2): mask coverage, 2-/3-fragment merges, flag/KV mismatch and degree-3 → no-merge, aligned vs anti-aligned one-ways and two-ways, osmid preference, fresh-index collision avoidance, via-node protection, ring split into two arcs, balloon chain, geometry/distance/KV/coordinate preservation.Validation still to do (per issue rollout)
Build mini_nor_cal/2025_Q4 with the flag on; compare graph size vs the Orbis (2.10 GiB) / DATA-6258 (2.05 GiB) / OSM (1.74 GiB) baselines; gRPC route-equivalence + stable-id-continuity + turn-restriction regression checks; heap delta.
Base branch
Targets
tomtom-orbis-osmid-64bit(notoriginal-direction) — it depends on the splitosmid/osmid_highEVs from #194.🤖 Generated with Claude Code