Skip to content

Compute stable edge IDs on demand and stop storing them (DATA-6257)#200

Open
bnaul wants to merge 6 commits into
brett/data-6259-coalesce-orbis-fragmented-edges-in-graphhopper-importfrom
brett/data-6257-compute-stable-edge-ids-on-demand-instead-of-storing-them
Open

Compute stable edge IDs on demand and stop storing them (DATA-6257)#200
bnaul wants to merge 6 commits into
brett/data-6259-coalesce-orbis-fragmented-edges-in-graphhopper-importfrom
brett/data-6257-compute-stable-edge-ids-on-demand-instead-of-storing-them

Conversation

@bnaul

@bnaul bnaul commented Jun 5, 2026

Copy link
Copy Markdown

Compute stable edge IDs on demand; stop storing them per edge

Implements DATA-6257 (parent DATA-6256, MEMORY_REVIEW.md rec #1) in full. The ticket's staged plan (Phase 1 shadow-verify behind a flag, then Phase 2 format break) is collapsed into this one PR — we validate off-branch with real test runs rather than running a production shadow window.

The stable edge ID is a pure function of osmWayId + segmentIndex (+ direction) (StableIdEncodedValues.calculateStableEdgeId). The way ID is already stored split across osmid/osmid_high, and the segment index lives in gh_edge_id_to_segment_index — so the 16 stable_id_byte_* / reverse_stable_id_byte_* encoded values (16 B/edge, amplified through CH shortcut copies; the doc estimates 1.5–5 GB steady-state on a 60 GB router) are redundant. This PR derives the ID at every read site and stops storing the bytes.

This is a graph format change — affected graphs must be re-imported to reclaim the space.

What changed

Compute path

  • StableIdEncodedValues.computeStableId(...) — from an EdgeIteratorState (serving/export) or from an edgeId + EdgeIntAccess (import); byte-identical to the old stored value.
  • Read sites now always compute: the gRPC StreetPath.stable_edge_ids path-details builder and the BQ StreetEdgeExporter. The segment index reaches the serving hot path via OsmHelper, wired lazily through PathDetailsBuilderFactoryWithStableId (OsmHelperProvider).
  • Virtual edges: routed paths include virtual edges at snapped endpoints (id ≥ base edge count); a virtual edge mirrors its original edge's encoded values, but its segment index must be looked up under the original edge id via VirtualEdgeIteratorState.getOriginalEdgeKey() (this fork's GH 8.0 has no EdgeIteratorState.isVirtual(), so detection is instanceof).

Stop storing

  • EncodedValueFactoryWithStableId no longer registers the 16 byte EVs; GraphHopperManaged drops them from the encoded-values string.
  • Import no longer stamps: CustomOsmReader drops the stamping call and EdgeCoalescer no longer re-stamps merged edges (it still mints their fresh segment index).
  • StableIdEncodedValues loses the stored getStableId/setStableId byte paths; IdentityEncodedValues drops the byte names from the EdgeCoalescer routing-equivalence mask (now just the two osmid halves).

Import-time SEID custom speeds

  • The custom-speeds tag parsers compute each edge's stable ID on demand during import. The segment index of the edge being parsed reaches them through a new ImportSegmentIndexLookup (pointed at the reader's in-progress map; the map write was moved before handleWayTags), threaded into StableEdgeIdManager via ReplicaVehicleTagParserFactory.

Format compatibility

Consistent with the existing convention (cf. the osmid_high split, which also changed the EV set without a version constant), there is no separate graph-version constant in the fork; the format change is realized by a coordinated re-import. GraphHopper's storage is self-describing (the encoding manager is persisted with the graph), so an old graph is not silently misread — its stored stable-id bytes are simply ignored by the new code until the graph is rebuilt. Coordinate the re-import with the road_network_version bump.

Tests

Runnable locally (mvn -o -pl replica-common,web-bundle test — green):

  • StableIdComputeTestcomputeStableId == calculateStableEdgeId across OSM ids (incl. > 2^31, exercising osmid_high), segments, both directions.
  • ComputedStableIdImportTest — imports beatty.osm (coalescing off and on); asserts the byte EVs are gone, every parsed edge has a positive segment index and a derivable ID both directions, and every stable_edge_ids value the serving path returns is one of the graph's directly-computed IDs (this fails if the virtual-edge segment-index mapping regresses). 768 / 754 edge-pairs and 554 / 552 served IDs verified.
  • SeidCustomSpeedImportTest — end-to-end proof of the trickiest path: imports beatty, writes a stable_edge_id-keyed closure for a real edge, re-imports under a custom-speeds car vehicle, and asserts the closure removed access on exactly that edge/direction (and left the reverse + a control edge alone). This exercises the full import-time seam — parse → custom-speeds access parser → getStableEdgeIdForEdgecomputeStableId via ImportSegmentIndexLookup → match — so it fails if the segment-index threading regresses.
  • EdgeCoalescerTest, IdentityEncodedValuesTest updated for the compute API.

Performance: computing on the fly costs ~100 ns/path-edge more than the old byte read (farmHash vs a memory read), and path details are built once over the returned path (not in the search loop), so even a ~200-edge route adds ~20 µs — negligible against CH routing, and the point is the GBs of memory it buys back.

CI is green (run 27230547590 on c49597be0): the full web + grpc build and unit suite pass — including the web-module SEID custom-speeds integration tests (CustomSpeedsUtilsTest, StableEdgeIdManagerTest) — so the edits to those modules (which can't be compiled in the dev sandbox: no generated protobuf, webgrpc) are confirmed. An earlier red was a pre-existing test-fixture bug (1-int scratch IntsRef), fixed here.

📋 Remaining real-world step (not a blocker for review): the actual memory-savings measurement — rebuild a representative graph with the deployed CH-profile config and diff graphhopper_network/ size + loaded RSS, old vs new format. The in-repo work confirms the per-edge term (edge-flags row 8→4 ints = 16 B/edge on a car-only graph) but the headline 1.5–5 GB is CH-amplified and needs a production-scale rebuild. Hand-off brief: plans/data-6257_memory_validation.md.

Base branch

Stacked on brett/data-6259-... (edge coalescing), which carries the 64-bit osmid_high split the computation needs. Per DATA-6256, 6259 lands first; retarget if you'd rather it sit directly on tomtom-orbis-osmid-64bit.

Rollout

Merge → rebuild affected graphs (orbis 2026_Q2/USA + regional_transportation_network regions) → bump road_network_version. No production shadow window (phases combined).

🤖 Generated with Claude Code

…57 Phase 1)

Phase 1 of removing the per-edge stable_id_byte_* / reverse_stable_id_byte_*
encoded values: prove the stable edge ID can be derived on demand from
osmid + osmid_high + segment index at every read site, and shadow-verify that
computed == stored across real traffic before the Phase 2 format break. No
format change here — the 16 bytes/edge are still written and kept on disk.

- StableIdEncodedValues.computeStableId(edge, reverse, segmentIndex): pure
  recomputation, byte-identical to getStableId(...).
- stable_id.use_computed config flag (default false), plumbed via
  GraphHopperManaged. When on, read sites run on the computed path and
  cross-check against the stored bytes; mismatches are logged and tallied by
  StableIdVerification (never thrown — verification must not disrupt serving).
- Switched read sites: the gRPC StreetPath.stable_edge_ids path-details builder
  and the BQ StreetEdgeExporter. The segment index reaches the serving hot path
  via OsmHelper, wired lazily through PathDetailsBuilderFactoryWithStableId
  (OsmHelperProvider). Routed paths include virtual edges at snapped endpoints,
  so the builder maps a virtual edge back to its original base-graph edge id
  (VirtualEdgeIteratorState.getOriginalEdgeKey) for the segment-index lookup.
- RouterServer exports routers.stable_id_computed_{comparisons,mismatches}
  StatsD gauges for the shadow window.

Deferred to Phase 2 (out of scope): removing the stable_id_byte_* EV
registration, removing the import-time stamping in StableEdgeIdManager, and the
graph format version bump. The import-time custom-speeds SEID seam also stays on
the stored bytes for now — converting it requires threading the segment index
through the parser path (mandatory only once stamping is removed), and verifying
it at import is tautological since the bytes are stamped by the same computation
moments earlier.

Tests: StableIdComputeTest (computed == stored across OSM ids incl. > 2^31,
segments, both directions); ComputedStableIdImportTest (beatty import via
GraphHopperManaged with the flag on, computed == stored for every edge both
directions with coalescing off and on, plus a serving-path route check with
zero mismatches).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@claude

claude Bot commented Jun 5, 2026

Copy link
Copy Markdown

Claude finished @bnaul's task in 3m 31s —— View job


Review: Compute stable edge IDs on demand (DATA-6257 Phase 1)

  • Read changed files and diff vs base branch
  • Review correctness / logic
  • Review concurrency & thread-safety (multi-threaded gRPC server)
  • Review hot-path performance
  • Review edge-case & error handling
  • Post review feedback

Overall this is a clean, well-scoped Phase 1. The compute-on-demand path is genuinely byte-identical to the stored path (both funnel through calculateStableEdgeIdLong.toUnsignedString(Longs.fromByteArray(...))), the virtual-edge remap via getOriginalEdgeKey() is the right fix for the segment-index lookup, the lazy OsmHelper resolution is correctly double-checked-locked on a volatile, and the tests (unit equivalence + full beatty import with coalescing on/off + serving-path routing) cover the important cases. Default-off keeps it safe to land. The two edits in web/grpc look correct against the existing call patterns — agree they're worth a CI confirm since they weren't machine-compiled.

I left three inline comments. Two are worth resolving before the shadow rollout:

1. Unbounded WARN logging on mismatch (StableIdVerification.verifyComputed) — main concern

The doc says verification "must never disrupt serving," but the WARN is unthrottled: a systematic divergence logs once per (edge, direction) comparison. At production QPS × many edges per route, that's potentially thousands of lines/sec — which can pin a core, fill disk, and degrade the path it's shadowing. The high-mismatch case is exactly the one to guard. Suggested a cheap rate-limit keyed on the existing counter (keeps the count exact). Inline →

2. verifyComputed returns computed, not stored

This makes it a substituting verification, not a true shadow: during the Phase 1 window any divergence ships the (possibly wrong) computed ID to gRPC clients and into the BQ export — corrupting output, not just logging. Equality is fully captured by the counters regardless of which string is returned, so returning stored would prove computed == stored across real traffic while guaranteeing output is unchanged until Phase 2 deliberately flips it. At minimum, call out in the rollout plan that a non-zero routers.stable_id_computed_mismatches means you're serving wrong IDs that week, not just observing. Inline →

3. Misconfig fails lazily at request time, not boot (minor)

Flag-on + non-OSM/GTFS graph throws IllegalStateException on the first (and every subsequent) routing request rather than at startup. Both inputs are known at construction — consider fail-fast validation when the flag is on. Low priority since production graphs are always OSM/GTFS. Inline →

Non-blocking notes

  • Per-edge cost while enabled: the compute path adds a String.format + farmhash + a DataAccess lookup per edge on the stable_edge_ids serving hot path. Fine given default-off and a one-router shadow window, but it ~doubles the per-edge stable-ID work while on.
  • Thread-safety: getSegmentIndexForGhEdge does an absolute DataAccess.getInt with no mutation — safe under concurrent reads, same as the rest of the graph during routing. Path-detail builders are per-request, so prevEdgeId/currentValue aren't shared. No new concurrency issues.
  • Metrics modeling: comparisons/mismatches are cumulative counters reported via client.gauge. Works for "flat-zero mismatches," but comparisons grows unbounded as a gauge — a counter/rate would read more naturally on the dashboard. Cosmetic.
    • branch brett/data-6257-compute-stable-edge-ids-on-demand-instead-of-storing-them

Comment on lines +42 to +46
if (!computed.equals(stored)) {
long total = mismatches.incrementAndGet();
LOG.warn("Computed stable edge ID != stored value (total mismatches={}): computed={} stored={} [{}]",
total, computed, stored, context.get());
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded WARN logging on mismatch can itself disrupt serving. The doc comment promises verification "must never disrupt serving," but a systematic divergence (e.g. one edge class, or a virtual-edge handling gap) would log one WARN per (edge, direction) comparison — at production QPS with many edges per route, that's potentially thousands of log lines/sec, which can pin a core on logging, fill disk, and degrade the very serving path it's shadowing. The whole point of the flag is to catch mismatches in real traffic, so the failure mode where it fires a lot is the one most worth guarding.

Recommend rate-limiting/sampling the log (the AtomicLong total already gives you a cheap throttle key) while keeping the counter exact, e.g.:

Suggested change
if (!computed.equals(stored)) {
long total = mismatches.incrementAndGet();
LOG.warn("Computed stable edge ID != stored value (total mismatches={}): computed={} stored={} [{}]",
total, computed, stored, context.get());
}
if (!computed.equals(stored)) {
long total = mismatches.incrementAndGet();
// Rate-limit: a systematic divergence would otherwise log once per comparison on the
// serving hot path. The counter stays exact; the metric is the real signal.
if (total <= 100 || Long.bitCount(total) == 1) {
LOG.warn("Computed stable edge ID != stored value (total mismatches={}): computed={} stored={} [{}]",
total, computed, stored, context.get());
}
}

LOG.warn("Computed stable edge ID != stored value (total mismatches={}): computed={} stored={} [{}]",
total, computed, stored, context.get());
}
return computed;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning computed (not stored) means a mismatch ships the possibly-wrong value to clients/exports. This is a deliberate choice ("the path Phase 2 will keep"), but it makes this a substituting verification rather than a true shadow: during the Phase 1 window, any divergence on the serving path returns the computed stable_edge_ids to gRPC clients, and in StreetEdgeExporter writes the computed ID into the BQ export — so a non-zero mismatch corrupts output, not just a log line.

Since equality is fully captured by the counters regardless of which string you return, returning stored here would still prove computed == stored across real traffic while guaranteeing output is unchanged until Phase 2 deliberately flips it. Worth at least calling out the trade-off explicitly in the rollout plan (one Places router for a week) — if routers.stable_id_computed_mismatches goes non-zero, you're serving the wrong IDs for that window, not just observing.

Comment on lines +84 to +91
Supplier<OsmHelper> osmHelperSupplier = () -> {
if (graphHopper instanceof OsmHelperProvider) {
return ((OsmHelperProvider) graphHopper).getOsmHelper();
}
throw new IllegalStateException(StableIdEncodedValues.USE_COMPUTED_CONFIG_KEY
+ " requires a graph that exposes a segment index (OSM/GTFS), but got "
+ graphHopper.getClass().getName());
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: this IllegalStateException for a misconfigured graph type (flag on + non-OSM/GTFS graph) only fires lazily on the first routing request that asks for stable_edge_ids — and via resolveOsmHelper() it re-throws on every subsequent request (the volatile osmHelper stays null). So a misconfig surfaces as a stream of failed requests rather than a startup failure. Since useComputedStableId and graphHopper are both known here at construction time, consider fail-fast: validate graphHopper instanceof OsmHelperProvider now (when the flag is on) so a bad config dies at boot, not in the serving path. Low priority — production graphs are always OSM/GTFS.

Completes the compute-on-demand migration begun in the previous commit and
removes the per-edge stored bytes. The stable edge ID is a pure function of
osmid + osmid_high + segment index, so the 16 stable_id_byte_* /
reverse_stable_id_byte_* encoded values (16 B/edge, amplified through CH
shortcut copies) are no longer registered, stamped, or persisted. This is a
graph format change: graphs must be re-imported to drop the bytes.

- EncodedValueFactoryWithStableId no longer registers the 16 byte EVs, and
  GraphHopperManaged drops them from the encoded-values string.
- StableIdEncodedValues loses the stored getStableId/setStableId byte paths;
  it now only computes (computeStableId from an EdgeIteratorState, or from an
  edgeId + EdgeIntAccess during import).
- Import no longer stamps: CustomOsmReader drops the StableEdgeIdManager
  stamping call and EdgeCoalescer no longer re-stamps merged edges (it still
  mints their fresh segment index). The segment index is recorded before
  handleWayTags so the SEID custom-speeds parsers can compute on demand.
- The import-time SEID seam reads the segment index of the edge being parsed
  through a new ImportSegmentIndexLookup (pointed at the reader's in-progress
  map), threaded into StableEdgeIdManager via ReplicaVehicleTagParserFactory.
- The read sites now always compute: StableIdPathDetailsBuilder (gRPC) and
  StreetEdgeExporter; the Phase-1 verification flag, StableIdVerification, and
  the RouterServer shadow gauges are removed. The path-details builder keeps
  mapping virtual edges to their original edge id for the segment-index lookup.
- IdentityEncodedValues drops the byte names from the routing-equivalence mask
  (the EdgeCoalescer identity set is now just the two osmid halves).

Format compatibility follows the existing convention (cf. the osmid_high
split): there is no separate graph-version constant in the fork; the
encoded-value change is realized by a coordinated re-import. GraphHopper's
storage is self-describing, so an old graph is not silently misread — its
stored bytes are simply ignored until it is rebuilt.

Tests: StableIdComputeTest and a rewritten ComputedStableIdImportTest (beatty
import with coalescing off and on; every edge's ID is derivable, and every
stable_edge_id returned by the serving path is one of the graph's
directly-computed IDs — which fails if the virtual-edge mapping regresses).
EdgeCoalescerTest / IdentityEncodedValuesTest / StableEdgeIdManagerTest /
CustomSpeedsUtilsTest updated for the compute API.

Note: the web and grpc modules could not be compiled in the dev sandbox (no
generated protobuf sources); their edits are mechanical but want CI
confirmation, and the SEID custom-speeds integration tests (web module) should
be validated on a real build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bnaul bnaul changed the title Compute stable edge IDs on demand behind a verification flag (DATA-6257 Phase 1) Compute stable edge IDs on demand and stop storing them (DATA-6257) Jun 5, 2026
bnaul and others added 4 commits June 5, 2026 13:47
Proves the import-time SEID custom-speeds seam works with stable edge IDs
computed on demand (no stored bytes): imports beatty, writes a stable_edge_id-
keyed closure for a real edge, re-imports under a custom-speeds car vehicle, and
asserts the closure removed access on exactly that edge/direction. Exercises the
full chain (parse -> handleWayTags -> custom-speeds access parser ->
getStableEdgeIdForEdge -> computeStableId via ImportSegmentIndexLookup -> match).
Runs in web-bundle, so it covers the seam without the web-module's micro_nor_cal
integration fixtures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
testRemoveAccessForSeidUserClosedRoads built a 1-int scratch IntsRef and asked
the seam to read the edge's osmid out of it, which overran the array
(ArrayIndexOutOfBounds) — a pre-existing failure on the base branch (the old
stored-bytes path read stable_id_byte_* at an even higher index out of the same
1-int scratch). Size the scratch to the encoding manager's flag width and seed
the edge's osmid into it, so the on-demand stable-id computation reads the real
osmid and matches the closure key. Production is unaffected (the import path
uses the full-width edge access); this only repairs the unit-test fixture.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Refactors isLinkMergeable into classifyLink, which returns a LinkOutcome
enum (MERGEABLE plus one bucket per rejection gate, in early-return
order). run() tallies the outcomes into a LinkClassificationStats struct
exposed on Result; CustomOsmReader logs the breakdown alongside its
existing edge-count summary so the diagnostic shows up in every build's
INFO log.

Used to decompose why a coalescing pass is reducing edges less than
expected: each rejection bucket sets the upper bound on additional
merges recoverable by relaxing that gate. On US-DC (orbis): of 66,314
degree-2 candidates, 43.1% merged, 40.2% blocked by excluded_edge
(barrier-edge proximity, which is dense in DC and structurally
unmergeable), 8.6% by flag_mismatch, 6.1% orientation_mismatch. On
mini_nor_cal: 1,089,421 candidates, 59.9% merged, 23.0% excluded_edge,
8.8% flag_mismatch, 4.2% orientation. Barriers are not an orbis-specific
problem — OSM-DC produced ~the same excluded_edge count.

No behavior change. Existing EdgeCoalescerTest suite (15 tests) passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EdgeCoalescer used to copy all source nodes to the target with IDs
preserved 1:1, leaving connection nodes whose two incident edges were
merged into one as 'dead' tower nodes — referenced by no target edge
but still consuming a row in the target's nodes table, each per-CH
nodes_ch_<profile> file, and the GH-node-to-OSM-node map. On orbis
mini_nor_cal that's ~650K dead nodes (one per merge), and the 10
production CH profiles each carry their own per-node entry.

Refactor run() to plan-first: walk every chain and split it into runs
before any edge is written, accumulating a bypassedNode bitmap for the
interior nodes of every multi-edge run. Compute a dense source-to-target
node remap (sourceToTargetNode), copy only surviving nodes into the
target with consecutive IDs starting at 0, then emit edges through the
remap so base/adj nodes resolve to the new dense IDs.

Result exposes sourceToTargetNode so the caller can rewrite its own
node-keyed bookkeeping. CustomOsmReader uses it to:
- rewrite the graphhopper:via_node tag on each restriction relation
  (protected via nodes are never bypassed, so the remap is total);
- rebuild ghNodeIdToOsmNodeIdMap into the target's ID space, dropping
  entries for bypassed source nodes.

Local single-profile validation on mini_nor_cal orbis cutout:
4.27M edges -> 3.62M (15.2% reduction, unchanged from pre-compaction),
3.38M nodes -> 2.73M (650K compacted, exactly matching the 650K
eliminated edges), on-disk graph dir 443.1 -> 385.1 MiB (13.1% smaller,
58 MB saved -> +12 MB on top of pre-compaction's 46 MB). The compaction
delta multiplies through the 10 production CH profiles on the deployed
config — nodes_ch_<profile> alone accounts for ~52 MB of that scaling,
before CH shortcut downstream effects.

Tests: all 15 EdgeCoalescerTest cases pass after updating hardcoded
node-ID assertions to remap through Result.sourceToTargetNode.
EdgeCoalescingImportTest (beatty + 3-way chain) likewise updated to
assert renumbering correctness instead of 1:1 preservation; 58 routes
through the orbis cutout remain bit-identical between pre- and post-
compaction builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant