Skip to content

Support 64-bit OSM way IDs by splitting osmid encoded value#194

Open
bnaul wants to merge 2 commits into
original-directionfrom
tomtom-orbis-osmid-64bit
Open

Support 64-bit OSM way IDs by splitting osmid encoded value#194
bnaul wants to merge 2 commits into
original-directionfrom
tomtom-orbis-osmid-64bit

Conversation

@bnaul

@bnaul bnaul commented May 21, 2026

Copy link
Copy Markdown

Background

We started ingesting TomTom Orbis into BigQuery (vendor/tomtom/orbis_processing.py in the model repo). Orbis is TomTom's open-map product, shipped as a .osm.pbf extract with TomTom-specific tags layered on top of standard OSM tags. As a sanity check that the data is usable for routing, I tried building a GraphHopper graph from the California Orbis PBF using this fork's import-sandbox command — which is when this issue surfaced.

The build crashed deep in pass 2:

java.lang.RuntimeException: Could not parse OSM file: ...
Caused by: java.lang.RuntimeException: Unexpectedly high way id.
    at com.graphhopper.OsmIdTagParser.handleWayTags(OsmIdTagParser.java:20)
    ...

OsmIdTagParser rejects any way with id > Integer.MAX_VALUE, and the osmid IntEncodedValue is declared with 31 bits. Orbis assigns its own ID space, with way IDs running up to ~7.7B in the CA extract — about 3.6× the int32 ceiling. Standard OSM is still safely under int32, but it's been creeping up over the years too (~1.4B today), so the runway isn't infinite even without Orbis.

What this PR does

Splits the osmid value across two 31-bit IntEncodedValues — osmid (low 31 bits) and a new osmid_high (next 31 bits) — for a total of 62 usable bits. Reassembles to a long at every read site.

GraphHopper core only ships IntEncodedValue; there's no LongEncodedValue, and adding one would mean touching upstream-GH internals (new EV type, packed-layout changes, new getters/setters on EdgeIteratorState). The split-int approach uses only the existing public API, and it's the pattern this fork already uses for the 64-bit stable edge IDs (stable_id_byte_0..7).

Touch points:

  • OsmIdTagParser — writes both halves; guard raised from 32-bit to 62-bit.
  • RouterConstants — adds OSM_ID_HIGH_ENCODED_VALUE + OSM_ID_LOW_BITS (31).
  • EncodedValueFactoryWithStableId — registers osmid_high alongside osmid.
  • TagParserFactoryWithOsmId — wires both EVs into the single tag parser; uses NoOpTagParser for osmid_high so GH doesn't try to parse it independently.
  • GraphHopperManaged — adds osmid_high to setEncodedValuesString.
  • StableIdEncodedValues — new readOsmWayId(edge) helper; setStableId now reads the full long.
  • StreetEdgeExporter — reads via the helper; exported street_edges.csv osmid column is unchanged in format (always was a long-formatted string).
  • PathDetailsBuilderFactoryWithStableId + new OsmIdPathDetailsBuilder — explicit builder for the osmid path detail so the API returns the full Long instead of the generic IntDetails truncating to 31 bits.
  • Tests: TestCustomOSMParsers exercises a way ID above 2^31 to confirm the round-trip; StableEdgeIdManagerTest switched from int to long via the new helper.

Safety for non-Orbis data

  • Stable edge IDs are unchanged for any OSM way ID < 2^31. ((long)0 << 31) | (id & 0x7FFF_FFFF) == id, so existing Geofabrik-derived SEIDs are bit-identical. No regression in SEID-keyed downstream artifacts (custom speed files, trajectory speeds, road closures).
  • Storage cost: ~4 bytes per edge (one extra 31-bit field). At ~10M CA edges that's ~40 MB; at ~150M nationwide edges that's ~600 MB — well under 1% of the heap the nationwide builder runs with.
  • Build time impact: negligible — one extra setInt per edge.

Operational caveat

Adding osmid_high to the encoded-values string changes the edge-record layout, so existing pre-built graph snapshots in GCS won't load against a runner built from this branch. The graphhopper_graph_builder inputs_hash should invalidate downstream artifacts naturally once a new runner image tag is rolled in, but the builder + runner image tags should ship together in the same model-repo change rather than independently.

API change worth noting

The osmid path detail in routing responses switches from Integer to Long. JSON-wise it's still a number; only clients that strongly type the field as int would break.

Testing

  • mvn -pl replica-common,web-bundle clean compile test-compile: ✅ all clean.
  • web test-compile is blocked by the separate grpc/protobuf generation prerequisite; running CI is the simplest way to exercise StableEdgeIdManagerTest.
  • End-to-end verification: built a patched JAR and ran import-sandbox against the full TomTom Orbis 2026-05-14 California PBF (200M nodes, 34M ways, way IDs up to 7.7B). Completes in ~13 min, producing a 9.27M-node / 11.57M-edge graph. Spinning up server mode on the result returns sane routes (LAX→SFO: 610 km / 7h10m; SF→Oakland: 18 km / 22 min) and the same routes built from the equivalent Geofabrik CA PBF are within rounding (610 km / 7h07m and 17 km / 20 min) — i.e. on standard-OSM data the patched code is indistinguishable from the old code.

🤖 Generated with Claude Code

@bnaul bnaul changed the base branch from master to original-direction May 21, 2026 13:32
@bnaul bnaul force-pushed the tomtom-orbis-osmid-64bit branch 2 times, most recently from 43c98a9 to 8900f29 Compare May 21, 2026 14:59
@bnaul bnaul force-pushed the tomtom-orbis-osmid-64bit branch from 8900f29 to ed18e47 Compare May 29, 2026 12:27
@bnaul bnaul marked this pull request as draft May 29, 2026 12:28
@bnaul bnaul marked this pull request as ready for review May 29, 2026 12:28
@bnaul bnaul requested a review from mark-idleman May 29, 2026 12:28
bnaul added a commit that referenced this pull request May 29, 2026
So marking PR #194 ready-for-review kicks off an automatic review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bnaul bnaul marked this pull request as draft May 29, 2026 12:46
@bnaul bnaul marked this pull request as ready for review May 29, 2026 12:46
TomTom Orbis assigns OSM way IDs above Integer.MAX_VALUE (CA goes up
to 7.7B), which OsmIdTagParser previously rejected because the osmid
IntEncodedValue is only 31 bits. Split the ID across two encoded
values (osmid low + osmid_high, 31 bits each = 62 usable bits) and
reassemble it everywhere downstream: StableIdEncodedValues, the
street-edge exporter, and a new OsmIdPathDetailsBuilder so the routing
API still returns a full long for the osmid path detail rather than
falling back to the generic IntDetails wiring.

Verified by rebuilding the patched JAR and importing the full TomTom
Orbis 2026-05-14 California PBF (200M nodes, 34M ways) — completes
end-to-end in ~13 min and produces a 9.3M-node graph.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bnaul bnaul force-pushed the tomtom-orbis-osmid-64bit branch from 79740af to c314078 Compare May 29, 2026 13:26
@bnaul bnaul marked this pull request as draft May 29, 2026 13:28
@bnaul bnaul marked this pull request as ready for review May 29, 2026 13:28
@claude

claude Bot commented May 29, 2026

Copy link
Copy Markdown

Claude encountered an error —— View job


I'll analyze this and get back to you.

@claude

claude Bot commented May 29, 2026

Copy link
Copy Markdown

Claude encountered an error after 1m 12s —— View job


PR Review In Progress

  • Read CLAUDE.md and gather repo context
  • Review changed files (OsmIdTagParser, encoded value wiring, StableIdEncodedValues, path detail builder, exporter)
  • Check correctness of bit-splitting/reassembly (sign-extension, ID 0, max-id boundary)
  • Check thread-safety on hot paths (encoder/decoder reads per edge)
  • Check performance regressions
  • Check edge cases / error handling in tests
  • Post review feedback

Keeps this branch's workflow byte-identical to the default branch so
the review passes claude-code-action's validation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bnaul bnaul marked this pull request as draft May 29, 2026 13:43
@bnaul bnaul marked this pull request as ready for review May 29, 2026 13:43
@claude

claude Bot commented May 29, 2026

Copy link
Copy Markdown

Claude finished @bnaul's task in 4m 37s —— View job


PR Review

LGTM overall — the split-encoding approach is the right call here (it sticks to the public IntEncodedValue API, matches the pattern already used for the 8-byte SEID, and the bit-math reassembles cleanly). I dug into the parts that looked most likely to bite and didn't find a correctness or thread-safety problem.

What I checked

Bit math (OsmIdTagParser) — correct. The high-half IEV is unsigned 31-bit, so (long) highValue is always non-negative; no sign-extension trap. id >>> 31 for id ≤ 2^62−1 fits in 31 bits, the id < 0 || id > MAX_SUPPORTED_ID guard catches both ends, and (int)(id & LOW_MASK) is well-defined.

SEID stability for legacy data — preserved. calculateHumanReadableStableEdgeId formats with %d, so for any OSM way ID that fit in 31 bits before, Long.toString(id) equals what the old int-typed code produced. Bit-identical hash input ⇒ bit-identical SEIDs. Confirmed against the safety claim in the PR description.

Encoded-value wiring order — correct. EncodedValueFactoryWithStableId checks equals("osmid_high") before startsWith("osmid"), which it has to since "osmid_high".startsWith("osmid") is true. Flagged inline as a robustness improvement — works today, easy to break if rearranged.

TagParser layering — correct. TagParserFactoryWithOsmId wires one OsmIdTagParser for osmid (writes both halves in one shot) and a NoOpTagParser for osmid_high so the EV exists for layout/lookup but nobody else writes to it.

Concurrency — no new shared mutable state on the read path. OsmIdPathDetailsBuilder is constructed per-request (matching StableIdPathDetailsBuilder); currentValue is per-builder; IntEncodedValue reads are stateless. Writes only happen during single-threaded build. No regressions for the multi-threaded gRPC path.

Performance — negligible. One extra setInt per edge at build, one extra edge.get per edge in the read helper, +4 bytes/edge of storage. Nothing on the hot routing path unless osmid path detail is requested, and even then it's two int reads + a shift + an OR per edge.

Path-detail factory edge cases — handled. Requesting osmid adds OsmIdPathDetailsBuilder once (the generic loop short-circuits on the explicit continue). Requesting osmid_high is filtered out and falls through to the "Cannot find the path details" error — correct, since it's internal.

Inline comments

  • Test coverage gapTestCustomOSMParsers only exercises the happy path. The new rejection branches (id < 0, id > 2^62−1) and the just-above-2^31 boundary that used to throw aren't tested. Suggested concrete additions inline.
  • EncodedValueFactoryWithStableId robustness — suggested switching startsWith("osmid") to equals("osmid") so the branch ordering isn't load-bearing. Inline.

Operational note

The encoded-values-string change shifts edge-record layout, so existing graph snapshots won't load — the PR body already calls this out. Worth re-confirming the builder + runner image tags ship together when this is rolled out.

Todo

  • Read CLAUDE.md and gather repo context
  • Review OsmIdTagParser bit-splitting / reassembly
  • Review encoded value wiring (EncodedValueFactory, TagParserFactory, GraphHopperManaged)
  • Review StableIdEncodedValues reassembly
  • Review OsmIdPathDetailsBuilder + PathDetailsBuilderFactory
  • Review StreetEdgeExporter
  • Review tests
  • Post review feedback
    · Branch: tomtom-orbis-osmid-64bit

Comment on lines +28 to +42
long wayId = (1L << 33) + 12345L;
ReaderWay way = new ReaderWay(wayId);
IntsRef relationFlags = new IntsRef(50);

// Check that handleOsmIdWayTags correctly sets OSM ID for edge
// handleOsmIdWayTags should round-trip the full 64-bit way ID.
customOSMParsers.handleOsmIdWayTags(edgeId, edgeIntAccess, way, relationFlags);
assertEquals(osmEncValue.getInt(false, edgeId, edgeIntAccess), (int) way.getId());
assertEquals(wayId, OsmIdTagParser.readOsmWayId(
osmIdLowEnc.getInt(false, edgeId, edgeIntAccess),
osmIdHighEnc.getInt(false, edgeId, edgeIntAccess)));

// Check that the OSM ID is still stored properly after handleWayTags
// handleWayTags should not clobber the previously-written ID (it skips OsmIdTagParser).
customOSMParsers.handleWayTags(edgeId, edgeIntAccess, way, relationFlags);
assertEquals(osmEncValue.getInt(false, edgeId, edgeIntAccess), (int) way.getId());
assertEquals(wayId, OsmIdTagParser.readOsmWayId(
osmIdLowEnc.getInt(false, edgeId, edgeIntAccess),
osmIdHighEnc.getInt(false, edgeId, edgeIntAccess)));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the test only exercises the round-trip happy path. The new rejection logic in OsmIdTagParser.handleWayTagsid < 0 and id > MAX_SUPPORTED_ID — isn't covered, and neither is the 2^31 boundary (which used to throw and now must succeed). A couple of cheap assertThrows / boundary cases would lock in the contract:

// Just-above-int32 boundary still round-trips (previously threw).
ReaderWay justOverInt = new ReaderWay((1L << 31));
customOSMParsers.handleOsmIdWayTags(edgeId, edgeIntAccess, justOverInt, relationFlags);
assertEquals(1L << 31, OsmIdTagParser.readOsmWayId(
        osmIdLowEnc.getInt(false, edgeId, edgeIntAccess),
        osmIdHighEnc.getInt(false, edgeId, edgeIntAccess)));

// Above 62-bit max is rejected.
ReaderWay tooBig = new ReaderWay(1L << 62);
assertThrows(RuntimeException.class,
        () -> customOSMParsers.handleOsmIdWayTags(edgeId, edgeIntAccess, tooBig, relationFlags));

Comment on lines +62 to 65
} else if (name.equals("osmid_high")) {
return new IntEncodedValueImpl("osmid_high", 31, false);
} else if (name.startsWith("osmid")) {
return new IntEncodedValueImpl("osmid", 31, false);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor robustness: this works today only because equals("osmid_high") is checked before startsWith("osmid")"osmid_high".startsWith("osmid") is true, so swapping the order would silently mis-create osmid_high as a value named "osmid". Tightening the second branch to equals("osmid") removes that footgun and reads more consistently with the new equals branch above:

Suggested change
} else if (name.equals("osmid_high")) {
return new IntEncodedValueImpl("osmid_high", 31, false);
} else if (name.startsWith("osmid")) {
return new IntEncodedValueImpl("osmid", 31, false);
} else if (name.equals("osmid_high")) {
return new IntEncodedValueImpl("osmid_high", 31, false);
} else if (name.equals("osmid")) {
return new IntEncodedValueImpl("osmid", 31, false);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant