Skip to content

Latest commit

 

History

History
264 lines (170 loc) · 19.3 KB

File metadata and controls

264 lines (170 loc) · 19.3 KB

Contributing to code2graph

Thank you for your interest in contributing. code2graph is a purpose-neutral, language-agnostic code-graph extraction library. It turns source files into structural facts (symbols, references, cross-file edges) and stops there. This guide gets you oriented before you open a PR.

The single most common contribution is adding a language or improving resolution quality. Both are covered in detail below.


Where to Start

  • GitHub Discussions — design proposals, architecture questions, "should code2graph support X?"
  • GitHub Issues — bug reports and well-scoped feature requests

For a new language with a maintained grammar, you can usually just open a PR — the recipe below is mechanical. For anything that changes the fact schema, the identity scheme, or the resolver trait, open a Discussion first.


What code2graph Is (and Isn't)

code2graph is a focused primitive, like a tokenizer or a parser generator. Consumers build retrieval, analysis, or navigation on top of its output; code2graph itself takes no position on what the facts mean.

The pipeline is two pure, deterministic stages:

source ──[extract]──▶ FileFacts (symbols + references) ──[resolve]──▶ CodeGraph (symbols + edges)
use code2graph::{extract_path, resolve::{Resolver, SymbolTableResolver}};

let a = extract_path("src/util.rs", "pub fn helper() {}")?;
let b = extract_path("src/main.rs", "pub fn run() { helper() }")?;
let graph = SymbolTableResolver.resolve(&[a, b]); // run --calls--> helper

Core invariants — non-negotiable, enforced in review

These are bright lines, not style preferences. A PR that crosses one will be asked to change regardless of size:

  • No storage, no I/O in the library core. Extractors and resolvers are pure and deterministic. No files, no network, no databases.
  • No source bodies. A Symbol carries a ByteSpan, never the source text. The consumer slices what it needs.
  • Purpose-neutral. No scoring, ranking, embeddings, recall-vs-precision policy, or ACLs baked in. That's the consumer's job.
  • Every edge carries a Confidence and a Provenance. Be honest about resolution quality — never fake precision (see Resolution tiers).
  • Extractors produce neutral facts; resolvers stay pure and only connect/derive edges — they never invent facts the extractor didn't emit.
  • Module-root files are wiring only. mod.rs / lib.rs contain only module declarations and re-exports (mod / pub mod / pub use / pub(crate) use) plus the module doc comment — never trait/type/fn definitions, dispatch matches, helpers, or consts. Put logic in a named sibling submodule and re-export it.
  • Grammars are imported in exactly one place. src/grammar.rs is the sole importer of every tree_sitter_* crate. Extractors call crate::grammar::<lang>(), never a grammar crate directly.

Repository Layout

Module Role
src/lang.rs The Language enum + file-extension dispatch. The canonical source of truth for language coverage.
src/grammar.rs The grammar chokepoint — the only file that imports tree_sitter_* crates, gated per Cargo feature.
src/extract/ Per-language extraction. mod.rs is wiring; dispatch.rs holds the Extractor trait + entry points; support.rs holds shared helpers; <lang>.rs is one tree-sitter walk → defs + refs.
src/resolve/ The Resolver trait (the tier seam) + the resolvers that implement it. Read mod.rs for the current set.
src/symbol/ SCIP-aligned identity: Descriptor + SymbolId rendering to a stable SCIP string. Cross-file match = string equality.
src/graph/types.rs The neutral fact schema: Symbol, Reference, Edge, FileFacts, CodeGraph, SymbolKind, EdgeKind, Confidence, Provenance, ByteSpan.
eval/ The measurement harness — scores ref→def precision/recall per language and per tier (see Validation).

Development Setup

Requirements: Rust stable (see rust-toolchain.toml / Cargo.toml rust-version; MSRV is 1.85, edition 2024). No system dependencies — tree-sitter grammars build from source via cc.

git clone https://github.com/nodedb-lab/code2graph.git
cd code2graph

cargo test                                    # unit + doc tests
cargo test --workspace                        # includes the eval/ integration suite
cargo clippy --all-targets -- -D warnings     # lints — must be clean
cargo fmt --all                               # formatting

Every language is a Cargo feature, all enabled by default. To work on one in isolation:

cargo test --no-default-features --features rust    # just the Rust extractor
cargo test --features scala scala                   # run only the Scala tests

Every public-facing file starts with // SPDX-License-Identifier: Apache-2.0.


Adding a Language

This is the most common contribution and it's mechanical once a grammar exists. The resolver is language-agnostic, so cross-file edges work for free once extraction emits correct facts.

Before you start: check docs/supported-languages.md for what's already covered (plus the candidate, not-feasible, and out-of-scope lists), and docs/ffi-support-matrix.md for cross-language FFI boundaries. Both are guarded by sync tests — supported-languages.md against the Language enum in src/lang.rs, and ffi-support-matrix.md against the per-ABI SPECS registry in src/ffi/ — so adding a language or an FFI boundary without updating the matching doc in the same PR fails the test.

First: is there a usable grammar? code2graph pins tree-sitter to >=0.24, <0.27. A grammar crate must be compatible with that range. If no compatible grammar exists, read When a language has no usable grammar before writing any code.

The recipe

Using a hypothetical language Foo with extension .foo:

1. Add the grammar dependency and feature flag (Cargo.toml):

[features]
# add to the default list so it's on by default
default = [ …, "foo" ]
foo = ["dep:tree-sitter-foo"]

[dependencies]
tree-sitter-foo = { version = "<x.y.z>", optional = true }

2. Register the grammar in the chokepoint (src/grammar.rs) — and add the ABI sanity-check arm at the bottom of the file:

#[cfg(feature = "foo")]
/// Returns the tree-sitter grammar for Foo.
pub fn foo() -> Language {
    tree_sitter_foo::LANGUAGE.into()
}
// …and in the abi_versions_are_compatible test:
#[cfg(feature = "foo")]
check("foo", super::foo());

3. Register the language (src/lang.rs): add the Foo enum variant, the Language::Foo => "foo" arm in as_str(), and the extension dispatch "foo" => Some(Self::Foo).

4. Write the extractor (src/extract/foo.rs): a struct FooExtractor implementing the Extractor trait. One tree-sitter walk that collects:

  • definitions → a Symbol with a SCIP SymbolId (namespace descriptors derived from the file path or the language's package/module convention), a SymbolKind, a ByteSpan, and a one-line signature;
  • references (call sites, imports, type uses) → a Reference with a byte offset and, where the syntax has a receiver (obj.method()), the receiver captured as the reference's qualifier so the scope-aware resolver can disambiguate.

Reuse the shared helpers in src/extract/support.rs (node_text, field_text, one_line_signature, collect_call_references, push_import_ref, push_ref, push_type_ref, the scope/binding helpers, …) — don't reinvent them. The freshest extractor is usually the best template; pick one structurally similar to your language (class-based, module-based, etc.).

5. Wire it up:

  • src/extract/mod.rs (wiring only): #[cfg(feature = "foo")] pub mod foo; and #[cfg(feature = "foo")] pub use foo::FooExtractor;
  • src/extract/dispatch.rs: the use super::FooExtractor; and the Language::Foo => FooExtractor.extract(source, file), match arm.

6. Add unit tests in foo.rs (#[cfg(test)] mod tests): assert that definitions get the expected SCIP id strings and SymbolKinds, and that references (including qualifiers on member calls) are captured. Assert the real rendered SCIP string — derive it from an existing extractor's tests.

Tip: dump the real AST before you write the extractor

Published grammars frequently differ from the node-types.json in their GitHub repo. Don't guess node/field names — verify them against the exact crate version you depend on. The reliable way: wire up the grammar (steps 1–2), drop a throwaway examples/ program that prints tree.root_node().to_sexp() for a few representative snippets, run it, read the real tree, then delete it. This catches surprises (separate signature/body nodes, field labels that point at punctuation, nested wrapper nodes) before they cost a review round-trip.

Embedded / single-file-component languages

Some languages embed another language — a Svelte/Vue <script> block contains real JS/TS. Don't re-implement JS/TS parsing: parse the host document, locate the script's inner source node, run the existing extractor (super::typescript::extract_ecmascript), then remap every byte offset back into the host file with support::shift_offsets. The Svelte extractor is the reference implementation for this shape.


When a Language Has No Usable Grammar

Not every language has a tree-sitter grammar you can depend on. Surface this honestly — don't ship a fragile workaround. Work through these in order:

  1. Search crates.io for a maintained binding. Try the obvious name and common variants (tree-sitter-<lang>, -<lang>-ng, -<lang>3). Prefer the one with real download counts and recent releases.

  2. Check tree-sitter version compatibility — this is the usual blocker. code2graph pins tree-sitter to >=0.24, <0.27. A grammar crate built against an old tree-sitter (e.g. 0.20) exposes a different Language type that cannot be passed to our parser, and its generated parser may have an ABI version outside our supported range. The abi_versions_are_compatible test in src/grammar.rs guards the ABI; a mismatched crate will fail it (or fail to link). If the only available grammar is stuck on an old tree-sitter, the language is not feasible right now — say so in the PR/issue.

  3. Do not bridge incompatible tree-sitter versions. Adding an old tree-sitter as a side dependency and transmuting its Language into ours is a layout-dependent hack that breaks the moment either crate changes. It violates the project's durability bar and will be rejected. The same goes for any "make it link somehow" FFI trick.

  4. If a grammar source exists but isn't published compatibly, vendoring/regenerating it is a real option — but it crosses the "grammars come from crates.io" line and adds generated C to maintain. Open a Discussion first; this is a project-level decision, not a drive-by PR.

  5. Otherwise, document and skip. Note the blocker (which crates exist, what tree-sitter version they target) in an issue so the next person doesn't re-discover it. When a maintained, compatible grammar appears, the language becomes a normal recipe-follows contribution.

The honest "we can't support this yet, and here's exactly why" is a valuable contribution. A grammar shim that links by luck is not.


Resolution Tiers

Resolution is pluggable behind the Resolver trait — the tier seam. Every resolver takes per-file FileFacts and emits the same CodeGraph schema, tagging each edge with a Confidence and a Provenance. Consumers pick a tier without changing how they read the output.

Tier Resolver Confidence Behaviour
A SymbolTableResolver NameOnly Fast, all languages, recall-first. Matches by name/scope. An ambiguous name fans out to all same-named definitions.
B ScopeGraphResolver Scoped / Exact Scope-aware: resolves references through lexical scopes, imports, and qualified paths. Emits an edge only when it can resolve precisely — never fakes precision (zero false positives is the contract).
FfiBridgeResolver Links cross-language boundaries deterministically (e.g. a #[no_mangle] Rust fn called from C, a PyO3 #[pyfunction] from Python) by matching the exported ABI name, which plain name resolution can't recover.

Provenance records which analysis derived an edge (name table, scope graph, FFI bridge, …), orthogonal to Confidence.

The decisive axis is resolution correctness. The highest-value resolution contributions:

  • Extend scope-aware resolution to another language. Tier B's quality depends on the extractor emitting the right facts — qualifiers on member calls, import bindings with their from_path, namespace descriptors. Improving an extractor's references often upgrades that language from Tier-A fan-out to Tier-B precision with no resolver change.
  • Improve the resolver itself behind the trait. New resolution capabilities (e.g. type-qualified call resolution) slot in behind the seam and must hold the contract: uniqueness gates precision — widen the candidate set only where you can still prove a unique match, so you never introduce a false positive.

When you touch resolution, prove it with the eval harness (below).


Validation: Prove the Number

"Best at code→graph" must be a number, not a claim. The eval/ crate scores ref→def precision and recall per language and per tier. The evaluation unit is a located edge (a reference site bound to a definition site), so name-only fan-out is penalised exactly where it over-connects: a reference linking to N same-named definitions scores one true positive and N − 1 false positives.

cargo run -p code2graph-eval      # print the scorecard
cargo test  -p code2graph-eval    # the regression gate on the invariants

The corpus has two kinds of ground truth, both under eval/corpus/:

  • Golden fixtures — hand-authored expected.edges, for role-typed scoring.
  • SCIP oracles<lang>_oracle/<case>/ directories scored location-only against an index produced by a mature, type-aware indexer (rust-analyzer, scip-typescript, scip-java, …). This is how Tier-B's precision thesis is locked against an external source of truth. See eval/ORACLE.md for the (maintainer-only, off-by-default) oracle regeneration workflow — the normal build and test loop pulls no SCIP/indexer dependencies; it only ever reads committed artifacts.

When you add a language, add at least one corpus case. When you improve resolution, add a case that exercises the ambiguity you fixed, and confirm the regression gate locks the gain. The regression tests encode invariants (Tier-A keeps full recall in its lane; Tier-B never emits a false positive; Tier-B beats Tier-A on precision where genuine ambiguity exists) — not brittle exact rates.


Testing

  • Unit tests — in the same file as the code under test, #[cfg(test)] mod tests { use super::*; … }. Can test private functions. This is where extractor tests live (expected SCIP strings, kinds, captured references).
  • Integration tests — in eval/tests/ (and any crate's tests/). Test the public API and end-to-end resolution behaviour.
  • Corpus cases — the data-driven scoring fixtures under eval/corpus/. Adding a directory is automatically picked up by the harness; no plumbing changes.

A language PR should include unit tests for the extractor and at least one corpus case.


Code Standards

Enforced in review — run all three green before opening a PR:

cargo fmt --all
cargo clippy --all-targets -- -D warnings
cargo test --workspace
  • No .unwrap() / .expect() / panic! in library code. Return Result, propagate with ?, use if let/let … else. (Test code may unwrap.)
  • No Result<T, String> — use the crate's typed CodegraphError.
  • Module-root files (mod.rs / lib.rs) are wiring only — declarations and re-exports, no logic. (See the core invariants.)
  • Reuse support.rs helpers rather than re-implementing text extraction, signature building, scope/binding construction, or reference pushing.
  • Keep files and functions focused. If a module is expected to grow, use a directory from the start. Prefer many small, single-purpose files.
  • Grammars only through src/grammar.rs. No direct tree_sitter_* imports elsewhere.

Commits and Pull Requests

Commit formatConventional Commits, scoped by area:

feat(extract): add Scala extractor
fix(resolve): capture receiver as qualifier on qualified calls
test(eval): add ruby_oracle/ambiguous_call corpus fixture
docs(symbol): document SCIP descriptor rendering

Types: feat, fix, perf, refactor, test, docs, chore. Common scopes: extract, resolve, symbol, graph, lang, grammar, eval.

  • One logical change per commit. Each commit must build standalone (no broken bisect) — e.g. an extractor fix lands before the resolver test that depends on it.
  • Keep PRs scoped. A new language is one coherent PR; don't bundle unrelated changes.
  • Draft PRs are welcome for directional feedback before a full implementation.
  • Don't include generated noise — no Cargo.lock churn unrelated to your dependency, no formatter reflows of untouched files.

Code of Conduct

This project follows the Contributor Covenant. By participating you are expected to uphold it. Report unacceptable behaviour through the channels listed there.

License

code2graph is licensed under Apache-2.0. By contributing, you agree your contributions are licensed under the same terms. No CLA is required.