Background
Tree-sitter's tags.scm query files define patterns for identifying symbol definitions (where classes, functions, methods are declared) and references (where they are used). These are the same queries that power GitHub's "Go to Definition" and "Find References" features.
Problem
Our current definition tracking is built from custom tree-sitter queries in language_spec.py and constants.py. While functional, this approach:
- Misses some definition types (e.g., TypeScript type aliases, Rust trait implementations, Java interface implementations)
- Doesn't track references at all — we only discover references through call resolution, missing type annotations, variable declarations, and inheritance clauses
- Requires manual maintenance per language when grammars update
The upstream tags.scm files are community-maintained and already capture definition/reference pairs we're missing.
Approach
Vendor tags.scm files from official tree-sitter grammar repos for all 12 supported languages. These are already available in the pip packages for 9/12 languages; we supplement the remaining 3 (TypeScript, Scala, C#) from the grammar repos.
What tags.scm captures
Capture types with semantic meaning:
@definition.class — class/struct/enum declarations
@definition.function — function declarations
@definition.method — method declarations
@definition.interface — interface declarations (Java, TS, Go)
@reference.call — function/method call sites
@reference.class — class references (inheritance, instantiation)
@reference.implementation — interface implementation references
@name — the identifier name at each definition/reference site
@doc — associated documentation comments
Example from Java's tags.scm:
(class_declaration
name: (identifier) @name) @definition.class
(interface_declaration
name: (identifier) @name) @definition.interface
(type_list
(type_identifier) @name) @reference.implementation
(superclass (type_identifier) @name) @reference.class
Implementation
- Create
codebase_rag/queries/tags/ directory with .scm files per language
- Run tags queries alongside existing structural queries during parsing
- Cross-validate our definitions against
@definition.* captures — log warnings for any definitions we miss
- Use
@reference.* captures to create new relationship types or supplement existing CALLS/IMPORTS edges
- Use
@doc captures to associate documentation comments with their definitions
- Add tests per language verifying tag capture completeness
Languages
All 12: Python, JavaScript, TypeScript, Rust, Java, C, C++, Go, Lua, Scala, PHP, C#
References
Background
Tree-sitter's
tags.scmquery files define patterns for identifying symbol definitions (where classes, functions, methods are declared) and references (where they are used). These are the same queries that power GitHub's "Go to Definition" and "Find References" features.Problem
Our current definition tracking is built from custom tree-sitter queries in
language_spec.pyandconstants.py. While functional, this approach:The upstream
tags.scmfiles are community-maintained and already capture definition/reference pairs we're missing.Approach
Vendor
tags.scmfiles from official tree-sitter grammar repos for all 12 supported languages. These are already available in the pip packages for 9/12 languages; we supplement the remaining 3 (TypeScript, Scala, C#) from the grammar repos.What tags.scm captures
Capture types with semantic meaning:
@definition.class— class/struct/enum declarations@definition.function— function declarations@definition.method— method declarations@definition.interface— interface declarations (Java, TS, Go)@reference.call— function/method call sites@reference.class— class references (inheritance, instantiation)@reference.implementation— interface implementation references@name— the identifier name at each definition/reference site@doc— associated documentation commentsExample from Java's tags.scm:
Implementation
codebase_rag/queries/tags/directory with .scm files per language@definition.*captures — log warnings for any definitions we miss@reference.*captures to create new relationship types or supplement existing CALLS/IMPORTS edges@doccaptures to associate documentation comments with their definitionsLanguages
All 12: Python, JavaScript, TypeScript, Rust, Java, C, C++, Go, Lua, Scala, PHP, C#
References