Skip to content

[gazelle] switch Java parsing to pure-Go tsparser (no JVM parser server)#439

Open
odvcencio wants to merge 9 commits into
bazel-contrib:mainfrom
odvcencio:draco/tsparser-wiring-318
Open

[gazelle] switch Java parsing to pure-Go tsparser (no JVM parser server)#439
odvcencio wants to merge 9 commits into
bazel-contrib:mainfrom
odvcencio:draco/tsparser-wiring-318

Conversation

@odvcencio

@odvcencio odvcencio commented Mar 4, 2026

Copy link
Copy Markdown

Summary

  • switch Gazelle Java source parsing from the gRPC/JVM javaparser runner to an in-process pure-Go tsparser
  • keep the parser runner interface explicit (private/parser) so backend selection stays implementation-detail
  • use gotreesitter.NewParserPool(...) for concurrency-safe parsing
  • add Java token-source fallback when the default parse tree root contains errors
  • add explicit tree release handling
  • add focused Java parser tests, including TestParsePackageConcurrent
  • pin github.com/odvcencio/gotreesitter to v0.17.4

Non-goals

  • no changes to compile-time dependency analysis (dependency_analyzer)
  • no Kotlin/Scala parsing behavior changes in this PR
  • no dependency resolution logic changes

Why Go/Bazel Version Bumps

gotreesitter requires a newer Go toolchain than the previous Gazelle parser path, so this PR aligns:

  • go.mod Go version
  • MODULE.bazel go_sdk.download(...)

Validation

  • go test ./java/gazelle/private/tsparser -count=1
  • go test -race ./java/gazelle/private/tsparser -run TestParsePackageConcurrent -count=1
  • bazel test --lockfile_mode=off --nocache_test_results --test_output=errors //java/gazelle:gazelle_test
  • go test ./java/gazelle/private/tsparser
  • bazel test //java/gazelle/private/tsparser:tsparser_test

Issue

Ref #318

- bump MODULE go_sdk to 1.24.0 for gotreesitter compatibility\n- pin github.com/odvcencio/gotreesitter to commit ce7d35d (correctness/perf-gates branch)
- Replace single parser instance with concurrency-safe ParserPool to prevent race conditions during parallel parsing
- Add token-source fallback mechanism when default parse trees contain syntax errors, improving resilience against parser edge cases
- Update gotreesitter dependency to version with ParserPool and JavaTokenSource APIs
- Add explicit tree.Release() calls to prevent memory leaks during parse operations
- Add TestParsePackageConcurrent stress test with 12 workers to validate thread safety
@odvcencio odvcencio marked this pull request as ready for review March 4, 2026 05:58
@odvcencio odvcencio changed the title [WIP] gazelle: switch Java parsing to pure-Go tree-sitter runner [gazelle] switch Java parsing to pure-Go tsparser (no JVM parser server) Mar 4, 2026
@jbedard

jbedard commented May 16, 2026

Copy link
Copy Markdown

I would really like to see something like this land, however I've found the github.com/odvcencio/gotreesitter package is simply not comparable to the cgo version when it come to parse performance.

After seeing this PR I tried upgrading the the Aspect js+orion+kotlin extensions and performance on some repos is 10x slower then go-tree-sitter with the cgo bindings. See my branch https://github.com/aspect-build/aspect-gazelle/tree/gotreesitter

@odvcencio are you still hoping to get the pure go version initial parse times up to speed with the cgo one? I know some other metrics are faster with gotreesitter but unfortunately with gazelle the initial parse time is almost all that matters.

@odvcencio

Copy link
Copy Markdown
Author

heya @jbedard thanks so much for tagging me and bringing me back to this. and i especially appreciate your interest in gotreesitter and your effort on that dev branch.

i think my benchmarks on Java have been indicating theres some room there for optimization, and ive confirmed your 10x assessment, i will take a look and see what is possible... im pretty confident i can get that to something decent 👍

odvcencio added 4 commits May 17, 2026 12:32
- Route packages with `.kt` sources through `javaparser` while keeping Java-only packages on `tsparser`
- Resolve nested types, generic references, method bodies, and annotation names more accurately so imported and exported metadata matches the actual API
- Fail fast when a package spans multiple Java packages and key per-class metadata by fully qualified name to avoid ambiguous lookups
- Refresh the Go, rules_go, and parser dependencies, and add regression tests plus corpus benchmarks for the new behavior
- Collect import, export, and main class names as raw strings during parsing, then resolve them into sorted sets after package validation.
- Keep `ClassNameLess` aligned with fully qualified string order without allocating full strings on each comparison.
- Add regressions for field-initializer type references and class-name ordering.
- Upgrade `github.com/odvcencio/gotreesitter` to `v0.17.1`
- Refresh module checksums to keep dependency metadata consistent
- Update `github.com/odvcencio/gotreesitter` from `v0.17.1` to `v0.17.4`
- Refresh module checksums to keep dependency metadata in sync and builds reproducible
@jbedard

jbedard commented May 18, 2026

Copy link
Copy Markdown

Are you still hoping to improve the core gotreesitter to make that initial parse time faster? I'd be interested in how and helping at least test it, although idk how much time I actually have to donate :/

@odvcencio

odvcencio commented May 18, 2026

Copy link
Copy Markdown
Author

heya @jbedard its pretty variable based on language right now but i did some work to target particularly java for this use case as well as JS/TS/Kotlin here

currently working on getting the python fullparse time down... if you check out the changes made through gotreesitter 0.17.4 you should see a considerable boost since the last time you tried it. my primary use cases have been Go-centric so i recognize there is a lot of space to cover to live up to the universal parsing promise across the board. but im targeting updates to the GLR machinery rather than one off language optimizations, so to answer your question, yes i think theres still a ton of room for growth in the parser efficiency especially since i likely still have issues in languages without enough parser exercising or workload-shaped corpora.

believe it or not (at least in the current python case), so far the core parser machinery is more or less there (give or take some funky language cases), the tree materialization has actually been the perf bottleneck. the current parser is materializing too many nodes prematurely and its adding on many times over the wall clock of the parsing itself.

@odvcencio

Copy link
Copy Markdown
Author

oh also yeah i am not really sure what a gazelle or bazel workflow really looks like so simply by using it and telling me if it works/how it performs/if it helps you personally is plenty of help... my goal is to get to roughly 2x cgo full parse times across the board so if a language is not there currently, its just a matter of time

@jbedard

jbedard commented May 19, 2026

Copy link
Copy Markdown

oh also yeah i am not really sure what a gazelle or bazel workflow really looks like so simply by using it and telling me if it works/how it performs/if it helps you personally is plenty of help... my goal is to get to roughly 2x cgo full parse times across the board so if a language is not there currently, its just a matter of time

The summary is gazelle generates BUILD files, normally declaring groups of files that need to be compiled and declaring what they depend on. That requires walking the fs to know what files are available, then different "languages" determine dependencies differently but parsing for "import statements" is the standard as I assume you've noticed here.

Many gazelle "languages" have chosen to use tree-sitter to do the parsing, including all the ones I have in aspect-gazelle as well as in rules_python and others. I would like to update the one here in rules_jvm like you are proposing, but we can't have a 2x performance reduction in parsing.

The problem is gazelle runs once and then exits, so we parse source files once and that's it. No incremental parsing, no warming up golang or tree-sitter, it's just run once and exit.

@odvcencio

odvcencio commented May 19, 2026

Copy link
Copy Markdown
Author

well, what ive found so far is highly corpora-specfic but I did get Java full parse down to ~1.11x or about 6-10ms more than cgo's.... like i mentioned previously, typically the parse machinery is close enough but where the latency is concentrated is typically in tree materialization wiith a big caveat that this is highly language/corpora specific.

i think the more folks like yourself that surface the language-specific issues/workloads/use cases, the more i will know where to improve. im always looking to continue to improve the parser performance so more use cases can surface. so its all helpful.. .thanks again!

@jbedard

jbedard commented May 19, 2026

Copy link
Copy Markdown

FWIW it seems like the AST or at least querying of the AST is quite different with your pure-go version.

See parser.go in my branch and compare it to main, the queries required significant changes and still aren't matching the same logic that was in go-tree-sitter. I haven't investigated why, so idk who's bug it is, but if you're trying to create a replacement for go-tree-sitter then the ASTs and queries need to align...

- Update `github.com/odvcencio/gotreesitter` to v0.18.0 to pick up upstream fixes and improvements
- Refresh module checksums so dependency resolution stays consistent
@odvcencio

Copy link
Copy Markdown
Author

Updated the PR branch to gotreesitter v0.18.0.\n\nValidation run locally:\n- go test ./java/gazelle/private/tsparser -count=1

@odvcencio

Copy link
Copy Markdown
Author

FWIW it seems like the AST or at least querying of the AST is quite different with your pure-go version.

sincere thanks for pointing this out, i had laid out the highlighter parity but never focused specifically on the queries (.scm) parity themselves.... lots of improvements on the way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants