[gazelle] switch Java parsing to pure-Go tsparser (no JVM parser server)#439
[gazelle] switch Java parsing to pure-Go tsparser (no JVM parser server)#439odvcencio wants to merge 9 commits into
Conversation
- bump MODULE go_sdk to 1.24.0 for gotreesitter compatibility\n- pin github.com/odvcencio/gotreesitter to commit ce7d35d (correctness/perf-gates branch)
- Replace single parser instance with concurrency-safe ParserPool to prevent race conditions during parallel parsing - Add token-source fallback mechanism when default parse trees contain syntax errors, improving resilience against parser edge cases - Update gotreesitter dependency to version with ParserPool and JavaTokenSource APIs - Add explicit tree.Release() calls to prevent memory leaks during parse operations - Add TestParsePackageConcurrent stress test with 12 workers to validate thread safety
|
I would really like to see something like this land, however I've found the After seeing this PR I tried upgrading the the Aspect js+orion+kotlin extensions and performance on some repos is 10x slower then @odvcencio are you still hoping to get the pure go version initial parse times up to speed with the cgo one? I know some other metrics are faster with gotreesitter but unfortunately with gazelle the initial parse time is almost all that matters. |
|
heya @jbedard thanks so much for tagging me and bringing me back to this. and i especially appreciate your interest in gotreesitter and your effort on that dev branch. i think my benchmarks on Java have been indicating theres some room there for optimization, and ive confirmed your 10x assessment, i will take a look and see what is possible... im pretty confident i can get that to something decent 👍 |
- Route packages with `.kt` sources through `javaparser` while keeping Java-only packages on `tsparser` - Resolve nested types, generic references, method bodies, and annotation names more accurately so imported and exported metadata matches the actual API - Fail fast when a package spans multiple Java packages and key per-class metadata by fully qualified name to avoid ambiguous lookups - Refresh the Go, rules_go, and parser dependencies, and add regression tests plus corpus benchmarks for the new behavior
- Collect import, export, and main class names as raw strings during parsing, then resolve them into sorted sets after package validation. - Keep `ClassNameLess` aligned with fully qualified string order without allocating full strings on each comparison. - Add regressions for field-initializer type references and class-name ordering.
- Upgrade `github.com/odvcencio/gotreesitter` to `v0.17.1` - Refresh module checksums to keep dependency metadata consistent
- Update `github.com/odvcencio/gotreesitter` from `v0.17.1` to `v0.17.4` - Refresh module checksums to keep dependency metadata in sync and builds reproducible
|
Are you still hoping to improve the core |
|
heya @jbedard its pretty variable based on language right now but i did some work to target particularly java for this use case as well as JS/TS/Kotlin here currently working on getting the python fullparse time down... if you check out the changes made through gotreesitter 0.17.4 you should see a considerable boost since the last time you tried it. my primary use cases have been Go-centric so i recognize there is a lot of space to cover to live up to the universal parsing promise across the board. but im targeting updates to the GLR machinery rather than one off language optimizations, so to answer your question, yes i think theres still a ton of room for growth in the parser efficiency especially since i likely still have issues in languages without enough parser exercising or workload-shaped corpora. believe it or not (at least in the current python case), so far the core parser machinery is more or less there (give or take some funky language cases), the tree materialization has actually been the perf bottleneck. the current parser is materializing too many nodes prematurely and its adding on many times over the wall clock of the parsing itself. |
|
oh also yeah i am not really sure what a gazelle or bazel workflow really looks like so simply by using it and telling me if it works/how it performs/if it helps you personally is plenty of help... my goal is to get to roughly 2x cgo full parse times across the board so if a language is not there currently, its just a matter of time |
The summary is gazelle generates BUILD files, normally declaring groups of files that need to be compiled and declaring what they depend on. That requires walking the fs to know what files are available, then different "languages" determine dependencies differently but parsing for "import statements" is the standard as I assume you've noticed here. Many gazelle "languages" have chosen to use tree-sitter to do the parsing, including all the ones I have in aspect-gazelle as well as in rules_python and others. I would like to update the one here in rules_jvm like you are proposing, but we can't have a 2x performance reduction in parsing. The problem is gazelle runs once and then exits, so we parse source files once and that's it. No incremental parsing, no warming up golang or tree-sitter, it's just run once and exit. |
|
well, what ive found so far is highly corpora-specfic but I did get Java full parse down to ~1.11x or about 6-10ms more than cgo's.... like i mentioned previously, typically the parse machinery is close enough but where the latency is concentrated is typically in tree materialization wiith a big caveat that this is highly language/corpora specific. i think the more folks like yourself that surface the language-specific issues/workloads/use cases, the more i will know where to improve. im always looking to continue to improve the parser performance so more use cases can surface. so its all helpful.. .thanks again! |
|
FWIW it seems like the AST or at least querying of the AST is quite different with your pure-go version. See parser.go in my branch and compare it to |
- Update `github.com/odvcencio/gotreesitter` to v0.18.0 to pick up upstream fixes and improvements - Refresh module checksums so dependency resolution stays consistent
|
Updated the PR branch to gotreesitter v0.18.0.\n\nValidation run locally:\n- |
sincere thanks for pointing this out, i had laid out the highlighter parity but never focused specifically on the queries ( |
Summary
javaparserrunner to an in-process pure-Gotsparserprivate/parser) so backend selection stays implementation-detailgotreesitter.NewParserPool(...)for concurrency-safe parsingTestParsePackageConcurrentgithub.com/odvcencio/gotreesittertov0.17.4Non-goals
dependency_analyzer)Why Go/Bazel Version Bumps
gotreesitterrequires a newer Go toolchain than the previous Gazelle parser path, so this PR aligns:go.modGo versionMODULE.bazelgo_sdk.download(...)Validation
go test ./java/gazelle/private/tsparser -count=1go test -race ./java/gazelle/private/tsparser -run TestParsePackageConcurrent -count=1bazel test --lockfile_mode=off --nocache_test_results --test_output=errors //java/gazelle:gazelle_testgo test ./java/gazelle/private/tsparserbazel test //java/gazelle/private/tsparser:tsparser_testIssue
Ref #318