Precompute ancestor hierarchy to speed up term indexing#279
Precompute ancestor hierarchy to speed up term indexing#279
Conversation
During term indexing, index_doc called retrieve_hierarchy_ids per class, issuing iterative SPARQL queries level-by-level to collect ancestors. For large ontologies (100K+ classes), this produced hundreds of thousands of SPARQL round-trips. Replace with a single paginated SPARQL query to fetch all parent-child edges, then compute the transitive closure in memory using memoized BFS. The precomputed ancestor map is stored as a class-level cache on LinkedData::Models::Class for the duration of bulk indexing and cleared in an ensure block afterward.
Add test_ancestors_precompute.rb covering linear chain, diamond inheritance, multiple roots, cycles, complex DAG, memoization, and edge cases. All tests are pure in-memory, no triplestore required. Add temporary per-class validation in the indexing loop that compares precomputed ancestors against the old retrieve_hierarchy_ids SPARQL traversal for every class. Logs warnings on mismatches. To be removed once validated against production data.
Per-class ancestor validation is expensive (runs both old and new for every class). Only enable it when explicitly requested via OP_VALIDATE_ANCESTORS=1 so it does not slow down normal indexing.
When OP_VALIDATE_ANCESTORS=1 is set, log old vs new timing for each class and whether ancestors matched or mismatched. Useful for comparing SPARQL traversal cost against in-memory cache lookup.
mdorf
left a comment
There was a problem hiding this comment.
This is a SOLID performance optimization, and we should definitely implement it in a short working order. However, it introduces significant changes to a core model and should NOT be bundled together with the upcoming major release. I propose that we revisit this feature SHORTLY after the main release.
| # end | ||
| # path_ids.delete(class_id) | ||
| path_ids = retrieve_hierarchy_ids(:ancestors) | ||
| path_ids = (self.class.ancestors_cache[class_id] || Set.new).dup |
There was a problem hiding this comment.
Hard dependency on cache — no fallback (High)
class.rb:270 now unconditionally reads from ancestors_cache:
path_ids = (self.class.ancestors_cache[class_id] || Set.new).dupIf index_doc is ever called outside of bulk indexing (e.g., individual class re-indexing, tests, or other code paths), ancestors_cache will be nil, and calling [class_id] on nil will raise NoMethodError. The commented-out fallback to retrieve_hierarchy_ids should be uncommented:
if self.class.ancestors_cache
path_ids = (self.class.ancestors_cache[class_id] || Set.new).dup
else
path_ids = retrieve_hierarchy_ids(:ancestors)
endThere was a problem hiding this comment.
I was thinking about it but is there a case where index_doc is called outside of bulk indexing?
| # SPARQL ancestor traversal. Keyed by class URI string, values are | ||
| # Sets of ancestor URI strings. Set by OntologySubmissionIndexer | ||
| # during bulk indexing and cleared after completion. | ||
| attr_accessor :ancestors_cache |
There was a problem hiding this comment.
Thread safety of class-level mutable state (High)
ancestors_cache is a class-level attr_accessor on LinkedData::Models::Class. If two ontologies are indexed concurrently (separate threads/workers), one worker could overwrite or nil out the cache while another is still reading it. Consider:
- Making the cache per-submission (pass it through the indexing context rather than a global class variable)
- Or at minimum, documenting/asserting that concurrent indexing is not supported
There was a problem hiding this comment.
Thread safety of class-level mutable state (High)
The ancestors_cache is computed once before worker threads are spawned. During indexing, threads only read from it (hash lookups + .dup on the Sets) — no writes occur until all threads complete and the cache is cleared.
The class-level state concern would apply if different ontologies were indexed concurrently in the same process. Currently ncbo_cron processes ontologies sequentially, and if we ever move to concurrent indexing, we would likely use separate worker processes subscribing to a queue — each with their own isolated class-level state. In the unlikely event we needed concurrency within a shared process, we could scope the cache via Thread.current (same pattern already used for RequestStore.store[:requested_lang] in this file).
| end | ||
|
|
||
| # TODO: Remove once precomputed ancestors are validated against production data | ||
| def validate_class_ancestors(cls, logger) |
There was a problem hiding this comment.
Instance variables leaked from validate_class_ancestors (Medium)
In validate_class_ancestors, @old_ancestors_result and @new_ancestors_result are instance variables on the indexer but are only used within Benchmark.realtime blocks. These should be local variables instead. The current approach leaks state across calls:
old_time = Benchmark.realtime do
old_ancestors = cls.retrieve_hierarchy_ids(:ancestors)
old_ancestors.select! { |x| !x["owl#Thing"] }
end
Summary
Closes #278
index_docwith a single bulk query + in-memory transitive closureLinkedData::Models::Class, cleared in anensureblock after indexingChanges
submission_indexer.rb—compute_ancestors_map,fetch_all_parent_edges,compute_ancestors_forprecompute the full ancestor map;validate_class_ancestorsprovides temporary per-class comparison against the old traversal, gated behindOP_VALIDATE_ANCESTORSenv var, with per-class timing logsclass.rb—ancestors_cacheclass accessor;index_docreads from cache instead of callingretrieve_hierarchy_idstest_ancestors_precompute.rb— 9 unit tests covering linear chain, diamond inheritance, multiple roots, cycles, complex DAG, memoization, and edge casesValidation
Run with
OP_VALIDATE_ANCESTORS=1to enable per-class comparison of precomputed ancestors against the old SPARQL traversal. Every class is checked and logs include timing for old vs new approach plus match/mismatch status. Once validated against production data, the validation code can be removed.Test plan
bundle exec rake test:docker:fs TEST='test/models/test_ancestors_precompute.rb'(9/9)bundle exec rake test:docker:fs TEST='test/models/test_search.rb'OP_VALIDATE_ANCESTORS=1— confirm zero mismatches