Precompute ancestor hierarchy to speed up term indexing by alexskr · Pull Request #279 · ncbo/ontologies_linked_data

alexskr · 2026-04-13T06:18:18Z

Summary

Closes #278

Replace per-class SPARQL ancestor traversal in index_doc with a single bulk query + in-memory transitive closure
Before the indexing loop, fetch all parent-child edges in one paginated SPARQL query and compute ancestors via memoized BFS — O(V + E) total
Store the precomputed ancestor map as a class-level cache on LinkedData::Models::Class, cleared in an ensure block after indexing
For a 100K-class ontology, this eliminates ~800K SPARQL round-trips
Observed 2-5x overall indexing performance improvement on large ontologies

Changes

submission_indexer.rb — compute_ancestors_map, fetch_all_parent_edges, compute_ancestors_for precompute the full ancestor map; validate_class_ancestors provides temporary per-class comparison against the old traversal, gated behind OP_VALIDATE_ANCESTORS env var, with per-class timing logs
class.rb — ancestors_cache class accessor; index_doc reads from cache instead of calling retrieve_hierarchy_ids
test_ancestors_precompute.rb — 9 unit tests covering linear chain, diamond inheritance, multiple roots, cycles, complex DAG, memoization, and edge cases

Validation

Run with OP_VALIDATE_ANCESTORS=1 to enable per-class comparison of precomputed ancestors against the old SPARQL traversal. Every class is checked and logs include timing for old vs new approach plus match/mismatch status. Once validated against production data, the validation code can be removed.

Test plan

Unit tests pass: bundle exec rake test:docker:fs TEST='test/models/test_ancestors_precompute.rb' (9/9)
Full indexing pipeline: bundle exec rake test:docker:fs TEST='test/models/test_search.rb'
Validation run on a large ontology with OP_VALIDATE_ANCESTORS=1 — confirm zero mismatches

During term indexing, index_doc called retrieve_hierarchy_ids per class, issuing iterative SPARQL queries level-by-level to collect ancestors. For large ontologies (100K+ classes), this produced hundreds of thousands of SPARQL round-trips. Replace with a single paginated SPARQL query to fetch all parent-child edges, then compute the transitive closure in memory using memoized BFS. The precomputed ancestor map is stored as a class-level cache on LinkedData::Models::Class for the duration of bulk indexing and cleared in an ensure block afterward.

Add test_ancestors_precompute.rb covering linear chain, diamond inheritance, multiple roots, cycles, complex DAG, memoization, and edge cases. All tests are pure in-memory, no triplestore required. Add temporary per-class validation in the indexing loop that compares precomputed ancestors against the old retrieve_hierarchy_ids SPARQL traversal for every class. Logs warnings on mismatches. To be removed once validated against production data.

Per-class ancestor validation is expensive (runs both old and new for every class). Only enable it when explicitly requested via OP_VALIDATE_ANCESTORS=1 so it does not slow down normal indexing.

When OP_VALIDATE_ANCESTORS=1 is set, log old vs new timing for each class and whether ancestors matched or mismatched. Useful for comparing SPARQL traversal cost against in-memory cache lookup.

mdorf

This is a SOLID performance optimization, and we should definitely implement it in a short working order. However, it introduces significant changes to a core model and should NOT be bundled together with the upcoming major release. I propose that we revisit this feature SHORTLY after the main release.

mdorf · 2026-04-14T16:28:08Z

            # end
-            # path_ids.delete(class_id)
-            path_ids = retrieve_hierarchy_ids(:ancestors)
+            path_ids = (self.class.ancestors_cache[class_id] || Set.new).dup


Hard dependency on cache — no fallback (High)

class.rb:270 now unconditionally reads from ancestors_cache:

path_ids = (self.class.ancestors_cache[class_id] || Set.new).dup

If index_doc is ever called outside of bulk indexing (e.g., individual class re-indexing, tests, or other code paths), ancestors_cache will be nil, and calling [class_id] on nil will raise NoMethodError. The commented-out fallback to retrieve_hierarchy_ids should be uncommented:

if self.class.ancestors_cache path_ids = (self.class.ancestors_cache[class_id] || Set.new).dup else path_ids = retrieve_hierarchy_ids(:ancestors) end

I was thinking about it but is there a case where index_doc is called outside of bulk indexing?

mdorf · 2026-04-14T16:29:09Z

+        # SPARQL ancestor traversal. Keyed by class URI string, values are
+        # Sets of ancestor URI strings. Set by OntologySubmissionIndexer
+        # during bulk indexing and cleared after completion.
+        attr_accessor :ancestors_cache


Thread safety of class-level mutable state (High)

ancestors_cache is a class-level attr_accessor on LinkedData::Models::Class. If two ontologies are indexed concurrently (separate threads/workers), one worker could overwrite or nil out the cache while another is still reading it. Consider:

Making the cache per-submission (pass it through the indexing context rather than a global class variable)

Or at minimum, documenting/asserting that concurrent indexing is not supported

Thread safety of class-level mutable state (High)

The ancestors_cache is computed once before worker threads are spawned. During indexing, threads only read from it (hash lookups + .dup on the Sets) — no writes occur until all threads complete and the cache is cleared.

The class-level state concern would apply if different ontologies were indexed concurrently in the same process. Currently ncbo_cron processes ontologies sequentially, and if we ever move to concurrent indexing, we would likely use separate worker processes subscribing to a queue — each with their own isolated class-level state. In the unlikely event we needed concurrency within a shared process, we could scope the cache via Thread.current (same pattern already used for RequestStore.store[:requested_lang] in this file).

mdorf · 2026-04-14T16:30:54Z

+      end
+
+      # TODO: Remove once precomputed ancestors are validated against production data
+      def validate_class_ancestors(cls, logger)


Instance variables leaked from validate_class_ancestors (Medium)

In validate_class_ancestors, @old_ancestors_result and @new_ancestors_result are instance variables on the indexer but are only used within Benchmark.realtime blocks. These should be local variables instead. The current approach leaks state across calls:

old_time = Benchmark.realtime do old_ancestors = cls.retrieve_hierarchy_ids(:ancestors) old_ancestors.select! { |x| !x["owl#Thing"] } end

alexskr added 3 commits April 11, 2026 01:30

Gate ancestor validation behind OP_VALIDATE_ANCESTORS env var

96438b9

Per-class ancestor validation is expensive (runs both old and new for every class). Only enable it when explicitly requested via OP_VALIDATE_ANCESTORS=1 so it does not slow down normal indexing.

alexskr requested a review from mdorf April 13, 2026 06:21

Add timing and per-class detail to ancestor validation logging

e8b6262

When OP_VALIDATE_ANCESTORS=1 is set, log old vs new timing for each class and whether ancestors matched or mismatched. Useful for comparing SPARQL traversal cost against in-memory cache lookup.

mdorf reviewed Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precompute ancestor hierarchy to speed up term indexing#279

Precompute ancestor hierarchy to speed up term indexing#279
alexskr wants to merge 4 commits intodevelopfrom
feature/precompute-ancestors-indexing

alexskr commented Apr 13, 2026 •

edited

Loading

Uh oh!

mdorf left a comment

Uh oh!

mdorf Apr 14, 2026

Uh oh!

alexskr Apr 14, 2026

Uh oh!

mdorf Apr 14, 2026

Uh oh!

alexskr Apr 15, 2026

Uh oh!

mdorf Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexskr commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Validation

Test plan

Uh oh!

mdorf left a comment

Choose a reason for hiding this comment

Uh oh!

mdorf Apr 14, 2026

Choose a reason for hiding this comment

Hard dependency on cache — no fallback (High)

Uh oh!

alexskr Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

mdorf Apr 14, 2026

Choose a reason for hiding this comment

Thread safety of class-level mutable state (High)

Uh oh!

alexskr Apr 15, 2026

Choose a reason for hiding this comment

Thread safety of class-level mutable state (High)

Uh oh!

mdorf Apr 14, 2026

Choose a reason for hiding this comment

Instance variables leaked from validate_class_ancestors (Medium)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexskr commented Apr 13, 2026 •

edited

Loading

Instance variables leaked from `validate_class_ancestors` (Medium)