Skip to content

Precompute ancestor hierarchy to speed up term indexing#279

Open
alexskr wants to merge 4 commits intodevelopfrom
feature/precompute-ancestors-indexing
Open

Precompute ancestor hierarchy to speed up term indexing#279
alexskr wants to merge 4 commits intodevelopfrom
feature/precompute-ancestors-indexing

Conversation

@alexskr
Copy link
Copy Markdown
Member

@alexskr alexskr commented Apr 13, 2026

Summary

Closes #278

  • Replace per-class SPARQL ancestor traversal in index_doc with a single bulk query + in-memory transitive closure
  • Before the indexing loop, fetch all parent-child edges in one paginated SPARQL query and compute ancestors via memoized BFS — O(V + E) total
  • Store the precomputed ancestor map as a class-level cache on LinkedData::Models::Class, cleared in an ensure block after indexing
  • For a 100K-class ontology, this eliminates ~800K SPARQL round-trips
  • Observed 2-5x overall indexing performance improvement on large ontologies

Changes

  • submission_indexer.rbcompute_ancestors_map, fetch_all_parent_edges, compute_ancestors_for precompute the full ancestor map; validate_class_ancestors provides temporary per-class comparison against the old traversal, gated behind OP_VALIDATE_ANCESTORS env var, with per-class timing logs
  • class.rbancestors_cache class accessor; index_doc reads from cache instead of calling retrieve_hierarchy_ids
  • test_ancestors_precompute.rb — 9 unit tests covering linear chain, diamond inheritance, multiple roots, cycles, complex DAG, memoization, and edge cases

Validation

Run with OP_VALIDATE_ANCESTORS=1 to enable per-class comparison of precomputed ancestors against the old SPARQL traversal. Every class is checked and logs include timing for old vs new approach plus match/mismatch status. Once validated against production data, the validation code can be removed.

Test plan

  • Unit tests pass: bundle exec rake test:docker:fs TEST='test/models/test_ancestors_precompute.rb' (9/9)
  • Full indexing pipeline: bundle exec rake test:docker:fs TEST='test/models/test_search.rb'
  • Validation run on a large ontology with OP_VALIDATE_ANCESTORS=1 — confirm zero mismatches

alexskr added 3 commits April 11, 2026 01:30
During term indexing, index_doc called retrieve_hierarchy_ids per class,
issuing iterative SPARQL queries level-by-level to collect ancestors.
For large ontologies (100K+ classes), this produced hundreds of thousands
of SPARQL round-trips.

Replace with a single paginated SPARQL query to fetch all parent-child
edges, then compute the transitive closure in memory using memoized BFS.
The precomputed ancestor map is stored as a class-level cache on
LinkedData::Models::Class for the duration of bulk indexing and cleared
in an ensure block afterward.
Add test_ancestors_precompute.rb covering linear chain, diamond
inheritance, multiple roots, cycles, complex DAG, memoization, and
edge cases. All tests are pure in-memory, no triplestore required.

Add temporary per-class validation in the indexing loop that compares
precomputed ancestors against the old retrieve_hierarchy_ids SPARQL
traversal for every class. Logs warnings on mismatches. To be removed
once validated against production data.
Per-class ancestor validation is expensive (runs both old and new
for every class). Only enable it when explicitly requested via
OP_VALIDATE_ANCESTORS=1 so it does not slow down normal indexing.
@alexskr alexskr requested a review from mdorf April 13, 2026 06:21
When OP_VALIDATE_ANCESTORS=1 is set, log old vs new timing for each
class and whether ancestors matched or mismatched. Useful for comparing
SPARQL traversal cost against in-memory cache lookup.
Copy link
Copy Markdown
Member

@mdorf mdorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a SOLID performance optimization, and we should definitely implement it in a short working order. However, it introduces significant changes to a core model and should NOT be bundled together with the upcoming major release. I propose that we revisit this feature SHORTLY after the main release.

# end
# path_ids.delete(class_id)
path_ids = retrieve_hierarchy_ids(:ancestors)
path_ids = (self.class.ancestors_cache[class_id] || Set.new).dup
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard dependency on cache — no fallback (High)

class.rb:270 now unconditionally reads from ancestors_cache:

path_ids = (self.class.ancestors_cache[class_id] || Set.new).dup

If index_doc is ever called outside of bulk indexing (e.g., individual class re-indexing, tests, or other code paths), ancestors_cache will be nil, and calling [class_id] on nil will raise NoMethodError. The commented-out fallback to retrieve_hierarchy_ids should be uncommented:

if self.class.ancestors_cache
  path_ids = (self.class.ancestors_cache[class_id] || Set.new).dup
else
  path_ids = retrieve_hierarchy_ids(:ancestors)
end

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it but is there a case where index_doc is called outside of bulk indexing?

# SPARQL ancestor traversal. Keyed by class URI string, values are
# Sets of ancestor URI strings. Set by OntologySubmissionIndexer
# during bulk indexing and cleared after completion.
attr_accessor :ancestors_cache
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread safety of class-level mutable state (High)

ancestors_cache is a class-level attr_accessor on LinkedData::Models::Class. If two ontologies are indexed concurrently (separate threads/workers), one worker could overwrite or nil out the cache while another is still reading it. Consider:

  • Making the cache per-submission (pass it through the indexing context rather than a global class variable)
  • Or at minimum, documenting/asserting that concurrent indexing is not supported

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread safety of class-level mutable state (High)

The ancestors_cache is computed once before worker threads are spawned. During indexing, threads only read from it (hash lookups + .dup on the Sets) — no writes occur until all threads complete and the cache is cleared.

The class-level state concern would apply if different ontologies were indexed concurrently in the same process. Currently ncbo_cron processes ontologies sequentially, and if we ever move to concurrent indexing, we would likely use separate worker processes subscribing to a queue — each with their own isolated class-level state. In the unlikely event we needed concurrency within a shared process, we could scope the cache via Thread.current (same pattern already used for RequestStore.store[:requested_lang] in this file).

end

# TODO: Remove once precomputed ancestors are validated against production data
def validate_class_ancestors(cls, logger)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instance variables leaked from validate_class_ancestors (Medium)

In validate_class_ancestors, @old_ancestors_result and @new_ancestors_result are instance variables on the indexer but are only used within Benchmark.realtime blocks. These should be local variables instead. The current approach leaks state across calls:

old_time = Benchmark.realtime do
  old_ancestors = cls.retrieve_hierarchy_ids(:ancestors)
  old_ancestors.select! { |x| !x["owl#Thing"] }
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants