Skip to content

Term indexing extremely slow for large ontologies due to per-class SPARQL ancestor traversal #278

@alexskr

Description

@alexskr

Summary

Term indexing performance degrades severely for large ontologies. The root cause is that index_doc calls retrieve_hierarchy_ids(:ancestors) individually for every class, issuing iterative SPARQL queries level-by-level to walk up the hierarchy. For an ontology with 100K+ classes at average depth ~8, this produces hundreds of thousands of SPARQL round-trips just to populate the parents field in Solr.

Current behavior

During OntologySubmissionIndexer#index, for each batch of 2,500 classes:

  1. Class.indexBatch(page_classes) is called
  2. This calls indexable_objectindex_doc on every class
  3. Each index_doc calls retrieve_hierarchy_ids(:ancestors) which iterates level-by-level up to 40 levels, issuing a SPARQL query at each level
  4. For a class at depth D, that's D SPARQL round-trips

Total cost: O(N × avg_depth) SPARQL queries, where N is the number of classes.

Proposed fix

Replace the per-class SPARQL ancestor traversal with a bulk precomputation:

  1. Before the indexing loop, fetch all parent-child edges in a single paginated SPARQL query
  2. Build the transitive closure in memory using memoized BFS — each edge visited at most once, total work O(V + E)
  3. Store the precomputed ancestor map as a class-level cache on LinkedData::Models::Class for the duration of bulk indexing
  4. index_doc reads ancestors from the cache instead of issuing SPARQL queries
  5. Clear the cache after indexing completes (ensure block)

This replaces ~800K SPARQL round-trips with 1 paginated SPARQL query + in-memory computation.

Files involved

  • lib/ontologies_linked_data/services/submission_process/operations/submission_indexer.rb — indexing orchestration
  • lib/ontologies_linked_data/models/class.rbindex_doc, retrieve_hierarchy_ids

Additional context

There are secondary performance bottlenecks in the indexer (SPARQL page fetch inside sync block, bring_remaining in CSV writer under lock) that could be addressed in follow-up work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions