Summary
Term indexing performance degrades severely for large ontologies. The root cause is that index_doc calls retrieve_hierarchy_ids(:ancestors) individually for every class, issuing iterative SPARQL queries level-by-level to walk up the hierarchy. For an ontology with 100K+ classes at average depth ~8, this produces hundreds of thousands of SPARQL round-trips just to populate the parents field in Solr.
Current behavior
During OntologySubmissionIndexer#index, for each batch of 2,500 classes:
Class.indexBatch(page_classes) is called
- This calls
indexable_object → index_doc on every class
- Each
index_doc calls retrieve_hierarchy_ids(:ancestors) which iterates level-by-level up to 40 levels, issuing a SPARQL query at each level
- For a class at depth D, that's D SPARQL round-trips
Total cost: O(N × avg_depth) SPARQL queries, where N is the number of classes.
Proposed fix
Replace the per-class SPARQL ancestor traversal with a bulk precomputation:
- Before the indexing loop, fetch all parent-child edges in a single paginated SPARQL query
- Build the transitive closure in memory using memoized BFS — each edge visited at most once, total work O(V + E)
- Store the precomputed ancestor map as a class-level cache on
LinkedData::Models::Class for the duration of bulk indexing
index_doc reads ancestors from the cache instead of issuing SPARQL queries
- Clear the cache after indexing completes (ensure block)
This replaces ~800K SPARQL round-trips with 1 paginated SPARQL query + in-memory computation.
Files involved
lib/ontologies_linked_data/services/submission_process/operations/submission_indexer.rb — indexing orchestration
lib/ontologies_linked_data/models/class.rb — index_doc, retrieve_hierarchy_ids
Additional context
There are secondary performance bottlenecks in the indexer (SPARQL page fetch inside sync block, bring_remaining in CSV writer under lock) that could be addressed in follow-up work.
Summary
Term indexing performance degrades severely for large ontologies. The root cause is that
index_doccallsretrieve_hierarchy_ids(:ancestors)individually for every class, issuing iterative SPARQL queries level-by-level to walk up the hierarchy. For an ontology with 100K+ classes at average depth ~8, this produces hundreds of thousands of SPARQL round-trips just to populate theparentsfield in Solr.Current behavior
During
OntologySubmissionIndexer#index, for each batch of 2,500 classes:Class.indexBatch(page_classes)is calledindexable_object→index_docon every classindex_doccallsretrieve_hierarchy_ids(:ancestors)which iterates level-by-level up to 40 levels, issuing a SPARQL query at each levelTotal cost: O(N × avg_depth) SPARQL queries, where N is the number of classes.
Proposed fix
Replace the per-class SPARQL ancestor traversal with a bulk precomputation:
LinkedData::Models::Classfor the duration of bulk indexingindex_docreads ancestors from the cache instead of issuing SPARQL queriesThis replaces ~800K SPARQL round-trips with 1 paginated SPARQL query + in-memory computation.
Files involved
lib/ontologies_linked_data/services/submission_process/operations/submission_indexer.rb— indexing orchestrationlib/ontologies_linked_data/models/class.rb—index_doc,retrieve_hierarchy_idsAdditional context
There are secondary performance bottlenecks in the indexer (SPARQL page fetch inside sync block,
bring_remainingin CSV writer under lock) that could be addressed in follow-up work.