Skip to content

feature(filter-subnetwork): Add feature to filter the subnetwork from INDRA by query context#84

Merged
tonywu1999 merged 8 commits intodevelfrom
feature-cosine-context
Mar 11, 2026
Merged

feature(filter-subnetwork): Add feature to filter the subnetwork from INDRA by query context#84
tonywu1999 merged 8 commits intodevelfrom
feature-cosine-context

Conversation

@tonywu1999
Copy link
Contributor

@tonywu1999 tonywu1999 commented Mar 10, 2026

Motivation and Context

Large INDRA-derived protein–protein interaction subnetworks include edges supported by literature across many biological contexts. This PR adds a feature to score evidence texts (PubMed abstracts) against a user-supplied query and filter the subnetwork to retain edges supported by contextually relevant evidence. Two scoring approaches are provided: tag-count matching and TF‑IDF cosine similarity. The goal is to let users focus subnetworks on literature contexts relevant to their research question.

Short solution summary

  • Adds exported function filterSubnetworkByContext(nodes, edges, query, cutoff = NULL, method = c("tag_count", "cosine")) that:
    • Extracts evidence text for edges via INDRA,
    • Fetches PubMed abstracts for evidence PMIDs (rentrez),
    • Scores abstracts against the query using tag-count or TF‑IDF cosine similarity,
    • Applies a cutoff to determine passing PMIDs and filters evidence, edges, and nodes,
    • Returns a list(nodes, edges, evidence) with scores.
  • Implements internal helpers for INDRA evidence querying, PubMed fetching/cleaning, and scoring.
  • Adds documentation (man page) and a vignette demonstrating workflows.
  • Updates DESCRIPTION and NAMESPACE to declare new imports and export.

Detailed changes

  • DESCRIPTION

    • Added Imports: text2vec, stopwords, xml2, rentrez (in addition to existing stats). (+5/-1)
  • NAMESPACE

    • Export: filterSubnetworkByContext
    • Added importFrom entries:
      • httr: content_type_json
      • rentrez: entrez_fetch
      • stopwords: stopwords
      • text2vec: TfIdf, create_dtm, create_vocabulary, fit_transform, itoken, prune_vocabulary, vocab_vectorizer, word_tokenizer
      • xml2: read_xml, xml_find_all, xml_text
        (+15/-0)
  • R/filterSubnetworkByContext.R (new; ~364 lines)

    • Public API: filterSubnetworkByContext(nodes, edges, query, cutoff = NULL, method = c("tag_count", "cosine"))
      • Defaults: method = "tag_count" (default cutoff 1); method = "cosine" (default cutoff 0.10).
      • Validates inputs and emits progress/info messages and warnings.
      • Workflow: .extract_evidence_text → collect PMIDs → .fetch_clean_abstracts_xml → scoring (.score_by_tag_count or .score_by_cosine) → identify passing PMIDs → merge scores into evidence → filter evidence/edges/nodes → return results.
    • Internal helpers:
      • .extract_evidence_text(df): queries INDRA per statement hash and returns evidence rows with text and pmid; warns if none found.
      • .fetch_clean_abstracts_xml(pmids): fetches/cleans PubMed abstracts via rentrez; handles rate-limiting and errors; returns named pmid→abstract list.
      • .query_indra_evidence(stmt_hash): INDRA API call helper with HTTP error handling.
      • .score_by_tag_count(abstracts, tags): case-insensitive substring counting.
      • .score_by_cosine(query, abstracts): TF‑IDF vectorization (text2vec) and cosine similarity.
    • Behavior: graceful handling of missing evidence/PMIDs or API failures; extensive console progress output.
  • man/filterSubnetworkByContext.Rd (new)

    • Documentation for filterSubnetworkByContext: usage, arguments, return value, and description.
  • vignettes/Filter-By-Context.Rmd (new)

    • End-to-end vignette showing usage with tag_count and cosine methods and guidance on queries/cutoffs.
  • R/utils_getSubnetworkFromIndra.R

    • Non-functional change: two trailing blank lines added.

Unit tests added or modified

  • New test file tests/testthat/test-filterSubnetworkByContext.R added (fixtures and tests).
    • Tests for helpers:
      • .score_by_tag_count: verifies counts and zero behavior.
      • .score_by_cosine: checks numeric vector output and that relevant abstracts score higher than irrelevant ones.
    • Tests for evidence/abstract helpers:
      • .extract_evidence_text: validates input column checks and behavior when INDRA yields no evidence (using mocks).
      • .fetch_clean_abstracts_xml: verifies handling of empty input and simulated API errors (using mocks).
    • End-to-end test for filterSubnetworkByContext:
      • Uses mockery to stub evidence extraction and abstract fetching, asserts returned structure (nodes, edges, evidence), ensures only matching edges pass cutoff, and that evidence includes a non-negative score column.
  • Despite added tests, Codecov reports the new R/filterSubnetworkByContext.R has 174 missing lines and 0% patch coverage; project coverage on the head branch dropped to ~66.3% from ~80.1% on devel (≈ −13.7%).

Coding guidelines / policy violations

  • New substantial functionality includes external network calls (INDRA, NCBI/rentrez) and TF‑IDF processing but the Codecov report indicates the new implementation is largely untested at the patch level (174 uncovered lines). This violates best practices for test coverage for new public API behavior.
  • Although tests were added, they mock internal data retrieval; additional targeted unit tests exercising scoring logic and integration tests with recorded fixtures or stronger mocks are needed to cover the new lines and reduce reliance on live network I/O.
  • New runtime dependencies (text2vec, stopwords, xml2, rentrez) were added; CI adjustments (dependency installation or mocking) are not present in the PR.
  • No explicit evidence of code formatting/style checks (e.g., styler) being applied; style compliance not confirmed.

@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds exported function filterSubnetworkByContext with helpers to retrieve INDRA evidence and PubMed abstracts, score evidence by tag counts or TF‑IDF cosine similarity, and filter nodes/edges; updates DESCRIPTION and NAMESPACE imports; adds docs, vignette, and tests.

Changes

Cohort / File(s) Summary
Package Dependencies
DESCRIPTION
Added Imports: text2vec, stopwords, xml2, rentrez (retains stats).
Namespace & Exports
NAMESPACE
Exported filterSubnetworkByContext; added importFrom entries for httr::content_type_json, rentrez::entrez_fetch, stopwords::stopwords, multiple text2vec functions, and xml2 functions.
Core Implementation
R/filterSubnetworkByContext.R
New exported filterSubnetworkByContext(nodes, edges, query, cutoff = NULL, method = c("tag_count","cosine")) plus internal helpers: .extract_evidence_text, .fetch_clean_abstracts_xml, .query_indra_evidence, .score_by_tag_count, .score_by_cosine. Implements INDRA queries, PubMed fetching, scoring, and filtering.
Utilities
R/utils_getSubnetworkFromIndra.R
Two trailing blank lines added; no behavioral change.
Docs & Examples
man/filterSubnetworkByContext.Rd, vignettes/Filter-By-Context.Rmd
Added Rd documentation and a vignette demonstrating tag_count and cosine workflows with examples and guidance.
Tests
tests/testthat/test-filterSubnetworkByContext.R
New tests covering scoring helpers, evidence extraction, PubMed fetching (mocked), and end-to-end filtering behavior/structures.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant Package as Package
    participant INDRA as INDRA API
    participant PubMed as PubMed (rentrez)
    participant TextProc as Text Processing (text2vec)

    User->>Package: call filterSubnetworkByContext(nodes, edges, query, ...)
    Package->>INDRA: request evidence per stmt_hash
    INDRA-->>Package: evidence text + PMIDs
    Package->>PubMed: fetch abstracts for PMIDs
    PubMed-->>Package: abstracts
    Package->>TextProc: compute tag counts or TF‑IDF + cosine vs query
    TextProc-->>Package: similarity scores
    Package->>Package: attach scores, filter evidence, compute surviving nodes/edges
    Package-->>User: return filtered nodes, edges, evidence
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

Review effort 3/5

Poem

🐰 I hopped through abstracts, tags, and scores,
Fetching PMIDs, opening literature doors,
TF‑IDF and counts helped me decide,
Which edges stay and which subnets hide,
A rabbit's tidy filter — hop, score, abide.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is missing entirely. The template sections for Motivation/Context, Changes, Testing, and checklist are all empty placeholders with no substantive content. Fill in all required template sections with detailed information: explain motivation, list all changes (DESCRIPTION, NAMESPACE, new function, documentation, vignette, tests), describe testing approach, and complete the pre-review checklist.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically describes the main feature: adding a filter for subnetworks from INDRA by query context, which matches the core change.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature-cosine-context

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 112-114: Replace the positional column access nodes[[1]] with an
explicit nodes$id reference in the nodes_filtered assignment, and add a
pre-check that the nodes data.frame contains an "id" column (e.g., stop with a
clear error if !"id" %in% colnames(nodes)); update the nodes_filtered creation
that currently uses surviving_nodes and edges_filtered to filter by nodes$id to
avoid silent breakage when columns are reordered; this keeps behavior consistent
with other functions like getSubnetworkFromIndra() and fails fast when the
required column is missing.
- Around line 93-106: Remove the forced conversion of missing similarity values
to 0 (the line setting evidence_scored$similarity[is.na(...)] <- 0) so
unknown/failed-fetch abstracts remain NA, and update the logic that computes
passing_pmids (the filter that uses similarity >= cutoff) to treat NA as passing
by using a condition like is.na(similarity) OR similarity >= cutoff; this
ensures evidence_scored, evidence_filtered, abstracts_df, similarity and
passing_pmids retain and preserve unknown scores rather than conflating them
with zero.
- Around line 217-220: Replace the single-section extraction using
xml_find_first with a concatenation of all <AbstractText> nodes: use
xml_find_all(doc, ".//AbstractText") to retrieve every section, combine their
xml_text() values (e.g., collapse with a space) and assign the combined string
to results[[pmid]] instead of the single abstract_node; additionally update the
package metadata to import xml_find_all by adding xml_find_all to the
`@importFrom` xml2 line and exporting it in the NAMESPACE so the function is
available.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f8397e58-b24c-4ce8-aa8e-d6f237bd05a8

📥 Commits

Reviewing files that changed from the base of the PR and between 5b3b03d and dc9fb91.

📒 Files selected for processing (6)
  • DESCRIPTION
  • NAMESPACE
  • R/filterSubnetworkByContext.R
  • R/utils_getSubnetworkFromIndra.R
  • man/filterSubnetworkByContext.Rd
  • vignettes/Filter-By-Context.Rmd

@codecov-commenter
Copy link

codecov-commenter commented Mar 11, 2026

Codecov Report

❌ Patch coverage is 62.30366% with 72 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.79%. Comparing base (f02d61d) to head (d4bac72).

Files with missing lines Patch % Lines
R/filterSubnetworkByContext.R 62.30% 72 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            devel      #84      +/-   ##
==========================================
- Coverage   80.09%   76.79%   -3.30%     
==========================================
  Files           7        8       +1     
  Lines         839     1030     +191     
==========================================
+ Hits          672      791     +119     
- Misses        167      239      +72     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tonywu1999 tonywu1999 force-pushed the feature-cosine-context branch from dc9fb91 to 18dae12 Compare March 11, 2026 16:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (2)
R/filterSubnetworkByContext.R (2)

139-140: ⚠️ Potential issue | 🟡 Minor

Use nodes$id instead of positional nodes[[1]].

This was flagged in a previous review and remains unaddressed. The function is exported and may receive dataframes with reordered columns. Using nodes[[1]] silently breaks in that case.

Proposed fix
     surviving_nodes  <- union(edges_filtered$source, edges_filtered$target)
-    nodes_filtered   <- nodes[nodes[[1]] %in% surviving_nodes, ]   # assumes first col is node ID
+    if (!"id" %in% names(nodes)) {
+        stop("`nodes` must contain an `id` column.")
+    }
+    nodes_filtered   <- nodes[nodes$id %in% surviving_nodes, ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/filterSubnetworkByContext.R` around lines 139 - 140, Replace the positional
column access nodes[[1]] in the nodes filtering step with the explicit column
name nodes$id to avoid breaking when columns are reordered: update the
expression that builds nodes_filtered (which currently uses surviving_nodes from
edges_filtered$source/target) to use nodes$id %in% surviving_nodes instead of
nodes[[1]] %in% surviving_nodes; ensure this change is applied where
surviving_nodes, edges_filtered and nodes_filtered are referenced.

305-325: ⚠️ Potential issue | 🟠 Major

Concatenate all abstract sections instead of truncating to the first.

This was flagged in a previous review and remains unaddressed. PubMed structured abstracts contain multiple <AbstractText> nodes (OBJECTIVE, METHODS, RESULTS, CONCLUSIONS). Using xml_find_first() retrieves only the first section, causing relevance scoring to miss matches in other sections.

Proposed fix
-            abstract_node <- xml_find_first(doc, ".//AbstractText")
-            
-            if (!is.na(abstract_node)) {
-                results[[pmid]] <- xml_text(abstract_node)
+            abstract_nodes <- xml_find_all(doc, ".//AbstractText")
+            
+            if (length(abstract_nodes) > 0) {
+                results[[pmid]] <- paste(trimws(xml_text(abstract_nodes)), collapse = " ")
             }

Also add xml_find_all to the @importFrom directive at line 298:

-#' `@importFrom` xml2 read_xml xml_find_first xml_text
+#' `@importFrom` xml2 read_xml xml_find_all xml_find_first xml_text
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/filterSubnetworkByContext.R` around lines 305 - 325, The loop currently
uses xml_find_first to grab only the first <AbstractText>; change it to use
xml_find_all on the parsed doc (replace the abstract_node variable with
abstract_nodes <- xml_find_all(doc, ".//AbstractText")), then if any nodes are
found collapse their texts into a single string (e.g.,
paste(xml_text(abstract_nodes), collapse = " ")) and store that in
results[[pmid]]; also adjust the presence check from is.na(abstract_node) to
testing length(abstract_nodes) > 0 (or similar). Finally, update the Roxygen
`@importFrom` directive that currently lists xml_find_first to also include
xml_find_all so the function is imported.
🧹 Nitpick comments (3)
R/filterSubnetworkByContext.R (3)

247-275: Consider adding rate limiting for INDRA API calls.

Unlike .fetch_clean_abstracts_xml() which includes Sys.sleep(0.34) to respect NCBI rate limits, this function makes INDRA API calls without throttling. For networks with many unique statement hashes, this could overwhelm the INDRA API or trigger rate limiting.

         evidence_list    <- .query_indra_evidence(stmt_hash)
         if (is.null(evidence_list) || length(evidence_list) == 0) next
+        
+        Sys.sleep(0.1)  # respect API rate limits
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/filterSubnetworkByContext.R` around lines 247 - 275, The loop over
unique_hashes calls .query_indra_evidence repeatedly without throttling; add
rate limiting (e.g., a short Sys.sleep delay or a configurable backoff)
immediately after each .query_indra_evidence call inside the for (i in
seq_along(unique_hashes)) loop (or on each iteration before continuing) to avoid
overwhelming the INDRA API—make the delay configurable (default similar to
.fetch_clean_abstracts_xml's 0.34s) and apply it around the
.query_indra_evidence invocation and before continuing to the next stmt_hash.

122-134: Consider preserving evidence with unfetched abstracts.

Setting NA scores to 0 (line 128) conflates transient fetch failures with genuinely low-scoring abstracts. Evidence backed by unfetchable abstracts is silently removed.

Consider either:

  1. Preserving NA-scored evidence in the output (let user decide), or
  2. Adding a warning when significant data is lost:
     evidence_scored$score[is.na(evidence_scored$score)] <- 0
+    na_count <- sum(is.na(evidence_scored$score))
+    if (na_count > 0) {
+        warning(sprintf("%d evidence rows had no abstract score (fetch failed); treated as score=0.", na_count))
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/filterSubnetworkByContext.R` around lines 122 - 134, Do not coerce missing
abstract scores to 0: remove the line that sets
evidence_scored$score[is.na(...)] <- 0 so NA remains, and leave
evidence_filtered as-is; additionally, add a warning when there are any NA
scores (e.g., compute n_na <- sum(is.na(evidence_scored$score)) and if n_na > 0
call warning() with the count/percentage) so callers are informed about
unfetched abstracts while preserving those rows (refer to symbols
evidence_scored, score, evidence_filtered, passing_pmids).

64-81: Add validation for cutoff parameter.

While query validation was added (addressing previous review), cutoff is not validated. Invalid values (e.g., negative for tag_count, non-numeric, NA) will cause silent misbehavior.

Proposed fix
     if (method == "tag_count") {
         if (!is.character(query) || length(query) < 1) {
             stop("`query` must be a character vector of tags when method = 'tag_count'.")
         }
         if (is.null(cutoff)) cutoff <- 1
+        if (!is.numeric(cutoff) || length(cutoff) != 1L || is.na(cutoff) || cutoff < 0) {
+            stop("`cutoff` must be a non-negative integer when method = 'tag_count'.")
+        }
         cat(sprintf(
             "Method: tag_count | Tags: %d | Cutoff: >= %d tag(s)\n",
             length(query), cutoff
         ))
     } else {
         if (!is.character(query) || length(query) != 1) {
             stop("`query` must be a single character string when method = 'cosine'.")
         }
         if (is.null(cutoff)) cutoff <- 0.10
+        if (!is.numeric(cutoff) || length(cutoff) != 1L || is.na(cutoff) ||
+            !is.finite(cutoff) || cutoff < -1 || cutoff > 1) {
+            stop("`cutoff` must be a single numeric value in [-1, 1] when method = 'cosine'.")
+        }
         cat(sprintf(
             "Method: cosine | Cutoff: >= %.2f\n", cutoff
         ))
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/filterSubnetworkByContext.R` around lines 64 - 81, Validate the cutoff
parameter after determining its default but before using it: for method ==
"tag_count" ensure cutoff is a single non-NA numeric/integer >= 1 (or
coerce/round if you expect integer counts) and for method == "cosine" ensure
cutoff is a single non-NA numeric between 0 and 1 (inclusive/exclusive as
appropriate); if validation fails, call stop() with a clear message. Insert
these checks near the existing method branches (referencing the method variable
and cutoff) inside the function in R/filterSubnetworkByContext.R so the code
sets defaults then validates cutoff before the cat(sprintf(...)) and any
downstream logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@man/filterSubnetworkByContext.Rd`:
- Around line 1-37: The man page is stale: regenerate it from the roxygen2
comments in the implementation by running devtools::document() (or
roxygen2::roxygenise()), then verify the generated Rd for
filterSubnetworkByContext matches the function signature and tags in
R/filterSubnetworkByContext.R (parameters: query, cutoff, method =
c("tag_count","cosine"), correct default for query if any, and output column
name score and updated title mentioning both methods); if any roxygen tags in
the R file are missing or incorrect, fix the `@param/`@return/@title comments in
the filterSubnetworkByContext roxygen block before re-running documentation.

In `@NAMESPACE`:
- Around line 41-43: The NAMESPACE is missing the xml_find_all import needed for
handling PubMed structured abstracts; add importFrom(xml2, xml_find_all) so that
the updated fetch_clean_abstracts_xml() (and related changes referenced in
filterSubnetworkByContext.R) can call xml_find_all() without namespace errors.

In `@R/filterSubnetworkByContext.R`:
- Around line 195-203: The itoken iterator created by itoken(all_texts,
preprocessor = tolower, tokenizer = word_tokenizer) is consumed by
create_vocabulary(...) and must be recreated before calling create_dtm(...);
update the code so you re-run itoken(...) (or otherwise create a fresh iterator
wrapper) immediately before vectorizer <- vocab_vectorizer(vocab) / dtm <-
create_dtm(tokens, vectorizer) to ensure create_dtm receives a fresh iterator;
keep the rest of the pipeline (TfIdf$new(), fit_transform(dtm, tfidf))
unchanged.

In `@vignettes/Filter-By-Context.Rmd`:
- Around line 242-243: Update the documentation table entries that list a
`similarity` column to use `score` instead (the function returns `score`),
specifically replace the `similarity` column name in the evidence table
descriptions with `score` and ensure both occurrences (the two table rows that
currently read "`similarity` | Cosine similarity score of the abstract vs.
query") are changed; also search the document for any other `similarity`
references related to the evidence table and align them to `score` for
consistency with the function output.
- Around line 2-12: Update the vignette title to be consistent and
method-agnostic by changing the YAML title (the "title" field currently set to
"Filtering Subnetworks by Biological Context Using TF-IDF Similarity") and the
VignetteIndexEntry (currently "Filtering Subnetworks by Biological Context") to
the same unified string (e.g., "Filtering Subnetworks by Biological Context") so
both the top-level title and the VignetteIndexEntry match and avoid the pipeline
warning.
- Around line 132-174: Several R Markdown code chunks reuse the same labels
causing duplicate chunk label errors; rename the second occurrence of each
duplicate chunk label (query, filter, nodes, edges, evidence, sort-evidence,
explore-cutoff) by appending a unique suffix like -cosine (e.g., query-cosine,
filter-cosine, nodes-cosine, edges-cosine, evidence-cosine,
sort-evidence-cosine, explore-cutoff-cosine) so each chunk label is unique;
locate the duplicate chunk delimiters with labels "query", "filter", "nodes",
"edges", "evidence", "sort-evidence", and "explore-cutoff" and update their
opening chunk names accordingly (keeping the chunk content unchanged).

---

Duplicate comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 139-140: Replace the positional column access nodes[[1]] in the
nodes filtering step with the explicit column name nodes$id to avoid breaking
when columns are reordered: update the expression that builds nodes_filtered
(which currently uses surviving_nodes from edges_filtered$source/target) to use
nodes$id %in% surviving_nodes instead of nodes[[1]] %in% surviving_nodes; ensure
this change is applied where surviving_nodes, edges_filtered and nodes_filtered
are referenced.
- Around line 305-325: The loop currently uses xml_find_first to grab only the
first <AbstractText>; change it to use xml_find_all on the parsed doc (replace
the abstract_node variable with abstract_nodes <- xml_find_all(doc,
".//AbstractText")), then if any nodes are found collapse their texts into a
single string (e.g., paste(xml_text(abstract_nodes), collapse = " ")) and store
that in results[[pmid]]; also adjust the presence check from
is.na(abstract_node) to testing length(abstract_nodes) > 0 (or similar).
Finally, update the Roxygen `@importFrom` directive that currently lists
xml_find_first to also include xml_find_all so the function is imported.

---

Nitpick comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 247-275: The loop over unique_hashes calls .query_indra_evidence
repeatedly without throttling; add rate limiting (e.g., a short Sys.sleep delay
or a configurable backoff) immediately after each .query_indra_evidence call
inside the for (i in seq_along(unique_hashes)) loop (or on each iteration before
continuing) to avoid overwhelming the INDRA API—make the delay configurable
(default similar to .fetch_clean_abstracts_xml's 0.34s) and apply it around the
.query_indra_evidence invocation and before continuing to the next stmt_hash.
- Around line 122-134: Do not coerce missing abstract scores to 0: remove the
line that sets evidence_scored$score[is.na(...)] <- 0 so NA remains, and leave
evidence_filtered as-is; additionally, add a warning when there are any NA
scores (e.g., compute n_na <- sum(is.na(evidence_scored$score)) and if n_na > 0
call warning() with the count/percentage) so callers are informed about
unfetched abstracts while preserving those rows (refer to symbols
evidence_scored, score, evidence_filtered, passing_pmids).
- Around line 64-81: Validate the cutoff parameter after determining its default
but before using it: for method == "tag_count" ensure cutoff is a single non-NA
numeric/integer >= 1 (or coerce/round if you expect integer counts) and for
method == "cosine" ensure cutoff is a single non-NA numeric between 0 and 1
(inclusive/exclusive as appropriate); if validation fails, call stop() with a
clear message. Insert these checks near the existing method branches
(referencing the method variable and cutoff) inside the function in
R/filterSubnetworkByContext.R so the code sets defaults then validates cutoff
before the cat(sprintf(...)) and any downstream logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8f4fa88f-c903-444c-9d97-9c124e7d0300

📥 Commits

Reviewing files that changed from the base of the PR and between dc9fb91 and 18dae12.

📒 Files selected for processing (6)
  • DESCRIPTION
  • NAMESPACE
  • R/filterSubnetworkByContext.R
  • R/utils_getSubnetworkFromIndra.R
  • man/filterSubnetworkByContext.Rd
  • vignettes/Filter-By-Context.Rmd
🚧 Files skipped from review as they are similar to previous changes (1)
  • DESCRIPTION

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
R/filterSubnetworkByContext.R (2)

295-315: Consider batch fetching for improved performance.

The current implementation fetches one PMID at a time with a 0.34s delay, which can be very slow for large networks. The rentrez::entrez_fetch() function supports fetching multiple IDs in a single request (up to 200 per batch), which would significantly reduce total execution time while still respecting NCBI rate limits.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/filterSubnetworkByContext.R` around lines 295 - 315, The loop fetching one
PMID at a time should be replaced by batch fetching: split pmids into chunks
(max 200), call rentrez::entrez_fetch once per chunk with ids collapsed, parse
the returned XML (using read_xml) to extract each article's AbstractText via
xml_find_first/xml_text and map them back to the original pmid keys to populate
results, preserve error handling per-chunk (wrap each batch call in tryCatch and
assign empty strings on failure), update progress reporting using chunk indices
vs total, and adjust Sys.sleep to respect NCBI rate limits between batches
rather than per PMID.

333-338: Consider adding a timeout to the HTTP request.

The POST call has no explicit timeout, which could cause the function to hang indefinitely if the INDRA API is unresponsive. Adding a timeout() configuration would improve reliability.

Example
         response <- POST(
             url,
             body   = list(stmt_hash = stmt_hash),
             encode = "json",
-            content_type_json()
+            content_type_json(),
+            httr::timeout(30)
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/filterSubnetworkByContext.R` around lines 333 - 338, The POST call
(response <- POST(...)) in filterSubnetworkByContext.R lacks a timeout and can
hang; update the POST invocation to include an httr timeout() option (e.g.,
timeout(30)) as an extra argument so the HTTP request fails fast if the INDRA
API is unresponsive, keeping the rest of the parameters (url, body, encode,
content_type_json()) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@R/filterSubnetworkByContext.R`:
- Line 88: The current construction pmids <-
unique(evidence$pmid[nchar(evidence$pmid) > 0]) can retain NA because nchar(NA)
yields NA; update the filter to explicitly exclude NAs and empty strings, e.g.
compute pmids <- unique(evidence$pmid[!is.na(evidence$pmid) &
nchar(evidence$pmid) > 0]) (or use !is.na(...) & nzchar(...) if preferred) so
that the pmids vector contains only non-NA, non-empty PMID strings.

---

Nitpick comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 295-315: The loop fetching one PMID at a time should be replaced
by batch fetching: split pmids into chunks (max 200), call rentrez::entrez_fetch
once per chunk with ids collapsed, parse the returned XML (using read_xml) to
extract each article's AbstractText via xml_find_first/xml_text and map them
back to the original pmid keys to populate results, preserve error handling
per-chunk (wrap each batch call in tryCatch and assign empty strings on
failure), update progress reporting using chunk indices vs total, and adjust
Sys.sleep to respect NCBI rate limits between batches rather than per PMID.
- Around line 333-338: The POST call (response <- POST(...)) in
filterSubnetworkByContext.R lacks a timeout and can hang; update the POST
invocation to include an httr timeout() option (e.g., timeout(30)) as an extra
argument so the HTTP request fails fast if the INDRA API is unresponsive,
keeping the rest of the parameters (url, body, encode, content_type_json())
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: df2fe0b7-e533-4f35-a274-f3582d49399c

📥 Commits

Reviewing files that changed from the base of the PR and between 18dae12 and 8d48b89.

📒 Files selected for processing (3)
  • R/filterSubnetworkByContext.R
  • man/filterSubnetworkByContext.Rd
  • vignettes/Filter-By-Context.Rmd
✅ Files skipped from review due to trivial changes (1)
  • man/filterSubnetworkByContext.Rd
🚧 Files skipped from review as they are similar to previous changes (1)
  • vignettes/Filter-By-Context.Rmd

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
R/filterSubnetworkByContext.R (2)

63-80: ⚠️ Potential issue | 🟠 Major

Validate cutoff per scoring mode.

cutoff is consumed unvalidated at Line 108. Right now NA, vectors, negative values, impossible tag-count thresholds, or cosine cutoffs outside [-1, 1] can silently misfilter the network.

Suggested fix
     if (method == "tag_count") {
         if (!is.character(query) || length(query) < 1) {
             stop("`query` must be a character vector of tags when method = 'tag_count'.")
         }
         if (is.null(cutoff)) cutoff <- 1
+        if (!is.numeric(cutoff) || length(cutoff) != 1L || is.na(cutoff) ||
+            cutoff < 0 || cutoff != floor(cutoff) || cutoff > length(query)) {
+            stop("`cutoff` must be a single integer in [0, length(query)] when method = 'tag_count'.")
+        }
         cat(sprintf(
             "Method: tag_count | Tags: %d | Cutoff: >= %d tag(s)\n",
             length(query), cutoff
         ))
     } else {
         if (!is.character(query) || length(query) != 1) {
             stop("`query` must be a single character string when method = 'cosine'.")
         }
         if (is.null(cutoff)) cutoff <- 0.10
+        if (!is.numeric(cutoff) || length(cutoff) != 1L || is.na(cutoff) ||
+            !is.finite(cutoff) || cutoff < -1 || cutoff > 1) {
+            stop("`cutoff` must be a single numeric value in [-1, 1] when method = 'cosine'.")
+        }
         cat(sprintf(
             "Method: cosine | Cutoff: >= %.2f\n", cutoff
         ))
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/filterSubnetworkByContext.R` around lines 63 - 80, Validate the cutoff
argument based on method before it's used: in the function handling method and
cutoff (look for variables method, cutoff and the tag_count / cosine branches in
filterSubnetworkByContext.R), ensure cutoff is a single non-NA numeric scalar;
for method == "tag_count" coerce/validate it as an integer >= 1 (and <=
length(query) if applicable) and reject vectors/NA/negatives, and for method ==
"cosine" validate it as a single numeric within [-1, 1] (reject NA or vectors);
return a clear stop() error message when validation fails so invalid cutoffs
cannot silently misfilter the network.

300-317: ⚠️ Potential issue | 🟠 Major

Preserve missing abstracts as unknown, not off-topic.

When PubMed returns no <AbstractText>, this helper drops the PMID entirely; on fetch errors it records "". Downstream that turns into missing or zero-like scores and the evidence is filtered out as if it were irrelevant, even though it was never actually scored.

Suggested fix
-            if (length(abstract_nodes) > 0) {
-                results[[pmid]] <- paste(trimws(xml_text(abstract_nodes)), collapse = " ")
-            }
+            if (length(abstract_nodes) > 0) {
+                results[[pmid]] <- paste(trimws(xml_text(abstract_nodes)), collapse = " ")
+            } else {
+                results[[pmid]] <- NA_character_
+            }
@@
-            results[[pmid]] <- ""
+            results[[pmid]] <- NA_character_
+    if (length(abstract_list) == 0L ||
+        all(is.na(unlist(abstract_list, use.names = FALSE)))) {
+        warning("No abstracts could be scored - returning unfiltered inputs.")
+        return(list(nodes = nodes, edges = edges, evidence = evidence))
+    }
@@
-    evidence_scored$score[is.na(evidence_scored$score)] <- 0
-
-    evidence_filtered <- evidence_scored[
-        evidence_scored$pmid %in% passing_pmids,
+    evidence_filtered <- evidence_scored[
+        is.na(evidence_scored$score) | evidence_scored$score >= cutoff,
         c("source", "target", "interaction", "site",
           "evidenceLink", "stmt_hash", "text", "pmid", "score")
     ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/filterSubnetworkByContext.R` around lines 300 - 317, The code currently
stores an empty string for PMIDs with no abstract or on fetch error, which
downstream is treated like a scored/irrelevant result; update the logic in the
tryCatch block so that when length(abstract_nodes) == 0 you set results[[pmid]]
<- NA_character_ (or a sentinel like "UNKNOWN") instead of dropping or leaving
it empty, and also change the error handler to set results[[pmid]] <-
NA_character_ (instead of ""), keeping the existing progress/logging but
ensuring missing abstracts are preserved as NA for downstream processing; modify
the occurrences that reference results[[pmid]] in this block (the abstract
assignment and the error function) accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 55-149: Add automated tests for the exported function
filterSubnetworkByContext to cover all branches: write testthat tests that mock
the helper calls .extract_evidence_text and .fetch_clean_abstracts_xml (and
simulate errors/timeouts), and stub scoring functions .score_by_tag_count and
.score_by_cosine to control scores; include cases for method="tag_count" and
method="cosine", no evidence (empty data.frame from .extract_evidence_text),
evidence with no PMIDs (pmid empty/NA), empty abstracts returned by
.fetch_clean_abstracts_xml, and remote-call failures (make
.fetch_clean_abstracts_xml throw an error) and assert the function returns the
unfiltered inputs or raises appropriately and that outputs contain expected
filtered nodes/edges/evidence; use mockery or with_mock to mock rentrez/httr if
those are called inside .fetch_clean_abstracts_xml and assert logging/messages
via expect_warning/expect_message where appropriate.

In `@vignettes/Filter-By-Context.Rmd`:
- Around line 176-178: The documentation still describes cosine semantics while
the code uses method = "tag_count"; update the prose and labels wherever
tag_count is used (e.g., the my_query walkthrough, the >= 0.10 cutoff example,
filtered_network$evidence references, the score variable, and the histogram
title "Cosine score to query") so they reflect integer tag-hit semantics: change
descriptions of score from "cosine score" to "tag hit count" (or "integer tag
hits"), replace the numeric cutoff example with an integer threshold (e.g., ">=
1" or a relevant count), and update plot labels and captions to say "Tag hit
count" (or similar) instead of "Cosine score to query"; apply these edits
consistently in all affected blocks (including the occurrences noted around the
method="tag_count" usage).
- Around line 132-173: The tags vector used for method = "tag_count" uses
underscore_joined tokens which won't match PubMed multi-word phrases; update the
tags variable (the tags object in the R code block) to use space-separated
phrases (e.g., "dna damage repair", "double strand break", "single strand
break", "base excision repair", etc.) instead of underscore-delimited words so
fixed substring matching will find the intended multi-word terms.

---

Duplicate comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 63-80: Validate the cutoff argument based on method before it's
used: in the function handling method and cutoff (look for variables method,
cutoff and the tag_count / cosine branches in filterSubnetworkByContext.R),
ensure cutoff is a single non-NA numeric scalar; for method == "tag_count"
coerce/validate it as an integer >= 1 (and <= length(query) if applicable) and
reject vectors/NA/negatives, and for method == "cosine" validate it as a single
numeric within [-1, 1] (reject NA or vectors); return a clear stop() error
message when validation fails so invalid cutoffs cannot silently misfilter the
network.
- Around line 300-317: The code currently stores an empty string for PMIDs with
no abstract or on fetch error, which downstream is treated like a
scored/irrelevant result; update the logic in the tryCatch block so that when
length(abstract_nodes) == 0 you set results[[pmid]] <- NA_character_ (or a
sentinel like "UNKNOWN") instead of dropping or leaving it empty, and also
change the error handler to set results[[pmid]] <- NA_character_ (instead of
""), keeping the existing progress/logging but ensuring missing abstracts are
preserved as NA for downstream processing; modify the occurrences that reference
results[[pmid]] in this block (the abstract assignment and the error function)
accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 18c0f642-d52f-40cb-ba18-0d9c010ddd4c

📥 Commits

Reviewing files that changed from the base of the PR and between 8d48b89 and b6f513e.

📒 Files selected for processing (4)
  • NAMESPACE
  • R/filterSubnetworkByContext.R
  • man/filterSubnetworkByContext.Rd
  • vignettes/Filter-By-Context.Rmd
🚧 Files skipped from review as they are similar to previous changes (1)
  • man/filterSubnetworkByContext.Rd

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
R/filterSubnetworkByContext.R (1)

301-320: ⚠️ Potential issue | 🟠 Major

Differentiate failed fetches from genuinely empty abstracts.

A request failure and a real “no abstract text” record are both stored as "" here, so downstream scoring treats transient PubMed failures as hard non-matches and can prune valid edges. Preserve an explicit unknown value for failed lookups and let the filter logic decide how to handle those rows.

Proposed fix
         record <- tryCatch(
             entrez_fetch(db = "pubmed", id = pmid, rettype = "xml"),
             error = function(e) {
                 cat(sprintf("Error fetching PMID %s at %d/%d: %s\n", pmid, i, total, e$message))
                 NULL
             }
         )
         
         if (is.null(record)) {
-            results[[pmid]] <- ""
+            results[[pmid]] <- NA_character_
             next
         }

Then keep that NA distinct in filterSubnetworkByContext() instead of coercing it to a zero score.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/filterSubnetworkByContext.R` around lines 301 - 320, The fetch error
handling currently stores failed PubMed fetches as an empty string, conflating
network/transient failures with genuinely empty abstracts; update the tryCatch
error branch so that when entrez_fetch fails you set results[[pmid]] <- NA (not
""), leave successful-but-empty abstracts as "" as before, and then update the
downstream filterSubnetworkByContext() logic to treat NA as "unknown" (i.e., do
not coerce NA to a zero/score) so callers can decide how to handle fetch
failures. Reference: the tryCatch block around entrez_fetch, the results[[pmid]]
assignment, and filterSubnetworkByContext().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 63-79: In filterSubnetworkByContext, validate inputs up front:
when method == "tag_count" ensure query is a character vector with length >=1
and no NA or empty ("" ) entries, and ensure cutoff is a single finite numeric
>=1 (coerce or error if not); when method == "cosine" ensure query is a single
non-NA, non-empty character string and cutoff is a single finite numeric in
[0,1] (default 0.10 otherwise) — replace the loose checks around query, cutoff
and the cat(...) blocks with these stricter validations referencing the existing
method, query and cutoff variables so the function fails fast on invalid tags or
invalid cutoff values.
- Around line 84-92: The early-return branches in filterSubnetworkByContext
currently return the raw evidence data frame without the required score column;
update both branches (the nrow(evidence) == 0 and length(pmids) == 0 cases) to
ensure evidence always contains a numeric score column (e.g., add evidence$score
<- NA_real_ if missing) before returning list(nodes = nodes, edges = edges,
evidence = evidence) so callers can safely access result$evidence$score without
special-casing.

---

Duplicate comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 301-320: The fetch error handling currently stores failed PubMed
fetches as an empty string, conflating network/transient failures with genuinely
empty abstracts; update the tryCatch error branch so that when entrez_fetch
fails you set results[[pmid]] <- NA (not ""), leave successful-but-empty
abstracts as "" as before, and then update the downstream
filterSubnetworkByContext() logic to treat NA as "unknown" (i.e., do not coerce
NA to a zero/score) so callers can decide how to handle fetch failures.
Reference: the tryCatch block around entrez_fetch, the results[[pmid]]
assignment, and filterSubnetworkByContext().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ba5b423e-9cf8-448b-b9a0-443fe909d790

📥 Commits

Reviewing files that changed from the base of the PR and between b6f513e and ca876c4.

📒 Files selected for processing (2)
  • R/filterSubnetworkByContext.R
  • tests/testthat/test-filterSubnetworkByContext.R

@tonywu1999 tonywu1999 merged commit 8123bd7 into devel Mar 11, 2026
4 checks passed
@tonywu1999 tonywu1999 deleted the feature-cosine-context branch March 11, 2026 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants