feature(filter-subnetwork): Add feature to filter the subnetwork from INDRA by query context#84
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds exported function Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant Package as Package
participant INDRA as INDRA API
participant PubMed as PubMed (rentrez)
participant TextProc as Text Processing (text2vec)
User->>Package: call filterSubnetworkByContext(nodes, edges, query, ...)
Package->>INDRA: request evidence per stmt_hash
INDRA-->>Package: evidence text + PMIDs
Package->>PubMed: fetch abstracts for PMIDs
PubMed-->>Package: abstracts
Package->>TextProc: compute tag counts or TF‑IDF + cosine vs query
TextProc-->>Package: similarity scores
Package->>Package: attach scores, filter evidence, compute surviving nodes/edges
Package-->>User: return filtered nodes, edges, evidence
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 112-114: Replace the positional column access nodes[[1]] with an
explicit nodes$id reference in the nodes_filtered assignment, and add a
pre-check that the nodes data.frame contains an "id" column (e.g., stop with a
clear error if !"id" %in% colnames(nodes)); update the nodes_filtered creation
that currently uses surviving_nodes and edges_filtered to filter by nodes$id to
avoid silent breakage when columns are reordered; this keeps behavior consistent
with other functions like getSubnetworkFromIndra() and fails fast when the
required column is missing.
- Around line 93-106: Remove the forced conversion of missing similarity values
to 0 (the line setting evidence_scored$similarity[is.na(...)] <- 0) so
unknown/failed-fetch abstracts remain NA, and update the logic that computes
passing_pmids (the filter that uses similarity >= cutoff) to treat NA as passing
by using a condition like is.na(similarity) OR similarity >= cutoff; this
ensures evidence_scored, evidence_filtered, abstracts_df, similarity and
passing_pmids retain and preserve unknown scores rather than conflating them
with zero.
- Around line 217-220: Replace the single-section extraction using
xml_find_first with a concatenation of all <AbstractText> nodes: use
xml_find_all(doc, ".//AbstractText") to retrieve every section, combine their
xml_text() values (e.g., collapse with a space) and assign the combined string
to results[[pmid]] instead of the single abstract_node; additionally update the
package metadata to import xml_find_all by adding xml_find_all to the
`@importFrom` xml2 line and exporting it in the NAMESPACE so the function is
available.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f8397e58-b24c-4ce8-aa8e-d6f237bd05a8
📒 Files selected for processing (6)
DESCRIPTIONNAMESPACER/filterSubnetworkByContext.RR/utils_getSubnetworkFromIndra.Rman/filterSubnetworkByContext.Rdvignettes/Filter-By-Context.Rmd
Codecov Report❌ Patch coverage is
Additional details and impacted files☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…atbot to give you way more similar sounding words for free
dc9fb91 to
18dae12
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (2)
R/filterSubnetworkByContext.R (2)
139-140:⚠️ Potential issue | 🟡 MinorUse
nodes$idinstead of positionalnodes[[1]].This was flagged in a previous review and remains unaddressed. The function is exported and may receive dataframes with reordered columns. Using
nodes[[1]]silently breaks in that case.Proposed fix
surviving_nodes <- union(edges_filtered$source, edges_filtered$target) - nodes_filtered <- nodes[nodes[[1]] %in% surviving_nodes, ] # assumes first col is node ID + if (!"id" %in% names(nodes)) { + stop("`nodes` must contain an `id` column.") + } + nodes_filtered <- nodes[nodes$id %in% surviving_nodes, ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/filterSubnetworkByContext.R` around lines 139 - 140, Replace the positional column access nodes[[1]] in the nodes filtering step with the explicit column name nodes$id to avoid breaking when columns are reordered: update the expression that builds nodes_filtered (which currently uses surviving_nodes from edges_filtered$source/target) to use nodes$id %in% surviving_nodes instead of nodes[[1]] %in% surviving_nodes; ensure this change is applied where surviving_nodes, edges_filtered and nodes_filtered are referenced.
305-325:⚠️ Potential issue | 🟠 MajorConcatenate all abstract sections instead of truncating to the first.
This was flagged in a previous review and remains unaddressed. PubMed structured abstracts contain multiple
<AbstractText>nodes (OBJECTIVE, METHODS, RESULTS, CONCLUSIONS). Usingxml_find_first()retrieves only the first section, causing relevance scoring to miss matches in other sections.Proposed fix
- abstract_node <- xml_find_first(doc, ".//AbstractText") - - if (!is.na(abstract_node)) { - results[[pmid]] <- xml_text(abstract_node) + abstract_nodes <- xml_find_all(doc, ".//AbstractText") + + if (length(abstract_nodes) > 0) { + results[[pmid]] <- paste(trimws(xml_text(abstract_nodes)), collapse = " ") }Also add
xml_find_allto the@importFromdirective at line 298:-#' `@importFrom` xml2 read_xml xml_find_first xml_text +#' `@importFrom` xml2 read_xml xml_find_all xml_find_first xml_text🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/filterSubnetworkByContext.R` around lines 305 - 325, The loop currently uses xml_find_first to grab only the first <AbstractText>; change it to use xml_find_all on the parsed doc (replace the abstract_node variable with abstract_nodes <- xml_find_all(doc, ".//AbstractText")), then if any nodes are found collapse their texts into a single string (e.g., paste(xml_text(abstract_nodes), collapse = " ")) and store that in results[[pmid]]; also adjust the presence check from is.na(abstract_node) to testing length(abstract_nodes) > 0 (or similar). Finally, update the Roxygen `@importFrom` directive that currently lists xml_find_first to also include xml_find_all so the function is imported.
🧹 Nitpick comments (3)
R/filterSubnetworkByContext.R (3)
247-275: Consider adding rate limiting for INDRA API calls.Unlike
.fetch_clean_abstracts_xml()which includesSys.sleep(0.34)to respect NCBI rate limits, this function makes INDRA API calls without throttling. For networks with many unique statement hashes, this could overwhelm the INDRA API or trigger rate limiting.evidence_list <- .query_indra_evidence(stmt_hash) if (is.null(evidence_list) || length(evidence_list) == 0) next + + Sys.sleep(0.1) # respect API rate limits🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/filterSubnetworkByContext.R` around lines 247 - 275, The loop over unique_hashes calls .query_indra_evidence repeatedly without throttling; add rate limiting (e.g., a short Sys.sleep delay or a configurable backoff) immediately after each .query_indra_evidence call inside the for (i in seq_along(unique_hashes)) loop (or on each iteration before continuing) to avoid overwhelming the INDRA API—make the delay configurable (default similar to .fetch_clean_abstracts_xml's 0.34s) and apply it around the .query_indra_evidence invocation and before continuing to the next stmt_hash.
122-134: Consider preserving evidence with unfetched abstracts.Setting
NAscores to0(line 128) conflates transient fetch failures with genuinely low-scoring abstracts. Evidence backed by unfetchable abstracts is silently removed.Consider either:
- Preserving NA-scored evidence in the output (let user decide), or
- Adding a warning when significant data is lost:
evidence_scored$score[is.na(evidence_scored$score)] <- 0 + na_count <- sum(is.na(evidence_scored$score)) + if (na_count > 0) { + warning(sprintf("%d evidence rows had no abstract score (fetch failed); treated as score=0.", na_count)) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/filterSubnetworkByContext.R` around lines 122 - 134, Do not coerce missing abstract scores to 0: remove the line that sets evidence_scored$score[is.na(...)] <- 0 so NA remains, and leave evidence_filtered as-is; additionally, add a warning when there are any NA scores (e.g., compute n_na <- sum(is.na(evidence_scored$score)) and if n_na > 0 call warning() with the count/percentage) so callers are informed about unfetched abstracts while preserving those rows (refer to symbols evidence_scored, score, evidence_filtered, passing_pmids).
64-81: Add validation forcutoffparameter.While query validation was added (addressing previous review),
cutoffis not validated. Invalid values (e.g., negative for tag_count, non-numeric, NA) will cause silent misbehavior.Proposed fix
if (method == "tag_count") { if (!is.character(query) || length(query) < 1) { stop("`query` must be a character vector of tags when method = 'tag_count'.") } if (is.null(cutoff)) cutoff <- 1 + if (!is.numeric(cutoff) || length(cutoff) != 1L || is.na(cutoff) || cutoff < 0) { + stop("`cutoff` must be a non-negative integer when method = 'tag_count'.") + } cat(sprintf( "Method: tag_count | Tags: %d | Cutoff: >= %d tag(s)\n", length(query), cutoff )) } else { if (!is.character(query) || length(query) != 1) { stop("`query` must be a single character string when method = 'cosine'.") } if (is.null(cutoff)) cutoff <- 0.10 + if (!is.numeric(cutoff) || length(cutoff) != 1L || is.na(cutoff) || + !is.finite(cutoff) || cutoff < -1 || cutoff > 1) { + stop("`cutoff` must be a single numeric value in [-1, 1] when method = 'cosine'.") + } cat(sprintf( "Method: cosine | Cutoff: >= %.2f\n", cutoff )) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/filterSubnetworkByContext.R` around lines 64 - 81, Validate the cutoff parameter after determining its default but before using it: for method == "tag_count" ensure cutoff is a single non-NA numeric/integer >= 1 (or coerce/round if you expect integer counts) and for method == "cosine" ensure cutoff is a single non-NA numeric between 0 and 1 (inclusive/exclusive as appropriate); if validation fails, call stop() with a clear message. Insert these checks near the existing method branches (referencing the method variable and cutoff) inside the function in R/filterSubnetworkByContext.R so the code sets defaults then validates cutoff before the cat(sprintf(...)) and any downstream logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@man/filterSubnetworkByContext.Rd`:
- Around line 1-37: The man page is stale: regenerate it from the roxygen2
comments in the implementation by running devtools::document() (or
roxygen2::roxygenise()), then verify the generated Rd for
filterSubnetworkByContext matches the function signature and tags in
R/filterSubnetworkByContext.R (parameters: query, cutoff, method =
c("tag_count","cosine"), correct default for query if any, and output column
name score and updated title mentioning both methods); if any roxygen tags in
the R file are missing or incorrect, fix the `@param/`@return/@title comments in
the filterSubnetworkByContext roxygen block before re-running documentation.
In `@NAMESPACE`:
- Around line 41-43: The NAMESPACE is missing the xml_find_all import needed for
handling PubMed structured abstracts; add importFrom(xml2, xml_find_all) so that
the updated fetch_clean_abstracts_xml() (and related changes referenced in
filterSubnetworkByContext.R) can call xml_find_all() without namespace errors.
In `@R/filterSubnetworkByContext.R`:
- Around line 195-203: The itoken iterator created by itoken(all_texts,
preprocessor = tolower, tokenizer = word_tokenizer) is consumed by
create_vocabulary(...) and must be recreated before calling create_dtm(...);
update the code so you re-run itoken(...) (or otherwise create a fresh iterator
wrapper) immediately before vectorizer <- vocab_vectorizer(vocab) / dtm <-
create_dtm(tokens, vectorizer) to ensure create_dtm receives a fresh iterator;
keep the rest of the pipeline (TfIdf$new(), fit_transform(dtm, tfidf))
unchanged.
In `@vignettes/Filter-By-Context.Rmd`:
- Around line 242-243: Update the documentation table entries that list a
`similarity` column to use `score` instead (the function returns `score`),
specifically replace the `similarity` column name in the evidence table
descriptions with `score` and ensure both occurrences (the two table rows that
currently read "`similarity` | Cosine similarity score of the abstract vs.
query") are changed; also search the document for any other `similarity`
references related to the evidence table and align them to `score` for
consistency with the function output.
- Around line 2-12: Update the vignette title to be consistent and
method-agnostic by changing the YAML title (the "title" field currently set to
"Filtering Subnetworks by Biological Context Using TF-IDF Similarity") and the
VignetteIndexEntry (currently "Filtering Subnetworks by Biological Context") to
the same unified string (e.g., "Filtering Subnetworks by Biological Context") so
both the top-level title and the VignetteIndexEntry match and avoid the pipeline
warning.
- Around line 132-174: Several R Markdown code chunks reuse the same labels
causing duplicate chunk label errors; rename the second occurrence of each
duplicate chunk label (query, filter, nodes, edges, evidence, sort-evidence,
explore-cutoff) by appending a unique suffix like -cosine (e.g., query-cosine,
filter-cosine, nodes-cosine, edges-cosine, evidence-cosine,
sort-evidence-cosine, explore-cutoff-cosine) so each chunk label is unique;
locate the duplicate chunk delimiters with labels "query", "filter", "nodes",
"edges", "evidence", "sort-evidence", and "explore-cutoff" and update their
opening chunk names accordingly (keeping the chunk content unchanged).
---
Duplicate comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 139-140: Replace the positional column access nodes[[1]] in the
nodes filtering step with the explicit column name nodes$id to avoid breaking
when columns are reordered: update the expression that builds nodes_filtered
(which currently uses surviving_nodes from edges_filtered$source/target) to use
nodes$id %in% surviving_nodes instead of nodes[[1]] %in% surviving_nodes; ensure
this change is applied where surviving_nodes, edges_filtered and nodes_filtered
are referenced.
- Around line 305-325: The loop currently uses xml_find_first to grab only the
first <AbstractText>; change it to use xml_find_all on the parsed doc (replace
the abstract_node variable with abstract_nodes <- xml_find_all(doc,
".//AbstractText")), then if any nodes are found collapse their texts into a
single string (e.g., paste(xml_text(abstract_nodes), collapse = " ")) and store
that in results[[pmid]]; also adjust the presence check from
is.na(abstract_node) to testing length(abstract_nodes) > 0 (or similar).
Finally, update the Roxygen `@importFrom` directive that currently lists
xml_find_first to also include xml_find_all so the function is imported.
---
Nitpick comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 247-275: The loop over unique_hashes calls .query_indra_evidence
repeatedly without throttling; add rate limiting (e.g., a short Sys.sleep delay
or a configurable backoff) immediately after each .query_indra_evidence call
inside the for (i in seq_along(unique_hashes)) loop (or on each iteration before
continuing) to avoid overwhelming the INDRA API—make the delay configurable
(default similar to .fetch_clean_abstracts_xml's 0.34s) and apply it around the
.query_indra_evidence invocation and before continuing to the next stmt_hash.
- Around line 122-134: Do not coerce missing abstract scores to 0: remove the
line that sets evidence_scored$score[is.na(...)] <- 0 so NA remains, and leave
evidence_filtered as-is; additionally, add a warning when there are any NA
scores (e.g., compute n_na <- sum(is.na(evidence_scored$score)) and if n_na > 0
call warning() with the count/percentage) so callers are informed about
unfetched abstracts while preserving those rows (refer to symbols
evidence_scored, score, evidence_filtered, passing_pmids).
- Around line 64-81: Validate the cutoff parameter after determining its default
but before using it: for method == "tag_count" ensure cutoff is a single non-NA
numeric/integer >= 1 (or coerce/round if you expect integer counts) and for
method == "cosine" ensure cutoff is a single non-NA numeric between 0 and 1
(inclusive/exclusive as appropriate); if validation fails, call stop() with a
clear message. Insert these checks near the existing method branches
(referencing the method variable and cutoff) inside the function in
R/filterSubnetworkByContext.R so the code sets defaults then validates cutoff
before the cat(sprintf(...)) and any downstream logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8f4fa88f-c903-444c-9d97-9c124e7d0300
📒 Files selected for processing (6)
DESCRIPTIONNAMESPACER/filterSubnetworkByContext.RR/utils_getSubnetworkFromIndra.Rman/filterSubnetworkByContext.Rdvignettes/Filter-By-Context.Rmd
🚧 Files skipped from review as they are similar to previous changes (1)
- DESCRIPTION
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
R/filterSubnetworkByContext.R (2)
295-315: Consider batch fetching for improved performance.The current implementation fetches one PMID at a time with a 0.34s delay, which can be very slow for large networks. The
rentrez::entrez_fetch()function supports fetching multiple IDs in a single request (up to 200 per batch), which would significantly reduce total execution time while still respecting NCBI rate limits.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/filterSubnetworkByContext.R` around lines 295 - 315, The loop fetching one PMID at a time should be replaced by batch fetching: split pmids into chunks (max 200), call rentrez::entrez_fetch once per chunk with ids collapsed, parse the returned XML (using read_xml) to extract each article's AbstractText via xml_find_first/xml_text and map them back to the original pmid keys to populate results, preserve error handling per-chunk (wrap each batch call in tryCatch and assign empty strings on failure), update progress reporting using chunk indices vs total, and adjust Sys.sleep to respect NCBI rate limits between batches rather than per PMID.
333-338: Consider adding a timeout to the HTTP request.The
POSTcall has no explicit timeout, which could cause the function to hang indefinitely if the INDRA API is unresponsive. Adding atimeout()configuration would improve reliability.Example
response <- POST( url, body = list(stmt_hash = stmt_hash), encode = "json", - content_type_json() + content_type_json(), + httr::timeout(30) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/filterSubnetworkByContext.R` around lines 333 - 338, The POST call (response <- POST(...)) in filterSubnetworkByContext.R lacks a timeout and can hang; update the POST invocation to include an httr timeout() option (e.g., timeout(30)) as an extra argument so the HTTP request fails fast if the INDRA API is unresponsive, keeping the rest of the parameters (url, body, encode, content_type_json()) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@R/filterSubnetworkByContext.R`:
- Line 88: The current construction pmids <-
unique(evidence$pmid[nchar(evidence$pmid) > 0]) can retain NA because nchar(NA)
yields NA; update the filter to explicitly exclude NAs and empty strings, e.g.
compute pmids <- unique(evidence$pmid[!is.na(evidence$pmid) &
nchar(evidence$pmid) > 0]) (or use !is.na(...) & nzchar(...) if preferred) so
that the pmids vector contains only non-NA, non-empty PMID strings.
---
Nitpick comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 295-315: The loop fetching one PMID at a time should be replaced
by batch fetching: split pmids into chunks (max 200), call rentrez::entrez_fetch
once per chunk with ids collapsed, parse the returned XML (using read_xml) to
extract each article's AbstractText via xml_find_first/xml_text and map them
back to the original pmid keys to populate results, preserve error handling
per-chunk (wrap each batch call in tryCatch and assign empty strings on
failure), update progress reporting using chunk indices vs total, and adjust
Sys.sleep to respect NCBI rate limits between batches rather than per PMID.
- Around line 333-338: The POST call (response <- POST(...)) in
filterSubnetworkByContext.R lacks a timeout and can hang; update the POST
invocation to include an httr timeout() option (e.g., timeout(30)) as an extra
argument so the HTTP request fails fast if the INDRA API is unresponsive,
keeping the rest of the parameters (url, body, encode, content_type_json())
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: df2fe0b7-e533-4f35-a274-f3582d49399c
📒 Files selected for processing (3)
R/filterSubnetworkByContext.Rman/filterSubnetworkByContext.Rdvignettes/Filter-By-Context.Rmd
✅ Files skipped from review due to trivial changes (1)
- man/filterSubnetworkByContext.Rd
🚧 Files skipped from review as they are similar to previous changes (1)
- vignettes/Filter-By-Context.Rmd
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
R/filterSubnetworkByContext.R (2)
63-80:⚠️ Potential issue | 🟠 MajorValidate
cutoffper scoring mode.
cutoffis consumed unvalidated at Line 108. Right nowNA, vectors, negative values, impossible tag-count thresholds, or cosine cutoffs outside[-1, 1]can silently misfilter the network.Suggested fix
if (method == "tag_count") { if (!is.character(query) || length(query) < 1) { stop("`query` must be a character vector of tags when method = 'tag_count'.") } if (is.null(cutoff)) cutoff <- 1 + if (!is.numeric(cutoff) || length(cutoff) != 1L || is.na(cutoff) || + cutoff < 0 || cutoff != floor(cutoff) || cutoff > length(query)) { + stop("`cutoff` must be a single integer in [0, length(query)] when method = 'tag_count'.") + } cat(sprintf( "Method: tag_count | Tags: %d | Cutoff: >= %d tag(s)\n", length(query), cutoff )) } else { if (!is.character(query) || length(query) != 1) { stop("`query` must be a single character string when method = 'cosine'.") } if (is.null(cutoff)) cutoff <- 0.10 + if (!is.numeric(cutoff) || length(cutoff) != 1L || is.na(cutoff) || + !is.finite(cutoff) || cutoff < -1 || cutoff > 1) { + stop("`cutoff` must be a single numeric value in [-1, 1] when method = 'cosine'.") + } cat(sprintf( "Method: cosine | Cutoff: >= %.2f\n", cutoff )) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/filterSubnetworkByContext.R` around lines 63 - 80, Validate the cutoff argument based on method before it's used: in the function handling method and cutoff (look for variables method, cutoff and the tag_count / cosine branches in filterSubnetworkByContext.R), ensure cutoff is a single non-NA numeric scalar; for method == "tag_count" coerce/validate it as an integer >= 1 (and <= length(query) if applicable) and reject vectors/NA/negatives, and for method == "cosine" validate it as a single numeric within [-1, 1] (reject NA or vectors); return a clear stop() error message when validation fails so invalid cutoffs cannot silently misfilter the network.
300-317:⚠️ Potential issue | 🟠 MajorPreserve missing abstracts as unknown, not off-topic.
When PubMed returns no
<AbstractText>, this helper drops the PMID entirely; on fetch errors it records"". Downstream that turns into missing or zero-like scores and the evidence is filtered out as if it were irrelevant, even though it was never actually scored.Suggested fix
- if (length(abstract_nodes) > 0) { - results[[pmid]] <- paste(trimws(xml_text(abstract_nodes)), collapse = " ") - } + if (length(abstract_nodes) > 0) { + results[[pmid]] <- paste(trimws(xml_text(abstract_nodes)), collapse = " ") + } else { + results[[pmid]] <- NA_character_ + } @@ - results[[pmid]] <- "" + results[[pmid]] <- NA_character_+ if (length(abstract_list) == 0L || + all(is.na(unlist(abstract_list, use.names = FALSE)))) { + warning("No abstracts could be scored - returning unfiltered inputs.") + return(list(nodes = nodes, edges = edges, evidence = evidence)) + } @@ - evidence_scored$score[is.na(evidence_scored$score)] <- 0 - - evidence_filtered <- evidence_scored[ - evidence_scored$pmid %in% passing_pmids, + evidence_filtered <- evidence_scored[ + is.na(evidence_scored$score) | evidence_scored$score >= cutoff, c("source", "target", "interaction", "site", "evidenceLink", "stmt_hash", "text", "pmid", "score") ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/filterSubnetworkByContext.R` around lines 300 - 317, The code currently stores an empty string for PMIDs with no abstract or on fetch error, which downstream is treated like a scored/irrelevant result; update the logic in the tryCatch block so that when length(abstract_nodes) == 0 you set results[[pmid]] <- NA_character_ (or a sentinel like "UNKNOWN") instead of dropping or leaving it empty, and also change the error handler to set results[[pmid]] <- NA_character_ (instead of ""), keeping the existing progress/logging but ensuring missing abstracts are preserved as NA for downstream processing; modify the occurrences that reference results[[pmid]] in this block (the abstract assignment and the error function) accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 55-149: Add automated tests for the exported function
filterSubnetworkByContext to cover all branches: write testthat tests that mock
the helper calls .extract_evidence_text and .fetch_clean_abstracts_xml (and
simulate errors/timeouts), and stub scoring functions .score_by_tag_count and
.score_by_cosine to control scores; include cases for method="tag_count" and
method="cosine", no evidence (empty data.frame from .extract_evidence_text),
evidence with no PMIDs (pmid empty/NA), empty abstracts returned by
.fetch_clean_abstracts_xml, and remote-call failures (make
.fetch_clean_abstracts_xml throw an error) and assert the function returns the
unfiltered inputs or raises appropriately and that outputs contain expected
filtered nodes/edges/evidence; use mockery or with_mock to mock rentrez/httr if
those are called inside .fetch_clean_abstracts_xml and assert logging/messages
via expect_warning/expect_message where appropriate.
In `@vignettes/Filter-By-Context.Rmd`:
- Around line 176-178: The documentation still describes cosine semantics while
the code uses method = "tag_count"; update the prose and labels wherever
tag_count is used (e.g., the my_query walkthrough, the >= 0.10 cutoff example,
filtered_network$evidence references, the score variable, and the histogram
title "Cosine score to query") so they reflect integer tag-hit semantics: change
descriptions of score from "cosine score" to "tag hit count" (or "integer tag
hits"), replace the numeric cutoff example with an integer threshold (e.g., ">=
1" or a relevant count), and update plot labels and captions to say "Tag hit
count" (or similar) instead of "Cosine score to query"; apply these edits
consistently in all affected blocks (including the occurrences noted around the
method="tag_count" usage).
- Around line 132-173: The tags vector used for method = "tag_count" uses
underscore_joined tokens which won't match PubMed multi-word phrases; update the
tags variable (the tags object in the R code block) to use space-separated
phrases (e.g., "dna damage repair", "double strand break", "single strand
break", "base excision repair", etc.) instead of underscore-delimited words so
fixed substring matching will find the intended multi-word terms.
---
Duplicate comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 63-80: Validate the cutoff argument based on method before it's
used: in the function handling method and cutoff (look for variables method,
cutoff and the tag_count / cosine branches in filterSubnetworkByContext.R),
ensure cutoff is a single non-NA numeric scalar; for method == "tag_count"
coerce/validate it as an integer >= 1 (and <= length(query) if applicable) and
reject vectors/NA/negatives, and for method == "cosine" validate it as a single
numeric within [-1, 1] (reject NA or vectors); return a clear stop() error
message when validation fails so invalid cutoffs cannot silently misfilter the
network.
- Around line 300-317: The code currently stores an empty string for PMIDs with
no abstract or on fetch error, which downstream is treated like a
scored/irrelevant result; update the logic in the tryCatch block so that when
length(abstract_nodes) == 0 you set results[[pmid]] <- NA_character_ (or a
sentinel like "UNKNOWN") instead of dropping or leaving it empty, and also
change the error handler to set results[[pmid]] <- NA_character_ (instead of
""), keeping the existing progress/logging but ensuring missing abstracts are
preserved as NA for downstream processing; modify the occurrences that reference
results[[pmid]] in this block (the abstract assignment and the error function)
accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 18c0f642-d52f-40cb-ba18-0d9c010ddd4c
📒 Files selected for processing (4)
NAMESPACER/filterSubnetworkByContext.Rman/filterSubnetworkByContext.Rdvignettes/Filter-By-Context.Rmd
🚧 Files skipped from review as they are similar to previous changes (1)
- man/filterSubnetworkByContext.Rd
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
R/filterSubnetworkByContext.R (1)
301-320:⚠️ Potential issue | 🟠 MajorDifferentiate failed fetches from genuinely empty abstracts.
A request failure and a real “no abstract text” record are both stored as
""here, so downstream scoring treats transient PubMed failures as hard non-matches and can prune valid edges. Preserve an explicit unknown value for failed lookups and let the filter logic decide how to handle those rows.Proposed fix
record <- tryCatch( entrez_fetch(db = "pubmed", id = pmid, rettype = "xml"), error = function(e) { cat(sprintf("Error fetching PMID %s at %d/%d: %s\n", pmid, i, total, e$message)) NULL } ) if (is.null(record)) { - results[[pmid]] <- "" + results[[pmid]] <- NA_character_ next }Then keep that
NAdistinct infilterSubnetworkByContext()instead of coercing it to a zero score.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/filterSubnetworkByContext.R` around lines 301 - 320, The fetch error handling currently stores failed PubMed fetches as an empty string, conflating network/transient failures with genuinely empty abstracts; update the tryCatch error branch so that when entrez_fetch fails you set results[[pmid]] <- NA (not ""), leave successful-but-empty abstracts as "" as before, and then update the downstream filterSubnetworkByContext() logic to treat NA as "unknown" (i.e., do not coerce NA to a zero/score) so callers can decide how to handle fetch failures. Reference: the tryCatch block around entrez_fetch, the results[[pmid]] assignment, and filterSubnetworkByContext().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 63-79: In filterSubnetworkByContext, validate inputs up front:
when method == "tag_count" ensure query is a character vector with length >=1
and no NA or empty ("" ) entries, and ensure cutoff is a single finite numeric
>=1 (coerce or error if not); when method == "cosine" ensure query is a single
non-NA, non-empty character string and cutoff is a single finite numeric in
[0,1] (default 0.10 otherwise) — replace the loose checks around query, cutoff
and the cat(...) blocks with these stricter validations referencing the existing
method, query and cutoff variables so the function fails fast on invalid tags or
invalid cutoff values.
- Around line 84-92: The early-return branches in filterSubnetworkByContext
currently return the raw evidence data frame without the required score column;
update both branches (the nrow(evidence) == 0 and length(pmids) == 0 cases) to
ensure evidence always contains a numeric score column (e.g., add evidence$score
<- NA_real_ if missing) before returning list(nodes = nodes, edges = edges,
evidence = evidence) so callers can safely access result$evidence$score without
special-casing.
---
Duplicate comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 301-320: The fetch error handling currently stores failed PubMed
fetches as an empty string, conflating network/transient failures with genuinely
empty abstracts; update the tryCatch error branch so that when entrez_fetch
fails you set results[[pmid]] <- NA (not ""), leave successful-but-empty
abstracts as "" as before, and then update the downstream
filterSubnetworkByContext() logic to treat NA as "unknown" (i.e., do not coerce
NA to a zero/score) so callers can decide how to handle fetch failures.
Reference: the tryCatch block around entrez_fetch, the results[[pmid]]
assignment, and filterSubnetworkByContext().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ba5b423e-9cf8-448b-b9a0-443fe909d790
📒 Files selected for processing (2)
R/filterSubnetworkByContext.Rtests/testthat/test-filterSubnetworkByContext.R
Motivation and Context
Large INDRA-derived protein–protein interaction subnetworks include edges supported by literature across many biological contexts. This PR adds a feature to score evidence texts (PubMed abstracts) against a user-supplied query and filter the subnetwork to retain edges supported by contextually relevant evidence. Two scoring approaches are provided: tag-count matching and TF‑IDF cosine similarity. The goal is to let users focus subnetworks on literature contexts relevant to their research question.
Short solution summary
Detailed changes
DESCRIPTION
NAMESPACE
(+15/-0)
R/filterSubnetworkByContext.R (new; ~364 lines)
man/filterSubnetworkByContext.Rd (new)
vignettes/Filter-By-Context.Rmd (new)
R/utils_getSubnetworkFromIndra.R
Unit tests added or modified
Coding guidelines / policy violations