Skip to content

Enhance CLI and analysis features for QTL Cascade and eQTL/pQTL#79

Merged
enriquea merged 30 commits intomainfrom
dev
Mar 28, 2026
Merged

Enhance CLI and analysis features for QTL Cascade and eQTL/pQTL#79
enriquea merged 30 commits intomainfrom
dev

Conversation

@enriquea
Copy link
Copy Markdown
Collaborator

@enriquea enriquea commented Mar 28, 2026

  • Investigate Codacy static analysis CI failure
  • Fix unused import DEFAULT_COLOC_H4_THRESHOLD in hvantk/qtlcascade/coloc.py
  • Fix unused import CLINGEN_HEADER_SKIP_LINES in hvantk/tables/table_builders.py
  • Fix unused local imports tempfile and os in hvantk/tables/table_builders.py
  • Fix test_cascade_cmd and test_coloc_cmd in test_cli_qtlcascade.py to use patch.dict(sys.modules, ...) for mocking hail and hvantk.core.hail_context (avoids import failure when hail is not installed)
  • All 9 QTL-related tests pass
  • flake8 and CodeQL checks clean

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@enriquea enriquea requested a review from Copilot March 28, 2026 11:29
@enriquea enriquea self-assigned this Mar 28, 2026
@enriquea enriquea added documentation Improvements or additions to documentation enhancement New feature or request labels Mar 28, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 28, 2026

📝 Walkthrough

Walkthrough

Adds a new QTL Cascade feature: a qtlcascade package (cascade construction, ABF colocalization, pipeline orchestration, plotting, reporting), CLI subcommands to build/run/report cascades, new eQTL/pQTL table builders and registry entries, docs and tests; plus assorted formatting-only edits elsewhere.

Changes

Cohort / File(s) Summary
QTL Cascade package
hvantk/qtlcascade/__init__.py, hvantk/qtlcascade/cascade.py, hvantk/qtlcascade/coloc.py, hvantk/qtlcascade/constants.py, hvantk/qtlcascade/gene_summary.py, hvantk/qtlcascade/pipeline.py, hvantk/qtlcascade/plot.py, hvantk/qtlcascade/report.py
New package implementing cascade construction, Wakefield ABF coloc, gene-summary aggregation, stage-based pipeline (single/multi-tissue), plotting helpers, HTML report generation, public API exports, and constants.
CLI: qtlcascade + mktable subcommands
hvantk/commands/qtlcascade_cli.py, hvantk/commands/make_table_cli.py, hvantk/hvantk.py
New qtlcascade command group with cascade, coloc, run, report; added mktable eqtl and mktable pqtl subcommands; registered CLI group in root.
Table builders & registry
hvantk/tables/table_builders.py, hvantk/tables/registry.py
Added create_eqtl_tb and create_pqtl_tb builders (GTEx/eQTLGen/Fang parsing, locus/alleles/gene_id keys, optional p-threshold and gene-map handling) and registry entries "eqtl"/"pqtl".
CLI tests
hvantk/tests/test_cli_mktable_qtl.py, hvantk/tests/test_cli_qtlcascade.py
New CLI-level tests exercising mktable and qtlcascade commands with mocks for Hail/pipeline/IO to validate invocation, option parsing, and error paths.
Documentation
docs_site/examples/index.md, docs_site/examples/qtlcascade.md, docs_site/guide/data-sources.md, docs_site/guide/usage.md, docs_site/tools/qtlcascade.md
New and extended docs: quick starts (CLI & Python), data-source guidance (GTEx/eQTLGen/Fang), tool reference, outputs, schemas, and examples.
Public API wiring
hvantk/qtlcascade/__init__.py, hvantk/hvantk.py
Exposes qtlcascade symbols via package init and registers CLI command group.
Formatting / minor edits
multiple modules: hvantk/commands/*.py, hvantk/ptm/*, hvantk/data/*, hvantk/tests/test_ptm.py, hvantk/utils/table_utils.py, etc.
Predominantly stylistic rewrapping, minor logging/message formatting, and small refactors; no behavioral changes reported.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as "hvantk CLI\n(qtlcascade)"
    participant Hail as "Hail Tables\n(eQTL/pQTL)"
    participant Cascade as "Cascade Builder"
    participant Coloc as "Colocalization"
    participant GeneSum as "Gene Summary"
    participant Plot as "Plotting"
    participant Report as "Report Generator"
    participant FS as "Filesystem (outputs)"

    User->>CLI: hvantk qtlcascade run
    CLI->>Hail: load(eQTL_ht), load(pQTL_ht)
    CLI->>Cascade: build_cascade(eqtl_ht, pqtl_ht, thresholds, tissue)
    Cascade->>Hail: outer-join on (locus, alleles, gene_id)
    Cascade->>Cascade: classify rows (eqtl_mediated/discordant/eqtl_only/pqtl_only)
    Cascade->>FS: write cascade.ht
    CLI->>GeneSum: build_cascade_gene_summary(cascade.ht, overlays)
    GeneSum->>FS: write gene_summary.ht / .tsv
    alt colocalization enabled
        CLI->>Coloc: run_coloc_per_gene(allpairs_eqtl, allpairs_pqtl, genes)
        Coloc->>FS: write coloc_results.tsv
    end
    CLI->>Plot: plot_* (class counts, attenuation, coloc posteriors)
    Plot->>FS: save PNGs
    CLI->>Report: generate_report(output_dir, inputs...)
    Report->>FS: write HTML report
    FS->>User: return report path / summary outputs
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

variant-annotation, Review effort 5/5

Suggested reviewers

  • ypriverol

Poem

🐰 I hopped from variant to gene with cheer,
Tables joined, Bayes whispered what’s clear.
Plots and reports in a tidy cascade,
I nibbled bugs and left docs well-laid—
Carrot for the reviewer, hop-to-dear! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 68.31% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Enhance CLI and analysis features for QTL Cascade and eQTL/pQTL' accurately captures the main additions: new CLI commands for eQTL/pQTL table creation and comprehensive QTL Cascade analysis features including documentation, workflows, and supporting infrastructure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new QTL Cascade module to hvantk, enabling end-to-end molecular QTL integration workflows (eQTL + pQTL table building, cascade join/classification, optional colocalization, plots, and HTML reporting) exposed via both CLI and documented user workflows.

Changes:

  • Added new hvantk qtlcascade ... CLI group plus a stage-based CascadePipeline (single- and multi-tissue execution) with plotting + HTML reporting.
  • Added new hvantk mktable eqtl and hvantk mktable pqtl builders (GTEx v11/v8, eQTLGen, Fang pQTL) and registered them in the table registry.
  • Added extensive documentation + examples for QTL Cascade usage and data sources.

Reviewed changes

Copilot reviewed 40 out of 42 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
hvantk/tables/table_builders.py Adds eQTL/pQTL table builders + shared helpers for parsing GTEx-style variant IDs and scanning per-tissue inputs.
hvantk/tables/registry.py Registers new eqtl / pqtl table builders in the table registry.
hvantk/qtlcascade/init.py Exposes the public QTL Cascade API surface (pipeline, coloc, plotting, reporting).
hvantk/qtlcascade/constants.py Introduces cascade class constants, default thresholds/priors, tissue mappings, and source identifiers.
hvantk/qtlcascade/cascade.py Implements the core eQTL⊕pQTL outer-join + cascade classification.
hvantk/qtlcascade/coloc.py Implements ABF coloc logic and a Hail→pandas driver for per-gene coloc.
hvantk/qtlcascade/gene_summary.py Aggregates cascade HT results to gene-level summaries with optional overlays.
hvantk/qtlcascade/pipeline.py Adds stage-based orchestration for cascade → summary → coloc → outputs, including multi-tissue collection mode.
hvantk/qtlcascade/plot.py Adds matplotlib-based plots for cascade classes, coloc posteriors, cross-tissue heatmap, etc.
hvantk/qtlcascade/report.py Adds a static self-contained HTML report generator.
hvantk/commands/qtlcascade_cli.py Adds click CLI commands for cascade, coloc, run (pipeline), and report generation.
hvantk/commands/make_table_cli.py Adds mktable eqtl and mktable pqtl subcommands wiring into the new builders.
hvantk/hvantk.py Registers the new qtlcascade command group in the top-level CLI.
docs_site/tools/qtlcascade.md New full documentation page for QTL Cascade (CLI + Python API + outputs).
docs_site/guide/data-sources.md Adds QTL data sources guidance (GTEx/eQTLGen/Fang pQTL).
docs_site/guide/usage.md Adds usage examples for building eQTL/pQTL tables via mktable.
docs_site/examples/qtlcascade.md Adds an end-to-end QTL Cascade example workflow.
docs_site/examples/index.md Links the new QTL Cascade example from the examples index.
Comments suppressed due to low confidence (1)

hvantk/tables/table_builders.py:1866

  • The reverse mapping table is keyed by gene_name and then indexed via rev[ht.gene_symbol]. If gene_name is not unique in the mapping HT, this can produce non-unique keys and make indexing ambiguous or error. Consider validating uniqueness of gene_name (or grouping/deduplicating deterministically) before using it as a key.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread hvantk/qtlcascade/plot.py
Comment on lines +95 to +98

# Annotate counts
for i, v in enumerate(counts):
ax.text(v + max(counts) * 0.01, i, f"{v:,}", va="center", fontsize=10)
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plot_cascade_classes() will raise when class_counts is empty (or contains none of the known CASCADE_CLASSES) because max(counts) is called during annotation. Consider returning an empty/"No data" plot (similar to plot_attenuation) when classes/counts is empty before calling max().

Copilot uses AI. Check for mistakes.
Comment on lines +1511 to +1515
for ext in extensions:
matches.extend(sorted(p.glob(f"*{ext}")))
if not matches:
raise FileNotFoundError(f"No files matching {extensions} in {input_path}")
return [(str(f), f.stem.split(".")[0]) for f in matches]
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_gtex_variant_id() uses the contig string from the GTEx-style variant_id (e.g. "chr1") directly in hl.locus(..., reference_genome=...). For GRCh37, Hail contigs are typically "1" (no "chr" prefix), so this can fail when --ref-genome GRCh37 is used. Consider normalizing contigs based on reference_genome (strip/add "chr" as needed) before constructing the locus.

Copilot uses AI. Check for mistakes.
Comment thread hvantk/qtlcascade/pipeline.py Outdated
Comment on lines +210 to +214
result = self._stage_build_cascade(result, tissue, tissue_dir)

# Stage 2: Gene summary
result = self._stage_gene_summary(result, tissue_dir)

Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run() builds the gene summary before running coloc. Since build_cascade_gene_summary() takes coloc_df to overlay coloc_max_h4/coloc_n_tissues, gene_summary.ht will never include coloc annotations with the current stage order. Consider running coloc before gene summary, or regenerating/patching the gene summary after coloc completes.

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +25
@click.group(
name="qtlcascade",
help="Molecular QTL cascade analysis (eQTL → pQTL → disease).",
context_settings=CONTEXT_SETTINGS,
)
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new qtlcascade CLI module isn't covered by the existing CLI test suite yet (there are many click command tests under hvantk/tests/). Consider adding basic invocation tests (patching CascadePipeline/build_cascade/run_coloc_per_gene) to ensure option wiring and exit codes are stable.

Copilot uses AI. Check for mistakes.
Comment thread hvantk/qtlcascade/constants.py Outdated
Comment on lines +77 to +78
EQTL_SOURCES = ("gtex_v11", "gtex_v8", "eqtlgen")
PQTL_SOURCES = ("gtex_fang", "ukb_ppp", "decode", "sun_2018")
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PQTL_SOURCES lists several identifiers (e.g. "ukb_ppp", "decode", "sun_2018") but create_pqtl_tb currently raises NotImplementedError for anything except "gtex_fang". This is likely to confuse downstream callers (and can break CLI choice generation if it ever uses PQTL_SOURCES). Consider restricting PQTL_SOURCES to implemented sources for now, or splitting into SUPPORTED vs IMPLEMENTED constants.

Copilot uses AI. Check for mistakes.
Comment on lines +340 to +343
ht = hl.read_table(result.cascade_ht_path)
cascade_genes = list(
ht.filter(ht.cascade_class == "eqtl_mediated").aggregate(
hl.agg.collect_as_set(ht.gene_id)
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_stage_coloc() says it is getting "genes with both eQTL and pQTL", but it filters to cascade_class == "eqtl_mediated" only. Genes with both signals but discordant direction (cascade_class == "discordant") will be excluded from coloc. Consider including discordant genes too if the intent is "both present" rather than "concordant only".

Copilot uses AI. Check for mistakes.
Comment on lines +420 to +424
rows = []
for tissue, res in results.items():
if res.class_counts:
for cls, cnt in res.class_counts.items():
rows.append(
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_generate_cross_tissue_outputs() populates the heatmap input by setting gene_id to the cascade class label and n_concordant to the class count. This doesn't match plot_cross_tissue_heatmap()'s documented input (genes × tissues) and will produce a misleading plot. Consider either plotting class counts explicitly (rename fields/value_col) or building a true gene×tissue matrix from gene summaries.

Copilot uses AI. Check for mistakes.
Comment on lines +199 to +203
@click.pass_context
def run_cmd(
ctx,
eqtl_ht,
pqtl_ht,
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multi-tissue summary in run_cmd hardcodes the coloc threshold as 0.8. Since DEFAULT_COLOC_H4_THRESHOLD is already defined in hvantk.qtlcascade.constants and used elsewhere, consider referencing the constant here to keep behavior consistent if the default changes.

Copilot uses AI. Check for mistakes.
Comment on lines +1699 to +1701
input_path=input_path,
output_path=output_path,
import_func=import_func,
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says tissue can "override inferred tissue name", but the import helpers always set tissue=tname from the filename and use the tissue parameter only as a filter. Either update the docs to reflect filter-only behavior or implement an override (e.g., when tissue is provided for a single file, set the output tissue field to that value).

Copilot uses AI. Check for mistakes.
Comment on lines +100 to +104
logger.info("Overlaying disease-gene labels from %s", disease_genes_ht_path)
disease_ht = hl.read_table(disease_genes_ht_path)
gene_ht = gene_ht.annotate(
is_disease_gene=hl.is_defined(disease_ht.index(gene_ht.gene_id)),
)
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The disease-gene overlay indexes disease_ht by gene_ht.gene_id, which assumes the provided disease table is keyed by Ensembl gene_id. hvantk's ClinGen gene-disease table builder keys by hgnc_id by default, so passing a ClinGen HT (as suggested in docs/CLI) will produce no matches. Consider validating the expected key or supporting hgnc_id via an HGNC/Ensembl mapping join.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
hvantk/tests/test_ptm.py (1)

224-233: ⚠️ Potential issue | 🟡 Minor

Use tmp_path fixture instead of hardcoded /tmp/test paths.

Line 232 uses hardcoded /tmp/test paths for test validation. This is brittle (permissions, cleanup, Windows compatibility) and inconsistent with the tmp_path fixture already used in test_map_ptm_sites_roundtrip (line 133).

🧪 Proposed fix
-def test_ptm_build_config_validation():
+def test_ptm_build_config_validation(tmp_path):
     """PTMBuildConfig.validate catches missing required fields."""
     config = PTMBuildConfig()
     errors = config.validate()
     assert len(errors) == 2
     assert any("output_dir" in e for e in errors)
     assert any("output_ht" in e for e in errors)
 
-    config2 = PTMBuildConfig(output_dir="/tmp/test", output_ht="/tmp/test/out.ht")
+    config2 = PTMBuildConfig(
+        output_dir=str(tmp_path), 
+        output_ht=str(tmp_path / "out.ht")
+    )
     assert config2.validate() == []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tests/test_ptm.py` around lines 224 - 233, Update the
test_ptm_build_config_validation to use the pytest tmp_path fixture instead of
hardcoded "/tmp/test" strings: add tmp_path as a test parameter, set output_dir
to str(tmp_path) (or tmp_path.as_posix()) and output_ht to str(tmp_path /
"out.ht") when constructing PTMBuildConfig (refer to PTMBuildConfig and config2
in the test), so the test uses an ephemeral, OS-independent directory and file
path.
🧹 Nitpick comments (6)
hvantk/datasets/uniprot_ptm_datasets.py (1)

323-323: Use explicit f-string conversion for exception text.

Ruff’s RUF010 applies here: prefer {e!s} instead of str(e) inside the f-string.

Suggested diff
-            raise RuntimeError(f"Failed to download UniProt PTM data: {str(e)}") from e
+            raise RuntimeError(f"Failed to download UniProt PTM data: {e!s}") from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/datasets/uniprot_ptm_datasets.py` at line 323, Replace the explicit
str(e) conversion inside the f-string with the f-string conversion flag {e!s} on
the raise RuntimeError line (i.e., change raise RuntimeError(f"Failed to
download UniProt PTM data: {str(e)}") from e to use {e!s}) so the exception text
uses the explicit f-string conversion and satisfies RUF010; keep the "from e"
chaining intact.
hvantk/data/gencc_streamer.py (1)

274-276: Consider using a single f-string for cleaner formatting.

The current approach using two adjacent f-strings works correctly but is somewhat unconventional. A single f-string would be more idiomatic and readable:

         logger.info(
-            f"Found {len(genes)} consensus genes " f"(min_submitters={min_submitters})"
+            f"Found {len(genes)} consensus genes (min_submitters={min_submitters})"
         )

Also noting that this formatting change appears unrelated to the PR's stated objectives (QTL Cascade and eQTL/pQTL features).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/data/gencc_streamer.py` around lines 274 - 276, The logger.info call
currently uses two adjacent f-strings; replace them with a single f-string in
the logger.info(...) invocation so the message reads in one formatted string
referencing len(genes) and min_submitters (e.g., inside the same f-string),
keeping the call to logger.info and the variables genes and min_submitters
unchanged for readability.
hvantk/ptm/plot.py (1)

120-140: Consider adding strict=True to zip() calls for safer iteration.

Lines 120 and 130 use zip() without an explicit strict= parameter. While the arrays are correctly paired in this context, adding strict=True (Python 3.10+) would catch length mismatches earlier if the code changes in the future.

♻️ Optional enhancement
-    for bar, n in zip(bars_p, path_counts):
+    for bar, n in zip(bars_p, path_counts, strict=True):
         if n > 0:
             ax1.text(
                 bar.get_x() + bar.get_width() / 2,
                 bar.get_height(),
                 str(n),
                 ha="center",
                 va="bottom",
                 fontsize=8,
             )
-    for bar, n in zip(bars_b, benign_counts):
+    for bar, n in zip(bars_b, benign_counts, strict=True):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/plot.py` around lines 120 - 140, The zip() calls iterating over
bars_p/path_counts and bars_b/benign_counts should be made strict to catch
future length mismatches; update the two zip(...) calls to use zip(...,
strict=True) when iterating in the loops that call ax1.text (referencing symbols
bars_p, path_counts, bars_b, benign_counts) so a ValueError is raised if lengths
differ.
hvantk/commands/ptm_cli.py (1)

191-191: Remove redundant hail import in ptm_annotate.

hail is already imported at Line 180 in the same try block, so Line 191 is duplicate and can be dropped for clarity.

♻️ Minimal cleanup
-        import hail as hl
-
         counts = result.aggregate(
             hl.struct(
                 n_total=hl.agg.count(),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/ptm_cli.py` at line 191, The duplicate import of hail inside
the ptm_annotate try block should be removed: there is already an import hail
earlier in that same try (around Line 180), so delete the redundant "import
hail" statement at the later occurrence to avoid duplicate imports and improve
clarity in the ptm_annotate function/block.
hvantk/qtlcascade/gene_summary.py (1)

75-76: Defensive code for gene_symbol that currently won't trigger.

Per cascade.py lines 91-124, build_cascade does not create a gene_symbol field in its output. This conditional will never evaluate to true with the current cascade builder implementation.

If this is intentional future-proofing for potential schema changes or external cascade tables, consider adding a brief comment to clarify the intent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/qtlcascade/gene_summary.py` around lines 75 - 76, The conditional
checking "gene_symbol" in row_fields inside gene_summary (the
agg_exprs["gene_symbol"] = hl.agg.take(ht.gene_symbol, 1)[0] branch) is dead
with the current cascade builder (see build_cascade in cascade.py not emitting
gene_symbol); either remove the unreachable conditional and associated agg_expr,
or keep it but add a brief clarifying comment above referencing build_cascade
and that this is intentional defensive/future-proofing for external
cascades/schema changes so reviewers understand it is deliberate. Ensure the
comment names the symbols row_fields, agg_exprs, gene_symbol, and build_cascade
to make the rationale discoverable.
hvantk/qtlcascade/__init__.py (1)

76-113: Consider sorting __all__ for lint stability.

Ruff reports RUF022; sorting avoids unnecessary lint noise/churn in future edits.

Suggested approach
-__all__ = [
-    ...
-]
+__all__ = sorted([
+    ...
+])

Or keep a manually sorted static list if you prefer explicit ordering.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/qtlcascade/__init__.py` around lines 76 - 113, The __all__ export list
is unsorted which triggers RUF022; sort the entries in the __all__ list (e.g.,
"build_cascade", "build_cascade_gene_summary", "CASCADE_CLASS_COLORS",
"CASCADE_CLASSES", "CascadeConfig", "CascadePipeline", "CascadeResult",
"DEFAULT_COLOC_H4_THRESHOLD", etc.) into a stable lexicographic order (or
replace with a manually sorted static ordering) so the list is deterministic and
lint-stable; update the existing __all__ definition accordingly while preserving
all original names.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs_site/examples/index.md`:
- Line 123: Update the "Outputs:" line so it clearly marks coloc results as
conditional on providing allpairs inputs: change the Outputs list (the line
beginning "**Outputs:**") to indicate that "coloc results (TSV)" are only
produced if the allpairs inputs are passed (for example add "(if allpairs inputs
provided)" or similar wording next to "coloc results (TSV)") so readers don’t
assume coloc is always generated.

In `@docs_site/tools/qtlcascade.md`:
- Line 108: Several fenced code blocks in qtlcascade.md are missing language
identifiers (causing markdownlint MD040); update each untagged triple-backtick
block (the occurrences around the visible code fences) to include an appropriate
language tag (e.g., ```bash for shell/command blocks, ```text for
pseudo-formula/text blocks, ```python for Python snippets) so every ``` fence in
the file is changed to a tagged form (e.g., ```text, ```bash, or ```python) to
satisfy linting and ensure consistent rendering.
- Line 247: The table cell for `attenuation_ratio` contains raw pipe characters
which break Markdown table columns; update the description in the
`attenuation_ratio` cell (currently "1 - |pQTL_beta|/|eQTL_beta| (eqtl_mediated
only)") by escaping the inner pipe characters (e.g., replace
`|pQTL_beta|/|eQTL_beta|` with `\|pQTL_beta\|/\|eQTL_beta\|`) so the table
renders correctly.

In `@hvantk/commands/qtlcascade_cli.py`:
- Around line 248-250: The dry-run branch calls pipeline.show_plan() but
discards its return value, so nothing is printed; capture the returned formatted
plan from pipeline.show_plan() and output it (e.g., via click.echo or printing)
before returning so `hvantk qtlcascade run --dry-run` actually displays the
plan; update the block that checks dry_run to assign plan = pipeline.show_plan()
and echo/log that plan (use the CLI's existing echo/logger mechanism).

In `@hvantk/ptm/plot.py`:
- Around line 146-161: The displayed CI string uses a Unicode EN DASH between
result.enrichment_ci_low and ci_hi_str inside the ax1.text call; replace that
character with the ASCII hyphen-minus '-' in the f-string (the ax1.text block
that references ci_hi_str, result.enrichment_ci_low, and
result.enrichment_odds_ratio) so the confidence interval is rendered as
f"({result.enrichment_ci_low:.2f}-{ci_hi_str})" to avoid encoding issues.

In `@hvantk/qtlcascade/cascade.py`:
- Around line 77-104: The join currently only keys on (locus, alleles, gene_id)
so when no tissue filter is passed rows from different tissues collapse; fix by
including tissue in the join key whenever the tables contain a tissue column and
the caller did not supply a single tissue: detect if "tissue" in
list(eqtl_ht.row) or "tissue" in list(pqtl_ht.row) and if tissue is None, call
key_by/key on both eqtl_ht and pqtl_ht to include tissue (e.g., key_by(locus,
alleles, gene_id, tissue)) before performing ht = eqtl_ht.join(pqtl_ht,
how="outer"); alternatively raise a clear error requiring the tissue argument if
you prefer to force single-tissue inputs.

In `@hvantk/qtlcascade/coloc.py`:
- Around line 208-227: The join currently performed by eqtl_sel.join(pqtl_sel,
how="inner") can mix rows from different tissues when tissue is not provided;
update the logic so that if the tables contain a "tissue" field (check "tissue"
in list(eqtl_ht.row) and list(pqtl_ht.row)) you add tissue to the selected
fields (e.g., include tissue in eqtl_sel and pqtl_sel) and include tissue in the
join key, otherwise raise or require the tissue argument; modify the code around
eqtl_sel, pqtl_sel and the joined assignment to either require tissue or join on
(locus/position, alleles, gene_id, tissue) so rows from different tissues are
never merged.

In `@hvantk/qtlcascade/constants.py`:
- Around line 77-78: PQTL_SOURCES currently advertises multiple backends but
create_pqtl_tb raises NotImplementedError for all but gtex_fang; update the
constant to only list implemented backends (keep PQTL_SOURCES = ("gtex_fang",))
or split into two constants (e.g., PQTL_SOURCES_IMPLEMENTED and
PQTL_SOURCES_PLANNED) and update any CLI/help references to use the implemented
list; locate the PQTL_SOURCES definition in constants.py and the create_pqtl_tb
function in hvantk/tables/table_builders.py to ensure they are consistent.

In `@hvantk/qtlcascade/gene_summary.py`:
- Around line 91-97: After reading constraint_ht in gene_summary.py, validate
its key fields and types before calling constraint_ht.index(gene_ht.gene_id):
check that constraint_ht.key_fields (or constraint_ht.key) contains "gene_id"
and that the key field type matches the type of gene_ht.gene_id; if not, log an
error (including actual key fields and types) and raise or return early to avoid
silent missing lookups or TableIndexKeyError. Also wrap the .index(...) call in
a try/except for TableIndexKeyError to produce a clear error message referencing
constraint_ht and gene_ht rather than letting the raw exception propagate.
Ensure you update the constraint_idx/oe_lof_upper annotation only after
successful validation.

In `@hvantk/qtlcascade/pipeline.py`:
- Around line 209-217: The gene-summary is built before coloc runs so
result.coloc_df remains None; change the stage ordering to run
_stage_coloc(result, tissue) before _stage_gene_summary(result, tissue_dir) (or
alternatively call _stage_gene_summary again after _stage_coloc to merge coloc
into the summary); apply the same fix for the later block that mirrors lines
316-323 so any path that enables coloc populates result.coloc_df before
_stage_gene_summary is invoked.
- Around line 219-220: The single-tissue run path never generates the HTML
report because generate_report() is only called from run_collection(); ensure
report generation is invoked for the single-sample/run() path by calling
generate_report(...) at the end of the normal run flow (or by updating
_stage_outputs(result, safe_name) to call generate_report when appropriate).
Specifically, add a call to generate_report within run() after _stage_outputs
(or inside _stage_outputs when safe_name/result indicate single-tissue mode) so
that functions generate_report and _stage_outputs both produce the HTML report
for non-collection runs just like run_collection() does.
- Around line 339-345: The code currently filters the cascade table at
ht.filter(ht.cascade_class == "eqtl_mediated") when building cascade_genes,
which excludes genes labeled "discordant" that also have both eQTL and pQTL
signals; update the filter to include both classes (e.g., ht.cascade_class ==
"eqtl_mediated" or ht.cascade_class == "discordant", or use
hl.set(['eqtl_mediated','discordant']).contains(ht.cascade_class)) so
cascade_genes (derived from result.cascade_ht_path / ht.aggregate /
hl.agg.collect_as_set) includes discordant genes too.

In `@hvantk/qtlcascade/plot.py`:
- Around line 126-138: The empty-data branch currently returns a placeholder
figure without saving it, so downstream consumers miss the expected PNG; modify
the branch where both.empty is True to call the module's saving helper (e.g.,
_save_figure or whatever save routine is used elsewhere) with the created fig
and output_path before returning; ensure you pass the same figsize/format
parameters used elsewhere and keep the logger.warning and "No data" text, then
return the saved fig so the output_path file exists for report assembly.

In `@hvantk/qtlcascade/report.py`:
- Around line 89-96: Escape all user/data-derived strings before inserting into
HTML: import html and wrap values like tissues items, title, description, plot
names, TSV cell values, and any variables used in rows.append (e.g., tissues,
cls, cnt, pct, and other interpolated variables in the blocks at lines ~135-145,
181-185, 213-242) with html.escape(...). Update the code paths that build HTML
rows (the rows.append calls shown and similar ones elsewhere) to call
html.escape on each non-markup value and format numbers separately (don’t escape
numeric formatting), ensuring you escape join elements (e.g., ',
'.join(html.escape(t) for t in tissues)) and any string inserted into table
cells.

In `@hvantk/tables/table_builders.py`:
- Around line 1820-1839: The pQTL builder currently falls back to raw gene
symbols when gene_map_ht is omitted, which will prevent cascade joins because
create_eqtl_tb keys by Ensembl gene_id; change the logic in the block that uses
gene_map_ht so that if gene_map_ht is None you either raise a clear exception
(e.g., ValueError) requiring a gene_map_ht for cascade-compatible pQTL tables or
require an explicit opt-out flag (add a parameter like
allow_raw_gene_symbol=False) and only permit the fallback when that flag is set
true; update the code paths around gene_map_ht, ht.annotate(gene_id=...) and the
final ht.key_by("locus","alleles","gene_id") to enforce this check and surface a
helpful error message mentioning create_eqtl_tb and the cascade key tuple.

---

Outside diff comments:
In `@hvantk/tests/test_ptm.py`:
- Around line 224-233: Update the test_ptm_build_config_validation to use the
pytest tmp_path fixture instead of hardcoded "/tmp/test" strings: add tmp_path
as a test parameter, set output_dir to str(tmp_path) (or tmp_path.as_posix())
and output_ht to str(tmp_path / "out.ht") when constructing PTMBuildConfig
(refer to PTMBuildConfig and config2 in the test), so the test uses an
ephemeral, OS-independent directory and file path.

---

Nitpick comments:
In `@hvantk/commands/ptm_cli.py`:
- Line 191: The duplicate import of hail inside the ptm_annotate try block
should be removed: there is already an import hail earlier in that same try
(around Line 180), so delete the redundant "import hail" statement at the later
occurrence to avoid duplicate imports and improve clarity in the ptm_annotate
function/block.

In `@hvantk/data/gencc_streamer.py`:
- Around line 274-276: The logger.info call currently uses two adjacent
f-strings; replace them with a single f-string in the logger.info(...)
invocation so the message reads in one formatted string referencing len(genes)
and min_submitters (e.g., inside the same f-string), keeping the call to
logger.info and the variables genes and min_submitters unchanged for
readability.

In `@hvantk/datasets/uniprot_ptm_datasets.py`:
- Line 323: Replace the explicit str(e) conversion inside the f-string with the
f-string conversion flag {e!s} on the raise RuntimeError line (i.e., change
raise RuntimeError(f"Failed to download UniProt PTM data: {str(e)}") from e to
use {e!s}) so the exception text uses the explicit f-string conversion and
satisfies RUF010; keep the "from e" chaining intact.

In `@hvantk/ptm/plot.py`:
- Around line 120-140: The zip() calls iterating over bars_p/path_counts and
bars_b/benign_counts should be made strict to catch future length mismatches;
update the two zip(...) calls to use zip(..., strict=True) when iterating in the
loops that call ax1.text (referencing symbols bars_p, path_counts, bars_b,
benign_counts) so a ValueError is raised if lengths differ.

In `@hvantk/qtlcascade/__init__.py`:
- Around line 76-113: The __all__ export list is unsorted which triggers RUF022;
sort the entries in the __all__ list (e.g., "build_cascade",
"build_cascade_gene_summary", "CASCADE_CLASS_COLORS", "CASCADE_CLASSES",
"CascadeConfig", "CascadePipeline", "CascadeResult",
"DEFAULT_COLOC_H4_THRESHOLD", etc.) into a stable lexicographic order (or
replace with a manually sorted static ordering) so the list is deterministic and
lint-stable; update the existing __all__ definition accordingly while preserving
all original names.

In `@hvantk/qtlcascade/gene_summary.py`:
- Around line 75-76: The conditional checking "gene_symbol" in row_fields inside
gene_summary (the agg_exprs["gene_symbol"] = hl.agg.take(ht.gene_symbol, 1)[0]
branch) is dead with the current cascade builder (see build_cascade in
cascade.py not emitting gene_symbol); either remove the unreachable conditional
and associated agg_expr, or keep it but add a brief clarifying comment above
referencing build_cascade and that this is intentional defensive/future-proofing
for external cascades/schema changes so reviewers understand it is deliberate.
Ensure the comment names the symbols row_fields, agg_exprs, gene_symbol, and
build_cascade to make the rationale discoverable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6c03a748-d238-420c-8258-64120eaa0af7

📥 Commits

Reviewing files that changed from the base of the PR and between 7d4f787 and 0cbdea5.

⛔ Files ignored due to path filters (2)
  • docs_site/images/hvantk-architecture.svg is excluded by !**/*.svg
  • docs_site/images/hvantk-qtlcascade-workflow.svg is excluded by !**/*.svg
📒 Files selected for processing (40)
  • docs_site/examples/index.md
  • docs_site/examples/qtlcascade.md
  • docs_site/guide/data-sources.md
  • docs_site/guide/usage.md
  • docs_site/tools/qtlcascade.md
  • hvantk/annotation/annotate.py
  • hvantk/commands/gencc_downloader.py
  • hvantk/commands/genesets_cli.py
  • hvantk/commands/hgc/__init__.py
  • hvantk/commands/make_table_cli.py
  • hvantk/commands/ptm_cli.py
  • hvantk/commands/qtlcascade_cli.py
  • hvantk/commands/uniprot_ptm_downloader.py
  • hvantk/commands/validate_bgzf_cli.py
  • hvantk/data/cosmic_cgc_streamer.py
  • hvantk/data/file_utils.py
  • hvantk/data/gencc_streamer.py
  • hvantk/data/gene_disease_streamer.py
  • hvantk/datasets/uniprot_ptm_datasets.py
  • hvantk/hvantk.py
  • hvantk/psroc/pipeline.py
  • hvantk/ptm/analysis.py
  • hvantk/ptm/annotate.py
  • hvantk/ptm/constants.py
  • hvantk/ptm/mapper.py
  • hvantk/ptm/pipeline.py
  • hvantk/ptm/plot.py
  • hvantk/ptm/report.py
  • hvantk/qtlcascade/__init__.py
  • hvantk/qtlcascade/cascade.py
  • hvantk/qtlcascade/coloc.py
  • hvantk/qtlcascade/constants.py
  • hvantk/qtlcascade/gene_summary.py
  • hvantk/qtlcascade/pipeline.py
  • hvantk/qtlcascade/plot.py
  • hvantk/qtlcascade/report.py
  • hvantk/tables/registry.py
  • hvantk/tables/table_builders.py
  • hvantk/tests/test_ptm.py
  • hvantk/utils/table_utils.py

Comment thread docs_site/examples/index.md Outdated
Comment thread docs_site/tools/qtlcascade.md Outdated
Comment thread docs_site/tools/qtlcascade.md Outdated
Comment thread hvantk/commands/qtlcascade_cli.py
Comment thread hvantk/ptm/plot.py
Comment thread hvantk/qtlcascade/pipeline.py Outdated
Comment thread hvantk/qtlcascade/plot.py
Comment thread hvantk/qtlcascade/report.py Outdated
Comment on lines +1598 to +1623
def _import_eqtl_eqtlgen(input_path, reference_genome):
"""Import eQTLGen cis-eQTL summary statistics."""
logger.info("Importing eQTLGen: %s", input_path)
ht = hl.import_table(
input_path,
force=True,
types={"Pvalue": hl.tfloat64, "Zscore": hl.tfloat64},
)
# Construct a GTEx-format variant_id so the shared parser can handle it.
if reference_genome.startswith("GRCh38"):
contig = hl.format("chr%s", ht.SNPChr)
else:
contig = ht.SNPChr
ht = ht.select(
gene_id_raw=ht.Gene,
variant_id=hl.delimit(
[contig, ht.SNPPos, ht.OtherAllele, ht.AssessedAllele, "b37"],
"_",
),
beta=hl.missing(hl.tfloat64), # eQTLGen provides Z-score, not beta
se=hl.missing(hl.tfloat64),
p_value=ht.Pvalue,
maf=hl.missing(hl.tfloat64),
tissue="Blood",
gene_symbol=ht.GeneSymbol,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

eqtlgen is exposed as supported, but its tables are unusable downstream.

This path writes beta and se as missing for every row. The cascade builder reads those fields to classify overlap direction, and the coloc driver requires se, so an eqtlgen table cannot participate correctly in either workflow. Either derive usable effect statistics here or fail fast when source="eqtlgen" is requested for cascade/coloc inputs.

Comment on lines +1820 to +1839
if gene_map_ht:
logger.info("Mapping gene symbols → Ensembl IDs via %s", gene_map_ht)
gm = hl.read_table(gene_map_ht)
rev = (
gm.key_by()
.select("gene_id", "gene_name")
.key_by("gene_name")
.distinct()
)
ht = ht.annotate(
gene_id=hl.or_else(rev[ht.gene_symbol].gene_id, ht.gene_symbol),
)
else:
ht = ht.annotate(gene_id=ht.gene_symbol)

if p_threshold is not None and p_threshold > 0:
ht = ht.filter(ht.p_value <= p_threshold)

ht = ht.annotate(source=source, is_cis=True)
ht = ht.key_by("locus", "alleles", "gene_id")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make Ensembl gene mapping mandatory for cascade-compatible pQTL tables.

When gene_map_ht is omitted, Line 1833 keys the pQTL table by raw gene symbol. create_eqtl_tb() keys eQTL rows by Ensembl gene_id, and the cascade join uses (locus, alleles, gene_id), so the command can successfully build a pQTL HT that is almost guaranteed not to join with any eQTL HT. This should hard-fail or at least require an explicit opt-out for non-cascade use.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tables/table_builders.py` around lines 1820 - 1839, The pQTL builder
currently falls back to raw gene symbols when gene_map_ht is omitted, which will
prevent cascade joins because create_eqtl_tb keys by Ensembl gene_id; change the
logic in the block that uses gene_map_ht so that if gene_map_ht is None you
either raise a clear exception (e.g., ValueError) requiring a gene_map_ht for
cascade-compatible pQTL tables or require an explicit opt-out flag (add a
parameter like allow_raw_gene_symbol=False) and only permit the fallback when
that flag is set true; update the code paths around gene_map_ht,
ht.annotate(gene_id=...) and the final ht.key_by("locus","alleles","gene_id") to
enforce this check and surface a helpful error message mentioning create_eqtl_tb
and the cascade key tuple.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (1)
hvantk/qtlcascade/pipeline.py (1)

76-91: ⚠️ Potential issue | 🟠 Major

Reject invalid thresholds and coloc priors in validate().

validate() currently accepts values like eqtl_p_threshold=-1, pqtl_p_threshold=2, or non-positive coloc priors/window sizes. Those inputs silently disable filters or feed nonsensical parameters into coloc instead of failing fast.

Suggested guardrails
     def validate(self) -> List[str]:
         """Return a list of validation errors (empty if valid)."""
         errors: List[str] = []
         if not self.eqtl_ht:
             errors.append("eqtl_ht is required.")
         if not self.pqtl_ht:
             errors.append("pqtl_ht is required.")
         if not self.output_dir:
             errors.append("output_dir is required.")
         # Coloc needs both allpairs
         if bool(self.eqtl_allpairs_ht) != bool(self.pqtl_allpairs_ht):
             errors.append(
                 "Both eqtl_allpairs_ht and pqtl_allpairs_ht are required "
                 "for colocalization (or omit both to skip coloc)."
             )
+        if not 0 <= self.eqtl_p_threshold <= 1:
+            errors.append("eqtl_p_threshold must be between 0 and 1.")
+        if not 0 <= self.pqtl_p_threshold <= 1:
+            errors.append("pqtl_p_threshold must be between 0 and 1.")
+        if self.coloc_window_kb <= 0:
+            errors.append("coloc_window_kb must be > 0.")
+        for name, value in (
+            ("coloc_p1", self.coloc_p1),
+            ("coloc_p2", self.coloc_p2),
+            ("coloc_p12", self.coloc_p12),
+        ):
+            if not 0 < value < 1:
+                errors.append(f"{name} must be between 0 and 1.")
+        if self.coloc_W <= 0:
+            errors.append("coloc_W must be > 0.")
         return errors
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/qtlcascade/pipeline.py` around lines 76 - 91, The validate() method
must reject out-of-range thresholds and coloc parameters: add checks in
validate() to ensure eqtl_p_threshold and pqtl_p_threshold are between 0 and 1
(exclusive of negatives or >1) and, if present, that coloc priors (e.g.,
coloc_prior / coloc_prior_secondary) are >0 and <=1 and coloc window/size
parameters are positive integers; append descriptive error strings to errors
when any of these constraints fail, keeping the existing
eqtl_allpairs_ht/pqtl_allpairs_ht consistency check (use the validate() function
and attribute names eqtl_p_threshold, pqtl_p_threshold, coloc_prior,
coloc_window to locate where to add the new guards).
🧹 Nitpick comments (5)
hvantk/qtlcascade/plot.py (1)

338-340: Consider adding strict=True to zip() for defensive coding.

The zip() at line 338 combines bp["boxes"] and colors, which are derived from the same classes list so their lengths should always match. Adding strict=True would catch any future refactoring bugs where lengths could diverge.

Example fix
-    for patch, color in zip(bp["boxes"], colors):
+    for patch, color in zip(bp["boxes"], colors, strict=True):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/qtlcascade/plot.py` around lines 338 - 340, The loop using
zip(bp["boxes"], colors) should use zip(..., strict=True) to defensively ensure
bp["boxes"] and colors have identical lengths; update the iteration (where
bp["boxes"] and colors are created from classes) to call zip(bp["boxes"],
colors, strict=True) so any future mismatch between bp["boxes"] and colors
raises an error immediately.
hvantk/tests/test_cli_qtlcascade.py (1)

76-116: Consider using tmp_path fixture instead of hardcoded /tmp paths.

The tests at lines 87 and 112 use hardcoded /tmp/cascade_out paths. While the pipeline is mocked so no actual files are written, using pytest's tmp_path fixture would be more consistent with the other tests in this file and avoid static analysis warnings.

Example fix
-def test_run_cmd_dry_run():
+def test_run_cmd_dry_run(tmp_path):
     runner = CliRunner()
     result = runner.invoke(
         qtlcascade_group,
         [
             "run",
             "--eqtl-ht",
             "/data/eqtl.ht",
             "--pqtl-ht",
             "/data/pqtl.ht",
             "-o",
-            "/tmp/cascade_out",
+            str(tmp_path / "cascade_out"),
             "--dry-run",
         ],
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tests/test_cli_qtlcascade.py` around lines 76 - 116, Update the two
tests test_run_cmd_dry_run and test_run_cmd_single_tissue to use the pytest
tmp_path fixture instead of the hardcoded "/tmp/cascade_out": add tmp_path as a
test parameter, build an output path from tmp_path (e.g. tmp_path /
"cascade_out" converted to a string) and pass that string to the CLI args in
place of "/tmp/cascade_out"; ensure both test functions (and the runner.invoke
calls) use the new tmp_path-derived path so the tests use an isolated temporary
directory.
hvantk/qtlcascade/coloc.py (2)

80-85: _logdiff can return -inf or NaN when a <= b.

The function assumes a > b per the docstring, but if sum_abf1 + sum_abf2 <= sum_abf_both due to numerical precision issues, np.log1p(-np.exp(b - a)) will return -inf (when equal) or NaN (when b > a).

Consider adding a guard similar to the R coloc implementation:

Add numerical guard
 def _logdiff(a: float, b: float) -> float:
     """Compute ``log(exp(a) - exp(b))`` in log-space (requires ``a > b``).
 
     Matches ``logdiff`` in the R coloc package (``R/claudia.R``).
     """
+    if b >= a:
+        return float("-inf")
     return a + np.log1p(-np.exp(b - a))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/qtlcascade/coloc.py` around lines 80 - 85, The _logdiff function can
produce -inf/NaN when a <= b; add a numerical guard at the start of _logdiff to
handle that case (e.g., if a <= b + eps) and return -np.inf (log(0)) instead of
computing np.log1p(-np.exp(b - a)); otherwise proceed with the existing
computation. Use a small epsilon (e.g., 1e-12) to tolerate tiny rounding errors
and reference the function name _logdiff so you update that exact function.

219-232: Warning for multi-tissue join mismatch is helpful but may not prevent data corruption.

When tissue is not provided but allpairs tables contain a tissue field, the code now emits a warning (lines 228-232). However, the join still proceeds on (locus, alleles, gene_id) only, which can incorrectly merge rows from different tissues into one coloc region.

The previous review suggested either requiring tissue when the field exists, or including tissue in the join key. The current approach relies on users heeding the warning, which may be insufficient for data integrity.

Consider enforcing tissue parameter when tissue field exists
     elif eqtl_has_tissue or pqtl_has_tissue:
-        logger.warning(
-            "No tissue filter provided but allpairs table(s) contain a "
-            "'tissue' field. Inner join on (locus, alleles, gene_id) may "
-            "cross-multiply rows from different tissues."
-        )
+        raise ValueError(
+            "No tissue filter provided but allpairs table(s) contain a "
+            "'tissue' field. Provide --tissue to avoid cross-tissue joins."
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/qtlcascade/coloc.py` around lines 219 - 232, The code currently only
logs a warning when eqtl_ht or pqtl_ht contain a "tissue" column but no tissue
parameter is provided, which allows incorrect cross-tissue joins; change this to
require an explicit tissue: when eqtl_has_tissue or pqtl_has_tissue and tissue
is falsy, raise a ValueError (or use logger.error then raise) with a clear
message instructing the caller to provide the tissue parameter; keep the
existing filtering logic that applies eqtl_ht.filter(eqtl_ht.tissue == tissue)
and pqtl_ht.filter(pqtl_ht.tissue == tissue) when tissue is provided
(referencing eqtl_ht, pqtl_ht, and tissue), and remove or replace the
logger.warning block so the function fails fast instead of proceeding with a
risky join.
hvantk/tables/table_builders.py (1)

1534-1569: Parquet import requires Spark session to exist; no explicit initialization.

_import_eqtl_gtex_parquet calls SparkSession.builder.getOrCreate() which assumes a Spark context is already configured. If called before Hail initialization or outside a Spark environment, this may fail silently or with a confusing error.

Consider documenting this requirement or adding a check/initialization step.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hvantk/ptm/plot.py`:
- Around line 120-139: Update the two zip calls in hvantk/ptm/plot.py to use
strict matching: replace zip(bars_p, path_counts) with zip(bars_p, path_counts,
strict=True) and zip(bars_b, benign_counts) with zip(bars_b, benign_counts,
strict=True) so the loops that call ax1.text(...) on bars_p/path_counts and
bars_b/benign_counts explicitly enforce equal lengths and satisfy Ruff B905.

In `@hvantk/qtlcascade/pipeline.py`:
- Around line 464-467: The report currently collects all PNGs from the shared
plots directory via self._plots_dir.glob("*.png"), causing stale images from
other runs to appear; change plot discovery in the plot_paths construction (and
the similar block around lines 510-513) to only include images produced by the
current run—e.g., restrict the glob or filter by the current run
identifier/output directory used when saving plots (use self.output_dir or
self.run_id / run_name as a prefix or parent directory) so plot_paths only
contains files belonging to this invocation.
- Around line 201-202: The sanitized tissue slug produced by _sanitize() is
lossy and can collide (e.g., "Brain/Cortex" vs "Brain Cortex"), causing
tissue_dir and plot filenames to overwrite; update the code in run and
run_collection to derive a stable, unique slug from tissue_label by combining
the sanitized name with a short deterministic suffix (e.g., first 8 chars of a
hash or base64 digest of tissue_label) and use that single unique_slug for
per_tissue paths and plot filenames (replace uses of safe_name with unique_slug
for tissue_dir and any plots/<slug>_* generation, including the other
occurrences around the 245-253 and 539-543 regions) so distinct tissues never
collapse to the same output path.
- Around line 459-462: The try/except around hl.read_table(...).to_pandas()
silently swallows failures (setting gene_df to None) — change the except to
catch Exception as e, log a clear error including result.gene_summary_ht_path
and the exception details (e) via the module logger (or raise a new RuntimeError
with that context) and then either re-raise so report generation fails fast or
explicitly mark the report as failed; apply the same change to the other
identical block that reads the gene summary (the block using
hl.read_table(...).to_pandas() and gene_df).
- Around line 329-331: _in _stage_gene_summary(), gene_ht.export(tsv_path) does
not honor self.config.overwrite, causing inconsistent rerun behavior; modify the
code around the export so that before calling gene_ht.export(tsv_path) you check
self.config.overwrite and if true remove any existing tsv_path (use the
appropriate filesystem removal for your execution environment: local os.remove
or cloud/hadoop rm via hail/utils or subprocess), and if overwrite is false and
tsv_path already exists raise/exit with a clear error; do this adjacent to the
existing call sites _stage_gene_summary, build_cascade_gene_summary, and the
tissue_dir/gene_summary.tsv path to ensure consistent behavior.

In `@hvantk/tables/table_builders.py`:
- Around line 1507-1526: _scan_tissue_files currently uses pathlib.Path and will
fail for cloud URIs (gs://, s3://); replace the Path-based logic with a
cloud-aware filesystem like hailtop.fs (or hailtop.fs.get_fs) similar to
_cleanup_temp_file: detect cloud URIs (e.g., input_path.startswith("gs://") or
"s3://"), use the filesystem client to check existence and file/dir semantics,
list matching files using the fs.glob or fs.ls APIs for each extension, and
build the [(file_path, tissue_name)] list by extracting the filename stem (split
on "/" then split basename on "." to get the tissue prefix) so both local and
cloud paths are supported by _scan_tissue_files.

---

Duplicate comments:
In `@hvantk/qtlcascade/pipeline.py`:
- Around line 76-91: The validate() method must reject out-of-range thresholds
and coloc parameters: add checks in validate() to ensure eqtl_p_threshold and
pqtl_p_threshold are between 0 and 1 (exclusive of negatives or >1) and, if
present, that coloc priors (e.g., coloc_prior / coloc_prior_secondary) are >0
and <=1 and coloc window/size parameters are positive integers; append
descriptive error strings to errors when any of these constraints fail, keeping
the existing eqtl_allpairs_ht/pqtl_allpairs_ht consistency check (use the
validate() function and attribute names eqtl_p_threshold, pqtl_p_threshold,
coloc_prior, coloc_window to locate where to add the new guards).

---

Nitpick comments:
In `@hvantk/qtlcascade/coloc.py`:
- Around line 80-85: The _logdiff function can produce -inf/NaN when a <= b; add
a numerical guard at the start of _logdiff to handle that case (e.g., if a <= b
+ eps) and return -np.inf (log(0)) instead of computing np.log1p(-np.exp(b -
a)); otherwise proceed with the existing computation. Use a small epsilon (e.g.,
1e-12) to tolerate tiny rounding errors and reference the function name _logdiff
so you update that exact function.
- Around line 219-232: The code currently only logs a warning when eqtl_ht or
pqtl_ht contain a "tissue" column but no tissue parameter is provided, which
allows incorrect cross-tissue joins; change this to require an explicit tissue:
when eqtl_has_tissue or pqtl_has_tissue and tissue is falsy, raise a ValueError
(or use logger.error then raise) with a clear message instructing the caller to
provide the tissue parameter; keep the existing filtering logic that applies
eqtl_ht.filter(eqtl_ht.tissue == tissue) and pqtl_ht.filter(pqtl_ht.tissue ==
tissue) when tissue is provided (referencing eqtl_ht, pqtl_ht, and tissue), and
remove or replace the logger.warning block so the function fails fast instead of
proceeding with a risky join.

In `@hvantk/qtlcascade/plot.py`:
- Around line 338-340: The loop using zip(bp["boxes"], colors) should use
zip(..., strict=True) to defensively ensure bp["boxes"] and colors have
identical lengths; update the iteration (where bp["boxes"] and colors are
created from classes) to call zip(bp["boxes"], colors, strict=True) so any
future mismatch between bp["boxes"] and colors raises an error immediately.

In `@hvantk/tests/test_cli_qtlcascade.py`:
- Around line 76-116: Update the two tests test_run_cmd_dry_run and
test_run_cmd_single_tissue to use the pytest tmp_path fixture instead of the
hardcoded "/tmp/cascade_out": add tmp_path as a test parameter, build an output
path from tmp_path (e.g. tmp_path / "cascade_out" converted to a string) and
pass that string to the CLI args in place of "/tmp/cascade_out"; ensure both
test functions (and the runner.invoke calls) use the new tmp_path-derived path
so the tests use an isolated temporary directory.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5922f759-f2df-4dcc-b255-37aec4fefa9a

📥 Commits

Reviewing files that changed from the base of the PR and between 0cbdea5 and 47610c8.

📒 Files selected for processing (15)
  • docs_site/examples/index.md
  • docs_site/tools/qtlcascade.md
  • hvantk/commands/make_table_cli.py
  • hvantk/commands/qtlcascade_cli.py
  • hvantk/ptm/plot.py
  • hvantk/qtlcascade/cascade.py
  • hvantk/qtlcascade/coloc.py
  • hvantk/qtlcascade/constants.py
  • hvantk/qtlcascade/gene_summary.py
  • hvantk/qtlcascade/pipeline.py
  • hvantk/qtlcascade/plot.py
  • hvantk/qtlcascade/report.py
  • hvantk/tables/table_builders.py
  • hvantk/tests/test_cli_mktable_qtl.py
  • hvantk/tests/test_cli_qtlcascade.py
✅ Files skipped from review due to trivial changes (2)
  • docs_site/examples/index.md
  • hvantk/qtlcascade/constants.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • hvantk/qtlcascade/gene_summary.py
  • hvantk/commands/make_table_cli.py
  • hvantk/qtlcascade/cascade.py

Comment thread hvantk/ptm/plot.py
Comment on lines 120 to +139
for bar, n in zip(bars_p, path_counts):
if n > 0:
ax1.text(bar.get_x() + bar.get_width() / 2, bar.get_height(),
str(n), ha="center", va="bottom", fontsize=8)
ax1.text(
bar.get_x() + bar.get_width() / 2,
bar.get_height(),
str(n),
ha="center",
va="bottom",
fontsize=8,
)
for bar, n in zip(bars_b, benign_counts):
if n > 0:
ax1.text(bar.get_x() + bar.get_width() / 2, bar.get_height(),
str(n), ha="center", va="bottom", fontsize=8)
ax1.text(
bar.get_x() + bar.get_width() / 2,
bar.get_height(),
str(n),
ha="center",
va="bottom",
fontsize=8,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify the repo's Python baseline supports zip(strict=...) and matches Ruff target.
for f in pyproject.toml setup.cfg setup.py tox.ini ruff.toml .ruff.toml; do
  if [ -f "$f" ]; then
    echo "== $f =="
    rg -n 'requires-python|python_requires|target-version|\[tool\.ruff\]|\[project\]|\[metadata\]' "$f" || true
    echo
  fi
done

Repository: bigbio/hvantk

Length of output: 95


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Read the full content of pyproject.toml to find Python version requirements
echo "=== Content of pyproject.toml ==="
if [ -f "pyproject.toml" ]; then
  cat pyproject.toml
else
  echo "pyproject.toml not found"
fi

echo ""
echo "=== Content of ruff.toml ==="
if [ -f "ruff.toml" ]; then
  cat ruff.toml
else
  echo "ruff.toml not found"
fi

echo ""
echo "=== Checking for Python version in any config files ==="
rg -i "python" --type toml --type cfg --type txt . || echo "No matches found"

Repository: bigbio/hvantk

Length of output: 2839


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify the actual code at lines 120-139 in hvantk/ptm/plot.py
echo "=== Code at lines 120-139 of hvantk/ptm/plot.py ==="
sed -n '115,145p' hvantk/ptm/plot.py | cat -n

echo ""
echo "=== Check if strict= parameter is used anywhere in the codebase ==="
rg "zip\(" hvantk/ -A 1 -B 1

Repository: bigbio/hvantk

Length of output: 7176


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Get full context of the hvantk/ptm/plot.py file to identify all zip() calls
wc -l hvantk/ptm/plot.py
echo "---"
# Extract lines with zip() to show context
rg -n "zip\(" hvantk/ptm/plot.py -B 2 -A 2

Repository: bigbio/hvantk

Length of output: 1485


Add explicit strict= to both zip() loops to align with B905 lint rule.

Lines 120 and 130 trigger Ruff B905. Since the project targets Python 3.10+, strict=True can be used to make the length-matching contract explicit and catch any misalignment early.

🔧 Proposed fix
-    for bar, n in zip(bars_p, path_counts):
+    for bar, n in zip(bars_p, path_counts, strict=True):
         if n > 0:
             ax1.text(
                 bar.get_x() + bar.get_width() / 2,
                 bar.get_height(),
                 str(n),
                 ha="center",
                 va="bottom",
                 fontsize=8,
             )
-    for bar, n in zip(bars_b, benign_counts):
+    for bar, n in zip(bars_b, benign_counts, strict=True):
         if n > 0:
             ax1.text(
                 bar.get_x() + bar.get_width() / 2,
                 bar.get_height(),
                 str(n),
                 ha="center",
                 va="bottom",
                 fontsize=8,
             )
🧰 Tools
🪛 Ruff (0.15.7)

[warning] 120-120: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


[warning] 130-130: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/plot.py` around lines 120 - 139, Update the two zip calls in
hvantk/ptm/plot.py to use strict matching: replace zip(bars_p, path_counts) with
zip(bars_p, path_counts, strict=True) and zip(bars_b, benign_counts) with
zip(bars_b, benign_counts, strict=True) so the loops that call ax1.text(...) on
bars_p/path_counts and bars_b/benign_counts explicitly enforce equal lengths and
satisfy Ruff B905.

Comment on lines +201 to +202
safe_name = _sanitize(tissue_label)
tissue_dir = str(self._per_tissue_dir / safe_name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Sanitized tissue names can overwrite each other's outputs.

_sanitize() is lossy, and run() uses that slug directly for both per_tissue/<slug> and plots/<slug>_*. Distinct tissues like Brain/Cortex and Brain Cortex collapse to the same slug, so the later run overwrites the earlier artifacts and run_collection() can read the same path under multiple tissue labels.

Also applies to: 245-253, 539-543

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/qtlcascade/pipeline.py` around lines 201 - 202, The sanitized tissue
slug produced by _sanitize() is lossy and can collide (e.g., "Brain/Cortex" vs
"Brain Cortex"), causing tissue_dir and plot filenames to overwrite; update the
code in run and run_collection to derive a stable, unique slug from tissue_label
by combining the sanitized name with a short deterministic suffix (e.g., first 8
chars of a hash or base64 digest of tissue_label) and use that single
unique_slug for per_tissue paths and plot filenames (replace uses of safe_name
with unique_slug for tissue_dir and any plots/<slug>_* generation, including the
other occurrences around the 245-253 and 539-543 regions) so distinct tissues
never collapse to the same output path.

Comment on lines +329 to +331
# Export TSV for downstream use
tsv_path = f"{tissue_dir}/gene_summary.tsv"
gene_ht.export(tsv_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /tmp && git ls-files | grep -E "pipeline\.py$"

Repository: bigbio/hvantk

Length of output: 125


🏁 Script executed:

cat -n hvantk/qtlcascade/pipeline.py | sed -n '320,340p'

Repository: bigbio/hvantk

Length of output: 1078


🏁 Script executed:

rg -n "\.export\(|\.write\(|\.checkpoint\(" hvantk/qtlcascade/pipeline.py -A 2

Repository: bigbio/hvantk

Length of output: 182


🌐 Web query:

Hail Table.export() API overwrite parameter documentation

💡 Result:

The Hail Table.export method does not have an overwrite parameter. It exports the table to a text file (typically TSV, recommended with .bgz extension for compression). Unlike Table.write, which has an overwrite parameter to overwrite existing files at the destination, Table.export lacks this option in the official documentation and source code. Usage example: table1.export('output/table1.tsv.bgz') Note: It is highly recommended to export large files with a .bgz extension, which uses block gzipped compression. These files can be read natively with Hail methods and tools like Python's gzip.open or R's read.table. Nested structures export as JSON; use Table.flatten first for separate fields.

Citations:


🌐 Web query:

Hail Table.write() overwrite parameter documentation

💡 Result:

The overwrite parameter in Hail's Table.write method is a boolean option that determines whether to overwrite an existing file at the specified output path. Signature: table.write(output, stage_locally=False, overwrite=False) - overwrite (bool): If True, overwrite an existing file at the destination. Defaults to False. Example: table1.write('output/table1.ht', overwrite=True) Warning: Do not write or checkpoint to a path that is already an input source for the query, as this can cause data loss. This is from the official Hail documentation for hail.Table.write.

Citations:


🌐 Web query:

Hail Table.checkpoint() overwrite parameter documentation

💡 Result:

The Table.checkpoint method in Hail checkpoints the table to disk by writing it to the specified path and then reading it back. It has an overwrite parameter of type bool. If overwrite=True, it will overwrite an existing file at the destination path. The default value is not explicitly stated in the documentation, but usage examples (e.g., write(..., overwrite=True)) suggest it may default to False. Full parameters: - output (str): Path at which to write. - stage_locally (bool): If True, major output will be written to temporary local storage before being copied to output. - overwrite (bool): If True, overwrite an existing file at the destination. Warning: Do not checkpoint to a path that is already an input source for the query, as this can cause data loss. Returns: Table Example usage: table = table.checkpoint('output/table.ht', overwrite=True)

Citations:


🏁 Script executed:

cat -n hvantk/qtlcascade/pipeline.py | sed -n '300,335p'

Repository: bigbio/hvantk

Length of output: 1678


🏁 Script executed:

rg -n "def _stage" hvantk/qtlcascade/pipeline.py | head -20

Repository: bigbio/hvantk

Length of output: 284


🏁 Script executed:

cat -n hvantk/qtlcascade/pipeline.py | sed -n '285,335p'

Repository: bigbio/hvantk

Length of output: 2330


gene_ht.export() lacks overwrite handling unlike other writes in this stage.

In _stage_gene_summary(), the call to build_cascade_gene_summary() (line 317) receives overwrite=self.config.overwrite, but gene_ht.export(tsv_path) (line 331) cannot use the same pattern—Hail's Table.export() API does not expose an overwrite parameter, unlike Table.write() and Table.checkpoint() which do. This creates inconsistent behavior on reruns to an existing output directory. Consider an explicit delete or versioned-path strategy to ensure reruns behave as intended.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/qtlcascade/pipeline.py` around lines 329 - 331, _in
_stage_gene_summary(), gene_ht.export(tsv_path) does not honor
self.config.overwrite, causing inconsistent rerun behavior; modify the code
around the export so that before calling gene_ht.export(tsv_path) you check
self.config.overwrite and if true remove any existing tsv_path (use the
appropriate filesystem removal for your execution environment: local os.remove
or cloud/hadoop rm via hail/utils or subprocess), and if overwrite is false and
tsv_path already exists raise/exit with a clear error; do this adjacent to the
existing call sites _stage_gene_summary, build_cascade_gene_summary, and the
tissue_dir/gene_summary.tsv path to ensure consistent behavior.

Comment thread hvantk/qtlcascade/pipeline.py Outdated
Comment on lines +464 to +467
plot_paths = {}
for p in self._plots_dir.glob("*.png"):
plot_paths[p.stem] = str(p)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scope report plots to the current run.

Both report builders glob every PNG in the shared plots directory. Reusing output_dir pulls stale images from earlier tissues/runs into the new HTML report, so the report can show plots that were not generated for the current invocation.

Also applies to: 510-513

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/qtlcascade/pipeline.py` around lines 464 - 467, The report currently
collects all PNGs from the shared plots directory via
self._plots_dir.glob("*.png"), causing stale images from other runs to appear;
change plot discovery in the plot_paths construction (and the similar block
around lines 510-513) to only include images produced by the current run—e.g.,
restrict the glob or filter by the current run identifier/output directory used
when saving plots (use self.output_dir or self.run_id / run_name as a prefix or
parent directory) so plot_paths only contains files belonging to this
invocation.

Comment on lines +1507 to +1526
def _scan_tissue_files(input_path, extensions):
"""Return ``[(file_path, tissue_name), ...]`` from *input_path*.

Tissue name is inferred from the filename prefix before the first dot
(e.g. ``Brain_Cortex.v8.signif_variant_gene_pairs.txt.gz`` → ``Brain_Cortex``).
Accepts a single file or a directory.
"""
from pathlib import Path

p = Path(input_path)
if p.is_file():
return [(str(p), p.stem.split(".")[0])]
if not p.is_dir():
raise FileNotFoundError(f"Not a file or directory: {input_path}")
matches = []
for ext in extensions:
matches.extend(sorted(p.glob(f"*{ext}")))
if not matches:
raise FileNotFoundError(f"No files matching {extensions} in {input_path}")
return [(str(f), f.stem.split(".")[0]) for f in matches]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

_scan_tissue_files won't work with cloud URIs (gs://, s3://).

This helper uses pathlib.Path which only handles local filesystem paths. The eQTL/pQTL builders accept input_path that could be cloud URIs, but Path.is_file(), Path.is_dir(), and Path.glob() will fail or return incorrect results for such paths.

Consider using Hail's hailtop.fs or similar cloud-aware filesystem utilities, similar to how _cleanup_temp_file handles this at lines 117-144.

🛠️ Suggested approach
 def _scan_tissue_files(input_path, extensions):
-    from pathlib import Path
+    try:
+        import hailtop.fs as hfs
+    except ImportError:
+        hfs = None
 
-    p = Path(input_path)
-    if p.is_file():
-        return [(str(p), p.stem.split(".")[0])]
-    if not p.is_dir():
-        raise FileNotFoundError(f"Not a file or directory: {input_path}")
+    # Check if cloud URI
+    if input_path.startswith(("gs://", "s3://", "hdfs://")):
+        if hfs is None:
+            raise ImportError("hailtop.fs required for cloud URIs")
+        # Use hailtop.fs for cloud paths
+        if hfs.is_file(input_path):
+            name = input_path.rsplit("/", 1)[-1].split(".")[0]
+            return [(input_path, name)]
+        # ... cloud directory listing logic
+    else:
+        from pathlib import Path
+        p = Path(input_path)
+        # existing local logic...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tables/table_builders.py` around lines 1507 - 1526, _scan_tissue_files
currently uses pathlib.Path and will fail for cloud URIs (gs://, s3://); replace
the Path-based logic with a cloud-aware filesystem like hailtop.fs (or
hailtop.fs.get_fs) similar to _cleanup_temp_file: detect cloud URIs (e.g.,
input_path.startswith("gs://") or "s3://"), use the filesystem client to check
existence and file/dir semantics, list matching files using the fs.glob or fs.ls
APIs for each extension, and build the [(file_path, tissue_name)] list by
extracting the filename stem (split on "/" then split basename on "." to get the
tissue prefix) so both local and cloud paths are supported by
_scan_tissue_files.

@enriquea enriquea assigned enriquea and unassigned enriquea Mar 28, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
hvantk/qtlcascade/pipeline.py (3)

468-472: ⚠️ Potential issue | 🟠 Major

Reports can still pick up stale plots from earlier runs.

Both paths rebuild plot_paths by globbing a shared directory instead of using the files generated by this invocation. Reusing output_dir can leak old <slug>_*.png files or a stale cross_tissue_heatmap.png into the new HTML. Tracking successful plot writes on CascadeResult would keep the report scoped to the current run.

Also applies to: 519-522

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/qtlcascade/pipeline.py` around lines 468 - 472, The current code
globbing self._plots_dir (using _sanitize(result.tissue) and building plot_paths
from files matching "{safe}_*.png") can include stale images from previous runs;
instead, record the exact plot files written during this invocation on the
CascadeResult (e.g., add/append to a result.plots list or result.plot_paths when
each plot is written) and use that list to build plot_paths; similarly, stop
globbing for "cross_tissue_heatmap.png" and read it only if it was recorded on
the CascadeResult for this run. Update the plot-writing functions to append
successful writes to the CascadeResult (unique symbols: CascadeResult,
result.plots or result.plot_paths, self._plots_dir, _sanitize,
cross_tissue_heatmap.png) and replace the glob-based construction with iteration
over the recorded list.

201-202: ⚠️ Potential issue | 🟠 Major

Make the tissue slug collision-resistant.

_sanitize() is lossy, so distinct labels like Brain/Cortex and Brain Cortex still collapse to the same slug. Since that value drives per_tissue directories and plot prefixes, one tissue can overwrite another's artifacts. Generate a unique slug once per run and reuse it everywhere.

Also applies to: 549-553

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/qtlcascade/pipeline.py` around lines 201 - 202, The current use of
_sanitize(tissue_label) is lossy and can collapse distinct labels (e.g.
"Brain/Cortex" vs "Brain Cortex"); replace ad-hoc calls with a single-run
mapping that generates a collision-resistant slug per tissue_label (e.g.
sanitized base plus a short deterministic hash or UUID derived from the original
tissue_label) and reuse that slug wherever safe_name or tissue_dir and plot
prefixes are produced (locations using _sanitize, including the code that sets
safe_name and tissue_dir and the later usages around the plot prefix code).
Implement a map on the pipeline instance (e.g. self._tissue_slug_map) populated
once when tissues are enumerated and then reference
self._tissue_slug_map[tissue_label] instead of calling _sanitize directly.
Ensure the slug remains stable for the duration of the run and is used for
per-tissue directories and plot prefixes.

329-331: ⚠️ Potential issue | 🟠 Major

gene_summary.tsv still ignores overwrite.

The HT checkpoint honors self.config.overwrite, but gene_ht.export(tsv_path) does not. A rerun into the same tissue directory can therefore succeed on gene_summary.ht and then fail when gene_summary.tsv already exists. Handle the destination explicitly before calling export().

Does Hail `Table.export()` support an `overwrite` parameter, and what is the documented behavior if the destination path already exists?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/qtlcascade/pipeline.py` around lines 329 - 331, The export step
ignores self.config.overwrite: before calling gene_ht.export(tsv_path) check
whether the destination exists and honor self.config.overwrite — if overwrite is
true, remove the existing tsv (and any related compressed variant) using the
appropriate filesystem helper (Hail's hadoop utilities for remote paths or
os.unlink for local paths), and if overwrite is false and the file exists, raise
or fail early with a clear message; then call gene_ht.export(tsv_path).
Reference gene_ht, export(), tsv_path, tissue_dir and self.config.overwrite to
locate where to add the existence check and deletion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@hvantk/qtlcascade/pipeline.py`:
- Around line 468-472: The current code globbing self._plots_dir (using
_sanitize(result.tissue) and building plot_paths from files matching
"{safe}_*.png") can include stale images from previous runs; instead, record the
exact plot files written during this invocation on the CascadeResult (e.g.,
add/append to a result.plots list or result.plot_paths when each plot is
written) and use that list to build plot_paths; similarly, stop globbing for
"cross_tissue_heatmap.png" and read it only if it was recorded on the
CascadeResult for this run. Update the plot-writing functions to append
successful writes to the CascadeResult (unique symbols: CascadeResult,
result.plots or result.plot_paths, self._plots_dir, _sanitize,
cross_tissue_heatmap.png) and replace the glob-based construction with iteration
over the recorded list.
- Around line 201-202: The current use of _sanitize(tissue_label) is lossy and
can collapse distinct labels (e.g. "Brain/Cortex" vs "Brain Cortex"); replace
ad-hoc calls with a single-run mapping that generates a collision-resistant slug
per tissue_label (e.g. sanitized base plus a short deterministic hash or UUID
derived from the original tissue_label) and reuse that slug wherever safe_name
or tissue_dir and plot prefixes are produced (locations using _sanitize,
including the code that sets safe_name and tissue_dir and the later usages
around the plot prefix code). Implement a map on the pipeline instance (e.g.
self._tissue_slug_map) populated once when tissues are enumerated and then
reference self._tissue_slug_map[tissue_label] instead of calling _sanitize
directly. Ensure the slug remains stable for the duration of the run and is used
for per-tissue directories and plot prefixes.
- Around line 329-331: The export step ignores self.config.overwrite: before
calling gene_ht.export(tsv_path) check whether the destination exists and honor
self.config.overwrite — if overwrite is true, remove the existing tsv (and any
related compressed variant) using the appropriate filesystem helper (Hail's
hadoop utilities for remote paths or os.unlink for local paths), and if
overwrite is false and the file exists, raise or fail early with a clear
message; then call gene_ht.export(tsv_path). Reference gene_ht, export(),
tsv_path, tissue_dir and self.config.overwrite to locate where to add the
existence check and deletion.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe20b8a5-776a-43cc-9ad6-8059eb5a192a

📥 Commits

Reviewing files that changed from the base of the PR and between 47610c8 and a8f172e.

📒 Files selected for processing (1)
  • hvantk/qtlcascade/pipeline.py

@enriquea
Copy link
Copy Markdown
Collaborator Author

Hey @copilot
Check this: Codacy Static Code Analysis
Fix only the issues related to this CI/CD.

…d fix test module mocking

Agent-Logs-Url: https://github.com/bigbio/hvantk/sessions/7c166da9-3a7e-46e0-a9de-77c6e59835b8

Co-authored-by: enriquea <8707954+enriquea@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 28, 2026

Hey @copilot
Check this: Codacy Static Code Analysis
Fix only the issues related to this CI/CD.

Fixed in commit 36ad0e6. The Codacy issues were:

  1. Unused imports (F401): Removed DEFAULT_COLOC_H4_THRESHOLD from hvantk/qtlcascade/coloc.py, and CLINGEN_HEADER_SKIP_LINES + local tempfile/os imports from hvantk/tables/table_builders.py.

  2. Failing tests (test_cascade_cmd, test_coloc_cmd): The patch("hvantk.core.hail_context.init_hail") approach failed when hail is not installed in the test environment. Fixed by using patch.dict(sys.modules, {"hail": MagicMock(), "hvantk.core.hail_context": _MOCK_HAIL_CONTEXT}) to mock the module before patching.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • app.codacy.com
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@enriquea enriquea merged commit 57f2916 into main Mar 28, 2026
5 of 6 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Mar 28, 2026
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants