Skip to content

dieterich-lab/QutRNA2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QutRNA2

Robust tRNA modification discovery from Nanopore direct tRNA sequencing

If you use QutRNA2, please cite: https://www.biorxiv.org/content/10.1101/2025.10.20.683443v1

Table of Contents

Quick Start

This quick start guide assumes you have conda installed and a CUDA-capable GPU available. See Installation for full details and a container solution using Singularity.

1. Clone the repository:

git clone https://github.com/dieterich-lab/QutRNA2
cd QutRNA2

2. Install and activate the environment:

conda env create -f conda.yaml -n qutrna2
conda activate qutrna2

3. Copy and edit the example config files:

cp examples/analysis/map_with_gpu.yaml my_analysis.yaml
cp examples/data/sprinzl_cm.yaml my_data.yaml

Edit my_data.yaml to point to your reference FASTA examples, sample description TSV (labels and filepaths to basecalled samples) example, and Sprinzl coordinate labels example for eukaryotic cytosolic tRNA. Edit my_analysis.yaml to set paths for JACUSA2 and gpu-tRNA-mapper, and any GPU init commands.

4. Run a dry run to verify the workflow:

snakemake \
    -c 1 \
    --snakefile <QUTRNA2_LOCAL_DIR>/workflow/Snakefile \
    --use-conda \
    --configfiles my_analysis.yaml \
    --config pepfile=my_data.yaml \
    --directory <ANALYSIS_OUTPUT> \
    -n

A list of jobs to be executed should appear. If it does without errors, remove -n and increase -c to the number of available cores to start the analysis.

New Features

QutRNA2 features the novel GPU-assisted gpu-tRNA-mapper that performs up to 25x faster than the previously used mapper parasail for the same task. Furthermore, a new, improved version of JACUSA v2.1.16 is included, featuring subsampled scores that improve the signal-to-noise ratio for identifying tRNA modifications.

Finally, a filter framework has been added to the analysis workflow to remove spurious alignments by applying the following filters:

  • Filter Random alignments
  • Filter Adapter overlap (with 5' and 3' splint adapters)
  • Filter multimapping reads

We added the following plots to assess the impact of filtering:

  • Alignment threshold summary plot
  • Impact of filters on read length
  • Impact of filters on the number of reads

More customization options for heatmap plots:

  • Filter tRNAs by min. number of reads
  • Display or ignore specific tRNAs by regular expression
  • Mark positions of interest
  • Use patterns to customize the title of heatmap plots

Requirements

To use GPU-assisted mapping, you need a compatible NVIDIA GPU. For details, check Hardware requirements. In brief, a CUDA-capable GPU with Volta architecture or newer is recommended.

If no compatible GPU is present, QutRNA2 can be used with parasail but will run significantly slower. See below.

Installation

Environment setup using Conda

We provide a conda file with all necessary packages. Clone the repository and install the requirements with conda.

Go to your desired <QUTRNA2-LOCAL-DIR> and clone the repository:

cd <QUTRNA2-LOCAL-DIR>
git clone https://github.com/dieterich-lab/QutRNA2

Next, install all the requirements:

cd QutRNA2
conda env create -f conda.yaml -n qutrna2

Finally, activate the environment:

conda activate qutrna2

Using Singularity container

If Singularity/Apptainer is not already available on your system, see the SingularityCE or Apptainer installation guides.

Use the pre-built container image

The Zenodo link in our manuscript contains a pre-built container image. You can download it, set up your config files, and run it as instructed here.

Build your own container image

A Singularity definition file is provided at singularity/qutrna2.def to build a portable container image as follows:

singularity build qutrna2.sif singularity/qutrna2.def

Note: building a Singularity image typically requires root or --fakeroot privileges (e.g. singularity build --fakeroot qutrna2.sif singularity/qutrna2.def) If building on an HPC cluster, check whether --fakeroot is supported.

Setup QutRNA2 analysis

QutRNA2 uses YAML files to define the data (data.yaml) and parametrize the analysis (analysis.yaml). Finally, a TSV file provides the sample description.

In summary, the sample description <SAMPLE_DESC> must be a TAB-separated file and contain the following columns:

condition sample_name subsample_name base_calling fastq|bam
... ... ... ... ...

See the files in the QutRNA2/examples folder for documented YAML and toy examples for sample tables. Note that the column base_calling is a legacy field and should be set to pass for all rows. QutRNA2 distinguishes the configuration of the analysis and the data. The following analysis types are supported:

  • map reads with gpu-tRNA-mapper (see QutRNA2/examples/analysis/map_with_gpu.yaml),
  • map reads with parasail (see QutRNA2/examples/analysis/map_with_parasail.yaml), and
  • use exisiting mapping (see QutRNA2/examples/analysis/existing_mapping.yaml.

QutRNA2 supports the following approaches to assign Sprinzl coordinates and the configuration of data input differs based on it:

  • using a covarince model and secondary structure alignment (see QutRNA2/examples/data/sprinzl_cm.yaml),
  • using an existing aligned FASTA file (see QutRNA2/examples/data/sprinzl_afasta.yaml), or
  • a direct mapping of sequence to Sprinzl coordinates (see QutRNA2/examples/data/seq_to_sprinzl.yaml)).

Those files are templates and must be adjusted to the user's needs.

Setup data configuration

First, define your <SAMPLE_DESC>. This file holds sample-specific information, such as "condition", "sample_name", "subsample", and "fastq" or "bam" - they directly correspond to columns - see examples/sample_desc_fastq.tsv. Data for entries with the same "sample_name" will be merged - they represent technical replicates. For historical reasons, the column "base_calling" is present. Set it to "pass". Finally, the column "fastq" should point to the path of the compressed (gzip) fastq file.

Second, define your <DATA_YAML>. This file describes what reference and Sprinzl coordinates (if any) to use. See examples/data/*.yaml. Make sure to add your <SAMPLE_DESC>. Provide "ref_fasta" and define what Sprinzl coordinates to use and the size of the adapters used! Correct adapter lengths are essential!

Sprinzl

For eukaryotic nuclear tRNAs, we use the following covariance model TRNAinf-euk.cm and labeling data/nuclear-euk-masked.txt.

For human mt-tRNAs, we use the sequence to Sprinzl mapping in https://www.nature.com/articles/s41467-020-18068-6 and deposited the data along with the Sprinzl labels to: data/human_mt_seq_to_sprinzl.tsv and data/human_mt_sprinzl_labels.txt.

It is crucial to obtain covariance models for the organism and tRNAs studied. These models can be acquired, for example, from https://github.com/UCSC-LoweLab/tRNAscan-SE/tree/master/lib/models.

Setup analysis configuration

Finally, define <ANALYSIS_YAML>. Here, the workflow is manipulated, and custom plots are defined. Check examples/analysis/*.yaml for examples. For the recommended GPU run, use examples/analysis//map_with_gpu.yaml as your template. Use examples/analysis/map_with_parasail.yaml instead as a template if you don't use GPU and would like to use parasail, but expect significantly longer runtimes.

Add any necessary init code for the GPU and provide paths for JACUSA2 and gpu-tRNA-mapper if they are not in the standard path.

Examples

Execute workflow

Run using the Conda environment

If not done yet, activate qutrna2 conda environment:

conda activate qutrna2

Use <ANALYSIS_OUTPUT> folder to define where QutRNA2 should write the output to:

snakemake \
    -c 1 \
    --snakefile <QUTRNA2_LOCAL_DIR>/workflow/Snakefile \
    --use-conda \
    --configfiles <ANALYSIS_YAML> \
        --config pepfile=<DATA_YAML> \
        --directory <ANALYSIS_OUTPUT> \
    -n

The -n is to do a dry run of the pipeline. You should see a list of necessary jobs to be run and hopefully no errors. You should increase -c 1 to whatever suits your computing machine.

if you have no errors and a clean dry run, you can start the analysis (remove -n):

snakemake \
    -c 1 \
    --snakefile <QUTRNA2_LOCAL_DIR>/workflow/Snakefile \
    --use-conda \
    --configfiles <ANALYSIS_YAML> \
        --config pepfile=<DATA_YAML> \
        --directory <ANALYSIS_OUTPUT>

Run using the container

The Snakemake command is now the same as above but without --use-conda, and it is called as part of a Singularity command:

singularity run \
  --nv \
  --bind <HOST_DATA_DIR>:<HOST_DATA_DIR> \
  <PATH_TO_SIF>/qutrna2.sif \
  snakemake \
    --cores <N_CORES> \
    --snakefile <QUTRNA2_LOCAL_DIR>/workflow/Snakefile \
    --configfiles <ANALYSIS_YAML> \
        --config pepfile=<DATA_YAML> \
        --directory <ANALYSIS_OUTPUT>
  • singularity run --nv enables GPU access inside the container (required for gpu-tRNA-mapper).
  • singularity run --bind mounts a host directory into the container so input/output paths remain accessible. Bind all directories referenced in your YAML configs or a parent directory.
  • snakemake --executor local is recommended when running inside a SLURM job (i.e. resource allocation is already handled by SLURM). Set --cores to the number of cores allocated to your job (e.g. $SLURM_CPUS_ON_NODE).

Results

When the analysis is finished, the <ANALYSIS_OUTPUT> directory will contain the following subdirectories: "data", "info", "logs", and "results".

<ANALYSIS_OUTPUT>/data/ will contain all unprocessed data used in the analysis. <ANALYSIS_OUTPUT>/info/ will contain runtime information and the configuration files used to track parameters. <ANALYSIS_OUTPUT>/logs/ will contain logs for executed jobs. <ANALYSIS_OUTPUT>/results/ will contain all calculations.

data

The directory <ANALYSIS_OUTPUT>/results/data will contain processed instances of the reference sequence.

Alignments

Alignments are stored in <ANALYSIS_OUTPUT>/results/bam/<read-type>/.... <read-type> corresponds to mapped, filtered, and final reads. The BAMs in the subdirectory "final" are used to calculate JACUSA2 score profiles. Each subdirectory is organised according to "sample_name", "subsample_name", and "base_calling" columns from <SAMPLE_DESC>.

cmalign

If a covariance model was provided in <DATA_YAML>, the secondary structure alignment under <ANALYSIS_OUTPUT>/results/cmalign/align.stk will be generated.

jacusa2

The directory <ANALYSIS_OUTPUT>/results/jacusa2 will contain JACUSA2 results for defined contrasts.

Plots

The directory <ANALYSIS_OUTPUT>/results/plots will contain plots. These are heatmaps of the JACUSA2 scores across tRNA positions or all tRNAs that the reads were mapped to and retained after filtering.

Check <ANALYSIS_OUTPUT>/results/plots/scores/cond1~{cond1}/cond2~{cond1}/{id}/bam~final/heatmap.pdf. This plot concludes the analysis.

stats

If filters were applied, the directory <ANALYSIS_OUTPUT>/results/stats will contain summary statistics for features such as alignment score, read length, and read count.

secondary structure (ss)

<ANALYSIS_OUTPUT>/results/seq_to_sprinzl_filtered.tsv will contain the sequence to sprinzl mapping.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors