Robust tRNA modification discovery from Nanopore direct tRNA sequencing
If you use QutRNA2, please cite: https://www.biorxiv.org/content/10.1101/2025.10.20.683443v1
- Quick Start
- New Features
- Requirements
- Installation
- Setup QutRNA2 Analysis
- Setup Data Configuration
- Setup Analysis Configuration
- Examples
- Execute Workflow
- Results
This quick start guide assumes you have conda installed and a CUDA-capable GPU available. See Installation for full details and a container solution using Singularity.
1. Clone the repository:
git clone https://github.com/dieterich-lab/QutRNA2
cd QutRNA22. Install and activate the environment:
conda env create -f conda.yaml -n qutrna2
conda activate qutrna23. Copy and edit the example config files:
cp examples/analysis/map_with_gpu.yaml my_analysis.yaml
cp examples/data/sprinzl_cm.yaml my_data.yamlEdit my_data.yaml to point to your reference FASTA examples, sample description TSV (labels and filepaths to basecalled samples) example, and Sprinzl coordinate labels example for eukaryotic cytosolic tRNA.
Edit my_analysis.yaml to set paths for JACUSA2 and gpu-tRNA-mapper, and any GPU init commands.
4. Run a dry run to verify the workflow:
snakemake \
-c 1 \
--snakefile <QUTRNA2_LOCAL_DIR>/workflow/Snakefile \
--use-conda \
--configfiles my_analysis.yaml \
--config pepfile=my_data.yaml \
--directory <ANALYSIS_OUTPUT> \
-nA list of jobs to be executed should appear. If it does without errors, remove -n and increase -c to the number of available cores to start the analysis.
QutRNA2 features the novel GPU-assisted gpu-tRNA-mapper that performs up to 25x faster than the previously used mapper parasail for the same task. Furthermore, a new, improved version of JACUSA v2.1.16 is included, featuring subsampled scores that improve the signal-to-noise ratio for identifying tRNA modifications.
Finally, a filter framework has been added to the analysis workflow to remove spurious alignments by applying the following filters:
- Filter Random alignments
- Filter Adapter overlap (with 5' and 3' splint adapters)
- Filter multimapping reads
We added the following plots to assess the impact of filtering:
- Alignment threshold summary plot
- Impact of filters on read length
- Impact of filters on the number of reads
More customization options for heatmap plots:
- Filter tRNAs by min. number of reads
- Display or ignore specific tRNAs by regular expression
- Mark positions of interest
- Use patterns to customize the title of heatmap plots
To use GPU-assisted mapping, you need a compatible NVIDIA GPU. For details, check Hardware requirements. In brief, a CUDA-capable GPU with Volta architecture or newer is recommended.
If no compatible GPU is present, QutRNA2 can be used with parasail but will run significantly slower. See below.
We provide a conda file with all necessary packages. Clone the repository and install the requirements with conda.
Go to your desired <QUTRNA2-LOCAL-DIR> and clone the repository:
cd <QUTRNA2-LOCAL-DIR>
git clone https://github.com/dieterich-lab/QutRNA2Next, install all the requirements:
cd QutRNA2
conda env create -f conda.yaml -n qutrna2Finally, activate the environment:
conda activate qutrna2If Singularity/Apptainer is not already available on your system, see the SingularityCE or Apptainer installation guides.
The Zenodo link in our manuscript contains a pre-built container image. You can download it, set up your config files, and run it as instructed here.
A Singularity definition file is provided at singularity/qutrna2.def to build a portable container image as follows:
singularity build qutrna2.sif singularity/qutrna2.defNote: building a Singularity image typically requires root or
--fakerootprivileges (e.g.singularity build --fakeroot qutrna2.sif singularity/qutrna2.def) If building on an HPC cluster, check whether--fakerootis supported.
QutRNA2 uses YAML files to define the data (data.yaml) and parametrize the analysis (analysis.yaml). Finally, a TSV file provides the sample description.
In summary, the sample description <SAMPLE_DESC> must be a TAB-separated file and contain the following columns:
| condition | sample_name | subsample_name | base_calling | fastq|bam |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
See the files in the QutRNA2/examples folder for documented YAML and toy examples for sample tables. Note that the column base_calling is a legacy field and should be set to pass for all rows.
QutRNA2 distinguishes the configuration of the analysis and the data. The following analysis types are supported:
- map reads with gpu-tRNA-mapper (see
QutRNA2/examples/analysis/map_with_gpu.yaml), - map reads with parasail (see
QutRNA2/examples/analysis/map_with_parasail.yaml), and - use exisiting mapping (see
QutRNA2/examples/analysis/existing_mapping.yaml.
QutRNA2 supports the following approaches to assign Sprinzl coordinates and the configuration of data input differs based on it:
- using a covarince model and secondary structure alignment (see
QutRNA2/examples/data/sprinzl_cm.yaml), - using an existing aligned FASTA file (see
QutRNA2/examples/data/sprinzl_afasta.yaml), or - a direct mapping of sequence to Sprinzl coordinates (see
QutRNA2/examples/data/seq_to_sprinzl.yaml)).
Those files are templates and must be adjusted to the user's needs.
First, define your <SAMPLE_DESC>. This file holds sample-specific information, such as "condition", "sample_name", "subsample", and "fastq" or "bam" - they directly correspond to columns - see examples/sample_desc_fastq.tsv.
Data for entries with the same "sample_name" will be merged - they represent technical replicates. For historical reasons, the column "base_calling" is present. Set it to "pass". Finally, the column "fastq" should point to the path of the compressed (gzip) fastq file.
Second, define your <DATA_YAML>. This file describes what reference and Sprinzl coordinates (if any) to use. See examples/data/*.yaml. Make sure to add your <SAMPLE_DESC>. Provide "ref_fasta" and define what Sprinzl coordinates to use and the size of the adapters used! Correct adapter lengths are essential!
For eukaryotic nuclear tRNAs, we use the following covariance model TRNAinf-euk.cm and labeling data/nuclear-euk-masked.txt.
For human mt-tRNAs, we use the sequence to Sprinzl mapping in https://www.nature.com/articles/s41467-020-18068-6
and deposited the data along with the Sprinzl labels to: data/human_mt_seq_to_sprinzl.tsv and data/human_mt_sprinzl_labels.txt.
It is crucial to obtain covariance models for the organism and tRNAs studied. These models can be acquired, for example, from https://github.com/UCSC-LoweLab/tRNAscan-SE/tree/master/lib/models.
Finally, define <ANALYSIS_YAML>. Here, the workflow is manipulated, and custom plots are defined. Check examples/analysis/*.yaml for examples. For the recommended GPU run, use examples/analysis//map_with_gpu.yaml as your template. Use examples/analysis/map_with_parasail.yaml instead as a template if you don't use GPU and would like to use parasail, but expect significantly longer runtimes.
Add any necessary init code for the GPU and provide paths for JACUSA2 and gpu-tRNA-mapper if they are not in the standard path.
If not done yet, activate qutrna2 conda environment:
conda activate qutrna2Use <ANALYSIS_OUTPUT> folder to define where QutRNA2 should write the output to:
snakemake \
-c 1 \
--snakefile <QUTRNA2_LOCAL_DIR>/workflow/Snakefile \
--use-conda \
--configfiles <ANALYSIS_YAML> \
--config pepfile=<DATA_YAML> \
--directory <ANALYSIS_OUTPUT> \
-nThe -n is to do a dry run of the pipeline. You should see a list of necessary jobs to be run and hopefully no errors.
You should increase -c 1 to whatever suits your computing machine.
if you have no errors and a clean dry run, you can start the analysis (remove -n):
snakemake \
-c 1 \
--snakefile <QUTRNA2_LOCAL_DIR>/workflow/Snakefile \
--use-conda \
--configfiles <ANALYSIS_YAML> \
--config pepfile=<DATA_YAML> \
--directory <ANALYSIS_OUTPUT>The Snakemake command is now the same as above but without --use-conda, and it is called as part of a Singularity command:
singularity run \
--nv \
--bind <HOST_DATA_DIR>:<HOST_DATA_DIR> \
<PATH_TO_SIF>/qutrna2.sif \
snakemake \
--cores <N_CORES> \
--snakefile <QUTRNA2_LOCAL_DIR>/workflow/Snakefile \
--configfiles <ANALYSIS_YAML> \
--config pepfile=<DATA_YAML> \
--directory <ANALYSIS_OUTPUT>singularity run --nvenables GPU access inside the container (required for gpu-tRNA-mapper).singularity run --bindmounts a host directory into the container so input/output paths remain accessible. Bind all directories referenced in your YAML configs or a parent directory.snakemake --executor localis recommended when running inside a SLURM job (i.e. resource allocation is already handled by SLURM). Set--coresto the number of cores allocated to your job (e.g.$SLURM_CPUS_ON_NODE).
When the analysis is finished, the <ANALYSIS_OUTPUT> directory will contain the following subdirectories:
"data", "info", "logs", and "results".
<ANALYSIS_OUTPUT>/data/ will contain all unprocessed data used in the analysis.
<ANALYSIS_OUTPUT>/info/ will contain runtime information and the configuration files used to track parameters.
<ANALYSIS_OUTPUT>/logs/ will contain logs for executed jobs.
<ANALYSIS_OUTPUT>/results/ will contain all calculations.
The directory <ANALYSIS_OUTPUT>/results/data will contain processed instances of the reference sequence.
Alignments are stored in <ANALYSIS_OUTPUT>/results/bam/<read-type>/.... <read-type> corresponds to mapped, filtered, and final reads. The BAMs in the subdirectory "final" are used to calculate JACUSA2 score profiles.
Each subdirectory is organised according to "sample_name", "subsample_name", and "base_calling" columns from <SAMPLE_DESC>.
If a covariance model was provided in <DATA_YAML>, the secondary structure alignment under <ANALYSIS_OUTPUT>/results/cmalign/align.stk will be generated.
The directory <ANALYSIS_OUTPUT>/results/jacusa2 will contain JACUSA2 results for defined contrasts.
The directory <ANALYSIS_OUTPUT>/results/plots will contain plots. These are heatmaps of the JACUSA2 scores across tRNA positions or all tRNAs that the reads were mapped to and retained after filtering.
Check <ANALYSIS_OUTPUT>/results/plots/scores/cond1~{cond1}/cond2~{cond1}/{id}/bam~final/heatmap.pdf. This plot concludes the analysis.
If filters were applied, the directory <ANALYSIS_OUTPUT>/results/stats will contain summary statistics for features such as alignment score, read length, and read count.
<ANALYSIS_OUTPUT>/results/seq_to_sprinzl_filtered.tsv will contain the sequence to sprinzl mapping.