This repository were created in the process of the course "053531 PR Softwareentwicklungsprojekt Bioinformatik (2022W)" in the winter-semester 2022/2023 at the university vienna. The project was supervised by Mag. Stefan Badelt, PhD and Univ.-Prof. Dipl.-Phys. Dr. Ivo Hofacker.
The project focused on testing the deep-learning based prediction tool UFold in an unbiased way. For this purpose it also contains files for sequence design to create artificial data, where the bias within the data can be controlled.
UFold is used for predicting secondary structure of RNAs. Some files within UFold were slightly changed and expanded, but most were left unchanged. The scripts sequence_design.py, sequence_design.ipynb were created for designing new sequences from target structure. The scripts create_file.py, ml_forensic.py, and predict.py were written to create data and test the performance of UFold models.
-python >= 3.6.6
-torch >= 1.4 with cudnn >=10.0
-matplotlib
-numpy
-pandas
-tqdm
The UFold folder contains the deep learning software UFold. To use the files predict.py, create_file.py ml_forensic.py, they must be located within this folder.
This scripts contains the functions:
-
random_sequence
- creates one random sequence
- Example Usage:
random_sequence(length=100)
-
generate_random_seq_and_ss:
- creates random sequence and corresponding structure using ViennaRNA from list of lengths
- Example Usage:
generate_random_seq_and_ss(lengths=[70, 100, 120])
-
sample2bpseq
- creates a bpseq file from given sequence and structure under given path
- Example Usage:
sample2bpseq(seq='GCCGUCGCGU', ss='((....))..', path='example_bpseq.txt')
-
random_bpseq
- creates a folder with bpseq files of given number of random sequences with given length, purpose or seed can be specified as well as the output folder
- Example Usage:
random_bpseq(N_seqs = 1000, n_seq = 100, purpose = "train")
-
random_ml_forensic
- creates random sequences with given length and saves them in numpy files (needed for analysis done in ml_forensic.py)
- Example Usage:
random_ml_forensic(N_seqs = 1000, n_seq = 100, output_folder = "ml_forensic/N1000_n100", seed=42)
-
fa2npy
- converts a given fasta file to a numpy file
- Example Usage:
fa2npy(fa_file = "example_fasta.fa", output_path = "example_numpy.npy")
-
pickle2fa
- converts a given pickle file to a fasta file
- Example Usage:
fa2npy(pickle_file = "example_pickle.pickle", fa_file= "example_fasta.fa")
The functions can be executed within the python script.
script to predict structures from randomly created sequence and store predictions in numpy files
Created files are used for the script ml_forensic.py
Parameters for specifying the model, which should be used for prediction, and path to the file, which should be used for the prediction, can be specified within the python script.
contains the function for analysing:
-
length_analysis
- analyses the relation between number of predicted base pairs and length of sequences and plots figure of prediciton with and without postprocessing, and results of RNADeep
- Example Usage:
length_analysis(folder="length_files", n_lengths= [70,100], stem_file_name = "N1000", save = "figure.png")
-
model_eval
- evaluates different metrics (MCC, F1, precision, recall) of given UFold model with and without postprocessing
- Example Usage:
model_eval(contact_net=ufold_contact, test_generator=Dataset_generator)
-
structural_elements
- analyses structural elements (External Loop (EL), Bulge Loop (BL), Hairpin Loop (HL), Internal Loop (IL), Multi Loop (ML)) of truth and prediction with and without postprocessing and the relative frequencies of the base pair types
- Example Usage:
structural_elements(file_path="structural_analysis/test_files")
-
known_structure_test
- test how likely the given model folds into given structure, calculates the base pair distance between prediction and given structure and plots it as a histogramm
- Example Usage:
known_structure_test(itr=10, structure="(((...)))", save = "results/structure_test.png", model = r"models/ufold_model.pt, unbias_model = False)
The functions can be executed within the python script.
creates new sequences from input fasta file.
Can be executed from the command line:
- -input or -i (str): path to the input fasta file
- -output or -o (str): path to the output txt file
- -design or -d (str): can be 1, 2 or 3. defines the design approach which should be used, default = 1
design argument defines the sequence design approach which should be used.
- 1 = simple sequence design
- 2 = frequency based sequence design
- 3 = constraint generation sequence design
- 1 = simple sequence design
example usage over commandline with frequency based sequence design:
python sequence_design.py -i input_file.fa -o results/output_file.txt -d 2
creates new sequences from input fasta file (similar to sequence_design.py). File was used to test the sequence design approaches, since the needed packages were incompatible with the used computer.
Can only be executed within jupyter notebook script.
The folder figures contain all the figures which were created in the process of the analysis. The README.md file within this folder explain what which figure represents, and which model was used to produce the results.