The purpose of this project is to utilize EVO2, a DNA foundation model developed by the Arc Institute, to predict the impacts of certain genetic variants, with a specific focus on non-coding regions and their influence on protein expression and phenotype related to Ataxia. The goal is to determine whether any of the numerous variants in non-coding regions could contribute to Ataxia phenotypes.
This project will focus on:
- Exploring the capabilities of EVO2 in analyzing non-coding genetic variants and their potential influence on Ataxia phenotypes.
- Developing a workflow to process variant data and predict potential impacts using EVO2.
- Utilizing and adapting existing test data, derived from the Snakemake Workflows for DNA sequencing (https://github.com/snakemake-workflows/dna-seq-varlociraptor) and vembrane output files (https://github.com/vembrane/vembrane), to facilitate this analysis.
- Gain familiarity with the EVO2 model and its applications.
- Develop a processing pipeline that can accept TSV files with variant data.
- Evaluate the impact of specific variants on protein expression, focusing on non-coding regions.
- A fully functional pipeline that accepts TSV files of variants, processes them, runs them through the EVO2 model, and outputs results.
- A report detailing the methods used, challenges faced, and results obtained.
- Documentation for future users to replicate or expand the analysis.
- Project initiation and background research: TBD
- Initial setup and exploration of EVO2: TBD
- Workflow development and testing: TBD
- Analysis and reporting: TBD
- Final presentation: TBD
- Utilize Python and an appropriate workflow management tool compatible with Slurm.
- Use test datasets from the
WES_Analysis_11340_9740repository as initial input. - Implement and refine a pipeline for processing and analyzing variants using EVO2.
- EVO2 model details from Arc Institute: Arc Institute EVO2
- Test data: WES Analysis Repository
- Workflow management tools: TBD (consider exploring alternatives to Cookiecutter Data Science for Python and Slurm compatibility)
- Accuracy and reliability of the workflow and predictions.
- Clarity and completeness of documentation and reporting.
- Innovation in workflow design and problem-solving approaches.
EVO2 is a genome modeling and design framework capable of analyzing DNA sequences across various domains of life, particularly valuable in studying non-coding regions.
- Regular check-ins and feedback sessions: TBD
- Access to shared resources and guidance from [Your Name]: TBD
- The need to adapt to a currently unfamiliar model (EVO2).
- Ensuring compatibility with cluster resources and data formats.
- Preferred channel for communication: TBD
- Frequency of updates and meetings: TBD
- Identify potential technical obstacles early and seek guidance as needed.
- Regularly back up work and maintain version control.