PyPair: Python-Based Short Read Alignment Software

Introduction

PyPair is a Python-based software tool for aligning short DNA sequences (reads) to a reference genome. It employs a learned index for exact-match seed generation, enhancing the efficiency and accuracy of the alignment process. The software integrates several bioinformatics tools and techniques, such as Biopython for parsing sequence data and PySAM for alignment handling, combined with machine learning models for improved seed generation.

Getting Started

Installation

To install PyPair, clone the repository and install the required dependencies:

git clone https://github.com/your-repository/PyPair.git` 
cd PyPair
pip install -r requirements.txt

Usage

PyPair operates through a series of steps to align reads to a reference genome:

FASTA Reference and FASTQ Reads Parsing: Utilizes Biopython to read FASTA formatted reference genomes and FASTQ formatted read files.
Seed Generation and Mapping: Generates mapping locations from FASTQ seeds and builds them into an internal dictionary.
Sequence Alignment: Employs the Smith-Waterman algorithm via PySAM for precise sequence alignment.

To start the seed generation and mapping process, run the following:

python predict_seeds.py <path-to-FASTQ-file>

Script Overview

The main script, predict_seeds.py, orchestrates the alignment process. It includes:

Parsing of FASTQ files for read sequences.
A Smith-Waterman algorithm implementation for sequence alignment.
CIGAR string computation from sequence alignments.
Model loading and sequence processing utilities.
Generation of BAM files from sequence alignments.

Dependencies

PyPair requires Python 3.x and several libraries, including Biopython, PySAM, Pandas, and Pickle. Ensure these are installed as per 1requirements.txt`.

Project Structure

The PyPair project is structured as follows:

PyPair/
├── model_weights/         # Trained model weights
├── reference_samples/     # Reference genome samples
├── Validation/            # Validation datasets
├── wandb/                 # Weights & Biases logging (optional)
├── encode_decode.py       # Encoding/decoding utilities
├── gen_index.py           # Index generation script
├── gen_training_data.py   # Training data generation script
├── io_processing.py       # I/O processing utilities
├── multi_model_training.ipynb  # Jupyter notebook for model training
├── predict_seeds.py       # Main prediction script
├── readme                 # Project README
└── requirements.txt       # Python package requirements

Support

For issues, questions, or contributions, please open an issue or pull request in the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.idea		.idea
model_weights		model_weights
modules		modules
multimodel_training		multimodel_training
reference_samples		reference_samples
research		research
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
changelog.md		changelog.md
predict_seeds.py		predict_seeds.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyPair: Python-Based Short Read Alignment Software

Introduction

Getting Started

Installation

Usage

Script Overview

Dependencies

Project Structure

Support

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

tonserrobo/PyPair

Folders and files

Latest commit

History

Repository files navigation

PyPair: Python-Based Short Read Alignment Software

Introduction

Getting Started

Installation

Usage

Script Overview

Dependencies

Project Structure

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages