PyPair is a Python-based software tool for aligning short DNA sequences (reads) to a reference genome. It employs a learned index for exact-match seed generation, enhancing the efficiency and accuracy of the alignment process. The software integrates several bioinformatics tools and techniques, such as Biopython for parsing sequence data and PySAM for alignment handling, combined with machine learning models for improved seed generation.
To install PyPair, clone the repository and install the required dependencies:
git clone https://github.com/your-repository/PyPair.git`
cd PyPair
pip install -r requirements.txt
PyPair operates through a series of steps to align reads to a reference genome:
- FASTA Reference and FASTQ Reads Parsing: Utilizes Biopython to read FASTA formatted reference genomes and FASTQ formatted read files.
- Seed Generation and Mapping: Generates mapping locations from FASTQ seeds and builds them into an internal dictionary.
- Sequence Alignment: Employs the Smith-Waterman algorithm via PySAM for precise sequence alignment.
To start the seed generation and mapping process, run the following:
python predict_seeds.py <path-to-FASTQ-file>
The main script, predict_seeds.py, orchestrates the alignment process. It includes:
- Parsing of FASTQ files for read sequences.
- A Smith-Waterman algorithm implementation for sequence alignment.
- CIGAR string computation from sequence alignments.
- Model loading and sequence processing utilities.
- Generation of BAM files from sequence alignments.
PyPair requires Python 3.x and several libraries, including Biopython, PySAM, Pandas, and Pickle. Ensure these are installed as per 1requirements.txt`.
The PyPair project is structured as follows:
PyPair/
├── model_weights/ # Trained model weights
├── reference_samples/ # Reference genome samples
├── Validation/ # Validation datasets
├── wandb/ # Weights & Biases logging (optional)
├── encode_decode.py # Encoding/decoding utilities
├── gen_index.py # Index generation script
├── gen_training_data.py # Training data generation script
├── io_processing.py # I/O processing utilities
├── multi_model_training.ipynb # Jupyter notebook for model training
├── predict_seeds.py # Main prediction script
├── readme # Project README
└── requirements.txt # Python package requirements
For issues, questions, or contributions, please open an issue or pull request in the GitHub repository.