Primepore

An analytical tool for detecting RNA modifications, capable of detecting three types of modifications and outputting modification predictions at the single-molecule level.

Introduction

Primepore is a high-performance software tool designed to detect and quantify three RNA modifications—m6A, m5C, and inosine—from Nanopore Direct RNA Sequencing data. The workflow consists of four stages, delivering per-site probabilities, modification proportions, and single-molecule modification status.

Key Features

Simultaneous detection of three RNA modifications (m6A, m5C, inosine)
End-to-end four-stage workflow: Preprocessing, Classification (Transformer), Regression, Clustering
Transformer-based classifier for initial screening of modification signals
Regression stage to predict modification proportions
Clustering stage to output per-molecule modification states
Integration with traditional signal and alignment tools (basecalling, mapping, nanopolish) and Groundtruth alignment for feature extraction
Outputs suitable for downstream statistics, visualization, and comparative analyses

Workflow

1. Preprocessing and Raw Data Alignment

Basecalling, read mapping, and event alignment
Groundtruth alignment to reference
Feature extraction and data integration to prepare inputs for modeling

2. Classification Stage

Transformer-based network for preliminary screening of modification signals
Distinguishes among m6A, m5C, inosine, and unmodified signals using sequence and signal context

3. Regression Stage

Predicts modification proportions at identified sites
Outputs continuous proportion estimates with associated uncertainty

4. Clustering Stage

Outputs per-molecule modification status based on predicted proportions
Groups reads by modification profiles to reveal heterogeneity and epitranscriptome structure

Requisites

Data preparing:

Data	Note
fast5 files	containing raw current signals
reference.fa	genome.fa (for homo sapiens: hg19.fa or hg38.fa)
methylation_rate.csv	methylation-rate groundtruth, needed only for training your own models

Environment:

Platform: Linux x86_64
GPU: Nvidia GPUs
CPUs

Softwares

Tool	Usage
Guppy	ONT offical software to generate fastq through basecalling
minimap2	align reads to reference.fa
samtools	bam files processing
slow5tools	converting (FAST5 <-> SLOW5)
f5c	eventalign, assign current signals to bases

python modules

Tool	Usage
torch	an open source Python machine learning library
read5	a python wrapper to read fast5, slow5/blow5 and pod5 files

Installation

It may take several minutes to install

git clone https://github.com/darelab2014/Primepore

Getting Started

1. Base calling and alignment

Convert all Fast5 files into a single Blow5 file (slow5tools)

# convert a directory of fast5 files into BLOW5 files (default compression: zlib+svb-zd)
slow5tools f2s fast5_dir -d blow5_dir
# merge all BLOW5 files in a directory into a single BLOW5 file (default compression: zlib+svb-zd)
slow5tools merge blow5_dir -o file.blow5

ONT guppy basecalling (guppy_basecaller)

# The input is 'fast5_dir', and the output is 'output_dir'. The input should be 'guppy_flowcell' and 'guppy_kit' depending on the sequencing platform.  
# Option: If you want to use a GPU, select '-x GPU'; the default is to use the CPU.
guppy_basecaller -i fast5_dir -s output_dir --flowcell guppy_flowcell --kit guppy_kit -x GPU -r
# Merge all fastq files,the output is 'combined_fastq'
cat output_dir/*.fastq > combined_fastq

Alignment with reference.fa (minimap2)

# The input files contain reference file 'reference.fa' and basecalled fastq file 'combined_fastq', the output is sam file 'alignment_sam'
minimap2 -a reference.fa combined_fastq -ax map-ont > alignment_sam

Convert and sort SAM to BAM (samtools)

# The input is sam file 'alignment_sam', the output is sorted bam file 'alignment_sorted_bam'
samtools view -S alignment_sam -b | samtools sort -o alignment_sorted_bam - ; samtools index alignment_sorted_bam

Event alignment (f5c)

# The input files contain fastq file 'combined_fastq' and blow5 file 'file.blow5'
f5c index combined_fastq --slow5 file.blow5
# The files contain fastq file 'combined_fastq', reference file 'reference.fa', sorted bam file 'alignment_sorted_bam' and blow5 file 'file.blow5'. The output is csv file 'eventalign_output.csv'
f5c eventalign -r combined_fastq -g reference.fa -b alignment_sorted_bam --slow5 file.blow5 --rna --signal-index --print-read-name > eventalign_output.csv

2. Ground truth alignment and data preprocessing

Optional: If want to retrain the model using your own data, you need to align it with the ground truth.

# The input files contain your groundtruth file 'your_groundtruth_file.csv', finally output the processed file 'groundtruth_file' (Default: Groundtruth.csv')
python ground_truth_process.py -i 'your_groundtruth_file.csv' -o Groundtruth.csv
# The input files contain event alignment file 'eventalign_output.csv', template output folder 'template_output_folder', finally output file 'align_label_output_file.feather' (must be path/*.feather) and groundtruth file 'groundtruth_file'
python align_label.py  -f eventalign_output.csv  -t template_output_folder -o path/align_label_output_file.feather -g groundtruth_file
# The input files contain raw current file 'file.blow5', template output folder 'template_output_folder', align_label output file 'path/align_label_output_file.feather' and finally output file 'align_raw_current_output_file.feather' (must be path/*.feather)
python align_raw_current.py -b file.blow5 -t template_output_folder -a path/align_label_output_file.feather -o path/align_raw_current_output_file.feather

Required: Whether training or inference using your own data, feature extraction is necessary.

# The input files contain align_raw_current output file 'align_raw_current_output_file.feather' and template output folder 'template_output_folder', finally output file 'output_feature.feather' (must be path/*.feather)
python feature_extraction.py -a path/align_raw_current_output_file.feather -t template_output_folder -o path/output_feature.feather

3. Classification model training and inference

Classification model training

# The input files contain feature file 'feature_file', model save folder 'model_saved_folder'. Optional model training input parameters: -e (epochs, default 100), -b (batch size, default 512), -d (device, default cuda)
python Classification_model_training.py -f feature_file -m model_saved_folder

Classification model inference

# The input files contain feature file 'feature_file', model save folder 'model_saved_folder' and inference result folder 'classification_inference_result_folder'. Optional model training input parameters: -b (batch size, default 512), -d (device, default cuda)
python Classification_model_inference.py -f feature_file -m model_saved_folder -o classification_inference_result_folder

4. Regression model training and inference

Regression model training

# The input files contain classification inference_result_file 'classification_inference_result_file', model save folder 'model_saved_folder' and the processed data template folder 'processed_data_template_floder'. Optional model training input parameters: -e (epochs, default 100), -d (device, default cuda)
python Regression_model_training.py -f classification_inference_result_file -t processed_data_template_floder -m model_saved_folder

Regression model inference

# The input files contain classification inference_result_folder 'classification_inference_result_file', model save folder 'model_saved_folder' and inference result file 'regression_inference_result_file'. Optional model training input parameter:-d (device, default cuda)
python Regression_model_inference.py -f classification_inference_result_file -m model_saved_folder -o regression_inference_result_file

5. Clustering (single molecule label output)

# The input files contain regression inference_result_file 'regression_inference_result_file' and the single molecule result file 'single_molecule_result_file'. Optional model training input parameters: -e (epochs, default 100), -d (device, default cuda)
python Single_molecule_results.py -f regression_inference_result_file -o single_molecule_result_file

Getting help

We appreciate your feedback and questions. You can report an error or suggestions related to Primepore as an issue on github.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
model		model
sample_data		sample_data
scripts		scripts
Primepore.svg		Primepore.svg
README.md		README.md
workflow.svg		workflow.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Primepore

Introduction

Key Features

Workflow

1. Preprocessing and Raw Data Alignment

2. Classification Stage

3. Regression Stage

4. Clustering Stage

Requisites

Data preparing:

Environment:

Softwares

python modules

Installation

Getting Started

1. Base calling and alignment

2. Ground truth alignment and data preprocessing

3. Classification model training and inference

4. Regression model training and inference

5. Clustering (single molecule label output)

Getting help

About

Uh oh!

Releases

Packages

Uh oh!

Languages

darelab2014/Primepore

Folders and files

Latest commit

History

Repository files navigation

Primepore

Introduction

Key Features

Workflow

1. Preprocessing and Raw Data Alignment

2. Classification Stage

3. Regression Stage

4. Clustering Stage

Requisites

Data preparing:

Environment:

Softwares

python modules

Installation

Getting Started

1. Base calling and alignment

2. Ground truth alignment and data preprocessing

3. Classification model training and inference

4. Regression model training and inference

5. Clustering (single molecule label output)

Getting help

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages