Simple pipeline for predicting bacterial base modification from PacBio HiFi sequencing data with kinetics tags.
Currently the output design is to store the output BAMs alongside the input files,
with the prefix fibertools_predict.{input_bam}. We implement a custom
check for whether output already exists, and filter any inputs which have output
files in the expected location. These will be logged by the pipeline. This behaviour
can be disabled by setting --clobber true.
Currently, the pipeline will
- Filter out BAMs which seem irrelevant by name
(contain
fail,unassigned,subread,scrap,fibertools_preidct), or which have existing output files. - Filter out any BAMS which do not contain the required kinetics tags
(
CHECK_KINETICS) - Predict 6ma base modification using fibertools (
PREDICT_FIBERTOOLS) - Extract modifications to table using a custom perl script (
EXTRACT_CALLS). This currently only extracts modifications with a probability > 240 (~0.94). Currently this is a fixed threshold and cannot be changed. This extraction is done by default, but is optional. Disable by setting--extract_calls false.
The default install of fibertools is from conda, and will not support use of the GPU. If you want to use GPU to improve speed of prediction, refer to the installation instructions in the fibertools documentation.
Installing fibertools with GPU features requires cmake, FindBin.pm, git.
samtools should be available in the environment you launch the pipeline from.
For our local use, currently it has been fast enough to run fibertools without the GPU, primarily due to time spent pending on the GPU partition. You may find it worthwhile install with GPU features if your GPUs are less in demand.
This will show how to run the pipeline using micromamba environments. We will create a micromamba environment with nextflow installed, then nextflow will automatically create the environment required to run the processes. If the machines you run the pipeline on do not have internet access, see the later section on running without internet access
Run
micromamba create -n nextflow nextflow conda samtools
We are installing conda within the environment, as nextflow needs the
conda binary to activate and deactivate environments.
samtools is installed as each file gets checked for kinetics tags
locally, rather than submitted as a job, and so runs in the nextflow
environment.
(Optional). Nextflow can take a local copy of the pipeline to run. If your compute nodes have internet access, this step isn't strictly necessary.
nextflow pull apduncan/bm-tk -r main
This will pull the most recent commit to the main branch. You could also specify a commit hash.
Move to whichever directory you want pipeline logs and configuration to be kept in. Unlike many nextflow pipelines, output files will not be in this directory. Output will be in the same location as the input bams.
This step isn't neccessary if you are in our group, the default should work.
nextflow.config specifies profiles which give details for the submission
system.
It has defaults which work for our group, if you are using this elsewhere you
will need to customise this.
Take a copy of the default config
curl https://raw.githubusercontent.com/apduncan/bm-tk/refs/heads/main/nextflow.config > nextflow.config
You can either customise or copy the nbi_slurm profile.
If you are also using slurm, it should be enough to specify your partition
names in the queues fields.
Activate your nextflow environment
micromamba activate nextflow
Then run the pipeline
nextflow run apduncan/bm-tk \
-profile nbi_slurm \
-work-dir /path/to/scratch \
-with-report \
-r main \
--bams "/glob/to/**/find*.bam"
Do this on a node where it is okay to start long running jobs interactively, or put the above in a batch submission script.
The pipeline should then run and produce your BAMs with predicted methylation.
The main obstacle to running without internet access is that nextflow will not be able to create the micromamba environment. However, we can do that on a node with internet access, then provide the path to the environment.
To create the environment, run
curl https://raw.githubusercontent.com/apduncan/bm-tk/refs/heads/main/env.yaml > env.yaml && \
micromamba env create -n bmtk --file env.yaml
Find the environment path
> micromamba env list | grep bmtk
bmtk /home/user/micromamba/envs/bmtk
Copy that path into the conda = setting of the profile in nextflow.config,
e.g. for the nbi_slurm profile:
profiles {
conda {
conda.enabled = true
process.conda = "/home/kam24goz/miniforge3/envs/pbbm"
}
nbi_slurm {
conda.useMicromamba = true
process {
conda = "/home/user/micromamba/envs/bmtk"
executor = 'slurm'
queue = 'ei-medium'
memory = '2GB'
cpus = 2
...
When you are submitting the nextflow pipeline it should use this environment.
Be sure to also put export NXF_OFFLINE='true' in your submission scripts,
otherwise nextflow will waste much time trying to phone home for updates.