nf-core/ontvar is a comprehensive structural variant (SV) calling, filtering, annotation and consensus generation pipeline for Oxford Nanopore Technologies (ONT) long-read sequencing data.
- Multi-caller SV detection: Sniffles, cuteSV, and Severus for comprehensive variant discovery
- Case-control aware analysis: Support for tumor-normal paired analysis and tumor-only with panel of normals
- Consensus calling: Sample-level caller merging with configurable support thresholds
- Population frequency filtering: Integration with gnomAD and custom population databases
- Comprehensive annotation: AnnotSV provides gene-based and regulatory annotations
- Cohort-level analysis: Multi-sample variant merging and analysis
- Interactive visualizations: Detailed QC plots and summary statistics at each stage
Note
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.
The pipeline consists of the following major steps:
- SV Calling: Run Sniffles, cuteSV, and Severus callers on input samples
- Sample Consensus: Merge caller results per sample using Jasmine (caller support filter)
- Population Annotation: Add allele frequency information from gnomad and long-read sequencing based healthy population databases (using SVDB)
- Sample Filtering: Remove common variants based on population frequencies
- Sample Annotation: Comprehensive AnnotSV annotation of sample variants
- Cohort Merging: Create cohort-wide merged callset using Jasmine
- Cohort Filtering: Apply population frequency filters at cohort level
- Final Annotation: AnnotSV annotation of final cohort callset
- QC & Visualization: Generate summary statistics and plots at each stage
Each row represents a sample with the following columns:
| Column | Required | Description |
|---|---|---|
group_id |
Yes | Sample group identifier for pairing |
sample_id |
Yes | Unique ID for each sample |
sample_type |
Yes | String indicating if sample is a case or control |
input_type |
Yes | Input data type: fastq or bam |
input_path |
Yes | Path to FASTQ file/directory or aligned BAM file |
Input Types:
-
FASTQ (
input_type: fastq):- Can be a single FASTQ file (
.fastq,.fq,.fastq.gz,.fq.gz) - Can be a directory containing multiple FASTQ files (will be concatenated)
- Requires
--referenceparameter for alignment with minimap2
- Can be a single FASTQ file (
-
BAM (
input_type: bam):- Must be aligned to the same reference genome specified with
--reference - Must be coordinate-sorted and indexed (
.bam.baifile should exist) - Skips alignment step
- Must be aligned to the same reference genome specified with
Example samplesheet:
group_id,sample_id,sample_type,input_type,input_path
group1,sample1,case,fastq,/path/to/fastq_dir/
group1,control1,control,bam,/path/to/control.bam
group2,sample2,case,fastq,/path/to/reads.fastq.gz
Notes:
- Samples with the same
group_idare treated as paired (e.g., tumor-normal pairs) sample_typemust be eithercaseorcontrol- BAM files should be coordinate-sorted and indexed (.bai file should exist)
Now, you can run the pipeline using:
nextflow run nf-core/ontvar \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--outdir <OUTDIR> \
--reference reference.fa| Parameter | Description | Example |
|---|---|---|
--input |
Path to comma-separated sample sheet file | path/to/samplesheet.csv |
--outdir |
Output directory path | path/to/outdir |
--reference |
Reference genome FASTA file | path/to/reference.fa |
AnnotSV requires annotation resources to function. The pipeline can automatically download these on first run, but it's recommended to download them once and reuse for subsequent runs to save time.
On the first run, AnnotSV will automatically download and install the annotation databases. This can take significant time (30+ minutes depending on internet speed and genome build).
nextflow run nf-core/ontvar \
-profile docker \
--input samplesheet.csv \
--outdir results \
--reference reference.fa
# AnnotSV will auto-download annotations to default locationAfter the first run, AnnotSV annotations are cached in the output directory. To reuse them for subsequent runs:
nextflow run nf-core/ontvar \
-profile docker \
--input samplesheet.csv \
--outdir results \
--reference reference.fa \
--annotsv_annotations /path/to/annotation/resourcesDefault location (after first run):
results/AnnotSV_annotations/
Custom location:
You can specify a custom annotation directory with --annotsv_annotations.
If you have pre-downloaded AnnotSV resources from another source, you can point to them directly:
nextflow run nf-core/ontvar \
-profile docker \
--input samplesheet.csv \
--outdir results \
--reference reference.fa \
--annotsv_annotations /path/to/your/AnnotSV/annotationsThe pipeline supports the following genome builds for AnnotSV annotation:
hg38(GRCh38) - Defaulthg37(GRCh37)mm10(GRCm38)mm9(GRCm37)
Specify the build with:
--genome_build GRCh37The pipeline offers extensive customization options for each step of the analysis. All parameters can be adjusted to fit your specific needs:
SV Caller Parameters: Fine-tune settings for Sniffles, cuteSV, and Severus including minimum mapping quality, SV size thresholds, read support requirements, and more.
Consensus & Filtering Parameters: Adjust caller support thresholds (e.g., require 2 or 3 callers), population frequency cutoffs, overlap ratios for merging, and distance thresholds.
Annotation Parameters: Configure AnnotSV annotation databases, genome builds, output formats, and annotation detail levels.
Database Parameters: Specify custom SVDB population databases, panel of normals files, and AnnotSV annotation paths.
These are configurable via command-line flags or in the nextflow.config file.
Chromosome Filtering
By default, the pipeline retains only main contigs (CHR1-22,X,Y,M). This is controlled in the FILTER_CHR module.
Adjust caller support and population frequency thresholds:
nextflow run nf-core/ontvar \
-profile docker \
--input samplesheet.csv \
--outdir results \
--reference reference.fa \
--min_caller_support 3 \ # Require 3/3 callers
--max_gnomad_af 0.001 \ # Change population frequency cutoff
--max_needlr_af 0.001To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.
Generated only when FASTQ inputs are provided. BAM inputs skip this step.
Files:
alignment/
├── sample1/
│ ├── sample1.bam
│ └── sample1.bam.bai
├── sample2/
│ ├── sample2.bam
│ └── sample2.bam.bai
└── ...
Notes:
- Uses minimap2 with preset
lr:hq(Default preset used by Dorado aligner for ONT reads) - BAM files are coordinate-sorted and indexed automatically
- For mixed inputs, FASTQ samples are aligned here and BAM samples skip to case-level analysis
- Concatenated FASTQ files (from
cat_fastq) are not published by default
The pipeline generates outputs organized into case-level and cohort-level directories:
- Individual caller VCFs for each sample
- Subdirectories:
sniffles/,cutesv/,severus/ - Summary JSON and count plots
Files:
01_raw_calls/
├── sniffles/
│ ├── SAMPLE1_sniffles.vcf.gz
│ └── SAMPLE2_sniffles.vcf.gz
├── cutesv/
│ ├── SAMPLE1_cutesv.vcf
│ └── SAMPLE2_cutesv.vcf
├── severus/
│ ├── tumor_normal/
│ │ └── SAMPLE1_tn_severus.vcf
│ └── tumor_only/
│ └── SAMPLE2_to_severus.vcf
├── raw_calls_summary.json
├── raw_callers_plot_sv_counts_stacked.png
├── raw_callers_plot_sv_counts_callers.png
└── raw_callers_plot_sv_counts.png
- Sample-level consensus VCFs (filtereed by caller support)
- AnnotSV annotations (pre-filtering)
- Summary statistics and plots
Files:
02_caller_merged/
├── SAMPLE1.vcf
├── SAMPLE1.tsv # AnnotSV full annotation
├── SAMPLE1.annotated.tsv # AnnotSV gene-level
├── caller_merged_summary.json
├── consensus_plot_sv_counts_stacked.png
└── consensus_plot_sv_counts.png
- Population frequency filtered VCFs
- Final AnnotSV annotations
- Summary statistics and plots
Files:
03_caller_merged_filtered/
├── SAMPLE1_filtered.vcf
├── SAMPLE1_filtered.tsv
├── SAMPLE1_filtered.annotated.tsv
├── filtered_summary.json
├── filtered_plot_sv_counts_stacked.png
└── filtered_plot_sv_counts.png
Files:
cohort/
├── cohort_annotated.vcf # AnnotSV (pre-AF filtering)
├── cohort_annotated.tsv # AnnotSV (pre-AF filtering)
├── cohort_filtered.vcf # AnnotSV (post-AF filtering)
├── cohort_filtered.tsv # AnnotSV (post-AF filtering)
├── cohort_annotated_summary.json
├── cohort_annotated_sv_counts.png
├── cohort_filtered_summary.json
└── cohort_filtered_sv_counts.png
A comprehensive HTML report combining all QC metrics:
results/multiqc/multiqc_report.html
Each *_summary.json file contains SV counts by:
- Sample
- Caller
- SV type (DEL, INS, DUP, INV, BND, etc.)
Example structure:
{
"analysis_type": "multi_sample",
"samples": {
"SAMPLE1": {
"callers": {
"sniffles": {
"sv_types": {
"DEL": {"count": 1234},
"INS": {"count": 567}
}
}
},
"combined_stats": {
"sv_types": {
"DEL": {"count": 1500}
}
}
}
}
}nf-core/ontvar is written and maintained by Manas Sehgal.
This pipeline integrates the following tools:
- minimap2 - Sequence alignment
- Sniffles2 - SV calling from long reads
- cuteSV - Long-read SV detection
- Severus - Somatic SV calling
- Jasmine - SV merging and comparison
- AnnotSV - Structural variant annotation
- SVDB - Structural variant population frequency annotation
- BCFtools - VCF manipulation
- MultiQC - Quality control reporting
If you would like to contribute to this pipeline, please see the contributing guidelines.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
You can cite the nf-core publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.