GitHub - manascripts/ontvar: Comprehensive structural variant calling, filtering, annotation and consensus set generation from ONT's long read sequencing data

Introduction

nf-core/ontvar is a comprehensive structural variant (SV) calling, filtering, annotation and consensus generation pipeline for Oxford Nanopore Technologies (ONT) long-read sequencing data.

Key Features

Multi-caller SV detection: Sniffles, cuteSV, and Severus for comprehensive variant discovery
Case-control aware analysis: Support for tumor-normal paired analysis and tumor-only with panel of normals
Consensus calling: Sample-level caller merging with configurable support thresholds
Population frequency filtering: Integration with gnomAD and custom population databases
Comprehensive annotation: AnnotSV provides gene-based and regulatory annotations
Cohort-level analysis: Multi-sample variant merging and analysis
Interactive visualizations: Detailed QC plots and summary statistics at each stage

Workflow Overview

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

The pipeline consists of the following major steps:

SV Calling: Run Sniffles, cuteSV, and Severus callers on input samples
Sample Consensus: Merge caller results per sample using Jasmine (caller support filter)
Population Annotation: Add allele frequency information from gnomad and long-read sequencing based healthy population databases (using SVDB)
Sample Filtering: Remove common variants based on population frequencies
Sample Annotation: Comprehensive AnnotSV annotation of sample variants
Cohort Merging: Create cohort-wide merged callset using Jasmine
Cohort Filtering: Apply population frequency filters at cohort level
Final Annotation: AnnotSV annotation of final cohort callset
QC & Visualization: Generate summary statistics and plots at each stage

Samplesheet Format

Each row represents a sample with the following columns:

Column	Required	Description
`group_id`	Yes	Sample group identifier for pairing
`sample_id`	Yes	Unique ID for each sample
`sample_type`	Yes	String indicating if sample is a `case` or `control`
`input_type`	Yes	Input data type: `fastq` or `bam`
`input_path`	Yes	Path to FASTQ file/directory or aligned BAM file

Input Types:

FASTQ (input_type: fastq):
- Can be a single FASTQ file (.fastq, .fq, .fastq.gz, .fq.gz)
- Can be a directory containing multiple FASTQ files (will be concatenated)
- Requires --reference parameter for alignment with minimap2
BAM (input_type: bam):
- Must be aligned to the same reference genome specified with --reference
- Must be coordinate-sorted and indexed (.bam.bai file should exist)
- Skips alignment step

Example samplesheet:

group_id,sample_id,sample_type,input_type,input_path
group1,sample1,case,fastq,/path/to/fastq_dir/
group1,control1,control,bam,/path/to/control.bam
group2,sample2,case,fastq,/path/to/reads.fastq.gz

Notes:

Samples with the same group_id are treated as paired (e.g., tumor-normal pairs)
sample_type must be either case or control
BAM files should be coordinate-sorted and indexed (.bai file should exist)

Now, you can run the pipeline using:

nextflow run nf-core/ontvar \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --outdir <OUTDIR> \
   --reference reference.fa

Required Parameters

Parameter	Description	Example
`--input`	Path to comma-separated sample sheet file	`path/to/samplesheet.csv`
`--outdir`	Output directory path	`path/to/outdir`
`--reference`	Reference genome FASTA file	`path/to/reference.fa`

AnnotSV Annotation Resources

AnnotSV requires annotation resources to function. The pipeline can automatically download these on first run, but it's recommended to download them once and reuse for subsequent runs to save time.

Automatic Download (First Run)

On the first run, AnnotSV will automatically download and install the annotation databases. This can take significant time (30+ minutes depending on internet speed and genome build).

nextflow run nf-core/ontvar \
   -profile docker \
   --input samplesheet.csv \
   --outdir results \
   --reference reference.fa
   # AnnotSV will auto-download annotations to default location

Reusing Annotation Resources (Recommended)

After the first run, AnnotSV annotations are cached in the output directory. To reuse them for subsequent runs:

nextflow run nf-core/ontvar \
   -profile docker \
   --input samplesheet.csv \
   --outdir results \
   --reference reference.fa \
   --annotsv_annotations /path/to/annotation/resources

Annotation Resource Location

Default location (after first run):

results/AnnotSV_annotations/

Custom location: You can specify a custom annotation directory with --annotsv_annotations.

Pre-downloaded Resources

If you have pre-downloaded AnnotSV resources from another source, you can point to them directly:

nextflow run nf-core/ontvar \
   -profile docker \
   --input samplesheet.csv \
   --outdir results \
   --reference reference.fa \
   --annotsv_annotations /path/to/your/AnnotSV/annotations

Supported Genome Builds

The pipeline supports the following genome builds for AnnotSV annotation:

hg38 (GRCh38) - Default
hg37 (GRCh37)
mm10 (GRCm38)
mm9 (GRCm37)

Specify the build with:

--genome_build GRCh37

Customizing Pipeline Parameters

The pipeline offers extensive customization options for each step of the analysis. All parameters can be adjusted to fit your specific needs:

SV Caller Parameters: Fine-tune settings for Sniffles, cuteSV, and Severus including minimum mapping quality, SV size thresholds, read support requirements, and more.

Consensus & Filtering Parameters: Adjust caller support thresholds (e.g., require 2 or 3 callers), population frequency cutoffs, overlap ratios for merging, and distance thresholds.

Annotation Parameters: Configure AnnotSV annotation databases, genome builds, output formats, and annotation detail levels.

Database Parameters: Specify custom SVDB population databases, panel of normals files, and AnnotSV annotation paths.

These are configurable via command-line flags or in the nextflow.config file.

Chromosome Filtering

By default, the pipeline retains only main contigs (CHR1-22,X,Y,M). This is controlled in the FILTER_CHR module.

Advanced Usage

Custom Filtering Thresholds

Adjust caller support and population frequency thresholds:

nextflow run nf-core/ontvar \
   -profile docker \
   --input samplesheet.csv \
   --outdir results \
   --reference reference.fa \
   --min_caller_support 3 \        # Require 3/3 callers
   --max_gnomad_af 0.001 \         # Change population frequency cutoff
   --max_needlr_af 0.001

Pipeline output

To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.

Preprocessing Outputs (`results/alignment/`)

Alignment (`alignment/`)

Generated only when FASTQ inputs are provided. BAM inputs skip this step.

Files:

alignment/
├── sample1/
│   ├── sample1.bam
│   └── sample1.bam.bai
├── sample2/
│   ├── sample2.bam
│   └── sample2.bam.bai
└── ...

Notes:

Uses minimap2 with preset lr:hq (Default preset used by Dorado aligner for ONT reads)
BAM files are coordinate-sorted and indexed automatically
For mixed inputs, FASTQ samples are aligned here and BAM samples skip to case-level analysis
Concatenated FASTQ files (from cat_fastq) are not published by default

The pipeline generates outputs organized into case-level and cohort-level directories:

Case-Level Outputs (`results/case/`)

1. Raw Calls (`01_raw_calls/`)

Individual caller VCFs for each sample
Subdirectories: sniffles/, cutesv/, severus/
Summary JSON and count plots

Files:

01_raw_calls/
├── sniffles/
│   ├── SAMPLE1_sniffles.vcf.gz
│   └── SAMPLE2_sniffles.vcf.gz
├── cutesv/
│   ├── SAMPLE1_cutesv.vcf
│   └── SAMPLE2_cutesv.vcf
├── severus/
│   ├── tumor_normal/
│   │   └── SAMPLE1_tn_severus.vcf
│   └── tumor_only/
│       └── SAMPLE2_to_severus.vcf
├── raw_calls_summary.json
├── raw_callers_plot_sv_counts_stacked.png
├── raw_callers_plot_sv_counts_callers.png
└── raw_callers_plot_sv_counts.png

2. Caller Merged (`02_caller_merged/`)

Sample-level consensus VCFs (filtereed by caller support)
AnnotSV annotations (pre-filtering)
Summary statistics and plots

Files:

02_caller_merged/
├── SAMPLE1.vcf
├── SAMPLE1.tsv                    # AnnotSV full annotation
├── SAMPLE1.annotated.tsv          # AnnotSV gene-level
├── caller_merged_summary.json
├── consensus_plot_sv_counts_stacked.png
└── consensus_plot_sv_counts.png

3. Caller Merged Filtered (`03_caller_merged_filtered/`)

Population frequency filtered VCFs
Final AnnotSV annotations
Summary statistics and plots

Files:

03_caller_merged_filtered/
├── SAMPLE1_filtered.vcf
├── SAMPLE1_filtered.tsv
├── SAMPLE1_filtered.annotated.tsv
├── filtered_summary.json
├── filtered_plot_sv_counts_stacked.png
└── filtered_plot_sv_counts.png

Cohort-Level Outputs (`results/cohort/`)

Files:

cohort/
├── cohort_annotated.vcf                    # AnnotSV (pre-AF filtering)
├── cohort_annotated.tsv                    # AnnotSV (pre-AF filtering)
├── cohort_filtered.vcf                     # AnnotSV (post-AF filtering)
├── cohort_filtered.tsv                     # AnnotSV (post-AF filtering)
├── cohort_annotated_summary.json
├── cohort_annotated_sv_counts.png
├── cohort_filtered_summary.json
└── cohort_filtered_sv_counts.png

MultiQC Report

A comprehensive HTML report combining all QC metrics:

results/multiqc/multiqc_report.html

Summary JSON Format

Each *_summary.json file contains SV counts by:

Sample
Caller
SV type (DEL, INS, DUP, INV, BND, etc.)

Example structure:

{
  "analysis_type": "multi_sample",
  "samples": {
    "SAMPLE1": {
      "callers": {
        "sniffles": {
          "sv_types": {
            "DEL": {"count": 1234},
            "INS": {"count": 567}
          }
        }
      },
      "combined_stats": {
        "sv_types": {
          "DEL": {"count": 1500}
        }
      }
    }
  }
}

Credits

nf-core/ontvar is written and maintained by Manas Sehgal.

Tools Used

This pipeline integrates the following tools:

minimap2 - Sequence alignment
Sniffles2 - SV calling from long reads
cuteSV - Long-read SV detection
Severus - Somatic SV calling
Jasmine - SV merging and comparison
AnnotSV - Structural variant annotation
SVDB - Structural variant population frequency annotation
BCFtools - VCF manipulation
MultiQC - Quality control reporting

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
assets		assets
conf		conf
docs		docs
modules		modules
subworkflows		subworkflows
tests		tests
workflows		workflows
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config
ro-crate-metadata.json		ro-crate-metadata.json
tower.yml		tower.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Key Features

Workflow Overview

Usage

Samplesheet Format

Required Parameters

AnnotSV Annotation Resources

Automatic Download (First Run)

Reusing Annotation Resources (Recommended)

Annotation Resource Location

Pre-downloaded Resources

Supported Genome Builds

Customizing Pipeline Parameters

Advanced Usage

Custom Filtering Thresholds

Pipeline output

Preprocessing Outputs (`results/alignment/`)

Alignment (`alignment/`)

Case-Level Outputs (`results/case/`)

1. Raw Calls (`01_raw_calls/`)

2. Caller Merged (`02_caller_merged/`)

3. Caller Merged Filtered (`03_caller_merged_filtered/`)

Cohort-Level Outputs (`results/cohort/`)

MultiQC Report

Summary JSON Format

Credits

Tools Used

Contributions and Support

Citations

About

Uh oh!

Languages

License

manascripts/ontvar

Folders and files

Latest commit

History

Repository files navigation

Introduction

Key Features

Workflow Overview

Usage

Samplesheet Format

Required Parameters

AnnotSV Annotation Resources

Automatic Download (First Run)

Reusing Annotation Resources (Recommended)

Annotation Resource Location

Pre-downloaded Resources

Supported Genome Builds

Customizing Pipeline Parameters

Advanced Usage

Custom Filtering Thresholds

Pipeline output

Preprocessing Outputs (results/alignment/)

Alignment (alignment/)

Case-Level Outputs (results/case/)

1. Raw Calls (01_raw_calls/)

2. Caller Merged (02_caller_merged/)

3. Caller Merged Filtered (03_caller_merged_filtered/)

Cohort-Level Outputs (results/cohort/)

MultiQC Report

Summary JSON Format

Credits

Tools Used

Contributions and Support

Citations

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Languages

Preprocessing Outputs (`results/alignment/`)

Alignment (`alignment/`)

Case-Level Outputs (`results/case/`)

1. Raw Calls (`01_raw_calls/`)

2. Caller Merged (`02_caller_merged/`)

3. Caller Merged Filtered (`03_caller_merged_filtered/`)

Cohort-Level Outputs (`results/cohort/`)