metassemble

Bioinformatics pipeline for assembly of shotgun metagenomic data, using metaSPAdes and MetaQUAST. Includes mapping of samples to the contigs to produce a table of read abundances.

Overview

This pipeline is written in snakemake and designed to automate and control the submission of processes to the Synergy server at the University of Calgary. Developed by Alana Schick for the lab of Dr. Laura Sycuro.

Input:

Paired-end fastq files, quality filtered and host sequences removed (if possible)
Forward and reverse fastq files containing reads from all the samples to be included in the assembly

Output:

A metagenomic assembly, located in: output/assembly/scaffolds.fasta
Detailed report of the quality of the metagenomic assembly, generated by metaquast, located in: output/metaquast/report.html
A table of contigs and counts for each sample, located in: output/counts.txt

Pipeline summary

Example workflow:

Steps

Metagenome assembly using metaSPAdes. Includes error correction by default. To disable this, set the error_corr parameter to FALSE in the config.yaml file. See paper for details about metaSPAdes. Specify the files to be used for assemble in the config.yaml file. To combine sequences from multiple samples, use the cat function:

cat /path/to/sequences/*_1.fastq > assembly_1.fastq
cat /path/to/sequences/*_2.fastq > assembly_2.fastq

Evaluate assembly using MetaQUAST. MetaQUAST uses the Silva 16S rRNA database to identify species content. It then obtains a set of genomes possibly represented by the assembled sequences. It uses this set of genomes as a reference to assess assembly quality. More details here.
Index reference (i.e. assembled contigs) using bowtie2 and samtools.
Map samples to reference using bowtie2.
Get count table. Generate a summary text file containing contig_id, length, and the number of reads mapping to that contig in each sample.

Installation

To use this pipeline, navigate to your project directory and clone this repository into that directory using the following command:

git clone https://github.com/SycuroLab/metassemble.git metassemble

Note: you need to have conda and snakemake installed in order to run this. To install conda, see the instructions here.

To install snakemake using conda, run the following line:

conda install -c bioconda -c conda-forge snakemake

See the snakemake installation webpage for further details.

Config file

All the parameters required to run this pipeline are specified in a config file, written in yaml. See/modify the provided example file with your custom parameters, called config.yaml. This is the only file that should be modified before running the pipeline. Make sure to follow the syntax in the example file in terms of when to use quotations around parameters.

Data and list of files

Specify the full path to the directory that contains your data in the config file. You also need to have a list of sample names which contains the names of the samples to run the pipeline on, one sample per line. You can run this pipeline on any number or subset of your samples. Sample names should include everything up to the R1/R2 (or 1/2) part of the file names of the raw fastq files. Specify the path and name of your list in the config file.

If there are many samples, it may be convenient to generate the list of files using the following command, replacing R1_001.fastq.gz with the general suffix of your files:

ls | grep R1_001.fastq.gz | sed 's/_R1_001.fastq.gz//' > list_files.txt

Description of other parameters

Parameter	Description
list_files	Full path and name of your sample list.
path	Location of input files.
forward	Fastq file containing all the forward reads to be used for assembly.
reverse	Fastq file containing all the reverse reads to be used for assembly.
error_corr	If TRUE (default), include read error correction during metaspades run.

Running the pipeline on Synergy

Test the pipeline by running snakemake -np. This command prints out the commands to be run without actually running them.

To run the pipeline on the Synergy compute cluster, enter the following command from the project directory:

snakemake --cluster-config cluster.json --cluster 'bsub -n {cluster.n} -R {cluster.resources} -W {cluster.walllim} -We {cluster.time} -M {cluster.maxmem} -oo {cluster.output} -e {cluster.error}' --jobs 500 --use-conda

The above command submits jobs to Synergy, one for each sample and step of the QC pipeline. Note: the file cluster.json contains the parameters for the LSF job submission system that Synergy uses. In most cases, this file should not be modified.

Results and log files

Snakemake will create a directory for the results of the pipeline as well as a directory for log files. Log files of each step of the pipeline will be written to the logs directory.

Notes

Choosing k-mer size for metaspades.

Max k-mer depends on the read length of your library. Generally for 150 bp libraries, use a max size of 99. For 250 bp libraries, use 127.

NEED TO ADD THIS PARAMETER TO THE CONFIG FILE

The output of metaquast is EXTENSIVE. The metrics of particular interest are:

Number of contigs
- a smaller number means you are more likely to have larger contigs
N50
- The minimum contig length needed to cover 50 percent of the genomc
- Similar to median contig length, but weighted towards longer contigs
- Larger is generally better
Total length
- Longer = more bps incorporated into your assembly

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
logs		logs
metassemble_files		metassemble_files
README.md		README.md
Snakefile		Snakefile
cluster.json		cluster.json
config.yaml		config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

metassemble

Overview

Pipeline summary

Steps

Installation

Config file

Data and list of files

Description of other parameters

Running the pipeline on Synergy

Results and log files

Notes

Choosing k-mer size for metaspades.

The output of metaquast is EXTENSIVE. The metrics of particular interest are:

About

Uh oh!

Releases

Packages

Languages

SycuroLab/metassemble

Folders and files

Latest commit

History

Repository files navigation

metassemble

Overview

Pipeline summary

Steps

Installation

Config file

Data and list of files

Description of other parameters

Running the pipeline on Synergy

Results and log files

Notes

Choosing k-mer size for metaspades.

The output of metaquast is EXTENSIVE. The metrics of particular interest are:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages