Skip to content

SycuroLab/metassemble

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

metassemble

Bioinformatics pipeline for assembly of shotgun metagenomic data, using metaSPAdes and MetaQUAST. Includes mapping of samples to the contigs to produce a table of read abundances.

Overview

This pipeline is written in snakemake and designed to automate and control the submission of processes to the Synergy server at the University of Calgary. Developed by Alana Schick for the lab of Dr. Laura Sycuro.

Input:

  • Paired-end fastq files, quality filtered and host sequences removed (if possible)
  • Forward and reverse fastq files containing reads from all the samples to be included in the assembly

Output:

  • A metagenomic assembly, located in: output/assembly/scaffolds.fasta
  • Detailed report of the quality of the metagenomic assembly, generated by metaquast, located in: output/metaquast/report.html
  • A table of contigs and counts for each sample, located in: output/counts.txt

Pipeline summary

Example workflow:

Rulegraph

Steps

  1. Metagenome assembly using metaSPAdes. Includes error correction by default. To disable this, set the error_corr parameter to FALSE in the config.yaml file. See paper for details about metaSPAdes. Specify the files to be used for assemble in the config.yaml file. To combine sequences from multiple samples, use the cat function:
cat /path/to/sequences/*_1.fastq > assembly_1.fastq
cat /path/to/sequences/*_2.fastq > assembly_2.fastq
  1. Evaluate assembly using MetaQUAST. MetaQUAST uses the Silva 16S rRNA database to identify species content. It then obtains a set of genomes possibly represented by the assembled sequences. It uses this set of genomes as a reference to assess assembly quality. More details here.

  2. Index reference (i.e. assembled contigs) using bowtie2 and samtools.

  3. Map samples to reference using bowtie2.

  4. Get count table. Generate a summary text file containing contig_id, length, and the number of reads mapping to that contig in each sample.

Installation

To use this pipeline, navigate to your project directory and clone this repository into that directory using the following command:

git clone https://github.com/SycuroLab/metassemble.git metassemble

Note: you need to have conda and snakemake installed in order to run this. To install conda, see the instructions here.

To install snakemake using conda, run the following line:

conda install -c bioconda -c conda-forge snakemake

See the snakemake installation webpage for further details.

Config file

All the parameters required to run this pipeline are specified in a config file, written in yaml. See/modify the provided example file with your custom parameters, called config.yaml. This is the only file that should be modified before running the pipeline. Make sure to follow the syntax in the example file in terms of when to use quotations around parameters.

Data and list of files

Specify the full path to the directory that contains your data in the config file. You also need to have a list of sample names which contains the names of the samples to run the pipeline on, one sample per line. You can run this pipeline on any number or subset of your samples. Sample names should include everything up to the R1/R2 (or 1/2) part of the file names of the raw fastq files. Specify the path and name of your list in the config file.

If there are many samples, it may be convenient to generate the list of files using the following command, replacing R1_001.fastq.gz with the general suffix of your files:

ls | grep R1_001.fastq.gz | sed 's/_R1_001.fastq.gz//' > list_files.txt

Description of other parameters

Parameter Description
list_files Full path and name of your sample list.
path Location of input files.
forward Fastq file containing all the forward reads to be used for assembly.
reverse Fastq file containing all the reverse reads to be used for assembly.
error_corr If TRUE (default), include read error correction during metaspades run.

Running the pipeline on Synergy

Test the pipeline by running snakemake -np. This command prints out the commands to be run without actually running them.

To run the pipeline on the Synergy compute cluster, enter the following command from the project directory:

snakemake --cluster-config cluster.json --cluster 'bsub -n {cluster.n} -R {cluster.resources} -W {cluster.walllim} -We {cluster.time} -M {cluster.maxmem} -oo {cluster.output} -e {cluster.error}' --jobs 500 --use-conda

The above command submits jobs to Synergy, one for each sample and step of the QC pipeline. Note: the file cluster.json contains the parameters for the LSF job submission system that Synergy uses. In most cases, this file should not be modified.

Results and log files

Snakemake will create a directory for the results of the pipeline as well as a directory for log files. Log files of each step of the pipeline will be written to the logs directory.

Notes

Choosing k-mer size for metaspades.

Max k-mer depends on the read length of your library. Generally for 150 bp libraries, use a max size of 99. For 250 bp libraries, use 127.

NEED TO ADD THIS PARAMETER TO THE CONFIG FILE

The output of metaquast is EXTENSIVE. The metrics of particular interest are:

  • Number of contigs
    • a smaller number means you are more likely to have larger contigs
  • N50
    • The minimum contig length needed to cover 50 percent of the genomc
    • Similar to median contig length, but weighted towards longer contigs
    • Larger is generally better
  • Total length
    • Longer = more bps incorporated into your assembly

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages