PyFFAME (Fast and Flexible Allelic MPRA Designer)

A tool for extremely quick and customizable design of multi-allelic MPRA (massively parallel reporter assays)

You can find PyFFAME's documentation here (WIP)

Dependencies

We suggest running python >= 3.7.

Python packages

pandas >= 0.24.2
pyfaidx >= 0.5.5.2

Also see the requirements.txt file.

Further required data

A reference genome in FASTA format. For example, the GRCh37 (hg19) reference genome can be found here [ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz]. Running the pipeline for the first time generates an index of this .fa file (i.e. a .fai file). This increases running time but only has to be done once.

A list of enzymes and (expanded) restriction sites, the basis for the one provided in this repository can be found here [ftp://ftp.neb.com/pub/rebase/itype2.txt]. These enzymes and restriction sites are based on up-to-date REBASE data:

Richard J Roberts, Tamas Vincze, Janos Posfai, Dana Macelis, REBASE--a database for DNA restriction and modification: enzymes, genes and genomes, Nucleic Acids Res, Volume 43, 05 November 2014, Pages D298-D299, https://doi.org/10.1093/nar/gku1046

Installation

Download or clone this repository.

If you wish to set up your own data base containing rsID variant info, see the alternative pipeline in the 'db_version' folder of this repository.

Usage

Edit the config_file.py file to customize your design.

Alternatively, run the shiny App (app.R) for an interactive config file creator. This app will ask you for all necessary information and create a PyFFAME-compatible config file.

The following options are available:

Parameter	Description	Value
in_vcf	Path to input file in vcf format (columns: CHROM, POS, ID, REF, ALT)	String (file path) or None
in_sequence	Path to input file containing additional sequences (columns: ID, FEATURE_SEQ)	String (file path) or None
in_barcode	Path to input file (without header) containing sufficient barcodes for the planned design	String (file path)
in_barcode_type	File type of barcode input file (json format or plain text with one barcode per line)	'json' or 'txt'
db_genome	Path to reference genome file	String
de_order	Design order, given as a string containing the letters 'a' to 'e'; see below for a more in-depth explanation	String
de_seq_1	First added sequence	String
de_seq_2	Second added sequence	String
de_seq_3	Third added sequence	String
set_feature_size	Nucleotides to add to each side of the variant (e.g., 85 will yield a total feature size of 171 nt)	Integer
set_all_features	Create all features (ref/alt) or only those for the reference alleles?	1: all; 0: ref .only
set_indel_max_length	Maximum length of indels to include (indels exceeding this size will be removed)	Integer
set_indel_features	Only create full-length indel features or create additional features accounting for the size difference? See below for a more in-depth explanation	0: additional; 1: only full-length
set_barcodes_per_feature	Number of barcodes to use per feature	Integer
set_rev_comp	Create reverse complementary versions of all features?	1: yes; 0: no
enz_file_processed	Path to processed enzyme file (columns: enzymes, site, expanded_sites)	String (file path) or None
enz_file	Path to REBASE enzyme file (ignored if processed file used)	String (file path)
enz_used	Enzymes used, comma-delimited (e.g. 'EcoRI, SbfI')	String or None
enz_sites	Expanded cut sites, python dictionary (e.g., ['GAATTC', 'CCTGCAGG'])	Dictionary as string
enz_cumul_cuts	Total restriction sites expected in final feature	Integer
enz_cumul_cuts_bc	Total restriction sites expected in (barcode plus sequences right before and after it; used to screen the barcodes for excess restriction sites)	Integer
out_format	Output format, json or tab-separated	'json' or 'tsv'
out_output	Path to designed output	String (file path)

If no VCF info is supplied (only additional sequences), no genome is necessary.

Note for single nucleotide variants: the pipeline checks, whether the reference allele for each variants is concordant with the base observed at this position in the genome. If not, the reverse complementary base is checked for concordance. If this reversed base is identical to the genomic position, reference and alternative allele are used reverse complemented. Else the original orientation of the alleles is preserved.

Further information regarding the design order (parameter de_order):

This parameter facilitates flexible combinations of the designed feature. As mentioned above, the parameter expects a string of length one to five, containing the letters a through e. Each letter corresponds to a certain subsequence of the design as per the following:

a: first added sequence (parameter de_seq_1)
b: created feature (i.e. the genomic sequence based on the input)
c: second added sequence (parameter de_seq_2)
d: barcode (as provided in the in_barcode file parameter)
e: third added sequence (parameter de_seq_3)

An example (and the default) would be 'abcde', resulting in the following final feature: (first added sequence)-(feature)-(second added sequence)-(barcode)-(third added sequence) If subsequences are not required for a design, the corresponding letter can be omitted.

Further information regarding features created for insertions and deletions (parameter set_indel_features):

When creating features for insertions and deletions, the resulting genomic sequence for either reference or alternative allele will be of a different length than specified in the config file. For example, in the case of a 3-bp deletion, the reference allele sequence might have a length of 171 bp but the resulting alternative allele sequence will have a length of 168 bp. To remedy this, the default behavior of PyFFAME is to create to create the reference allele sequence, the alternative allele sequence of length 168 and a second alternative allele sequence of length 171 (with padding obtained from the genomic context). Setting this parameter to 1 will only create full-length features, i.e. the first of the two alternative allele sequences will be discarded. When all possible features are created, 'SHORT' / 'LONG' is added to the unique feature ID to indicate this.

Running the pipeline

Call the script using python3 main.py config.py. Note that the config file has to be in the same folder as your terminal and requires a .py file extension. The provided exemplary files can be modified according to the your design.

The final designed features each contain a unique ID, created in the following way:

[rsID] + [REF / ALT] + [allele number for variants with more than 1 alternative allele] + [SHORT / LONG] + [consecutive number]

Created output:

File containing the designed features; file containing features which were removed due to additional enzyme restriction sites; log file.

Example

Example files (rsID input (for the DB-version), VCF input, additional sequences input) are contained in the example_files folder. The config.py contained in this repository can be used with these files.

Barcodes

Barcodes can be supplied by you. Alternatively, you can use the companion tool PyBarcodes (https://github.com/DNAbased/PyBarcodes) to generate barcodes based on custom parameters.

The following sets of barcodes are already provided in json format:

Length [nt]	Number of barcodes
6	36
7	127
8	351
9	1,215
10	4,217
11	10,807
12	38,718
13	99,853
14	370,911
15	954,072

The following criteria were applied to obtain these barcodes:

Removed barcodes with repeats of more than two identical nucleotides
Removed barcodes without at least one of each nucleotide
Lower GC threshold: 40 %
Upper GC threshold: 60 %
Removed barcodes containing ACA/CAC/GTG/TGT
Minimum levenshtein distance between barcodes of 3

Benchmarking

We benchmarked PyFFAME and compared it to two other tools for the design of MPRA assays, MPRAnator and MPRA Design Tools:

Ilias Georgakopoulos-Soares, Naman Jain, Jesse M Gray, Martin Hemberg, MPRAnator: a web-based tool for the design of massively parallel reporter assay experiments, Bioinformatics, Volume 33, Issue 1, 01 January 2017, Pages 137–138, https://doi.org/10.1093/bioinformatics/btw584

Andrew R Ghazi, Edward S Chen, David M Henke, Namrata Madan, Leonard C Edelstein, Chad A Shaw, Design tools for MPRA experiments, Bioinformatics, Volume 34, Issue 15, 01 August 2018, Pages 2682–2683, https://doi.org/10.1093/bioinformatics/bty150

The relevant information can be found in the associated repository (https://github.com/DNAbased/PyFFAME_benchmark).

Testing (WIP)

python3.7 -m unittest

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
barcodes		barcodes
db_version		db_version
docs		docs
example_files		example_files
.gitignore		.gitignore
README.md		README.md
app.R		app.R
config.py		config.py
enzymes_new.txt		enzymes_new.txt
enzymes_processed.csv		enzymes_processed.csv
handlers.py		handlers.py
main.py		main.py
pipeline.png		pipeline.png
requirements.txt		requirements.txt
test_mpra.py		test_mpra.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyFFAME (Fast and Flexible Allelic MPRA Designer)

Dependencies

Python packages

Further required data

Installation

Usage

Further information regarding the design order (parameter de_order):

Further information regarding features created for insertions and deletions (parameter set_indel_features):

Running the pipeline

Example

Barcodes

Benchmarking

Testing (WIP)

About

Uh oh!

Uh oh!

Languages

DNAbased/PyFFAME

Folders and files

Latest commit

History

Repository files navigation

PyFFAME (Fast and Flexible Allelic MPRA Designer)

Dependencies

Python packages

Further required data

Installation

Usage

Further information regarding the design order (parameter de_order):

Further information regarding features created for insertions and deletions (parameter set_indel_features):

Running the pipeline

Example

Barcodes

Benchmarking

Testing (WIP)

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages