Chlamy-Cheminformatics

Implementation of the ECFP algorithm for use in a chemical genomics project in the Jonikas Lab, Stanford Dept. of Plant Biology.

Background

ECFP

The ECFP algorithm was first published by David Rogers and Mathew Hahn as a means of generating numerical fingerprints from the graphical structures of molecules. Since the numerical features of a fingerprint are directly encoded by structural elements of the molecule that fingerprint comes from, cheminformatic algorithms can use them to approximate the structural similarity of two molecules without wasting time on computing subgraph isomorphisms (who wouldn't want to avoid solving millions of NP-complete problems?) A description of the algorithm can be found at http://pubs.acs.org/doi/abs/10.1021/ci100050t.

Why Fingerprinting?

When I worked at the Jonikas lab during the summer of 2016, I was helping out with a chemical genomics project in the green alga Chlamydomonas reinhardtii (aka Chlamy). Chlamy has several metabolic pathways that are of great interest to plant biologists--including a particularly efficient CO2-concentrating mechanism that could dramatically increase crop yield if it were engineered into higher plants--and the lab wanted to characterize these pathways by screening a library of single-gene-knockout Chlamy mutants with a series of drugs and analyzing how different groups of mutants responded to different families of molecules.

As a precursor to this project, I helped screen the effects of over 1000 small molecules on the growth of wild-type Chlamy colonies. These experiments were intended to help us figure out how much of each drug we would need to use in order to get a response out of the algae in the larger mutant screen, but the data also presented us with an opportunity to try and identify particular chemical substructures that were common amongst growth-inhibiting molecules--which would give us hints about how some of these drugs were actually retarding colony growth.

That was our motivation for clustering molecules by structural similarity--and it just happens that ECFPs are great for that purpose.

Now, software packages exist to compute similarity coefficients between molecules using the ECFP algorithm. But none of the vendors we contacted got back to us before we needed the clustering data--so I read up on the Rogers/Hahn paper and took a crack at implementing the algorithm myself.

The Code

The main program takes a series of .mol or .sdf chemcal table files from the target directory and converts them into a series of graphs using a Molecule interface I wrote for this project. It then runs the ECFP algorithm to assign numerical fingerprints to each molecule and, for every unique pair of molecules in the directory, computes a Tanimoto coefficient between the two by comparing the size of the intersection of the two feature sets to the size of their union. The Tanimoto coefficient rates the similarity of the two molecules on a scale from 0 to 1, with 0 representing no shared features between the fingerprints and 1 representing a complete overlap.

The Tanimoto coefficients are output in two formats:

A matrix listing every Tanimoto coefficient generated during the algorithm.
An interaction table for Cytoscape listing only Tanimoto coefficients above a certain similarity threshold. Cytoscape is a network visualization tool that allowed me to convert edges representing high molecular similarity into a human-readable graph. (Examples can be found in the folder visualization.)

ecfp also accepts an niterations parameter; this parameter determines the type of ECFP fingerprint that should be generated. As niterations rises, the individual numerical features in a fingerprint incorporate larger swatches of the molecule and thus give more comprehensive information about its connectivity. Most cheminformatic applications (including mine) use ECFP_4 fingerprints, generated by running the algorithm for two iterations. More information is available in Rogers and Hahn's paper.

A Note to Potential Users

My implementation of ECFP has a couple of limitations that might reduce its usefulness for certain applications:

I do not preprocess molecular structures to identify aromatic bonds before running the ECFP algorithm. This is mostly because doing so would require a complete enumeration and traversal of all cycles in the graph with length greater than 2. There are some interesting algorithmic approaches to this problem -- including some neat vector space methods that construct all cycles from a small cycle basis -- but it wasn't worth doing for my own purposes. In general, this will occasionally skew similarity coefficients between two molecules with aromatic rings represented as alternating single and double bonds, as the two resonance forms of this ring might produce distinct numerical features.
I do not incorporate bond stereochemistry into my fingerprints. However, it would not be difficult to incorporate stereochemical information into the ECFP algorithm as I present it here.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
lspt45		lspt45
lupt45		lupt45
lvpt45		lvpt45
tests		tests
visualization		visualization
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
ecfp.cc		ecfp.cc
ecfp_utils.h		ecfp_utils.h
latca_all.zip		latca_all.zip
molecule.cc		molecule.cc
molecule.h		molecule.h
stringhash.cc		stringhash.cc
stringhash.h		stringhash.h
transfer_latca.py		transfer_latca.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chlamy-Cheminformatics

Background

ECFP

Why Fingerprinting?

The Code

A Note to Potential Users

About

Uh oh!

Releases

Packages

Languages

License

millimat/Chlamy-Cheminformatics

Folders and files

Latest commit

History

Repository files navigation

Chlamy-Cheminformatics

Background

ECFP

Why Fingerprinting?

The Code

A Note to Potential Users

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages