This script allows detection of enriched composite motifs in genomic sequence data.
- Python 2.7.x, with the following packages installed:
- pip
- numpy and scipy (accounted for by the instructions below)
- MOODS v1.0.2.1 (ditto)
- JASPAR-formatted motifs
- bedtools-derived FASTA DNA sequence file(s)
Newer versions of MOODS and Python are not supported. PRs to add this support, which pass the included tests, would be welcome.
If you have multiple Python versions on your system, please ensure that the
first python and pip in your search path are the Python 2.7
versions. In a typical HPC environment, your module system (e.g. Environment
Modules) should handle this for you.
-
Clone the source from GitLab (MOODS v1.0.2.1 is provided as a submodule):
git clone --recursive https://github.com/weirauchlab/cosmo.git cd cosmoIf you already cloned the repository without reading this section first, no worries, just run this inside the new clone:
git submodule init && git submodule updateAs an alternative, download the latest release archive from GitHub, then unpack it into a local directory; see the DETAILED INSTALLATION section for instructions on downloading and building MOODS from source
-
If you have Docker or Podman available (substitute
podmanfordockerbelow):docker build . -t cosmo # be patient, builds Python 2.7 from source! docker run --rm -it cosmo cosmo --help docker run --rm -it cosmo cosmostats --helpSee the DEVELOPMENT AND TESTING section for more details.
-
If you want to use a local Python installation instead, you'll probably need to load a module, or build Python 2.7 from source. If you don't already have a version of pip that works with Python 2.7, install it:
wget https://bootstrap.pypa.io/pip/2.7/get-pip.py python get-pip.pyImportant tip:
At this point, stop and make sure thatpython --versionandpip --versionboth report "2.7" in their output; that is, that the firstpythonandpipin your search path are definitely the Python 2.7 versions.To avoid having to install COSMO's dependencies at the system level, use pip to install virtualenv if necessary, then create a Python virtual environment and activate it:
# in the 'cosmo' subdirectory from 'git clone' above python -m virtualenv venv . venv/bin/activateIf you have some other Python 2.7 environment (such as Conda or Environment Modules), you probably know what to do on your own. If you have trouble with this step, try the Docker method described below.
-
Next, build the MOODS C library and install COSMO's other Python dependencies:
make depsIf you get a "permission denied" error here, follow the steps above to create a virtualenv, activate it, then try again.
-
Finally, to make sure everything works, you run the
make testtarget in the includedMakefile(assumes a Unix environment):make test -j4 # run parallel tasks on up to 4 CPU cores
See DETAILED INSTALLATION below if you have any problems with the instructions above or the running the scripts.
If you have MOODS and COSMO's dependencies already
installed, you can just copy cosmo.py and cosmostats.py to a directory in
your shell's search path and call it good. However there's an install
target in the included Makefile that will handle the details for
you.
If you are on a Unix/Linux system, run make install. The default installation
prefix is /usr/local (with scripts being installed to /usr/local/bin), so
you will likely need to become root with sudo or similar.
A simpler option is to install to your home directory:
make install PREFIX=$HOME/.local
Most Linux distributions already include ~/.local/bin in your search path by
default. You may need to log out and back in again for this to take effect. How
to update your shell's PATH variable is beyond the scope here.
If this is successful, you can run cosmo or cosmostats from any directory
on your filesystem, without needing to specify the relative pathnames like
./cosmo.py in the examples below.
Windows is not currently supported by the method we presently use in our
setup.py. However, if you have success building MOODS on Windows
and would like to have a go at getting COSMO working, too, a patch or pull
request would be welcome.
The cosmo.py script does the actual scanning of the FASTA, and
cosmostats.py compiles summary statistics into a file
named stats.tab in your current working directory.
cosmo.py supports the following command-line options:
| Option | Description |
|---|---|
-fa PATH |
path to FASTA sequence file |
-t |
log-odds score threshold (S/Smax) (default is 0.6) |
-P |
(optional) pseudocount for MOODS to use (default is 1) |
-p PATH |
path to JASPAR-format PWMs (default is ./jpwm) |
-d |
maximum allowed distance between motifs (default is 10) |
-s |
boolean flag to dinucleotide shuffle the input sequence |
-N |
background run number |
-C |
boolean flag to save coordinates rather than counts |
COSMO writes counts for stereopairs to the local directory in the file
cosmo.counts.tab. Background scans (with parameters -s and -N <x>) are
placed into sequential files named cosmo.counts.tab.<x>). Coordinates (with
the -C option, explained below) are saved into a BED-formatted file
cosmo.coords.bed
Example: scan a FASTA file in the current working directory, with a specific log-odds threshold score and maximum allowed distance between motifs:
# (the defaults are 0.6 and 10, respectively)
./cosmo.py -fa h3k27ac.fa -t 0.75 -d 20
Use -N <number> to start a specific number of background runs.
Use -s to dinucleotide-shuffle the input sequences.
./cosmo.py -fa h3k27ac.fa -s -N 1
./cosmo.py -fa h3k27ac.fa -s -N 2
⋮
./cosmo.py -fa h3k27ac.fa -s -N <n>
For a large number of background runs, this is best accomplished in a 'for' loop in your favorite shell. Assuming Bash or Z shell:
runs=100
for (( i=1; i<=runs; i++ )); do
./cosmo.py -fa h3k27ac.fa -s -N $i
done
The -C option produces outputs that are genomic coordinates in BED format,
rather than counts:
./cosmo.py -fa h3k27ac.fa -C
# combine existing 'cosmo.counts.tab*' files into summary stats
./cosmostats.py -N 100
Combined with the example above, for 100 background scans:
runs=100
# assuming Bash or Z shell…
for (( i=1; i<=runs; i++ )); do
./cosmo.py -fa h3k27ac.fa -s -N $i
done
./cosmostats.py -N $runs
If you have problems with the QUICK START section (e.g. MOODS fails to build), here's a fully-manual installation, spelled out.
Did you forget to git clone --recursive? If you didn't do that, you don't
have the MOODS submodule. Do this:
cd cosmo # if not already there
# if required, create a Python 2.7 virtualenv and activate it
python -m virtualenv venv && source venv/bin/activate
if test -d .git; then
git submodule init && git submodule update
make deps
else
echo "Oops, this isn't a Git repository." >&2
fi
If your Python is in a virtual environment, make sure it is activated,
otherwise the make deps step will not do the right thing.
At this point, you should be able to run ./cosmo.py and get a usage
message (with no Python tracebacks). Skip to "Running on example
FASTA."
You will need to download the MOODS sources from GitHub first:
# remove existing 'MOODS' dir; it's where the Git submodule would go
rmdir MOODS
# or 'curl -LOJ' if you don't have 'wget'
wget https://github.com/jhkorhonen/MOODS/archive/v1.0.2.1.zip
unzip v1.0.2.1.zip && rm -i v1.0.2.1.zip
# move the unpacked directory to 'MOODS', where the Makefile expects it
mv MOODS-1.0.2.1 MOODS
Again, if your Python is in a virtual environment, make sure it is
activated. You should be able to make deps at this point, and the
Makefile will guide you through the rest of the steps. But here's the
completely manual way to reproduce what the Makefile does:
pushd MOODS/src
make
cd ../python
python setup.py install
popd
# install NumPy and SciPy (COSMO's dependencies)
pip install -r requirements.txt
At this point, you should be able to run ./cosmo.py and get a usage
message (with no Python tracebacks).
First, unpack the example FASTA file if necessary, and run several background scans (in this example, three):
cd examples
test -f example.fa || gunzip example.fa.gz
# vary these parameters to your liking (see USAGE section, above)
args="-fa example.fa -t 0.6 -d 10 -p jpwm"
# the '&>' syntax assumes Bash or Z shell; it combines stdout and stderr
../cosmo.py $args -s -N 1 &>1.log &
../cosmo.py $args -s -N 2 &>2.log &
../cosmo.py $args -s -N 3 &>2.log &
Run an additional scan with the -C (save coordinates) option, so you can see
what that output looks like:
../cosmo.py $args -C &>coords.log &
Finally, compute statistics for the three scans (-N 3) and redirect this
output into a file named stats.tab:
../cosmostats.py -N 3 > stats.tab
The output stats.tab is tab-delimited, and may be viewed in the terminal,
e.g., with column -t, or opened in a spreadsheet program such as Excel,
Google Sheets, or LibreOffice.
If you define an environment variable named COSMO_PWMDIR, it
becomes the default for the -p / --pwmdir option. Typically, this would be
an absolute path starting at /, but you can get creative.
This can be useful, for example, when used with Environment Modules, to define a system-wide directory containing the JASPAR-formatted matrices for all users.
This variable can also be defined in your login scripts, e.g. your
~/.bash_profile or ~/.profile; note that the variable set by a setenv
statement in a modulefile would still take precedence in that case.
The short answer is:
make module
For members of the Weirauch Lab, this will just do the Right Thing™. To avoid
errors here, purge all your modules, deactivate any virtualenvs, and re-load
python/2.7.18-wrl or a comparable Python 2.7.x module.
For others, this will install the module to /usr/local/modules/cosmo/x.y.z
(where x.y.z is the currently checked-out version of COSMO) and put the
modulefile in /usr/local/modules/modulefiles/cosmo/x.y.z.
For you to be able to module load cosmo, you will need to have run module use /usr/local/modules/modulefiles in your current shell session or login
scripts, or to have added that to your sitewide configuration files, e.g.
/etc/environment-modules/modulespath on Debian/Ubuntu systems.
See the definitions of MODULEDESTROOT and MODULEFILEDEST in the
Makefile for customization options. For example, if you have custom
modules in ~/modules and modulefiles in ~/modules/modulefiles, you can:
make module MODULEDESTROOT=$HOME/modules
Further help with Environment Modules is beyond the scope of this document. See its homepage at https://modules.sf.net for more information.
You can use the included Dockerfile to simplify local development; it builds
a minimal Debian Linux container with GNU Make and the latest release of Python
2.7 inside.
To use it, build the image locally, then bind mount the repository to /src
inside the container before running commands inside it. For example:
docker build . -t cosmo
# make sure it works
docker run --rm -it cosmo cosmo --help
docker run --rm -it cosmo cosmostats --help
Podman will work equally well here. To run tests on sample data included with repository, on 4 CPU cores:
docker run --rm -it -v .:/src cosmo make -j4 test
# or with Podman, assuming your local user ID is 1000
podman run --userns=keep-id --rm -it -v .:/src cosmo make -j4 test
Podman runs containers without root privileges, with the container's cosmo
user in a different namespace than the user on the host system, so you need to
add --userns=keep-id to the docker run invocation, or else manually adjust
permissions inside the container.1
If you're changing the code, make sure make test passes, or at least you can
figure out the reason why it didn't pass (explain this in your commit
message), before committing to the master/main branch.
For breaking changes — e.g. removing a command-line option or changing the input or output formats in a non-backward-compatible way — then you must:
- increment the whole number (major) part of the version in
setup.py - …and
git tag vX.Y.Z, whereX.Y.Zis the new version number.
See semver.org for more information.
- FASTA inputs must have headers in
chrN:<start>-<end>format, whereNis the chromosome number; the nucleotide sequences must also be on a single line. - Given FASTA inputs above about 100 MB, COSMO takes a long time to finish.
- As a workaround split large FASTAs into multiple files before the
>sequence header lines and concatenate the results from COSMO.
- As a workaround split large FASTAs into multiple files before the
| Name | Role | |
|---|---|---|
| Jeremy Riddell | riddeljr@mail.uc.edu | Primary author |
| Kevin Ernst | kevin.ernst@cchmc.org | Contributor |
| Matthew Weirauch, PhD | matthew.weirauch@cchmc.org | Principal Investigator |
The rights holders are Cincinnati Children's Hospital Medical Center and the contributors.
The software's license is GPLv3, to match that of MOODS. See
LICENSE.txt for details.