-
Notifications
You must be signed in to change notification settings - Fork 1
Advanced usage
GRAViTy-V2 is a complex, resource-intensive application. Not adhering to best practice principles can lead to unnecessarily long run times and, in some cases, lacklustre results.
It would be intractable to construct databases comprising every known viral genome. For this reason, we advise breaking down queries into several stages at differing levels of granularity: we call this the "pass" system.
Imagine you have an RNA virus sequence with unknown taxonomy and you have minimal prior knowledge of its provenance. A "first pass" might compare your unknown sequence against representative members of each family in Riboviria. Your results then indicate that your unknown sequence is most similar to the sequence from Flaviviridae. A sensible consequent "second pass" would therefore be to re-run GRAViTy-V2, comparing your unknown sequence to a wide variety of genomes Flaviviridae, which will result in a much more precise taxonomy.
There are three predominant reasons for long run times:
- Running GRAViTy-V2 on a low-spec machine. We recommend, as a bare minimum, an Intel 9th generation + CPU with at least 8 cores (or equivalent AMD/Apple chip), and at least 16 Gb RAM.
- Creating runs with very large search spaces. Experiments comparing >1000 genomes will take hours to run, and analyses on very large genomes (e.g. Mamonoviridae) will take longer than equivalent experiments on small genomes. We recommend running workflows in multiple passes (see above).
- User-set parameters can have a huge influence on run times. We strongly recommend users familiarise themselves with the "Input parameter descriptions" section of the Usage Wiki page. In addition, the following parameters should be used with caution:
- NThreads: setting this value as 1 will essentially remove parallelism and increase run times. Setting this to a high value will increase the memory footprint and could cause overflows into swap space, which will also increase run times. We recommend setting at N/2, where N = number of CPU cores.
- N_Bootstrap: whenever a bootstrap iteration is run, the GOM needs to be recompiled, which can significantly increase run times. Users may opt to reduce the number of bootstrap iterations, or otherwise disable boostrap entirely ("Bootstrap": false) on runs where this isn't required.
- MutualInformationScoring: This statistic is important in determining the relative feature importance of each PPHMM for influencing each classification. Although interesting, this is not necessary for all workflows and can greatly increase run times in analyses where thousands of PPHMMs have been compiled.
- PPHMMSorting: This will rarely improve performance but can dramatically increase run times (see our article for more information).
- N_AlignmentMerging: see "PPHMMSorting" (above).
Usage instructions differ for Docker containers, as this process involves hosting a local server behind an abstraction layer.
- Pull or build, then run a GRAViTy-V2 Docker container (see Installation)
- Check the container status and find the container ID with
$ docker ps - Exec into container (with bash shell) with
$ docker exec -it XXXXXXX bash, whereXXXXXXXis your container ID - Start a bash terminal
$ bash - Confirm that GRAViTy-V2 is functional by calling the dependency test
$ python3 -m cli.dep_test
Guidance on using the GRAViTy-V2 GUI is not provided (hence CLI is recommended, see below), as this will differ vastly between users: some will simply be able to use $ docker inspect to find its IP address, then append :8000/docs to the end and access in the browsers. Users on WSL and HPC clusters (and possibly also Macs) will have to follow a more convoluted process: please consult your system administrator for guidance.
Currently, CLI support only extends for the new_classification_full, dependency_test and example_run endpoints. Expanding CLI support is on our development roadmap. These functions have limited argument parsing and work by reading a configuration file (.json format). It is possible, however, to use any GRAViTy-V2 function from the command line, by cURL'ing endpoints.
An example configuration file is included in the repository (./data/eva/example_runfile.json). Users may copy and modify this file, then trigger from the command line as follows (swapping path-to-runfile with user's file path).
$ python3 -m cli.gravity_new_classification --runfile path-to-runfile
Users may script this CLI entrypoint to do batch jobs.
These endpoints exist for the purposes of testing installations and troubleshooting dependency problems. They require no user input other than calling from the command line.
$ python3 -m cli.dep_test
$ python3 -m cli.example_run
Users may interact with any GRAViTy-V2 endpoint programmatically with cURL, either via scripting a payload or by reading in json config files. Full API schema to support this are generated automatically by the GRAViTy-V2 API: the easiest way to view this is to start the GRAViTy-V2 API and visit http://127.0.0.1:8000/openapi.json (assuming deployment on localhost).
It is possible to batch automate multiple GRAViTy-V2 runs in this manner. An example script is included in ./dev/automated_batch_run.py