The code and database described here will allow you to obtain a taxonomic label for your viral genome(s) based on the Unified Human Gut Virome (UHGV) taxonomy.
The UHGV is a comprehensive and deeply annotated database of viral genomes from the human gut microbiome. For more info on the data, see: https://github.com/snayfach/UHGV.
The UHGV-classifier allows a user to assign a quasi-taxonomic label to their sequence, determine novelty relative to database, identify characteristics of the nearest viral group, and identify other phylogenetically related viruses in the database.
Install program using git and pip (add --user if you don't have root access):
pip install git+https://github.com/snayfach/UHGV-classifier.git
Install external dependencies using conda:
conda install -c bioconda prodigal-gv diamond blast -y
View available modules:
uhgv -h
Download and unpack the latest database:
uhgv download .
UHGV-tools: download
[1/5] Checking latest version of database...
[2/5] Downloading 'uhgv-db'...
[3/5] Extracting 'uhgv-db'...
[4/5] Building BLASTN database...
[5/5] Building DIAMOND database...
Run time: 250.6 seconds
Peak mem: 2.18 GB
View command line usage for classify module:
uhgv classify -h
usage: uhgv classify [-h] -i PATH -o PATH -d PATH [-t THREADS] [-c]
options: -h, --help show this help message and exit
required arguments: -i PATH Path to nucleotide seqs
-o PATH Path to output directory
-d PATH Path to database directory
-t THREADS Number of threads to run program with (1)
--continue Continue where program left off
--quiet Suppress logging messages
Download a test dataset of 5 phages from Nishijima et al. using wget:
wget https://raw.githubusercontent.com/snayfach/UHGV-classifier/main/example/viral_sequences.fna -O viral_sequences.fna
Classify sequences, replacing </path/to/uhgv-db> as appropriate:
uhgv classify -i viral_sequences.fna -o output -d </path/to/uhgv-db> -t 10
UHGV-classify v0.0.1: classify
[1/10] Reading input sequences
[2/10] Reading database sequences
[3/10] Estimating ANI with blastn
[4/10] Identifying genes using prodigal-gv
[5/10] Performing self alignment
[6/10] Aligning proteins to database
[7/10] Calculating amino acid similarity scores
[8/10] Finding top database hits
[9/10] Performing phylogenetic assignment
[10/10] Writing output file(s)
There are two main output files:
output/classify_summary.tsv: information related to classificationoutput/taxon_info.tsv: details about the classified taxa (ex: lifestyle, genome size, host)
Here are field definitions and example values for classify_summary.tsv:
| Field | Description | Example |
|---|---|---|
| genome_id | user genome identifier | 0008_k141_99927 |
| genome_length | length in bp | 96989 |
| genome_num_genes | count of CDS | 106 |
| taxon_id | UHGV taxon identifier | vSUBGEN-22354 |
| class_method | nucleotide or protein based classification | protein |
| class_rank | lowest classified rank | subgenus |
| ani_reference | nearest reference based on ANI | UHGV-0030436 |
| ani_identity | nucleotide identity | 93.65 |
| ani_query_af | % of query covered | 86.6 |
| ani_target_af | % of target covered | 83.58 |
| ani_taxonomy | taxonomy of reference genome | vFAM-00050;vSUBFAM-00057;vGENUS-00180;vSUBGEN-22354;vOTU-000988 |
| aai_reference | nearest reference based on AAI | UHGV-0030436 |
| aai_shared_genes | number of proteins aligned | 93 |
| aai_identity | amino acid identity | 89.33 |
| aai_score | normalized, cumulative bitscore | 82.57 |
| aai_taxonomy | taxonomy of reference genome | vFAM-00050;vSUBFAM-00057;vGENUS-00180;vSUBGEN-22354;vOTU-000988 |
Here field definitions and example values for taxon_info.tsv:
| Field | Description | Example |
|---|---|---|
| genome_id | user genome identifier | 0008_k141_99927 |
| taxon_id | UHGV taxon identifier | vSUBGEN-22354 |
| taxon_lineage | UHGV taxon lineage | vFAM-00050;vSUBFAM-00057;vGENUS-00180;vSUBGEN-22354 |
| host_lineage | Consensus GTDB host lineage | d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella (100.0) |
| ictv_lineage | Consensus ICTV taxon lineage | r__Duplodnaviria;k__Heunggongvirae;p__Uroviricota;c__Caudoviricetes;o__Crassvirales;f__Beta-crassviridae (100.0) |
| lifestyle | Consensus virus lifestyle | virulent (100.0) |
| genome_length_median | median genome length of viruses in lineage | 100566.0 |
| genome_length_iqr | interquartile range of genome length | 100566.0 - 100566.0 |
If you use the UHGV-classifier in your research, please cite both the software and the underlying publication:
Publication:
A genomic atlas of the human gut virome elucidates genetic factors shaping host interactions
Camargo, A. P., Baltoumas, F. A., Ndela, E. O., Fiamenghi, M. B., Merrill, B. D., Carter, M. M., Pinto, Y., Chakraborty, M., Andreeva, A., Ghiotto, G., Shaw, J., Proal, A. D., Sonnenburg, J. L., Bhatt, A. S., Roux, S., Pavlopoulos, G. A., Nayfach, S., & Kyrpides, N. C. — bioRxiv (2025), DOI: 10.1101/2025.11.01.686033
Software:
Nayfach, S. (2025). UHGV classifier (Version 1.0.0) [Software]. Zenodo. https://doi.org/10.5281/zenodo.17418882
