Skip to content

snayfach/UHGV-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UHGV-classifier: taxonomic classification of human gut viruses

Overview

The code and database described here will allow you to obtain a taxonomic label for your viral genome(s) based on the Unified Human Gut Virome (UHGV) taxonomy.

The UHGV is a comprehensive and deeply annotated database of viral genomes from the human gut microbiome. For more info on the data, see: https://github.com/snayfach/UHGV.

The UHGV-classifier allows a user to assign a quasi-taxonomic label to their sequence, determine novelty relative to database, identify characteristics of the nearest viral group, and identify other phylogenetically related viruses in the database.

Installation

Install program using git and pip (add --user if you don't have root access):
pip install git+https://github.com/snayfach/UHGV-classifier.git

Install external dependencies using conda:
conda install -c bioconda prodigal-gv diamond blast -y

View available modules:
uhgv -h

Download and unpack the latest database:
uhgv download .

UHGV-tools: download
[1/5] Checking latest version of database...
[2/5] Downloading 'uhgv-db'...
[3/5] Extracting 'uhgv-db'...
[4/5] Building BLASTN database...
[5/5] Building DIAMOND database...
Run time: 250.6 seconds
Peak mem: 2.18 GB

View command line usage for classify module:
uhgv classify -h

usage: uhgv classify [-h] -i PATH -o PATH -d PATH [-t THREADS] [-c]

options: -h, --help show this help message and exit

required arguments: -i PATH Path to nucleotide seqs
-o PATH Path to output directory
-d PATH Path to database directory
-t THREADS Number of threads to run program with (1)
--continue Continue where program left off
--quiet Suppress logging messages

Example usage

Download a test dataset of 5 phages from Nishijima et al. using wget:
wget https://raw.githubusercontent.com/snayfach/UHGV-classifier/main/example/viral_sequences.fna -O viral_sequences.fna

Classify sequences, replacing </path/to/uhgv-db> as appropriate:
uhgv classify -i viral_sequences.fna -o output -d </path/to/uhgv-db> -t 10

UHGV-classify v0.0.1: classify
[1/10] Reading input sequences
[2/10] Reading database sequences
[3/10] Estimating ANI with blastn
[4/10] Identifying genes using prodigal-gv
[5/10] Performing self alignment
[6/10] Aligning proteins to database
[7/10] Calculating amino acid similarity scores
[8/10] Finding top database hits
[9/10] Performing phylogenetic assignment
[10/10] Writing output file(s)

There are two main output files:

  • output/classify_summary.tsv: information related to classification
  • output/taxon_info.tsv: details about the classified taxa (ex: lifestyle, genome size, host)

Here are field definitions and example values for classify_summary.tsv:

Field Description Example
genome_id user genome identifier 0008_k141_99927
genome_length length in bp 96989
genome_num_genes count of CDS 106
taxon_id UHGV taxon identifier vSUBGEN-22354
class_method nucleotide or protein based classification protein
class_rank lowest classified rank subgenus
ani_reference nearest reference based on ANI UHGV-0030436
ani_identity nucleotide identity 93.65
ani_query_af % of query covered 86.6
ani_target_af % of target covered 83.58
ani_taxonomy taxonomy of reference genome vFAM-00050;vSUBFAM-00057;vGENUS-00180;vSUBGEN-22354;vOTU-000988
aai_reference nearest reference based on AAI UHGV-0030436
aai_shared_genes number of proteins aligned 93
aai_identity amino acid identity 89.33
aai_score normalized, cumulative bitscore 82.57
aai_taxonomy taxonomy of reference genome vFAM-00050;vSUBFAM-00057;vGENUS-00180;vSUBGEN-22354;vOTU-000988

Here field definitions and example values for taxon_info.tsv:

Field Description Example
genome_id user genome identifier 0008_k141_99927
taxon_id UHGV taxon identifier vSUBGEN-22354
taxon_lineage UHGV taxon lineage vFAM-00050;vSUBFAM-00057;vGENUS-00180;vSUBGEN-22354
host_lineage Consensus GTDB host lineage d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella (100.0)
ictv_lineage Consensus ICTV taxon lineage r__Duplodnaviria;k__Heunggongvirae;p__Uroviricota;c__Caudoviricetes;o__Crassvirales;f__Beta-crassviridae (100.0)
lifestyle Consensus virus lifestyle virulent (100.0)
genome_length_median median genome length of viruses in lineage 100566.0
genome_length_iqr interquartile range of genome length 100566.0 - 100566.0

Citation

If you use the UHGV-classifier in your research, please cite both the software and the underlying publication:

Publication:

A genomic atlas of the human gut virome elucidates genetic factors shaping host interactions

Camargo, A. P., Baltoumas, F. A., Ndela, E. O., Fiamenghi, M. B., Merrill, B. D., Carter, M. M., Pinto, Y., Chakraborty, M., Andreeva, A., Ghiotto, G., Shaw, J., Proal, A. D., Sonnenburg, J. L., Bhatt, A. S., Roux, S., Pavlopoulos, G. A., Nayfach, S., & Kyrpides, N. C. — bioRxiv (2025), DOI: 10.1101/2025.11.01.686033

Software:

Nayfach, S. (2025). UHGV classifier (Version 1.0.0) [Software]. Zenodo. https://doi.org/10.5281/zenodo.17418882

About

Taxonomic classification of human gut viruses

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages