VirJenDB Curation

Data ingestion

reformatted dates to standardized ISO 8601 format
splitted multi-value fields into other fields
typographical error correction

Host annotations

VirJenDB host annotations are based on two complementary approaches: 1) direct host information provided by the original data sources, which include both submitter-supplied (e.g., GenBank/Refseq entries with submitter annotations, PHDaily) and computationally predicted host annotations (e.g. IMG/VR with host predictions from multiple methods); and 2) host assignments inferred from the prophages predicted from GenBank microbial genomes with PhiSpy version 4.2.21, where the host corresponds to the bacterial genome containing the prophage. See the metadata field descriptions in the schema for more details.

Host taxonomy mapping

The host taxonomy fields were standardized to the GTDB v226 taxonomic framework by mapping the NCBI TaxIds to the GTDB taxonomy names using the GTDB-NCBI taxonomy mapping file available on the Datasets page. The most frequently represented GTDB classification was selected for each NCBI taxon. For the PhiSpy prophage predictions, host information was obtained by mapping the prophage host genome identifiers with the NCBI TaxIds from the BV-BRC input dataset. From IMG/VR v4.0, we extracted phage sequences with host information available in the GTDB v207 taxonomic scheme, selecting entries tagged as “isolation host” from the associated metadata. The PHD dataset provided host information in both NCBI and GTDB v220 taxonomic schemes, which we incorporated directly into our dataset. The GTDB taxonomy values were merged using the mapped fields, including “Host NCBI TaxID” and “Host NCBI Species Name”, into the VirJenDB dataset for 1.47 million phage sequences.

Sequence deduplication and clustering workflow

Exact duplicate sequences were removed through hash-based dereplication, resulting in the deduplicated, unique sequences dataset available on the Datasets page.
The deduplicated dataset file was split into subfiles of about 1 million sequences and each subfile was clustered at 95 % similarity using linclust (mmseq2 v14.7e284).
The linclust cluster subfiles were then clustered at 95 % average nucleotide identity (ANI) over 85 % of the query sequence length using Vclust 1.30 (35) with the Leiden algorithm to obtain the Vclust step 1 clusters.
The Vclust clusters were merged into one file, and once again de-replicated using Vclust at 95 % ANI over 85 % query length with the Leiden algorithm to obtain the final step 2 vOTU clusters.
The longest sequence in the vOTU cluster was selected as the cluster representative, with the exception of one abnormally long SARS-CoV-2 cluster representative about 10 million nucleotides in length for which the second-longest sequence in the vOTU cluster was manually set as the representative. The current version of the vOTU cluster dataset is available on the Datasets page.

Documentation

VirJenDB Curation

Data ingestion

Host annotations

Host taxonomy mapping

Sequence deduplication and clustering workflow

Table of Contents