last update
2026-May-26
VirJenDB
v1.0
Sequence Deduplication and vOTU Clustering
VirJenDB generates non-redundant viral sequence sets and vOTUs through a multi-step clustering workflow.
Step 1 — Exact Sequence Deduplication
Exact duplicate sequences are removed through hash-based deduplication to generate a unique sequence dataset.
Step 2 — Preliminary Clustering
Sequences are partitioned into subfiles of approximately one million sequences and clustered at:
- 95% similarity
using:
- MMseqs2 Linclust v14.7e284
Step 3 — vOTU Clustering
Subclusters are clustered using:
- Vclust 1.30
- 95% ANI
- 85% query coverage
- Leiden algorithm
This produces intermediate vOTU clusters.
Step 4 — Final vOTU Generation
Intermediate clusters are merged and reclustered to produce final vOTUs.
Representative Sequence Selection
The longest sequence in each vOTU is selected as representative, with one manually curated exception involving an abnormally long SARS-CoV-2 sequence.
Current datasets are available on the Datasets page.
Sequence Representation Icons
UQ Unique representative of identical sequences after deduplication.
CL Representative sequence selected for a vOTU cluster.
A sequence may display both icons if it is both unique and selected as the cluster representative.