last update
2026-April-29
VirJenDB
v1.0
Sequence Deduplication and vOTU Clustering
VirJenDB generates nonredundant viral sequence sets and vOTUs through a multi-step clustering workflow.
Step 1 — Exact Sequence Dereplication
Exact duplicate sequences are removed through hash-based dereplication to generate a unique sequence dataset.
Step 2 — Preliminary Clustering
Sequences are partitioned into subfiles of approximately one million sequences and clustered at:
- 95% similarity
using:
- MMseqs2 Linclust v14.7e284
Step 3 — vOTU Clustering
Subclusters are clustered using:
- Vclust 1.30
- 95% ANI
- 85% query coverage
- Leiden algorithm
This produces intermediate vOTU clusters.
Step 4 — Final vOTU Generation
Intermediate clusters are merged and reclustered to produce final vOTUs.
Representative Sequence Selection
The longest sequence in each vOTU is selected as representative, with one manually curated exception involving an abnormally long SARS-CoV-2 sequence.
Current datasets are available on the Datasets page.