Sequence Deduplication and vOTU Clustering

VirJenDB generates non-redundant viral sequence sets and vOTUs through a multi-step clustering workflow.

Step 1 — Exact Sequence Deduplication

Exact duplicate sequences are removed through hash-based deduplication to generate a unique sequence dataset.

Step 2 — Preliminary Clustering

Sequences are partitioned into subfiles of approximately one million sequences and clustered at:

95% similarity

using:

MMseqs2 Linclust v14.7e284

Step 3 — vOTU Clustering

Subclusters are clustered using:

Vclust 1.30
95% ANI
85% query coverage
Leiden algorithm

This produces intermediate vOTU clusters.

Step 4 — Final vOTU Generation

Intermediate clusters are merged and reclustered to produce final vOTUs.

Representative Sequence Selection

The longest sequence in each vOTU is selected as representative, with one manually curated exception involving an abnormally long SARS-CoV-2 sequence.

Current datasets are available on the Datasets page.

Sequence Representation Icons

UQ Unique representative of identical sequences after deduplication.

CL Representative sequence selected for a vOTU cluster.

A sequence may display both icons if it is both unique and selected as the cluster representative.

Documentation