VirJenDB

Documentation

last update

2026-May-26

VirJenDB

v1.0

This page is up-to-date!

Sequence Deduplication and vOTU Clustering

VirJenDB generates non-redundant viral sequence sets and vOTUs through a multi-step clustering workflow.

Step 1 — Exact Sequence Deduplication

Exact duplicate sequences are removed through hash-based deduplication to generate a unique sequence dataset.

Step 2 — Preliminary Clustering

Sequences are partitioned into subfiles of approximately one million sequences and clustered at:

using:

Step 3 — vOTU Clustering

Subclusters are clustered using:

This produces intermediate vOTU clusters.

Step 4 — Final vOTU Generation

Intermediate clusters are merged and reclustered to produce final vOTUs.

Representative Sequence Selection

The longest sequence in each vOTU is selected as representative, with one manually curated exception involving an abnormally long SARS-CoV-2 sequence.

Current datasets are available on the Datasets page.

Sequence Representation Icons

UQ Unique representative of identical sequences after deduplication.

CL Representative sequence selected for a vOTU cluster.

A sequence may display both icons if it is both unique and selected as the cluster representative.

Table of Contents