VirJenDB

Documentation

last update

2026-April-29

VirJenDB

v1.0

This page is up-to-date!

Sequence Deduplication and vOTU Clustering

VirJenDB generates nonredundant viral sequence sets and vOTUs through a multi-step clustering workflow.

Step 1 — Exact Sequence Dereplication

Exact duplicate sequences are removed through hash-based dereplication to generate a unique sequence dataset.

Step 2 — Preliminary Clustering

Sequences are partitioned into subfiles of approximately one million sequences and clustered at:

using:

Step 3 — vOTU Clustering

Subclusters are clustered using:

This produces intermediate vOTU clusters.

Step 4 — Final vOTU Generation

Intermediate clusters are merged and reclustered to produce final vOTUs.

Representative Sequence Selection

The longest sequence in each vOTU is selected as representative, with one manually curated exception involving an abnormally long SARS-CoV-2 sequence.

Current datasets are available on the Datasets page.

Table of Contents