VirJenDB

Documentation

last update

2026-April-29

VirJenDB

v1

This page is up-to-date!

Overview

This page describes how VirJenDB determines whether a viral sequence is labeled as the coding strand.

The strand direction is inferred computationally by comparing the total length of detected ORFs between a sequence and its reverse complement. This information is stored in the coding_strand field and is used during downstream analysis and filtering.

0ie field summary

This field is populated by a script that determines strand direction directly from sequence data. The algorithm is adapted from: ViralClust snippet


Processing Pipeline

flowchart

Steps

  1. Retrieve unprocessed sequence IDs
  2. Fetch sequence for each ID
  3. Determine strand direction
  4. Store results (sequence ID and strand classification) in a CSV file

Strand Direction Determination

Algorithm Summary

Both the original sequence and its reverse complement are evaluated using the same ORF detection method.

The strand direction is determined by comparing their total ORF length:

Strand direction decision logic

Decision Rule


Reverse Complement

This function reverses the sequence and replaces each nucleotide with its complement (A↔T, C↔G).

def reverse_complement(sequence):
    comp = {'A':'T','C':'G','G':'C','T':'A'}
    return ''.join([comp.get(x, 'N') for x in sequence.upper()[::-1]])

Longest Coding Region Detection

ORF detection process

Concept

ORF Rule

import re
REGEX_ORF = re.compile(r'[^*]{200,}')

Matches regions:

The threshold removes short ORF-like segments.


Example

Original sequence:

Reverse complement:

→ Result: coding_strand = true

Implementation

The full code can be found here github.com/VirJenDB/Curation-module

Note: This method assumes that the correct strand contains longer open reading frames. While effective for most viral genomes, it may be less reliable for highly compact or non-canonical genomes.

Table of Contents