Strand Direction Detection

Overview

This page describes how VirJenDB determines whether a viral sequence is labeled as the coding strand.

The strand direction is inferred computationally by comparing the total length of detected ORFs between a sequence and its reverse complement. This information is stored in the coding_strand field and is used during downstream analysis and filtering.

0ie field summary

true: the sequence is likely in the correct coding orientation
false: the reverse complement likely represents the coding strand

This field is populated by a script that determines strand direction directly from sequence data. The algorithm is adapted from: ViralClust snippet

Processing Pipeline

flowchart

Steps

Retrieve unprocessed sequence IDs
Fetch sequence for each ID
Determine strand direction
Store results (sequence ID and strand classification) in a CSV file

Strand Direction Determination

Algorithm Summary

Both the original sequence and its reverse complement are evaluated using the same ORF detection method.

The strand direction is determined by comparing their total ORF length:

the original sequence
its reverse complement

Strand direction decision logic

Decision Rule

If the total length of detected coding regions in the original sequence is greater than in its reverse complement → true
Otherwise → false

Reverse Complement

This function reverses the sequence and replaces each nucleotide with its complement (A↔T, C↔G).

def reverse_complement(sequence):
    comp = {'A':'T','C':'G','G':'C','T':'A'}
    return ''.join([comp.get(x, 'N') for x in sequence.upper()[::-1]])

Longest Coding Region Detection

ORF detection process

Concept

Translate the nucleotide sequence into amino acids in all three forward reading frames
Map codons → amino acids
* = stop codon
X = unknown

ORF Rule

import re
REGEX_ORF = re.compile(r'[^*]{200,}')

Matches regions:

without stop codons (*)
with a minimum length of 200 amino acids

The threshold removes short ORF-like segments.

Example

Original sequence:

ORF total length: 1200

Reverse complement:

ORF total length: 350

→ Result: coding_strand = true

Implementation

The full code can be found here github.com/VirJenDB/Curation-module

Note: This method assumes that the correct strand contains longer open reading frames. While effective for most viral genomes, it may be less reliable for highly compact or non-canonical genomes.

Documentation