last update
2026-April-29
VirJenDB
v1
Overview
This page describes how VirJenDB determines whether a viral sequence is labeled as the coding strand.
The strand direction is inferred computationally by comparing the total length of detected ORFs between a sequence and its reverse complement. This information is stored in the coding_strand field and is used during downstream analysis and filtering.
true: the sequence is likely in the correct coding orientationfalse: the reverse complement likely represents the coding strand
This field is populated by a script that determines strand direction directly from sequence data. The algorithm is adapted from: ViralClust snippet
Processing Pipeline
Steps
- Retrieve unprocessed sequence IDs
- Fetch sequence for each ID
- Determine strand direction
- Store results (sequence ID and strand classification) in a CSV file
Strand Direction Determination
Algorithm Summary
Both the original sequence and its reverse complement are evaluated using the same ORF detection method.
The strand direction is determined by comparing their total ORF length:
- the original sequence
- its reverse complement
Decision Rule
- If the total length of detected coding regions in the original sequence is greater than in its reverse complement → true
- Otherwise → false
Reverse Complement
This function reverses the sequence and replaces each nucleotide with its complement (A↔T, C↔G).
def reverse_complement(sequence):
comp = {'A':'T','C':'G','G':'C','T':'A'}
return ''.join([comp.get(x, 'N') for x in sequence.upper()[::-1]])
Longest Coding Region Detection
Concept
- Translate the nucleotide sequence into amino acids in all three forward reading frames
- Map codons → amino acids
*= stop codonX= unknown
ORF Rule
import re
REGEX_ORF = re.compile(r'[^*]{200,}')
Matches regions:
- without stop codons (
*) - with a minimum length of 200 amino acids
The threshold removes short ORF-like segments.
Example
Original sequence:
- ORF total length: 1200
Reverse complement:
- ORF total length: 350
→ Result: coding_strand = true
Implementation
The full code can be found here github.com/VirJenDB/Curation-module
Note: This method assumes that the correct strand contains longer open reading frames. While effective for most viral genomes, it may be less reliable for highly compact or non-canonical genomes.