We have two methods to provide quality scores for orthologue pairs:
These methods are completely indepenent of each other and of the orthology inference itself. The scores can be used to determine how likely it is that the orthologue pairs are real.
Gene Order Conservation score
Genes that are descended from the same gene are likely to be part of a block of genes, all in the same order, in both species. Some rearrangements between genes may occur over time, particularly in distantly related species, but it is less likely that an isolated gene which does not share gene neighbours with its inferred orthologue is a real orthologue.
The gene order conservation (GOC) score indicates how many of the four closest neighbours of a gene match between orthologous pairs. It is calculated by following these steps:
- Load all the predicted orthologues for a pair of species
- Separate the orthologues into their respective chromosomes
- Discard any orthologue that is by itself (usually in a scaffold). As these orthologues automatically get a NULL score for having no neighbours
- Order the set of orthologues in each chromosome by their start positions on the chosen reference genome
- For each orthologous pair, fetch the two genes upstream and downstream
- Check whether they are also identified as orthologues and in the same orientation
- Each match is scored as 25% meaning if all four neighbouring genes match that orthologue gets a GOC score of 100% for this reference genome
- Go back to step 4 and repeat using the alternative species as the reference genome
- Now we have two GOC scores for each other. We currently report the max of these scores
Of the four neighbouring genes, three are orthologues and in conserved order and position, resulting in a GOC score of 75.
Whole Genome Alignment score
We assume that genes which are orthologous to each other will fall within genomic regions that can be aligned to one another. Since we calculate pairwise whole genome alignments, we can use these to check the regions surrounding orthologues.
The whole genome alignment score calculates the coverage of the alignment over the orthologue pair, as follows:
- Exon boundaries are fetched for all genes in all species of interest
- The species are paired off and all alignments between each pair are detected. All predicted orthologues between the pair are fetched and batched (default = 10)
- The coverage over each member of the orthology is calculated using every available alignment. Coverage over exons is regarded as a higher importance than intronic regions, so a weighted score is generated. The score takes the coverage over exons as a base, with bonus points given for coverage over the introns (normalised by the proportion of intronic sequence in the gene).
- An overall score for the homology prediction, as a whole is computed. This can be defined as the maximum score, after the score for the pair of genes has been averaged for each alignment i.e. we report the average score for the greatest-coverage alignment
Orthologue pairs may be tagged as high confidence if they meet certain thresholds for these two measures as well as identity. The thresholds we use depend on the most-recent common ancestor of the species pair, according to the table below. The orthologue pair must satisfy all the criteria to be tagged as high confidence.
|Clades||Min. GOC score||Min. WGA score||Min. %identity|