Protein trees
Gene trees aim to represent the evolutionary history of gene families, which evolved from a common ancestor. Reconciliation of the gene tree against the species tree allows us to distinguish duplication and speciation events, thus inferring orthologues and paralogues.
There is a clear concordance with reciprocal best approaches in the simple case of unique orthologous genes. However, the gene tree pipeline is able to find more complex one-to-many and many-to-many relations. This, for instance, significantly raises the number of Teleost (bony fish) to Mammal orthologues and has even more dramatic effects on Fly/Mammal or Worm/Mammal orthologous gene predictions. Using this approach, we are also able to "time" duplication events which produce paralogues by identifying the most recent common ancestor (i.e. taxonomy level) for a given internal node of the tree.
Please note that, primarily due to performance limitations, between-species paralogy relationships are not stored in the Compara databases or listed on the website (except in some special cases described in the "Between-species paralogues" section of the homology types documentation). However, these relationships can be identified by examining the gene trees, which are annotated with duplication events.
Details on tree building
The gene orthology and paralogy prediction pipeline has eight basic steps:
- Load the Ensembl canonical translation of each gene from species in Ensembl, which is currently chosen on the basis of various factors such as conservation, expression, coding sequence length and concordance with other key resources.
- Run an HMM search on the TreeFam HMM library to classify the sequences into their families.
- Cluster the genes that did not have any match into additional families:
- run NCBI Blast+ [1] (refined with SmithWaterman) on every orphaned gene against every other (both self and non-self species). We use the version 2.2.30, with the parameters: -seg no -max_hsps_per_subject 1 -use_sw_tback -num_threads 1.
- Build a sparse graph of gene relations based on Blast e-values and generate clusters using hcluster_sg [2]. Hcluster_sg has a little insight about the phylogeny of the species (namely: a list of outgroup species), which helps define pertinent clusters for the ingroup species. We use yeast (Saccharomyces cerevisiae) as an outgroup for the main protein trees collection (via the -C option) and the following command line arguments: -m 750 -w 0 -s 0.34 -O. See below for more details on gene tree building.
 
- run NCBI Blast+ [1] (refined with SmithWaterman) on every orphaned gene against every other (both self and non-self species). We use the version 2.2.30, with the parameters: 
- Families may be expanded with genes that have been projected from a source gene that is already a member of the given family.
- Large families that would be too complex to analyse are broken down with QuickTree [6] to limit them to 1,500 genes.
- For each cluster (family), build a multiple alignment based on the protein sequences using either a combination of multiple aligners, consensified by M-Coffee [3] or Mafft [4] when the cluster is too large, or running MCoffee takes too long. We use the version 9.03.r1318 of M-Coffee, with the aligner set: mafftgins_msa, muscle_msa, kalign_msa, t_coffee_msa, and the Mafft version 7.505 with the command line options: --auto.
- For each aligned cluster, build a phylogenetic tree using TreeBeST [5] using the CDS back-translation of the protein multiple alignment from the original DNA sequences. A rooted tree with internal duplication tags is obtained at this stage, reconciling it against a species tree inferred from the NCBI taxonomy in general. For the Plants division, the topology of this species tree is curated by the Ensembl Plants team. See below for more details on gene tree building.
- From each gene tree, infer gene pairwise relations of orthology and paralogy types.
- A stable ID is assigned to each GeneTree in the main protein trees collection.
Clustering
hcluster_sg [2] performs hierarchical clustering under mean distance. It reads an input file that describes the similarity between two sequences, and groups two nearest nodes at each step. When two nodes are joined, the distance between the joined node and all the other nodes are updated by mean distance. This procedure is iterated until one of the three rules is met:
- Do not merge cluster A and B if the total number of edges between A and B is smaller than |A|*|B|/3, where|A|and|B|are the sizes of A and B, respectively. This rule guarantees each cluster is compact.
- Do not join A to any other cluster if |A| < 500. This rule avoids huge clusters which may cause computational burden for multiple alignment and tree building as well.
- Do not join A and B if both A and B contain outgroup genes. This rule tries to find ingroup gene families. TreeFam clustering is done with outgroups.
Hcluster_sg also introduces an additional edge-breaking rule: removes an edge between cluster A and B if the number of edges between A and B is smaller than |A|*|B|/10. This heuristic rule removes weak relations which are quite unlikely to be joined at a later step.
As the pipeline has to be completed in time for the Ensembl releases, we limit the size of the clusters to 1,500 genes. For larger clusters, we run Mafft [4] and QuickTree [6]. QuickTree is a very fast and efficient way to build an unrooted phylogenetic tree. We then find the branch that best splits the cluster into two halves. We recursively follow this approach until each sub-cluster is smaller than 1,500 genes.
Tree building
The CDS back-translated protein alignment (i.e., codon alignment) is used to build five different trees (within TreeBeST [5]):
- a maximum likelihood (ML) tree built, based on the protein alignment with the WAG model, which takes into account the species tree
- a ML tree built using phyml, based on the codon alignment with the Hasegawa-Kishino-Yano (HKY) model, also taking into account the species tree
- a neighbour-joining (NJ) tree using p-distance, based on the codon alignment
- a NJ tree using dN distance, based on the codon alignment
- a NJ tree using dS distance, based on the codon alignment
For (1) and (2), TreeBeST uses a modified version of phyml release 2.4.5 which takes an input species tree, and tries to build a gene tree that is consistent with the topology of the species tree. This "species-guided" phyml uses the original phyml tree-search code. However, the objective function maximised during the tree-search is multiplied by an extra likelihood factor not found in the original phyml. This extra likelihood factor reflects the number of duplications and losses inferred in a gene tree, given the topology of the species tree. The species-guided phyml allows the gene tree to have a topology that is inconsistent with the species tree if the alignment strongly supports this. The species tree is based on the NCBI taxonomy tree (subject to some modifications depending on new datasets). For Plants division, the topology of this species tree is curated by Ensembl Plants team.
The final tree is made by merging the five trees into one consensus tree using the "tree merging" algorithm. This allows TreeBeST to take advantage of the fact that DNA-based trees often are more accurate for closely related parts of trees and protein-based trees for distant relationships, and that a group of algorithms may outperform others under certain scenarios. The algorithm simultaneously merges the five input trees into a consensus tree. The consensus topology contains clades found in any of the input trees, where the clades chosen are those that minimise the number of duplications and losses inferred, and have the highest bootstrap support. Branch lengths are estimated for the final consensus tree based on the DNA alignment, using phyml with the HKY model.
Gene tree building in Cultivars
A modified approach is used to generate most protein trees for strain and breed gene-tree datasets [7] [8]. A gene phylogeny is inferred on a cDNA alignment from which problematic columns have been removed using Noisy [9] (version 1.5.12). Gene trees are inferred using RAxML [10] (version 8.2.12) or ExaML [11] (version 3.0.22) from a RAxML parsimony starting tree, all under a GTR+Γ model. If gene-tree inference requires excessive runtime, we fall back to FastTree [12] (version 2.1.8) with the GTR model. TreeBeST is used to infer gene trees from very small or simple gene alignments.
To minimise redundancy, gene trees and homologies may be excluded if they consist entirely of genes that are present in another gene-tree collection.
Notes and References
- 
Camacho C et al,, "BLAST+: architecture and applications."  BMC Bioinformatics. 2009 Dec 15;10:421.
 We use the version 2.2.30, with the parameters: -seg no -max_hsps_per_subject 1 -use_sw_tback -num_threads 1
- Li, H et al., Hcluster_sg: hierarchical clustering software for sparse graphs.
 We use yeast (Saccharomyces cerevisiae) as an outgroup for the main protein trees collection (via the -C option) and the following command line arguments: -m 750 -w 0 -s 0.34 -O
 The weights used in the graph are MIN(100, ROUND(-LOG10(evalue)/2))
- 
M-Coffee:
Wallace IM, O'Sullivan O, Higgins DG, Notredame C., "M-Coffee: combining multiple sequence alignment methods with T-Coffee." Nucleic Acid Research. 2006 Mar 23;34(6):1692-9.
 We use the version 9.03.r1318, with the aligner set: mafftgins_msa, muscle_msa, kalign_msa, t_coffee_msa
- 
Mafft: Katoh K, Standley DM. "MAFFT multiple sequence alignment software version 7: improvements in performance and usability." Mol Biol Evol. 2013 Apr;30(4):772-80
 We use the version 7.505 with the command line options: --auto
- Li, H et al., Compara-specific version, Original version (TreeBeST was previously known as NJTREE)
- 
QuickTree: Howe K, Bateman A and Durbin R, "QuickTree: building huge Neighbour-Joining trees of protein sequences." Bioinformatics (Oxford, England) 2002;18;11;1546-7
 QuickTree builds an unrooted tree and we recursively split the cluster by finding a branch that roughly holds half of the nodes on each side.
- Aken et al., "Ensembl 2017". Nucleic Acids Research. 2017 Jan;45(D1):D635-D642.
- Harrison PW et al., "Ensembl 2024." Nucleic Acids Res, 52(D1):D891-D899.
- Dress et al., "Noisy: identification of problematic columns in multiple sequence alignments." Algorithms for Molecular Biology : AMB. 2008 Jun;3:7.
- Stamatakis A. "RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies." Bioinformatics (Oxford, England). 2014 May;30(9):1312-1313.
- Kozlov et al., "ExaML version 3: a tool for phylogenomic analyses on supercomputers." Bioinformatics (Oxford, England). 2015 Aug;31(15):2577-2579.
- Price et al., "FastTree 2–approximately maximum-likelihood trees for large alignments." PloS ONE. 2010 Mar;5(3):e9490.


 
    
![Ensembl blog [RSS logo]](/i/rss_icon_16.png)
![Follow us on Twitter! [twitter logo]](/i/twitter.png)
