Gene Orthology/Paralogy prediction method
The gene orthology and paralogy predictions are generated by a pipeline where maximum likelihood phylogenetic gene trees (generated by TreeBeST) play a central role. They aim to represent the evolutionary history of gene families, i.e. genes that diverged from a common ancestor. These gene trees reconciled with their species tree have their internal nodes annotated to distinguish duplication or speciation events.
This methodology was published in EnsemblCompara GeneTrees: Analysis of complete, duplication aware phylogenetic trees in vertebrates. Vilella AJ, Severin J, Ureta-Vidal A, Durbin R, Heng L, Birney E. Genome Research 2008 Nov 4. Please cite this publication if you use the homologies from Ensembl.
There is a clear concordance with reciprocal best approaches in the simple case of unique orthologous genes. However, the gene tree pipeline is able to find more complex one-to-many and many-to-many relations. This, for instance, significantly raises the number of Teleost (bony fish) to Mammal orthologs and has even more dramatic effects on Fly/Mammal or Worm/Mammal orthologous gene predictions. Using this approach, we are also able to "time" duplication events which produce paralogs by identifying the most recent common ancestor (=taxonomy level) for a given internal node of the tree.
The gene orthology and paralogy prediction pipeline has 6 basic steps:
- Load the longest translation of each gene from all species used in Ensembl
- run WUBlastp+SmithWaterman of every gene against every other (both self and non-self species) in a genome-wise manner
- Build a sparse graph of gene relations based on Blast scores and generate clusters using hcluster_sg1
- For each cluster, build a multiple alignment based on the protein sequences using a combination of multiple aligners, consensified by MCoffee2
- For each aligned cluster, build a phylogenetic tree using TreeBeST3 using the CDS back-translation of the protein multiple alignment from the original DNA sequences**. A rooted tree with internal duplication tags is obtained at this stage, reconciling it with the species tree in ensembl-compara/scripts/pipeline/species_tree_njtree.taxon_id.nh (refer to section "Create the species tree file" in ensembl-compara/scripts/pipeline/README-genetree for a more detailed explanation).
- From each gene tree, infer gene pairwise relations of orthology* and paralogy types.
* There is an extra optional step which calculates the dN/dS values for the orthology relationships of closely related pairs of species. We use codeml in the PAML4 package (model=0, NSsites=0) for this (ensembl-compara/scripts/homology/codeml.ctl.hash).
We only calculate these dN/dS values for high-coverage closely-related pairs of species. When the species evolutionary distance is too large, the saturation of the dS values biases the estimated dN/dS ratio. When the dS value of a given ortholog gene pair is too large, we mask out the dN/dS ratio, although the N, S, dN, dS and log-likelihood (lnL) values can still be obtained for all pairs.
** A description of the tree building method in TreeBeST: the CDS back-translated protein alignment (i.e., codon alignment) is used to build 5 different trees:
- a maximum likelihood (ML) tree built using phyml, based on the protein alignment with the WAG model;
- a ML tree built using phyml, based on the codon alignment with the Hasegawa-Kishino-Yano (HKY) model;
- a neighbour-joining (NJ) tree using p-distance, based on the codon alignment;
- a NJ tree using dN distance, based on the codon alignment; and
- a NJ tree using dS distance, based on the codon alignment.
For (1) and (2), TreeBeST uses a modified version of phyml release 2.4.5 which takes an input species tree, and tries to build a gene tree that is consistent with the topology of the species tree. This "species-guided" phyml uses the original phyml tree-search code. However, the objective function maximised during the tree-search is multiplied by an extra likelihood factor not found in the original phyml. This extra likelihood factor reflects the number of duplications and losses inferred in a gene tree, given the topology of the species tree. The species-guided phyml allows the gene tree to have a topology that is inconsistent with the species tree if the alignment strongly supports this. The species tree is based on the NCBI taxonomy tree (subject to some modifications depending on new datasets).
The final tree is made by merging the five trees into one consensus tree using the "tree merging" algorithm. This allows TreeBeST to take advantage of the fact that DNA-based trees often are more accurate for closely related parts of trees and protein-based trees for distant relationships, and that a group of algorithms may outperform others under certain scenarios. The algorithm simultaneously merges the five input trees into a consensus tree. The consensus topology contains clades found in any of the input trees, where the clades chosen are those that minimize the number of duplications and losses inferred, and have the highest bootstrap support. Branch lengths are estimated for the final consensus tree based on the DNA alignment, using phyml with the HKY model.
- any gene pairwise relation where the ancestor node is a
speciation event. We predict several descriptions of orthologues:
- apparent_ortholog_one2one (is a special case, see  below)
- any gene pairwise relation where the ancestor node is a duplication event. We predict several descriptions of paralogues:
- other_paralog (see  below)
- contiguous_gene_split (see  below)
- putative_gene_split (see  below)
A within_species_paralog corresponds to a relation between 2 genes of the *same* species where the ancestor node has been labelled as a duplication node e.g. H2:H2', M2:M2' but does not necessarily mean that the duplication event has occurred in this species only. For example, M2':M3' are also within_species_paralog but the duplication event has occurred in the common ancestor between species H and species M. If H is human and M Mouse, the taxonomy level "times" the duplication event to the ancestor of "Euarchontoglires".  When paralogue genes are too distant to be in the same gene tree, but are still part of a broader "super-family", they are labelled as other_paralog relationships. In this case, the precise taxonomic level of the duplication event is left as undetermined.
A between_species_paralog corresponds to a relation between 2 genes of *different* species where the ancestor node has been labelled as a duplication node e.g. M1:H2 or M1:H3.  Special cases of between_species_paralog can be singled out where they can be characterised as one2one relations. Such cases are then relabelled apparent_ortholog_one2one. These can be the results of real duplications followed by gene losses (as shown in the picture below), but can also be the results of a wrong gene tree topology and wrong duplication node labelling. When the tree topology cannot resolve real orthology pairs, they will remain in the database as between_species_paralogs, reflecting long-distance relations that may be upgraded to orthologies in further versions of the gene tree pipeline, or for example, with improved gene builds.
 A gene split paralog is an artefactual type of paralogy that commonly arises from a fragmented genome assembly or a gene prediction that is poor in supporting evidences (cDNA, ESTs, proteins, etc.). A putative_gene_split is called when there is no overlap between the gene fragments in the same species and they lie in different sequence regions in the assembly (See GeneB1/GeneB2 in the image below). A contiguous_gene_split is called when the two fragments lie close to each other (<1MB) in the same sequence region and the same strand (See GeneA1/GeneA2 below).
The original Ensembl method of calculating these different events relies heavily on the use of a reference species tree; normally the taxonomy tree with dubious internal nodes removed. When working with well defined and distant species e.g. metazoans the taxonomy is more than adequate since the tree used does not have to include distances; it is the topology of the species which is important. However Ensembl Genomes works with species whose taxonomy is confusing and sometimes incorrect. Because of this we have developed a system for estimating the species trees for these problematic clades. To calculate new species trees we:
Find complete clusters of genes from the previous clade homology run release with the following properties
- Every genome in the original run of the homology pipeline is represented
- Each genome is represented once in the cluster (to avoid 1:m relationships & paralogs)
- For each clade blastp each member against the new species
- Select candidate extensions to the clades based upon consensus between all clade members
- From this set of clades select 10 at random
- Concatenate the cDNA of all proteins in the cluster together to produce a concatenated alignment
- Pass this alignment into Leaphy5
- Take the tree with the greatest likelihood as the new species tree
We do endeavour to use the taxonomy and community donated trees as & where they exist and are suitable. In situations where they do not the above methodology is applied.
Taxonomic Node Calls in GeneTrees
Under normal circumstances each node in a gene tree can be represented by a point on the taxonomy. However when we have used our own species trees this is not the case (since they have been used in scenarios where taxonomy is poorly defined). In this situations internal gene tree nodes are then assigned to a nominated taxonomic node by looking at common shared ancestors between the children of that node. The algorithm does mean that when we look at a flat part of the taxonomy many internal nodes will have the same taxonomic node assigned.
Within a division of Ensembl Genomes each genome is used in the homology pipeline. This means within a clade we will attempt to classify each and every relationship between proteins possible. However due to constraints in processing pan-taxonomic compara does not use an exhaustive list of species. The species used are:
- Anolis carolinensis
- Anopheles gambiae
- Apis mellifera
- Arabidopsis thaliana
- Bacillus subtilis subsp. subtilis str. 168
- Borrelia burgdorferi B31
- Buchnera aphidicola str. APS (Acyrthosiphon pisum)
- Caenorhabditis elegans
- Ciona savignyi
- Danio rerio
- Daphnia pulex
- Dictyostelium discoideum
- Drosophila melanogaster
- Emericella nidulans
- Escherichia coli str. K-12 substr. MG1655
- Gallus gallus
- Gasterosteus aculeatus
- Homo sapiens
- Leishmania major
- Monodelphis domestica
- Mus musculus
- Mycobacterium tuberculosis H37Rv
- Neisseria meningitidis Z2491
- Nematostella vectensis
- Neurospora crassa
- Ornithorhynchus anatinus
- Oryza sativa Japonica Group
- Pan troglodytes
- Phaeodactylum tricornutum
- Physcomitrella patens
- Phytophthora infestans T30-4
- Plasmodium falciparum 3D7
- Pongo abelii
- Puccinia graminis f. sp. tritici CRL 75-36-700-3
- Pyrococcus horikoshii OT3
- Saccharomyces cerevisiae
- Schistosoma mansoni
- Schizosaccharomyces pombe
- Staphylococcus aureus subsp. aureus N315
- Streptococcus pneumoniae TIGR4
- Streptococcus pyogenes M1 GAS
- Strongylocentrotus purpuratus
- Trichoplax adhaerens
- Vitis vinifera
- Wolbachia sp. wMel
- Xenopus tropicalis
Notes and References
- Hcluster_sg : hierarchical clustering software for sparse graphs. hcluster_sg performs hierarchical clustering under mean distance. It reads an input file that describes the similarity between two sequences, and groups two nearest nodes at each step. When two nodes are joined, the distance between the joined node and all the other nodes are updated by mean distance. This procedure is iterated until one of the three rules is met: (a) Do not merge cluster A and B if the total number of edges between A and B is smaller than |A|*|B|/3, where |A| and |B| are the sizes of A and B, respectively. This rule guarantees each cluster is compact. (B) Do not join A to any other cluster if |A| < 500. This rule avoids huge clusters which may cause computational burden for multialignment and tree building as well. (C) Do not join A and B if both A and B contain plant genes or both A and B contain Fungi genes. This rule tries to find animal gene families. TreeFam clustering is done with outgroups. Hcluster_sg also introduces an additional edge breaking rule: removes an edge between cluster A and B if the number of edges between A and B is smaller than |A|*|B|/10. This heuristic rule removes weak relations which are quite unlikely to be joined at a later step.
- Wallace IM, O'Sullivan O, Higgins DG, Notredame C. "M-Coffee: combining multiple sequence alignment methods with T-Coffee." Nucleic Acid Research. 2006 Mar 23;34(6):1692-9. The current aligner sets are divided into 3 levels, each alignment is tried with the first, and retried on the second or third on failure. First level: mafftgins_msa, muscle_msa, kalign_msa, t_coffee_msa; second level: mafft_msa, muscle_msa, clustalw_msa, kalign_msa; third level: mafft --auto.
- Li, H et al., TreeBeST: http://treesoft.sourceforge.net/treebest.shtml. Manuscript in preparation. TreeBeST was previously known as NJTREE.
- Yang, Z. "PAML: a program package for phylogenetic analysis by maximum likelihood" Comput Appl Biosci. 1997 Oct; 13(5):555-556.
- Whelan S. "New Approaches to Phylogenetic Tree Search and Their Application to Large Numbers of Protein Alignments" Systematic Biology 2007 56(5):727-740