Super trees

We define super tree as a structure that links multiple gene trees together. Both our protein trees and ncRNA trees resources contain super trees.

They arise from two situations

  1. a dynamic process to cater for large families,
  2. the handling of some Panther sub-families

In both cases (see below), we infer some homologies across the super tree, which means that two orthologues or paralogues may be in different gene trees.

Breaking down large families

Large families that would be too complex to analyse are recursively broken down with QuickTree. The current limits are 1,500 for protein trees and 400 for ncRNA trees. This happens dynamically based on the gene count of each family.

The process generates a set of sub-families, each smaller than the required size, and a reconciled binary tree that links them. We apply our standard homology inference, but only for paralogues as those super trees generally capture ancient duplication events. We call these paralogues Ancient paralogues in the Paralogues table of the web site, and other_paralog in the database and in BioMart.

Panther sub-families in the HMM library

We classify the protein-coding genes into families using a library of HMMs based on Panther. Following an assessment of family sizes and quality across all eukaryotes, we have decided to use the Panther sub-families instead of the families in some cases. In those cases, the HMM library only contains the sub-families (as if they were families), and not their family.

For each of these broken-down Panther families, we create a super tree to record the fact that there is a known homology between the sub-families. However, we don't compute the topology of this super tree so it remains flat (up to 300 nodes). Therefore, it is not reconciled with the species tree and lacks speciation / duplication annotations. To infer orthologies between sub-families, we compare them in a pairwise fashion. If a pair of sub-families share at least one species, we record all paralogues between them as Ancient paralogues / other_paralog like above. Otherwise, we record every pair of genes between them as orthologues, following our standard naming rule of the relationship cardinality.