Annotation of external cross references

The identifiers, names and descriptions of the genes, transcripts and translations in Ensembl Genomes are typically imported from or created in collaboration with the relevant communities for a given species. In addition, external cross references to these objects are automatically created from various other databases as part of the standard release process, as described below.

External cross references are useful for several interrelated purposes:

  1. They can provide gene names and descriptions where no explicit data is readily available from the community.
  2. The imported names, synonyms and descriptions can improve the quality of search results and aid in the biological interpretation of data.
  3. They allow genes, transcripts and translations in Ensembl Genomes to be referenced using all the commonly available names, synonyms and identifiers for the species.
  4. They allow bulk translation of identifiers (using BioMart) facilitating data exchange between heterogeneous bioinformatics services.

Automatically creating and importing external cross references

There are two types of external cross reference (XRef), direct or dependent.

  • A direct XRef is one that can be directly linked to a gene, transcript or translation object in Ensembl Genomes by synonymy or sequence similarity.
  • A dependent XRef is one that is transitivly linked to the object via the direct XRef.

For example, the translation for the Arabidopsis thaliana gene AT1G58030 in Ensembl Plants is identical to the UniProtKB/Swiss-Prot protein CAAT2_ARATH, giving us a direct XRef by synonymy. Additionally, CAAT2_ARATH is annotated within UniProtKB with XRefs to more than 20 other databases. A sub-set of these XRefs (e.g. BT046174 in the European Nucleotide Archive) are additionally imported as 'dependent' XRefs based on the original 'direct' XRef.

The process of importing XRefs for a given species consists of loading direct and dependent XRefs from a pre-defined set of sources. Each source is configured to use either direct mappings (synonymy) or by sequence alignment using exonerate. The sources used are either generic, applying equally to all species; taxon-specific; or can be specific to a single species.

Note that species imported directly from the INSDC archives are processed differently, having direct XRefs for the primary INSDC feature and dependent XRefs from UniProtKB. For more details, see INSDC annotation import.

Ontology annotation

Cross references are used as a mechanism of adding ontology annotations to genes, transcripts and translations in Ensembl Genomes. Ontology terms, typically but not exclusively from the Gene Ontology, are imported from four distinct sources, two within the XRef pipeline described above and two separate sources, respectively:

  1. As dependent XRefs from the UniProtKB source (itself derived from GOA).
  2. From additional community sources, including annotations hosted in GOA but not yet imported into UniProtKB.
  3. Via the InterPro2GO pipeline, using the results of the protein feature annotation pipeline.
  4. By projection of manually annotated terms from one-to-one orthologues (using gene trees in a well annotated species to other related species. Terms with the following evidence codes are projected: IDA, IC, IGI, IMP, IPI, ISS, NAS, ND, RCA, TAS.

Ontology annotations come with additional standard information about the underlying evidence supporting the annotation and qualifiers to refine the scope of the annotations.

Common XRef sources

We import XRefs from this list of sources for all our species, unless otherwise specified on the species homepage:

  • UniProtKB
    • GO
    • ArrayExpress
  • Interpro
  • RefSeq
    • EntrezGene
  • UniParc

and where data is available from this list of sources:

  • RFAM
  • miRBase
  • UniGene
  • RNAMMER
  • TRNASCAN_SE
  • PHIbase
  • Gramene Pathway

and an additional list of more than 100 species or taxon specific sources (see individual species pages for details).

Accessing XRefs

Gene, transcript and translation XRefs are tabulated on the gene and transcript pages under the External references section of the left hand menu. For example, for Arabidopsis thaliana, gene AT3G52430 or transcript AT3G52430.1. Additionally, XRefs may be returned within BioMart, or queried in the Perl or REST APIs.

Ontology annotations are displayed on dedicated views on the gene and transcript pages under the Ontology section of the left hand menu. For example, gene AT3G52430 or transcript AT3G52430.1.