Repeat feature annotation

If repeat data is present in INSDC when a genome is loaded, then those features are imported into Ensembl Genomes. For bacterial genomes, this is currently the only source of repeat data. For other divisions, a computational pipeline is additionally run, to annotate three types of repeat:

  • Low-complexity regions (Dust [1])
  • Tandem repeats (TRF [2])
  • Complex repeats (RepeatMasker [3])

Annotating repeats with RepeatMasker requires a repeat library. In most cases, a species-specific library is not available, so the RepBase [4] database of eukaryotic repetitive elements is used. Repeat libraries from the following sources are used and combined where possible:

Viewing and accessing repeat features

By default, repeat features are not displayed in the genome browser; display them by using the Configure this page option. You can view all repeats, or a subset of repeats based on type.

The repeat annotations can be programatically accessed using the Ensembl API. See the RepeatFeature and RepeatFeatureAdaptor documentation for further details.

For Ensembl Plants species only, tandem repeats annotated by the TRF program are not used to soft- and hardmask the genome sequences.

References

  1. Morgulis A et al. (2006) A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol. 13:1028-40
  2. Benson G (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27: 573-580
  3. Smit AFA, Hubley R, Green P (1996-2010) RepeatMasker Open-3.0 http://www.repeatmasker.org
  4. Jurka J et al. (2005) Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 110:462-467