The Ensembl Genomes public MySQL Servers

Ensembl Genomes and the Ensembl software platform use the MySQL relational database management system to store data. MySQL databases are used by the web browser and REST service, and can be used with the Ensembl Perl API or directly with a MySQL client (see below). The schema used by the Ensembl platform are described in the Ensembl API documentation.

MySQL databases are also used by the BioMart data warehouse interface, although we recommend that you use the web interface to access data in BioMart, as the mart schema contains many tables of denormalised data. Data can also be retrieved from BioMart programmatically, using the XML-based martservice.

Public MySQL Servers

Ensembl Genomes operates a MySQL server for public use which contains all databases from the last 10 Ensembl Genomes releases, including those used for the BioMart interface. This server can be used in conjuction with the Ensembl public MySQL servers (though Ensembl Genomes do not currently offer any mirror servers). Details of Ensembl and Ensembl Genomes servers are shown below, and all servers can be accessed using the user 'anonymous' (no password required):

Dataset Server Port
Ensembl Genomes, all databases 4157
Ensembl 5306
Ensembl Mart 5316

Note: Ensembl Genomes and Ensembl MySQL servers are located at different URLs. Ensembl Genomes databases from all five divisions are located on the same server. Ensembl BioMart is on a different server. Not all MySQL instances use the default port, so please ensure that you specify the correct port when trying to connect!

The MySQL server is provided 'as-is', though scheduled downtime will be publicised on the mailing lists and on the Ensembl blog. Please note that processes that use this service excessively to the detriment of other users may be terminated without warning to preserve the service functionality. For intensive use of this server, an alternative is to set up a local MySQL database with copies of Ensembl Genomes data.

The public MySQL server is restricted to data from the most recent 10 releases only. With each subsequent release, data from the oldest release will be removed at release time. The removal of each release will be announced with the intentions for the new release approximately 2 months in advance. MySQL dumps from all previous releases are always available from the Ensembl Genomes FTP site.

Installing local MySQL databases

When creating a Ensembl Genomes mirror, or to use local MySQL databases with the Ensembl Perl API, dumps of MySQL databases can be downloaded from the Ensembl Genomes FTP site. Instructions for loading MySQL dumps onto a local MySQL server can be found in the webcode documentation.

Naming conventions

Most Ensembl Genomes MySQL databases follow a standard naming convention, using the following

  • genus_species is the binomial name (strain may be appended to species)
  • eg_version is the version of Ensembl Genomes for this release
  • ensembl_version is the version of the Ensembl schema/API used
  • assembly_version is the version of the genome assembly
  • division is the name of the Ensembl Genomes division (bacteria, metazoa, fungi, plants, protists; pan_homology is the pan-taxonomic compara database)

The following conventions apply:

  • core databases - <genus_species>_core_<eg_version>_<ensembl_version>_<assembly_version>
  • variation databases - <genus_species>_variation_<eg_version>_<ensembl_version>_<assembly_version>
  • funcgen databases - <genus_species>_funcgen_<eg_version>_<ensembl_version>_<assembly_version>
  • compara databases - ensembl_compara__<eg_version>_<ensembl_version>
  • mart databases - _mart_<eg_version>, _seq_mart_<eg_version>, _snp_mart_<eg_version>

Collection databases

For Ensembl Bacteria, the genomes are loaded in batches of up to 250 into databases matching the pattern bacteria_<db_number>_collection_core_<eg_version>_<ensembl_version>_1 where:

  • db_number is the number representing the individual collection of genomes
  • eg_version is the version of Ensembl Genomes for this release
  • ensembl_version is the version of the Ensembl schema/API used

The terminal number (which usually represents the assembly version) has no significance in this context. For the majority of genomes, there is no significance to which databases contain which genomes, but any new genomes are added to the last database in the list by db_number (or a new database with the next value of db_name if that has more than 250 genomes). However, genomes used in pan compara, which typically are references for key species, are found in collection bacteria_0. For this reason, users are strongly recommended to use the Ensembl Genomes API to access Ensembl Bacteria.

For Ensembl Fungi and Protists, from release 27 onwards, a large proportion of genomes are stored in collection databases in a similar way to bacteria, following the pattern fungi|protists_<collection_name>_collection_<eg_version>_<ensembl_version>_1. The naming convention is above, except the collection name is based on taxonomic grouping e.g. fungi_ascomycota1_collection_core_27_80_1 is the first database containing genomes from Ascomycota.

In either case, please note that SQL statements designed to work on single species MySQL databases may not always work on collection databases. For instance, the gene table in a collection database contains genes for all genomes, so the following statement returns all protein-coding genes for all genomes:

select stable_id from gene where biotype='protein_coding';

To obtain genes for a given species, the following style of query is needed, where the genes are restricted to those coming from a genome with a particular name:

select stable_id from gene join seq_region using (seq_region_id) 
join coord_system using (coord_system_id) join meta using (species_id) 
where meta_key="species.production_name" and meta_value="escherichia_coli_str_k_12_substr_mg1655"
and biotype="protein_coding";

Given this, you are strongly recommended to use the REST service, and Ensembl Perl API where possible.