Ensembl Metadata Perl API

Whilst all Ensembl Genomes database can be accessed using the standard Ensembl API, the way that up to 250 genomes are loaded into a single database presents some barriers to easy access for Ensembl Bacteria. To overcome this, a metadata API is provided to make accessing the data easier.

The Bio::EnsEMBL::LookUp object provides an interface for loading the EnsEMBL Registry with the large numbers of genomes from multiple databases, and allows individual DBAdaptor objects to be retrieved for genomes that match criteria including ENA accession (or other seq_region name), Genome Assembly accession, species name and taxonomy ID.

Once Bio::EnsEMBL::DBAdaptor objects have been returned, they can be used as for any Ensembl species. Alternatively, the now-loaded Bio::EnsEMBL::Registry object can be accessed directly.

Installing the API

First, the standard Ensembl API and its dependencies should be installed. This should be the same version of Ensembl that is used by Ensembl Bacteria.

Secondly, the ensembl-metadata API should be installed from the GitHub repository. You will need to add the modules directory from this package to your PERL5LIB. You may also need to install the CPAN package JSON - please consult your local systems adminstration if you are unsure about how to do this.

Basic use

The default mode for using the API is to use a specialised lookup database on the public MySQL server.

Building a helper

use strict;
use warnings;
use Bio::EnsEMBL::LookUp;
my $lookup = Bio::EnsEMBL::LookUp->new();

Once instantiated, the helper can be queried to retrieve Ensembl DBAdaptors in various ways:

Getting DBAdaptors by name

use strict;
use warnings;
use Bio::EnsEMBL::LookUp;
my $lookup = Bio::EnsEMBL::LookUp->new();
my $dba = $lookup->get_by_name_exact('escherichia_coli_str_k_12_substr_mg1655');   
my @dbas = @{$lookup->get_all_by_name_pattern('escherichia_coli_.*')};

Getting DBAs by taxonomy

To get all genomes an organism identified by a given taxonomic node, supply the NCBI taxonomy ID to get_all_by_taxon_id(). For instance, for Streptococcus sanguinis (strain SK36):

use strict;
use warnings;
use Bio::EnsEMBL::LookUp;
my $lookup = Bio::EnsEMBL::LookUp->new();
my @dbas = @{$lookup->get_all_by_taxon_id(388919)};

To get all genomes belonging to a branch of the taxonomy, supply the NCBI taxonomy ID of the root node for that branch to get_all_by_taxon_branch(). For instance, to find all genomes from the genus Escherichia:

use strict;
use warnings;
use Bio::EnsEMBL::LookUp;
my $lookup = Bio::EnsEMBL::LookUp->new();
my @dbas = @{$lookup->get_all_by_taxon_branch(561)};

Getting DBAs by genomic INSDC accession

use strict;
use warnings;
use Bio::EnsEMBL::LookUp;
my $lookup = Bio::EnsEMBL::LookUp->new();
my ($dba) = $lookup->get_all_by_accession("U00096");

Getting DBAs by Genome Assembly accession

use strict;
use warnings;
use Bio::EnsEMBL::LookUp;
my $lookup = Bio::EnsEMBL::LookUp->new();
my $dba = $lookup->get_by_assembly_accession("GCA_000005845.1");

Once obtained, DBAdaptor objects can be used as normal for an Ensembl species e.g.

my $genes = $dba->get_GeneAdaptor()->fetch_all();
print "Found ".scalar @$genes." genes for ".$dba->species()."\n";

Important: Once finished with a DBAdaptor object for the time being, it should be disconnected to avoid running out of connections on the MySQL server being used with following method:

$dba->dbc()->disconnect_if_idle();

Disconnected DBAdaptor objects can be used again without manually reconnecting.

Advanced use

The registry helper can be instantiated and used in a variety of ways. For instantiation from a local database set, the following code can be used (subsitute your own details):

register_dbs(
                 "mysql.mydomain.com", 3306, "myuser",
                 "mypass", "bacteria_[0-9]+_collection_core_17_70_1" );
my $lookup = Bio::EnsEMBL::LookUp::LocalLookUp->new(-CLEAR_CACHE => 1);
# use as required

The ensembl-metadata API can be used in conjunction with the Compara Perl API to access gene family data for Ensembl Bacteria. To find which families a gene belongs to:

use strict;
use warnings;
use Bio::EnsEMBL::LookUp;
use Bio::EnsEMBL::Compara::DBSQL::DBAdaptor;
print "Building helper\n";
my $helper = Bio::EnsEMBL::LookUp->new();

my $nom = 'escherichia_coli_str_k_12_substr_mg1655';
print "Getting DBA for $nom\n";
my ($dba) = @{$helper->get_by_name_exact($nom)};  

my $gene = $dba->get_GeneAdaptor()->fetch_by_stable_id('b0344');
print "Found gene " . $gene->external_name() . "\n";

# load compara adaptor
my $compara_dba = Bio::EnsEMBL::Compara::DBSQL::DBAdaptor->new(-HOST => 'mysql-eg-publicsql.ebi.ac.uk', -USER => 'anonymous', -PORT => '4157', -DBNAME => 'ensembl_compara_bacteria_24_77');
# find the corresponding member
my $member = $compara_dba->get_GeneMemberAdaptor()->fetch_by_source_stable_id('ENSEMBLGENE',$gene->stable_id());
# find families involving this member
for my $family (@{$compara_dba->get_FamilyAdaptor()->fetch_all_by_Member($member)}) {
  print "Family ".$family->stable_id()."\n"; 
}

To retrieve the genes belonging to a given family:

use strict;
use warnings;
use Bio::EnsEMBL::LookUp;
use Bio::EnsEMBL::Compara::DBSQL::DBAdaptor;
print "Building helper\n";
my $helper = Bio::EnsEMBL::LookUp->new();

# load compara adaptor
my $compara_dba = Bio::EnsEMBL::Compara::DBSQL::DBAdaptor->new(-HOST => 'mysql-eg-publicsql.ebi.ac.uk', -USER => 'anonymous', -PORT => '4157', -DBNAME => 'ensembl_compara_bacteria_24_77');
# find the corresponding member
my $family = $compara_dba->get_FamilyAdaptor()->fetch_by_stable_id('MF_00395');
print "Family " . $family->stable_id() . "\n";
for my $member (@{$family->get_all_Members()}) {
  my $genome_db = $member->genome_db();
  print $genome_db->name();
  my ($member_dba) = @{$helper->get_by_name_exact($genome_db->name())};
  if (defined $member_dba) {
  my $gene = $member_dba->get_GeneAdaptor()->fetch_by_stable_id($member->gene_member()->stable_id());
  print $member_dba->species() . " " . $gene->external_name . "\n";
        $member_dba->dbc()->disconnect_if_idle();
  }
}

To retrieve the genes belonging to a given family (in this case the HAMAP family for the cytochrome b6-f complex subunit 8), filtering to a specific branch of the taxonomy (in this case from the species Prochlorococcus marinus):

-use strict;
use warnings;
use Bio::EnsEMBL::LookUp;
use Bio::EnsEMBL::Compara::DBSQL::DBAdaptor;

print "Building helper\n";
my $helper = Bio::EnsEMBL::LookUp->new();

# find all genomes that descendants of a specified node to use as a filter
my $taxid = 1219; # Prochlorococcus marinus
print "Finding genomes for " . $taxid . "\n";
my %target_species = map { $_->species() => $_ } @{$helper->get_all_by_taxon_branch($taxid)};

# load compara adaptor
my $compara_dba = Bio::EnsEMBL::Compara::DBSQL::DBAdaptor->new(-HOST => 'mysql-eg-publicsql.ebi.ac.uk', -USER => 'anonymous', -PORT => '4157', -DBNAME => 'ensembl_compara_bacteria_24_77');
# find the corresponding member
my $family = $compara_dba->get_FamilyAdaptor()->fetch_by_stable_id('MF_00395');
print "Family " . $family->stable_id() . "\n";
for my $member (@{$family->get_all_Members()}) {
  my $genome_db = $member->genome_db();
  # filter by taxon from the calculated list
  my $member_dba = $target_species{$genome_db->name()};
  if (defined $member_dba) {
     my $gene = $member_dba->get_GeneAdaptor()->fetch_by_stable_id($member->gene_member()->stable_id());
     print $member_dba->species() . " " . $gene->external_name . "\n";
     $member_dba->dbc()->disconnect_if_idle();
  }
}

To retrieve the canonical peptides from genes belonging to a given family:

use strict;
use warnings;
use Bio::EnsEMBL::LookUp;
use Bio::EnsEMBL::Compara::DBSQL::DBAdaptor;
use Bio::SeqIO;
print "Building helper\n";
my $helper = Bio::EnsEMBL::LookUp->new();

# load compara adaptor
my $compara_dba = Bio::EnsEMBL::Compara::DBSQL::DBAdaptor->new(-HOST => 'mysql-eg-publicsql.ebi.ac.uk', -USER => 'anonymous', -PORT => '4157', -DBNAME => 'ensembl_compara_bacteria_24_77');

# find the corresponding member
my $family  = $compara_dba->get_FamilyAdaptor()->fetch_by_stable_id('MF_00395');

# create a file to write to
my $outfile = ">" . $family->stable_id . ".fa";
my $seq_out = Bio::SeqIO->new(-file   => $outfile,
                -format => "fasta",);
print "Writing family " . $family->stable_id() . " to $outfile\n";

# loop over members
for my $member (@{$family->get_all_Members()}) {
  my $genome_db = $member->genome_db();
  my ($member_dba) = @{$helper->get_by_name_exact($genome_db->name())};
  if (defined $member_dba) {
  my $gene = $member_dba->get_GeneAdaptor()->fetch_by_stable_id($member->gene_member()->stable_id());
  print "Writing sequence for " . $member->stable_id() . "\n";
  my $s = $gene->canonical_transcript()->translate();
  $seq_out->write_seq($s);
        $member_dba->dbc()->disconnect_if_idle();
  }
}