Ensembl Variant Effect Predictor Annotation sources


Ensembl VEP can use a variety of annotation sources to retrieve the transcript models used to predict consequence types.

  • Cache - a downloadable file containing all transcript models, regulatory features and variant data for a species
  • GFF or GTF - use transcript models defined in a tabix-indexed GFF or GTF file
  • Database - connect to a MySQL database server hosting Ensembl databases

Data from VCF, BED and bigWig files can also be incorporated by Ensembl VEP's Custom annotation feature.

Using a cache is the most efficient way to use Ensembl VEP; we would encourage you to use a cache wherever possible. Caches are easy to download and set up using the installer. Follow the tutorial for a simple guide.


Caches

Using a cache (--cache) is the fastest and most efficient way to use Ensembl VEP, as in most cases only a single initial network connection is made and most data is read from local disk. Use offline mode to eliminate all network connections for speed and/or privacy.

Cache version

We strongly recommend that you download/use the cache version which corresponds to your Ensembl VEP installation,
i.e. cache version 114 should be used with the Ensembl VEP tool version 114.

This is mainly due to the fact that the cache (data content and structure) is generated every Ensembl release, regarding the data and API updates for this release, therefore the cache data format might differ between versions (and be incompatible with a newer version of the Ensembl VEP tool).


Downloading caches

Cache files are created for every species for each Ensembl release. They can be automatically downloaded and configured using INSTALL.pl.

If interested in RefSeq transcripts you may download an alternate cache file (e.g. homo_sapiens_refseq), or a merged file of RefSeq and Ensembl transcripts (eg homo_sapiens_merged); remember to specify --refseq or --merged when running Ensembl VEP to use the relevant cache. See documentation for full details.


Manually downloading caches

It is also simple to download and set up caches without using the installer. By default, Ensembl VEP searches for caches in $HOME/.vep; to use a different directory when running VEP, use --dir_cache.

FTP directories with indexed cache data:

Ensembl: Vertebrates
Ensembl Genomes: Bacteria | Fungi | Metazoa | Plants | Protists

NB: When using Ensembl Genomes caches, you should use the --cache_version option to specify the relevant Ensembl Genomes version number as these differ from the concurrent Ensembl VEP version numbers.

HPRC and alternative assemblies

Ensembl VEP caches are also available for Human Pangenome Reference Consortium (HPRC) data at the Ensembl HPRC data page. Click here for more information on how to annotate variants on HPRC assemblies.


Data in the cache

The data content of Ensembl VEP caches vary by species. This table shows the contents of the default human cache files in release 114.

SourceVersion (GRCh38)Version (GRCh37)
Ensembl database version 114 114
Genome assembly GRCh38.p14 GRCh37.p13
MANE Version v1.4 n/a
GENCODE 48 19
RefSeq GCF_000001405.40-RS_2023_10
(GCF_000001405.40_GRCh38.p14_genomic.gff)
105.20220307
(GCF_000001405.25_GRCh37.p13_genomic.gff)
Regulatory build 1.0 1.0
PolyPhen-2 2.2.3 2.2.2
SIFT 6.2.1 5.2.2
dbSNP 156 156
COSMIC 100 98
HGMD-PUBLIC 2020.4 2020.4
ClinVar 2024-09 2023-06
1000 Genomes Phase 3 (remapped) Phase 3
gnomAD exomes v4.1 v4.1
gnomAD genomes v4.1 v4.1

Limitations of the cache

The cache stores the following information:

  • Transcript location, sequence, exons and other attributes
  • Gene, protein, HGNC and other identifiers for each transcript (where applicable, limitations apply to RefSeq caches)
  • Locations, alleles and frequencies of existing variants (see note below).
  • Regulatory regions
  • Predictions and scores for SIFT, PolyPhen-2

The cache does not store any information pertaining to, and therefore cannot be used for, the following:

  • HGVS names (--hgvs, --hgvsg) - to retrieve these you must additionally point to a FASTA file containing the reference sequence for your species (--fasta)
  • Using HGVS notation as input (--format hgvs)
  • Using variant identifiers as input (--format id)
  • Finding overlapping structural variants (--check_sv)

Enabling one of these options with --cache will cause Ensembl VEP to warn you in its status output with something like the following:

 2011-06-16 16:24:51 - INFO: Database will be accessed when using --hgvs 

Existing variants

Here existing variants referes to those variants that have been loaded to Ensembl variation database from accessioning resources. For example, for human, you can see the source of data in the above table. We load variants from accessioning resources such as dbSNP, COSMIC, and HGMD-PUBLIC.

Note that gnomAD is not a variant accessioning body. What it means is that any gnomAD variant that are not accessioned will not be avialable in the cache. For example, gnomAD v4.1 was released in April 2024, but will not be available in the cache until the variants have been submitted to dbSNP for accessioning and made available in a dbSNP release. If you run the variant 5-32100960-ATAAG-A using 113 cache you would not get any frequency information because it was not accessioned at the time of Ensembl 113 release -

./vep --id "5 32100960 . ATAAG A" --af_gnomadg --af_gnomade --check_existing --cache --cache_version 113 --fasta genome.fa.gz
#Uploaded_variation	Location	Allele	Gene	Feature	Feature_type	Consequence	cDNA_position	CDS_position	Protein_position	Amino_acids	Codons	Existing_variation	Extra
5_32100961_TAAG/-	5:32100961-32100964	-	ENSG00000133401	ENST00000397559	Transcript	splice_donor_variant,non_coding_transcript_exon_variant	95-?	-	-	-	-	-	IMPACT=HIGH;STRAND=1

Alternative: In such cases you can use gnomAD VCF file with --custom option.


Data privacy and offline mode

When using the public database servers, Ensembl VEP requests transcript and variation data that overlap the loci in your input file. As such, these coordinates are transmitted over the network to a public server, which may not be appropriate for the analysis of sensitive or private data.

Note

Only the coordinates are transmitted to the server; no other information is sent.

To use offline mode that does not use any network connections, use the flag --offline.

The limitations described above apply absolutely when using offline mode. For example, if you specify --offline and --format id, Ensembl VEP will report an error and refuse to run:

ERROR: Cannot use ID format in offline mode

All other features, including the ability to use custom annotations and plugins, are accessible in offline mode.



GFF/GTF files

Ensembl VEP can use transcript annotations defined in GFF or GTF files. The files must be bgzipped and indexed with tabix and a FASTA file containing the genomic sequence is required in order to generate transcript models. This allows you to annotate variants from any species and assembly with these data.

Your GFF or GTF file must be sorted in chromosomal order. Ensembl VEP does not use header lines so it is safe to remove them.

grep -v "#" data.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > data.gff.gz
tabix -p gff data.gff.gz
./vep -i input.vcf --gff data.gff.gz --fasta genome.fa.gz

You may use any number of GFF/GTF files in this way, providing they refer to the same genome. You may also use them in concert with annotations from a cache or database source; annotations are distinguished by the SOURCE field in the output.

  • GFF file

    Example of command line with GFF, using flag --gff :

    ./vep -i input.vcf --cache --gff data.gff.gz --fasta genome.fa.gz

    NOTE: If you wish to customise the name of the GFF as it appears in the SOURCE field and Ensembl VEP output header, use the longer --custom annotation form:

    --custom file=data.gff.gz,short_name=frequency,format=gff
  • GTF file

    Example of command line with GTF, using flag --gtf :

    ./vep -i input.vcf --cache --gtf data.gtf.gz --fasta genome.fa.gz

    NOTE: If you wish to customise the name of the GFF as it appears in the SOURCE field and Ensembl VEP output header, use the longer --custom annotation form:

    --custom file=data.gtf.gz,short_name=frequency,format=gtf

GFF format expectations

Ensembl VEP has been tested on GFF files generated by Ensembl and NCBI (RefSeq). Due to inconsistency in the GFF specification and adherence to it, not all GFF files will be compatible with Ensembl VEP and not all transcript biotypes may be supported. Additionally, Ensembl VEP does not support GFF files with embedded FASTA sequence.


Column "type" (3rd column):

The following entity/feature types are supported by VEP.

Show supported types

Lines of other types will be ignored; if this leads to an incomplete transcript model, the whole transcript model may be discarded. If unsupported types are used you will see a warning like the following -

WARNING: Ignoring 'five_prime_utr' feature_type from Homo_sapiens.GRCh38.111.gtf.gz GFF/GTF file. This feature_type is not supported in Ensembl VEP.

Expected parameters in the 9th column:

  • ID

    Only required for the genes and transcripts entities.

  • parent/Parent

    - Entities in the GFF are expected to be linked using a key named "parent" or "Parent" in the attributes (9th) column of the GFF.
    - Unlinked entities (i.e. those with no parents or children) are discarded.
    - Sibling entities (those that share the same parent) may have overlapping coordinates, e.g. for exon and CDS entities.

  • biotype

    Transcripts require a Sequence Ontology biotype to be defined in order to be used.
    The simplest way to define this is using an attribute named "biotype" on the transcript entity. Other configurations are supported in order for Ensembl VEP to use GFF files from NCBI and other sources.

Here is an example:

##gff-version 3.2.1
##sequence-region 1 1 10000
1 Ensembl gene        1000  5000  . + . ID=gene1;Name=GENE1
1 Ensembl transcript  1100  4900  . + . ID=transcript1;Name=GENE1-001;Parent=gene1;biotype=protein_coding
1 Ensembl exon        1200  1300  . + . ID=exon1;Name=GENE1-001_1;Parent=transcript1
1 Ensembl exon        1500  3000  . + . ID=exon2;Name=GENE1-001_2;Parent=transcript1
1 Ensembl exon        3500  4000  . + . ID=exon3;Name=GENE1-001_2;Parent=transcript1
1 Ensembl CDS         1300  3800  . + . ID=cds1;Name=CDS0001;Parent=transcript1

GTF format expectations

The following GTF entity types will be extracted:

  • cds (or CDS)
  • stop_codon
  • exon
  • gene
  • transcript

Entities are linked by an attribute named for the parent entity type e.g. exon is linked to transcript by transcript_id, transcript is linked to gene by gene_id.

Transcript biotypes are defined in attributes named "biotype", "transcript_biotype" or "transcript_type". If none of these exist, Ensembl VEP will attempt to interpret the source field (2nd column) of the GTF as the biotype.

Here is an example:

1 Ensembl gene        1000  5000  . + . gene_id "gene1"; gene_name "GENE1";
1 Ensembl transcript  1100  4900  . + . gene_id "gene1"; transcript_id "transcript1"; gene_name "GENE1"; transcript_name "GENE1-001"; transcript_biotype "protein_coding";
1 Ensembl exon        1200  1300  . + . gene_id "gene1"; transcript_id "transcript1"; exon_number "exon1"; exon_id "GENE1-001_1";
1 Ensembl exon        1500  3000  . + . gene_id "gene1"; transcript_id "transcript1"; exon_number "exon2"; exon_id "GENE1-001_2";
1 Ensembl exon        3500  4000  . + . gene_id "gene1"; transcript_id "transcript1"; exon_number "exon3"; exon_id "GENE1-001_2";
1 Ensembl CDS         1300  3800  . + . gene_id "gene1"; transcript_id "transcript1"; exon_number "exon2"; ccds_id "CDS0001";

Chromosome synonyms

If the chromosome names used in your GFF/GTF differ from those used in the FASTA or your input VCF, you may see warnings like this when running VEP:

WARNING: Chromosome 21 not found in annotation sources or synonyms on line 160

To circumvent this you may provide VEP with a synonyms file. A synonym file is included in Ensembl VEP's cache files, so if you have one of these for your species you can use it as follows:

./vep -i input.vcf -cache -gff data.gff.gz -fasta genome.fa.gz -synonyms ~/.vep/homo_sapiens/114_GRCh38/chr_synonyms.txt

Limitations of the cache

Using a GFF or GTF file as the gene annotation source limits access to some auxiliary information available when using a cache. Currently most external reference data such as gene symbols, transcript identifiers and protein domains are inaccessible when using only a GFF/GTF file.

Ensembl VEP's flexibility does allow some annotation types to be replaced. The following table illustrates some examples and alternative means to retrieve equivalent data.

Data typeAlternative
SIFT and PolyPhen-2 predictions (--sift, --polyphen) Use the PolyPhen_SIFT plugin
Co-located variants (--check_existing, --af* flags) A couple of options are available:
  1. Use a VCF with --custom to retrieve variant IDs, frequency and other data
  2. Add --cache to use variants in the cache. *
Regulatory consequences (--regulatory) Add --cache to use regulatory features in the cache. *

* Note this will also instruct Ensembl VEP to annotate input variants against transcript models retrieved from the cache as well as those from the GFF/GTF file. It is possible to use --transcript_filter to include only the transcripts from your GFF/GTF file:

./vep -i input.vcf -cache --custom file=data.gff.gz,short_name=myGFF,format=gff --fasta genome.fa.gz --transcript_filter "_source_cache is myGFF"


FASTA files

By pointing Ensembl VEP to a FASTA file (or directory containing several files), it is possible to retrieve reference sequence locally when using --cache or --offline. This enables Ensembl VEP to:

  • Retrieve HGVS notations (--hgvs)
  • Check the reference sequence given in input data (--check_ref)
  • Construct transcript models from a GFF or GTF file without accessing a database (specially useful for performance reasons or if using data from species/assembly not part of Ensembl species list)

FASTA files from Ensembl can be set up using the installer; files set up using the installer are automatically detected when using --cache or --offline; you should not need to use --fasta to manually specify them.

The following plugins do require the fasta file to be explicitly passed as a command line argument (i.e. --fasta /VEP_DIR/your_downloaded.fasta)

  • CSN
  • GeneSplicer
  • MaxEntScan

To enable this, Ensembl VEP uses one of two modules:

  • The Bio::DB::HTS Perl XS module with HTSlib. This module uses compiled C code and can access compressed (bgzipped) or uncompressed FASTA files. It is set up by the installer.
  • The Bio::DB::Fasta module. This may be used on systems where installation of the Bio::DB::HTS module has not been possible. It can access only uncompressed FASTA files. It is also set up by the installer and comes as part of the BioPerl package.

The first time you run Ensembl VEP with a specific FASTA file, an index will be built. This can take a few minutes, depending on the size of the FASTA file and the speed of your system. On subsequent runs the index does not need to be rebuilt (if the FASTA file has been modified, Ensembl VEP will force a rebuild of the index).


FASTA FTP directories

Suitable reference FASTA files are available to download from the Ensembl FTP server. See the Downloads page for details.

You should preferably use the installer as described above to fetch these files; manual instructions are provided for reference. In most cases it is best to download the single large "primary_assembly" file for your species. You should use the unmasked (without _rm or _sm in the name) sequences.

Note that Ensembl VEP requires that the file be either unzipped (Bio::DB::Fasta) or unzipped and then recompressed with bgzip (Bio::DB::HTS::Faidx) to run; when unzipped these files can be very large (25GB for human). An example set of commands for setting up the data for human follows:

curl -O http://ftp.ensemblgenomes.org/pub/fungi/release-114/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
bgzip Homo_sapiens.GRCh38.dna.primary_assembly.fa
./vep -i input.vcf --offline --hgvs --fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz


Databases

Ensembl VEP can use remote or local database servers to retrieve annotations.

  • Using --cache (without --offline) uses the local cache on disk to fetch most annotations, but allows database connections for some features (see cache limitations)
  • Using --database tells Ensembl VEP to retrieve all annotations from the database. Please only use this for small input files or when using a local database server!

Public database servers

By default, Ensembl VEP is configured to connect to the public MySQL instance at ensembldb.ensembl.org. If you are in the USA (or geographically closer to the east coast of the USA than to the Ensembl data centre in Cambridge, UK), a mirror server is available at useastdb.ensembl.org. To use the mirror, use the flag --host useastdb.ensembl.org

Data for Ensembl Genomes species (e.g. plants, fungi, microbes) is available through a different public MySQL server. The appropriate connection parameters can be automatically loaded by using the flag --genomes

If you have a very small data set (100s of variants), using the public database servers should provide adequate performance. If you have larger data sets, or wish to use Ensembl VEP in a batch manner, consider one of the alternatives below.


Using a local database

It is possible to set up a local MySQL mirror with the databases for your species of interest installed. For instructions on installing a local mirror, see here. You will need a MySQL server that you can connect to from the machine where you will run Ensembl VEP (this can be the same machine). For most annotation functionality, you will only need the Core database (e.g. homo_sapiens_core_114_38) installed. In order to find co-located variants or to use SIFT or PolyPhen-2, it is also necessary to install the relevant variation database (e.g. homo_sapiens_variation_114_38).

Note that unless you have custom data to insert in the database, in most cases it will be much more efficient to use a pre-built cache in place of a local database.

To connect to your mirror, you can either set the connection parameters using --host, --port, --user and --password, or use a registry file. Registry files contain all the connection parameters for your database, as well as any species aliases you wish to set up:

use Bio::EnsEMBL::DBSQL::DBAdaptor;
use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor;
use Bio::EnsEMBL::Registry;

Bio::EnsEMBL::DBSQL::DBAdaptor->new(
  '-species' => "Homo_sapiens",
  '-group'   => "core",
  '-port'    => 5306,
  '-host'    => 'ensembldb.ensembl.org',
  '-user'    => 'anonymous',
  '-pass'    => '',
  '-dbname'  => 'homo_sapiens_core_114_38'
);

Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new(
  '-species' => "Homo_sapiens",
  '-group'   => "variation",
  '-port'    => 5306,
  '-host'    => 'ensembldb.ensembl.org',
  '-user'    => 'anonymous',
  '-pass'    => '',
  '-dbname'  => 'homo_sapiens_variation_114_38'
);

Bio::EnsEMBL::Registry->add_alias("Homo_sapiens","human");

For more information on the registry and registry files, see here.



Cache - technical information

ADVANCED The cache consists of compressed files containing listrefs of serialised objects. These objects are initially created from the database as if using the Ensembl API normally. In order to reduce the size of the cache and allow the serialisation to occur, some changes are made to the objects before they are dumped to disk. This means that they will not behave in exactly the same way as an object retrieved from the database when writing, for example, a plugin that uses the cache.

The following hash keys are deleted from each transcript object:

  • analysis
  • created_date
  • dbentries : this contains the external references retrieved when calling $transcript->get_all_DBEntries(); hence this call on a cached object will return no entries
  • description
  • display_xref
  • edits_enabled
  • external_db
  • external_display_name
  • external_name
  • external_status
  • is_current
  • modified_date
  • status
  • transcript_mapper : used to convert between genomic, cdna, cds and protein coordinates. A copy of this is cached separately by VEP as

    $transcript->{_variation_effect_feature_cache}->{mapper}

As mentioned above, a special hash key "_variation_effect_feature_cache" is created on the transcript object and used to cache things used by VEP in predicting consequences, things which might otherwise have to be fetched from the database. Some of these are stored in place of equivalent keys that are deleted as described above. The following keys and data are stored:

  • introns : listref of intron objects for the transcript. The adaptor, analysis, dbID, next, prev and seqname keys are stripped from each intron object
  • translateable_seq : as returned by

    $transcript->translateable_seq

  • mapper : transcript mapper as described above
  • peptide : the translated sequence as a string, as returned by

    $transcript->translate->seq

  • protein_features : protein domains for the transcript's translation as returned by

    $transcript->translation->get_all_ProteinFeatures

    Each protein feature is stripped of all keys but: start, end, analysis, hseqname
  • codon_table : the codon table ID used to translate the transcript, as returned by

    $transcript->slice->get_all_Attributes('codon_table')->[0]

  • protein_function_predictions : a hashref containing the keys "sift" and "polyphen"; each one contains a protein function prediction matrix as returned by e.g.

    $protein_function_prediction_matrix_adaptor->fetch_by_analysis_translation_md5('sift', md5_hex($transcript-{_variation_effect_feature_cache}->{peptide}))

Similarly, some further data is cached directly on the transcript object under the following keys:

  • _gene : gene object. This object has all keys but the following deleted: start, end, strand, stable_id
  • _gene_symbol : the gene symbol
  • _ccds : the CCDS identifier for the transcript
  • _refseq : the "NM" RefSeq mRNA identifier for the transcript
  • _protein : the Ensembl stable identifier of the translation
  • _source_cache : the source of the transcript object. Only defined in the merged cache (values: Ensembl, RefSeq) or when using a GFF/GTF file (value: short name or filename)