Ensembl Variant Effect Predictor FAQ

For any questions not covered here, please send an email to the Ensembl developer's mailing list (public) or contact the Ensembl Helpdesk (private). Also you can report issues through our (public) Github repositories. For general vep issues you should use ensembl-vep repository and for specific plugins you should use VEP_plugins repository.

General questions

Q: Why has my insertion/deletion variant encoded in VCF disappeared from the output?

Ensembl treats unbalanced variants differently to VCF - your variant hasn't disappeared, it may have just changed slightly! You can solve this by giving your variants a unique identifier in the third column of the VCF file. See here for a full discussion.

Q: Why don't I see any co-located variants when using species X?

Not all species have variants and not all species that do are in the Ensembl variation resource - see this document for details. The custom option can be used in the commandline interface to include more variant sets

Q: Why do I see multiple known variants mapped to my input variant?

Ensembl VEP compares your input to known variants from the Ensembl variation database. In some cases one input variant can match multiple known variants:

Germline variants from dbSNP and somatic mutations from COSMIC may be found at the same locus
Some sources, e.g. HGMD, do not provide public access to allele-specific data, so an HGMD variant with unknown alleles may colocate with one from dbSNP with known alleles
Multiple alternate alleles from your input may match different variants as they are described in dbSNP

See here for a full discussion.

Q: Ensembl VEP is not assigning a frequency to my input variant - why?

Ensembl VEP's cache contains frequency data only for specific studies. See here for a full discussion. The custom option can be used in the commandline interface to include more frequency sets

Q: Why do I see so many lines of output for each variant in my input?

While it would be convenient to have a simple, one word answer to the question "What is the consequence of this variant?", in reality biology is not this simple! Many genes have more than one transcript, so Ensembl VEP provides a prediction for each transcript that a variant overlaps. Ensembl VEP has options to help select results according to your requirements; the --canonical and --mane options indicate which transcripts are canonical and belong to the human MANE set respectively, while --pick, --per_gene, --summary and --most_severe allow you to give a more summary level assessment per variant.

Furthermore, several "compound" consequences are also possible - if, for example, a variant falls in the final few bases of an exon, it may be considered to affect a splicing site, in addition to possibly affecting the coding sequence.

Q: How do I reduce Ensembl VEP's memory requirement?

There are a number of ways to do this-

Ensure your input file is sorted by location. This can greatly reduce memory requirements and runtime
Consider reducing the buffer size. This reduces the number of variants annotated together in a batch and can be modified in both command line and web interfaces. Reducing buffer size may increase run time.
Ensure you are only using the options you need, rather than --everything. Some data-rich options, such as regulatory annotation have an impact on memory use

Q: How to cite Ensembl VEP?

If you use Ensembl VEP, please cite our latest publication to continue to support Ensembl VEP development.

Ensembl VEP web interface questions

Q: How do I access the web version of the Ensembl Variant Effect Predictor?

You can find the Ensembl VEP web tool on the Tools page.

Q: Why is the output I get for my input file different when I use the Ensembl VEP web and command line interfaces?

Ensure that you are passing equivalent arguments to the command line tool that you are using in the web interface. If you are sure this is still a problem, please report it on the ensembl-dev mailing list.

Q: Is there a tutorial for the web tool?

Yes, see our latest tutorial Annotating and prioritizing genomic variants using the Ensembl Variant Effect Predictor — A tutorial for more information on using the Ensembl VEP web interface.

Ensembl VEP command line tool questions

Q: How can I make Ensembl VEP run faster?

There are a number of factors that influence how fast Ensembl VEP runs. Have a look at our handy guide for tips on improving runtime.

Q: Why am I not seeing the same variant from my input in the output?

Since the Ensembl 110 release, Ensembl VEP by default will minimise the input allele for display in the output. To see the exact allele representation you provided, use the --uploaded_allele option.

Q: Why do I see "N" as the reference allele in my HGVS strings?

Q: Why do I get errors related with Sequence.pm?

substr outside of string at /nfs/users/nfs_w/wm2/Perl/ensembl-variation/modules/Bio/EnsEMBL/Variation/Utils/Sequence.pm line 511.
Use of uninitialized value $ref_allele in string eq at /nfs/users/nfs_w/wm2/Perl/ensembl-variation/modules/Bio/EnsEMBL/Variation/Utils/Sequence.pm line 514.
Use of uninitialized value in concatenation (.) or string at /nfs/users/nfs_w/wm2/Perl/ensembl-variation/modules/Bio/EnsEMBL/Variation/Utils/Sequence.pm line 643.

Both of these error types are usually seen when using a FASTA file for retrieving sequence. There are a couple of steps you can take to try to remedy them:

The index alongside the FASTA can become corrupted. Delete [fastafile].index and re-run Ensembl VEP to regenerate it. By default this file is located in your $HOME/.vep/[species]/[version]_[assembly] directory.
The FASTA file itself may have been corrupted during download; delete the fasta file and the index and re-download (you can use the Ensembl VEP installer to do this).
Older versions of BioPerl (1.2.3 in particular is known to have this) cannot properly index large FASTA files. Make sure you are using a later (>=1.6) version of BioPerl. The Ensembl VEP installer installs 1.6.924 for you.

If you still see problems after taking these steps, or if you were not using a FASTA file in the first place, please contact us.

Q: Why are chromosomes not found in annotation sources or synonyms?

WARNING: Chromosome 21 not found in annotation sources or synonyms on line 160

This can occur if the chromosome names differ between your input variant and any annotation source that you are using (cache, database, GFF/GTF file, FASTA file, custom annotation file). To circumvent this you may provide a synonyms file. A synonym file is included in Ensembl VEP's cache files, so if you have one of these for your species you can use it as follows:

./vep -i input.vcf -cache -synonyms ~/.vep/homo_sapiens/114_GRCh38/chr_synonyms.txt

The file consists of lines containing pairs of tab-separated synonyms. Order is not important as synonyms can be used in both "directions".

Q: Why do I get feature_type warnings from my GFF/GTF file?

WARNING: Ignoring 'five_prime_utr' feature_type from Homo_sapiens.GRCh38.111.gtf.gz GFF/GTF file. This feature_type is not supported in Ensembl VEP.

This can occur if you are using GFF/GTF file and the file contains a type that is not supported by Ensembl VEP. Those lines are simply ignored. However, in cases where the transcript model is incomplete the full model may be ignored.

Please try to use supported feature types as mentioned here

Q: Can I get gnomAD exomes and genomes frequencies in Ensembl VEP?

Yes, see this guide.

Q: Why do I have issues connecting to Ensembl databases?

Could not connect to database homo_sapiens_core_63_37 as user anonymous using [DBI:mysql:database=homo_sapiens_core_63_37;host=ensembldb.ensembl.org;port=5306] as a locator:
Unknown MySQL server host 'ensembldb.ensembl.org' (2) at $HOME/src/ensembl/modules/Bio/EnsEMBL/DBSQL/DBConnection.pm line 290.

-------------------- EXCEPTION --------------------
MSG: Could not connect to database homo_sapiens_core_63_37 as user anonymous using [DBI:mysql:database=homo_sapiens_core_63_37;host=ensembldb.ensembl.org;port=5306] as a locator:
Unknown MySQL server host 'ensembldb.ensembl.org' (2)

If you select the database option rather than using a cache Ensembl VEP will try to connect to the public MySQL server at ensembldb.ensembl.org. Occasionally the server may break connection with your process, which causes this error. This can happen when the server is busy, or due to various network issues. Consider using the caching system. Using a cache and fasta file is the most effcient way to run Ensembl VEP

Q: Can I use Ensembl VEP on Windows?

Yes - see the documentation for a few different ways to get the Ensembl VEP running on Windows.

Q: Can I use Ensembl VEP with species and assemblies which are not available in Ensembl?

Yes - you can run Ensembl VEP on any species you have data for by providing a custom gene annotation in GFF/GTF and genome sequence in FASTA file, like so:

./vep -i input.vcf --gff data.gff.gz --fasta genome.fa.gz

Q: Can I use Ensembl VEP with T2T-CHM13 and other human assemblies?

Yes - you can run Ensembl VEP using Human Pangenome Reference Consortium (HPRC) data by following the instructions on how to use Ensembl VEP with HPRC assemblies.

Q: Can I download all of the SIFT and/or PolyPhen predictions?

The Ensembl Variation database and the human Ensembl VEP cache file contain precalculated SIFT and PolyPhen-2 predictions for every possible amino acid change in every translated protein product in Ensembl. Since these data are huge, we store them in a compressed format.

There are different approaches to download SIFT/PolyPhen-2 data:

Using the PolyPhen_SIFT plugin:

For any species with predictions in our Ensembl databases, the plugin is able to download the predictions data into a local SQLite database for offline use. PolyPhen predictions are only available for human data.
We also provide a downloadble SQLite database containing PolyPhen/SIFT predictions based on Human Pangenome Reference Consortium (HPRC) and GRCh38 assemblies. For more information, refer to Missense deleteriousness predictions in HPRC assemblies.

Using our Perl API:

Fetch a ProteinFunctionPredictionMatrix for your protein of interest and then call its get_prediction() method to get the score for a particular position and amino acid, looping over all possible amino acids for your position.
You would need to work out which peptide position your codon maps to, but there are methods in the TranscriptVariation class that should help you (probably translation_start() and translation_end()).