Variant Effect Predictor Examples and use cases


Example commands

  • Read input from STDIN, output to STDOUT

    ./vep --cache -o stdout
  • Add regulatory region consequences

    ./vep --cache -i variants.txt --regulatory
  • Input file variants.vcf.txt, input file format VCF, add gene symbol identifiers

    ./vep --cache -i variants.vcf.txt --format vcf --symbol
  • Filter out common variants based on 1000 Genomes data

    ./vep --cache -i variants.txt --filter_common
  • Force overwrite of output file variants_output.txt, check for existing co-located variants, output only coding sequence consequences, output HGVS names

    ./vep --cache -i variants.txt -o variants_output.txt --force --check_existing --coding_only --hgvs
  • Run for any species or assembly (even if not part of Ensembl data) by providing your own FASTA file and GFF/GTF annotation

    ./vep -i variants.txt -o variants_output.txt --gff data.gff.gz --fasta genome.fa.gz
  • Specify DB connection parameters in registry file ensembl.registry, add SIFT score and prediction, PolyPhen prediction

    ./vep --database -i variants.txt --registry ensembl.registry --sift b --polyphen p
  • Connect to Ensembl Genomes db server for Arabidopsis thaliana

    ./vep --database -i variants.txt --genomes --species arabidopsis_thaliana
  • Load config from ini file, run in quiet mode

    ./vep --config vep.ini -i variants.txt -q
  • Use cache in /home/vep/mycache/, use gzcat instead of zcat

    ./vep --cache --dir /home/vep/mycache/ -i variants.txt --compress gzcat
  • Add custom position-based phenotype annotation from remote BED file

    ./vep --cache -i variants.vcf --custom file=ftp://ftp.myhost.org/data/phenotypes.bed.gz,short_name=phenotype
  • Use the plugin named MyPlugin, output only the variation name, feature, consequence type and MyPluginOutput fields

    ./vep --cache -i variants.vcf --plugin MyPlugin --fields Uploaded_variation,Feature,Consequence,MyPluginOutput
  • Right align variants before consequence calculation. For more information, see here.

    ./vep --cache -i variants.vcf --shift_3prime 1

gnomAD

gnomAD exome frequency data is included in VEP's cache files from release 90, replacing ExAC; use --af_gnomade to enable using this data. VEP can also retrieve frequency data from the gnomAD genomes set or ExAC via VEP's custom annotation functionality.

For the latest gnomAD data, please visit gnomAD downloads.

  1. VEP requires Bio::DB::HTS to read data from tabix-indexed VCFs - see installation instructions
  2. Ensembl's FTP site hosts abridged VCF files for gnomAD and ExAC, additionally remapped to GRCh38 using CrossMap. It is possible for VEP to read these files directly from their remote location, though for optimal performance the VCF and index should be downloaded to a local file system.
  3. Run VEP with the following command (using the GRCh38 input example) to get locations and continental-level allele frequencies:

    ./vep -i examples/homo_sapiens_GRCh38.vcf --cache \
    --custom file=gnomad.genomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz,short_name=gnomADg,format=vcf,type=exact,coords=0,fields=AF_AFR%AF_AMR%AF_ASJ%AF_EAS%AF_FIN%AF_NFE%AF_OTH

    You will then see data under field names as described in the VEP output header:

    ## gnomADg : gnomad.genomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz (exact)
    ## gnomADg_AFR_AF : AFR_AF field from gnomad.genomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz
    ## gnomADg_AMR_AF : AMR_AF field from gnomad.genomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz
    ...
    where the gnomADg field contains the ID (or coordinates if no ID found) of the variant in the VCF file. Any of the fields in the gnomAD file INFO field can be added by appending them to the list in your VEP command.

Conservation scores

You can use VEP's custom annotation feature to add conservation scores to your output. For example, to add GERP scores, download the bigWig file from the list below, and run VEP with the following flag:

./vep --cache -i example.vcf --custom file=All_hg19_RS.bw,short_name=GERP,format=bigwig

Example conservation score files:

All files provided by the UCSC genome browser - files for other species are available from their FTP site, though be sure to use the file corresponding to the correct assembly.


dbNSFP

dbNSFP - "a lightweight database of human nonsynonymous SNPs and their functional predictions" - provides pathogenicity predictions from many tools (including SIFT, PolyPhen, LRT, MutationTaster, FATHMM) across every possible missense substitution in the human proteome. The data is available to download, and while it cannot be immediately used by the VEP it is simple to process the data into a format that the dbNSFP.pm plugin can use.

After downloading the file, you will need to process it so that tabix can index it correctly. This will take a while as the file is very large! Note that you will need the tabix utility in your path to use dbNSFP.

unzip dbNSFP4.0b2a.zip
head -n1 dbNSFP4.0b2a_variant.chr1 > dbNSFP4.0b2a.txt
cat dbNSFP4.0b2a_variant.chr* | grep -v "#" >> dbNSFP4.0b2a.txt
rm dbNSFP4.0b2a_variant.chr*
bgzip dbNSFP4.0b2a.txt
tabix -s 1 -b 2 -e 2 dbNSFP4.0b2a.txt.gz

Then simply download the dbNSFP VEP plugin and place it either in $HOME/.vep/Plugins/ or a path in your $PERL5LIB. When you run VEP with the plugin, you will need to select some of the columns that you wish to retrieve; to list them run VEP with the plugin and the path to the dbNSFP file and no further parameters:

./vep --cache --force --plugin dbNSFP,dbNSFP4.0b2a.txt.gz
2014-04-04 11:27:05 - Read existing cache info
2014-04-04 11:27:05 - Auto-detected FASTA file in cache directory
2014-04-04 11:27:05 - Checking/creating FASTA index
2014-04-04 11:27:05 - Failed to instantiate plugin dbNSFP: ERROR: No columns selected to fetch. Available columns are:
#chr,pos(1-coor),ref,alt,aaref,aaalt,hg18_pos(1-coor),genename,Uniprot_acc,
Uniprot_id,Uniprot_aapos,Interpro_domain,cds_strand,refcodon,SLR_test_statistic,
codonpos,fold-degenerate,Ancestral_allele,Ensembl_geneid,Ensembl_transcriptid,
...

Note that some of these fields are replicates of those produced by the core VEP code (e.g. SIFT, PolyPhen, the 1000 Genomes and ESP frequencies) - you should use the options to enable these from the VEP code in place of the annotations from dbNSFP as the dbNSFP file covers only missense substitutions. Other fields, such as the conservation scores, may be better served by using genome-wide files as described above.

To select fields, just add them as a comma-separated list to your command line:

./vep --cache --force --plugin dbNSFP,dbNSFP4.0b2a.txt.gz,LRT_score,FATHM_score,MutationTaster_score

One final point to note is that the dbNSFP scores are frozen on a particular Ensembl release's transcript set; check the readme file on their download site to find out exactly which. While in the majority of cases protein sequences don't change between releases, in some circumstances the protein sequence used by VEP in the latest release may differ from the sequence used to calculate the scores in dbNSFP.


Structural Variants

VEP can be used to annotate structural variants (SV) with their predicted effect on other genomic features. For more information on SV input format, see here.

Prediction process

  • The INFO keys 'END' or 'SVLEN' are present, the proportion of any overlapping feature covered by the variant is calculated
  • If the SVTYPE or ALT is 'DEL', the variant tested for feature ablation/ truncation
  • If the SVTYPE or ALT is 'DUP', the variant tested for feature amplification
  • If the SVTYPE or ALT is 'INS' or 'DUP', the variant tested for feature elongatation
  • SVTYPE is used in preference to ALT to derive the variant type of an SV with 'CN*' alleles

Reported overlaps

  • VEP calculates the length and proportion of each genomic feature overlapped by a structural variant
  • Use the --overlaps option to enable this when using VCF or tab format. (This is reported by default in standard VEP and JSON format.)
  • The keys bp_overlap and percentage_overlap are used in JSON format and OverlapBP and OverlapPC in other formats.

Changing memory requirements

  • By default, VEP does not annotate variants larger than 10M. If you are using the command line tool, you can use the --max_sv_size option to modify this.
  • By default, variants are analysed in batches of 5000. Using the --buffer_size option to reduce this can reduce memory requirements, especially if your data is sparse. A smaller buffer size is essential when annotating structural variants with regulatory data.

Citations and VEP users

VEP is used by many organisations and projects:

Other citations and use cases:

  • VAX is a suite of plugins for VEP that expands its functionality
  • pViz is a visualisation tool for VEP results files
  • McCarthy et al compares VEP to AnnoVar
  • Pabinger et al reviews variant analysis software, including VEP
  • VEP is used to provide annotation for the ExAC and gnomAD projects