Ensembl Variant Effect Predictor Examples and use cases

Example commands

Read input from STDIN, output to STDOUT
```
./vep --cache -o stdout
```

Add regulatory region consequences

./vep --cache -i variants.txt --regulatory

Input file variants.vcf.txt, input file format VCF, add gene symbol identifiers
```
./vep --cache -i variants.vcf.txt --format vcf --symbol
```
Filter out common variants based on 1000 Genomes data
```
./vep --cache -i variants.txt --filter_common
```
Force overwrite of output file variants_output.txt, check for existing co-located variants, output only coding sequence consequences, output HGVS names
```
./vep --cache -i variants.txt -o variants_output.txt --force --check_existing --coding_only --hgvs
```
Specify DB connection parameters in registry file ensembl.registry, add SIFT score and prediction, PolyPhen prediction
```
./vep --database -i variants.txt --registry ensembl.registry --sift b --polyphen p
```

Connect to Ensembl Genomes db server for Arabidopsis thaliana

./vep --database -i variants.txt --genomes --species arabidopsis_thaliana

Load config from ini file, run in quiet mode
```
./vep --config vep.ini -i variants.txt -q
```

Use cache in /home/vep/mycache/, use gzcat instead of zcat

./vep --cache --dir /home/vep/mycache/ -i variants.txt --compress gzcat

Add custom position-based phenotype annotation from remote BED file

./vep --cache -i variants.vcf --custom file=ftp://ftp.myhost.org/data/phenotypes.bed.gz,short_name=phenotype

Use the plugin named MyPlugin, output only the variation name, feature, consequence type and MyPluginOutput fields
```
./vep --cache -i variants.vcf --plugin MyPlugin --fields Uploaded_variation,Feature,Consequence,MyPluginOutput
```
Right align variants before consequence calculation. For more information, see here.
```
./vep --cache -i variants.vcf --shift_3prime 1
```
Report uploaded allele before minimisation. For more information, see here.
```
./vep --cache -i variants.vcf --uploaded_allele
```

gnomAD

gnomAD exome frequency data is included in Ensembl VEP's cache files from release 90, replacing ExAC; use --af_gnomade to enable using this data. Ensembl VEP can also retrieve frequency data from the gnomAD genomes set or ExAC via Ensembl VEP's custom annotation functionality.

For the latest gnomAD data, please visit gnomAD downloads.

Ensembl VEP requires Bio::DB::HTS to read data from tabix-indexed VCFs - see installation instructions
The Ensembl FTP site hosts abridged VCF files for gnomAD additionally remapped to GRCh38 using CrossMap. It is possible for Ensembl VEP to read these files directly from their remote location, though for optimal performance the VCF and index should be downloaded to a local file system.
- GRCh38
  - gnomAD genomes (r2.1, remapped with CrossMap): [VCFs and tabix indexes]
  - gnomAD exomes (r2.1, remapped with CrossMap): [VCFs and tabix indexes]
  - ExAC (v0.3, remapped using CrossMap): [VCF] [tabix index]
- GRCh37
  - gnomAD genomes (r2.1): [VCF and tabix indexes]
  - gnomAD exomes (r2.1): [VCF and tabix indexes]
  - ExAC (v0.3): [VCF] [tabix index]

Run Ensembl VEP with the following command (using the GRCh38 input example) to get locations and continental-level allele frequencies:

./vep -i examples/homo_sapiens_GRCh38.vcf --cache \
--custom file=gnomad.genomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz,short_name=gnomADg,format=vcf,type=exact,coords=0,fields=AF_AFR%AF_AMR%AF_ASJ%AF_EAS%AF_FIN%AF_NFE%AF_OTH

You will then see data under field names as described in the Ensembl VEP output header:

## gnomADg : gnomad.genomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz (exact)
## gnomADg_AFR_AF : AFR_AF field from gnomad.genomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz
## gnomADg_AMR_AF : AMR_AF field from gnomad.genomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz
...

where the gnomADg field contains the ID (or coordinates if no ID found) of the variant in the VCF file. Any of the fields in the gnomAD file INFO field can be added by appending them to the list in your Ensembl VEP command.

Conservation scores

You can use the custom annotation feature to add conservation scores to your output. For example, to add GERP scores, download the bigWig file from the list below, and run Ensembl VEP with the following flag:

./vep --cache -i example.vcf --custom file=All_hg19_RS.bw,short_name=GERP,format=bigwig

Example conservation score files:

Human (GRCh38)

Human (GRCh37)

All files provided by the UCSC genome browser - files for other species are available from their FTP site, though be sure to use the file corresponding to the correct assembly.

dbNSFP

dbNSFP - "a lightweight database of human nonsynonymous SNPs and their functional predictions" - provides pathogenicity predictions from many tools (including SIFT, LRT, MutationTaster, FATHMM) across every possible missense substitution in the human proteome.

Ensembl VEP plugins sometimes require data processed in specific ways as arguments. Any requirements and usage instructions for each plugin can be found in the plugin documentation.

In the case of the dbNSFP.pm plugin, the data needs to be downloaded and then processed into a format that the plugin can use. Note that there are two distinct branches of the files provided for academic and commercial usage; please use the appropriate files for your use case.

After downloading the file, you will need to process it so that tabix can index it correctly. This will take a while as the file is very large! Note that you will need the tabix utility in your path to use dbNSFP.

version=4.5c
unzip dbNSFP${version}.zip
zcat dbNSFP${version}_variant.chr1.gz | head -n1 > h

# GRCh38/hg38 data
zgrep -h -v "^#chr" dbNSFP${version}_variant.chr* | sort -k1,1 -k2,2n - | cat h - | bgzip -c > dbNSFP${version}_grch38.gz
tabix -s 1 -b 2 -e 2 dbNSFP${version}_grch38.gz

# GRCh37/hg19 data
zgrep -h -v "^#chr" dbNSFP${version}_variant.chr* | awk '$8 != "." ' | sort -k8,8 -k9,9n - | cat h - | bgzip -c > dbNSFP${version}_grch37.gz
tabix -s 8 -b 9 -e 9 dbNSFP${version}_grch37.gz

Then simply download the dbNSFP.pm plugin and place it either in $HOME/.vep/Plugins/ or a path in your $PERL5LIB. When you run Ensembl VEP with the plugin, you will need to select some of the columns that you wish to retrieve; to list them run Ensembl VEP with the plugin and the path to the dbNSFP file and no further parameters:

./vep --cache --force --plugin dbNSFP,dbNSFP4.5c_grch38.txt.gz
2014-04-04 11:27:05 - Read existing cache info
2014-04-04 11:27:05 - Auto-detected FASTA file in cache directory
2014-04-04 11:27:05 - Checking/creating FASTA index
2014-04-04 11:27:05 - Failed to instantiate plugin dbNSFP: ERROR: No columns selected to fetch. Available columns are:
#chr,pos(1-coor),ref,alt,aaref,aaalt,hg18_pos(1-coor),genename,Uniprot_acc,
Uniprot_id,Uniprot_aapos,Interpro_domain,cds_strand,refcodon,SLR_test_statistic,
codonpos,fold-degenerate,Ancestral_allele,Ensembl_geneid,Ensembl_transcriptid,
...

Note that some of these fields are replicates of those produced by the core Ensembl VEP code (e.g. SIFT, the 1000 Genomes Project frequencies) - you should use the Ensembl VEP options for frequency information rather than the annotations from dbNSFP as the dbNSFP file covers only missense substitutions. Other fields, such as the conservation scores, may be better served by using genome-wide files as described above.

To select fields, just add them as a comma-separated list to your command line:

./vep --cache --force --plugin dbNSFP,dbNSFP4.5c_grch38.txt.gz,LRT_score,FATHM_score,MutationTaster_score

One final point to note is that the dbNSFP scores are frozen on a particular Ensembl release's transcript set; check the readme file on their download site to find out exactly which. While in the majority of cases protein sequences don't change between releases, in some circumstances the protein sequence used by Ensembl VEP in the latest release may differ from the sequence used to calculate the scores in dbNSFP.

Structural variants

Ensembl VEP can annotate structural variants (SV) with their predicted effect on other genomic features. For more information on SV input format, see here.

Prediction process

If the INFO keys END or SVLEN are present, the proportion of any overlapping feature covered by the variant is calculated
The alternative allele (or SVTYPE in older VCF files) defines the type of structural variant; some types of structural variants are tested for specific consequences:

Structural variant type	Abbreviation	Specific consequences
Insertion	INS	Feature elongation
Deletion	DEL	Feature truncation
Duplication	DUP	Feature amplification/elongation
Inversion	INV	Not tested for any specific consequence
Copy number variation	CNV	Feature amplification/elongation (if copy number is 2) or truncation (if copy number is 0)
Breakpoint variant	BND	Feature truncation

Insertions and deletions

Supports mobile element insertions/deletions, including ALU, HERV, LINE1 and SVA elements
- Currently, mobile element variants are treated as any insertion/deletion

Breakpoint variants

Supports chromosome synonyms in breakends (such as chr4 and NC_000004.12)
Processes single breakends and multiple, comma-separated alternative breakends
Consequences are reported for each breakend; for instance, for a VCF input like 1 7936271 . N N[12:58877476[,N[X:10932343[, it will report the consequences for each of the 3 breakends:
- N[12:58877476[: consequences for the first alternative breakend near chr12:58877476
- N[X:10932343[: consequences for the second alternative breakend near chrX:10932343
- N.: consequences for the reference breakend near chr1:7936271 (represented as detailed in the VCF 4.4 specification, section 5.4.9: Single breakends)
In case of specific breakends not overlaping any reported Ensembl features (such as transcripts and regulatory regions), that specific breakend will NOT be presented in VEP output.

Reported overlaps

Ensembl VEP calculates the length and proportion of each genomic feature overlapped by a structural variant
Use the --overlaps option to enable this when using VCF or tab format. (This is reported by default in standard Ensembl VEP and JSON format.)
The keys bp_overlap and percentage_overlap are used in JSON format and OverlapBP and OverlapPC in other formats.

Plugin support

CADD plugin
Conservation plugin
NearestGene plugin
Phenotypes plugin
StructuralVariantOverlap plugin: please note that all features of this plugin have been ported to --custom annotation, with additional improvements
TSSDistance plugin

Changing memory requirements

By default, Ensembl VEP does not annotate variants larger than 10M. If you are using the command line tool, you can use the --max_sv_size option to modify this.
- This limit is not associated with breakpoint variants: each breakend in a breakpoint variant is analysed by Ensembl VEP as a single base (the alternative sequence is currently ignored).
By default, variants are analysed in batches of 5000. Using the --buffer_size option to reduce this can reduce memory requirements, especially if your data is sparse. A smaller buffer size is essential when annotating structural variants with regulatory data.

Pangenome assemblies

Ensembl VEP is able to analyse variants in any species or assembly (even if not part of Ensembl data) by providing your own FASTA file and GFF/GTF annotation:

./vep -i variants.txt -o variants_output.txt --gff data.gff.gz --fasta genome.fa.gz

We also provide data for other assemblies besides those supported in the current Ensembl and Ensembl Genomes sites.

HPRC assemblies

The Human Pangenome Reference Consortium (HPRC) aims to sequence 350 individuals of diverse ancestries, producing a pangenome of 700 haplotypes by the end of 2024. The first publication (A draft human pangenome reference) describes 47 phased, diploid assemblies from a cohort of genetically diverse individuals.

The Ensembl VEP command-line tool (CLI) can annotate and filter variants called against the latest human assemblies, including the telomere-to-telomere assembly of the CHM13 cell line (T2T-CHM13). We have annotated genes on these human assemblies, based on Ensembl/GENCODE 38 genes and transcripts, via a new mapping pipeline as detailed in the Methods section of A draft human pangenome reference. The links to download and visualise the human annotations for HPRC assemblies are summarised in the Ensembl HPRC data page.

Running Ensembl VEP with the human HPRC assemblies

Currently, Ensembl VEP can only be run with HPRC assemblies in offline mode, one assembly at a time. There are two ways to use Ensembl VEP with HPRC assemblies:

Using Ensembl VEP cache with (recommended) FASTA sequence (the most efficient way)
Using GTF annotation with (mandatory) FASTA sequence

In the examples below, we demonstrate annotating variants on T2T-CHM13v2.0 (GCA_009914755.4 assembly). To create a sample VCF to use in the examples below, you can take the first 100 lines from the ClinVar VCF file mapped to T2T-CHM13:

clinvar=ftp://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/variation/2022_10/vcf/2024_07/clinvar_20240624_GCA_009914755.4.vcf.gz
tabix -h $clinvar 1 | head -n 100 > test.vcf

Ensembl VEP cache

The cache is a downloadable archive containing all transcript models for an assembly; it may also contain regulatory features and variant data.

Let's start by downloading and extracting the cache to the default Ensembl VEP directory (available for each annotation by clicking in VEP cache in the Ensembl HPRC data page). In the case of T2T-CHM13:

cd $HOME/.vep
curl -O https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/variation/2022_10/indexed_vep_cache/Homo_sapiens-GCA_009914755.4-2022_10.tar.gz
tar xzf Homo_sapiens-GCA_009914755.4-2022_10.tar.gz

This will create the folder homo_sapiens_gca009914755v4/107_T2T-CHM13v2.0 with the gene data required to run Ensembl VEP. The name of this folder contains relevant information when running Ensembl VEP:

Species: homo_sapiens_gca009914755v4
Cache version: 107
Assembly: T2T-CHM13v2.0

As well as molecular consequence predictions, many gene/transcript-based options are supported for HPRC assemblies:

vep -i test.vcf --offline \
    --species homo_sapiens_gca009914755v4 \
    --cache_version 107 \
    --fasta Homo_sapiens-GCA_009914755.4-softmasked.fa.gz \
    --domains --symbol --canonical --protein --biotype --uniprot --variant_class

We don't have other annotations, such as RefSeq transcripts or variant information in the cache.

To run Ensembl VEP with the downloaded cache in offline mode, please specify the species (which here includes assembly name) and cache version:

vep -i test.vcf --offline --species homo_sapiens_gca009914755v4 --cache_version 107

FASTA sequence

When using the cache, supplying the reference genomic sequence in a FASTA file is optional, but is required to enable the following options:

Create HGVS notations (--hgvs and --hgvsg)
Check the reference sequence given in input data (--check_ref)

Genomic FASTA files can be found in Ensembl HPRC data page > FTP dumps > ensembl > genome. FASTA files need to be either uncompressed or compressed with bgzip (recommended) to be compatible with Ensembl VEP. For instance, to download a compressed FASTA file, uncompress it and then re-compress it with bgzip:

curl -O https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/genome/Homo_sapiens-GCA_009914755.4-softmasked.fa.gz
gzip -d Homo_sapiens-GCA_009914755.4-softmasked.fa.gz
bgzip Homo_sapiens-GCA_009914755.4-softmasked.fa.gz

Afterwards, you can run Ensembl VEP using cache and the --fasta flag:

vep -i test.vcf --offline \
    --species homo_sapiens_gca009914755v4 \
    --cache_version 107 \
    --fasta Homo_sapiens-GCA_009914755.4-softmasked.fa.gz

More information on using FASTA files is available here.

GTF and GFF annotation

As an alternative to using cache files, Ensembl VEP can utilise gene information in appropriately indexed GTF or GFF files. GTF and GFF files can be downloaded from the annotation column in the Ensembl HPRC data page. The data needs to be re-sorted in chromosomal order, compressed in bgzip and indexed with tabix. We present here the example for a GTF file:

curl -O https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/geneset/2022_07/Homo_sapiens-GCA_009914755.4-2022_07-genes.gtf.gz
gzip -d Homo_sapiens-GCA_009914755.4-2022_07-genes.gtf.gz
grep -v "#" Homo_sapiens-GCA_009914755.4-2022_07-genes.gtf |\
  sort -k1,1 -k4,4n -k5,5n -t$'\t' |\
  bgzip -c > Homo_sapiens-GCA_009914755.4-2022_07-genes.gtf.gz
tabix Homo_sapiens-GCA_009914755.4-2022_07-genes.gtf.gz

FASTA files are always required when running HPRC data with GTF annotation, as the transcript sequences are not available in the GFF files.

Afterwards, you can run Ensembl VEP using the GTF and FASTA files:

vep -i test.vcf \
      --gtf Homo_sapiens-GCA_009914755.4-2022_07-genes.gtf.gz \
      --fasta Homo_sapiens-GCA_009914755.4-softmasked.fa.gz

Check here for more information on using gene annotations in GTF and GFF files.

Missense deleteriousness predictions

Although PolyPhen-2 and SIFT scores are not directly available for alternative assemblies by using --polyphen and --sift, they can be retrieved via the PolyPhen_SIFT plugin.

Using our ProteinFunction pipeline, we ran PolyPhen-2 2.2.3 and SIFT 6.2.1 on the proteome sequences for GRCh38 and all HPRC assemblies (the protein FASTA files indicated in Ensembl HPRC data page) and stored their results in a single SQLite file: homo_sapiens_pangenome_PolyPhen_SIFT_20240502.db.

Pre-computed scores and predictions can be retrieved by downloading this file and using the PolyPhen_SIFT plugin:

curl -O https://ftp.ensembl.org/pub/current_variation/pangenomes/Human/homo_sapiens_pangenome_PolyPhen_SIFT_20240502.db
vep -i test.vcf --offline \
    --species homo_sapiens_gca009914755v4 \
    --cache_version 107 \
    --fasta Homo_sapiens-GCA_009914755.4-softmasked.fa.gz \
    --plugin PolyPhen_SIFT,db=human_pangenomes.PolyPhen_SIFT.db

Matched variant annotations (ClinVar, gnomAD and dbSNP)

We don't have variant data in the caches for the HPRC assemblies, but it can be integrated using the --custom option with data files using the same assembly coordinates. We have lifted-over some key datasets, including ClinVar and gnomAD to the HPRC assemblies (downloadable from the VCF column in Ensembl HPRC data page).

# Download ClinVar data and respective index (TBI)
curl -O https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/variation/2022_10/vcf/2024_07/clinvar_20240624_GCA_009914755.4.vcf.gz
curl -O https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/variation/2022_10/vcf/2024_07/clinvar_20240624_GCA_009914755.4.vcf.gz.tbi

# Run Ensembl VEP with ClinVar data
vep -i test.vcf --offline \
    --species homo_sapiens_gca009914755v4 --cache_version 107 \
    --fasta Homo_sapiens-GCA_009914755.4-softmasked.fa.gz \
    --custom file=clinvar_20240624_GCA_009914755.4.vcf.gz,short_name=ClinVar,format=vcf,type=exact,coords=0,fields=CLNSIG%CLNREVSTAT%CLNDN

Additional annotations

Ensembl VEP plugins are a simple way to add new functionality to your analysis. Many require data that is only available for GRCh37 or GRCh38, but others, for example those based on gene attributes or on the fly analysis are compatible with the HGRC assemblies.

Here follows Ensembl VEP plugins that are easily compatible with alternative human assemblies:

Plugin	Description	Plugin data	Usage example
Blosum62	Looks up the BLOSUM 62 substitution matrix score for the reference and alternative amino acids predicted for a missense mutation.		`--plugin Blosum62`
DosageSensitivity	Retrieves haploinsufficiency and triplosensitivity probability scores for affected genes (Collins et al., 2022).	`Collins_rCNV_2022.dosage_sensitivity_scores.tsv.gz`	`--plugin DosageSensitivity,file=Collins_rCNV_2022.dosage_sensitivity_scores.tsv.gz`
Downstream	Predicts downstream effects of a frameshift variant on the protein sequence of a transcript.	Requires a FASTA file provided via the --fasta option	`--plugin Downstream`
Draw	Draws pictures of the transcript model showing the variant location.		`--plugin Draw`
GeneSplicer	Runs GeneSplicer to get splice site predictions.	Binary and training data for GeneSplicer (plugin instructions)	`--plugin GeneSplicer,binary=genesplicer/bin/linux/genesplicer,training=genesplicer/human`
GO	Retrieves Gene Ontology (GO) terms associated with genes (for HGRC assemblies, specifically) using custom GFF annotation containing GO terms.	Ensembl HPRC data page > FTP dumps > ensembl > variation > [date] > gff: `_GO_plugin.gff.gz` `_GO_plugin.gff.gz.tbi`	`--plugin GO,file=homo_sapiens_gca009914755v4_110_VEP_GO_plugin.gff.gz`
HGVSIntronOffset	Returns HGVS intron start and end offsets. To be used with --hgvs option.		`--plugin HGVSIntronOffset`
LoFtool	Provides a rank of genic intolerance and consequent susceptibility to disease based on the ratio of Loss-of-function (LoF) to synonymous mutations for each gene.		`--plugin LoFtool`
MaxEntScan	Runs MaxEntScan to get splice site predictions.	Extracted directory from fordownload.tar.gz	`--plugin MaxEntScan,/path/to/fordownload`
NearestExonJB	Finds the nearest exon junction boundary to a coding sequence variant.		`--plugin NearestExonJB`
NMD	Predicts if a variant allows the transcript to escape nonsense-mediated mRNA decay based on certain rules.		`--plugin NMD`
Phenotypes	Retrieves overlapping phenotype information.	Ensembl HPRC data page > FTP dumps > ensembl > variation > [date] > gff: `_phenotypes_plugin.gvf.gz` `_phenotypes_plugin.gvf.gz.tbi`	`--plugin Phenotypes,file=homo_sapiens_gca009914755v4_110_VEP_phenotypes_plugin.gvf.gz`
pLI	Adds the probability of a gene being loss-of-function intolerant (pLI).		`--plugin pLI`
PolyPhen_SIFT	Retrieves PolyPhen and SIFT predictions from a SQLite database.	`homo_sapiens_pangenome_PolyPhen_SIFT_20240502.db`	`--plugin PolyPhen_SIFT,db=homo_sapiens_pangenome_PolyPhen_SIFT_20240502.db`
ProteinSeqs	Writes two files with the reference and mutated protein sequences of any proteins found with non-synonymous mutations in the input file.		`--plugin ProteinSeqs`
SingleLetterAA	Returns HGVSp string with single amino acid letter codes.		`--plugin SingleLetterAA`
SpliceRegion	Provides more granular predictions of splicing effects.		`--plugin SpliceRegion`
SubsetVCF	Retrieves overlapping records from a given VCF file.	A VCF file	`--plugin SubsetVCF,file=file.vcf.gz,name=myvfc`
TranscriptAnnotator	Annotates variant-transcript pairs based on a given file.	Tab-separated annotation file (plugin instructions)	`--plugin TranscriptAnnotator,file=annotation.txt.gz`
TSSDistance	Calculates the distance from the transcription start site for upstream variants.		`--plugin TSSDistance`

Citations and Ensembl VEP users

Ensembl VEP is used by many organisations and projects:

Ensembl VEP forms a part of Illumina's VariantStudio software
Gemini is a framework for exploring genome variation that uses Ensembl VEP
The DECIPHER project uses Ensembl VEP to aid variant interpretation

Other citations and use cases:

VAX is a suite of plugins that expand Ensembl VEP functionality
pViz is a visualisation tool for Ensembl VEP results files
McCarthy et al compares Ensembl VEP to AnnoVar
Pabinger et al reviews variant analysis software, including Ensembl VEP
Ensembl VEP is used to provide annotation for the gnomAD project
For a fuller list of Ensembl VEP citations see EuropePMC information for our 2010, 2016 and tutorial publications.

Upcoming Ensembl Platform Transition