Approximate time: 20 minutes
VEP will add annotation from a number of sources for each variant that we upload. Below is a subset of the most commonly used annotations annotations.
Identifiers: Gene, transcript, protein, etc.
Frequency data: Allele frequency information from multiple public databases. 1000 Genomes, (gnomAD)[https://gnomad.broadinstitute.org/], (ESP)[https://evs.gs.washington.edu/EVS/] Allele frequency information is helpful to understand whether the input variant is common or rare in different geographical populations.
-Pathogenicity predictions: Computational predictions of whether a variant will affect the protein function. Various algorithms are available (SIFT, PolyPhen2, CADD, etc)
Disease Association:
Clinical significance and disease association as reported in ClinVar.
ClinVar is a widely used database that aggregates and curates clinical reports of variants with clinical determinations.
The clinical significances reported in VEP range from Benign
to Pathogenic
and usually have a disease annotation.
Consequence: For each variant, VEP identifies all transcripts in the selected database (Ensembl or Refseq) that overlaps with the variant coordinates. The consequence of the variant with respect to the transcript is then evaluated based on the following diagram.
These consequences are then binned into impact groups: LOW, MODERATE, MODIFIER, HIGH. For a full mapping to consequence to impact, see VEP
We’ll run VEP on the VCF that we produced and analyze the variant consequences.
First, we’ll download the VCF from the cluster to our local computer.
Files
and select Home Directory
.intro-to-ngs/results/na12878.vcf
Download
In web browser tab, navigate to to https://useast.ensembl.org/Tools/VEP Note that VEP can also be run on the command line on our HPC, resulting in a text file (txt or vcf). You are welcome to ask for instructions to run the command line VEP. For single VCF analysis, the web server is recommended in order to take advantage of the visualization tools.
In the Species
section choose Human (Homo sapiens)
(should be the default)
In the Input data
section choose Or upload file:
and navigate to the downloaded file na12878.vcf
Under Transcript database to use
select RefSeq transcripts
Click Run
When your job is done, click View Results
<img src=”../img/vep_results_1.png” width=900”>
Our goal is to identify variants that change the coding sequence.
We can see in the Coding Consequences
box on the right that 20% of the variants are missense
, which means that they
change the coding sequence of the transcript.
Under Filters
choose Consequence
+ is
+ missense_variant
and click Add
You should see 1 row - here are a subset of interesting columns:
Location | Allele | Consequence | IMPACT | SYMBOL | BIOTYPE | Amino_acids |
---|---|---|---|---|---|---|
10:94842866-94842866 | G | missense_variant | MODERATE | CYP2C19 | protein_coding | I/V |
Existing_variation | SIFT | PolyPhen | AF | Clinical Significance |
---|---|---|---|---|
rs3758581,CM983294 | tolerated(0.38) | benign(0.05) | 0.9515 |
Based on the annotations, one can conclude that this variant unlikely to cause disease.
This is consistent with what we know about NA12878
being a healthy individual.
Though the vatiant does change the amino acid from I
to V
, both SIFT, PolyPhen both suggest that this change does not
alter protein function.
Furthermore, there is no ClinVar report associated with this variant.
Finally, the maximum allele frequency found for this variant in the 1000 Genomes
database is 0.95
, meaning it is a
very common variant and unlikely to be pathogenic.