Yum, beautiful regulatory variants to spot...

Regul@tionSpotter

Documentation

Input

Output

Examples

Analysing a single variant

Input

Single Variant

For single queries in RegulationSpotter you can use the single query interface. Here, you can put in single variants as shown in our single query tutorial. Simply fill in the chromosome and location of the variant along with the reference and alternative allele. Please note that For InDels, you have to use the VCF format, i.e. always start with the last reference base before the variant.

Output

Clicking the submit button leads you to a view of your results. For intragenic alterations and known disease causing variants, you will be redirected to our conventional MutationTaster output. More information on this can be found in the MutationTaster documentation. Here, we will focus on explanations about the detailed RegulationSpotter output.

Screenshot of a single variant result file

Screenshot of a single variant result file

Screenshot of detailed results view of RegulationSpotter analysis of a single variant.

Results

Likely effect of an alteration

RegulationSpotter treats alterations differently depending on whether they are located within a gene or not. For alterations in protein-coding transcripts of genes, it relies on MutationTaster, which classifies an alteration as one of four possible types:

disease causing (ClinVar): known disease mutation listed in ClinVar.
disease causing: predicted by MutationTaster as disease causing.
polymorphism: predicted by MutationTaster as harmless.
polymorphism automatic: known to be harmless from databases.

For more details about the classification process, please refer to our MutationTaster documentation.

Extragenic (extratranscriptic) alterations are assessed by RegulationSpotter directly. The program compiles and combines all the regulatory data and comes up with an estimate of how likely it is for a variant to be located in a regulatory region. RegulationSpotter also assesses public data sources such as CLinVar, 1000G and ExAC in order to reliably classify known variants. The possible outcomes are:

disease causing (ClinVar): known disease mutation listed in ClinVar.
functional region (much evidence): Regulatory information is available for the variant's location in several data sources, resulting in a high Region Score which makes it principally possible to classify this variant as "functional" with high confidence. The confidence was drawn from testing the distribution of Region Scores for a set of known extratranscriptic disease variants and known harmless extratranscriptic variants with a high frequency of homozygous occurence in the data of 1000G. Due to the available data, RegulationSpotter considers it vary likely for the variant to have a regulatory function. The high confidence label is given to variants for which the positive predictive value (PPV) was at least 98% and the negative predictive value (NPV) was below 98%.
functional region: Regulatory information is available for the variant's location which leads to a high Region Score. Due to the available data, RegulationSpotter considers it possible for the variant to have a regulatory function.
non-functional region: Low Region Score which makes it rather unlikely that this variant is located in a regulatory region and plays a functional role. Due to the available data, RegulationSpotter considers the variant to be not located in a regulatory region. For the high confidence label, the same thresholds as for functional variants were used.
non-functional region (much evidence): Very low Region Score which makes it very unlikely that this variant is located in a regulatory region and thus this variant is very likely not functional in terms of gene regulation.
polymorphism: The variant was found in homozygous state in several samples in the data from 1000G or ExAC which is indicative of a harmless polymorphism.

The results overview is followed by a summary section summing up the most important annotations for this variant.

Alteration (phys. location)

The alteration on "physical" i.e. chromosomal level (e.g. chr7:91623937_91623938insGGCAAT).

Alteration type

Is either SNV (a single base exchange), an insertion, a deletion or a combination of insertion and deletion.

Alteration region

Extratranscriptic by definition. Extratranscriptic in this context means any position that is out of (protein coding) transcripts, i.e. also promoter regions upstream of the gene start (the 5' most TSS).

Known variant

Any known polymorphism(s) or known disease variant that have been found at the position in question. Our database contains all single nucleotide polymorphisms (SNPs) from the NCBI SNP database (dbSNP). If an alteration is located at the same position as a known dbSNP, RegulationSpotter provides the SNP ID (or rs ID) and a link. Please note that there may be differences between your alteration and the alleles in dbSNP.
Moreover, we have stored all variants from the 1000 Genomes Project [1] (1000G) and from the Exome Aggregation Consortium (ExAC) [2]. For 1000G and ExAC, RegulationSpotter provides detailed information about homozygous/heterouygous hits and numbers of allele carries. If an alteration was found > 4x homozygously in 1000G or >10x homozygously in ExAC, it is automatically regarded as polymorphism.
We also display known disease variants from NCBI ClinVar. If a variant is marked as probable-pathogenic or pathogenic in ClinVar, it is automatically predicted to be disease-causing, i.e. disease causing automatic (the Region Score is calculated and shown nevertheless). We also provide a link to the respective entry in the ClinVar database.
Moreover, we have integrated the public version of the Human Gene Mutation Database (HGMD) [3]. The data includes the positions of the disease mutations and their HGMD ID. The disease alleles are not included so we cannot use HGMD for automatic predictions. Whenever an HGMD public disease mutation is found at the same position as a variant, this will be written in the summary. We also place a direct hyperlink to the mutation in HGMD into the 'dbSNP / 1000G / HGMD(public) / ClinVar' field, so you can check whether the HGMD mutation has the same allele as your variant (and whether the disease matches). Please note that you must be logged in at the HGMD site to make the hyperlink work - access to the public version is free but requires registration.

Promoters

This section displays all the promoters annotated for your variant. RegulationSpotter gets this information from various sources:

Ensembl promoters are derived from Ensembl multicell regulatory features [3]. We display the cell lines that were annotated with it. We grouped the cell lines according to their biological properties, which is indicated by the background colour. Features can be either active (black circle) or poised / repressed / inactive (white circle).
FANTOM predictions come from FANTOM5 data (obtained from Ensembl Regulatory build, b37, published in [4])
RegulationSpotter promoters were annotated in our group and denote promoters as areas lying 500bp upstream and 50 bp downstream of a transcription start site and which state as promoter are (often) supported by typical features such DNase1 hypersensitivity and H3K4me3 histone marks. The classification as promoter in terms of location around TSS is used to connect promoters annotated by Ensembl to nearby genes / transcripts, an information which is not directly available from Ensembl Regulatory promoter elements with ENSR ID.

Enhancers

Displays all enhancer annotations found for your variant. Enhancer annotations were obtained from FANTOM5 [5] and VISTA [6]

Epigenetic Marks (RegulationSpotter)

Epigenetic marks (DNase1 hypersensitive sites and H3K4me3 annotations) obtained from Ensembl multicell regulatory features which were FOUND (alternativ: are annotated) in at least 3 cell lines and overlap with a promoter region. Please keep in mind that the coordinates of these marks may differ from the marks directly taken from the Ensembl Regulatory Build, because we show the overlap between different cell lines. This allows for a sharper annotation than in the Ensembl Regulatory Features Promoter and Promoter flanking region, but is less detailed than the cell-based single track annotations.

Histone Modifications, Polymerase, Open Chromatin, Transcription Factor Binding Site

We obtained cell-based annotations on histone modifications, polymerase binding sites, open chromatin and transcription factor binding sites (TFBSs) from the Ensembl regulatory build [3].
For each regulatory feature, we display the cell lines that were annotated with it. We grouped the cell lines according to their biological properties, which is indicated by the background colour:

Blood cells: GM12878, GM12865, CMK, GM12891, GM15510, Th2, GM12801, GM12892, GM18507, GM19240, Jurkat, GM12873, GM18526, GM19099, GM19238, GM18951, Th1, GM12874, GM12878-XiMat, GM12864, GM10847, GM12875, NB4, GM19239, GM12872, GM19193, GM18505, K562, K562b, DND-41, Monocytes-CD14+
Bone cells: Osteobl
Brain cells: Medullo
Breast cells: MCF10A-Er-Src, MCF7
Colon cells: Caco-2, HCT116
Embryonic Stem cells: H9ESC, H1ESC, H7ESC
Endothelial cells: HUVEC
Epithelial cells: LHSR, HPAEpiC, HRPEpiC, HCPEpiC, HEEpiC, HAEpiC, HIPEpiC, SAEC, HRE, A549, RPTEC, HRCEpiC, NHBE, HNPCEpiC, HeLa-S3, HMEC
Fetal Membrane cells: Chorion
Gingiva cells: AG09319, HGF
Heart cells: HCF
Kidney cells: HEK 293, HEK293b
Liver cells: HepG2b, HepG2
Lung cells: NHLF, AG04450, IMR90
Monocytes: Monocytes-CD14+
Muscle cells: SKMC, HSMMtube, HSMM
Neuron cells: SKNSHRA, PFSK1, NH-A, SKNMC
Pancreas cells: PanIslets, Panc1
Retina cells: WERIRB1
Skin cells: ProgFib, Melano, BJ, Fibrobl, AG10803, AG04449, NHEK, NHDF-neo, NHDF-Ad, NHDF, AG09309
Not grouped:NTERA-2 cl.D1, DND-41, HCM

Histone Modifications

We used annotations for the following 28 histone modifications from the Ensembl regulatory build:

H2AK5ac	Histone 2A Lysine 5 Acetylation
H2AZ	Histone 2A variant Z
H2BK120ac	Histone 2B Lysine 120 Acetylation
H2BK12ac	Histone 2B Lysine 12 Acetylation
H2BK15ac	Histone 2B Lysine 15 Acetylation
H2BK20ac	Histone 2B Lysine 20 Acetylation
H2BK5ac	Histone 2B Lysine 5 Acetylation
H3K14ac	Histone 3 Lysine 14 Acetylation
H3K18ac	Histone 3 Lysine 18 Acetylation
H3K23ac	Histone 3 Lysine 23 Acetylation
H3K23me2	Histone 3 Lysine 23 di-methylation
H3K27ac	Histone 3 Lysine 27 Acetylation
H3K27me3	Histone 3 Lysine 27 Tri-Methylation
H3K36me3	Histone 3 Lysine 36 Tri-Methylation
H3K4ac	Histone 3 Lysine 4 Acetylation
H3K4me1	Histone 3 Lysine 4 Mono-Methylation
H3K4me2	Histone 3 Lysine 4 Di-Methylation
H3K4me3	Histone 3 Lysine 4 Tri-Methylation
H3K56ac	Histone 3 Lysine 56 Acetylation
H3K79me1	Histone 3 Lysine 79 mono-methylation
H3K79me2	Histone 3 Lysine 79 di-methylation
H3K9ac	Histone 3 Lysine 9 Acetylation
H3K9me1	Histone 3 Lysine 9 mono-methylation
H3K9me3	Histone 3 Lysine 9 Tri-Methylation
H4K20me1	Histone 4 Lysine 20 mono-methylation
H4K5ac	Histone 4 Lysine 5 Acetylation
H4K8ac	Histone 4 Lysine 8 Acetylation
H4K91ac	Histone 4 Lysine 91 Acetylation

Open Chromatin

For annotation of open chromatin, we used DNase I hypersensitive sites from the Ensembl regulatory build.

Polymerase Binding Sites

Indicates that annotations for Polymerase II and Polymerase III were found for your variant's location.

Transcription Factor Binding Sites

We included the following TFBSs (see list below). TFBSs that are annotated in at least 3 different cell lines are printed in bold. TFBSs can be either confirmed, i.e. found by experimental procedures such as ChIP-seq, or be deduced by motif, i.e. the binding site for a certain TF that can be contained in the DNA sequence.

Ap2alpha	Ap2alpha Transcription Factor Binding
Ap2gamma	Ap2gamma Transcription Factor Binding
ATF3	ATF3 Transcription Factor Binding
BAF155	BAF155 Transcription Factor Binding
BAF170	BAF170 Transcription Factor Binding
BATF	BATF Transcription Factor Binding
BCL11A	BCL11A Transcription Factor Binding
BCL3	BCL3 Transcription Factor Binding
BCLAF1	BCLAF1 Transcription Factor Binding
BHLHE40	BHLHE40 Transcription Factor Binding
Brg1	Brg1 Transcription Factor Binding
Cfos	Cfos TF binding
Cjun	Cjun TF binding
Cmyc	Cmyc TF binding
CTCF	CCCTC-binding factor
CTCFL	CTCFL Transcription Factor Binding
E2F1	E2F1 Transcription Factor Binding
E2F4	E2F4 Transcription Factor Binding
E2F6	E2F6 Transcription Factor Binding
EBF1	EBF1 Transcription Factor Binding
Egr1	Egr1 Transcription Factor Binding
ELF1	ELF1 Transcription Factor Binding
ETS1	ETS1 Transcription Factor Binding
FOSL1	FOSL1 Transcription Factor Binding
FOSL2	FOSL2 Transcription Factor Binding
FOXA1	FOXA1 Transcription Factor Binding
FOXA2	FOXA2 Transcription Factor Binding
Gabp	Gabp TF binding
Gata1	Gata1 TF binding
Gata2	Gata2 Transcription Factor Binding
GTF2B	GTF2B Transcription Factor Binding
HDAC2	HDAC2 Transcription Factor Binding
HDAC8	HDAC8 Transcription Factor Binding
HEY1	HEY1 Transcription Factor Binding
HNF4A	HNF4A Transcription Factor Binding
HNF4G	HNF4G Transcription Factor Binding
Ini1	Ini1 Transcription Factor Binding
IRF4	IRF4 Transcription Factor Binding
Junb	Junb Transcription Factor Binding
Jund	Jund TF binding
Max	Max TF binding
MEF2A	MEF2A Transcription Factor Binding
MEF2C	MEF2C Transcription Factor Binding
Nanog	Nanog Transcription Factor Binding
Nfe2	Nfe2 TF binding
NFKB	NFKB Transcription Factor Binding
NR4A1	NR4A1 Transcription Factor Binding
Nrf1	Nrf1 Transcription Factor Binding
Nrsf	Nrsf TF binding
p300	p300 Transcription Factor Binding
Pax5	Pax5 Transcription Factor Binding
Pbx3	Pbx3 Transcription Factor Binding
POU2F2	POU2F2 Transcription Factor Binding
POU5F1	POU5F1 Transcription Factor Binding
PU1	PU1 Transcription Factor Binding
Rad21	Rad21 Transcription Factor Binding
RXRA	RXRA Transcription Factor Binding
SETDB1	SETDB1 Transcription Factor Binding
Sin3Ak20	Sin3Ak20 Transcription Factor Binding
SIX5	SIX5 Transcription Factor Binding
SP1	SP1 Transcription Factor Binding
SP2	SP2 Transcription Factor Binding
Srf	Srf TF binding
TAF1	TAF1 Transcription Factor Binding
TAF7	TAF7 Transcription Factor Binding
Tcf12	Tcf12 Transcription Factor Binding
THAP1	THAP1 Transcription Factor Binding
Tr4	Tr4 Transcription Factor Binding
USF1	USF1 Transcription Factor Binding
XRCC4	XRCC4 Transcription Factor Binding
Yy1	Yy1 Transcription Factor Binding
ZBTB33	ZBTB33 Transcription Factor Binding
ZBTB7A	ZBTB7A Transcription Factor Binding
ZEB1	ZEB1 Transcription Factor Binding
Znf263	Znf263 TF binding
ZNF274	ZNF274 Transcription Factor Binding

In this section, you will also find a link to ePOSSUM, our software for the analysis of transcription factor binding sites.

Genomic Interactions

We integrated data on the interaction of distant genomic elements generated by Hi-C experiments from Rao et al. [7], from 5C experiments for the ENCODE project [8,9] generated by groups from the University of Massachusetts and from the 4D Genome database. 5C and Hi-C data were downloaded from NCBI: Find 5C data UMass data here and Hi-C data here .
For each interaction annotated for your variant, RegulationSpotter displays the gene name and Ensembl gene ID as well as the element, promoter or distant element, interacts w/ promoter (might be an enhancer) involved in the interaction. It should be noted that due to multiple TSSs of the same gene, a variant can be considered as affecting the promoter of a certain gene or not, depending on which transcript / TSS is under scrutiny.
We only display interactions which were present in at least 3 different cell lines and also include the affected cell lines.
To give you a better understanding of the interaction, RegulationSpotter also displays the interaction as a plot - just try out the link given below in the figure capture.

Interaction plot

Screenshot of an interaction plot. This plot is embedded in the single variant output (example - click on 'show interactions as plot', direct link to the interaction plot).
The image is divided into two parts, which can be separately resized and scrolled through to bring together the different elements: the upper part shows involved genes or transcripts (display can be changed by the user upon clicking on 'show transcripts instead of genes'), while the lower part shows interacting regions, in one of which the analysed variant is located. The thin red line symbolizes the location of the variant. Interaction elements are depicted as black lines with blue ends, the blue ends represent the genomic elements which were found to interact with each other, e.g. by Hi-C or similar methods. You can find protein-coding genes or transcripts in the region as red rectangles and pseudogenes or non-protein-coding genes (e.g. pseudogenes) or transcripts (e.g. processed trancripts) marked with a little green box. You can switch between viewing genes (usually resulting in a condensed picture) or transcripts (extended view). We recomment to switch on transcript view in order to be able to understand the classification of interacting elements as promoter or distant element (e.g. enhancer). Moreover, you will find a link to explore the region in Ensembl. Below the plot you can find a legend explaining the picture.

PhyloP/PhastCons

Indicates the conservation of the alteration site. Data from phyloP [10] and PhastCons [11].
PhastCons and phyloP are both methods to determine the grade of conservation of a given nucleotide. RegulationSpotter uses values which are precomputed and offered by UCSC (please follow the links to phyloP and PhastCons).
phastCons values vary between 0 and 1 and reflect the probability that each nucleotide belongs to a conserved element, based on the multiple alignment of genome sequences of 46 different species (the closer the value is to 1, the more probably the nucleotide is conserved). It considers not just each individual alignment column, but also its flanking columns.
In contrast, phyloP (values between -14 and +6) separately measures conservation at individual columns, ignoring the effects of their neighbors. Moreover, phyloP can not only measure conservation (slower evolution than expected under neutral drift) but also acceleration (faster evolution than expected). Sites predicted to be conserved are assigned positive scores, while sites predicted to be fast-evolving are assigned negative scores.
For deletions, insertions and Indels, not all phyloP and phastCons values of all affected bases add up to the Region Score, but only one value for each, phyloP and phastCons, is added to the Region Score.

In case of a deletion: the highest value of all deleted bases
In case of an insertion: the highest value of the two flanking bases
In case of an indel: the highest value of the two flanking bases or the deleted bases

For more information about phyloP and phastCons, please see the cited papers.

CADD

The CADD [12] value for the respective position. Please be aware that we always display the highest value for a certain position, regardless of the actual variant, which means that the CADD value displayed here might slighty differ from the actual value for the distinct variant stored or displayed elsewhere. Moreover, CADD values are only displayed for informational reasons and are not included in the score. The integrated version is CADD for b37 v1.3.

Chromosome

The chromosome the alteration is located on.

Strand

Is either 1 for forward strand or -1 for reverse strand

Chromosomal position

Gives the last wild-type base before alteration and first wild-type base after alteration in chromosomal sequence context (position relative to start of chromosomal reference sequence) e.g. 154,372,337 / 154,372,339, the altered base is at position 154,372,338.

Original chrDNA sequence snippet

Original DNA sequence with the original nucleotide marked in blue.

Altered chrDNA sequence snippet

Altered DNA sequence with the original nucleotide marked in blue.

Speed

The speed that was required for the current analysis.

Contact

In case you discover bugs, have suggestions or questions, please write an e-mail to
Jana Marie Schwarz (jana-marie.schwarz AT charite.de) or to
Dominik Seelow (dominik.seelow AT charite.de).
We also appreciate hearing about your general experiences using RegulationSpotter.

References

[1] 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature 2012 Nov 1. PMID: 23128226

[2] Analysis of protein-coding genetic variation in 60,706 humans. Monkol Lek, Konrad J. Karczewski[…]Exome Aggregation Consortium. Nature volume 536, pages 285–291 (18 August 2016)

[3] The Human Gene Mutation Database: 2008 update. Peter D Stenson, Matthew Mort, Edward V Ball, Katy Howells, Andrew D Phillips, Nick ST Thomas and David N Cooper. Genome Medicine 2009.

[4] Zerbino DR, Wilder SP, Johnson N, Huettemann T, Flicek PR. The Ensembl Regulatory Build. Genome Biology 2015. PMID: 25887522

[5] FANTOM Consortium and the RIKEN PMI and CLST (DGT) et al. A promoter-level mammalian expression atlas. Nature 507, 462-470 (2014).

[6] Visel A, Minovitsky S, Dubchak I, Pennacchio LA. VISTA Enhancer Browser - a database of tissue-specific human enhancers. Nucleic Acids Res. 2007. PMID: 17130149

[7] Rao SS, Huntley MH, Durand NC, Stamenova EK et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 2014. PMID: 25497547

[8] ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57-74 (2012).

[9] Sloan, C. A. et al. ENCODE data at the ENCODE portal. Nucleic Acids Res. 44, D726-732 (2016).

[10] Pollard KS, Hubisz MJ, Siepel A. Detection of non-neutral substitution rates on mammalian phylogenies. Genome Res. 2009. PMID: 19858363

[11] Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005. PMID: 16024819

[12] CADD: predicting the deleteriousness of variants throughout the human genome. Philipp Rentzsch, Daniela Witten, Gregory M Cooper, Jay Shendure, Martin Kircher. Nucleic Acids Research 2018.