Washington University Tutorial



NIEHS SNPs Workshop

January 11, 2008

Interactive Tutorial 2: Web Tools for SNP Selection

Goal: This tutorial introduces several websites and tools useful for determining linkage disequilibrium for your gene or region of interest and for tagSNP selection. In this section, you will cover the following topics.

• Linkage disequilibrium (LD) and TagSNPs

o Genome Variation Server (GVS)

o NIEHS SNPs precomputed tagSNPs

• NIEHS SNPs website tools

o Visual Haplotype (VH1)

o GeneSNPs – Haplotype view

• Batch Genome Variation Server

• Haploview

o HapMap Data in Haploview

o EGP Data in Haploview

Part 1. Linkage Disequilibrium using Genome Variation Server

1. Go to the NIEHS SNPs home page at and select ‘Genome Variation Server’ from the left-hand menu under ‘Software’ (or go directly to the Genome Variation Server at ).

2. Click the ‘Gene Name’ button. This will take you to a page on which you should enter FGFR2 in the gene name box, and 2000 in both the upstream and downstream boxes. Then, click the search button.

3. Under the select data set – deselect EGP-Asian-Panel and select EGP-CEPH-Panel. In the set-up parameters for display and analysis change the allele frequency cutoff from 0 to 5, and in Display Results click the green button labeled ‘display linkage disequilibrium.’

4. A new page will appear with the title Select Display Type. This page contains links to a table listing the r2 values between pairs of SNPs identified by rs numbers and a graphical representation of the visual genotypes with a triangular LD plot underneath. On the Select Display Type page, click on ‘open graphical display of linkage disequilibrium,’ and a window will appear with a visual display of the genotypes. The rs numbers of SNPs are listed across the top of the image. These numbers are coded with keys at the bottom of this window: the ‘Variation Labeling Color’ details the color-coding of the numbers according to function, and ‘Variation Labeling Style’ explains the font style, with SNPs in unique regions in bold type and SNPs in repeat regions in normal type. The numbers on the left side of the image represent the sample ID. Each square represents an individual sample’s genotype: homozygous for the common allele (blue), heterozygous (red), homozygous for the rare allele (yellow), undetermined (where no genotypes are available - grey), and conflicting genotypes (which can occur when you merge multiple data sets – black).

5. Using the visual genotype figure and triangle plot evaluate the LD among SNPs in this gene. First notice the highest LD is represented by red squares in the triangle plot – see the LD Min/Max Shading Scale with the lowest r2 set at 0.1. The square colors in the LD triangle vary between blue (cold for LD, r2 = 0.1) to red (hot for LD, r2 = 1.0) with white squares indicating no LD (r2 0.9, or red squares in the triangle plot) with rs1219648, which is the fourth SNP from the left, you would not have an easy task. Let’s try one more way to simplify the LD analysis, which will really help you answer this question. So close the graphical display and the select display page and return to the Genome Variation Server Page and move to the setup parameters for display and analysis section. Go to the section called Clustering in Graphic Display and select the ‘Cluster SNPs’ option. Now display LD again and when the ‘Select Display Type’ window emerges, click the ‘open graphical display of linkage disequilibrium.’ Now determine which SNPs are in high LD with rs1219648. The first thing you will notice is that associated SNPs are now next to one another. It is very easy to evaluate the small blocks of LD that are present in this gene.

6. To save the visual genotype/pairwise triangle plot image to your computer, right-click on the image and choose ‘save as.’

7. You can also get the data in a table form. Return to the ‘Select Display Type’ window and click ‘open table display of linkage disequilibrium.’ The data will appear as text in a table on a new page. You can also get a text version of the pairwise LD table by returning to the Genome Variation Server page and, under ‘Data Output and Display,’ go to ‘Display SNPs by’ and toggle to text. Click on the ‘display linkage disequilibrium’ button and a savable text-based table will appear.

8. Close both the pairwise LD text table and visual genotype image of LD browser windows.

9. Explore LD in other populations – this is the CEPH population. In the Genome Variation Server Page, move up the page to the Select Data Set and deselect EGP CEPH-Panel and select EGP Yorub-Panel. Leave everything else the same and move down the page to Display Results and click the display linkage disequibrium and in the Select Display Type Page, click on ‘open graphical display of linkage disequilibrium.’ Is there more or less LD in the African population samples?

10. There is one more feature of LD analysis with GVS that could be useful. It is an important way that you can explore LD in the genome with SNPs identified via association studies either in candidate gene studies or genome-wide studies. For example, the SNP you were analyzing above rs1219648 is a SNP in the first intron of the FGFR2 gene. This SNP and several others in the first intron of this gene have been identified as the top hits in several genome wide association studies of breast cancer susceptibility (see Hunter et al 2007 Nature Genetics 39:870). Go back to the opening GVS page where you can select your search. Rather than selecting a gene name, click on dbSNP rs id, and it will take to the rs ID search. Enter 1219648 and 20000 upstream and downstream and click search. Now when you get to the Select Data Sets, unselect the EGP Asian-Panel and select EGP CEPH-Panel and the HapMap CEU. In the Set-Up Parameter for Display and Analysis, you can leave the merging option set on A common samples and combine variation because these have similar samples. Then, on the Display SNP by:, toggle to the Custom-Text option. Click on Display Linkage Disequilibrium. You will then get a choice of formats, but select SNPs paired with rs 1219648 and enter 0.70 in the r2 box. You can then select the type of information you will get about the pairs, including function. Scroll down the page and click submit. How many SNPs in LD with 1219648? What is their functional status (where do they occur in the gene?). This query is providing more information because you are exploring more than one data set to simplify your analysis. Neither of these datasets is complete, but, together, they can give you more information across the same samples. This type of query can also be done in a batch mode, which is presented later in this tutorial.

Part 2. TagSNPs determination using GVS

1. To pick tagSNPs, let’s use another gene that has been linked to breast cancer susceptibility, CASP8. Go back to the opening GVS page. (Hint: Click on GVS: Genome Variation Server in the left hand corner at the top of the page. This will take you to the main GVS page). Once there, click on Gene Name, and in the search box enter CASP8 and enter 20000 in the upstream and downstream boxes, and then click the search button.

2. Under the select data set – deselect the EGP-Asian-Panel and select EGP-CEPH-Panel. This is by far the largest data set available, so best to use for tagSNP selection since it will give you the most complete set of tags. In the set-up parameters for display and analysis - change the allele frequency cutoff from 0 to 5, and in Display Results, click the button labeled ‘display tagSNPs.’ The default parameter for selecting tagSNPs is r2 > 0.80. When the ‘Select Data Type’ window opens, click on the graphical display first. How many SNPs meet the allele frequency cutoff (remember look under the gene name)? Notice that the associated SNPs are now clustered together in this view. The bars over the SNP ids mark each of the bins (clusters of associated SNPs) and the asterisk indicates the SNP or SNPs that can be a tagSNP. SNPs with just asterisk over and no bar are known as singleton SNPs (that SNP needs to be typed and does not have another proxy in the data set).

3. Close the graphical window and return to the select data type window and click on ‘open table display of tagSNPs.’

4. The output lists each bin, its average minor allele frequency, the SNPs that can be tagSNPs and, in the last column, the SNPs that the TagSNPs capture information during an association analysis. Only one tagSNP per bin is required to represent the genetic associations for that bin. Note in Bin 1 all the SNPs can serve as tagSNPs and they all capture information equally about each other. However, in Bin 3 only one SNP can capture the associations for four other SNPs in this cluster.

5. How many tagSNPs should be typed to capture the genetic architecture of this gene (How many bins are there)? Which is the largest bin of SNPs, and how many SNPs are in this bin? Are there alternative tagSNPs for this bin? In this bin do any of the tagSNPs have predicted function? What is its function?

6. How many bins contain only a single SNP (singleton bins) and how many contain multiple SNPs?

7. How many TagSNPs (bins) are needed to capture LD across this gene with a 10% allele frequency cut-off? Go back to the Genome Variation Server page and, in the filtering SNPs section, change the allele frequency cut-off from 5 to 10. Display tagSNPs with this cut-off and click on ‘open table display of tagSNPs’. How many TagSNPs (bins) are needed to capture the SNP patterns in this gene now?

8. Of the singleton bins (a SNP that tags itself), which SNP has the largest minor allele frequency? Remember this SNP would have to be typed to capture the genetic architecture of this gene – most of these are not captured in the current genome-wide formats. So they will miss underlying associations. The current formats capture the largest bins.

Part 3. Merging populations using GVS

1. Close the current tagSNP windows and return to the GVS page for CASP8, and, in addition to EGP-CEPH-Panel, also select the EGP-AD-Panel (African American Panel) and the EGP-YORUB- Panel.

2. Scroll down to ‘Merge Samples and Variations’ (under ‘Merging Data Sets’) and toggle to option C, ‘combined samples with combined variations.’ Keep the 10% allele frequency cutoff. Choose ‘display genotypes’ and open the graphical display. How many SNPs are in the combined variation set for these samples? How many samples altogether? (Remember the number of SNPs and samples are below the image and Gene Name.)

3. Close the graphical display and select data type window and return to the GVS page and click ‘display TagSNPs.’ Open the table display of TagSNPs. TagSNPs across multiple populations are chosen by implementing a Multipop version of LD-Select, as described in Howie et al. (Hum Genet. 120: 58-68, 2006 - you should be able to download the .pdf file for free, so yell if you can’t!). The table you have opened contains the information for the two combined populations, including the tagSNP, Function, Unique/Repeat, European Bin Represented, and African Bin Represented. For example, the first tagSNP is rs13402616. It is in the intron of CASP8. It is in a unique sequence, represents African bin 11, and does not have representation in Europeans (i.e., it is an African specific SNP). How many TagSNPs are needed to capture the information in this gene for both populations? (Hint look at the line right above the table.) How many of these TagSNPs capture information from both populations? How many population specific TagSNPs are there? Notice some of the SNPs are in parentheses – these represent alternatives to capture the same information – function and uniqueness of the sequence determine which tagSNP is listed first. Some of the tagSNPs are in brackets. These represent tagSNPs with genotype coverage below the threshold. These tagSNPs could be eliminated but for completeness are included.

4. You can also get a text version of this information. Change ‘Display SNP by’ to ‘text’ and ‘display TagSNPs.’ There is also a custom format option that allows you to obtain frequency, function, conservation, and flanking sequencing to help you order your SNP genotyping assays! Try the custom format. Also, you have the option of sending in all the alternative tagSNPs for genotyping design. Once scored for ease of genotyping by the company you can view which one to pick as the tagSNP if needed. This is a very useful feature.

Part 4. TagSNP Selection (LDSelect) in NIEHS SNPs

1. Starting at the NIEHS SNPs home page (), under “Gene Targets” (within left-hand navigation menu), click on ‘A-Z Finished Genes Directory.’

2. Choose ‘A’ and then ‘ADH1C’ to access the gene page data for ADH1C.

3. To find the tagSNPs for ADH1C, scroll down the page to the ‘LD Linkage Data’ section.

4. Click on a population for tagSNPs specifically chosen for that population. TagSNPs were chosen for each population from all SNPs regardless of minor allele frequency using the algorithm LDSelect at the default r2 > 0.64.

5. How many tagSNPs (bins) are required to typeADH1C in the European-descent populations? How many tagSNPs must be genotyped directly because they are not contained within a bin with another SNP?

6. Which population sample requires more tagSNPs to represent ADH1C: African-descent or Asian-descent?

Part 5. Using Visual Haplotype (VH1) for Haplotype Analysis in NIEHS SNPs

1. Go to

2. Click on “Visual Haplotypes” in the left-hand navigation menu. This software will display haplotypes. Haplotypes represent the alleles of each SNP assigned to an individual’s chromosomes. Each individual has two chromosomes representing the maternal and paternal chromosomes inherited from his or her parents. The visual haplotype will be twice as long as the visual genotype because now each individual is represented by two rows of data (haplotypes) instead of just one row of data (genotypes). NOTE: Be aware that a few of the genes re-sequenced by NIEHS SNPs are X-linked (males have one X chromosome [haplotype] and females have two X chromosomes).

3. Using the pull-down menu for ‘EGP Finished Gene Phasebase Input File,’ choose the gene FEN1 re-sequenced by NIEHS SNPs.

4. To focus on common genetic variation, we suggest you filter by minor allele frequency for common SNPs. Enter 5 in ‘Rare Allele Percentage (integer, 0 to 50).’

5. To identify the number of haplotypes in your population sample, sort by sample. At ‘Cluster By:’ choose ‘SAMPLE.’

6. Click on ‘Run VH1 on the Web!’

7. You should have an image of the haplotypes in a pop-up window. The numbers at the top of the image represent the SNPs (numbered along a reference sequence used in re-sequencing the gene). The SNPs here are sorted according to samples with the same haplotype. The numbers on the side of the image represent the sample ID. Each square represents an individual sample’s allele: common (blue) and rare (yellow) allele. Each row represents the individual sample’s haplotype, and each individual will have two rows representing the two chromosomes. You can identify the number of common haplotypes manually using VH1.

8. How many haplotypes do you have? (Scroll down – also look at names of samples; hint there are 2 haplotypes for each sample) How many haplotype tagSNPs would you genotype to resolve all common haplotypes?

Part 6. Where to Find Haplotypes in NIEHS SNPs

1. In addition to VH1, we offer PHASEv2.0 output for each of the genes re-sequenced on the NIEHS SNPs website. Under ‘Gene Targets’ in the left-hand navigation menu, click on ‘A-Z Finished Genes Directory.’

2. Choose ‘F’ and then FEN1.

3. PHASE output is found in the ‘Haplotyping Data’ section of the gene’s web page.

4. Let’s look at the Phase Output file. There are three SNPs listed under ‘Begin List Summary.’ How many haplotypes? What bases are in the haplotypes for each SNP site? What is the least frequent haplotype?

Part 7. Haplotypes in GeneSNPs

1. The GeneSNPs resource at the University of Utah () is linked off of all NIEHS SNPs gene data pages. Navigate to GeneSNPs and click on ‘Open Query’ at the top of the page, enter ADH1C in the Symbol box, then click on the search button.

2. When the entry for the ‘ADH1C’ gene appears, select the link in the ‘Gene Models’ column – ‘UCSC:hg16:4’

3. From the rainbow-colored ‘NIEHS’ drop-down menu select ‘Haplotypes.’ A list of pre-computed PHASE haplotypes with different minor allele frequency cutoffs and exonic SNPs are shown for each population.

4. Scroll down the haplotype sections. The EGP95 has been split into 5 populations: EGP95_AA (African-American), EGP95_AS (Asians), EGP95_AY (Yorubans), PDR95_EU (European-Americans - CEPH), and PDR95_HI (Hispanics). Go to the PDR95_EU and find the row with PHASE output for haplotypes with a 0.10 cutoff and nonsynonymous SNPs not included (nonsyn=NO), then click on ‘view’ in that row.

5. A new window will appear with a visual representation of the haplotypes in this gene for this population and for each polymorphic site (across the top). The haplotypes are color-coded, red for the major allele and yellow for the minor allele. How many different haplotypes are shown? The number of chromosomes carrying these haplotypes can be found to the far left. How many haplotypes are found three or more times in this population? Twenty-two CEPH samples are sequenced in the PDR95. So what should the haplotype count total to? Does it?

Part 8. Batch GVS

Using GVS there are also ways to batch large jobs. These are the jobs were you wouldn’t want to enter each query independently. For example you can provide a list of genes, i.e. an entire pathway of genes, that you are interested in picking tagSNPs for.

1. Go to GVS Batch or go to the GVS front page and there is link to GVS Batch on the left hand side. Just click and you are there.

2. Under the ‘Input List File’ – There is a link to the information page for GVS Batch. GVS batch can do all of the queries that interactive GVS can do. Click on the link to the information page and read the introduction. You will then move onto the parameters that can be requested.

3. If you move down the page you will see examples that you can download. Go to example #6 and download this onto your desktop. Then go back to the front page of GVS Batch. Browse to find example6.txt and upload it. Enter your e-mail address in the box and then click submit. The Server will send you e-mail when the job is finished. And you can download your file.

4. Open the file you downloaded and examine - the first part of the file is just a repeat of the query. After that is the tagSNP output for the ABO gene. How many tagSNPs (bins) are required to type ABO? How many for VKORC1?

5. You can try lots of other simple modifications of this one file like changing the population from 596, which is our PGA-CEPH samples to population 1409, the HapMap CEU samples.

Here are some helpful population identifiers:

population_id | class | handle | local_population_id

---------------+-------------+------------+---------------------

1409 | CSHL-HAPMAP | HapMap-CEU | EUROPE

1410 | CSHL-HAPMAP | HapMap-HCB | EAST ASIA

1411 | CSHL-HAPMAP | HapMap-JPT | EAST ASIA

1412 | CSHL-HAPMAP | HapMap-YRI | WEST AFRICA

1471 | EGP_SNPS | EGP_YORUB-PANEL

1472 | EGP_SNPS | EGP_HISP-PANEL

1473 | EGP_SNPS | EGP_CEPH-PANEL

1474 | EGP_SNPS | EGP_AD-PANEL

1475 | EGP_SNPS | EGP_ASIAN-PANEL

You can find the list of all populations at



Haploview

Part 9. Using HapMap Data in Haploview

1. Download and install Haploview from

Install Haploview as suggested for your computer operating systesm as directed on the download page. You must also have Java installed on your computer. There is a link to a site to download Java as well. If Haploview does not open after installation, you most likely do not have Java installed on your computer. Install Java and try again. Once you have Haploview open it and you be viewing two windows one Labeled Haploview 4.0 and one that says Welcome to Haploview . The Welcome to Haploview window is looking for you to enter data to work in Haploview. On the left hand down arrow, click and select HapMap Download. You need to be connected to the internet. Leave the release at 21 (that is the current release for HapMap data), select chromosome 17 and in the start enter 23105 and end 23154. Click “OK.”

2. The first view of the data is the “Check Markers” window. This provides a nice summary of the marker data, including the name of the markers, genomic position of the markers, observed heterozygosity, predicted heterozygosity, Hardy Weinberg, % of samples successfully genotyped, the number of fully genotyped family trios for each marker, the number of Mendelian inheritance errors, minor allele frequency, and pass/fail quality control for each marker.

3. Haploview offers a graphical view of the LD statistic. Click on the “LD Plot” tab. To view the entire image at once go to the Display at the top of the window and go to LDZoom and choose unzoomed. You can change the haplotype block definitions by going to “Analysis” and select the block definition. The default is the block definition by Gabriel et al. in Science (2002). To change the LD statistic, click on “Analysis” then define blocks and choose the “four-gamete rule.”

4. How many blocks are in NOS2A for the CEU population using the four gamete rule? (You can also select Haplotypes to get the answer as well)

5. If you didn’t go to Haplotypes yet go there now. How many haplotypes were identified in Block 2? How many haplotype tagging SNPs were identified? (Go to display options and select ‘Show tags in blocks’. TagSNPs are indicated by triangles above the genotype.)

6. For the minimal set of tagSNPs, go to the “Tagger” tab. You can choose the algorithm used to define the tagSNPs. For this example, choose “pairwise tagging only” (This option is just an implementation of LD-Select which is noted in the Tagger documentation). Then click “Run Tagger.” The results are displayed so that the tagSNPs are shown in one window, the SNPs being tagged in the other window. Using “Tagger” and “pair-wise tagging only,” how many Haploview tagSNPs are in NOS2A for the CEU HapMap data? Does this change if you use the aggressive tagging approach (with combinations of 2 or 2 and 3 markers)?

Part 10. Using EGP Data in Haploview

1. Download the EGP data for NOS2A for this exercise from GVS. Go to the main GVS page and select gene name. Enter NOS2A don’t bother with upstream or downstream its enough as is. In the Select Data Set(s) – deselect EGP-YORUB-PANEL and Select the EGP-CEPH-PANEL. Go to Set-up parameters and select Custom-Text in the Display SNPs By and then click on display genotypes. You can now select the format you want, since you need two files for Haploview the Haploview genotypes and Haploview marker info, the simplest thing to do is select Download a tarball with all formats. Submit and download your data. You can also go to and download GVS.haploviewGenotypes.NOS2A.txt and GVS.haploviewMarkers.NOS2A.txt

2. Open Haploview. If Haploview is already open from the previous exercise, under “File,” choose “Open new data.” You want to stay in the Linkage Format this time and browse to load the files in your downloaded folder from GVS. For Data file - upload the Genotypes file and Locus Information File upload the Markers file. Click “OK”.

3. Repeat steps 3 and 4 from “Using HapMap Data in Haploview.” Note the difference between a complete set of common variation data (EGP) and common set of sampled variation data (HapMap). There are significant differences.

4. How many blocks are in NOS2A for the CEU population using the four gamete rule? (You can also select Haplotypes to get the answer as well)

5. How many haplotypes were identified in Block 2? How many haplotype tagging SNPs were identified? (Go to display options and select ‘Show tags in blocks’. TagSNPs are indicated by triangles above the genotype.)

6. How many tagSNPs are identified using pair-wise tagging only in “Tagger” using the EGP data? Does the 2-marker or 2-marker or 3-marker save you many tagSNPs. Also try 2-marker, or 2- or 3-marker more than once – how many SNPs does it give you same answer as before. With incomplete data sets you can get slightly different answers. Think about the influence of aggressive tagging on developing marker sets.

Answers to Questions

Part 1.

5.SNPs in LD with rs1219648. Ans: rs2981575, rs3135718, rs11200014, rs11379664, rs41302265

9. The Yoruban samples had more sites and few more small LD blocks but not much more LD over the gene.

10. 14 SNPs are in LD with rs1219648: rs11379664, rs10736303, rs11200014, rs41302265, rs2912780, rs2981579, rs1078806, rs2981578, rs2981575, rs2912774, rs2936870, rs2860197, rs2981582, rs3135718. These are all intronic.

Part 2.

2. 48 SNPs across 22 samples

5. 23 tagSNPs, Bin1, 6 SNPs

Yes, leads to an amino acid substitution in the coding region – a nonsynonymous SNP – rs13006529

6. 11 bins are singletons, 12 bins have multiple SNPs

7. 10 tagSNPs

8. Bin 8 – rs1035140 has a 50% minor allele frequency, Bin 9 and 10 are common SNPs as well and would also need to be typed.

Part 3.

2. 70 SNPs, 49 Samples

3. 39 tagSNPs, 12 tags SNPs capture genetic information in both populations, 27 tagSNPs capture genetic information in only one population.

Part 4.

5. 15 bins or tagSNPs: eight bins with >1 SNP; seven “bins” with only one SNP.

6. African-descent (23 tagSNPs). Asians-descent requires 7 tagSNPs. This gene does follow expectations for the populations being examined, African – the most tags (least LD), European – fewer tagSNPs compared to Africans (more LD), and Asians –fewer tagSNPs than Europeans (and more LD). Also note that there is no frequency cut-off here so tags for low frequency SNPs are also included. To filter use GVS!

Part 5.

8. How many haplotypes? 3

How many haplotype tagSNPs would you genotype to resolve all common haplotypes? 2 – the SNP at 1175 captures one haplotype and either 995 or 5213 SNP captures the other haplotype.

Part 6.

4. 3, GCG, GTG, ACT, Haplotype 2 or GTG.

Part 7.

5. 10, 5, 44, Yes

Part 8.

4. 15 for ABO, 4 for VKORC1

Part 9.

4. 8 Blocks

5. 5 Haplotypes in Block 2; 4 tagSNPs for Block 2

6. Pairwise – 20 tagSNPs capture 42 SNPs, or aggressive tagging using 2 –marker – 14 to 16 tagSNPs capture 42 SNPs, or 2- and 3-marker – 14 to 17 tagSNPs to capture 42 SNPs.

Part 10.

4. 22 blocks

5. 3 haplotypes in Block2, 2 tagSNPs for block 2

6. Pairwise - 51 tagSNPs other options give you 50 tagSNPs not a significant decrease.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download