BQ2012 Principal Component Analysis - BioQUEST

Principal Component Analysis

Learning Objectives

After completion of this module, the student will be able to

describe principal component analysis (PCA) in geometric terms

interpret visual representations of PCA: scree plot and biplot

apply PCA to a small data set research the application of PCA in different

knowledge domains design a research project using microarray data and analysis tools on REMBRANDT

Concepts

Big data Dimensionality reduction Principal component analysis (PCA)

Knowledge and Skills

Excel skills: Conditional formatting, linear regression, scatter plot, functions Scree plot, biplot Coordinate representation of points

Prerequisites

Familiarity with Excel o Copy, paste, graphing, sorting

Scatterplot Correlation Linear regression

Supporting Articles and Data Sets

Kho, A.T., Q. Zhao. Z. Cai, A.J. Butte, J.Y.H. Kim, S.L. Pomeroy, D.H. Rowitch, and I.S. Kohane. 2004. Conserved mechanisms across development and tumorigenesis revealed by a mouse development perspective of human cancers. Genes & Development 2004. 18: 629-640. Doi: 10.1101/gad.1182504.

Accessed on the web on May 27, 2012:

Citation: Neuhauser, C. Principal Component Analysis. Created: May 27, 2012 Revisions: Copyright: ? 2012 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license.

Page 1

Big Data, Heat Maps, Scatter Plots, and Data Clouds

In this module, we will look at a study by Kho et al. 2004. The abstract of the paper begins with the sentence: "Identification of common mechanisms underlying organ development and primary tumor formation should yield new insights into tumor biology and facilitate the generation of relevant cancer models." (Kho et al. 2004) Their study focuses on a childhood cancer, medulloblastoma, which is a cancer of the central nervous system. As part of the study, the research group analyzed microarray expression data of mouse cerebella during postnatal days 1-60 to identify the genes that were expressed early versus late during development.

This module uses the data in Supplemental Table 1 (MS Excel), which is available at

The complete data set is in the accompanying worksheet. The raw data is in the first sheet, labeled "Raw Data." Except for adding an identifier for each row in Column A, the sheet contains the data from the Supplemental Table 1 (Kho et al. 2004; see above for link). The sheet is protected to avoid accidental changes in the raw data. However, you can still copy data from that sheet into a new sheet.

The study resulted in a large amount of data. We will start with a preliminary exploration to get a better sense of the data. Table 1 shows the data of the first thirteen of 2552 genes from one of the experiments:

Table 1: Expression data of one of the experiments of the first thirteen genes (Kho et al. 2004)

PN1.b Mouse Cereb Signal

PN3.b Mouse Ce r e b Signal

PN5.b Mouse Ce r e b Signal

PN7.b Mouse Cereb Signal

PN10.b Mouse Ce r e b Signal

PN15.b Mouse Ce r e b Signal

PN21.b Mouse Cereb Signal

PN30.b Mouse Ce r e b Signal

PN60.b Mouse Ce r e b Signal

1

-66.0

132.4

87.0

20.8

21.8

932.7

844.4

1188.2

1422.6

2

124.4

248.3

396.3

305.7

218.7

38.7

-220.0

-288.4

-256.0

3 16176.1 16805.4 12578.4 14833.9 14654.7 18062.6 18909.6 16863.5 16036.2

4

6581.7

5088.4

2971.9

5588.1

5155.3

8632.2 10941.5 14002.9 15068.5

5

354.6

603.9

555.5

223.1

532.9

-293.8

-285.9

-605.4

-951.3

6

401.7

674.7

271.2

82.4

178.3

-152.2

-96.7

-152.6

-403.1

7

2580.3

1509.5

2215.7

1949.7

1835.1

1195.9

2102.1

1863.8

2218.0

8

302.5

680.3

633.2

560.8

631.4

268.0

293.6

-312.8

-799.6

9

414.7

388.5

583.8

564.5

244.5

-4.2

-62.0

228.9

251.6

10

1574.6

881.7

1409.7

1492.1

909.1

1319.4

784.4

1113.8

2326.5

11

719.4

1284.9

1252.4

1519.2

1206.4

824.2

1040.2

932.2

1862.7

12

4925.6

4129.7

8435.9

8006.5

5159.7

4131.6

4692.9

2432.1

3743.0

13

719.6

187.8

593.7

197.6

277.2

15.3

-228.4

1327.5

1546.2

Citation: Neuhauser, C. Principal Component Analysis. Created: May 27, 2012 Revisions: Copyright: ? 2012 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license.

Page 2

Normalizing the Data and Generating a Heat Map

In the second sheet (called HeatMap), we copied the data set from Columns V to AD of the first sheet. The data come from wild-type mouse cerebella during the first 60 days postnatal, and were profiled using Affymetrix Mu11K arrays at nine time points: P1, P3, P5, P7, P10, P15, P21, P30, and P60, indicating the number of days postnatal.

The first step is to normalize the data. We follow Kho et al. (2004): "each of the 2552 genes was individually normalized to mean zero and variance one across P1-P60." We illustrate this on the first gene (see Table 2). The expression data for the nine different time points is listed in Cells A2:AI. To standardize the data, we need to calculate the mean and the standard deviation of the nine data points. To calculate the mean, we use the Excel function AVERAGE(number1, [number2],...). We enter in Cell J2 the expression

=AVERAGE(A2:I2)

To calculate the standard deviation, we use the Excel function STDEV.S(number1, [number2],...). We enter in Cell K2 the expression

=STDEV.S(A2:I2)

To standardize the values, we use the Excel function STANDARDIZE(x, mean, standard_dev). This function has three arguments, the value we want to standardize, the mean, and the standard deviation. The function returns the normalized value of x, that is,

normalized value x mean standard_dev

To standardize the value in Cell A2, we enter in Cell A6

=STANDARDIZE(A2,$J$2,$K$2)

We obtain the value -0.9877, which is the result of the expression (-66.0-509.3)/582.48.

Note that we used absolute references for the mean and standard deviation (indicated by the $ sign before the column letter and row number, respectively), but relative reference for the value we want to standardize. This allows us to drag the cell A6 across so that the remaining cells B6 to I6 are filled with the standardized data.

Citation: Neuhauser, C. Principal Component Analysis. Created: May 27, 2012 Revisions: Copyright: ? 2012 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license.

Page 3

Table 2: The expression profile of the first gene before (Row 2) and after normalization (Row 6)

A

B

C

D

E

F

G

H

I

J

K

PN1.b

PN3.b

PN5.b

PN7.b

PN10.b

PN15.b

PN21.b

PN30.b

PN60.b

Mouse Mouse Mouse Mouse Mouse Mouse Mouse Mouse Mouse

Ce r e b

1 Signal

Ce r e b Signal

Ce r e b Signal

Ce r e b Signal

Ce r e b Signal

Ce r e b Signal

Ce r e b Signal

Ce r e b Signal

Ce r e b Signal

M e an

Standard De viation

2

-66.0

132.4

87.0

20.8

21.8

932.7

844.4

1188.2

1422.6

509.3 582.48

3

4

PN1.b

PN3.b

PN5.b

PN7.b

PN10.b

PN15.b

PN21.b

PN30.b

PN60.b

Mouse Mouse Mouse Mouse Mouse Mouse Mouse Mouse Mouse

Ce r e b

5 Signal

Ce r e b Signal

Ce r e b Signal

Ce r e b Signal

Ce r e b Signal

Ce r e b Signal

Ce r e b Signal

Ce r e b Signal

Ce r e b Signal

6 -0.9877 -0.6471 -0.7250 -0.8387 -0.8370 0.7269 0.5753 1.1655 1.5679

To plot the standardized expression profile over time, we generate a table as below:

Time [Days]

1

3

5

7

10

15

21

30

60

Value -0.98772 -0.6471 -0.72505 -0.8387 -0.83698 0.726858 0.575264 1.165502 1.567922

We then graph the data as a scatterplot with straight lines and markers (Figure 1):

Figure 1: Scatterplot of normalized expression profile for the first gene in the data set.

We now return to the full data set in the second sheet and standardize the remaining gene expression profiles. (Note: Adjust the rows and columns in the formulas according to where the data are.)

To standardize the data in Columns B through L in the worksheet under the HeatMap tab, calculate the mean and the standard deviation of each gene, and enter the results in Columns M and N, respectively. Use the STANDARDIZE function to normalize the data, and enter the results in Columns P through X.

Citation: Neuhauser, C. Principal Component Analysis. Created: May 27, 2012 Revisions: Copyright: ? 2012 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license.

Page 4

To visualize the up- versus down-regulated genes, use the Conditional Formatting option that is available in the Styles group of the Home ribbon. Choose a Color Scale that formats cells with high values red and cells with low values blue. Now, sort the data from largest to smallest by the first time point of the normalized data (Column P) using Custom Sort. Make sure you sort all columns so that you can keep track of the genes using the numeric ID in Column B. The coloration indicates which genes tend to be expressed early versus late during development. Scatter Plots and Correlations Since the data consist of expression profiles that were taken on days that are close together, we expect that the expression profiles from time point to time point are correlated. We use scatter plots to visualize correlations and calculate the correlation among all pairs of time points. Figure 2 shows an example of a scatter plot where each data point represents the expression of a single gene at time points 5 days (horizontal axis) and 7 days (vertical axis). We see that the data are positively correlated (Figure 2).

Figure 2: Scatterplot of the normalized data from time points 5 and 7. We can calculate the correlation using the Excel function

=CORREL(array1, array2) where array1 (similarly, array2) is the range of data for a given time point. For instance, we find that the correlation between time points 5 and 7 of the normalized data is 0.66.

Citation: Neuhauser, C. Principal Component Analysis. Created: May 27, 2012 Revisions: Copyright: ? 2012 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license.

Page 5

Exercise 1: Calculate the correlation between pairs of time points for the normalized data. Find the pairs with the larges positive and largest negative correlation and plot each of these pairs of data as a scatter plot. What property of the scatter plot tells you whether the data are positively versus negatively correlated?

Data Clouds The data were collected over nine days during Days 1-60 postnatal. We can think of each time point as a single dimension. To visualize the data as a cloud in the temporal space, we would need a 9-dimensional space, one dimension for each time point. Of course, we cannot draw such a cloud. We are restricted to at most three dimensions when plotting clouds. But even a three-dimensional cloud is not easy to interpret as illustrated in Figure 3.

Figure 3: Data cloud in three spatial dimensions In the following, we will learn a tool, called Principal Component Analysis that reduces the dimensionality of the data by rotating the coordinate axes in such a way as to maximize the signal and minimize redundancy in the representation. This will allow us to represent high-dimensional data with fewer dimensions while keeping the most important features of the data. Before we explain this further, we will need to review how to represent points in space.

Citation: Neuhauser, C. Principal Component Analysis. Created: May 27, 2012 Revisions: Copyright: ? 2012 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license.

Page 6

Representing Points in Space

When we represent a point in 2-dimensional space, we give its coordinates relative to a coordinate system. For instance, the red point in the graph below (Figure 4) has coordinates (x,y) in the blue coordinate system and coordinates (u,v) in the black coordinate system. Since in either coordinate system the axes are orthogonal, we can use the Pythagorean Theorem and find that

r2 x2 y2 u2 v2

Figure 4: Representing points in rectangular coordinate systems

Exercise 2: In Figure 4, assume that the coordinates of the red point in the x-y coordinate system are (1,2). Assume that the u-axis goes through the point (2,1) in the x-y coordinate system. What are the coordinates of the red point in the u-v coordinate system?

Citation: Neuhauser, C. Principal Component Analysis. Created: May 27, 2012 Revisions: Copyright: ? 2012 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license.

Page 7

Choosing a Coordinate System

Look at the left and right panel in Figure 5. In the left panel, the data points are normally distributed with mean 0 and variance 1 and they are uncorrelated. In the right panel, the data points are also normally distributed with mean 0 and variance 1, but this time they are correlated.

Figure 5: The data in the left panel are uncorrelated; the data in the right panel are correlated. Correlation in data introduces redundancy in data in the sense that knowing the value of one coordinate allows us to make predictions about the other coordinate, and the higher the correlation, the better our prediction will be.

Figure 6: The rotated data cloud In Figure 6, we rotated the data cloud from the right panel of Figure 5 so that the maximum variability is aligned with the horizontal axis. Rotating the data cloud is equivalent to rotating the axes. That is, we could have said that we rotated the axes to that the first axis goes through the cloud where the variability is maximal. To express the data points in the new coordinate system, we proceed as in

Citation: Neuhauser, C. Principal Component Analysis. Created: May 27, 2012 Revisions: Copyright: ? 2012 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license.

Page 8

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download