Principal Components Analysis



Chapter 5. Principal Components Analysis.

Analysis of Principal Components

Introduction

PCA is a method to study the structure of the data, with emphasis on determining the patterns of covariances among variables. Thus, PCA is the study of the structure of the variance-covariance matrix. In practical terms, PCA is a method to identify variable or sets of variables that are highly correlated with each other. The results can be used for multiple purposes:

❖ To construct a new set of variables that are linear combinations of the original variables and that contain exactly the same information as the original variables but that are orthogonal to each other.

❖ To identify patterns of multicollinearity in a data set and use the results to address the collinearity problem in multiple linear regression.

❖ To identify variables or factors, underlying the original variables, which are responsible for the variation in the data.

❖ To find out the effective number of dimensions over which the data set exhibits variation, with the purpose of reducing the number of dimensions of the problem.

❖ To create a few orthogonal variables that contain most of the information in the data and that simplify the identification of groupings in the observations.

PCA is applied to a single group of variables; there is no distinction between explanatory and response variables. In multiple linear regression (MLR), PCA is applied only to the set of X variables, to study multicollinearity.

1 Example 1

Koenigs et al., (1982) used PCA to identify environmental gradients and to relate vegetation gradients to environmental variables. Forty-eight environmental variables, including soil chemical and physical characteristics, were measured in 40 different sites. In addition, over 17 vegetation variables, including % cover, height, and density, were measured in each site. PCA was applied both to the environmental and vegetational variables.

The first 3 PC's of the vegetational variables explained 73.3% of all the variation in the data. These 3 dimensions are interpreted as the gradients naturally "perceived" by the plant community, and were well correlated with environmental variables related to soil moisture.

2 Example 2

Jeffers (J.N.R. Jeffers. 1967. Two case-studies on the application of Principal component analysis. Applied Statistics, 16:225-236) applied PCA to a sample of 40 winged aphids on which 19 different morphological characteristics had been measured. The characteristics measured included body length, body width, forewing length, leg length, length of various antennal segments, number of spiracles, etc. Winged aphids are difficult to identify, so the study used PCA to determine the number of distinct taxa present in the sample. Although PCA is not a formal procedure to define the clusters or groups of observations, it simplifies the data so they can be inspected graphically. The first two PC’s explained 85% of the total variance in the correlation matrix. When the 40 samples were plotted on the first two PC’s they formed 4 major groups distributed in an “interesting” S shape. Although 19 traits were measured, the data only contained information equivalent to 2-3 independent traits.

[pic]

Figure 5-1. Use of principal components to facilitate the identification of taxons of aphids.

Model and concept

PCA does not have any model to be tested, although it is assumed that the variables are linearly related. The analysis can be thought of as looking at the same set of data from a different perspective. The perspective is changed by moving the origin of the coordinate system to the centroid of the data and then rotating the axes.

Given a set of p variables (X1, ..., Xp), PCA calculates a set of p linear combinations of the variables (PC1, ..., PCp) such that:

❖ The total variation in the new set of variables or principal components is the same as in the original variables.

❖ The first PC contains the most variance possible, e.g. as much variance as can be captured in a single axis.

❖ The second PC is orthogonal to the first one (their correlation is 0), and contains as much of the remaining variance as possible.

❖ The third PC is orthogonal to all previous PC's and also contains the most variance possible.

❖ Etc.

This procedure is achieved by calculating a matrix of coefficients whose columns are called eigenvectors of the variance-covariance or of the correlation matrix of the data set. Some basic consequences of the procedure are that:

❖ All original variables are involved in the calculation of PC scores (i.e. the location of each observation in the new set of axis formed by the PC's).

❖ The sum of variances of the PC's equals the sum of the variances of the original variables when PCA is based on the variance-covariance matrix, or the sum of the variances of the standardized variables when PCA is based on the correlation matrix.

❖ There are p eigenvalues (p=number of variables in the data), each one associated with one eigenvector and a PC. These eigenvalues are the variances of the data in each PC. Thus, the sum of eigenvalues based on the variance-covariance matrix is equal to the sum of variances of the original variables.

PCA based on the correlation matrix is equivalent to using PCA based on the variance-covariance of the standardized variables. Because standardized variables have variance=1, the sum of eigenvalues is p, the number of variables.

Assumptions and potential problems

1 Normality

For descriptive PCA no specific distribution is assumed. If the variables have a multivariate normal distribution the results of the analysis are enhanced and tend to be clearer. Normality can be assessed by using the Analyze Distribution platform in JMP, or the PROC UNIVARIATE. Transformations can be applied to approach normality as described in Figure 4.6 of Tabachnick and Fidell (1996). Multivariate normality can be assessed by looking at the pairwise scatterplots. If variables are normal and linearly related, the data will tend to exhibit multivariate normality. Strict testing of multivariate normality can be achieved by calculating the Jacknifed squared Mahalanobis distance for each observation and then testing the hypothesis that its distribution is a χ2 distribution with as many degrees of freedom as variables considered in the PCA. This test is very sensitive, so it is recommended that a very low α be used (e.g. 0.001 or 0.0005).

Open the file spartina.jmp. In JMP, select the ANALYZE -> MULTIVARIATE platform and include all variables in the Response box; then, click OK.

[pic]

[pic]

Note that the two labels, “Mahalanobis” and “Jackknife” both refer to the same statistical distance D. The difference is that the latter is calculated using the jackknife procedure whereby D for each observation is calculated while holding the observation out of the data to obtain the variance-covariance matrix.

[pic]

[pic]

For the final step, simple regress D2 on the χ2 quantile. The line should be straight, with slope 1 and a zero intercept. In the specific Spartina example, there is at least one outlier that throws off multivariate normality.

2 Linearity

PCA assumes (i.e. only accounts for) linear relationships among variables. Lack of linearity works against the ability to “concentrate” variation in few PC’s. Linearity can be examined by looking at the pairwise scatterplots. If two variables are not linearly related, a transformation is applied. The variable to be transformed should be carefully selected such as not to disrupt the linearity with the rest of the variables.

3 Sample size

A potential problem of PCA is that results are not reliable (e.g. are different from sample to sample) if sample sizes are small. The problem is not as grave as for Factor analysis, and it diminishes as the variables exhibit higher correlations. Although some textbooks recommend 300 cases or observations, this is probably more appropriate for social studies where many variables cannot be measured directly (e.g., intelligence). In agriculture and biology, scientists routinely use data sets with 30 or more observations and the results pass peer review.

4 Outliers

Multivariate outliers can be a major problem in PCA, because just one or a few observations can completely distort the results. As indicated in the Data Screening topic, multivariate outliers can be identified by testing the jackknifed squared Mahalanobis distance. Transformations can help. If outliers remain after transformations, observations can be considered for deletion, but this has to be fully reported, and it has to be understood that elimination of observations just because they do not fit the rest of the data can have negative implications on the correspondence between sample and population. If sample size is very large, then deletion of a few outliers will not severely restrict the applicability of the results.

Geometry of PCA

The principal components are obtaining by rotating a set of orthogonal (perpendicular and independent) axes, pivoting at the centroid of the data. First, the direction that maximizes the variance of the scores (or perpendicular projections of the data points on the moving axes) on the first axis is determined, and the axis (PC1) is then fixed in that position. The rotation continues with the constraint that PC1 is the axis of rotation until the second axis maximizes the variance of the scores, which is the position for PC2. The procedure continues until PC p-1 is set. Because of the orthogonality, setting PC p-1 also sets the last PC.

Procedure for analysis

1 JMP procedure.

There are two ways to get Principal Components in JMP, through the Multivariate and the Graph -> Spinning Plot platforms. Both give all the numerical information. The Spinning Plot also displays a biplot, which is described below. When using the Spinning Plot make sure you select Principal Components and not “std Principal Components.” The latter give the standardized PC scores.

2 SAS code and output

proc princomp data=spartina out=spartpc;

var h2s sal eh7 ph acid p k ca mg na mn zn cu nh4;

run;

In this example, PCA is done on the correlation matrix, which is equivalent to saying that PC’s were calculated on the basis of the standardized variables. Using the correlation matrix is the default option for the PROC PRINCOMP, and is the most common choice. Alternatively, the analysis can use the covariance matrix, which just centers the data (i.e. all the principal components go through the centroid of the sample). The rationale for choosing correlation or covariance for PCA is discussed below.

Each eigenvalue represents the amount of variation from the original sample that is explained by the corresponding PC. In this example PCA was based on the correlations or standardized variables. Each standardized variable has a mean of zero and a variance of 1. Thus the sum of the variances of the original variables is equal to the number of variables, and the first PC accounts for 4.924/14 or 0.3517 of the total sample variance.

The eigenvectors are vectors of coefficients that can be used to get the values of the projections of each observation on each new axis or PC. The logic behind this is just a change of axes: just as the location of an observation can be expressed as a vector of p dimensions (in the example p=14) where the p dimensions are the measured variables, each observation can be expressed as a vector of p PC values or scores. The scores for all PC's for all observations can be saved into a SAS data set by using the OUT=filename option in the PROC PRINCOMP. This options creates a SAS file with all the information contained in the file specified by the DATA= option, plus the scores for all observations in all PC's. In JMP, from the Spinning Plot red triangle, select Save Principal Components.

To further the explanation of the eigenvectors, consider the first observation in the Spartina data set. The score for that observation on PC1 can be calculated by multiplying the standardized value for each variable by the corresponding element of the first column of the matrix of eigenvectors and adding all the terms, as shown in Table 1.

Box 4. Eigenvectors

Eigenvectors

PRIN1 PRIN2 PRIN3 PRIN4 PRIN5 PRIN6 PRIN7

H2S -.163637 0.009086 0.231669 0.689722 0.014386 -.419348 0.300094

SAL -.107894 0.017324 0.605727 -.270389 0.508742 0.010076 0.383770

EH7 -.123813 0.225247 0.458251 0.301313 -.166758 0.596651 -.296867

PH -.408217 -.027467 -.282670 0.081726 0.091618 0.191256 0.056897

ACID 0.411680 -.000362 0.204919 -.165831 -.162713 -.024061 0.117085

P 0.273196 -.111277 -.160543 0.199965 0.747115 -.017903 -.336928

K -.033446 0.487887 -.022907 0.043000 -.061998 -.016587 -.067421

CA -.358562 -.180445 -.206595 -.054385 0.206152 0.427579 0.104949

MG 0.079033 0.498653 -.049515 -.036561 0.103793 0.034182 -.044195

NA -.017130 0.470439 0.050575 -.054358 0.239519 -.060440 -.181661

MN 0.277082 -.182164 0.019849 0.483078 0.038899 0.299511 0.124567

ZN 0.404195 0.088823 -.176373 0.150047 -.007768 0.034351 -.072907

CU -.010788 0.391707 -.376740 0.102023 0.063434 0.077993 0.562581

NH4 0.398754 -.025968 -.010607 -.104087 -.005857 0.381686 0.395252

PRIN8 PRIN9 PRIN10 PRIN11 PRIN12 PRIN13 PRIN14

H2S -.073755 0.168302 0.295840 0.222927 -.015407 0.006864 -.079812

SAL 0.100873 -.175066 -.227621 0.088425 -.156210 -.094878 0.089376

EH7 -.312742 -.226136 0.083754 -.023086 0.055421 -.033492 -.023123

PH -.029538 0.023918 0.146959 0.041662 -.331152 0.025938 0.750134

ACID -.152610 0.095416 0.101118 0.344782 0.455459 0.351392 0.477337

P -.398662 0.077828 -.017685 -.034542 0.064822 0.065467 0.014741

K -.115096 0.559085 -.555004 0.217893 -.030301 -.249524 0.072785

CA 0.185889 0.186412 0.073763 0.511310 0.346574 0.079545 -.307040

MG 0.170996 -.011293 0.111582 0.118799 -.397791 0.690127 -.192283

NA 0.449939 0.088170 0.439200 -.216233 0.363391 -.276211 0.143663

MN 0.531706 0.086117 -.361647 -.269913 0.077826 0.172893 0.140813

ZN 0.208525 -.439455 0.014406 0.568635 -.222750 -.396331 0.041311

CU -.277074 -.376706 -.129195 -.192872 0.305087 -.000372 -.043094

NH4 -.145025 0.420100 0.393717 -.130247 -.301510 -.230796 -.117317

Table 1. Example showing how to calculate the PC1 score for observation1. The values of the original variables are standardized because this PCA was performed on the correlation matrix.

|PC11 = | -0.164*[-610.00-(-601.78)]/30.70 |

|+ | -0.108*[33.00-(30.27)]/3.72 |

|+ | -0.124*[-290.00-(-314.40)]/36.96 |

|+ | -0.408*[5.00-(4.60)]/1.25 |

|+ | 0.412*[2.34-(3.86)]/2.51 |

|+ | 0.273*[20.24-(32.30)]/27.59 |

|+ | -0.033*[1441.67-(797.62)]/297.60 |

|+ | -0.359*[2150.00-(2365.32)]/1718.33 |

|+ | 0.079*[5169.05-(3075.11)]/939.41 |

|+ | -0.017*[35184.50-(16596.71)]/6882.42 |

|+ | 0.277*[14.29-(38.10)]/24.48 |

|+ | 0.404*[16.45-(17.88)]/8.28 |

|+ | -0.011*[5.02-(3.99)]/1.04 |

|+ | 0.399*[59.52-(87.46)]/47.27 |

|= | -1.097 |

In matrix notation the calculation of PC scores is straightforward; simply multiply the matrix of standardized values in the original axes, called Z, (standardized data matrix) and the matrix of eigenvectors V to obtain an nxp matrix of scores W:

W = Z V

The calculations in matrix notation are illustrated in the file HOW03.xls.

3 Loadings

"Loadings" are the correlations between each one of the original variables and each one of the principal components. Therefore, there are as many loadings as coefficients in the matrix of eigenvectors. Loadings can be used to interpret the results of the PCA, because a high loading for a variable in a PC indicates that the PC has a strong common component or relationship with the variable. Loadings are interpreted by looking at the set of loadings for each PC and identifying groups that are large and negative, and groups that are large and positive. This is then used to interpret the PC as being an underlying factor that reflects increases in variables with positive loadings, and decreases in variables with negative loadings.

Loadings can be calculated as a function of the eigenvectors and standard deviations of PC's and original variables, or they can be calculated directly by saving the PC scores and correlating them with the original variables.

[pic]

In the Spartina example, the loading for H2S in PC1 is

-0.163637*sqrt(4.92391) = -0.3631. The loading for Mg in PC2 is 0.498653*sqrt(3.69523) = 0.9586.

4 Interpretation of results

Interpretation of the results depends on the main goal for the PCA. We will consider two main types of goals:

1. Reduction of dimensionality of data set and/or identification of underlying factors.

2. Analysis of collinearity among X variables in a regression problem.

1 Identification of fewer underlying factors

In the first case, the interpretation depends on whether the analysis identifies a clear subset of PC's that explain a large proportion of all of the variance. When this is the case, it is possible to try to interpret the most important PC's as underlying factors.

In the Spartina example, the first two principal components represent two systems of variables that tend to vary together. PC1 is associated with pH, buffer acid, NH4, and Ca, and appears to represent a gradient of acidity-alkalinity. Similar interpretations can be reached for the next two principal components. The interpretation is simplified by using Gabriel's Biplots.

A Gabriel's biplot has two components (hence the name biplot) plotted on the same set of axes represented by a pair of PC's (typically PC1 vs. PC2): a scatterplot of observations, and a plot of vectors that represent the loadings or correlations of each original variable with each PC in the biplot. The first element is obtained as by plotting PC1 vs. PC2 for each observation as dots on the graph. The second element is obtained by making a vector, for each variable, that goes from the origin (0,0) to the point represented by (rxpc1, rxpc2), where rxpc1, and rxpc2 are the loadings for the variable X with PC1 and PC2, respectively. Therefore, the Gabriel's biplot has as many points as observations, and as many vectors as variables in the data set. For the Spartina example, there are 45 points and 14 vectors (Figure 1). Groups of vectors that point in about the same (or directly opposite) directions indicate variables that tend to change together.

Note that JMP draws the “rays” or vectors in the Spinning Graph by linking the origin to the PC scores or coordinates of a fictitious point that is at the average of all original variables except for the one it represents, for which it has a value equal to 3 standard deviations. This facilitates viewing the rays and scatter of points together, and it preserves the interpretability of the relative lengths of the vectors, but the vectors no longer have a total length of 1 (over all PC’s). I graphed a the loadings for the spartina example to illustrate this point in the HW03.xls file, with diamonds marking what would be the tips of the rays in a biplot.

Why do the vectors have a length equal 1.0? Think of the vector for one of the variables, say pH, and imagine it poking through the 14-dimensional space formed by the principal axis or components. The length of the vector is the length of a hypotenuse of a right triangle. By applying Pythagorean theorem several times on can calculate the length of the vector, which is the sum of the squares of the 14 coordinates. Recall that each coordinate is the correlation between the corresponding PC and pH, because of the way the vector for pH was constructed. Therefore, the square of each coordinate is the R2 or proportion of the variance in pH explained by each PC. Since the PC’s are orthogonal to each other and they contain all of the variance in the sample, no portion of the variance of pH is explained by more than one component, and the sum of the variance explained by all components equals the total variance in pH. Therefore, the sum of the individual r2’s, which is the length of the vector, must equal 1.

The points on the plot help to see how the observations may form different groups or vary along the "gradients" represented by the combination of both PC's. The vectors that have length close to 1 (say >.85) represent variables that have a strong association with the two components, i.e., the two PC's capture a great deal of the variation of the original variable. When the vector length is close to 1, the relationships of that vector and others that are also close to 1 will be accurately displayed in the plot. The direction of the vector shows the sign of the correlations (loadings). Moreover, the angle between any two vectors shows the degree of correlation between the variation of the two variables that is captured on the PC1-PC2 plane (in fact, r = cos [angle]). When vectors tend to form "bundles" they can be interpreted as systems of variables that describe the gradient. For example, Ca, pH, NH4, acid, and Zn form such a system.

[pic]

Figure 1. Gabriel's biplot for the Spartina example. Numbers next to each point indicate the location of the sample.

2 Identification of collinearity

When the PCA is performed on a set of X variables that are considered as explanatory variables for a multiple linear regression problem, the interpretation is different from above. The main goal in this case is to determine if there are variables in the set that tend to be almost perfect linear combinations of other variables in the set. These variable have to be identified and considered for deletion.

| |Condition number|

|Eigenvalue | |

|4.924 |1.00 |

|3.695 |1.15 |

|1.607 |1.75 |

|1.335 |1.92 |

|0.692 |2.67 |

|0.501 |3.14 |

|0.385 |3.57 |

|0.381 |3.60 |

|0.166 |5.45 |

|0.143 |5.87 |

|0.087 |7.53 |

|0.045 |10.43 |

|0.030 |12.83 |

|0.010 |22.77 |

Identification of variables that may be causing a collinearity problem is achieved by calculating the Condition Number (CN) for each PC. Keep in mind that collinearity is a problem for MLR, not for PCA; we use PCA to work on the MLR problem.

[pic]

Hence, the CN for PCi is the square root of the quotient between the largest eigenvalue and the eigenvalue for the PC under consideration. A value of CN of 30 or greater identifies a PC implicated in the collinearity. Those variables can be identified by requesting the COLLINOINT option in the PROC REG in SAS.

Some issues and Potential problems in PCA

1 Use of correlation or covariance matrix?

The most typical choice is to use the correlation matrix to perform PCA, because this removes the impact of differences in the units used to express different variables. When the covariance matrix is used (by choosing Principal Components on Covariance in JMP or specifying the COV option in the PROC PRINCOMP statement in SAS), those variables that are expressed in units that yield values of large numerical magnitude will tend to dominate the first PC's. A change of units in a variable, say from g to kg will tend to reduce its contribution to the first PC's. In most cases this would be an undesirable artifact, because results would depend on the units used.

A nice example for a situation where the use of the covariance matrix is recommended is given by Lattin et al., (2003) page 112:

[pic]

(From Lattin, J., J. Douglas Carroll, and Paul E. Green. 2003. Analyzing Multivariate Data. Thomson Brooks/Cole.

2 Interpretation depends on goal.

In a sense, when PCA is performed to identify underlying factors and to reduce problem dimensionality, one hopes to finds a high degree of relation among some subgroups of variables. On the other hand, in MLR one hopes to find that all measured X variables will increase our ability to explain Y. In the case of Multiple Linear Regression (MLR), it is desirable to have many eigenvalues close to 1.

3 Interpretability.

One of the main problems with PCA is that often times, the new axes identified are difficult to interpret, and they may all involve a combination with a significant "component" of each original variable. There is no formal procedure to interpret PC's or to deal with lack of interpretability. Interpretation can be easier if the problem allows rotation of the PC o transform them into more understandable “factors.” This type of analysis, closely related to PCA is called Factor Analysis, and it is widely used in the social sciences.

4 How many PC’s should be retained?

In using PCA for reduction of dimensionality, one must decide how many components to keep for further analyses and explanation. In terms of visual presentation of results, it is very hard to convey results in more than 2 or 3 dimensions.

There are at least three options for deciding how many PC's to use, scree plots, retaining PC's whose eigenvalues exceed a critical value, and retaining sufficient PC's to account for a critical proportion of the original data.

1 Scree plot.

The Scree plot is a graph of the eigenvalues in decreasing order. The y-axis has shows the eigenvalues and the x-axis shows their order. The graph is inspected visually to identify "elbows," and the location of these breaks in the line is used to select a given number of PC's.

[pic]

Figure 5-2. Use of Scree plot to decide how many PC to retain. This choice is subjective, and focuses on finding "breaks" in the continuity of the line. In this case, keeping 3 or 5 PC's would be acceptable choices.

2 Retain if λ>average.

When PCA is based on the correlation matrix, the sum of eigenvalues equals p, the number of variables. Thus, the average value for the eigenvalues is 1.0. Any PC whose eigenvalue if greater than 1 explains more than the average amount of variation, and can be kept.

3 Retain as many necessary for 80%.

Finally, a sufficient number of PC's can be retained to account for a desired proportion of the total variance in the original data. This proportion can be chosen subjectively.

-----------------------

Box 1: Simple statistics

Principal Component Analysis

45 Observations

14 Variables

Simple Statistics

H2S SAL EH7 PH ACID

Mean -601.7777778 30.26666667 -314.4000000 4.602222222 3.861777778

StD 30.6956385 3.71972629 36.9559935 1.246994366 2.506354913

P K CA MG NA

Mean 32.29688889 797.6228889 2365.318889 3075.109333 16596.71111

StD 27.58669395 297.6023371 1718.327317 939.406676 6882.42337

MN ZN CU NH4

Mean 38.10054667 17.87524000 3.988576667 87.45520000

StD 24.48057096 8.27980582 1.036991704 47.27275022

Box 2: Correlation Matrix.

Correlation Matrix

H2S SAL EH7 PH ACID P K

H2S 1.0000 0.0958 0.3997 0.2735 -.3738 -.1154 0.0690

SAL 0.0958 1.0000 0.3093 -.0513 -.0125 -.1857 -.0206

EH7 0.3997 0.3093 1.0000 0.0940 -.1531 -.3054 0.4226

PH 0.2735 -.0513 0.0940 1.0000 -.9464 -.4014 0.0192

ACID -.3738 -.0125 -.1531 -.9464 1.0000 0.3829 -.0702

P -.1154 -.1857 -.3054 -.4014 0.3829 1.0000 -.2265

K 0.0690 -.0206 0.4226 0.0192 -.0702 -.2265 1.0000

CA 0.0933 0.0880 -.0421 0.8780 -.7911 -.3067 -.2652

MG -.1078 -.0100 0.2985 -.1761 0.1305 -.0632 0.8622

NA -.0038 0.1623 0.3425 -.0377 -.0607 -.1632 0.7921

MN 0.1415 -.2536 -.1113 -.4751 0.4204 0.4954 -.3475

ZN -.2724 -.4208 -.2320 -.7222 0.7147 0.5574 0.0736

CU 0.0127 -.2660 0.0945 0.1814 -.1432 -.0531 0.6931

NH4 -.4262 -.1568 -.2390 -.7460 0.8495 0.4897 -.1176

CA MG NA MN ZN CU NH4

H2S 0.0933 -.1078 -.0038 0.1415 -.2724 0.0127 -.4262

SAL 0.0880 -.0100 0.1623 -.2536 -.4208 -.2660 -.1568

EH7 -.0421 0.2985 0.3425 -.1113 -.2320 0.0945 -.2390

PH 0.8780 -.1761 -.0377 -.4751 -.7222 0.1814 -.7460

ACID -.7911 0.1305 -.0607 0.4204 0.7147 -.1432 0.8495

P -.3067 -.0632 -.1632 0.4954 0.5574 -.0531 0.4897

K -.2652 0.8622 0.7921 -.3475 0.0736 0.6931 -.1176

CA 1.0000 -.4184 -.2482 -.3090 -.6999 -.1122 -.5826

MG -.4184 1.0000 0.8995 -.2194 0.3452 0.7121 0.1082

NA -.2482 0.8995 1.0000 -.3101 0.1170 0.5601 -.1070

MN -.3090 -.2194 -.3101 1.0000 0.6033 -.2335 0.5270

ZN -.6999 0.3452 0.1170 0.6033 1.0000 0.2121 0.7207

CU -.1122 0.7121 0.5601 -.2335 0.2121 1.0000 0.0137

NH4 -.5826 0.1082 -.1070 0.5270 0.7207 0.0137 1.0000

Box 3. Eigenvalues

Eigenvalues of the Correlation Matrix

Eigenvalue Difference Proportion Cumulative

PRIN1 4.92391 1.22868 0.351708 0.35171

PRIN2 3.69523 2.08810 0.263945 0.61565

PRIN3 1.60713 0.27222 0.114795 0.73045

PRIN4 1.33490 0.64330 0.095350 0.82580

PRIN5 0.69160 0.19103 0.049400 0.87520

PRIN6 0.50057 0.11513 0.035755 0.91095

PRIN7 0.38544 0.00467 0.027531 0.93848

PRIN8 0.38077 0.21480 0.027198 0.96568

PRIN9 0.16597 0.02298 0.011855 0.97754

PRIN10 0.14299 0.05613 0.010214 0.98775

PRIN11 0.08687 0.04158 0.006205 0.99395

PRIN12 0.04529 0.01544 0.003235 0.99719

PRIN13 0.02985 0.02036 0.002132 0.99932

PRIN14 0.00949 . 0.000678 1.00000

In the results of the multivariate platform, select Outlier Analysis. This will display the Mahalanobis distance for all points in the dataset. In the picture below, the observations were already sorted by increasing distance, so the plot looks like an ordered string of dots.

Click on the red triangle by Outlier Analysis and select Save Jackknife Distance. This creates a new column in the data table, labeled Jackknife Distance. Create a new column where you calculate the squared Jackknife Mahalanobis distance, which in the next picture is labeled D-sq.

D2 is the variable that should be distributed like a χ2 random variable with p=14 in this case. In order to determine how close D2 is to a χ2, it is necessary to create a column of expected quantiles for the χ2. For this the data must be sorted in increasing order of D2. The column with χ2 quantile is created with a formula as indicated in the next figure. Conceptually, the χ2 quantile is the value of a χ2 with p degrees of freedom that is greater than x% of the values, where x is the proportion of rows (observations) with lower values of D2.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download