Chapter 6 Measures of distance and correlation between …

6-1

Chapter 6

Measures of distance and correlation between variables

In Chapters 4 and 5 we concentrated on distances between samples of a data matrix, which are usually the rows. We now turn our attention to the variables, usually the columns. Two variables have a pair of values for each sample, and we can consider measures of distance and dissimilarity between these two column vectors. More often, however, we measure the similarity between variables: this can be in the form of correlation coefficients or other measures of association. In this chapter we shall look at the geometric properties of variables, and various measures of correlation between them. In particular we shall look at the geometric concept called a scalar product, which is highly related to the concept of Euclidean distance. The decision about which type of correlation function to use depends on the measurement scales of the variables, as we already saw briefly in Chapter 1. Finally, we also consider statistical tests of correlation, introducing the idea of permutation testing.

Contents

The geometry of variables Correlation coefficient Scalar product Distances based on correlation coefficients Distances between count variables Distances between categorical variables and between categories Testing correlations: permutation testing

The geometry of variables

In Exhibits 4.3 and 5.5 in the previous chapters we have been encouraging the notion of samples being points in a multidimensional space. Even though we cannot draw points in more than three dimensions, we can easily extend the mathematical definitions of distance to samples (usually the rows of the data matrix) for which we have J measurements for any J. We now consider the variables, which are usually the columns of the data matrix, and their sets of observed values across the I samples. In this case a two-dimensional, even a three-dimensional figure as a starting point for our thinking is rather trivial, because having only 2 or 3 samples in a study is ridiculous from a statistical point of view. Nevertheless, in Exhibit 6.1 we have attempted a picture of two variables in the three-dimensional space of a data set with sample size I = 3, because of the high pedagogical value of the diagram.

6-2

Exhibit 6.1 Two variables measured in three samples (sites in this case), viewed in three dimensions: (a) original scale; (b) standardized scale, where each set of three values has been centred with respect to its mean and divided by its standard deviation (standardized values as shown). Projections of some points onto the `floor' of the s2?s3 plane are shown, to assist in understanding the three-dimensional positions of the points.

(a)

s1

100

80

60

[ 72 75 59 ]

40

? depth

[ 4.8 2.8 5.4 ] 20

pollution

O ? 20 40 60 80 100

s2

100 8 0 6 0 4 0 20

?

s3

(b)

SITE s1 s2 s3

Pollution 0.343 -1.126 0.784

Depth 0.392 0.745 -1.137

s1

1.0

[ 0.392 0.754 ?1.137 ]

? depth

0.8

0.6

0.4

[ 0.343 ?1.126 0.784 ] ?

0.2

pollution

-1.2 -1.0 -0.8 -0.6 -0.4 -0.2 O

-0.2 -0.4 -0.6 -0.8 -1.0

?

0.2 0.4 0.6 0.8 s2

1.0 0.8 0. 6 0. 4 0. 2

?

s3

6-3

Note that the two variables pollution and depth have been standardized with respect to the mean and standard deviation of this sample of size 3, hence the values here do not coincide with the standardized values in the complete data set, given in Exhibit 4.4. Exhibit 6.2 now shows the triangle formed by the two vectors and the origin O, taken out of the threedimensional space of Exhibit 6.1(b), and laid flat.

Exhibit 6.2 Triangle of pollution and depth vectors with respect to origin (O) taken out of Exhibit 6.1(b) and laid flat, with lengths of sides indicated.

pollution ?

2

2.682

O

depth ?

2

The standardization has imposed equal lengths on the two vectors, because of the unit

variance of each variable j=1 and 2:

1

I -1

I

( xij

i =1

-

x j )2

=

1 I -1

I i =1

xi2j

=1

where the means of the standardized variables are zero and, in this example, I = 3. So the

length of each vector, i.e. the square root of the sum of squared coordinates (see Exhibit 4.3), is:

j

x

2 j

=

I -1

(6.1)

which is 2 in this example, as indicated in Exhibit 6.2. You can check that the (column) sums of squares of the standardized values in Exhibit 6.1(b) are indeed 2. The length of the third side of the triangle, between the pollution and depth points, is similarly calculated using the Euclidean distance formula (4.4) to be the square root of 7.190, i.e., 2.682 ? check by summing the squared differences between the two columns and taking the square root. Hence we know the three sides of the triangle, so by using the cosine rule (which we all learnt at school) we can calculate the cosine of the angle between the vectors:

c2 = a2 + b2 ? 2abcos( )

(6.2)

i.e., 2.6822 = 2 + 2 ? 222cos( ) = 4 ? 4cos( )

hence, cos( ) = 1 ? ? ?7.190 = ?0.7975

and the angle is = 2.494 radians, or 142.9 degrees.

6-4

Correlation coefficient

After all that work, so what?! (we hear readers cry...) The punch line is that ?0.7975 is actually the correlation coefficient between pollution and depth (in this sample of size 3) ? so we have illustrated the result that the cosine of the angle between two standardized variables is the correlation. But in the above geometric explanation it is clear that the angle , and thus also the correlation which is cos( ), does not depend on the length of the two vectors which subtend the angle, only on the centering of these vectors. Since we can choose the lengths at will, there is (and will be later) a great advantage if these vectors have length 1. For example, in the cosine rule of (6.2), if a = b = 1, then there is the following simple relationship between the correlation r = cos( ) and the distance c between the two variable points, irrespective of the sample size:

r = 1 ? ? c2

(6.3)

Standardized variables, whose lengths are equal to I -1 (see (6.1)), can be converted to

have length 1 simply by dividing them by I -1 , and then we call them unit variables. In our I = 3 example, the unit variables are:

SITE

s1 s2 s3

Pollution 0.242 -0.796 0.554

Depth 0.277 0.527 -0.804

and (6.3) can be easily verified: -0.7975 = 1 ? ? [ (0.242-0.277)2 + (-0.796-0.527)2 + (0.554-(-0.804))2]

Scalar product

But there is yet another way of interpreting the correlation coefficient geometrically. Look at what you get when you take the sum of the products of the elements of the unit variables:

(0.242?0.277) + (-0.796?0.527) + (0.554?(-0.804)) = -0.7975

Almost miraculously, the correlation coefficient can be calculated by what is called the

scalar product:

I

rjj = xij xij

i =1

(6.4)

as long as the xij are the values of the unit variables. In fact, this result is easy to prove, since the definition of the correlation is a sum of products, but we leave details like that to the Theoretical Appendix A.

The concept of a scalar product underlies many multivariate techniques which we shall introduce later. It is very much related to the operation of projection, which is crucial later when we project points in high-dimensional spaces onto lower-dimensional ones. As an illustration of this, consider Exhibit 6.3, which depicts the unit variables (with length 1, so

6-5 they are shorter than those of length 1.414 in Exhibit 6.2).

Exhibit 6.3 Same triangle as in Exhibit 6.2, but with variables having unit length (i.e., unit variables, obtained by dividing standardized variables by the square root of I ?1, where I is the sample size). The projection of either variable onto the direction defined by the other variable vector will give the value of the correlation, cos( ). (The origin O is the zero point and the scale is given by the unit length of the variables.)

pollution ?

1

1.896

O

depth ?

1

?0.7975

Distances based on correlation coefficients

From a distance point of view, if the variables are expressed as unit variables, with sum of squares equal to 1, then from (6.3) the distance between variables j and j (denoted by c in the previous discussion, denoted here by djj) is directly related to the correlation coefficient rjj as follows:

d jj = 2 - 2rjj = 2 1 - rjj

(6.4)

where djj has a minimum of 0 when r = 1 (i.e., the two variables coincide), a maximum of 2

when r = ?1 (i.e., the two variables go in exact opposite directions), and the value 2 when r = 0 (i.e., the two variables are uncorrelated and are at right-angles to each other).

An inter-variable distance can also be defined in the same way for other types of correlation coefficients and measures of association that lie between ?1 and +1, for example the (Spearman) rank correlation. This so-called nonparametric measure of correlation is the regular correlation coefficient applied to the ranks of the data. In the sample of size 3 in Exhibit 6.1(a) pollution and depth have the following ranks:

SITE

s1 s2 s3

Pollution 2 1 3

Depth 2 3 1

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download