Chapter 6 Measures of distance and correlation between …
6-1
Chapter 6
Measures of distance and correlation between variables
In Chapters 4 and 5 we concentrated on distances between samples of a data matrix, which are usually the rows. We now turn our attention to the variables, usually the columns. Two variables have a pair of values for each sample, and we can consider measures of distance and dissimilarity between these two column vectors. More often, however, we measure the similarity between variables: this can be in the form of correlation coefficients or other measures of association. In this chapter we shall look at the geometric properties of variables, and various measures of correlation between them. In particular we shall look at the geometric concept called a scalar product, which is highly related to the concept of Euclidean distance. The decision about which type of correlation function to use depends on the measurement scales of the variables, as we already saw briefly in Chapter 1. Finally, we also consider statistical tests of correlation, introducing the idea of permutation testing.
Contents
The geometry of variables Correlation coefficient Scalar product Distances based on correlation coefficients Distances between count variables Distances between categorical variables and between categories Testing correlations: permutation testing
The geometry of variables
In Exhibits 4.3 and 5.5 in the previous chapters we have been encouraging the notion of samples being points in a multidimensional space. Even though we cannot draw points in more than three dimensions, we can easily extend the mathematical definitions of distance to samples (usually the rows of the data matrix) for which we have J measurements for any J. We now consider the variables, which are usually the columns of the data matrix, and their sets of observed values across the I samples. In this case a two-dimensional, even a three-dimensional figure as a starting point for our thinking is rather trivial, because having only 2 or 3 samples in a study is ridiculous from a statistical point of view. Nevertheless, in Exhibit 6.1 we have attempted a picture of two variables in the three-dimensional space of a data set with sample size I = 3, because of the high pedagogical value of the diagram.
6-2
Exhibit 6.1 Two variables measured in three samples (sites in this case), viewed in three dimensions: (a) original scale; (b) standardized scale, where each set of three values has been centred with respect to its mean and divided by its standard deviation (standardized values as shown). Projections of some points onto the `floor' of the s2?s3 plane are shown, to assist in understanding the three-dimensional positions of the points.
(a)
s1
100
80
60
[ 72 75 59 ]
40
? depth
[ 4.8 2.8 5.4 ] 20
pollution
O ? 20 40 60 80 100
s2
100 8 0 6 0 4 0 20
?
s3
(b)
SITE s1 s2 s3
Pollution 0.343 -1.126 0.784
Depth 0.392 0.745 -1.137
s1
1.0
[ 0.392 0.754 ?1.137 ]
? depth
0.8
0.6
0.4
[ 0.343 ?1.126 0.784 ] ?
0.2
pollution
-1.2 -1.0 -0.8 -0.6 -0.4 -0.2 O
-0.2 -0.4 -0.6 -0.8 -1.0
?
0.2 0.4 0.6 0.8 s2
1.0 0.8 0. 6 0. 4 0. 2
?
s3
6-3
Note that the two variables pollution and depth have been standardized with respect to the mean and standard deviation of this sample of size 3, hence the values here do not coincide with the standardized values in the complete data set, given in Exhibit 4.4. Exhibit 6.2 now shows the triangle formed by the two vectors and the origin O, taken out of the threedimensional space of Exhibit 6.1(b), and laid flat.
Exhibit 6.2 Triangle of pollution and depth vectors with respect to origin (O) taken out of Exhibit 6.1(b) and laid flat, with lengths of sides indicated.
pollution ?
2
2.682
O
depth ?
2
The standardization has imposed equal lengths on the two vectors, because of the unit
variance of each variable j=1 and 2:
1
I -1
I
( xij
i =1
-
x j )2
=
1 I -1
I i =1
xi2j
=1
where the means of the standardized variables are zero and, in this example, I = 3. So the
length of each vector, i.e. the square root of the sum of squared coordinates (see Exhibit 4.3), is:
j
x
2 j
=
I -1
(6.1)
which is 2 in this example, as indicated in Exhibit 6.2. You can check that the (column) sums of squares of the standardized values in Exhibit 6.1(b) are indeed 2. The length of the third side of the triangle, between the pollution and depth points, is similarly calculated using the Euclidean distance formula (4.4) to be the square root of 7.190, i.e., 2.682 ? check by summing the squared differences between the two columns and taking the square root. Hence we know the three sides of the triangle, so by using the cosine rule (which we all learnt at school) we can calculate the cosine of the angle between the vectors:
c2 = a2 + b2 ? 2abcos( )
(6.2)
i.e., 2.6822 = 2 + 2 ? 222cos( ) = 4 ? 4cos( )
hence, cos( ) = 1 ? ? ?7.190 = ?0.7975
and the angle is = 2.494 radians, or 142.9 degrees.
6-4
Correlation coefficient
After all that work, so what?! (we hear readers cry...) The punch line is that ?0.7975 is actually the correlation coefficient between pollution and depth (in this sample of size 3) ? so we have illustrated the result that the cosine of the angle between two standardized variables is the correlation. But in the above geometric explanation it is clear that the angle , and thus also the correlation which is cos( ), does not depend on the length of the two vectors which subtend the angle, only on the centering of these vectors. Since we can choose the lengths at will, there is (and will be later) a great advantage if these vectors have length 1. For example, in the cosine rule of (6.2), if a = b = 1, then there is the following simple relationship between the correlation r = cos( ) and the distance c between the two variable points, irrespective of the sample size:
r = 1 ? ? c2
(6.3)
Standardized variables, whose lengths are equal to I -1 (see (6.1)), can be converted to
have length 1 simply by dividing them by I -1 , and then we call them unit variables. In our I = 3 example, the unit variables are:
SITE
s1 s2 s3
Pollution 0.242 -0.796 0.554
Depth 0.277 0.527 -0.804
and (6.3) can be easily verified: -0.7975 = 1 ? ? [ (0.242-0.277)2 + (-0.796-0.527)2 + (0.554-(-0.804))2]
Scalar product
But there is yet another way of interpreting the correlation coefficient geometrically. Look at what you get when you take the sum of the products of the elements of the unit variables:
(0.242?0.277) + (-0.796?0.527) + (0.554?(-0.804)) = -0.7975
Almost miraculously, the correlation coefficient can be calculated by what is called the
scalar product:
I
rjj = xij xij
i =1
(6.4)
as long as the xij are the values of the unit variables. In fact, this result is easy to prove, since the definition of the correlation is a sum of products, but we leave details like that to the Theoretical Appendix A.
The concept of a scalar product underlies many multivariate techniques which we shall introduce later. It is very much related to the operation of projection, which is crucial later when we project points in high-dimensional spaces onto lower-dimensional ones. As an illustration of this, consider Exhibit 6.3, which depicts the unit variables (with length 1, so
6-5 they are shorter than those of length 1.414 in Exhibit 6.2).
Exhibit 6.3 Same triangle as in Exhibit 6.2, but with variables having unit length (i.e., unit variables, obtained by dividing standardized variables by the square root of I ?1, where I is the sample size). The projection of either variable onto the direction defined by the other variable vector will give the value of the correlation, cos( ). (The origin O is the zero point and the scale is given by the unit length of the variables.)
pollution ?
1
1.896
O
depth ?
1
?0.7975
Distances based on correlation coefficients
From a distance point of view, if the variables are expressed as unit variables, with sum of squares equal to 1, then from (6.3) the distance between variables j and j (denoted by c in the previous discussion, denoted here by djj) is directly related to the correlation coefficient rjj as follows:
d jj = 2 - 2rjj = 2 1 - rjj
(6.4)
where djj has a minimum of 0 when r = 1 (i.e., the two variables coincide), a maximum of 2
when r = ?1 (i.e., the two variables go in exact opposite directions), and the value 2 when r = 0 (i.e., the two variables are uncorrelated and are at right-angles to each other).
An inter-variable distance can also be defined in the same way for other types of correlation coefficients and measures of association that lie between ?1 and +1, for example the (Spearman) rank correlation. This so-called nonparametric measure of correlation is the regular correlation coefficient applied to the ranks of the data. In the sample of size 3 in Exhibit 6.1(a) pollution and depth have the following ranks:
SITE
s1 s2 s3
Pollution 2 1 3
Depth 2 3 1
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- vectors in rn ryerson university
- vectors in 2d and 3d vectors 1 three dimensional
- lab manual ix setting on 21 05 09 21 27
- section 13 1 displacement vectors university of arizona
- chapter 6 measures of distance and correlation between
- 11 2 vectors in three dimensions space vectors
- inner product spaces university of kansas
- chapter 2 vectors for mechanics 2 6 center of mass and
- lines and planes in r3 harvard university
- inner product
Related searches
- correlation between education and religion
- examples of correlation between non related variables
- chapter 6 questions and answers
- example of distance and displacement
- similarities between distance and displacement
- differences between distance and displacement
- correlation between a1c and glucose
- difference between distance and displacement
- correlation between hypertension and diabetes
- chapter 6 stock valuation solutions to questions and problems
- correlation between education and health
- chapter 6 stocks and bonds investment