STAT 101, Module 3: Numerical Summaries for



STAT 101, Module 3: Measuring Associations

Numerical Summaries for Two Quantitative Variables

Examples of Questions about Two Variables:

1) How much does a balloon expand as the temperature rises?

2) How does gas mileage differ for car models with differing weight?

3) How does length of illness differ as the amount of medication differs?

4) How much do life spans differ if the amount of regular exercise differs?

5) How does demand differ as price differs?

6) How does revenue differ as advertizing effort differs?

7) How does performance on the SAT differ as the amount of homework differs?

8) How does voter favor differ if the amount spent by candidates for political office differs?

A bad habit: the assumption of causality

• We cast the questions above in the form

“How does Y differ if X differs?”

for a reason: We feel tempted to put the questions in the form

“How does X affect Y?”

But this assumes that X does indeed affect Y.

Do we know this? We usually don’t. It is an assumption.

Yet our mind jumps to conclusions about causal effects without evidence. (An extreme case is magical thinking.)

• Q: When do we know that X affects Y?

A1: If a controlled experiment has been performed.

[The notion of a controlled experiment is not simple:

o In the physical sciences it may mean that aparatus has been used to control X and Y has been measured.

o In the medical sciences it may mean that a clinical trial has been performed, where subjects have been randomly and double-blindly assigned to treatment and placebo (X) and survival or improvement (Y) has been observed.

o In the social sciences it may mean that a lab experiment has been performed where subjects have been exposed to controlled stimuli (X) and responses (Y) have been observed.]

A2: If a strong theory tells us that X affects Y.

[Again, the notion of a strong theory is not simple:

o Most physics theories are strong. The peculiar thing about the physical world is that it behaves so regularly: temperature always affects pressure given fixed volume, opposite charges always attract each other, …

o Some biological theories are strong. For example, the theory of descent of species would be difficult to refute given the known underpinnings from molecular biology.

o In the social sciences there are really no strong theories. Every attempt to state “X affects Y” in generality runs up against circumstances, history, culture that .]

• Comments on the examples listed at the beginning:

1) How much does a balloon expand as the temperature rises?

Strong theories of physics say that heat expands gas volume.

2) How does gas mileage differ for car models with differing weight?

Physics again: greater weight requires more energy to be moved against friction. But we know that many other factors matter also (technology, aerodynamics, age of car,…).

3) How does length of illness differ as the amount of medication differs?

Whether a medicin shortens an illness has to be established in clinical trials and can only be ascertained in large numbers. If you think that a dose of Tylenol got you over a cold, you really don’t know whether you would have recovered the same way without Tylenol. There is no causal knowledge in individual cases, only in large numbers. Taking a medicin is a gamble, but it may be the most reasonable gamble given current knowledge.

4) How much do life spans differ if the amount of regular exercise differs?

There is a danger to jump to conclusions in matters of health. In fact, there have been recent examples where the literature retracted: some nutritional supplements gained a favorable reputation after some empirical studies “showed” that they may prevent cancer or improve cardio-vascular health. Examples are beta-carotene and vitamin E. Yet, in clinical trials these supplements had no effect. It appears that popping certain supplements is one of the things healthy-living people do but actually has nothing to do with their health. Unlike certain supplements, the benefits of moderate regular exercise may be based on much stronger evidence, including clinical trials and our understanding of physiology (a complex of strong theories).

5) How does demand differ as price differs?

This is taught in economics: if price increases, demand decreases. Hence if we accept economics as a strong theory, we have a causal relationship. Yet, exceptions exist: there are luxury products whose demand increases when the price is raised. Even some colleges have experienced increased applications after raising tuition. Finally, economics is often seen not as a causal theory but a normative theory: it says “If human beings were rational optimizers of their utilities (wealth, happiness,…), then they would act as follows:…” But recently Nobel prizes have been given for research that shows how actual humans are not economically rational.

6) How does revenue differ as advertizing effort differs?

Everybody agrees that advertizing is necessary, yet measuring the effect of advertizing is notoriously difficult. If you ever work for the marketing department of a company, do not accept this assignment! Also, any “effect” might not be causal.

7) How does performance on the SAT differ as the amount of homework differs?

Whether more homework improves SAT scores on average we don’t know. More precisely, I don’t know of any controlled experiments in this area, but they may exist.

8) How does voter favor differ if the amount spent by candidates for political office differs?

Whether more money makes the campaign of a candidate more successful, we don’t know; it sounds likely but there exist numerous exceptions of ‘underfinanced underdogs’ who beat their wealthier contenders. Obviously the characteristics of the candidate and the political climate have something to do with success also. Controlled experiments are infeasible because no candidate will agree to spending less than the maximum he can get his hands on. Also, it is legitimate for you to have a moral position on inequities in campaign spending even if you don’t actually know that spending affects campaign success.

What is left for us to learn from data?

• Most data have no strong underlying theory and are not from controlled experiments. Instead, they are

OBSERVATIONAL DATA,

meaning that they have been passively observed, without or with little control of the objects of study.

Go back to the data examples we have seen in Module 1,

Titanic survival, CEO compensation, car mileage, presidential approval ratings, places climate ratings,

and ask yourself:

▪ Is the dataset experimental or observational?

▪ Is there a strong theory that applies to the data?

• Q: If we can’t assume that the variables of an observational study affect each other, what use is it to analyze them?

A:

PREDICTION!

Even if X does not affect Y, it can be useful to know that high values of X imply higher values of Y on average.

Example: If marketers learn that readers of certain magazins (X) are more likely to buy their product (Y), they will want to know who reads these magazins, even if reading them is not the cause of the preference of certain products. The reason for the association may have to do with life styles that match preference of magazins with preference of consumer products.

(This example gives us one possible reason for why X and Y may be associated: there is a common factor that affects both.)

• A data example: the association between Height and Weight among 194 male UPenn students.

[pic]

o Does Height “cause” Weight?

Yo: Biology dictates that taller persons have more Weight on average (follows from a strong theory).

o However, history/culture also affects Weight!

50 years ago persons of the same Height would have had less Weight on average.

In the end, the data only says something about how Weight and Height are associated for this particular group of people. Our understanding of what makes Heights and Weights match up as observed is very incomplete.

We must find a way to describe association without referring to causality. Here is a definition that does this:

The ASSOCIATION between X and Y is the pattern in which X and Y values are matched.

Types of associations:

▪ positive vs. negative: up versus down

▪ linear vs. curved (=nonlinear),

▪ continuous versus clustered (=clumped).

In detail:

▪ Positive association: Higher values of Y tend to be matched with higher values of X.

▪ Negative association: Lower values of Y tend to be matched with higher values of X.

▪ Linear association: The plot of Y against X looks straight; drawing a straight line through the plot looks like the right thing to do.

▪ Non-linear association: The plot of Y against X does not look straight; drawing a curve through the plot looks like the right thing to do. (Convex: ‘cup’, concave: ‘cap’)

▪ Continuous association: The plot of Y against X does not look clumped, and drawing a line or curve through the plot looks like the right thing to do.

▪ Clustered association: The plot of Y against X has two or more groups diagonally shifted against each other; the plot looks clumped, but the clumps are neither side by side nor on top of each other: they are moved up-down and left-right against each other. Drawing curve or line doesn’t look like the right thing to do.

The plots below show examples:

A.

[pic]

B.

[pic]

C.

[pic]

D.

[pic]

A.: linear and positive

B.: linear and negative

C.: non-linear, convex

D.: clustered and positive

In real data the types of association can be more tentative, messy, and mixed. Examples: (PlacesRated.JMP, CarModels2003-4.JMP)

E.

[pic]

F.[pic]

E.: linear, positive

F.: non-linear, negative

(weakly curved; ignore top left outlier)

G.

[pic]

H.

[pic]

G.: non-linear, pos., clustered

H.: non-linear or clustered?

(Height60: pos. assoc.?)

Questions we can ask about associations:

• How strongly are two variables associated?

This question is a good one because associations are never exact. Obviously there will be stronger and weaker associations.

• At a given value of X, what is the average value of Y?

This is a valid question: What is, for example, the average Weight of people of Height 5’11?

• For a given difference in X, what is the average difference in Y?

Again, this is a good question: What is the average difference in Weight between people of Heights 5’10 and 5’11, or 5’ and 6’? This will be answered with slopes of fitted lines.

These questions have nothing to do with causality; they only describe association: how X and Y values are paired on average, and how strongly so.

What if there is no association?

• What does it even mean, that X and Y have no association?

It means there is no pattern in how X and Y values are matched. “No pattern” would mean the X and Y values are matched

randomly.

• What does “random” mean?

This is a deep question, too deep to discuss here. We will only try to gain a practical sense of what randomness “looks like”.

• Pseudo-random number generators: There exists a way of “faking randomness” on computers, using algorithms that generate number sequences that have no apparent pattern. Take it as a given that this can be done in a convincing way.

JMP has a collection of pseudo-random number simulators. In particular it has a pseudo-random generator of integer sequences, such as ‘6, 2, 1, 4, 7, 3, 5’. These can be used to shuffle the values of a column as if the values had been thrown into a bag and drawn in random order, lottery-style.

• Random association between X and Y: an illustration.

1) Imagine putting all Height values in a bag, and all Weight values in another bag.

2) Shake both bags so they get mixed well.

3) Then pull out pairs of Height and Weight values, one pair after another, so the Height and Weight values are randomly matched.

What would the result look like? Below is first a plot of the real data, then three plots in which Heights and Weights are (pseudo-)randomly matched to illustrate what “random association” or, equivalently, “no association” would look like.

Plots 2-4 have by construction no association between Height and Weight, even though the values are exactly the same, just shuffled randomly within the columns. Whatever patterns you see in Plots 2-4 are random patterns, random associations, and of no interest. (Note that a single clump/cluster does not make this a “clustered assocation”; you need two or more clumps.)

The comparison shows quite convincingly, though, that in the actual data Height and Weight are not randomly matched. The positive association seems to be real and not random.

[pic]

[pic]

[pic]

[pic]

To reproduce the above experiment, do the following:

1) Open PennStudentsRandom.JMP

2) Graph > Overlay Plot >

‘HEIGHT’ →X, ‘Random WEIGHT’ →Y, > OK

3) Right-click on ‘Random Weight’ in spreadsheet > Formula > Apply. Click repeatedly on Apply: every time, a new shuffle of the values in ‘Random Weight’ is generated and shown in the overlay plot. [This does not work with Fit Y by X.]

This experiment gives a sense of what we should expect under random or no association. By comparing the actual plot of X and Y, we also get a sense of whether the visible association in the data is real or could be due to chance.

Measuring the Strength of Linear Association:

The Correlation Coefficient of two Variables

• Textbook: Sec. 3.4 (again, ignore the ‘population’ case)

• Task: Finding a measure of two columns x and y such that

1) if the measure is positive, the linear association is positive, and the higher the value, the stronger is the positive linear association;

2) if the measure is negative, the linear association is negative, and the lower the value, the stronger is the negative linear association;

3) if the value is zero, there is no linear association;

Here are examples of artificial data whose x and y have linear associations that range from none to medium to strong to perfect, both positive and negative:

[pic]

What could a measure of linear association look like?

The most common measure, called correlation, is initially not very intuitive. It requires algebra to make sense, in particular, the so-called Cauchy-Schwartz inequality.

We start with a definition.

• Definition: Covariance of two variables x and y

[pic]

Algebra and covariance: Here is why people like the variance, and why they introduced the covariance. Like the cross-term xy in the binomial expansion

(x+y)2 = x2 + y2 + 2xy ,

the covariance plays the role of the cross-term for the variance:

s(x+y)2 = s(x)2 + s(y)2 + 2 cov(x,y) .

In fact, this formula is just the above binomial formula, but replicated N times with [pic]and [pic] instead of x and y, summed up, and divided by N–1.

• Fact: Cauchy-Schwartz inequality

–s(x) s(y) [pic] cov(x,y) [pic] s(x) s(y)

Furthermore:

o The right hand inequality becomes an equality exactly if x and y are in a perfect positive linear association.

o The left hand inequality becomes an equality exactly if x and y are in a perfect negative linear association.

Hence: We could use the covariance as a measure of linear association. But to judge the strength, we’d have to compare with the product of the standard deviations. Inconvenient!

• Definition: Correlation of two variables x and y

cor(x,y) = [pic]

Fundamental and convenient property:

[pic]

Moreover:

o cx,y = +1 ↔ x, y in a perfect positive linear association

o cx,y = –1 ↔ x, y in a perfect negative linear association

In the eight plots of artificial data above, the correlations were:

o 0.0, +0.5, +0.9 and +1.0 in the top row,

o 0.0, –0.5, –0.9 and –1.0 in the bottom row.

(The textbook has a series of plots similar to the above plots in Figure 3.7 on p.68 but with different correlations and smaller sample size. The sample size in the above plots was N = 500.)

Comparison of covariance and correlation:

o Because of the possibility of doing algebra, covariance is a theoretically important quantity.

o Because of its convenience as a measure of linear association, correlation is a practically important quantity.

A connection between correlation and covariance:

Standardization or z-Score

• For a variable x, form its z-score with the formula

[pic].

• Properties:

o [pic] and [pic] always.

o z-Scores are always unit-free.

(Standardization could be done with any location measure instead of the mean, and any dispersion measure instead of the standard deviation, with the result that the standardized variable has zero location measure and unit dispersion measure. Similarly, the standardized variable is unit-free. Try this with the median and the IQR!)

• Notation: When we have z-scores of two variables x and y, we denote them zx and zy , respectively.

• Fact: cov(zx,zy) = c(x,y) .

The correlation is the covariance of the z-scores!

Therefore, the correlation is also unit-free,

whereas the covariance has units (strange ones, though).

• A use of z-scores in grading: Imagine two midterm exams, both with a max of 20 points, but the first was less difficult and the scores range between 16 and 20, whereas the second had scores between 5 and 20. If added, the second midterm would be perhaps three times as influential as the first midterm. To equalize their influence, some instructors form z-scores before adding them up. This ensures that both have the same dispersion and hence same influence.

Correlation in Practice

Correlation measures linear dependence. Yet, if we calculate the correlation of variables x and y that are non-linearly associated, the correlation will also see a degree of linear association.

[pic]

[pic]

[pic]

cor(HEIG’,WEIG’)=0.716 cor(Clim’,Hous’) = 0.386 cor(X,Yclump’) = 0.664

The left-most correlation makes sense because the association looks linear. The middle correlation makes less sense because there is probably non-linear and clustered association. The right-most correlation is shaky also because the association is clearly clustered, even though it is positive.

Correlation in the presence of outliers and grouping

[pic]

In small samples, the correlation can be badly distorted by strong outliers. In the plotted example, the sample size is N = 50, and the outlier is removed by a factor of about 5 from the bulk of the data.

[pic]

When data are strongly grouped, the group-internal correlations can be quite different from the global correlation, as in the plotted example.

Correlation Tables

When facing several quantitative variables, it is efficient to obtain the complete table of all pairwise correlations.

Correlations

| |Climate-Terrain |Housing |HlthC-Environ |The_Arts |Crime |

|Climate-Terrain |1.0000 |0.3863 |0.2133 |0.2270 |0.1924 |

|Housing |0.3863 |1.0000 |0.4530 |0.4486 |0.1342 |

|HlthC-Environ |0.2133 |0.4530 |1.0000 |0.8658 |0.3047 |

|The_Arts |0.2270 |0.4486 |0.8658 |1.0000 |0.3895 |

|Crime |0.1924 |0.1342 |0.3047 |0.3895 |1.0000 |

JMP: Analyze > Multivariate Methods > Multivariate

(selection variables) → Y, Columns

> OK

But we really don’t know how reasonable these correlations are till we see the corresponding scatterplots. JMP, in the same output, produces all of them and shows them as a scatterplot matrix:

Scatterplot Matrix

[pic]

(We removed the “density ellipses” which JMP shows by default by clicking on the tiny red triangle in the top left next to “Scatterplot Matrix” and selecting “Density Ellipses”.)

We see that the correlations respond to a lot of non-linear structure for which we would not use a ruler to describe it. Yet the correlations get at least the signs right: all associations are positive, if anything, and the correlations reflect the strength of the associations somewhat.

Scatterplot matrices: Familiarize yourself with the conventions.

• The variable names are in the diagonal boxes.

• The plots in a column have all the same X-axis, named in the diagonal.

• The plots in a row have all the same Y-axis, again named in the diagonal.

Correlation and Ellipses, or “diagonal width”

JMP shows by default density ellipses that have the following properties:

• If the ellipse points upwards, it corresponds to a positive correlation; similarly, a downward pointing ellipse corresponds to a negative correlation.

• The thinner the ellipse, the closer the correlation is to ±1. Conversely, the more circular the ellipse, the closer the correlation is to 0.

Scatterplot Matrix

[pic]

Diagonal Dispersion as a Justification for Correlation

Here is another way to think about correlation, based on the idea of diagonal dispersion:

• First, we must make sense of the term “diagonal”. This is a geometric term which applies to a plot, but in any plot we have chosen a vertical and horizontal dimension in inches for the ranges of X and Y. This means, we somehow visually equate a number of vertical units (lb for Weight) with horizontal units (inch for Height). To give numeric meaning to what we do visually, we use z-scores of both the X and the Y variable. Z-scores eliminate units and make all variables look “the same”: same location measure (0) and same dispersion measure (1). Hence from now on we will work with the z-scores zx and zy of X and Y, where standardization is done with the mean as the location measure and the standard deviation as the dispersion measure:

[pic]

These are two new columns formed from the X and Y columns. As an example, see PennStudentsDemo.JMP

• For z-scores, if there is a perfect linear association, it must be ± the identity: zy = a zx + b → zy = ± zx

Why? Because taking the mean of both sides shows b=0, and taking the standard deviation of both sides shows |a|=1.

• Observations:

o The values of zx and zy are nearly equal if the values of zy – zx are nearly zero.

Equivalently, the dispersion of zy – zx is nearly zero, hence s(zy – zx) is nearly zero.

o The values of zy and –zx are nearly equal if the values of zy – (–zx) = zy + zx are nearly zero.

Equivalently, the dispersion of zy + zx is nearly zero, hence s(zy+ zx) is nearly zero.

The situation is depicted in the following plot:

[pic]

In the following series of plots,

o the top row shows the dispersion of zy – zx shrinking and the dispersion of zy + zx expanding, while

o the bottom row shows the dispersion of zy – zx expanding and the dispersion of zy + zx shrinking.

[pic]

• The above train of thought suggests that either dispersion could be used as a measure of linear association:

s(zy – zx) or s(zy + zx)

because:

o s(zy – zx) = 0 exactly if zy = zx , and

s(zy – zx) = 2 exactly if zy = –zx .

o s(zy + zx) = 0 exactly if zy = –zx , and

s(zy + zx) = 2 exactly if zy = zx .

Therefore, if we wanted a measure of correlation that is zero for a perfect negative linear association and +2 for a perfect positive linear association, we could use s(zy + zx). However, when we earlier set up the task of constructing a measure of linear association, we asked for negative values for negative association. This could be achieved by using

s(zy + zx) – s(zy – zx),

or better:

(s(zy + zx) – s(zy – zx))/2

because this quantity is +1 when zy = zx , and –1 when zy = –zx . Yet, this choice is not used either. It is the difference of the variances (squared dispersions) that is actually used, up to a factor 4:

cor(x,y) = ( s2(zy + zx) – s2(zy – zx) )/4

One confirms easily that

o cor(x,y) = +1 if zy = zx , and

o cor(x,y) = –1 if zy =– zx , and

o cor(x,y) = 0 if s(zy + zx) = s(zy – zx) .

Finally, one also confirms by substituting the definitions of zx and zy that

( s2(zy + zx) – s2(zy – zx) )/4 = cov(x,y) / ( s(x) s(y) ) ,

which is the standard formula for the correlation. So we recognize that people like the variance really because of the algebra one can do with it. The alternative without squares tried earlier would work, too, as a measure of linear association, but there is no pretty algebra for it.

Even though variance is not a measure of dispersion, it is the square of a measure of dispersion. We can still say that correlation measures “diagonal dispersion”, even though this is not exactly right.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download