Correlation - DePaul University



LSP 121

Activity 3

Part I - Introduction to Correlation

Learning Goals for this Activity

1. You will see how correlation can help explain relationships.

2. You will begin to understand the difference between correlation and causality.

1.  Using Excel, open the file StateSATs.xls (from the QRC website in the folder OlderData at the bottom of the page) which contains data on average SAT scores and the percent taking the SAT in each state in the US

a. Sort the data to find which states have the highest average SAT scores and which have the lowest. List the three highest states and the three lowest states.

b. Make a XY-scatterplot of the average score and the percentage of students taking the test. Add a linear trendline and find the R-squared value.  Paste this graph into your Word document. Do you think they are correlated? Why/why not?

[pic]

c. Write a short paragraph describing the relationship between average SAT score and percentage student taking the test.  Bonus 3 points: Give a reasonable explanation for why these seemingly puzzling results.

d. How does Illinois compare in average SAT score? In percent taking the SAT?  You should determine the z-scores for Illinois for these two values to answer this.  A reminder on calculating z-score: You need to find the mean and sd of both columns (B and C). Below each column, calculate the mean using ‘=average()’. To determine the standard deviation, the formula is ‘=stdev()’. Then use ‘=normdist()’ formula to determine the specific value using Illinois’ data. Include the values for mean, sd for both columns and the specific values for Illinois in your answer. An important aspect of statistics is interpreting your results in terms of the “real world”. So, make a conjecture why so few Illinois students take the SAT.

2. Using Excel, open the file TVLifeExpectancy.xls, which contains data on life expectancy and the number of TV's per person in selected countries.

a. Make a scatterplot of the data; adjust the scale so that 40 is the minimum on the y-axis (Right-click over the numbers ( Format Axis ( Axis Options) .  Add a linear trendline and find the R-squared value.  Paste this graph into your Word document. Note how adjusting the scale altered the appearance of the graph. Can you suggest why this might have been helpful? (Hint: Think in terms of the appearnce of the trendline and how that appearance suggests or does not suggest correlation

b. Is there a correlation between life expectancy and number of TV's per person?

c. Can we infer from the data that TV's cause longevity?  Can you name some common underlying causes for both longevity and higher rates of televisions per capita

3. Using Excel, open the file Nielsen.xls (in the OlderData file), a file derived from data in the 1994 World Almanac and Book of Facts on the Nielsen TV ratings for the favorite syndicated programs for 1992-93.  Determine the R-squared value for each of the following pairs. Note that you don't have to graph each one.  You can type =RSQ(A7:A26, B7:B26) to calculate the R-squared for the data in columns A7 to A26 and B7 to B26.

a. Women and Men

b. Women and Teenager

c. Women and Children

d. Men and Teenagers

e. Men and Children

f. Teenagers and Children

Can you apply a “real-world” explanation for these data? In a short well-written paragraph, explain what can be concluded from these correlations and absences of correlation.

4. This last exercise will use SPSS for calculating correlation. But before we import an Excel file into SPSS, let’s add a column to our Excel table. Using Excel, open the file CityCrimeRates.xls, which contains data on the crime rates (total and violent) and population.  Add a new column – Population Density – which is calculated by dividing Population by Land Area. Save this Excel file to My Documents or the Desktop. Then run SPSS and import the Excel file. Give the variables appropriate names and change the necessary values from String to Numeric (with appropriate decimal places).

Determine the correlation coefficient for each of the following pairs of variables:

a. Crime Rate and Population

b. Violent Crime Rate and Population

c. Crime Rate and Population Density (Number of people per square mile)

Paste the results of each correlation into your Word document. In a short well-written paragraph, explain what can be concluded from these correlations and absences of correlation.

Part II - Statistics That Deceive: Simpson’s Paradox

1. Air travelers would like their flights to arrive on time. Airlines collect data about on-time arrivals and report them to the Department of Transportation. Here is one month’s data for flight from several western cities for two airlines.

On Time Late

Alaska Airlines 3274 501

America West 6438 787

a. Calculate the percentage of flights that are on time for each airline and the percentage of flights that are late for each airline (Recall: to compute a percentage, divide the part by the whole. So compute the total of 3274 + 501. Then to find the percent on time, divide 3274 by this total. Percentage late is similar: divide 501 by the total, or subtract the percent on time from 100. You can easily do this in a spreadsheet Which airline has the better on-time record? Include these results in your Word document.

Now consider more detailed data which gives the city from which the flight originated:

Alaska Airlines America West

On Time Late On Time Late

Los Angeles 497 62 694 117

Phoenix 221 12 4840 415

San Diego 212 20 383 65

San Francisco 503 102 320 129

Seattle 1841 305 201 61

TOTALS 3274 501 6438 787

b. Enter the above table into Excel and calculate the percentage of flights that were on time and delayed for each airline for each city of origination. (For example, for Los Angeles, add 497 and 62 for the total number of flights. Then to find the percentage of On Time flights, divide 497 by this total. Now which airline has the better on-time record? Paste or type the results into your Word document.

c. Explain why this is an example of Simpson’s paradox, and explain how it comes about in this case.

2. One of the better known real-life examples of Simpson’s Paradox is the lawsuit against the University of California, Berkeley for bias against women applying to graduate school. Below are the (supposed) admission figures for the fall of 1973. Calculate the percentage of admission for both the men and the women.

| |Applicants |Admitted |% Admitted |

|Men |8442 |3714 | |

|Women |4321 |1512 | |

From this data, which sex appears to be the favored sex?

However, further data revealed the following information (actual numbers may be slightly different):

|Major |Men | |Women | |

| |Applicants |Admitted |Applicants |Admitted |

|A |825 |512 |108 |89 |

|B |560 |352 |25 |17 |

|C |325 |120 |593 |210 |

|D |417 |138 |375 |131 |

|E |191 |53 |393 |94 |

|F |272 |16 |341 |24 |

Find the percentages of those admitted for all majors and both sexes. In which majors did the men have a higher percentage of admission? In which majors did the women have a higher percentage? How does this compare with the first table above? How is this an example of Simpson’s Paradox and how does it come about in this example?

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download