Project: Linear Correlation and Regression - Central Oregon Community ...

Project: Linear Correlation and Regression

You may very well have studied linear regression before; I know many instructors discuss it in their classes. If the word "regression" means nothing to you...great! This project will explain it to you from the ground up. For those of you who have seen linear regression before...great! This project will hopefully de ? mystify what is going on when you ran the command LinReg(ax+b) on your TI.

Much of the data we deal with in this course are univariate; that is, only one characteristic is measured and studied. For example, we can study the average age of houses in, say, Oklahoma. The one variable? Age. This project will deal with bivariate data, where two characteristics are measured simultaneously. Our main idea is to discover whether or not there is a correlation between these two variables.

A correlation is a measure of how well two variables are related. If two variables are related well, we say they are highly correlated. If not, we say they are not highly correlated. For example, height and weight are well correlated ? taller people tend to be heavier than shorter people. The relationship isn't perfect; people of the same height vary in weight, and you can probably think of two people you know where the shorter one is heavier than the taller one. Nonetheless, the average weight of people 5'5'' is less than the average weight of people 5'6'', and their average weight is less than that of people 5'7'', and so forth.

OK...so, let's begin with a straightforward, context ? less example. Here are some data points:

x 10 8 13 9 11 14 6 4 12 7 5 y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68

It's often hard to see, when data is presented in this form, whether or not a correlation exists between the variables. To assist us, we'll use a scatter plot1...there it is at right.

Now that we have our scatter plot, we can observe a few things more easily. It seems as though, as x increases, y increases, as well...roughly linearly. Can you (kind of) see the line that goes through them?

This line (y = 0.5x + 3) is called the "least squares regression" line, or "best fit" line (more on how to find it using Excel in a bit, and, if you're interested in why it's called the "least squares regression" line, this will explain: ). It is the line that best defines the data onto which it's superimposed. From this line, you may (if you're careful) make predictions.

But, wait just a minute...who says that line should even be there?

I mean, I could have just as easily drawn a parabola, or some other curve, over that data. Who says a line even makes sense?

1 If you need a little tutorial for creating an Excel scatter plot:

Apparently this idea also upset a statistician by the name of Frank Anscombe. Back in 1973, he published a paper titled "Graphs in Statistical Analysis" where he presented four data sets: the one I showed you above, and three more. When you graph all four, with their corresponding "best fit lines", you get this:

Yup...all FOUR of those data sets have the same best fit line. And, you see, that's the problem: Just because we can apply a best fit line to data doesn't mean we should. In our first data set (left to right), the line seems reasonable. In the second, I sure see parabolic data, don't you? In the third, you have almost perfectly linear data with an outlier severely affecting the best fit line; most of the data trends upward slightly left to right, but the best fit line rises much more quickly, influenced by that outlier (the same kind of thing happens in the fourth data set).

So, what we need is a measure of how linearly correlated data is...and statisticians have given us that with a little measure called r. It's formally called the "Pearson Product Moment Correlation Coefficient"2. Here's a look at some different datasets and their corresponding r ? values (thanks to Wikipedia):

As you can see, the sign of the r ? value indicates the "slope of the data", and, the closer the absolute value of the r ? value is to 1, the better the linear fit (positive r's imply a positive slope in the data, and vice versa). So the r ? values of 1 and 0.8 show strong linear evidence, but how about the 0.4's? The 0's? Not so much, eh?

We need a decision mechanism...a way, from an r- value, to decide if the data's "linear enough". I'll show you how, with an example. Suppose a pediatrician collects the following data from 11 infants:

Height (in.) 27.75 24.5 25.5 26 25 27.75 26.5 27 26.75 26.75 27.5 Head Girths (in.) 17.1 17.1 17.1 17.3 16.9 17.6 17.3 17.5 17.3 17.5 17.5

She wants to see if there is a linear relation between the infants' heights and their head girths.

She starts (wisely) by constructing a scatterplot of the data, shown at right. Once she's plotted the data, she seems satisfied that the data looks roughly linear; she doesn't see a parabolic curve to the data, or any other nonlinear patterns. But...are they linear enough?

Let's check...open the spread sheet marked "correg.xlsx", and select the tab marked "infant height wrt head girth". In that sheet, you'll see the data across the top, as well as the scatterplot of the data.

2 Which makes you wonder why they called it "r", eh? Why not the "PPMCC"? Nah...looks too much like "PMRC", and we all know how Rage Against the Machine and Dee Snider feel about them. Why "r", then? Maybe they were pirates. By the way, its computational formula is at right. And no, you never have to do it by hand. Arrrr, mateys...

x x y y

r

sx

sy

n 1

You'll also see some orange "decision boxes" that I've set up for you...let's

begin: to find r, in cell C5, type =CORREL(C2:M2,C3:M3) in the "fx" bar and then press enter. Your spreadsheet will kick back the r ? value of 0.69.

Next, input your sample size in cell C6. Once you press Enter, your decision

will be made in cell C7. In this case, the decision is "yes...the data is linear enough."3 This means that, as an infant's height increases, their head girth

increases linearly, as well.

Now, once you've decided that your data is linear enough, it's time to get the

equation that best defines your data. Technically, you'd have to find an equation of the form y = mx + b, where

n xy x y

m

n

x2

x2

and b y mx ..but no one does these by hand anymore4. Excel, as you saw earlier,

saves the day again (this is for Excel 2010 ? 2014 might look a tad different, but that's what Google is for, right?):

a) Left click on the scatterplot, then click the

"Chart Tools/Layout" tab.

.

b) Left ? click on the selection "Trendline", then, from the bottom of the list of options, click on "More Trendline Options".

c) Make sure that the "Linear" radio button is selected. Everything else can stay the same, but make sure that the box next to "Display Equation on Chart" is selected.

Click OK, and you'll get something like the one at right above (you can make some style changes, too, if you like).

Now that we have the regression ("best fit") equation, we can (carefully) use it to forecast. For example, suppose we wanted to know what the head girth of an infant will be, given their height is 29 inches. Here's how you could reckon that piece of info...

So, an infant with a height of 29 inches should have a head circumference of roughly 17.6 inches.

y 13.601" 0.1395x 13.601" 0.1395(29") 17.6"

^ Did you notice how I called that result y ? That was intentional...since it's not a parameter, but a statistic, it should carry a margin of error. ME's for regressions are not discussed in most intro stat courses5, but I've shown you how to get the 95% CI for the last example, if you're interested, at right.

If you ever need to calculate ME's for regression equations, you'll be using SPSS, or MiniTab, or some other (better) stat software than Excel. So, in this project, I will not be asking you to forecast...just to get the best ? fit equations.

95% CI for y^ y^ ME

y^ t

MSE

1 n

(x x)2 (x x)2

17.6 2.26

0.023

1 11

(29 26.45)2 11.98

17.6 0.12

(17.48",17.72")

3 This had better feel like black magic to you. Hang on, we'll get to what it's doing, soon enough...also, when it says the data is "linear enough", Excel is trusting that you have already decided that a line makes sense. More on that in question #5 below... 4Although, if you're interested, I've placed a justification of these formulas on the enrichment page of the website. 5 Most intro stat courses also treat linear regression as a hypothesis test, which is unnecessarily cumbersome, in my humble opinion. That's why we're doing regression the way we are, and not the way most texts (yours included) do it.

A few closing notes:

1) "Correlation does not imply causation". Just because two variables move together in a predictable fashion does not mean that one is causing the other to do so. There are, at times, other factors (called "lurking", or "confounding" variables) at work causing the correlation you see, but your regression will most likely not identify them. For example, in the baby height vs. head girth work above, we can't say that the baby's increasing height is causing the baby's increasing head girth...most likely, it's due to the fact that all parts of the baby are growing proportionately, and we're only looking at two of them (I'd bet, if you studied their foot lengths, for example, they would correlate with the heights as well)

2) Your dataset should be a SRS with no outliers. These can wildly affect your regression analyses and pollute your results.

3) More on the r- value: a. To keep with our conventions in class (and the journals), I based the r ? values on 95% confidence (5% significance) in this project, but they can allow for any confidence, in general.

b. A given r ? value might imply linearity in a larger data set, but not imply linearity in a smaller one As your sample size goes up, your correlation coefficient is allowed to be farther from 1, yet still be significant. This allows for the natural variations that will occur in a larger set of data. Here's an example of what I mean:

Both of those data sets are arguably linear, and they carry the same r ? value (0.75589). However, the left one, where n = 5, is not strongly linear enough, while the right one (n = 20) is.

c. I don't think I can emphasize this enough...the r ? value you find does not tell you "Yeah! Your data's linear!" It can't see your data at all. It's basically a "second ? pass" filter: once you've looked at your data, it tells you, "OK...you saw the data, and you think it looks linear. I'll tell you if it's statistically linear enough." It's sort of a conditional probability...like a P ? value. It doesn't tell you P(you should use a line)...it tells you

| P(you should use a line you have decided a line will fit the data well).

* * * * * * * Please do this entire project within Excel...download the spreadsheet to your computer, and save it. Upload to Blackboard when you're done.

(20 points ? all or none! Read carefully!) For each of the questions 1 through 4: a. Create the Excel scatter plot for your data, right in the spreadsheet. Make sure your axes have clear labels (with units, if applicable), and makes sure your plot has a title. Do each of these first before moving onto part b. b. If your data appears linear (or not obviously non ? linear), complete the orange decision boxes like we did in the example above. In the non - linear one, leave the decision boxes blank. c. If (and only if) your decision box tells you your data's statistically linear enough (2 of the 3 should), place the best ? fit line, with its equation, on your scatter plot. If you correctly find a best ? fit line, replace the blank in the question asked in the textbox in each sheet with "increase" or "decrease". Otherwise, leave that box the way it is.

To recap: You should have 4 scatterplots, and 3 sets of decision boxes filled in. You should also only have 2 lines on scatterplots, and 2 "increase/decrease" statements filled in. Be careful!

For question 5: complete the 4 decision boxes shown (regardless of the look of the data), and answer the question in the sheet. (5 points total ? 1 for the Anscombe's Quartet decision boxes, 4 for answered question)

1. (Data Set "rate my professor") I randomly selected 16 of the faculty members from COCC and checked their "Rate My Professor" profiles. On this tab, you'll see the data. I'd like you to see if there is a linear enough relationship between the professor's easiness rating (the x ? variable) and their overall rating (the y ? variable).

2. (Data Set "cops and offenses") I randomly selected 15 years on record for Bend's police force (size of force given as x), and cross ? referenced those years for offenses reported (number of offenses per 1,000 Bendites are given as y). Does there appear to be a linear relationship between the number of police I'd like you to see if there is a linear enough relationship between the number of cops and the number of offenses per 1000 population? From a study I did a few years back! Hope they don't arrest me for sharing the data.

3. (Data Set "revenue") A stadium's operations office is trying to decide how much to charge for a ticket to a midweek, midday event. With each rise in ticket price comes a greater chance of turning away more folks, but also a rise in revenue (from more expensive tickets) that could offset the lost patrons. Based on randomized survey results, they forecast the amount of revenue (y) as a function of ticket price (x). Does there appear to be a linear enough relationship between the ticket price and revenue?

4. (Data Set "earnings wrt education") The data in this sheet indicate averages gotten from the Bureau of Labor Statistics' 2008 Current Population Survey (the BLS CPS 2008). I'd like you to see if there is a linear enough relationship between the years of schooling (x) and median weekly salary (y).

You may have to scroll over to get to the last data set:

5. (Data Set "Anscombe's Quartet") We'll finish with the datasets to which I alluded on page 2 of the data. Remember...the ones that all had the same best ? fit line? Those four data sets are listed in this tab, each next to corresponding scatter plot. A student carelessly filled in all 4 decision boxes without looking first at the scatterplots (you might recall, from the 2nd page of this project, that, mathematically, these 4 data sets have the same best fit line...you now see that they also have the same r - value). Answer the question in the sheet (you may have to scroll down to see it).

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download