Lhsteacher.lexingtonma.org



Correlation: Examining Residuals and Correlation Coefficient

Today, we will be more closely examining best-fit lines. Begin by creating a scatterplot for the data given in the Excel file on the web site. Remember to label your axes appropriately.

Lately, we have been using our calculators to find equations of lines of best fit. Excel can calculate best-fit lines (trendlines) as well. Now, we are going to examine part of the procedure that the calculator and the computer use to find them. Excel can make this process easier for us.

Residuals are the:

[pic]

The goal of finding the best-fit line is to make the numbers in the residual column as small as possible. Since some sets of data can contain hundreds of points, you need a single number that represents the error for the entire data set. Adding the individual residuals does not really help. One prediction that is 10,0000 too high offsets a prediction that is 10,000 too low.

A line of best fit is the graph of the linear equation that shows the relationship between two sets of data most accurately (with the least errors).

When working with a line of best fit we want to assess how well the line is fitting the data. The sum of square residuals (and thus positive numbers) is one way to measure, but the minimum value that this can reach changes depending on the data. Instead, we use what is called the linear correlation coefficient (r), which is a number that is always between –1 and 1.

Note: When calculating r, the linear correlation coefficient, the calculator uses the sum of square residuals. They are related, but we are not going to go into how.

The number r, called the correlation coefficient, is a measurement of how closely the best-fit line fits the data.

• A value of r = 1 or –1 would say the line fits the data perfectly (perfect correlation).

• A value of r that is greater than 0.8 is considered strong correlation

• A value of r that is less than 0.5 is often considered weak correlation

• A value of r = 0 would stand for the worst possible fit (no correlation)

o A positive value of r would mean:

o A negative value of r would mean:

[pic]

Our calculator will tell you r for any best-fit line. To turn this feature on: [pic]

Enter your original Excel data into your calculator and find the line of best fit. You will now notice that the calculator displays r along with the equation.

Equation:

What value of r does your calculator display? What does this number tell you?

The calculator also display r2, the coefficient of determination. This allows you to determine how certain one can be in making predictions from the best-fit line. This is the number that Excel gives you. You can find r by taking the square root of r2.

Now you try it with some data about the weather and the number of visitors to a water park:

|Visitors to a Water Park |

| | | | |

|Temperature (F) |# of Visitors (thousands)|Predicted |Residuals |

|62 |36.8 | | |

|67 |40.9 | | |

|69 |36.9 | | |

|70 |37.1 | | |

|72 |44.3 | | |

|77 |42.3 | | |

|79 |45.8 | | |

|80 |45.4 | | |

|83 |48.4 | | |

|85 |51.5 | | |

|88 |57.7 | | |

|94 |57.3 | | |

|96 |56.1 | | |

|98 |58.9 | | |

What is the best-fit line that Excel gives you?

What is the R2 value that Excel gives you?

What is the value of r?

What would you say about the correlation of the best-fit line to the data?

Homework

|Hours |Yeast Density |

|0 |9.6 |

|1 |18.3 |

|2 |29.0 |

|3 |47.2 |

|4 |71.1 |

|5 |119.1 |

|6 |174.6 |

|7 |257.3 |

| | |

1. The table of data comes from an experiment by Tor Carlson in 1913 on Saccharomyces cerevisiae (a type of yeast). The data show the number of hours elapsed and the number of yeast cells per square unit of area in a Petri dish.

a. Find the balance point of the data and add it to the table.

b. Make a scatterplot of this data below.

[pic]

c. Using your calculator, find the line of best fit. Write the equation and r-value below, and plot the best-fit line on your scatterplot.

d. With a different color pen or highlighter, mark the residuals on your scatterplot.

e. Either in Excel, with your calculator or by hand calculate all of the residuals and label them in your scatterplot. You can either fill out the table below, or print out your Excel work and paste it below.

|Hours |Yeast Density |Predicted Density |Residuals |

|0 |9.6 | | |

|1 |18.3 | | |

|2 |29.0 | | |

|3 |47.2 | | |

|4 |71.1 | | |

|5 |119.1 | | |

|6 |174.6 | | |

|7 |257.3 | | |

f. Look at your r-value. What does this tell you about your best-fit line? (Make sure you address the sign of your r-value and the magnitude of your r-value.)

2. This table gives the mean height in centimeters of boys ages 5 to 13 in the United States. (Source: National Center for Health Statistics)

|Age |Height (cm) |

|5 |109.2 |

|6 |115.7 |

|7 |122.0 |

|8 |128.1 |

|9 |133.7 |

|10 |138.8 |

|11 |143.7 |

|12 |149.3 |

|13 |156.4 |

a. Using your calculator find the best-fit line.

a. What does the slope represent? Does this make sense?

b. What does the y-intercept represent? Does this make sense?

c. What does the data point (5, 109.2) represent? Does this make sense?

d. Predict the height of a 14-year old. Does this make sense?

e. Predict the height of a 30-year old. Does this make sense?

f. Predict how old a 130cm tall boy is.

g. What is the correlation coefficient, r, for your best-fit line?

h. What does r tell you about your best-fit line?

3. Explain how your calculator finds the line of best-fit using residuals.

4. For the scatterplots below estimate the r-value for the lines of best fit.

[pic]

d. e. g.

[pic] [pic] [pic]

h. i.

[pic] [pic]

TEACHER NOTES:

computer day!!

add directions for how to get excel file (highlighted yellow)

-- teach students how to find trendline and display equation – right click on points on scatterpoint, click “add trendline”, go to options and click “display equation” – the equation is generally not moveable – have fun reading it!

OPEN UP SKETCHPAD BEFORE RESIDUALS

- say this is the same data we were just plotting in Excel

- then show the line and have students manipulate it to try to best-fit the equation

- ask students how they are picking the line – get to that we are trying to make the distance between the data points and the line the smallest

- how far off are our data points from the line?

- how can we measure that distance? – get to comparing the actual data value to the predicted data value (what you get when you plug in year into the best-fit line) – this distance is called the residual

GO BACK TO EXCEL

- new column “predicted number of deaths” -- =best fit line equation, but click year box instead of x, then drag down for all the rows

- new column “Residuals (errors)” -- =click on actual box – click on predicted box, then drag down for all the rows

- (if you want, highlight all new cells, right click, “format cell”, on left click on number, switch decimals to 1)

GO BACK TO WORKSHEET

- have students write down definition of residuals and how they calculate it: residuals are the difference between the actual value and the predicted value, or how far our data points are away from the best-fit line or how much error our best-fit line has, calculate by actual data value – predicted value (or y-value of data point – y-value of point on line)

- have students label residuals in diagram with their actual length calculated in excel – start talking about + vs. – and why we want to get rid of signs when summing up error to determine how good the line is

o if you were to add up the residuals what would that give you? Would that be possible? What is wrong with these residuals? (don’t want + and -)

o talk about how summing “residuals” is measuring total error of the line

o then talk about one way to get rid of +/- is to square all the numbers

GO BACK TO SKETCHPAD

- show the sum of squared residuals, play with line again, and show that the line that looks like a best-fit has the smallest sum of squared residuals line

- then show best fit line and see how close the class got, if off change your line to that line and show sum of square residuals still goes down

GO BACK TO EXCEL

- right click on treadline, go to options, click on show equation – compare this to sketchpad and show they are the same

- in a new column “square residuals” – = click on residual box then square it, drag down for all the rows

- in new cell “sum of square residuals” - = sum (highlight all the square residual column)

- compare this number to sum of square residuals in SKETCHPAD when matched line to best-fit line

- This is how the calculator finds the line of best fit – it minimizes the sum of square residuals

BACK TO HANDOUT

- Follow the handout to talk about correlation number

-- HONORS – this lesson will basically be replaced by Michelle’s r^2 lesson

Possible honors extention – residual plots or interpreting residual plots

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download