Chapter 2: Data



Chapter 3.2: Linear Regression – R2 and Residuals

Least-Squares Regression Line

The LSRL is a model used to represent a set of ________________ data. Suppose you find the distance from each point in the data to the linear model, then square those distances and find the sum. This is called the ________________________________________________. The Least-Squares Regression Line (LSRL) is the line that ________________ this sum. The equation of the LSRL is [pic].

[pic] represents ________________________________________________.

[pic] represents ________________________________________________.

[pic] represents ________________________________________________.

[pic] represents ________________________________________________.

Given a set of data, you can calculate the LSRL (without using your calculator!). Knowing the correlation makes this task even easier. Use the following formulas:

[pic] [pic] [pic] [pic]

Also, note that: [pic]

Exercise 1: The correlation (r) between the number of wins by American League baseball teams and the average attendance at their home games for the 2006 season is 0.696.

a) What would you predict about the Average Attendance for a team that is 2 standard deviations above average in wins?

b) If a team is 1 standard deviation below average in attendance, what would you predict about the number of games the team has won?

Exercise 2: Find the LSRL given the summary statistics – Tale of 2 Regressions WKS

Coefficient of Determination

The coefficient of determination, also called R2, is the square of the _______________. The R2 value tells how much of the variation in the response variable is accounted for by the linear regression model. For example, if R2 = 1, then _____% of the variability in the response variable is accounted for by the linear model. In other words, the relationship between the two variables is perfectly linear. If R2 = 0.95, we can conclude that _____ % of the variability in the response variable is accounted for by the linear relationship with the explanatory variable.

1. Given the following set of data, find the equations of the LSRL, then find and interpret both the correlation and the coefficient of determination.

[pic] [pic]

a. LSRL: ________________________________________ (use meaningful variables in your equation rather than x and y, and use proper statistical notation!)

b. Correlation (r-value): ________. A correlation of ________ indicates that there is a _______________, _______________, _______________ relationship between ___________________________ and ______________________ ________.

c. Coefficient of determination (R2): ________. An R2 value of ________ indicates that ________% of the _______________ in _____________________________ is accounted for by the _______________ relationship with _________________.

2. A study of class attendance and grades earned among first-year students at a state university showed that in general students who attended a higher percent of their classes earned higher grades. Class attendance explained 16% of the variation in grades among the students. What is the numerical correlation between percent of classes attended and grades earned? __________

Residual Plots

A residual is the difference between the observed y-value and the __________________ y-value for a given x-value.

residual = [pic]

The ________________________________________ (SSR) is used to determine the Least-Squares Regression Line for a given set of data.

A ____________________ is a scatterplot which graphs the residuals on the _______________ axis and the values of the explanatory variable on the _______________ axis for each data point, [pic].

The residual plot gives a visual representation of the amount of error in the model. The closer the residuals are to __________, the smaller the error and the more accurate the model.

The LSRL is a good model if the residual plot shows random _______________ relatively close to the horizontal axis (zero). The horizontal axis represents the _______________.

Points in the residual plot that lie directly on the horizontal axis lie directly on the ___________.

Points in the residual plot that lie above the horizontal axis lie __________ the LSRL. Therefore, the model gives an underestimate at that point. Therefore _______________ residuals represent underestimates.

Points in the residual plot that lie below the horizontal axis, lie __________ the LSRL. Therefore the model gives an overestimate at that point. Therefore _______________ residuals represent overestimates.

The LSRL is not a good model if the residual plot shows _______________________________.

3. Construct a well-labeled residual plot using the data on jet ski fatalities from #1. What can you conclude about the appropriateness of the linear model based on the residual plot?

Chapter 3.2: Linear Regression – R2 and Residuals (KEY)

Least-Squares Regression Line

The LSRL is a model used to represent a set of quantitative data. Suppose you find the distance from each point in the data to the linear model, then square those distances and find the sum. This is called the sum of the squares of the residuals. The Least-Squares Regression Line (LSRL) is the line that minimizes this sum. The equation of the LSRL is [pic].

[pic] represents explanatory variable (actual data).

[pic] represents predicted y-value.

[pic] represents y-intercept.

[pic] represents slope.

Given a set of data, you can calculate the LSRL (without using your calculator!). Knowing the correlation makes this task even easier. Use the following formulas:

[pic] [pic] [pic] [pic]

Also, note that: [pic]

Exercise 1: The correlation (r) between the number of wins by American League baseball teams and the average attendance at their home games for the 2006 season is 0.696.

a) What would you predict about the Average Attendance for a team that is 2 standard deviations above average in wins?

The average attendance will be (0.696)(2) = 1.392 standard deviations above average.

b) If a team is 1 standard deviation below average in attendance, what would you predict about the number of games the team has won?

The number of games the team has won will be (0.696)(1) = 0.696 standard deviations below the average wins.

Exercise 2: Find the LSRL given the summary statistics – Tale of 2 Regressions WKS

Coefficient of Determination

The coefficient of determination, also called R2, is the square of the r-value (correlation). The R2 value tells how much of the variation in the response variable is accounted for by the linear regression model. For example, if R2 = 1, then 100% of the variability in the response variable is accounted for by the linear model. In other words, the relationship between the two variables is perfectly linear. If R2 = 0.95, we can conclude that 95 % of the variability in the response variable is accounted for by the linear relationship with the explanatory variable.

**** SHOW PPT OF R2 EXPLAINED ****

1. Given the following set of data, find the equation of the LSRL, then find and interpret both the correlation and the coefficient of determination.

[pic] [pic]

a. LSRL: fatal = -34.648 + 6.03 (year) (use meaningful variables in your equation rather than x and y, and use proper statistical notation!)

b. Correlation (r-value): 0.938. A correlation of 0.938 indicates that there is a strong, positive, linear relationship between year and number of fatalities .

c. Coefficient of determination (R2): 0.880. An R2 value of 0.880 indicates that 88% of the variability in number of fatalities is accounted for by the linear relationship with the year.

**** NOTE: Go over the meaning of slope in the context of the problem. Also explain the formula for the slope ([pic]) by showing the Understanding r ppt. ****

2. A study of class attendance and grades earned among first-year students at a state university showed that in general students who attended a higher percent of their classes earned higher grades. Class attendance explained 16% of the variation in grades among the students. What is the numerical correlation between percent of classes attended and grades earned? 0.4

* Check if r > 0 or r < 0.

Residual Plots

A residual is the difference between the observed y-value and the predicted y-value for a given x-value.

residual = [pic]

The sum of the squares of the residuals (SSR) is used to determine the Least-Squares Regression Line for a given set of data.

A residual plot is a scatterplot which graphs the residuals on the vertical axis and the values of the explanatory variable on the horizontal axis for each data point, [pic].

The residual plot gives a visual representation of the amount of error in the model. The closer the residuals are to zero, the smaller the error and the more accurate the model.

The LSRL is a good model if the residual plot shows random scatter relatively close to the horizontal axis (zero). The horizontal axis represents the LSRL.

Points in the residual plot that lie directly on the horizontal axis lie directly on the LSRL.

Points in the residual plot that lie above the horizontal axis lie above the LSRL. Therefore, the model gives an underestimate at that point. Therefore positive residuals represent underestimates.

Points in the residual plot that lie below the horizontal axis, lie below the LSRL. Therefore the model gives an overestimate at that point. Therefore negative residuals represent overestimates.

The LSRL is not a good model if the residual plot shows a pattern.

3. Construct a well-labeled residual plot using the data on jet ski fatalities from #1. What can you conclude about the appropriateness of the linear model based on the residual plot?

-----------------------

Jet Ski Fatalities (1987-1996)

Jet Ski Fatalities (1987-1996)

[pic]

Since the residual plot does not show any distinct pattern, the linear model is appropriate for the original set of data. That is, number of fatalities can be predicted based on the year using the following linear equation:

[pic]

fatal = -34.648 + 6.03 (year)

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download