Stat 112 Review Notes for Chapter 3, Lecture Notes 1-5



Stat 112 Review Notes for Chapter 3, Lecture Notes 1-5

1. Simple Linear Regression Model: The simple linear regression model for the mean of [pic]given [pic]is

[pic] (1.1)

where [pic]=slope=change in mean of [pic]for each one unit change in [pic];

[pic]=intercept=mean of [pic]given[pic]. The disturbance [pic]for the simple linear regression model is the difference between the actual [pic]and the mean of [pic]given [pic] for observation [pic]: [pic]. In addition to (1.1), the simple linear regression model makes the following assumptions about the disturbances [pic]:

(i) Linearity assumption: [pic]. This implies that the linear model (1.1) for the mean of [pic]given [pic]is the correct model for the mean.

(ii) Constant variance assumption: The disturbances [pic]are assumed to all have the same variance [pic].

(iii) Normality assumption: The disturbances [pic]are assumed to have a normal distribution.

(iv) Independence assumption: The disturbances [pic]are assumed to be independent.

2. Least Squares Estimates of the Simple Linear Regression Model: Based on a sample [pic], we estimate the slope and intercept by the least squares principle --

we minimize the sum of squared prediction errors in the data, [pic]. The least squares estimates of the slope and intercept are the [pic] and [pic]that minimize the sum of squared prediction errors. Some properties of the least squares estimates are:

(i) Unbiased estimators: The means of the sampling distribution of [pic] and [pic]are [pic] and [pic] respectively.

(ii) Consistent estimators: As the sample size [pic]increases, the probability that [pic] and [pic] will come close[pic] and [pic] respectively converges to 1.

(iii) Minimum variance estimators: The least squares estimators are the best possible estimators of [pic] and [pic] in the sense of having the smallest variance among unbiased estimators.

3. Residuals: The disturbance [pic]is the difference between the actual [pic]and the mean of [pic]given [pic]: [pic]. The residual [pic]is an estimate of the disturbance: [pic].

4. Using the Residuals to Check the Assumptions of the Simple Linear Regression Model: The residual plot is a scatterplot of the [pic]pairs, i.e., a plot of the [pic]variable versus the residuals. To check the linearity assumption, we check if [pic] is approximately zero for each part of the range of [pic]. To check the constant variance assumption, we check if the spread of the residuals remains constant as [pic]varies. To check the normality assumption, we check if the histogram of the residuals is approximately bell shaped. For now, we will not consider the independence assumption; we will consider it in Section 6.

5. Root Mean Square Error: The root mean square error (RMSE) is approximately the average absolute error that is made when using [pic] to predict [pic]. The RMSE is denoted by [pic] in the textbook.

6. Confidence Interval for the Slope: The confidence interval for the slope is a range of plausible values for the true slope [pic] based on the sample [pic]. The 95% confidence interval for the slope is [pic], where [pic]is the standard error of the slope, [pic]. The 95% confidence interval for the slope is approximately [pic].

7. Hypothesis Testing for the Slope: To test hypotheses for the slope, we use the t-statistic [pic] where [pic]is detailed below.

(i) Two-sided test: [pic] vs. [pic]. We reject [pic]if [pic] or [pic].

(ii) One-sided test I: [pic] vs. [pic]. We reject [pic]if [pic]

(iii) One-sided test II: [pic] vs. [pic]. We reject [pic]if [pic]

When [pic], we can calculate the p-values for these two tests using JMP as follows:

(i) Two-sided test: the p-value is Prob>|t|

(ii) One-sided test I: If [pic]is negative (i.e., the sign of the t-statistic is in favor of the alternative hypothesis), the p-value is (Prob>|t)/2. If [pic]is positive (i.e., the sign of the t-statistic is in favor of the null hypothesis), the p-value is 1-(Prob>|t)/2.

(iii) One-sided test II: If [pic]is positive (i.e., the sign of the t-statistic is in favor of the alternative hypothesis), the p-value is (Prob>|t)/2. If [pic]is negative (i.e., the sign of the t-statistic is in favor of the null hypothesis), the p-value is 1-(Prob>|t)/2.

8. R Squared: The R squared statistic measures how much of the variability in the response the regression model explains. R squared ranges from 0 to 1, with higher R squared values meaning that the regression model is explaining more of the variability in the response.

[pic]

9. Prediction Intervals: The best prediction for the [pic]of a new observation [pic]with [pic]is the estimated mean of [pic]given [pic]: [pic].

The 95% prediction interval for the [pic]of a new observation [pic]with [pic]is an interval that will contain the value of [pic]most of the time. The formula for the prediction interval is :

[pic][pic], where

[pic];

[pic];

[pic].

When n is large (say n>30), the 95% prediction interval is approximately equal to

[pic].

10. Cautions in Interpreting Regression Results:

(i) The regression of [pic]on [pic]measures the association between [pic]and [pic]. A strong association between [pic]and [pic]does not necessarily mean that changes in [pic]cause changes in [pic]. A strong association between [pic]and [pic]could be explained by [pic]causing changes in [pic]or by there being a lurking variable that is related to both [pic]and [pic].

(ii) The regression model cannot be relied on to make accurate predictions for the [pic]of[pic]that are outside the range of the observed [pic]’s, [pic]. The prediction intervals for the [pic]of [pic]that are outside the range of the observed [pic]’s are also not reliable. Trying to use the regression model to predict the [pic]of [pic]that are outside the range of the observed [pic]’s is called extrapolation.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download