Statistics 203: Introduction to Regression and Analysis of ...

[Pages:16]Statistics 203: Introduction to Regression and Analysis of Variance

Multiple Linear Regression: Diagnostics

Jonathan Taylor

- p. 1/16

Today

q Today q Spline models q What are the assumptions? q Problems in the regression

function q Partial residual plot q Added-variable plot q Problems with the errors q Outliers & Influence q Dropping an observation q Different residuals q Crude outlier detection test q Bonferroni correction for

multiple comparisons

q DF F IT S

q Cook's distance

q DF BET AS

s Splines + other bases. s Diagnostics

- p. 2/16

Spline models

q Today q Spline models q What are the assumptions? q Problems in the regression

function q Partial residual plot q Added-variable plot q Problems with the errors q Outliers & Influence q Dropping an observation q Different residuals q Crude outlier detection test q Bonferroni correction for

multiple comparisons

q DF F IT S

q Cook's distance

q DF BET AS

s Splines are piecewise polynomials functions, i.e. on an interval between "knots" (ti, ti+1) the spline f (x) is polynomial but the coefficients change within each interval.

s Example: cubic spline with knows at t1 < t2 < ? ? ? < th

3

h

f (x) = 0jxj + i(x - ti)3+

j=0

i=1

where

(x - ti)+ =

x - ti 0

if x - ti 0 otherwise.

s Here is an example.

s Conditioning problem again: B-splines are used to keep the model subspace the same but have the design less ill-conditioned.

s Other bases one might use: Fourier: sin and cos waves;

Wavelet: space/time localized basis for functions.

- p. 3/16

What are the assumptions?

q Today q Spline models q What are the assumptions? q Problems in the regression

function q Partial residual plot q Added-variable plot q Problems with the errors q Outliers & Influence q Dropping an observation q Different residuals q Crude outlier detection test q Bonferroni correction for

multiple comparisons

q DF F IT S

q Cook's distance

q DF BET AS

s What is the full model for a given design matrix X ?

Yi = 0 + 1Xi1 + ? ? ? + pXi,p-1 + i

s Errors N (0, 2I). s What can go wrong?

x Regression function can be wrong ? missing predictors, nonlinear.

x Assumptions about the errors can be wrong. x Outliers & influential observations: both in predictors and

observations.

- p. 4/16

Problems in the regression function

q Today q Spline models q What are the assumptions? q Problems in the regression

function q Partial residual plot q Added-variable plot q Problems with the errors q Outliers & Influence q Dropping an observation q Different residuals q Crude outlier detection test q Bonferroni correction for

multiple comparisons

q DF F IT S

q Cook's distance

q DF BET AS

s True regression function may have higher-order non-linear terms i.e. X12 or even interactions X1 ? X2.

s How to fix? Difficult in general ? we will look at two plots "added variable" plots and "partial residual" plots.

- p. 5/16

Partial residual plot

q Today q Spline models q What are the assumptions? q Problems in the regression

function q Partial residual plot q Added-variable plot q Problems with the errors q Outliers & Influence q Dropping an observation q Different residuals q Crude outlier detection test q Bonferroni correction for

multiple comparisons

q DF F IT S

q Cook's distance

q DF BET AS

s For 1 j p - 1 let

eij = ei + j Xij .

s Can help to determine if variance depends on X j and outliers.

s If there is a non-linear trend, it is evidence that linear is not sufficient.

- p. 6/16

Added-variable plot

q Today q Spline models q What are the assumptions? q Problems in the regression

function q Partial residual plot q Added-variable plot q Problems with the errors q Outliers & Influence q Dropping an observation q Different residuals q Crude outlier detection test q Bonferroni correction for

multiple comparisons

q DF F IT S

q Cook's distance

q DF BET AS

s For 1 j p - 1 let H(j) be the Hat matrix with this predictor deleted. Plot

(I - H(j))Y vs.(I - H(j))Xj.

s Plot should be linear and slope should be j. Why?

Y = X(j)(j) + j Xj + (I - H(j))Y = (I - H(j))X(j)(j) + j (I - H(j))Xj + (I - H(j)) (I - H(j))Y = j (I - H(j))Xj + (I - H(j))

s Also can be helpful for detecting outliers. s If there is a non-linear trend, it is evidence that linear is not

sufficient.

- p. 7/16

Problems with the errors

q Today q Spline models q What are the assumptions? q Problems in the regression

function q Partial residual plot q Added-variable plot q Problems with the errors q Outliers & Influence q Dropping an observation q Different residuals q Crude outlier detection test q Bonferroni correction for

multiple comparisons

q DF F IT S

q Cook's distance

q DF BET AS

s Errors may not be normally distributed. We will look at QQplot for a graphical check. May not effect inference in large samples.

s Variance may not be constant. Transformations can sometimes help correct this. Non-constant variance affects our estimates of SE() which can change t and F statistics substantially!

s Graphical checks of non-constant variance: added variable plots, partial residual plots, fitted vs. residual plots.

s Errors may not be independent. This can seriously affect our estimates of SE().

- p. 8/16

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download