Stat 521: Notes 2



Stat 521: Notes 2.

Reading: Wooldridge, Chapter 4.2.3, 12.8.2

I. One more example of Delta Method:

Consider again the regression model

[pic].

for the NLS data, where we assume Assumption 1 from Notes 1: [pic] independent of education and [pic].

Suppose we are interested in the average effect on the level of earnings of increasing education levels by one year. That is, for each individual we estimate the effect of increasing their education level by one year, from whatever level it was, followed by averaging over all individuals. In terms of the parameters of the linear regression model [pic], the parameter of interest is now a much messier function:

[pic]

Substituting estimated values for the parameters leads to:

[pic]

model2=lm(logwage~educ);

summary(model2);

Call:

lm(formula = logwage ~ educ)

Residuals:

Min 1Q Median 3Q Max

-1.79050 -0.25453 0.01991 0.27309 1.68779

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.040344 0.085044 59.27 lci.theta

[,1]

[1,] 23.59284

> uci.theta

[,1]

[1,] 34.95973

Thus, a 95% confidence interval for the average effect on the level of weekly earnings of increasing education levels by one year is ($23.59, $34.96).

II. Checking the assumptions of the regression model

Consider the regression model under Assumption 1 from Notes 1:

[pic]

[pic].

Assumption 1: [pic] independent of [pic] and distributed normally with mean zero and variance [pic].

Assumption 1 can be broken down into three key assumptions:

(1) Mean model is correct: [pic]. This ensures the mean model is correct, [pic].

(2) Constant variance: The variance of the [pic]’s is the same and does not depend on [pic]

(3) Normality: The [pic]’s are normally distributed.

[There is a fourth key assumption that the [pic]’s are independent of each other that we will discuss later].

The regression model with Assumption 1 holding, i.e., (1)-(3) holding, is called the normal linear regression model.

To check the reasonableness of (1), we construct residual plots of the residuals [pic] versus each covariate in [pic] and versus [pic]. If the mean model is correct, the mean of the residuals should be close to zero for each range of [pic] and [pic]. If the mean model is not correct, we can consider transforming covariate(s) or adding polynomials in covariate(s).

To check the reasonable of (2), we look in the residual plots at whether the spread of the residuals is approximately constant for each range of [pic].

To check the reasonable of (3), we look at normal quantile-quantile plots.

Importance of assumptions: (1) is the most important. If the mean model is not correct, then our inferences about how the covariates affect the mean of Y may be biased. If (1) is satisfied but not (2), the estimates of the mean of Y given the covariates are unbiased, but the standard method of obtaining standard errors and confidence intervals we have been studying will not be valid. We will study methods of inference that are valid even if constant variance (2) does not hold. (3) is the least important in that if we have a sample of size at least 30, the standard method of inference produces approximately correct confidence intervals for the coefficients as long as (1) and (2) hold. However, prediction intervals depend on (3) being valid.

Let’s consider again the NLS data.

In Notes 1, we fit the model

[pic]

Is this a correct mean model? Does constant variance hold? Does normality hold?

Checking mean model

model1=lm(logwage~educ+exper+expersq);

# Residual plots

resid.model1=resid(model1);

# Plot residuals vs. experience

plot(exper,resid.model1);

# Draw horizontal line at 0

abline(0,0);

[pic]

It is rather hard to tell if there is a pattern in the mean of the residuals.

To better understand the pattern in the mean of the residuals, we will introduce a very useful tool for several purposes: scatterplot smoothers.

III. Scatterplot Smoothers

Scatterplot smoothers are a descriptive tool for estimating the mean of Y given a single covariate X. As an example, the data set prestige.txt contains from the 1971 Canadian Census, the average income (X) and prestige scores (Y) for 102 occupations. We would like to fit a model to predict prestige scores based on average income.

Prestige=read.table("Prestige");

attach(Prestige);

plot(income,prestige)

We would like to approximate E(Y|X). Here is one idea. Suppose we are interested in E(Y|X=8400). We can take the X’s that are “close” to 8403 (the 23rd largest income) and average them to approximate E(Y|X=8403). It also might make more sense to put more weight on X’s that are very close to 8400 (e.g., 8450) than observations that are somewhat close to 8400(e.g., 8800).

A common weight function to use is the tricube function:

[pic],

where for observation i, [pic] where [pic] is the value of x we are interested in and h is the half-width of the window of observations that will be averaged over. Over those observations in the window, a weighted local polynomial regression is carried out:

[pic],

where [pic] is chosen to minimize the weighted sum of squares over the windown:

[pic]

and [pic]. It is typical to adjust h so that each local regression includes a fixed proportion s of the data; then s is called the span of the local regression.

The process of scatterplot smoothing is illusrated in Figure 1, focusing initially on estimating [pic], represented in Figure 1 by the vertical solid line:

▪ A window including the 50 nearest neighbors of the 23rd largest income (i.e., for span [pic]) is shown in Figure 1(a).

▪ The tricube weights for observations in this neighborhood appear in Figure 1(b).

▪ Figure 1(c) shows the locally weighed regression line fit to the data in the neighborhood of [pic]; the fitted value [pic] is represented in this graph as a larger solid dot.

▪ Finally, in Figure 1(d), local regressions are estimated for a range of x-values and the fitted values are connected in a curve.

Figure 1(d) is produced by the following R commands:

plot(income,prestige,xlab="Average Income",ylab="Prestige");

lines(lowess(income,prestige,f=.5),lwd=2);

The f=.5 specifies the span. We can let R try to automatically choose the span in a way that produces the best predictions by not specifying f.

plot(income,prestige,xlab="Average Income",ylab="Prestige");

lines(lowess(income,prestige),lwd=2);

[pic]

The estimate of E(Y|X) produced by a scatterplot smoothers is called a nonparametric regression estimate of E(Y|X).

IV. Back to Checking Regression Model Assumptions

The scatterplot smoother can be applied to visualize the mean of the residuals over the range of a covariate.

model1=lm(logwage~educ+exper+expersq);

# Residual plots

resid.model1=resid(model1);

# Plot residuals vs. experience

plot(exper,resid.model1);

# Draw horizontal line at 0

abline(0,0);

# Add scatterplot smoother

lines(lowess(exper,resid.model1));

[pic]

From this plot, we see evidence that the residuals have mean close to zero across the range of experience.

We make the same plot for the other covariate education.

plot(educ,resid.model1);

abline(0,0);

lines(lowess(educ,resid.model1));

[pic]

The residuals also appear to have mean zero across the range of education.

The constant variance assumption appears reasonable from the residual plots but it somewhat hard to tell. We will discuss ways to obtain correct inferences even if the constant variance assumption does not hold in the next notes.

One additional way of considering the constant variance assumption is to use the plot(model fit) command, which prints a number of plots, including a plot of the residual vs. fitted values and a plot of the estimate of the variance of the residuals vs. the fitted values.

plot(model1)

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download