Plotting and saving residuals and tted values in regression

Plotting and saving residuals and fitted values in regression

Ken Butler November 1, 2016

1 The basics

Let's begin with our windmill data:

library(tidyverse)

## Loading tidyverse: ggplot2 ## Loading tidyverse: tibble ## Loading tidyverse: tidyr ## Loading tidyverse: readr ## Loading tidyverse: purrr ## Loading tidyverse: dplyr ## Conflicts with tidy packages ---------------------------------------------## filter(): dplyr, stats ## lag(): dplyr, stats

windmill=read.csv("windmill.csv",header=T) head(windmill,n=10)

## wind.velocity DC.output

## 1

5.00 1.582

## 2

6.00 1.822

## 3

3.40 1.057

## 4

2.70 0.500

## 5

10.00 2.236

## 6

9.70 2.386

## 7

9.55 2.294

## 8

3.05 0.558

## 9

8.15 2.166

## 10

6.20 1.866

There were two variables, DC.output and wind.velocity, and we were interested in how we could predict the first from the second.

A scatterplot reveals that the relationship is not a straight line:

1

ggplot(windmill,aes(x=wind.velocity,y=DC.output))+ geom_point()+geom_smooth(se=F)

2.0 1.5

q q q

q

q

q

q

qq q

q

q q

q q

q q

DC.output

q

q

q

q

1.0

q

q

0.5

q

q

4

6

8

10

wind.velocity

It seems to increase but at a decreasing rate.1 What happens if we fit a linear regression anyway?

DC.1=lm(DC.output~wind.velocity,data=windmill) summary(DC.1)

## ## Call: ## lm(formula = DC.output ~ wind.velocity, data = windmill) ## ## Residuals:

1Normally, geom_smooth adds a smooth trend and a grey "envelope" around it that is a confidence band, that is, a confidence interval for y for each x. To omit that, add se=F.

2

##

Min

1Q Median

3Q

Max

## -0.59869 -0.14099 0.06059 0.17262 0.32184

##

## Coefficients:

##

Estimate Std. Error t value Pr(>|t|)

## (Intercept) 0.13088 0.12599 1.039

0.31

## wind.velocity 0.24115 0.01905 12.659 7.55e-12 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 0.2361 on 23 degrees of freedom

## Multiple R-squared: 0.8745,Adjusted R-squared: 0.869

## F-statistic: 160.3 on 1 and 23 DF, p-value: 7.546e-12

We saw that the fit appeared to be good (R-squared 87%) and the slope was significantly nonzero: that is, there is a real trend. But how do we know that the straight is (as we suspect) inappropriate and that we can do better with the right kind of curve?

The key is to look at a plot of the residuals against something. Typically the starting point is residuals against fitted values. ggplot provides a nice mechanism to make this plot if that's all you want (that is, you don't want to save them for later):

ggplot(DC.1,aes(x=.fitted,y=.resid))+ geom_point()+geom_smooth(se=F)

3

0.2 0.0 -0.2

q q

q qq

q

q

q

q

q

q

q q

q

q

q

q

q

q

q

q q

q q

-0.4

.resid

-0.6

q

1.0

1.5

2.0

2.5

.fitted

The strategy is to plot the regression object, the thing that you ask for the summary of to look at intercepts, slopes, R-squared and such. This has, hiding in it, things called .fitted and .resid with dots on the front.

As for the actual residual plot: I added a smooth trend this time to show the clear curved pattern in the residuals. This indicates that the actual relationship is a curve, not a straight line.

The other thing we might want to do is to save fitted values or residuals. Here, I want to save the fitted values to plot later. It is easiest for later if I add them to the data frame:

windmill$fit.linear=fitted(DC.1) str(windmill)

## 'data.frame': 25 obs. of 3 variables: ## $ wind.velocity: num 5 6 3.4 2.7 10 9.7 9.55 3.05 8.15 6.2 ... ## $ DC.output : num 1.58 1.82 1.06 0.5 2.24 ... ## $ fit.linear : num 1.337 1.578 0.951 0.782 2.542 ...

4

and we see that they have indeed been added.

2 Fitting curves

We saw two approaches in class to fitting a curve. The first is a general one: fit not a linear model y = a+bx but a quadratic "parabola" model y = a+bx+cx2.2 That goes like this:

DC.2=lm(DC.output~wind.velocity+I(wind.velocity^2),data=windmill) summary(DC.2)

##

## Call:

## lm(formula = DC.output ~ wind.velocity + I(wind.velocity^2),

##

data = windmill)

##

## Residuals:

##

Min

1Q Median

3Q

Max

## -0.26347 -0.02537 0.01264 0.03908 0.19903

##

## Coefficients:

##

Estimate Std. Error t value Pr(>|t|)

## (Intercept)

-1.155898 0.174650 -6.618 1.18e-06 ***

## wind.velocity

0.722936 0.061425 11.769 5.77e-11 ***

## I(wind.velocity^2) -0.038121 0.004797 -7.947 6.59e-08 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 0.1227 on 22 degrees of freedom

## Multiple R-squared: 0.9676,Adjusted R-squared: 0.9646

## F-statistic: 328.3 on 2 and 22 DF, p-value: < 2.2e-16

R-squared is vastly improved (even beyond what we thought was good before), and the squared term is significantly nonzero: that is, it adds something useful to the model, so it should stay.

We should check the residual plot, via what ought to be by now a familiar technique:

ggplot(DC.2,aes(x=.fitted,y=.resid))+ geom_point()+geom_smooth(se=F)

2In class, I may have had a, b and c the other way around.

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download