Plotting and saving residuals and tted values in regression
Plotting and saving residuals and fitted values in regression
Ken Butler November 1, 2016
1 The basics
Let's begin with our windmill data:
library(tidyverse)
## Loading tidyverse: ggplot2 ## Loading tidyverse: tibble ## Loading tidyverse: tidyr ## Loading tidyverse: readr ## Loading tidyverse: purrr ## Loading tidyverse: dplyr ## Conflicts with tidy packages ---------------------------------------------## filter(): dplyr, stats ## lag(): dplyr, stats
windmill=read.csv("windmill.csv",header=T) head(windmill,n=10)
## wind.velocity DC.output
## 1
5.00 1.582
## 2
6.00 1.822
## 3
3.40 1.057
## 4
2.70 0.500
## 5
10.00 2.236
## 6
9.70 2.386
## 7
9.55 2.294
## 8
3.05 0.558
## 9
8.15 2.166
## 10
6.20 1.866
There were two variables, DC.output and wind.velocity, and we were interested in how we could predict the first from the second.
A scatterplot reveals that the relationship is not a straight line:
1
ggplot(windmill,aes(x=wind.velocity,y=DC.output))+ geom_point()+geom_smooth(se=F)
2.0 1.5
q q q
q
q
q
q
qq q
q
q q
q q
q q
DC.output
q
q
q
q
1.0
q
q
0.5
q
q
4
6
8
10
wind.velocity
It seems to increase but at a decreasing rate.1 What happens if we fit a linear regression anyway?
DC.1=lm(DC.output~wind.velocity,data=windmill) summary(DC.1)
## ## Call: ## lm(formula = DC.output ~ wind.velocity, data = windmill) ## ## Residuals:
1Normally, geom_smooth adds a smooth trend and a grey "envelope" around it that is a confidence band, that is, a confidence interval for y for each x. To omit that, add se=F.
2
##
Min
1Q Median
3Q
Max
## -0.59869 -0.14099 0.06059 0.17262 0.32184
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.13088 0.12599 1.039
0.31
## wind.velocity 0.24115 0.01905 12.659 7.55e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2361 on 23 degrees of freedom
## Multiple R-squared: 0.8745,Adjusted R-squared: 0.869
## F-statistic: 160.3 on 1 and 23 DF, p-value: 7.546e-12
We saw that the fit appeared to be good (R-squared 87%) and the slope was significantly nonzero: that is, there is a real trend. But how do we know that the straight is (as we suspect) inappropriate and that we can do better with the right kind of curve?
The key is to look at a plot of the residuals against something. Typically the starting point is residuals against fitted values. ggplot provides a nice mechanism to make this plot if that's all you want (that is, you don't want to save them for later):
ggplot(DC.1,aes(x=.fitted,y=.resid))+ geom_point()+geom_smooth(se=F)
3
0.2 0.0 -0.2
q q
q qq
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q q
-0.4
.resid
-0.6
q
1.0
1.5
2.0
2.5
.fitted
The strategy is to plot the regression object, the thing that you ask for the summary of to look at intercepts, slopes, R-squared and such. This has, hiding in it, things called .fitted and .resid with dots on the front.
As for the actual residual plot: I added a smooth trend this time to show the clear curved pattern in the residuals. This indicates that the actual relationship is a curve, not a straight line.
The other thing we might want to do is to save fitted values or residuals. Here, I want to save the fitted values to plot later. It is easiest for later if I add them to the data frame:
windmill$fit.linear=fitted(DC.1) str(windmill)
## 'data.frame': 25 obs. of 3 variables: ## $ wind.velocity: num 5 6 3.4 2.7 10 9.7 9.55 3.05 8.15 6.2 ... ## $ DC.output : num 1.58 1.82 1.06 0.5 2.24 ... ## $ fit.linear : num 1.337 1.578 0.951 0.782 2.542 ...
4
and we see that they have indeed been added.
2 Fitting curves
We saw two approaches in class to fitting a curve. The first is a general one: fit not a linear model y = a+bx but a quadratic "parabola" model y = a+bx+cx2.2 That goes like this:
DC.2=lm(DC.output~wind.velocity+I(wind.velocity^2),data=windmill) summary(DC.2)
##
## Call:
## lm(formula = DC.output ~ wind.velocity + I(wind.velocity^2),
##
data = windmill)
##
## Residuals:
##
Min
1Q Median
3Q
Max
## -0.26347 -0.02537 0.01264 0.03908 0.19903
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
-1.155898 0.174650 -6.618 1.18e-06 ***
## wind.velocity
0.722936 0.061425 11.769 5.77e-11 ***
## I(wind.velocity^2) -0.038121 0.004797 -7.947 6.59e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1227 on 22 degrees of freedom
## Multiple R-squared: 0.9676,Adjusted R-squared: 0.9646
## F-statistic: 328.3 on 2 and 22 DF, p-value: < 2.2e-16
R-squared is vastly improved (even beyond what we thought was good before), and the squared term is significantly nonzero: that is, it adds something useful to the model, so it should stay.
We should check the residual plot, via what ought to be by now a familiar technique:
ggplot(DC.2,aes(x=.fitted,y=.resid))+ geom_point()+geom_smooth(se=F)
2In class, I may have had a, b and c the other way around.
5
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- jitender aswani and jens doerpmund lets talk code
- ggplot remove legend guide
- ë t ëÖë Ì 6 btmë 6
- r handouts 2019 20 data visualization with ggplot2
- tutorial 7 maps 2 of 2
- syntax stata
- plotting and saving residuals and tted values in regression
- advancedquantitativemethods inthestudyofpoliticalbehavior
- graphics with ggplot2
- multipanel plotting in r with base graphics
Related searches
- personal and professional values examples
- real and ideal values sociology
- social and cultural values definition
- lease residuals and money factors
- coefficients in regression analysis
- coefficient in regression model
- coefficients in regression model
- negative coefficient in regression analysis
- pandas find and replace values in columns
- in regression analysis the residuals represent the
- in regression analysis the independent variable is
- t and p values in statistics