Linear Regression using Stata - Princeton University
Linear Regression using Stata
(v. 6.3)
Oscar Torres-Reyna
otorres@princeton.edu
December 2007
Regression: a practical approach (overview)
We use regression to estimate the unknown effect of changing one variable over another (Stock and Watson, 2003, ch. 4) When running a regression we are making two assumptions, 1) there is a linear relationship between two variables (i.e. X and Y) and 2) this relationship is additive (i.e. Y= x1 + x2 + ...+xN). Technically, linear regression estimates how much Y changes when X changes one unit. In Stata use the command regress, type: regress [dependent variable] [independent variable(s)] regress y x In a multivariate setting we type: regress y x1 x2 x3 ... Before running a regression it is recommended to have a clear idea of what you are trying to estimate (i.e. which are your outcome and predictor variables). A regression makes sense only if there is a sound theory behind it.
2
PU/DSS/OTR
Regression: a practical approach (setting)
Example: Are SAT scores higher in states that spend more money on education controlling by other factors?*
? Outcome (Y) variable ? SAT scores, variable csat in dataset ? Predictor (X) variables
? Per pupil expenditures primary & secondary (expense) ? % HS graduates taking SAT (percent) ? Median household income (income) ? % adults with HS diploma (high) ? % adults with college degree (college) ? Region (region)
*Source: Data and examples come from the book Statistics with Stata (updated for version 9) by Lawrence C. Hamilton (chapter
6). Click here to download the data or search for it at . Use the file states.dta (educational data for the U.S.).
3
PU/DSS/OTR
Regression: variables
It is recommended first to examine the variables in the model to check for possible errors, type: use describe csat expense percent income high college region summarize csat expense percent income high college region
. describe csat expense percent income high college region
storage display variable name type format
value label
variable label
csat expense percent income high college region
int %9.0g int %9.0g byte %9.0g double %10.0g float %9.0g float %9.0g byte %9.0g
region
Mean composite SAT score Per pup il expenditures prim&sec % HS graduates taking SAT Median household income, $1,000 % adults HS diploma % adults college degree Geographical region
. summarize csat expense percent income high college region
Variable
Obs
Mean Std. Dev.
Min
Max
csat expense percent
income high
51
944.098 66.93497
51 5235.961 1401.155
51 35.76471 26.19281
51 33.95657 6.423134
51 76.26078 5.588741
832 2960
4 23.465
64.3
1093 9259
81 48.618
86.6
college region
51 20.02157
4.16578
50
2.54 1.128662
12.3 1
33.3 4
4
PU/DSS/OTR
Regression: what to look for
Lets run the regression: regress csat expense, robust
Outcome variable (Y)
Predictor variable (X)
. regress csat expense, robust Linear regression
Root MSE: root mean squared error, is the sd of the 7 regression. The closer to zero better the fit.
Robust standard errors (to control for heteroskedasticity)
This is the p-value of the model. It 1 tests whether R2 is different from
0. Usually we need a p-value lower than 0.05 to show a statistically significant relationship between X and Y.
Number of obs =
F( 1, 49) =
Prob > F
=
R-squared
=
Root MSE
=
51 36.80 0.0000 0.2174 59.814
2 R-square shows the amount of variance of Y explained by X. In this case expense explains 22%
of the variance in SAT scores.
csat
expense _cons
Coef.
-.0222756 1060.732
Robust Std. Err.
.0036719 24.35468
t
-6.07 43.55
P>|t|
0.000 0.000
6
csat = 1061 - 0.022*expense For each one-point increase in expense, SAT scores decrease by 0.022 points.
[95% Conf. Interval] -.0296547 -.0148966
1011.79 1109.675
4
Adj R2 (not shown here) shows the same as R2 but adjusted by the # of cases and # of variables. When the # of variables is small 3 and the # of cases is very large then Adj R2 is closer to R2. This provides a more honest association between X and Y.
5
The t-values test the hypothesis that the coefficient is different from 0. To reject this, you need a t-value greater than 1.96 (for 95% confidence). You can get the t-values by dividing the coefficient by its standard error. The tvalues also show the importance of a variable in the model.
Two-tail p-values test the hypothesis that each coefficient is different from 0. To reject this, the p-value has to be lower than 0.05 (you could choose also an alpha of 0.10). In this case, expense is statistically significant in explaining SAT.
5
PU/DSS/OTR
Regression: what to look for
Adding the rest of predictor variables:
Robust standard errors (to control for heteroskedasticity)
regress csat expense percent income high college, robust
Output variable (Y)
Predictor variables (X)
This is the p-value of the model. It 1 indicates the reliability of X to
predict Y. Usually we need a p-
. regress csat expense percent income high college, robust
value lower than 0.05 to show a statistically significant relationship
Linear regression
Number of obs =
51
between X and Y.
F( 5, 45) = 50.90
Root MSE: root mean squared error, is the sd of the 7 regression. The closer to zero better the fit.
Prob > F R-squared Root MSE
= 0.0000 = 0.8243 = 29.571
2 R-square shows the amount of variance of Y explained by X. In this case the model explains
Robust
82.43% of the variance in SAT
csat
Coef. Std. Err.
t P>|t|
[95% Conf. Interval]
scores.
expense percent
income high
college _cons
.0033528 -2.618177
.1055853 1.630841 2.030894 851.5649
.004781 .2288594 1.207246
.943318 2.113792 57.28743
0.70 -11.44
0.09 1.73 0.96 14.86
0.487 0.000 0.931 0.091 0.342 0.000
-.0062766 -3.079123 -2.325933 -.2690989 -2.226502
736.1821
.0129823 -2.15723 2.537104 3.530781
6.28829 966.9477
Adj R2 (not shown here) shows the same as R2 but adjusted by the # of cases and # of variables. When the # of variables is small
3 and the # of cases is very large then Adj R2 is closer to R2. This
6
provides a more honest
csat = 851.56 + 0.003*expense ? 2.62*percent + 0.11*income + 1.63*high + 2.03*college
association between X and Y.
4
5
Two-tail p-values test the hypothesis that each coefficient is different
The t-values test the hypothesis that the coefficient is different from 0. To reject this, you need a t-value greater than 1.96 (at 0.05 confidence). You can get the t-values by dividing the coefficient by its standard error. The tvalues also show the importance of a variable in the model. In this case, percent is the most important.
from 0. To reject this, the p-value has to be lower than 0.05 (you could choose also an alpha of 0.10). In this case, expense, income, and college are not statistically significant in explaining SAT; high is almost significant at 0.10. Percent is the only variable
that has some significant impact on SAT (its coefficient is differ6ent from 0)
PU/DSS/OTR
Regression: using dummy variables/selecting the reference category
If using categorical variables in your regression, you need to add n-1 dummy variables. Here `n' is the number of categories in the variable. In the example below, variable `industry' has twelve categories (type tab industry, or tab industry, nolabel)
The easiest way to include a set of dummies in a regression is by using the prefix "i." By default, the first category (or lowest value) is used as reference. For example:
sysuse nlsw88.dta reg wage hours i.industry, robust
Linear regression
Number of obs =
F( 12, 2215) =
Prob > F
=
R-squared
=
Root MSE
=
2228 24.96 0.0000 0.0800 5.5454
wage
hours
industry Mining
Construction Manufacturing Transport/Comm/Utility Wholesale/Retail Trade Finance/Ins/Real Estate Business/Repair Svc Personal Services Entertainment/Rec Svc Professional Services Public Administration
_cons
Robust Coef. Std. Err.
.0723658 .0110213
t P>|t| 6.57 0.000
9.328331 1.858089 1.415641 5.432544 .4583809
3.92933 1.990151 -1.018771 1.111801 2.094988 3.232405
7.287849 1.281807
.849571 1.03998 .8548564 .9934195 1.054457 .8439617 1.192314 .8192781 .8857298
3.126629 .8899074
1.28 1.45 1.67 5.22 0.54 3.96 1.89 -1.21 0.93 2.56 3.65
0.201 0.147 0.096 0.000 0.592 0.000 0.059 0.228 0.351 0.011 0.000
3.51 0.000
[95% Conf. Interval]
.0507526
.093979
-4.963399 -.6555808 -.2503983
3.393107 -1.218023
1.981199 -.0776775
-2.67381 -1.226369
.4883548 1.495457
1.381489
23.62006 4.371759 3.081679 7.471981 2.134785 5.877461 4.057979 .6362679 3.449972 3.701622 4.969352
4.871769
To change the reference category to "Professional services" (category number 11) instead of "Ag/Forestry/Fisheries" (category number 1), use the prefix "ib#." where "#" is the number of the reference category you want to use; in this case is 11.
sysuse nlsw88.dta reg wage hours ib11.industry, robust
Linear regression
Number of obs =
F( 12, 2215) =
Prob > F
=
R-squared
=
Root MSE
=
2228 24.96 0.0000 0.0800 5.5454
wage
hours
industry Ag/Forestry/Fisheries
Mining Construction Manufacturing Transport/Comm/Utility Wholesale/Retail Trade Finance/Ins/Real Estate Business/Repair Svc Personal Services Entertainment/Rec Svc Public Administration
_cons
Robust Coef. Std. Err.
.0723658 .0110213
t P>|t| 6.57 0.000
-2.094988 7.233343
-.2368991 -.6793477
3.337556 -1.636607
1.834342 -.1048377 -3.113759
-.983187 1.137416
.8192781 7.245913 1.011309 .3362365 .6861828 .3504059 .6171526 .7094241 .3192289 .9004471 .4176899
5.221617 .4119032
-2.56 1.00
-0.23 -2.02
4.86 -4.67
2.97 -0.15 -9.75 -1.09
2.72
0.011 0.318 0.815 0.043 0.000 0.000 0.003 0.883 0.000 0.275 0.007
12.68 0.000
[95% Conf. Interval]
.0507526
.093979
-3.701622 -6.97615
-2.220112 -1.338719
1.991927 -2.323766
.6240837 -1.496044 -3.739779 -2.748996
.3183117
-.4883548 21.44284 1.746314 -.019976 4.683185 -.949449 3.0446 1.286368 -2.48774 .7826217 1.956521
4.41386 6.029374
The "ib#." option is available since Stata 11 (type help fvvarlist for more options/details). For older Stata versions you need to use "xi:" along with "i." (type help xi for more options/details). For the examples above type (output omitted):
xi: reg wage hours i.industry, robust
char industry[omit]11
/*Using category 11 as reference*/
xi: reg wage hours i.industry, robust
To create dummies as variables type
tab industry, gen(industry)
To include all categories by suppressing the constant type:
reg wage hours bn.industry, robust hascons
Regression: ANOVA table
If you run the regression without the `robust' option you get the ANOVA table
xi: regress csat expense percent income high college i.region
Source
SS
Model (A) 200269.84 Residual (B)12691.5396
Total (C) 212961.38
df
MS
9 22252.2045 (D) 40 317.28849 (E)
49 4346.15061 (F)
Number of obs =
F( 9, 40) =
Prob > F
=
R-squared
=
Adj R-squared =
Root MSE
=
50 70.13 0.0000 0.9404 0.9270 17.813
MSS 200269.84
F
=
(k -1) RSS
=
9 12691.5396
=
22252.2045 317.28849
=
D E
= 70.13
n-k
40
AdjR2 = 1- n -1 (1- R2 ) = 1- 49 (1- 0.9404) = 1- E = 1- 317.28849 = 0.9270
n-k
40
F 4346.15061
R2 = MSS = 1- TSS
ei2
(yi - y)2
=
200269.84 21296.1.38
= A = 0.9404 C
RootMSE = RSS = 12691.5396 = B = 17.813
(n - k)
40
40
A = Model Sum of Squares (MSS). The closer to TSS the better fit. B = Residual Sum of Squares (RSS) C = Total Sum of Squares (TSS) D = Average Model Sum of Squares = MSS/(k-1) where k = # predictors E = Average Residual Sum of Squares = RSS/(n ? k) where n = # of observations F = Average Total Sum of Squares = TSS/(n ? 1) R2 shows the amount of observed variance explained by the model, in this case 94%. The F-statistic, F(9,40), tests whether R2 is different from zero. Root MSE shows the average distance of the estimator from the mean, in this case 18 points in estimating SAT scores.
Source: Kohler, Ulrich, Frauke Kreuter, Data Analysis Using Stata, 2009
8
PU/DSS/OTR
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- linear regression using stata princeton university
- interpreting and visualizing regression models with
- autocorrelation function in stata
- regression lines in stata thomas elliott
- title regress postestimation diagnostic
- description stata
- chapter 1 linear regression with 1 predictor
- stata for dummies gwilym pryce
- socy498c—introduction to computing for
- testing for normality by using a q q plot
Related searches
- simple linear regression test statistic
- linear regression coefficients significance
- princeton university admissions staff
- princeton university hospital princeton nj
- linear regression test statistic calculator
- linear regression without a calculator
- linear regression significance
- linear regression coefficient formula
- linear regression significance test
- linear regression slope significance testing
- linear regression statistical significance
- linear regression hypothesis example