SAS Simple Linear Regression
SAS Regression Using Dummy Variables and Oneway ANOVA
/*****************************************************************
This example illustrates:
How to create side-by-side boxplots
How to create dummy variables
How to use dummy variables in a linear regression model
How to fit a oneway ANOVA
Procs used:
Proc Means
Proc Boxplot
Proc Reg
Proc Univariate
Proc GLM
Filename: dummy_variables.sas
*******************************************************************/
The commands below allow us to utilize the user-defined formats along with our permanent SAS data set.
OPTIONS FORMCHAR="|----|+|---+=|-/\*";
libname b510 "e:\510\";
options fmtsearch=(WORK b510);
options nofmterr;
We get descriptive statistics for all variables within each level of ORIGIN.
proc means data=b510.cars;
class origin;run;
The MEANS Procedure
N
ORIGIN Obs Variable N Mean Std Dev Minimum Maximum
------------------------------------------------------------------------------------------------
USA 253 MPG 248 20.1282258 6.3768059 10.0000000 39.0000000
ENGINE 253 247.7134387 98.7799678 85.0000000 455.0000000
HORSE 249 119.6064257 39.7991647 52.0000000 230.0000000
WEIGHT 253 3367.33 788.6117392 1800.00 5140.00
ACCEL 253 14.9284585 2.8011159 8.0000000 22.2000000
YEAR 253 75.5217391 3.7145843 70.0000000 82.0000000
CYLINDER 253 6.2766798 1.6626528 4.0000000 8.0000000
Europe 73 MPG 70 27.8914286 6.7239296 16.2000000 44.3000000
ENGINE 73 109.4657534 22.3719083 68.0000000 183.0000000
HORSE 71 81.0000000 20.8134572 46.0000000 133.0000000
WEIGHT 73 2431.49 490.8836172 1825.00 3820.00
ACCEL 73 16.8219178 3.0109175 12.2000000 24.8000000
YEAR 73 75.7397260 3.5630332 70.0000000 82.0000000
CYLINDER 73 4.1506849 0.4907826 4.0000000 6.0000000
Japan 79 MPG 79 30.4506329 6.0900481 18.0000000 46.6000000
ENGINE 79 102.7088608 23.1401260 70.0000000 168.0000000
HORSE 79 79.8354430 17.8191991 52.0000000 132.0000000
WEIGHT 79 2221.23 320.4972479 1613.00 2930.00
ACCEL 79 16.1721519 1.9549370 11.4000000 21.0000000
YEAR 79 77.4430380 3.6505947 70.0000000 82.0000000
CYLINDER 79 4.1012658 0.5904135 3.0000000 6.0000000
-----------------------------------------------------------------------------------------------
We can see that the mean of vehicle miles per gallon (MPG) is lowest for the American cars, intermediate for the European cars, and highest for the Japanese cars.
We now look at a side-by-side boxplot of miles per gallon (MPG) for each level of origin.
/*Get side-by-side boxplots of Weight for each
vehicle origin*/
proc sort data=b510.cars;
by origin;
run;
goptions device=win target=winprtm;
proc boxplot data=b510.cars;
plot mpg * origin / boxstyle=schematic;
run;
The boxplot shows the pattern of means that we noted in the descriptive statistics. The variance is similar for the American, European and Japanese cars. The distribution of mpg is somewhat positively skewed for American and European cars, and negatively skewed for Japanese cars. There are some high outliers in the American and European cars. Because ORIGIN is a nominal variable, we will not be tempted to think of this as an ordinal relationship. If we had a different coding for ORIGIN, this graph would have shown a different pattern.
[pic]
Before we can fit a linear regression model with a categorical (in this case, nominal) predictor we need to create dummy variables to be used in the model. We will create 3 dummy variables, even though only two of them will be used in the regression model. Each dummy variable will be coded as 0 or 1. A value of 1 will indicate that a case is in a given level of origin, and a value of 0 will indicate that the case is not in that level of origin.
The dummy variables for ORIGIN are created in the data step below. Note that the output shows the one car with a missing origin does not have a value for any of the dummy variables. Also note that the frequency tabulations for the three dummy variables show that there is one missing value for each dummy variable.
/*Data step to create dummy variables for each level of ORIGIN*/
data b510.cars2;
set b510.cars;
if origin not=. then do;
American=(origin=1);
European=(origin=2);
Japanese=(origin=3);
end;
run;
proc print data=b510.cars2
var origin American European Japanese weight;
run;
Obs ORIGIN American European Japanese WEIGHT
1 . . . . 732
2 USA 1 0 0 1800
3 USA 1 0 0 1875
4 USA 1 0 0 1915
5 USA 1 0 0 1955
. . .
256 Europe 0 1 0 1825
257 Europe 0 1 0 1834
258 Europe 0 1 0 1835
259 Europe 0 1 0 1835
260 Europe 0 1 0 1845
. . .
329 Japan 0 0 1 1649
330 Japan 0 0 1 1755
331 Japan 0 0 1 1760
332 Japan 0 0 1 1773
proc freq data=b510.cars2;
tables origin american european japanese;
run;
The FREQ Procedure
Cumulative Cumulative
ORIGIN Frequency Percent Frequency Percent
-----------------------------------------------------------
USA 253 62.47 253 62.47
Europe 73 18.02 326 80.49
Japan 79 19.51 405 100.00
Frequency Missing = 1
Cumulative Cumulative
American Frequency Percent Frequency Percent
-------------------------------------------------------------
0 152 37.53 152 37.53
1 253 62.47 405 100.00
Frequency Missing = 1
Cumulative Cumulative
European Frequency Percent Frequency Percent
-------------------------------------------------------------
0 332 81.98 332 81.98
1 73 18.02 405 100.00
Frequency Missing = 1
Cumulative Cumulative
Japanese Frequency Percent Frequency Percent
-------------------------------------------------------------
0 326 80.49 326 80.49
1 79 19.51 405 100.00
Frequency Missing = 1
We can now fit a regression model, to predict MPG for each Origin. We will use American cars as the reference category in this model. To do this we will include the dummy variables for European and Japanese cars in our model. These two dummy variables represent a contrast between the average MPG for European vs. American cars and Japanese vs. American cars, respectively. In general, if you have k categories in your categorical variable, you will need to include k-1 dummy variables in the regression model.
/*Fit a regression model with American cars as the reference category*/
proc reg data=b510.cars2;
model mpg = european japanese;
plot residual.*predicted.;
output out=regdat p=predicted r=residual rstudent=rstudent;
run; quit;
The REG Procedure
Model: MODEL1
Dependent Variable: MPG
Number of Observations Read 406
Number of Observations Used 397
Number of Observations with Missing Values 9
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 7984.95725 3992.47862 97.97 |t|
Intercept 1 20.12823 0.40537 49.65 ................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related searches
- simple linear regression test statistic
- simple linear regression hypothesis testing
- simple linear regression null hypothesis
- simple linear regression model calculator
- simple linear regression uses
- simple linear regression model pdf
- simple linear regression practice problems
- simple linear regression least squares
- simple linear regression excel
- simple linear regression example pdf
- simple linear regression example questions
- simple linear regression business example