SAS Regression Using Dummy Variables



SAS Regression Using Dummy Variables

/***********************************************

SAS EXAMPLE -- DUMMY VARIABLES IN REGRESSION

FOR BOTH ORDINAL AND NOMINAL

VARIABLES

FILENAME: regress2.sas

************************************************/

/*USE PERMANENT SAS DATA SET CREATED EARLIER*/

libname LABDATA "c:\temp\labdata";

/*CHECK DUMMY VARIABLE CODING*/

proc freq data=labdata.werner;

tables agegrp agedum1 agedum2 agedum3 agedum4;

title "CHECKING DUMMY VARIABLE CODING";

run;;

proc means data=labdata.werner;

class agegrp;

var age;

run;

/*Create boxplots of cholesterol for each level of agegrp*/

proc sort data=labdata.werner;

by agegrp;

run;

proc boxplot data=labdata.werner;

plot chol*agegrp / boxstyle=schematic;

title "BOXPLOT TO SHOW RELATIONSHIP BETWEEN AGEGRP AND CHOLESTEROL";

run;

/*MODEL WITH AGE DUMMY VARIABLES*/

proc reg data=labdata.werner;

model chol = agedum2 agedum3 agedum4;

AGEDUM: test agedum2, agedum3, agedum4;

output out=regdat1 p=predict1 r=resid1;

plot residual.*predicted.;

title "MULTIPLE REGRESSION WITH DUMMY VARIABLES FOR AGE";

title2 "PLUS A TEST FOR AGE DUMMY VARIABLES";

title3 "REFERENCE AGE IS AGEGRP 1";

run; quit;

proc univariate data=regdat1 plot normal;

var resid1;

histogram;

qqplot / normal(mu=est sigma=est);

run;

/*SWITCH REFERENCE GROUP FOR AGE DUMMY VARIABLES*/

proc reg data=labdata.werner;

model chol = agedum1 agedum2 agedum3;

AGEDUM: test agedum1, agedum2, agedum3;

title "MULTIPLE REGRESSION WITH DUMMY VARIABLES FOR AGE";

title2 "PLUS A TEST FOR AGE DUMMY VARIABLES";

title3 "REFERENCE AGE IS AGEGRP 4";

run; quit;

/*INCLUDE AGE DUMMY VARIABLES AND CONTINUOUS COVARIATES*/

proc reg data=labdata.werner;

model chol = agedum2 agedum3 agedum4 calc uric alb wt;

AGEDUM: test agedum2, agedum3, agedum4;

title "MULTIPLE REGRESSION WITH DUMMY VARIABLES FOR AGE";

title2 "PLUS OTHER CONTINUOUS COVARIATES";

title3 "REFERENCE AGE IS AGEGRP 1";

run;quit;

/*****************************************************

ANOTHER EXAMPLE USING DUMMY VARIABLES FOR A NOMINAL

CATEGORICAL VARIABLE: SPECIES

******************************************************/

title;

data kanga;

infile "c:\temp\labdata\kanga.dat" lrecl=80;

input sex 1

species 3

basal_l 5-8

occip_l 10-13

palat_l 15-18

palat_w 20-22

nasal_l 24-26

nasal_w 28-30

squam_d 32-34

lacry_w 36-38

zygom_w 40-43

orbit_w 45-47

rostr_w 49-51

occip_d 53-55

crest_w 57-59

foram_w 61-63

mandi_l 65-68

mandi_w 70-72

mandi_d 74-76

ramus_h 78-80;

if species = 0 then species_dum0 = 1;

if species in (1,2) then species_dum0=0;

if species = 1 then species_dum1 = 1;

if species in (0,2) then species_dum1=0;

if species = 2 then species_dum2 = 1;

if species in (0,1) then species_dum2=0;

run;

proc means data=kanga;

class species;

run;

proc sort data=kanga;

by species;

run;

proc boxplot data=kanga;

plot crest_w*species / boxstyle=schematic;

run;

proc reg data=kanga;

model crest_w = species_dum1 species_dum2;

plot residual.*predicted.;

output out=kanga_reg1 p=predict r=resid rstudent=rstudent;

run;quit;

proc univariate data=kanga_reg1 plot normal;

var resid;

histogram;

qqplot / normal(mu=est sigma=est);

run;

*************************************************************************************;

proc freq data=labdata.werner;

tables agegrp agedum1 agedum2 agedum3 agedum4;

title "CHECKING DUMMY VARIABLE CODING";

run;;

CHECKING DUMMY VARIABLE CODING

The FREQ Procedure

Cumulative Cumulative

AGEGRP Frequency Percent Frequency Percent

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

1 44 23.40 44 23.40

2 46 24.47 90 47.87

3 50 26.60 140 74.47

4 48 25.53 188 100.00

Cumulative Cumulative

AGEDUM1 Frequency Percent Frequency Percent

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

0 144 76.60 144 76.60

1 44 23.40 188 100.00

Cumulative Cumulative

AGEDUM2 Frequency Percent Frequency Percent

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

0 142 75.53 142 75.53

1 46 24.47 188 100.00

Cumulative Cumulative

AGEDUM3 Frequency Percent Frequency Percent

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

0 138 73.40 138 73.40

1 50 26.60 188 100.00

Cumulative Cumulative

AGEDUM4 Frequency Percent Frequency Percent

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

0 140 74.47 140 74.47

1 48 25.53 188 100.00

proc means data=labdata.werner;

class agegrp;

var age;

run;

The MEANS Procedure

Analysis Variable : AGE

N

AGEGRP Obs N Mean Std Dev Minimum Maximum

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

1 44 44 21.8181818 1.2440131 19.0000000 24.0000000

2 46 46 28.0434783 2.0758561 25.0000000 31.0000000

3 50 50 36.2400000 3.2486104 32.0000000 41.0000000

4 48 48 47.8333333 4.0070859 42.0000000 55.0000000

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

/*Create boxplots of cholesterol for each level of agegrp*/

proc sort data=labdata.werner;

by agegrp;

run;

proc boxplot data=labdata.werner;

plot chol*agegrp / boxstyle=schematic;

title "BOXPLOT TO SHOW RELATIONSHIP BETWEEN AGEGRP AND CHOLESTEROL";

run;

[pic]

proc reg data=labdata.werner;

model chol = agedum2 agedum3 agedum4;

AGEDUM: test agedum2, agedum3, agedum4;

output out=regdat1 p=predict1 r=resid1;

plot residual.*predicted.;

title "MULTIPLE REGRESSION WITH DUMMY VARIABLES FOR AGE";

title2 "PLUS A TEST FOR AGE DUMMY VARIABLES";

title3 "REFERENCE AGE IS AGEGRP 1";

run; quit;

MULTIPLE REGRESSION WITH DUMMY VARIABLES FOR AGE

PLUS A TEST FOR AGE DUMMY VARIABLES

REFERENCE AGE IS AGEGRP 1

The REG Procedure

Model: MODEL1

Dependent Variable: CHOL

Number of Observations Read 188

Number of Observations Used 187

Number of Observations with Missing Values 1

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 3 38114 12705 7.02 0.0002

Error 183 331383 1810.83492

Corrected Total 186 369497

Root MSE 42.55391 R-Square 0.1032

Dependent Mean 235.15508 Adj R-Sq 0.0884

Coeff Var 18.09610

Parameter Estimates

Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 218.44186 6.48941 33.66 D >0.1500

Cramer-von Mises W-Sq 0.057998 Pr > W-Sq >0.2500

Anderson-Darling A-Sq 0.511199 Pr > A-Sq 0.2024

Variable: resid1 (Residual)

Stem Leaf # Boxplot

12 3 1 0

10 3 1 |

8 1480 4 |

6 233488902889 12 |

4 00123355889334 14 |

2 22333334446788992344567789 26 +-----+

0 122233333467778888899011123688999 33 | + |

-0 888887554432100998777643322111000 33 *-----*

-2 9988776322111009977777666663221 31 +-----+

-4 97666533221087775321100 23 |

-6 6763210 7 |

-8 7 1 |

-10

-12

-14

-16 7 1 0

----+----+----+----+----+----+---

Multiply Stem.Leaf by 10**+1

Normal Probability Plot

130+ *

| *+

| ****+

70+ ******

| ****+

| ******

10+ ******

| *****

| *******

-50+ *******

| ******++

| *+++++

-110+++

|

|

-170+*

+----+----+----+----+----+----+----+----+----+----+

-2 -1 0 +1 +2

[pic]

[pic]

proc reg data=labdata.werner;

model chol = agedum1 agedum2 agedum3;

AGEDUM: test agedum1, agedum2, agedum3;

title "MULTIPLE REGRESSION WITH DUMMY VARIABLES FOR AGE";

title2 "PLUS A TEST FOR AGE DUMMY VARIABLES";

title3 "REFERENCE AGE IS AGEGRP 4";

run;quit;

MULTIPLE REGRESSION WITH DUMMY VARIABLES FOR AGE

PLUS A TEST FOR AGE DUMMY VARIABLES

REFERENCE AGE IS AGEGRP 4

The REG Procedure

Model: MODEL1

Dependent Variable: CHOL

Number of Observations Read 188

Number of Observations Used 187

Number of Observations with Missing Values 1

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 3 38114 12705 7.02 0.0002

Error 183 331383 1810.83492

Corrected Total 186 369497

Root MSE 42.55391 R-Square 0.1032

Dependent Mean 235.15508 Adj R-Sq 0.0884

Coeff Var 18.09610

Parameter Estimates

Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 257.16667 6.14213 41.87 F

Model 7 74986 10712 6.47 |t|

Intercept 1 -26.00067 67.74900 -0.38 0.7016

AGEDUM2 1 13.64563 8.81812 1.55 0.1236

AGEDUM3 1 20.93826 8.77256 2.39 0.0181

AGEDUM4 1 35.60898 9.03520 3.94 0.0001

CALC 1 23.13575 7.49192 3.09 0.0023

URIC 1 8.62277 2.90619 2.97 0.0034

ALB 1 -4.18820 10.01485 -0.42 0.6763

WT 1 -0.08492 0.16699 -0.51 0.6117

Test AGEDUM Results for Dependent Variable CHOL

Mean

Source DF Square F Value Pr > F

Numerator 3 8892.61133 5.37 0.0015

Denominator 173 1655.47937

Kangaroo data set analysis:

proc means data=kanga;

class species;

run;

The MEANS Procedure

N

species Obs Variable N Mean Std Dev Minimum Maximum

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

0 51 sex 51 1.4901961 0.5048782 1.0000000 2.0000000

basal_l 51 1499.39 166.8074433 1090.00 1899.00

occip_l 51 1591.76 161.4247302 1195.00 1925.00

palat_l 51 1035.51 125.6861763 740.0000000 1327.00

palat_w 42 264.6190476 30.2323945 208.0000000 332.0000000

nasal_l 51 706.8627451 89.5468636 493.0000000 905.0000000

nasal_w 51 246.7058824 30.0188568 175.0000000 310.0000000

squam_d 50 179.8600000 25.8575454 122.0000000 265.0000000

lacry_w 51 444.5686275 43.3406299 350.0000000 560.0000000

zygom_w 50 868.9800000 75.0516802 673.0000000 1067.00

orbit_w 50 238.7800000 14.8286678 205.0000000 277.0000000

rostr_w 50 269.9600000 36.1024393 185.0000000 350.0000000

occip_d 50 658.5600000 72.2637347 462.0000000 798.0000000

crest_w 51 108.8627451 39.7839262 21.0000000 203.0000000

foram_w 51 98.7843137 15.6809614 67.0000000 137.0000000

mandi_l 49 1265.16 152.4004247 901.0000000 1648.00

mandi_w 51 135.8039216 13.3611670 101.0000000 174.0000000

mandi_d 51 195.0000000 24.4344020 138.0000000 257.0000000

ramus_h 51 686.6274510 76.1227852 476.0000000 880.0000000

species_dum0 51 1.0000000 0 1.0000000 1.0000000

species_dum1 51 0 0 0 0

species_dum2 51 0 0 0 0

1 48 sex 48 1.5208333 0.5048523 1.0000000 2.0000000

basal_l 47 1476.04 171.0401628 1030.00 1893.00

occip_l 46 1563.24 157.2762869 1121.00 1945.00

palat_l 48 1003.88 129.4495096 665.0000000 1315.00

palat_w 35 245.4285714 33.9179683 172.0000000 319.0000000

nasal_l 47 671.3404255 84.2396366 454.0000000 893.0000000

nasal_w 48 229.2916667 29.4545870 141.0000000 292.0000000

squam_d 48 171.8958333 27.6561252 121.0000000 299.0000000

lacry_w 48 438.4791667 45.3557989 303.0000000 547.0000000

zygom_w 48 850.8333333 71.1517208 640.0000000 994.0000000

orbit_w 48 242.0625000 17.2348068 202.0000000 283.0000000

rostr_w 45 270.4888889 37.1064943 193.0000000 368.0000000

occip_d 41 630.6585366 68.1617964 435.0000000 754.0000000

crest_w 48 115.6250000 42.8344786 13.0000000 216.0000000

foram_w 48 95.1875000 13.8466622 60.0000000 126.0000000

mandi_l 46 1227.50 146.4658853 856.0000000 1568.00

mandi_w 48 134.5208333 12.8708637 101.0000000 163.0000000

mandi_d 48 189.5000000 21.6460895 132.0000000 240.0000000

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

species Obs Variable N Mean Std Dev Minimum Maximum

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

1 48 ramus_h 48 683.8333333 74.3051023 473.0000000 824.0000000

species_dum0 48 0 0 0 0

species_dum1 48 1.0000000 0 1.0000000 1.0000000

species_dum2 48 0 0 0 0

2 52 sex 52 1.5000000 0.5048782 1.0000000 2.0000000

basal_l 52 1501.08 180.1735379 1048.00 1848.00

occip_l 50 1525.58 144.9033584 1145.00 1823.00

palat_l 51 1027.45 133.0591318 693.0000000 1276.00

palat_w 49 257.9795918 32.7461765 182.0000000 328.0000000

nasal_l 51 617.6862745 75.4576677 434.0000000 751.0000000

nasal_w 51 224.7254902 27.5151438 167.0000000 287.0000000

squam_d 52 189.7115385 29.8633072 131.0000000 280.0000000

lacry_w 51 440.6666667 46.3170235 311.0000000 535.0000000

zygom_w 52 912.3846154 78.8762132 725.0000000 1090.00

orbit_w 52 237.6923077 18.5139465 190.0000000 290.0000000

rostr_w 50 274.8000000 36.9986210 173.0000000 371.0000000

occip_d 47 663.0851064 61.1046251 481.0000000 770.0000000

crest_w 50 144.4000000 34.2225609 60.0000000 214.0000000

foram_w 52 89.7115385 14.7679257 48.0000000 128.0000000

mandi_l 44 1258.82 156.6364882 880.0000000 1583.00

mandi_w 50 147.1800000 12.5903915 108.0000000 169.0000000

mandi_d 51 204.3137255 21.1740314 152.0000000 271.0000000

ramus_h 52 728.7500000 80.9923441 511.0000000 880.0000000

species_dum0 52 0 0 0 0

species_dum1 52 0 0 0 0

species_dum2 52 2.0000000 0 2.0000000 2.0000000

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

proc sort data=kanga;

by species;

run;

proc boxplot data=kanga;

plot crest_w*species / boxstyle=schematic;

run;

[pic]

proc reg data=kanga;

model crest_w = species_dum1 species_dum2;

plot residual.*predicted.;

output out=kanga_reg1 p=predict r=resid rstudent=rstudent;

run;quit;

The REG Procedure

Model: MODEL1

Dependent Variable: crest_w

Number of Observations Read 151

Number of Observations Used 149

Number of Observations with Missing Values 2

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 2 35702 17851 11.70 |t|

Intercept 1 108.86275 5.46963 19.90 0.1500

Cramer-von Mises W-Sq 0.089723 Pr > W-Sq 0.1557

Anderson-Darling A-Sq 0.54369 Pr > A-Sq 0.1660

Variable: resid (Residual)

Stem Leaf # Boxplot

10 0 1 |

9 34 2 |

8 4 1 |

7 09 2 |

6 0 1 |

5 44556 5 |

4 00123445 8 |

3 224577889 9 |

2 0022233445666888 16 +-----+

1 01112233333466668999 20 | |

0 1345567778999 13 *--+--*

-0 998766665554443210 18 | |

-1 9855420 7 | |

-2 99988877766655322 17 +-----+

-3 98543300 8 |

-4 64310 5 |

-5 421 3 |

-6 866430 6 |

-7 80 2 |

-8 841 3 |

-9 8 1 |

-10 3 1 |

----+----+----+----+

Multiply Stem.Leaf by 10**+1

Variable: resid (Residual)

Normal Probability Plot

105+ *

| * *++

| *+++

| *+

| +**

| +***

| ****

35+ +***

| ****

| ****

| ****

| ****

| **+

| *****

-35+ ***+

| **

| +**

| +****

| ++*

| ++***

|++*

-105+*

+----+----+----+----+----+----+----+----+----+----+

-2 -1 0 +1 +2

[pic]

[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download