University of Michigan



SPSS Simple Linear Regression, Using Dummy Variables in Regression, Oneway ANOVA, Multiple Regression

We get the Werner.sav SPSS dataset that we created for the last handout. When you open the dataset, you can select "Paste" as for other commands. Notice that there are comments included for each part of the SPSS code to help in reading the code later.

GET FILE="C:\Users\kwelch\Desktop\b510\werner.sav".

COMPUTE bodymass = (weight/2.2026)/(height*.0254)**2 .

EXECUTE .

*-----------------------------------------------------------------------------DESCRIPTIVES.

DESCRIPTIVES

VARIABLES=AGE HEIGHT WEIGHT PILL CHOL ALB CALCIUM ACID PAIR logwt hichol bodymass

/STATISTICS=MEAN STDDEV MIN MAX .

In the output from the descriptive statistics below, notice that the Valid N listwise is 180.

|Descriptive Statistics |

| |

| | |CHOL |AGE |

|CHOL |Pearson Correlation |1 |.368** |

| |Sig. (2-tailed) | |.000 |

| |N |186 |186 |

|AGE |Pearson Correlation |.368** |1 |

| |Sig. (2-tailed) |.000 | |

| |N |186 |188 |

|**. Correlation is significant at the 0.01 level (2-tailed). |

We now look at a scatterplot to get an idea of the relationship between these two variables. We use Age as the X variable (listed first in the scatterplot statement), and Chol as the Y variable, which is listed second in the scatterplot statement.

*-------------------------------------------------------------------------------SCATTER PLOT.

GRAPH

/SCATTERPLOT(BIVAR)=AGE WITH CHOL

/MISSING=LISTWISE.

[pic]

We now fit a simple linear regression model with Cholesterol as the outcome or dependent variable, and Age as the predictor. We use simplified syntax, rather than that supplied automatically by SPSS paste, so we can request specific plots and other items that we wish to examine. We save the predicted values (PRED), residuals (RESID), and the studentized deleted residuals (SDRESID).

*-------------------------------------------------------------------------------SIMPLE LINEAR REGRESSION.

REGRESSION

/DEPENDENT CHOL

/METHOD=ENTER AGE

/SCATTERPLOT=(*SDRESID ,*PRED)

/RESIDUALS HIST(SDRESID) NORM(SDRESID)

/SAVE PRED RESID SDRESID.

Regression

|Variables Entered/Removedb |

|Model |Variables Entered |Variables Removed |Method |

|1 |AGEa |. |Enter |

|a. All requested variables entered. |

|b. Dependent Variable: CHOL |

|Model Summaryb |

|Model |R |R Square |Adjusted R Square |Std. Error of the |

| | | | |Estimate |

|1 |.368a |.135 |.131 |39.680 |

|a. Predictors: (Constant), AGE |

|b. Dependent Variable: CHOL |

|ANOVAb |

|Model |

|b. Dependent Variable: CHOL |

|Coefficientsa |

|Model |

|Residuals Statisticsa |

| |

The histogram of the studentized-deleted residuals appears to be quite symmetric, and the normal p-p plot is reasonable.

[pic][pic]

[pic]

We now examine a scatterplot of the saved studentized-deleted residuals vs. the predicted values to check for heteroskedasticity. After creating the scatterplot, we include a loess fit to the residuals, just to check for any remaining pattern. These residuals look good. There is no apparent pattern that has been missed by our regression model fit, and the variance of the residuals appears to be relatively constant at all predicted values.

*-------------SCATTERPLOT of SAVED RESIDUALS (Y) BY PREDICTED (X) .

GRAPH

/SCATTERPLOT(BIVAR)=PRE_1 WITH SDR_1

/MISSING=LISTWISE .

[pic]

We can get a test of normality for the residuals if we save them. We use the studentized-deleted residuals, called SDR_1 by SPSS. Even though the p-value for the Shapiro-Wilk test is marginally significant, the distribution of these residuals looks quite good for normality based on the normal Q-Q plot. Notice that the Q-Q plot shows us deviations from the expected normal line at the end-points of the distribution, which were not shown in the P-P plot generated by the Regression command, because the P-P plot endpoints must be anchored to the normal line, based on the way the graph is generated.

*-----------------------------------------------------------------------CHECK NORMALITY OF RESIDUALS

(Analyze...Descriptive Statisics...Explore).

EXAMINE VARIABLES=SDR_1

/PLOT BOXPLOT STEMLEAF NPPLOT

/COMPARE GROUP

/STATISTICS DESCRIPTIVES

/CINTERVAL 95

/MISSING LISTWISE

/NOTOTAL.

|Tests of Normality |

| |Kolmogorov-Smirnova |Shapiro-Wilk |

| |

|*. This is a lower bound of the true significance. |

[pic]

Dummy Variable Regression:

We now investigate a model in which we will use age categories to predict cholesterol, rather than the "continuous" version of age. Before we can fit this model using Regression, we must first create the categorical variable, AGECAT, and then create the dummy variables for each level of AGECAT to be used in the linear regession model.

*------------------------------------CREATE AGECAT (Four Categories).

RECODE

AGE

(Lowest thru 24=1) (25 thru 31=2) (32 thru 41=3) (42 thru Highest=4) INTO AGECAT .

EXECUTE .

VALUE LABELS

AGECAT (1) < 25 (2) 25-31 (3) 32-41 (4) >=42.

*------------------------------------------------------------------------------CHECK RECODED VALUES.

FREQUENCIES VARIABLES=AGECAT

/ORDER=ANALYSIS.

|AGECAT |

| |

|CHOL |

|AGECAT |

|Model |Variables Entered |Variables Removed |Method |

|1 |AGEDUM4, AGEDUM2, |. |Enter |

| |AGEDUM3a | | |

|a. All requested variables entered. |

|Model Summaryb |

|Model |R |R Square |Adjusted R Square |Std. Error of the |

| | | | |Estimate |

|1 |.327a |.107 |.092 |40.549 |

|a. Predictors: (Constant), AGEDUM4, AGEDUM2, AGEDUM3 |

|b. Dependent Variable: CHOL |

|ANOVAb |

|Model |

|b. Dependent Variable: CHOL |

The coefficients in the model give the difference in mean cholesterol for each age category, compared to the reference category, AGECAT=1. Notice that Cholesterol is estimated to be higher in each age category than in AGECAT =1, but this difference is significant only for agecat=3 and 4.

|Coefficientsa |

|Model |

Again, we check the residuals for normality, using a histogram of the studentized-deleted residuals, and a normal P-P plot.

[pic][pic]

We plot the studentized deleted residuals vs. the predicted values to check for heteroskedasticity.

Notice in the scatterplot below, the variance for each level of AGECAT is similar.

[pic]

Oneway ANOVA:

We now fit a oneway anova model using AGECAT as a factor. This model is easier to fit in SPSS than the equivalent model using Regression, because we don't need to set up the dummy variables first. We can also get post-hoc tests to compare the mean of CHOL for each level of AGECAT.

*---------------------------------------------ONEWAY ANOVA USING AGECAT AS A FACTOR.

UNIANOVA CHOL BY AGECAT

/METHOD=SSTYPE(3)

/INTERCEPT=INCLUDE

/POSTHOC=AGECAT(TUKEY)

/PLOT = PROFILE( AGECAT )

/CRITERIA=ALPHA(0.05)

/DESIGN=AGECAT.

|Between-Subjects Factors |

| | |Value Label |N |

|AGECAT |1.00 |< 25 |43 |

| |2.00 |25-31 |45 |

| |3.00 |32-41 |50 |

| |4.00 |>=42 |48 |

|Tests of Between-Subjects Effects |

|Dependent Variable:CHOL |

|Source |

In the post hoc tests output shown below, we see that the mean of cholesterol is significantly higher for agecat=4 than for agecat=1, 2, and 3, when we use an alpha of 0.05. The p-values for these differences have been adjusted for multiple comparisons, using the Tukey method.

Post Hoc Tests

AGECAT

|Multiple Comparisons |

|CHOL |

|Tukey HSD |

|(I) AGECAT |

|*. The mean difference is significant at the 0.05 level. |

The homogeneous subsets output is another way of displaying the differences among means.

Homogeneous Subsets

|CHOL |

|Tukey HSDa,,b,,c |

|AGECAT |N |Subset |

| | |1 |2 |

|1.00 < 25 |43 |218.44 | |

|2.00 25-31 |45 |231.24 | |

|3.00 32-41 |50 |235.62 |235.62 |

|4.00 >=42 |48 | |257.17 |

|Sig. | |.178 |.055 |

|Means for groups in homogeneous subsets are displayed. |

|Based on observed means. |

|The error term is Mean Square(Error) = 1644.216. |

|a. Uses Harmonic Mean Sample Size = 46.344. |

|b. The group sizes are unequal. The harmonic mean of the group |

|sizes is used. Type I error levels are not guaranteed. |

|c. Alpha = 0.05. |

The profile plot shows the means of each level of AGECAT, joined by lines. This plot is not as helpful as the boxplot that we generated earlier, because it doesn't give an idea of the variability at each level of agecat.

[pic]

Multiple Regression:

We now examine a multiple regression problem. The first model we will fit will have collinear predictors. We create a new variable that is the sum of WEIGHT and ALB, plus a random normal variate with mean 0 and standard deviation 1. We next check a correlation matrix and create a scatterplot matrix..

*--------------------------------------------CORRELATION MATRIX.

COMPUTE wtalb=weight+alb+RV.Normal(0,1).

EXECUTE.

For the correlation matrix, we use the Missing=Listwise subcommand, so that all cases included in the correlation matrix will have complete data on the entire list of variables. This will tell us how many observations will be available for a regression model fitted using these variables. The Missing=Pairwise option allows us to use the number of observations present for each pair of variables, and that number can vary for each pair..

CORRELATIONS

/VARIABLES=CHOL AGE ACID ALB WEIGHT WTALB

/PRINT=TWOTAIL NOSIG

/MISSING=LISTWISE .

|Correlationsa |

| |

|*. Correlation is significant at the 0.05 level (2-tailed). |

|a. Listwise N=181 |

[pic]

We now fit the multiple linear regression model, with collinearity. Note the high values of VIF for WEIGHT and WTALB (both are > 10, indicating collinearity). In the collinearity diagnostics table you can also see that the eigenvalue for dimension 6 is very small (.0000244) and the condition index for dimension 6 is very high (490.385) and that both WEIGHT and WTALB are reported to have all of their variance (proportion of variance=1.0) loading on this dimension. This is another indication of severe collinearity.

*--------------------------------------------MULTIPLE REGRESSION.

REGRESSION

/STATISTICS = DEFAULT COLLIN

/DEPENDENT CHOL

/METHOD=ENTER AGE ACID ALB WEIGHT wtalb.

|Variables Entered/Removed |

|Model |Variables Entered |Variables Removed |Method |

|1 |wtalb, ALB, AGE, ACID,|. |Enter |

| |WEIGHTa | | |

|a. All requested variables entered. |

|Model Summary |

|Model |R |R Square |Adjusted R Square |Std. Error of the |

| | | | |Estimate |

|1 |.422a |.178 |.155 |39.198 |

|a. Predictors: (Constant), wtalb, ALB, AGE, ACID, WEIGHT |

|ANOVAb |

|Model |

|b. Dependent Variable: CHOL |

|Coefficientsa |

|Model |

|Collinearity Diagnosticsa |

|Model |

We now refit the model, but delete the WTALB variable, and check model diagnostics.We check a scatterplot of the leverage vs. the Studentized-deleted residuals, and get a listing of the 10 largest values of Studentized-Deleted Residuals, Leverage, and Cook's Distance.

Observations with a large absolute value of the residual are ones that have an unusual value of the Response (Y), given the X variables in the model. Those cases with a large leverage, are extreme in the X space (far from the overall mean of all X variables). The influential cases are those that are high in both leverage and residuals. Cook's distance is an overall measure of the influence of a case. If an observation is highly influential, it may change the results of the regression model fit. Influential cases should be checked to see why they are unusual. They may have incorrect values, or they may just be very different for some reason.

We also request Partial Regression plots with the /Partialplot subcommand. This shows us the relationship between CHOL and each predictor, after adjusting for all other predictors in the model.

*--------------------------------------------MULTIPLE REGRESSION WITH DIAGNOSTICS.

REGRESSION

/STATISTICS=DEFAULT COLLIN

/DEPENDENT CHOL

/METHOD=ENTER AGE ACID ALB WEIGHT

/SCATTERPLOT=(*SDRESID ,*PRED) (*LEVER, *SDRESID)

/PARTIALPLOT

/RESIDUALS = HISTOGRAM(SDRESID) OUTLIERS(SDRESID LEVER COOK).

|Variables Entered/Removed |

|Model |Variables Entered |Variables Removed |Method |

|1 |WEIGHT, ALB, AGE, |. |Enter |

| |ACIDa | | |

|a. All requested variables entered. |

|Model Summaryb |

|Model |R |R Square |Adjusted R Square |Std. Error of the |

| | | | |Estimate |

|1 |.420a |.177 |.158 |39.119 |

|a. Predictors: (Constant), WEIGHT, ALB, AGE, ACID |

|b. Dependent Variable: CHOL |

|ANOVAb |

|Model |

|b. Dependent Variable: CHOL |

|Coefficientsa |

|Model |Unstandardized Coefficients |Standardized |t |Sig. |

| | |Coefficients | | |

| |

|Collinearity Diagnosticsa |

|Model |Dimension |Eigenvalue |Condition Index |Variance Proportions |

| |

|Outlier Statisticsa |

| | |Case Number |Statistic |Sig. F |

|Stud. Deleted Residual |1 |182 |3.234 | |

| |2 |159 |-2.603 | |

| |3 |47 |2.531 | |

| |4 |70 |2.521 | |

| |5 |178 |2.054 | |

| |6 |162 |1.988 | |

| |7 |149 |1.962 | |

| |8 |135 |-1.957 | |

| |9 |144 |1.893 | |

| |10 |60 |1.768 | |

|Cook's Distance |1 |182 |.069 |.997 |

| |2 |60 |.067 |.997 |

| |3 |159 |.053 |.998 |

| |4 |47 |.043 |.999 |

| |5 |70 |.039 |.999 |

| |6 |138 |.032 |.999 |

| |7 |178 |.023 |1.000 |

| |8 |26 |.023 |1.000 |

| |9 |162 |.020 |1.000 |

| |10 |188 |.019 |1.000 |

|Centered Leverage Value |1 |138 |.141 | |

| |2 |142 |.119 | |

| |3 |94 |.095 | |

| |4 |60 |.093 | |

| |5 |6 |.086 | |

| |6 |166 |.083 | |

| |7 |21 |.068 | |

| |8 |129 |.067 | |

| |9 |2 |.061 | |

| |10 |184 |.055 | |

|a. Dependent Variable: CHOL |

[pic] [pic]

[pic]

[pic]

[pic] [pic]

[pic] [pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download