SECTION 2



Chapter 2

Looking at Data – Relationships between Y and X

We will now look at determining whether there is an association between two quantitative variables. To study the relationship we measure both variables on the same unit. For example, we obtain the height and weight for each of 100 adult males.

Two variables are associated if one variable tends to change in a systematic way when the other variable is changed. For example, as height increases weight tends to also increase.

But there is a caution: two variables might appear to be associated but actually both of them might be affected by a third variable called a lurking variable.

For example, as children grow up their shoe size increases and their spelling ability increases. Spelling and shoe size may appear to be associated but actually both are being driven by a third variable, age.

A response variable, also called the dependent or “Y” variable, measures an outcome of a study. An explanatory variable, also called an independent or “X” variable, is thought to “explain” or “cause” these changes.

Example:

The energy required to heat or cool an office building (response variable) depends on the outside temperature (explanatory variable).

Use the following procedure to examine the relationship between two quantitative variables:

1. Graph the data on a scatterplot. A scatterplot is and X-Y plot with the Response Variable on the Y axis and the Explanatory Variable on the X axis.

2. Describe the pattern shown on the scatterplot according to:

Form: Linear or non-linear pattern

Direction: Positive or negative (algebraic slope). Positive means that Y increases as X increases.

Strength: Tightness of data points to a line, straight or curved. Subjective, strong, moderate, weak

3. Look for outliers, (points which do not fit the pattern).

4. If the data is reasonably linear, get an equation for the best fitting line using least squares regression and get the correlation, a numerical rating of strength for linear associations only.

5. Look at a residual plot to see if the residuals form a random pattern as X increases. We do not want to see a systematic pattern.

6. Look at a normal probability plot of residuals to determine whether the residuals are normally distributed. (The dots sticking close to the 45 degree line is good.)

7. Look at hypothesis tests for the slope, and intercept to test whether either could = 0. Look at confidence intervals for the slope and intercept.

8. If you had an outlier, you should re-work the data without the outlier and comment on the differences in your results.

Example:

There is evidence that drinking moderate amounts of wine helps prevent heart attacks. The following table shows the yearly alcohol consumption from wine and yearly deaths from heart disease in 19 developed nations.

Country Alcohol from wine Heart disease deaths

Australia 2.5 211

Austria 3.9 167

Belgium 2.9 131

Canada 2.4 191

Denmark 2.9 220

Finland 0.8 297

France 9.1 71

Iceland 0.8 211

Ireland 0.7 300

Italy 7.9 107

Netherlands 1.8 167

New Zealand 1.9 266

Norway 0.8 227

Spain 6.5 86

Sweden 1.6 207

Switzerland 5.8 115

United Kingdom 1.3 285

United States 1.2 199

West Germany 2.7 172

SPSS Instructions: First, enter the data in SPSS, one column for the

X/independent/explanatory variable data, another column for the Y/dependent/response variable data. Then Graphs>LegacyDialog>Scatter/Dot> Simple Scatter, move alcohol to the X Axis box, move deaths to the Y Axis box, below is the scatterplot:

[pic]

Form? Linear

Direction? Negative

Strength? Moderate

Causal effect? Cannot determine without a well designed experiment.

Scatterplots display the association between two quantitative variables. What if you have a third categorical variable?

To add a categorical variable to a scatterplot, use a different plot color or symbol for each category.

Correlation

The correlation quantifies the direction and strength of the linear relationship between two quantitative variables. Correlation is usually written as r.

Formula for correlation: [pic]

Correlation is calculated by SPSS and is easily done on a calculator.

Properties of correlation:

• The correlation, r, is always between -1 and 1. Values near 0 indicate a weak linear relationship and values close to 1 or -1 indicate a strong linear relationship.

• The sign of the correlation always is the same as the sign of the slope.

• A positive r corresponds to a positive relationship between the variables. A negative r corresponds to a negative relationship between the variables.

• It makes no difference which variable you call x and which you call y. You could reverse X and Y and the correlation would remain unchanged.

• Both variables need to be quantitative to calculate correlation.

• The correlation r does not change if we change the units of measurements of x, y, or both.

• Correlation measures the strength of a linear relationship only.

• Like the mean and standard deviation, the correlation is not resistant. The correlation r is strongly affected by outlying observations. Sometimes it is improved; sometimes it is decreased. Use r with caution when outliers appear in the scatterplot.

• Correlation is not a complete description of two-variable data. You should give the mean and standard deviations of both x and y along with the correlation.

To determine the correlation within SPSS: Analyze>Regression>Linear, move deaths to the dependent box, move alcohol to the independent box. Click “OK”.

This R is actually the ABSOLUTE VALUE OF R, so you need to pay attention to the sign of the direction yourself.

The Pearson Correlation gives you the actual R with the correct sign. To get the Pearson Correlation output from SPSS: Analyze > Correlate > Bivariate, move the two variables to the Variable box, Pearson correlation is default selection, click OK.

Example:

For the Alcohol consumption from Wine and Heart Attacks scatterplot, the correlation is:

[pic]

SPSS repots the absolute value of correlation, R= .843. (the negative sign is not shown).

Pearson correlation r = -0.843 shows the correct sign.

Least-Squares Regression:

If a scatterplot shows that the relationship is linear, we can determine the linear equation or regression line which is the best fit for the data. We can use this regression line to predict the value of the response variable y for a given value of the explanatory variable x.

When doing regression, it does matter which variable is on the y axis and which is on the x axis. The predictor or explanatory variable must be on the x axis and the variable being predicted (response variable) must be on the y axis.

Least-Squares Regression fits a straight line through the data points that minimizes the sum of the squared vertical distances between the data points and the straight line.

• Minimizes [pic] the sum of the squared vertical distances.

• Equation of the line is: [pic]

• Slope of the line is: [pic], where the slope measures the amount of change caused in the response variable when the explanatory variable is increased by one unit.

• The point ( x, y ) always lies on the line. Therefore y = a + b x

• Intercept of the line is: a = y – bx where the intercept is the value of Y when X = 0.

Prediction:

We use the regression line (or equation) to predict the response variable y for a specific value of the explanatory variable x.

Extrapolation:

Extrapolation is the use of the regression line for prediction of y for an x value which is far outside the original range of x values studied. Such predictions are questionable because the association between x and y may not remain linear as x is extended beyond the original range.

Example: If you studied a child’s height for age 2-10years, a straight line would probably fit the data very well, but this line cannot be used to predict the height at age 40 because the association does not remain linear past the age of 16 or 17.

We will finish the Alcohol From Wine vs Heart Disease Deaths example.

Recall, there was a correlation of -.843, showing a moderate association between alcohol consumption from wine and heart disease deaths.

To determine the regression coefficients within SPSS: Analyze>Regression>Linear, move deaths to the dependent box, move alcohol to the independent box. Click “OK”.

[pic]

To add the regression line within SPSS: Double left click on your scatterplot, the chart editor will appear; within the chart editor, right click on any point and select “Add fit line at total”; the regression line will be added within the chart editor; exit the chart editor and the regression line will be added to your original scatterplot.

[pic]

The regression equation of the line is:

Y (deaths) = 260.563 – 22.969 (X or alcohol consumption)

[pic] in regression:

The square of the correlation, [pic], is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x.

[pic] measures how successfully the regression explains the response. The closer [pic] is to 1.0, the better the regression. The value of [pic] is shown on the scatterplot by SPSS.

[pic]= .71 in the graph above.

Example:

The mean height and age of children in Kalama, Egypt. (Book, pg 136)

Age x (months) Height y (centimeters)

18 76.1

19 77.0

20 78.1

21 78.2

22 78.8

23 79.7

24 79.9

25 81.1

26 81.2

27 81.8

28 82.8

29 83.5

[pic]

Model Summary

|Model |R |R Square |Adjusted R |Std. Error of |

| | | |Square |the Estimate |

|

| |.994 |.989 |.988 |.2560 |

|a Predictors: (Constant), Age of Kalama child (months)

Statistics

| | |Age of Kalama |Height of |

| | |child (months)|Kalama child |

| | | |(centmtr) |

|

|N |Valid |12 |12 |

|

| |Missing |0 |0 |

|

|Mean | |23.50 |79.850 |

|

|Std. Deviation| |3.606 |2.3024 |

|

Coefficients

| | |Unstandardized| |Standardized |t |Sig. |

| | |Coefficients | |Coefficients | | |

|

|Model | |B |Std. Error |Beta | | |

|

|1 |(Constant) |64.928 |.508 | |127.709 |.000 |

|

| |Age of Kalama |.635 |.021 |.994 |29.665 |.000 |

| |child (months)| | | | | |

|a Dependent Variable: Height of Kalama child (centmtr)

Cautions about Regression and Correlation:

Correlation and Regression are common statistical tools. There are limitations to their use to understand before we can predict outcomes or determine causation.

PRECAUTIONS TO TAKETO EVALUATE THE REGRESSION:

1. Look at Residuals and Residual Plots

A residual is the difference between an observed value of the response variable and the value predicted by the regression line.

residual = observed value of y – predicted value of y

A residual plot is a graph of the regression residuals plotted against the explanatory variable x. Residual plots help us assess the fit of a regression line. We would like to see a random pattern for the residuals as we sweep across the x axis.

We would also like to see the magnitude of the residuals be approximately constant as we sweep across the x axis. This would indicate that the error in prediction would be approximately constant for all values of x.

There should not be any systematic pattern in the residual plot.

One systematic pattern which sometimes occurs is a funnel or wedge pattern where the residuals increase in magnitude as the x values increase. This would happen if the error in prediction is a certain percent of the x value.

Example:

Do a residual plot for the Alcohol from Wine and Heart disease deaths example.

The least squares regression equation was:

Y (heart disease deaths) = 260.563 – 22.969 (alcohol from wine)

We could also develop a predicted value for any current value of X, the explanatory/independent variable that exists in our data. For example, the predicted value of Y (deaths) when X (alcohol consumption) = 2.5 would be:

Y (deaths) = 260.563 – 22.969 (2.5) = 203.141

We find the residual when X (alcohol consumption) = 2.5 and when Y (deaths) = 211 by:

Residual = Observed value – Predicted value

Residual = 211- 203.141 = 7.859 Here are all the residuals:

|Alcohol |Deaths |Predicted value |Residual |

|2.5 |211 |203.14146 |7.85854 |

|3.9 |167 |170.98518 |-3.98518 |

|2.9 |131 |193.95395 |-62.95395 |

|2.4 |191 |205.43833 |-14.43833 |

|2.9 |220 |193.95395 |26.04605 |

|0.8 |297 |242.18836 |54.81164 |

|9.1 |71 |51.54759 |19.45241 |

|0.8 |211 |242.18836 |-31.18836 |

|0.7 |300 |244.48524 |55.51476 |

|7.9 |107 |79.11011 |27.88989 |

|1.8 |167 |219.21959 |-52.21959 |

|1.9 |266 |216.92272 |49.07728 |

|0.8 |227 |242.18836 |-15.18836 |

We can then plot the X variable vs the residual to get a residual plot:

Within SPSS: To do a residual plot, you need to have the residuals within your data. When you do the regression, ask SPSS to calculate and save your predicted values and your residuals like this: Analyze>regression>linear; move variables to appropriate boxes; click on the “save” box and check “unstandardized “ under predicted values and check “unstandardized” under residuals, click continue. These values will then appear as additional columns in the Data view.

The generate a residual plot of the X variable and the residual value: Graphs>Legacy>Simple>Scatter/Dot, move alcohol to the X Axis box, remove deaths from the Y Axis box, move unstandardized residuals to the Y Axis box, click OK. To add a “0” reference line to the plot, double left click on your residual plot, the chart editor will appear; within the chart editor, right click on any point and select “Add reference line for Yl”; select reference line tab and enter “0” as the reference value; click continue, the reference line will be added within the chart editor; exit the chart editor and the reference line will be added to your original residual plot.

[pic]

2. Determine if there are lurking variables

A lurking variable is a variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables.

A lurking variable could be a categorical variable that effects one or both of the variables you are interested in. Or a lurking variable could be a change over time.

Example:

Recall the alcohol from wine and heart disease study/example we discussed. Income level was a lurking variable there, a variable that affected both other variables.

3. Ask if there are striking individual points, outliers/influential observations, in your scatterplot and residual plot?

An outlier is an observation that lies outside the overall pattern of the other observations. Points that are outliers in the y direction of a scatterplot have large regression residuals, but outliers in the x direction need not have large residuals.

An observation is influential for a statistical calculation if removing it would markedly change the result ( in particular, the slope of the line). The farther a point is from the mean of the X values, xbar, the more influence that point has on the slope of the least-squares regression line.

Example:

This example will use data that is part of a data set from Dr. T.N.K. Raju, Department of Neonatology, University of Illinois at Chicago.

IMR=Infant Mortality rate

PQLI=Physical Quality of life Index (Indicator of average wealth)

Data:

|Case |PQLI |IMR |

|1 |17 |110 |

|2 |24 |78 |

|3 |28 |88 |

|4 |29 |135 |

|5 |29 |55 |

|6 |33 |138 |

|7 |33 |79 |

|8 |35 |133 |

|9 |36 |120 |

|10 |36 |92 |

|11 |43 |125 |

Before doing a regression or finding the correlation, it is always important to make a scatterplot of the data. Note: The physical quality of life index could be used to predict infant mortality rate, so PQLI is the explanatory variable and is plotted on the x-axis.

[pic]

Note that the relationship looks weak.

Below is some of the output.

Model Summary

|Model |R |R Square |Adjusted R |Std. Error of |

| | | |Square |the Estimate |

|

| |.303 |.092 |-.009 |28.025 |

|a Predictors: (Constant), PQLI

The correlation r = 0.303434

The percent of variation in IMR that is explained by the regression of IMR on PQLI ([pic]) is [pic]=0.092072

SPSS will show the best fitting straight line. Double left click on the scatterplot; the chart editor will appear; within the chart editor, right click on any point and select “Add fit line at total” ; the regression line will be added within the chart editor; exit the chart editor and the regression line will be added to the scatterplot.

[pic]

Coefficients

| | |Unstandardized| |Standardized |t |Sig. |

| | |Coefficients | |Coefficients | | |

|

|Model | |B |Std. Error |Beta | | |

|

|1 |(Constant) |66.698 |40.787 | |1.635 |.136 |

|

| |PQLI |1.223 |1.280 |.303 |.955 |.364 |

|a Dependent Variable: IMR

The equation of the Least squares regression line is:

IMR = 66.70 + 1.22(PQLI)

Residual plot:

[pic]

Thinking about the relationship shown on the graph and the equation, would it make sense that the IMR would go up as the PQLI improved? What could be a potential lurking variable here?

Now let’s look at what happens if we add a categorical variable to the picture.

The new data set is as follows:

|Case |PQLI |IMR | Location |

|1 |17 |110 | rural |

|2 |24 |78 | rural |

|3 |28 |88 | rural |

|4 |29 |135 | urban |

|5 |29 |55 | rural |

|6 |33 |138 | urban |

|7 |33 |79 | rural |

|8 |35 |133 | urban |

|9 |36 |120 | urban |

|10 |36 | 92 | rural |

|11 |43 |125 | urban |

If we graph the rural and urban locations on the scatterplot using different plotting colors, we get the following scatterplot and regression line.

[pic]

Coefficients

| | |Unstandardized| |Standardized |t |Sig. |

| | |Coefficients | |Coefficients | | |

|

|Model | |B |Std. Error |Beta | | |

|

|1 |(Constant) |162.511 |23.216 | |7.000 |.006 |

|

| |PQLI |-.918 |.654 |-.630 |-1.403 |.255 |

|a Dependent Variable: IMR

b Selecting only cases for which Location = urban

Coefficients

| | |Unstandardized| |Standardized |t |Sig. |

| | |Coefficients | |Coefficients | | |

|

|Model | |B |Std. Error |Beta | | |

|

|1 |(Constant) |114.629 |35.111 | |3.265 |.031 |

|

| |PQLI |-1.112 |1.232 |-.412 |-.903 |.418 |

|a Dependent Variable: IMR

b Selecting only cases for which Location = rural

Note that the relationships both make sense and looks a lot stronger.

What this shows us is that location was a lurking categorical variable that caused the whole relationship to reverse. It is always important to look at what the graphs are saying from a practical standpoint and think about what you are doing.

Summary:

So far, we have used the following to examine the relationship between two quantitative variables:

1. Graph the data using a scatterplot.

2. For linear patterns, find the correlation, r.

3. Also for linear patterns, find the Least-squares Regression.

4. Then evaluate the model using a residual plot, calculation of [pic], and looking for outliers and influential variables. Think about lurking variables that might help to explain the results.

5. Remember that even a strong association does not imply causation. To show causation, a well designed and controlled experiment is required.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download