Correlations



Correlations

To this point in our review, we have dealt with single variables, learning how to describe their central tendency and distributions. Correlations deal with the relationship between two or more variables. With this measure, we can answer questions such as; “Are achievement tests scores related to Grade Point Averages?”, “Is counselor empathy related to counseling outcome?”, and “Is student toenail length related to success in graduate school?” Correlational studies, however, are limited in that they do not allow us to make conclusions about causes. No matter how high a correlation we find between two variables, we can only say the two variables are related, we cannot say one is a cause of the other.

To begin, let’s review what we mean by “cause”. When we say one variable has a causal effect on another variable we are saying that changing the first variable will produce a change in the second variable. We can only make this claim if we have, in fact, changed, (manipulated) one variable and observed the effects of this change on a second variable (while holding all other variables constant). Correlational studies do not meet this standard for making causal conclusions. In a correlational study, we simply measure the level of two variables for the same individual and then statistically determine whether knowledge of person’s score on one variable allows us to predict their score on the second variable better than if we did not have that knowledge. If we find a significant correlation between two variables there are always three possible causal explanations for why that relationship exists. These three explanations can be summarized as two problems.

1) The Directional Problem – For example, there is a correlation between Birth Rate in humans (variable A) and number of storks nesting in a village. Two possible explanations for this relationship are 1) that A causes B (i.e., High birthrates in humans cause higher nesting rates of storks; perhaps the storks are attracted to babies). An equally good explanation of this correlation is that B causes A (i.e., Higher rates of storks cause higher birth rates - perhaps my mother is correct and the storks really do bring babies).

2) The Third Variable Problem - Some outside variable (or set of variables) may effect both A and B in a manner so that they co-vary in a predictable way but A does not cause B and B does not cause A. In our example, a possible explanation of the relationship between human birth rates and the number of nesting storks may be that both are effected by the success of the previous harvest. The more food available, the more likely storks will nest in an area, and the more likely humans are to have offspring. So following a good harvest, stork nesting increases and so does human birth rates. They co-vary because they are both effected by a common underlying determinant, but A does not cause B, nor does B cause A.

What correlations do allow us to do is conclude that two variables are (or are not) related. This can be very interesting and valuable information – but it must be used with caution when making a prescription based on a correlation.

Dr. Oz said that having loving, monogamous sex regularly, keeps you younger. Double the amount of sex you have you and you can live longer.

“Being intimate with the people you love is critically important to longevity,” the doctor says. “We’ve got tons of data around it, but the basic rule of thumb is that if the average American has sex once a week, you want to have sex at least twice a week. It increases longevity by about three years. For women, it’s more about quality than quantity. If you don’t have that loving, conjugal relationship with someone you can grow with in life, then you’re not fun and fearless.”

Correlational studies are by definition ex post facto. In a correlational study the researcher measures variables as they naturally occur. They do not do anything to manipulate the variables. Many of the studies that members of this class are conducting are correlational. While there are two variables (at least) neither one is an independent variable nor a dependant variable.

The Pearson Correlation Statistic ( r ) provides two separate pieces of information. (1) The sign, negative or positive tells us the direction of the relationship. If a correlation is positive, it indicates that higher levels on one variable predict higher levels of the second variable (and conversely that lower levels of one variable predict lower levels of the other). A negative correlation indicates that higher levels of one variable predict lower levels of the second variable (and visa versa). When interpreting a correlation it is important to keep in mind that a negative correlation is not a negative result. It is just as informative as a positive correlation.

For example, in my Introduction to Experimental Psychology courses I have found a fairly high positive correlation between class attendance and final grades. The more classes a student attends the higher their final grade tends to be. If instead of measuring the number of classes students attend, I had measured the number of classes they missed, I would find a negative correlation between classes missed and final grades.

The second piece of information a correlation statistic provides is the strength of the relationship -- how accurately we can predict one variable based on the other. The value of a correlation ranges from zero to one. The closer the absolute value of the correlation is to 1 the stronger the relationship. Similarly, the closer the absolute value of the correlation is to zero the weaker the relationship.

One way to represent the correlation between two variables is to graph the relationship between two variables. On the Y (vertical axis) we put the scale for one variable, and on the X axis (horizontal axis) we put the second variable. For now, it really does not matter which you put on which axis. Each subjects’ score on the two variables are thus represented as a data point on this graph which is called a scatter diagram. On the diagram below a subject’s scores on two variables is represented by one data point. Subject 1 attended 40 out of 45 classes and obtained a final grade of 75%.

[pic] Data Points

When we plot data from a sample of scores, the pattern of the scores can tell us something about the relationship between the two variables. Two measures that plotted on a graph will produce scatter-plots that approximate a line. The less the scores deviate from the line the stronger the correlation. If the line is sloping upward, the direction of the correlation is positive (the higher the scores on one axis, the higher the scores on the other axis). If the slope is downward, the direction of the correlation is negative (higher scores on one variable are associated with lower scores on the second variable). This line (called the line of best fit) is the line that fits the data in a manner that reduces the overall distance between itself and all the data points. In other words, if you drew a series of lines on your scatterplot and then measured the total amount the data points varied from the line (in two-dimensional geometric space), you would find that there would be one line that produced the lowest total deviation scores. In other words, there is one line that summarizes the relationship between the two variables best, or that best fits the data. For example, in the following figure the line of best fit summarizes the relationship between classes attended and final grade.

[pic] Line of Best Fit

The line of best fit for a perfect correlation will have two characteristics.

1) All the data points will line up perfectly on the line of best fit.

2) If the scores are transformed to z-scores, the slope of the line of best fit will be will be 45((negative or positive.)

[pic]

For example, this scatterplot depicts a perfect positive correlation.

When two variables are not related, they may appear to be random, or circular. For example, the scatterplot below show a correlation of +.06 (very low). The line of best fit is either (and equally) a straight vertical or straight horizontal line extending from at the mean of each distribution. No matter what score I am given on Variable X, my best prediction of the value of variable Y is the mean of the distribution of Y. Knowing the value of X does not provide any usable information.

. [pic]

While there are some great uses for scatterplots, which we will discuss, they are not great ways of presenting your results. Instead, we report the correlation statistic.

Significance of a correlation

SPSS will not only tell you the correlation between variables, it will also tell you the p value of the correlation. Like all other tests of significance, this tells us how likely we would be to find this degree of relationship, if the two variables are, in fact, not related. For example, suppose I put random numbers between 1 and 100 in one bucket, and another set of numbers between 1 and 100 in a second bucket. I have each of you pull a number from each bucket. If we entered the numbers from bucket one as variable 1 and the numbers from bucket 2 as variable2 into SPSS. We would expect to find a zero correlation between the two. There should be no relationship between the two variables. They should both be random, unrelated events. Even so, it is unlikely we would actually get a zero correlation between the two. Based on chance alone we could get even a high correlation. When we say a correlation is significant, we are saying that the probability of obtaining a correlation this high or higher, when in fact, there is no relationship between the two variables, is less than 5%. SPSS reports 2-tailed probability levels. We will talk more about 1 and two tailed tests when we talk about t-tests, but I will point out here that you should be using a 2-tailed test for your studies. One-tailed tests are only used when we have very strong reasons for predicting the direction of the correlation. None of your studies meets this criterion.

The magnitude of a correlation is strongly affected by the size of the sample. The more participants you have in your sample the less likely it is that you will obtain a high correlation by chance alone. If you think about it, if we draw just a few numbers from each of the buckets it is more likely they will form a pattern just due to chance than if we drew several pairs of numbers. The same magnitude of correlation may be significant in a large N study, but non-significant in a small N study. For example, in a study with 10 participants a correlation of .67 is needed in order for the result to be significant at the p = .05 level. In a study of 100 participants, the correlation need only be .20 to be significant at the same level. This leads to an interesting, but fairly common phenomenon in correlational studies. Often, researchers are very excited about their data and will run preliminary analysis on after getting data from only a few respondants. They often find fairly high correlations. As they run more subjects, they find the correlations tend to drop. On the other hand, even with high correlations on the small data sets, researchers often find the results are not significant, whereas the lower correlations they find with the larger sample are significant. Correlations are not meant to be used on small samples sizes, but do tend to give fairly stable correlations when sample sizes are larger. This is why I have suggested that those of you who are running correlational studies aim for sample sizes of at least 60.

Statistical significance just means that the correlation is unlikely to be due to chance, it does not mean it is high enough to be important of meaningful or anything of that sort. So how do we judge the size of a correlation? Different researchers have suggested different values, however Cohen (1988) has suggested the following guidelines which seen to be fairly widely used.

Correlations between .10 and .29 are considered small.

Correlations between .30 and .49 are medium and

Correlations over .5 are considered large.

Coefficient of Determination is a measure that can be easily obtained from the correlation statistic that SPSS produces for you. It is simply r squared. It is often more useful to talk about the relationship between your variables in terms of the Coefficient of Determination. For example, if the correlation between Intelligence test scores and GPA were r = .60, squaring it would give you .36 (I know that it seems wrong when you see that the square of a number is smaller than the number being squared, but for decimals, this is true). The coefficient of determination (in our example) indicates that 36% of the variability in Y is “accounted for” by the variability in X. Another way of saying this is that if the correlation between two variables is .60, then they have 36% of their variability”in common”. You may want to use this statistic when discussing the results of your study.

Linear and Nonlinear Relationships

One of the assumptions that the correlation statistic depends on is that the relationship between the two variables is linear. This means that there is a straight line that best describes the relationship between the variables. Not all relationships are linear. For example, arousal level and performance are not linearly related. Let’s say you have a very low arousal level (you can barely stay awake) and you write an exam. How well do you expect to do? On the other hand, lets say you are really highly aroused (15 cups of coffee), again how well do you expect to do. The relationship between arousal and performance is well known. At very high and very low levels of arousal performance is poor, but at middle levels performance is good. The line that best describes this relationship is curved. Looking at the scatterplot of your data should tell you if the relationship is curvilinear or has some other strange shape. If it is, do not despair. There are other ways of analyzing the data, or alternatively there are ways of transforming the data to make the relationship linear. If you find you have a non-linear relationship – let me know and we will discuss the best way to handle it. If you have a non-linear relationship between your variables, it is unlikely you will find a significant correlation. Looking at the distribution below, think about where the line of best fit would be. But this line is not the best fit for the data. In fact, looking at the scatterplot we can see that one variable is highly predictable if we know the value of the other variable, the relationship is just not linear, therefore a Pearson correlation analysis is not appropriate for this for this analysis.

[pic]

Skewed Distributions

Skewed Distributions can also produce invalid correlation statistics. For example, assume I measure the time people take to finish a set of math problems and correlate this with the number of questions they answer correctly. I plot the relationship between time and number of problems correctly answered. I find that at short time limits they produce fewer correct answers. Students who took longer to complete the problems tend to score higher on the problems - - up until at about 5 minutes. After 5 minutes subjects all obtain perfect scores. Even if they take more time, after 5 minutes they can do no better. This is a ceiling effect and it is due to the limitations of the test. The relationship I obtained between Time and Score is not linear.

[pic]

Outliers can also strongly effect a correlation either increasing it or decreasing it. When conducting a Correlational analysis you should always recheck the scores of obvious outliers to make sure they are not errors. Outliers can be identified by looking at the scatterplot.

Restricted Range and Correlations. All other things being equal,, the greater the variability among the observations in your distribution, the greater the value of the correlation. For example, if we look at the relationship between height and basketball playing ability in the general population the correlation would be fairly high. If, however, we only sampled people who were over 6 feet, the correlation between these two variables would be much lower. Looking at the diagram below we can see the correlation on the sub-set is lower than in the total set

[pic]

Lab 3.

For this lab you will need to access the data set sent to you by email. The data set contains 7 variables. The first column is the subject’s ID number. The second column is the subject’s sex (1 = female, 2 = male). The third column contains Math Aptitude Test Scores which students completed during registration week. The Fourth column contains the subjects Grades on their Statistics course. The fifth column contains student’s GPAs and the 6th column contains the number of hours they spend on extra curricular activities per week. The seventh column is a math anxiety score. This scale is rated on a 0 to 5 scale in which higher values indicate higher math anxiety. Your first step is to run a set of Pearson correlations for the data set. You should include only the variables that are appropriate for Pearson analysis in this analysis (remember they must be continuous variables).

Procedure for calculating Pearson Correlations

1. From the menu at the top of the screen click on Analyze, then click on correlate, then on Bivariate. (Bi means two – so Bivariate means it will look at the relationship between pairs of variables.)

2. Select all the variables you want to correlate with each other and move them into the box marked Variables.

3. Check that the Pearson box and the 2 tail box both are selected. Also check that the box next to Flag Significant correlations is selected. This will instruct SPSS to put a single asterisk (*) next to correlations that are significant at p < .05, and a double asterisk (**) next to correlations that are significant at p < .01..

4. Click on the Options button.

For Missing values, click on the Exclude cases pairwise box.

This option instructs SPSS to omit variables with missing values from the analysis which involve the missing value only. If there are missing values, then the N (number of cases) each correlation is calculated on within a set of variables may be different. If you select Exclude cases listwise, subjects who have any missing data are eliminated from all analysis, even if you have data for that subject for a given pair being correlated.

Under Options you can also obtain the Means and Standard Deviations. This saves you having to use the Descriptive options to obtain these values. Click on Continue and then on OK

SPSS will produce an output matrix that reports the correlations between each pair of variables. In fact, it gives each possible correlation twice. It will report the correlation between variable A and Variable B, and it will also report the correlation between Variable B and Variable A. Of course, these are the same value, but SPSS prints out the complete matrix. You should notice that SPSS also reports the correlation of each variable with itself. This value is always 1.0 since a variable is a perfect predictor of itself.

Hand in this Correlational matrix and write a brief summary of the Correlational results. Comment on the significant correlations. Be sure to indicate the direction of the correlation (i.e., the higher variable A score the lower variable B). Also calculate and report the Coefficient of Determination for each significant correlation. State what this coefficient tells us.

Next for all pairs of continuous variables, produce a scatterplot.

Procedure for generating a scatterplot.

1. From the menu at the top of the screen click on: Graphs, then click on Scatter.

2. Click on Simple. Click on Define.

3. Click on one of the variables and move it to the Y-axis box.

4. Click n the second variable and move it to the X axis box

5. Add a title for your scatterplot (i.e., Relationship between Agreeableness and Moral integrity).

6. Click on Continue and OK.

Looking at the Scatterplot allows us to

1) identify outliers.

I have purposely put an outlier in your data set. Find it, fix-it (it is a typo) and the reanalyze all the correlations it effected. In your write-up comment on effect the outlier had on your original calculations and summarize the new correlations.

As you produce each scatterplot, examine it to see if the relationship appears to be linear, curvilinear or skewed. Find the scatterplot that is an example of a curvilinear relationship. Paste that scatterplot into your lab write-up. Do you think a Pearson correlation is likely to overestimate, underestimate or give a good estimate for the relationship between the two variables?

Calculating Correlations for Subsets of Data. Several of you are looking at the correlations between two variables and asking whether this correlation differ between Males and Females. You can calculate the correlations for males and for females by using the Split File option.

Splitting a Data file

Sometimes it is necessary to split your file and to repeat analyses for groups (for example for males and females) separately

Procedure

1. Make sure you have the Data view window open on the screen.

2. Click on the Data menu and choose Split File.

3. Click on Compare Groups and specify the grouping variable (in this case sex)

4. Click on OK

For any analysis you perform after this split file procedure, the two groups will be analyzed separately. When you have finished the analysis, you need to go back and turn the Split Files option off.

Calculate the correlation between Math Aptitude scores and Statistics class grades for males and for females separately. Do not forget to turn the select cases option off when you are done.

You can also select a subset of scores using the select cases option from the Data Menu. It works just like the Split file option except that it will produce a new column labels $Filter in which it will indicate which cases are selected and which are unselected for the analysis. Using this option select only those cases for which the subject has scored 70 or higher on the math aptitude test. Then recalculate the correlation between math Aptitude scores and Statistics Grades for this selected sub-set. Compare your new correlation to the correlation you obtained from the full data set. Comment on the results.

What to hand in –

1. Correlation matrix – with written summary

2. Report the new calculation of the variable from which you removed an outlier. Describe why the correlation changed, or did not change.

3. The curvilinear scatterplot and your explanation of the appropriateness of a Pearson correlation for describing the relationship between these two variables.

4. Correlation matrices for the split files analysis (Males and Females separately) and comments on these results.

5. Statistics for the selected cases analysis (Math aptitude > 70) – and comments

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download