Linear Correlation and Regression



Linear Correlation and RegressionIn this lecture, we’ll be looking at scatter plots and inferences you can make about them. There are two topics here: correlation, which deals with describing the relationship between two variables and deciding about its strength, and regression, which involves using values of one variable to make predictions about the values of the other variable.Today I’m going to talk about these concepts in general and for a specific example. Then you’ll get new sets of data and perform correlation and regression analysis on your own.Two-Variable Data and Scatterplot Let’s say that you have ten people each taking two different medical tests, and you have the scores each person got on each of the tests. Here they are:Person#Test#1Test#2145.244.4248.152.8343.840.5452.158.9553.757.7642.645.4744.247.284752954.461.61050.253We make a scatter plot using the data, putting the score on Test #1 on the x-axis and the score on Test #2 on the y-axis and putting a dot (or some kind of mark) for each person. As before, the variable represented on the x-axis is called the independent variable but for the purposes of regression is also referred to as the predictor variable . The variable on the y-axis is called the dependent variable or the response variable . It’s important to remember that designating one variable to be the predictor and the other to be the response in no way implies that the value of the predictor variable causes the response variable to take on whatever value it has. That’s pretty obvious in the medical test example, where the results of both tests are the consequence of some medical condition, and neither causes the other. The predictor variable might have a causal effect on the response variable, but you can’t prove it by demonstrating correlation.Here’s the scatter plot (you can do this GeoGebra by clicking the button that looks like a scatterplot):So what do we make of this picture? It’s obvious that although the points are not on a single straight line, they are pretty close to one. They tend to rise from left to right. The higher the Test #1 score, in general the higher the Test #2 score. We call this a positive correlation . (If the points tended to fall going from left to right, meaning that higher values of one variable tend to go with lower values of the other, we’d have a negative correlation .) If you were to draw the straight line that comes as close as possible to the points, it would also go up from left to right, i.e. have a positive slope. The line we want is called the least-squares best-fit regression line , among other things, and its name reveals what it is that makes it the best fit.Best-fit Regression Line and Residual First let’s see its equation: . The slope, referred to as , is 1.508, and the y-intercept, referred to as , is -21.216. So the line is of the form , which you have already seen in your Algebra classes. The funny symbol that looks like y is pronounced “y-hat”, which represents the output of the function if you used the x value from the data. You can see that and are statistics, because they describe a sample, in this case the ten pairs of numbers that generated the equation of the regression line. The actual computation of the slope and y-intercept of this line is a bit complicated and assumes you have studied multivariate calculus or linear algebra, and we let the software produce it for us, but you can understand the condition that it fulfills. Look at how the line fits in with the scatter diagram:Some of the points are above the line, and some are below. In this next picture, I’ve included the vertical lines between the points and the regression line:The directed length of the vertical line connecting the point and the regression line is called the residual of the point (or prediction error). If we use the symbol y to represent the actual y value in the scatterplot, and for the predicted y value, then the residual/error can be written as: It’s negative if the point is below the line, and it’s positive if the point is above the line. Instead of using the original y as the vertical axis, if you make another scatter plot using (x, Error) as your pairs, you will get a residual plot based on your data and regression line: If you find all the residuals for a scatter diagram and its regression line, square them, and add up the squares, then the least-squares regression line is exactly that: the line with the smallest such sum. In slightly mathematical language, we say that the least-square regression line minimizes the following quantity: Any other line, if you measured the vertical distances from the points to the line, squared them, and summed the squares, you would get a larger total than you do for this line.This may not seem like such a big deal, but it is the accepted criterion for deciding which line is best, since it’s backed up by calculus. The line’s equation is used for predicting how people will score on Test #2 given their score on Test #1. Going back to the person who got 45.2 on Test #1 and 44.4 on Test #2, we can see that this person got a lower than expected score on Test #2 because his or her point is below the line, thus having a negative residual. We would have predicted Person #1’s score on Test #2 to be , not 44.4. (This person’s residual would be . )Identifying Potential Outliers In the residual plot, you may have noticed that some of the points are further away from X axis than others, which raised the question of whether they are outliers. It turns out such a criterion does exist, and we can rely on GeoGebra partially to help identify potential outliers. When you display the Scatterplot with the Linear regression line in GeoGebra, you have the option to display some of the statistics as well: Most of these statistics have to do with the actual computation of the regression line, but there is one statistic that is useful for identifying outliers: SSE (which stands for sum of squared errors). Here we noticed that SSE = 51.2. SSE provides a useful measure of how far the points deviate from the line overall, since it’s related to the standard deviation of residuals:So in our case, the standard deviation is: The rest of the outlier detection is quite similar to what we did in descriptive statistics: we simply look for any points whose residual is more than two standard deviations away from the mean of all the (which is equal to 0, another property of the least-square regression line). In our example, the range of residuals is: (-2.5*2, 2.5*2), or (-5, 5). Looking back at the residual plot, although the second person seems to be a bit far from the regression line, the residual is still within the usual range. Therefore, we conclude that there are no outliers in our data set. Linear CorrelationThe question remains whether we can use the Test #1 result as an accurate enough predictor of the Test #2 result that we can dispense altogether with performing Test #2. Of course, that depends partially on how closely we have to measure the result of Test #2 in order for it to be useful. That in turn would depend on the nature of Test #2 and what it is we’re trying to measure. But another part of the question involves looking at how closely the line fits the points, because it would be possible to have a different set of ten points with an almost identical regression line, but this set would contain points much further away from the line. Here’s an example of such a set:Person#Test #1Test #2145.238.2248.160.2343.837.3452.165.8553.753.6642.653.6744.253.184742.4954.466.11050.248.5And here is that set graphed along with both its own regression line:Here the points are still pretty close to the line but definitely not as close as before, yet we would make essentially the same predictions about the results of Test #2 based on the results of Test #1. And our predictions would be off by more.We need a way of measuring how closely the line fits the points, and for this we use the sample correlation coefficient, symbolized by . Like and , it’s a statistic describing our sample. Its very complicated formula assures the fact that it is a number between - 1 and +1. In essence, if the points all lie on a straight line with a negative slope, and if all the points lie on a straight line with a positive slope. And the closer the absolute value of r, |r| is to 1 (in other words, the closer r is to either - 1 or +1 ) the more closely the points hug the line, and the better the line is for purposes of prediction.For our original table of results, you can use GeoGebra to calculate the correlation coefficient (please consult the screen casts for the instructions), and for the second data set pictured above, . What does this tell us? First, that the line is a better fit for the first set than for the second, because is closer to 1 for the first set than for the second. If (the slope of the line) is positive, so is , and if is negative, so is . In other words, measures both the strength as well as the direction of the of the correlation.With an intuitive grasp of what the correlation coefficient is, now we’re going to go back to our claims-testing mode. You may recall that in hypothesis testing, we are always using a statistic measured from the sample (say, or ) to test a claim about a parameter ( or , respectively). In the hypothesis testing of correlation, this paradigm remains the same. Since is calculated from the sample, it is inevitably another statistic. The parameter to be tested in this case, is the correlation coefficient of the population, which we will represent using the Greek letter $\rho$ (you can remember this as the Greek "r").What will be our null hypothesis this time? Remember the symbolic form of always contains $=$. In testing correlation, the null hypothesis is always going to be the same:What does this mean? Since the correlation coefficient of 0 represents no correlation whatsoever, the null hypothesis simply states that the two variables are NOT correlated. If the sample correlation coefficient is far away from 0, then it will provide significant evidence against .Theoretically, the alternative hypothesis can have any of the tails that we discussed previously: left (containing <), right (containing >), and two tailed (containing ). Since we will be using traditional method in conducting the hypothesis test, we will only be using the two-tailed test for simplicity's sake (otherwise, we will need three, not one, tables for critical values). For the problem discussed above, we will phrase our alternative hypothesis as follows:In other words, the alternative hypothesis declares "there is correlation between the two variables", without specifying which direction the correlation takes.Testing Correlation by Critical ValueIn the traditional method of hypothesis testing, a critical value is needed to compare with the test statistic. If we already have the test statistic (), where does critical value come from? There are multiple ways to do this, but we will use one that does not require you to convert to some other test statistic. Instead, we will directly use the critical values for . These values are listed in the HYPERLINK "Table%20of%20Critical%20Values%20for%20Testing%20Correlation.htm" table of critical values (to save space, I included just the top 15 rows of the table):Sample Size = 0.05 = 0.0130.9971.00040.9500.99050.8780.95960.8110.91770.7540.87580.7070.83490.6660.798100.6320.765110.6020.735120.5760.708130.5530.684140.5320.661150.5140.641Note: If you need to compute the critical value for very large n, you can also use the Excel spreadsheet on my website: . This tool allows you to use any significance level and sample size. Suppose we are using a significance level of , looking at the row corresponding to , this table gives us the critical value of 0.765. The last piece of information we need to find the decision is which tail to use. In the case of using the Table of Critical Values, the test is always right tailed, which means the critical region is always to the right of the critical values shown in the table.By comparing our test statistic , we see that in our first example, we will reject the null hypothesis, since ; while in the second example, the decision will be failure to reject , since . This confirms our earlier suspicion about these two different data sets: while the first set fits the least square regression line fairly well, the second one does not. The test scores in the first example show a significant positive correlation, while the second example does not.Although only the positive critical values are included shown in the Critical Value table, they are also used to test the significance of a negative correlation. The way to do this is simply by looking at the absolute value of , |r|, and reject if |r| is greater than the critical value listed in the table. For example, if you had 10 pairs of data and turns out to be -0.828, then the proper decision will be a significant negative correlation.Before we leave the critical value table, one last point worth noting is why the critical values get smaller as you increase the number of pairs in your data. Intuitively, this is due to the fact that it is much easier to get correlation when the sample size is small. For example, if your data set consists of just 2 points, then you will get a perfect correlation (and as a result). It is for this reason that is not even listed in the table, since it's practically meaningless to do correlation with such a small set of data. On the other hand, when n is very large, say n = 500, then using the tool I provided above, you can find that it only takes |r|>0.088 to obtain a significant correlation. So the correlation coefficient should definitely be interpreted together with the sample size. The following graph summarizes how the critical value changes with the sample size for = 0.05. The formula used to find the critical value will be described in the next section. Testing Correlation by using P-valueAlthough the method described above is fairly straightforward, it does not quite explain where the critical values come from. How can you compute your own critical values, if you decide to use a significance level of 0.001? In addition, since we never mentioned P-value, you may be disappointed since the decision logic using P-value was used everywhere else in this course.The good news is that the underlying test statistic for testing claims about is a familiar one -- the Student's t distribution. You can convert to the t statistic manually using the relationship:with the degree of freedom and a two-tailed test to find the P-value. Using the same example above, suppose your sample size , and the correlation coefficient . Then the t statistic turns out to be:Using the t calculator (with df = 8) clearly shows that the P-value () is almost zero. So we arrive at the same conclusion as the critical value method.Knowing the above relationship also allows to you to invert the relation, and solve for critical values of (as listed in the critical value table) from given values of . With a little Algebra at the pre-calculus level, we can solve for from the equation :For example, using and . We can first identify the critical value for (using df = 8). Then using this t value, together with , we can solve for the critical value of :Similarly, if you ever have to compute some exotic critical values (say or ), you can do this calculation if you have some way of finding the critical value for Students’ t (often called an “inverse t distribution function”), and the formula above will give you the critical value for r. The Relationship between Correlation and RegressionSo why bother doing a hypothesis test on correlation, if all we care is to make some predictions using the regression line? It is simply to give you the peace of mind that you need before you can say your prediction has any meaning. So the workflow of a statistician in carrying out regression analysis is roughly like this:Test for correlation (if it turns out to be significant) find the regression line use new values of or to predict the other variable by solving the linear equationThis order of reasoning applies to other more advanced types of regression analysis as well, such as polynomial regression, logarithmic regression, and multiple regression. So next time when you read that someone predicts eating a certain food can extend life by a certain number years, you may want to check the correlation in their data! ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download