Lab Notes – Correlation, Autocorrelation



Lab Notes – Correlation, Autocorrelation

Objective: Develop a basic understanding of correlation, autocorrelation and random walk using Statgraphics and simple regression.

Open Statgraphics with an empty data table (don’t open a data file).

Create a random variable. Use the describe/distributions/probability distributions window to generate a sample from the standard normal distribution. When you open this window it defaults to the standard normal distribution. Save the result using the save results button and checking the box to save results for dist. 1, renaming the variable from RAND1 to X. This will place a sample of size 100 into a variable named X.

Using the Window dropdown menu, select the first available window, and select the second column. Right click on the second column and select modify data. Rename the column Y. Right click on the column again and select generate data. Enter “X” in the text box and click OK.

Relate Y to X using simple regression. Since Y was generated from X with no additional random component we should expect to see a perfect fit. Does all the data fall on the regression line? Is the R squared value at 100%? Is the correlation coefficient 1.00 (rounded to two places)? Does the slope of the regression line equal 1? Is the constant value 0.0?

Create a second random variable by sampling the standard normal distribution and save it as RAND1. We are going to add RAND1 a little at a time to Y to see the effect on the model. To do this we generate data for the Y column (right click on the Y column and select generate data), each time adding a bit more of RAND1 to Y. Recall that the mean value of RAND1 is near zero and its standard deviation is near 1 (since we sampled from a standard normal distribution). Use the following table to create several variables, Y, and investigate the effects of increasing the coefficient of RAND1. This exercise increases the random component of Y and enables you to visualize the effects on the data scatter around the regression line as the Rsquared value decreases and the correlation between X and Y decreases.

|RAND 1 Coef. |R squared % |Correlation Co. |Const. |Slope |P value ANOVA |P value slope |

|None |100 |1.00 |0 |1 |0.0000 |0.0000 |

|0.1 | | | | | | |

|0.2 | | | | | | |

|0.5 | | | | | | |

|1 | | | | | | |

|2 | | | | | | |

|5 | | | | | | |

To complete this exercise simply leave the simple regression model open and use the generate data option in the data table for Y to bring in the RAND1 variable with the coefficients in the table. For example, for the first modification Y= X+RAND1*0.1. When you enter X+RAND1*0.1 and click OK the simple regression window will automatically update using the new values for Y. You can quickly fill in the table by moving between the data table and the simple regression window and recording the results.

As you increase the random component of Y by increasing the coefficient of RAND1 what happens to the data scatter around the regression line? What happens to the p value for the model (from the ANOVA table)? What happens to the p value for the slope? What happens to the slope? What happens to the constant (intercept)? Even though we are adding a zero mean random variable to Y it has a significant effect on both slope and constant values in the regression model, simply because the X dependence gets lost in the noise.

In the previous section we added RAND1 as an error component in our simple regression model. Notice that this implies that we can observe Y and X but not RAND1. Suppose that we can observe not only Y and X, but also RAND1. That is to say, we can observe RAND1 independently from the regression model. Construct Y from X and RAND1 and try relating these three variables using multiple regression. (Use generate data on the Y column and enter X+RAND1.) Does this model account for all the variation in Y?

The point of this exercise is to give you a feeling for how these model parameters are related, and the importance of identifying all the significant sources of variation to produce a sound model.

The following exercise is to help you gain familiarity with the concept of a random walk simply by using Statgraphics to generate two random walks from the samples we already have.

Open the data table from the Windows menu. Select an empty column and use modify data to rename the column Xwalk. Use the generate data option and enter Runtot(X) and click OK. The function Runtot(?) takes a column of data and produces the running total. If you think about it, since X is a sample from a random variable, this must produce a random walk. Select another blank column. Name it RWalk and repeat using Runtot(RAND1) this time. Take a moment to reflect on what we have done. We took two independent samples from the same probability distribution, the standard normal distribution. We then created two random walks from these samples. Now look at the two random walks using Special/time series/Descriptive… Are they the same? Were they generated from the same random process?

Look at the autocorrelation graph for your random walks. Does the maximum correlation occur with a lag of 1? Given the definition of a random walk does this make sense?

Now, to close the loop, do a simple regression with DIFF(Xwalk) as the dependent variable and X as the independent variable. What is the correlation coefficient? What is the slope? What is the constant? Does the DIFF function produce the random steps that comprise the random walk? Repeat using DIFF(Rwalk) as the dependent variable and RAND1 as the independent variable.

Remember in modeling variation, sometimes you may wish to use the change in a variable rather than the variable itself in your model. The DIFF function can be a useful tool for taking slowly varying data and extracting the more volatile changes.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download