Third Handout



Third Handout

How would you tell someone what the predicted values are, and how to plot them?

The residuals or errors are the differences between the actual y-values and the predicted y-values. On the spreadsheet, make a column of residuals by:

1. In F3, enter the label Residuals

2. In F4, enter the formula = C3 – E3

3. Copy and paste this formula into F5 to F19.

Statistics textbooks try to emphasize the formula: Data = Fit + Error. This is not a complicated equation, since Error is defined as Data – Fit. So your data, in column C, is the sum of your fitted values in column E and the Errors in Column F.

A model is considered a good model when the errors, or residuals, do not have a pattern to them. If there is a pattern, then perhaps a better model will reduce the magnitudes of the errors. The linear model for Gas Used versus Heating Degree Days was good for two reasons. The errors, in column F on our worksheet, do not have a pattern, and the errors are “small.” Can we quantify what we mean by small?

R-squared, Total Variation, Predicted Variation, and Residual Variation

One of the main ideas in statistics is to study and quantify variation, the variability of data. The Gas used to heat a home varies from month to month. One measure of the variability of that variable is its standard deviation. Compute the standard deviation of the Gas data, and while we’re considering this, let’s compute the means and standard deviations of both variables in GasVsTemp.

1. Go below the note in the Data sheet to cell A22 and enter the label Mean

2. In cell A23 enter the label Std Devn

3. Compute these values using the Average and Stdev functions. Note: Once you’ve done it for column B, copy and paste the formulas to column C.

The standard deviation of the y data is defined as: [pic] , where [pic] is the mean of all the y-values, and n is the number of y-values. The term in the numerator of the radical is called the total variation in y, the sum of the squared deviations of y from its mean. We could compute this total variation by “undoing” the formula for the standard deviation, but our goal, right now, is not quick computation.

1. Compute the squared deviation for each of the y-values as follows

2. In H3, enter the label Squares of Deviations of Gas from Mean Gas

3. In H4, type the formula =(C4 - $C$22)^2. This is the square of the difference between C4, the first y-value, and the mean of all the y-values.

4. Copy and paste that formula for all the y-values.

5. Add up all these squared deviations and put the total next to the sum of the squares of the residuals, in cell H20

6. Verify that this is the same result as you would get by undoing the formula for the standard deviation, namely (n - 1) ( (Std Devn of y’s)^2, by entering =15 * c23^2 into cell H21.

This number, 170. ..., is a measure of the variability in the average daily gas used from month to month. But most of this variability can be predicted by the linear (straight-line) model for Gas used versus Heating Degree Days. Recall that in Column E, we computed the predicted gas usage based on the explanatory variable (x), Degree Days. These numbers vary, of course, from row to row, since x varies from row to row. How much of the variation in the response variable, y, or Gas used, is predicted by x. Find the total variation in the predicted values of y, the data in column E:

1. In cell I3, type the label Squares of Deviations of Predicted Gas from Mean Gas

2. First check that the mean of the predicted values is the same as mean of the original data; this is always the case, even for “bad fits.”

3. In I4, compute the squared deviation of the predicted value in E4 from the overall mean predicted value. You could copy that formula over, and note that it didn’t copy quite correctly, since there is an additional column between C4 and E4; or just type it =(E4 - $C$22)^2

4. Copy and paste that formula appropriately

5. Then sum these numbers.

We now have three sources of variation, as measured by sums of squares of deviations from means. (Incidentally, the sum of the squares of the residuals is the same as the sum of the squares of the deviations of the residuals from their mean, since the residual mean is always zero.) Is there any relation between these numbers? Yes. The total variation is the sum of the predicted variation and the error variation. Is this always the case? Yes, because the vector of residuals is always orthogonal (perpendicular) to the vector of predicted values. Check it out as follows:

1. In J3 place the label, Product of Predicted Gas Deviations with Residuals

2. In J4 through J19, enter the product of deviations of the predicted values and residuals:

1. The deviations are E4 - $C$22

2. The residuals are in F4

3. So enter = (E4 - $C$22) * F4 into J4.

3. Copy and paste appropriately into J5 through J19

4. Add these up. We should get zero.

5.

So the formulas are:

Data = Fit + Error

y = [pic] + e (Same equation with typical symbols for data, fit, and error)

[pic]

[pic]

[pic], but this last term always sums to 0, so

[pic]

Once again, and these remarks only apply to the students who have had some exposure to orthogonal vectors and dot products, the error vector and the prediction vector are orthogonal. So Pythagoras’ theorem applies, and the squared length of the prediction vector + the squared length of the error vector = the squared length of the data vector.

However, all should try to grasp the result.

Total variation = Predicted Variation + Residual Variation.

Finally, the fraction of the total variation in the data, that can be explained by our (linear) model is Predicted Variation divided by Total Variation. Compute this. On the spreadsheet, enter into I22 (say) the formula = I20/H20. Does this value look at all familiar? Probably not, but go back to the Chart sheet and look at the value of R2. It is 0.9906, exactly the same as the ratio of predicted variation to the total variation. In words that everyone may be able to understand, the linear model using temperature explains or predicts more than 99% of the variation in Gas used. The residual or unexplained variation is less than 1% of the total variation.

The Correlation Coefficient

A natural question is, “Why is this fraction of explained variation called R2 ?” The answer is a surprising theorem. The usual symbol for the correlation coefficient is R (??), or r, and this value measures the direction and strength of the linear relation between two variables. Correlation coefficients close to +1 or -1 indicate a high degree of linear relation. (If negative, then the slope is negative.) Compute the correlation coefficient between Gas used and Degree days. Choose an empty cell, say H23, and paste the correlation between these two variables into that cell. In H24, find the square of this value. Ta ta da. The theorem is that the square of the correlation coefficient is identical to the ratio of explained variation to total variation. (The mathematically inclined may try to prove this theorem for extra credit, but it’s not trivial.)

Unfortunately, some people with a little knowledge of statistics mistakenly believe that the correlation coefficient captures all of the relationship between two variables. It does not. The correlation coefficient does measure the degree to which two variables are linearly related, but may fail (miserably) to capture any other relationship. Any thorough study of the relation between two variables begins with a plot of the data.

Plot! Plot!! Plot!!! Plot!!!! Plot!!!! YOUR DATA

After saving the work you’ve done on this dataset, open once more the file relating Fuel used versus Speed driven. Recall that the graph is quite nonlinear. However, if we had been so foolish as not to plot the data, we might misinterpret the correlation coefficient, -0.17, and the resulting paltry value of R2, .0295, to conclude that speed driven accounts for only 3% of the variation in gasoline used, even when all the data come from the same automobile. There is a correct statement here with the conclusion about 3%, but the word linear must be used. There is very little linear relation between speed and fuel consumption, but there is a highly nonlinear relation that you might guess: Fuel consumption is lowest when an automobile is driven at a moderate speed. In fact, we saw that just a second-degree polynomial fitting fuel consumption to speed will explain about 80% of the variation in fuel consumption.

The Standards Data

A few months ago, twenty gels were prepared and measured by Ms. Virginia Hutchins to find and quantify the relation between Molecular weight and relative mobility. She and John Handy will describe their methods for determining the distances the proteins migrated. I was given the data on a spreadsheet that is now called OriginalStandards. Retrieve that data; store your retrieved copy on your floppy disk with some similar but unique name that personalizes it.

There is only one sheet now called Original. To keep things clean, if the dataset is not too large, I often do all my calculations on a separate sheet. Also, notice that the data for a particular gel are stored in rows; that’s OK, but the default Excel expects the data to be in columns (default choice), so let us make a copy of this data interchanging roles and columns. (It is also true that I like my data in columns, and statistical software seems to have that expectation also.)

The Paste Special Commands

There are many special pasting functions that Excel provides. Two that we will use are Paste Transpose and Paste Values. Those of you who have studied matrices know that the transpose of a matrix is the same matrix with rows and columns interchanged. My first step in analyzing this dataset was to make a transposed copy of this data on a new sheet.

1. Highlight all the data from A1 to K21

2. Click Edit > Copy

3. Click the Sheet1 tab at the bottom of the workbook.

4. We need columns 1 and 2 for the molecular weights and log molecular weights, and perhaps a title row; in any case, activate C2 on this new blank sheet.

5. Click Edit > Paste Special (Note: Since this command may not have been used recently, it may not be available unless you wait, or click the arrows at the bottom of the menu.)

6. In the lower right hand corner of the dialog box is a box for Transpose. Make sure that box is checked

7. Click OK.

So far, so good, I hope. There may be many sheets; it is a good idea to name the sheets to help us remember what’s where. To rename Sheet1, double click on the Sheet1 tab. Type the name (I chose Data) of the sheet “over” the highlighted sheet. “Enter” or clicking anywhere on the spreadsheet ends the sheet name editing.

The standards are used to calculate or estimate the weights of unknowns. One way (of the many ways) is to find some model that is approximately linear; then use the standards and their guaranteed known molecular weights to find the best linear relation for the model. Then look at the mobility (or relative mobility) of the unknowns, on the same gel, and the just-fitted model to find the molecular weight of the unknowns. In that effort, we need a good model for the relative mobility versus the molecular weights. A model that has been proposed is that relative mobility is linear in log Molecular weight.

First, we need to convert all of the mobility data in the Data sheet to relative mobility. We also need to bring in the molecular weight data. To keep things orderly, copy all the available information into a new sheet as follows:

1. Highlight the molecular weights on the Original sheet in H28 through H36.

2. Copy and paste them onto the blank Sheet2, at cell A4; note that only the upper left corner of a block of pasted cells needs to be selected.

3. In A3 label these data Molecular weight

4. In B3 type the label log Molecular weight

5. In B4, enter the formula =log(A4)

6. Copy and paste this formula into A5:A12

7. Make a copy of all the data in C2 to W12

1. Make the sheet Data active by clicking on it

2. Make C2 active (Click C2)

3. Highlight C2:W12 Hold the shift key down while hitting End Arrow Down and End Arrow Right (or use the mouse)

4. Click Edit > Copy

5. Click Sheet 2

6. Make C2 active

7. Click Edit > Paste

1. Change the mobility data to relative mobility

1. Activate D4 (which has the value 1.10)

2. Enter an = sign

3. Click on Data sheet

4. Click on D4 (Note the formula in the Editing window)

5. Enter the division sign / (Note the formula in the Editing window)

6. Click on D3 (Note the formula in the Editing window)

7. Hit the Enter key. The spreadsheet will return to Sheet2, cell D5, with the value 0.16 in D4 and the formula, =Data!D4/Data!D3 in cell D4. This is the way Excel refers to cells on another sheet; if there is no SheetName! before D4, then the same sheet is assumed.)

8. That’s the formula we want, except that the Dye-front measurement should be the denominator for all calculations in this column. We need to fix (make absolute) the value 3 in that formula. Click on D4.

9. Move the cursor between the D and the 3 in the formula entry window right above the column headings.

10. Enter a $ sign

11. Enter (Push the Enter key)

1. Copy and paste this formula into all the other values from D4 to W12.

1. Activate D4

2. Click Edit > Copy

3. Highlight D4 to W12

4. Click Edit > Paste

This is a good time to save what you’ve done so far. The upper left hand corner of your new data should look like

| | | | | |

|cm band 1 |0.16 |0.13 |0.12 |0.12 |

| | | | | |

|cm band 2 |0.27 |0.25 |0.24 |0.24 |

| | | | | |

|cm band 3 |0.31 |0.29 |0.28 |0.28 |

| | | | | |

|cm band 4 |0.38 |0.36 |0.36 |0.36 |

| | | | | |

|cm band 5 |0.49 |0.46 |0.48 |0.47 |

Give this sheet a name; I am using RelativeMobility.

Graphs with Many Curves

The data for horizontal axis for all of these graphs is in column B, and is not adjacent to the data for the vertical axis. We will have to enter our ranges somewhat manually as follows. While we could copy and paste the two columns together, another way is preferable in the long run. For technical reasons that may become apparent, we need to move or eliminate the unnecessary line containing the dye-front data on the RelativeMobility sheet. In case we need it again, remember that it still exists in another or two sheets.

1. Highlight the data from C2 to W2 (This is the Gel ID row)

2. Click Edit > Cut (Not Copy, but Cut)

3. Activate cell C3

4. Click Edit > Paste

1. Click the Chart Wizard button

2. Choose xy-scatter with smooth connecting lines and Next

3. The dialog box asks for the data range. Click the Series tab.

4. Click the Add button

5. For name choose D3

6. For x-values choose (by point and click method or typing) B4:B12

7. For y-values choose D4:D12

8. Make it a new sheet.

Now edit the x and y axes so that more of the graph is relevant. In particular, change the minimum on the x-axis to something 3.8 or less, and the minimum on the y-axis to 0.1 so that the graph is stretched.

1. On the chart sheet, double click anywhere on the values on the x-axis.

2. On the dialog box click the scale tab

3. Change the minimum value

4. Do something similar if you like for the y-axis.

Once we have the graph set up the way we want it, it is easy to use the Chart options to add more data. With the graph sheet active,

1. Click Chart > Add data

2. Notice that the instructions allow you to name the series by including it at the top of the column (or left of a row); click the point and click red arrow.

3. Click the RelativeMobility sheet.

4. Let us add 4 graphs to the sheets by choosing E3:H12.

5. Click the read arrow again, and finish up.

Data Snooping

At this point, we could add more data series to the graphs, but a pattern seems to be emerging. What are these graphs telling you/us? Brainstorm. Does the linear model we’re using seem adequate?

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download