MA 1024 - Worcester Polytechnic Institute



Biology and Chemistry Project

Regression Analysis

Introduction

Laboratory science often involves running an experiment, gathering data and trying to make a generalization based upon the data.

One common goal is to demonstrate a relationship between two variables from gathered data. For example we might have a drug which we hope reduces cholesterol. Data gathered might be of the form (xi, yi) where xi is the dosage given and yi is the LDL level. We might then plot the data and see if LDL decreases with dosage. In the event it seems to, we would then want to establish a degree of confidence in our result and also establish an equation which would allow us to model the situation and make predictions.

Such an approach to data analysis may be done with Regression Analysis.

The goals of this project are to:

show the basic concepts behind regression analysis and how they relate to basic calculus

have the student develop technology skills so they may apply this

Part One: Fundamental Notions

Suppose we have n data points (x1, y1), (x2,y2), . . . , (xn,yn) gathered experimentally, which, when plotted, are somewhat linear (follow a straight line to some extent ). For example, the data [10,280],[12,270],[15,265],[20,240],[24,220],[30,220] looks like

[pic]

We wish to fit a straight line to this data, so that we can do predictions with it. In this case here there is little enough data so we can do a decent job by hand. As a guess, I take y = 320 – 4x. Putting my line and the data on the same graph we have

[pic]

This came out pretty well. The data was pretty linear and there wasn't much of it. We wish to use calculus to do the best possible job at this, no matter how much data there is. So we assume we have an arbitrary straight line, y = mx + b , and we wish to pick m and b so as to best fit the data (since m and b define the straight line). This means we somehow can compare two different straight lines.

While there are many ways to do this, we pick the following. At each data point, we measure the vertical distance that the straight line misses going thru the point. We then square this quantity and add up all such values for all n data points.

Why do we square them? So there are no + and – values canceling one another out. All misses, above and below the points, count.

This amounted to defining the following expression for the total error, E :

E = ( mx1 + b – y1)2 + ( mx2 + b – y2)2 + . . . +( mxn + b – yn)2

For the straight line y=320-4x and the 6 pieces of data above the total error is

E = (280 – 280)2 + (272 – 270)2 + (260 – 265)2 + ( 240 – 240)2 + (224 – 220)2 + (200 – 220)2

= 445

The value of 445, all by itself, is pretty much meaningless. Compared with other lines, it may be useful. Of course if it came out to 0 we would know we had perfectly linear data and had matched it! If we can come up with another line which results in a smaller value of E, that is good. With calculus, we can come up with the line that results in the smallest of all possible lines.

This can be more concisely written using summation notation as in Calculus II and III:

[pic]

The problem is now much more mathematical. We are trying to find values for m and b so that E is a minimum. This is the best fit straight line. Again, the variables here are m and b, not x and y! The x's and y's are data points and actual numbers, as opposed to variables.

How to do this? The minimum will occur when the derivatives with respect to the variables m and b are equal to 0. That's the starting point. To do these derivatives correctly, we need two rules from Calculus I,

• the derivative of a sum is the sum of the derivatives

• the power/chain rule

All of this results in

[pic]

and

[pic]

with these being set to 0 to find the critical point.

Problem 1: Using algebra, show that these may be rearranged to

[pic]

[pic]

(you will need to do the algebra carefully to get here but it can be done!!)

Problem 2:

a) For the data points ( 1,4 ) (2,10), (3,10), (5, 19), (6,23 ), (8,24) neatly plot them.

b) For the straight line y = 2x +2 compute the total error E

c) Using this data, set up the two equations immediately above

Now since all x's and y's are known as well as n, we have two equations and two unknowns, m and b.

Furthermore the equations are linear so they are easy to solve by elimination, for example.

Problem 3:

a) Solve your equations for m and b

b) Compute the total error E for the line y = mx + b using the values you just got for m and b

c) Plot the data, the line y = 2x + 2 and the line y = mx + b on the same plot. Comment.

So what all of this amounts to is that we find the slope and intercept that result in the smallest possible error, E. This is often called the "Least Squares straight line". Now in practice, no one does this by hand as you did above. The usual options for this are TI calculators or the Excel spreadsheet, as well as many statistical packages such as SPSS. This leads us to

Part Two – Technology

First of all you need to make a decision: are you going to use your TI calculator or the Excel spreadsheet? Your choice.

If you are going to use the TI, you must get the manual. "I don't have the manual" you respond. This is common – you need to go to the TI website and download a copy. Start your hunt at:

If you are going to use Excel, you need to see if it is configured to perform Regression Analysis. Do this by firing up Excel. On the top of it are various Menus ( File, Edit, View, Insert, Format, Tools, Data, Window, Help …). Look under Tools and see if there is a Data Analysis entry. If there is then you are all set. If there is not then you need to install the Data Analysis tools. See CCC about this.

In either case we need to develop the skills to use the technology. The first thing you want to do is see if the best fit straight line to data from a straight line is the same line. (think about this). So for sake of discussion let's take y = 2x + 3 and generate some data. For example

{ (1,5), (2,7), (3, 9) and (5,13) }.

So enter this in. If you are using Excel, put it into two adjacent columns. Have your software perform the Regression Analysis. You need to spot the slope as 2 and the intercept 3. This may either appear as "m and b" or "slope and intercept".

From Excel you get an output like this:

SUMMARY OUTPUT | | | | | | | | | | | | | | | | | | | | |Regression Statistics | | | | | | | | | |Multiple R |1 | | | | | | | | | |R Square |1 | | | | | | | | | |Adjusted R Square |1 | | | | | | | | | |Standard Error |0 | | | | | | | | | |Observations |4 | | | | | | | | | | | | | | | | | | | | |ANOVA | | | | | | | | | | |  |df |SS |MS |F |Significance F | | | | | |Regression |1 |35 |35 |#NUM! |#NUM! | | | | | |Residual |2 |0 |0 | | | | | | | |Total |3 |35 |  |  |  | | | | | | | | | | | | | | | | |  |Coefficients |Standard Error |t Stat |P-value |Lower 95% |Upper 95% |Lower 95.0% |Upper 95.0% | | |Intercept |3 |0 |65535 |#NUM! |3 |3 |3 |3 | | |X Variable 1 |2 |0 |65535 |#NUM! |2 |2 |2 |2 | | | | | | | | | | | | | |

A lot of this is of no value to you right now but of interest are 4 things: R Square, Observations, Intercept and X Variable 1. We will talk about R Square momentarily but the others should be consistent with our data.

Comment on Excel: it's easy to get the X and Y values backwards.

Now let's go back to our original data. Analyzing this with technology, we get

m = -3.3019 b= 310.25 R2 = .929

This tells us the best possible fit is y = -3.3019x + 310.25 The value of R2 tells us the original data was quite linear. (R cannot be bigger than 1 or less than -1). Physically this means that with no medication, one would expect an LDL level of 310. For each unit given, the LDL drops by 3.3 units. This is the best estimate we can make.

A Second Example

Suppose our data is: [10,200],[14,230],[15,180],[20,300],[25,200],[30,320]

which plots to

[pic]

The relationship is not nearly so clear. We can still apply technology to it with the following results:

slope: 4.82 intercept 146.7 R2 = .38

so the best fit is y = 4.82 x + 146.7. If you superimpose a plot on the data we get the following:

[pic]

What is of note here is the graph mixed with R2. The value of .38 is relatively low and suggests that we not be so confident about the relationship. In some texts, R is called a correlation coefficient and seeks to measure the strength of the relationship between X and Y.

An important point is: you can perform linear regression analysis on ANY data and generate a straight line. That action does not mean

the relationship was linear or that doing the fit was a good idea. The value of R (or R2) provides some help with this decision. What is acceptable varies considerably from discipline to discipline. In the world of social science, .38 is pretty good. In the world of experimental physics, it is pretty poor.

Part Three Exponential Fits

It would be nice if all related data had linear relationships. However science and nature are not so kind. Exponential relationships exist in situations such as the cooling of an object (see Newton's Law of Cooling for example), bacteria growing exponentially, unstable isotopes decaying as an aggregate exponentially.

It is hard to visually tell if data is exponential. It either increases or decreases very rapidly, but so do a lot of other functions.

Consider the data { (.5, 1.1 ), (1.0, 1.22 ), (1.5, 1.35 ), (2.0 ,1.5 ), (2.5, 1.65 ), (3.0, 1.82 ), (3.5, 2.00 ), (4.0, 2.22 ), (4.5 ,2.46 ), (5.0,2.72 ), }. If we plot it, it appears as

[pic]

Does it look linear? perhaps. Quadratic? (parabola) perhaps. Exponential? hard to tell.

Some Simple Algebra

If x and y have a linear relationship then they satisfy an equation of the form y = mx + b (1).

If they have an exponential relationship then they satisfy an equation of the form y = Aekx (2).

Now if we take the natural log of both sides of equation (2) we have

ln(y) = ln( Aekx)

Using the property of natural logs that ln(UV) = ln(U) + ln(V) we have

ln(y) = ln(A) + ln(ekx)

Further, since logs and exponentials are inverses of one another, ln(ekx) = kx so we now have

ln(y) = ln(A) + kx

Now if A is a constant, so is ln(A). This last equation may be viewed in a valuable geometric way:

if y and x have an exponential relationship (Aekx) then the graph of x vs ln(y) is linear.

A linear graph is easy to see. It is easy to check – the slope between any pair of points should be about the same. We have lots of software for working with linear data fits. This is good!

In the data used earlier in this part, we can try this out. Taking the natural log of all y coordinates we have

{ (0.5, 0.1 ), (1.0, 0.2 ), (1.5, 0.3), (2.0, 0.4 ), (2.5, 0.5 ), (3.0, 0.6 ), (3.5, 0.7 ), (4.0, 0.8), (4.5, 0.9 ), (5.0, 1.0 ), }.

This is clearly linear!! The slope between any two points (using rise/run ) is .2. With a little high school algebra, the intercept is 0 so the equation fitting ln(y) vs x is

ln(y) = .2x + 0

We can get y in terms of x by exponentiating both sides:

eln(y) = e.2x

Now logs and exponentials are inverses so this becomes y = e.2x

(Had the intercept not been 0 then from basic algebra, e.2x +b = e.2x *eb and eb would become A as above).

To summarize this and apply it:

• to tell if data is exponential, plot ln(y) vs x and see if it is linear

• to fit an exponential to it, fit a straight line to ln(y) vs x and then exponentiate it

Problem Set 4

Fit an exponential to the following data:

| | | | | | | | | | | | | | | |t |0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |10 |11 |12 |13 | |T |87 |85 |84 |82 |81 |80.5 |80 |79 |78.0 |77.5 |77 |76.0 |75 |74.0 | |

This data came from collecting temperature vs time for a cup of very hot water. The units are in minutes and degrees Celsius.

Use it to predict how long it will take for the water to cool to 50 degrees.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download