Regression Analysis



Regression Analysis

The sort of statistical comparison you make when you compare means, as in the “t” test, or variances as in the “F” test, are single variate statistics (that is each observation consists of a single value). You can imagine situations where two factors might change together – often one is relater to the other – either by a cause-effect relationship, or they “co-vary” due to the influence of some third factor. For instance the growth rate of a population of micro-organisms may vary in relation to the amount of food they receive (this could be a cause-effect relationship) or the growth rate and protein might vary together – perhaps both influenced by the third factor – food).

One way of examining these bivariate situations is by using regression models. Chapter 14 in Sokal & Rolf (1981) gives one good introduction to this sort of analysis. This handout will simply try to give you a brief outline of the basics of linear regression. It is not meant to provide a theoretical grounding in regression models.

If you examine the scattergram of pod-length vs. number of seeds (note this is a continuous variable vs. a discrete variable) you will probably notice that there is a tendency for longer pods to have more seeds. This is not surprising, but suppose someone were to ask you to be a bit more explicit about this relationship. HOW does seed number increase as a function of pod-length? In other words is there a more explicit way to describe the observation that longer pods tend to have more seeds?

Another question your pesky friend asked you :Is the relationship suggested by your scattergram significant? – that is - is the slope of the regression line significantly greater than zero?

The simplest regression model is that of a straight line described by the equation:

Y = a +bX

Where Y is the value on the Y axis (the dependant variable e.g. seed number) and X is the corresponding number on the X axis (e.g. pod length the independent variable). [NOTE sometimes the decision of what is independent and what is dependent is clear other times it is not]. In this equation “b” is the slope of the line (“rise over run” or (y (x), and a is the Y intercept (i.e. the value of y when x = 0).

To determine the slope of the regression line you need to get the following values (normally you would use a computer or a calculator – but the first time try it by hand so you get a feel for how your date provide the result you get).

n the number of (pairs of) observations

(X the sum of the X values (pod lengths)

((x)2 simply the above value squared

((X2) the sum of the individual squared X values

(Y the sum of the Y values (the number of seeds)

(Y2 the above value squared

((Y2) the sum of the individual squared Y values

(XY the sum of the n terms obtained by multiplying each X by its corresponding Y

From these values you can now calculate:

the mean of the X values (the average pod length in each sample)

the mean of the Y values (the average number of seeds in each sample)

(x2 the sum of the deviations of each of the X values from the mean i.e.

((X-X) or (((X2) – ((X)2/n)

(y2 the sum of the deviations of each of the Y values from the mean: Y (calculated as above)

and finally

(xy = (XY-(((X)*((Y)/n)

Now you can determine:

The regression coefficient i.e. the slope of the line:

b = (xy/(x2

and the y intercept:

a = –b.

In the real world you will probably never do all these calculations. Any scientific calculator, computer spread sheet, or statistics program will give you the final values for slope and intercept without any intermediate steps on your part. Nevertheless, it is a very good idea to go through the calculations at least once in your life (Why not in 439L?) to see that’s going on with the numerical manipulations – plus if your calculator battery dies you won’t be stuck!)

To test the significance of the regression line (i.e. is it different from 0 [meaning the values of the Y’s are not a function of the values of their corresponding X’s] you need a measure of the variabilities of both variates and their relationship.

Sb = ((((y2-(((xy)2/(x2))/n-2)* (1/(x2))1/2

Using this value a “t” value for the test that your slope is equal to zero can be obtained

ts = (b-0)/Sb

You use zero here because you are interested in the null hypothesis that your slope is not different from zero. If you had another hypothesis that value would be used instead. Can you think of a situation where you night have another null hypothesis?

Now just look up the t value you got in your t table to see if it is so large that you consider the hypothesis that the slope of your regression is zero is unlikely. Remember the slope may be positive or negative.

Sokal, R. R. & F. J. Rohlf 1981 Biometry. 2nd ed W. H, Freeman & Co San Francisco = 1 Error! Bookmark not defined.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download