Regression Analysis: t90 versus t50



Page 17

Correlation and Regression

Correlation and regression is used to explore the relationship between two or more variables. The correlation coefficient r is a measure of the linear relationship between two variables paired variables x and y.. For data, it is a statistic calculated using the formula

r =

The correlation coefficient is such -1 ≤ r ≤ 1. If y is a linear function of x, then r =1 if the slope is positive and -1 if it is negative. We emphasize that r is a measure of linear relationship, not functional relationship.

Example. Here is a small dataset:

x y

-3.0 0.00000

-2.5 1.65831

-2.0 2.23607

-1.5 2.59808

-1.0 2.82843

-0.5 2.95804

0.0 3.00000

0.5 2.95804

1.0 2.82843

1.5 2.59808

2.0 2.23607

2.5 1.65831

3.0 0.00000

Mean of y = 2.11983

Correlations: x, y

Pearson correlation r of x and y = -0.000; P-Value = 1.000

You see the relationship of course: x2 + y2 = 9

A more concrete interpretation of r will be given later when we discuss regression. For the moment, view high values of | r | as indicating that the points (x, y) lie nearly on a straight line while low values indicate that no obvious line passes close to all the points.

Page 18

Astronomy Example.

We use the data from Mukherjee, Feigelson, Babu, etal in “Three types of Gamma-Ray Bursts (The Astrophysical Journal, 508, pp 314-327, 1998), in which there are 11 variables, including 2 measures of burst durations t50 and t90 (times in which 50% and 90% of the flux arrives). It will be used to illustrate concepts in correlation and regression.

One would expect, from the context of the example, that the variables t50 and t90 ought to be strongly related. Here is output from Minitab showing the value of r:

Correlations: t50, t90

Pearson correlation of t50 and t90 = 0.868

P-Value = 0.000

Regression Analysis: t90 versus t50

The regression equation is

t90 = 10.1 + 1.65 t50

Predictor Coef SE Coef T P

Constant 10.106 1.066 9.48 0.000

t50 1.64553 0.03333 49.37 0.000

S = 26.8058 R-Sq = 75.3% R-Sq(adj) = 75.3%

Analysis of Variance

Source DF SS MS F P

Regression 1 1751506 1751506 2437.55 0.000

Residual Error 800 574841 719

Total 801 2326347

Page 19

Unusual Observations (Partial list!)

Obs t50 t90 Fit SE Fit Residual St Resid

2 69 208.576 123.002 2.031 85.574 3.20R

5 306 430.016 514.242 9.768 -84.226 -3.37RX

7 30 381.248 58.866 1.070 322.382 12.04R

10 95 182.016 166.813 2.847 15.203 0.57 X

20 204 292.736 345.530 6.375 -52.794 -2.03RX

26 61 221.184 111.207 1.823 109.977 4.11R

44 30 123.392 58.761 1.069 64.631 2.41R

77 58 179.840 104.783 1.713 75.057 2.81R

127 19 110.464 41.911 0.959 68.553 2.56R

130 108 158.080 187.665 3.248 -29.585 -1.11 X

R denotes an observation with a large standardized residual.

X denotes an observation whose X value gives it large influence.

Residuals vs Fits for t90

Plot shows non constant variance and very unusual standardized residuals. Suggests making transformation on both x and y. We will discuss this later.

Transform by taking logs of both variables.

Correlations: log(t50), log(t90)

Pearson correlation of log(t50) and log(t90) = 0.975

P-Value = 0.000

Page 20

Regression Analysis: log(t90) versus log(t50)

The regression equation is

log(t90) = 0.413 + 0.984 log(t50)

Predictor Coef SE Coef T P

Constant 0.413395 0.008372 49.38 0.000

log(t50) 0.984235 0.007863 125.17 0.000

S = 0.206054 R-Sq = 95.1% R-Sq(adj) = 95.1%

Analysis of Variance

Source DF SS MS F P

Regression 1 665.19 665.19 15666.85 0.000

Residual Error 800 33.97 0.04

Total 801 699.15

Unusual Observations (Partial list)

Obs log(t50) log(t90) Fit SE Fit Residual St Resid

7 1.47 2.58121 1.86195 0.01040 0.71925 3.50R

12 0.83 1.70600 1.22772 0.00765 0.47828 2.32R

47 -1.85 -1.46852 -1.41125 0.02008 -0.05727 -0.28 X

56 -1.72 -1.17393 -1.28072 0.01911 0.10679 0.52 X

60 0.70 1.52799 1.10066 0.00740 0.42733 2.08R

95 0.76 1.79607 1.16183 0.00750 0.63425 3.08R

125 -0.89 -0.04769 -0.46532 0.01332 0.41763 2.03R

141 -0.11 0.82321 0.30056 0.00885 0.52265 2.54R

156 -1.06 -0.10018 -0.63037 0.01445 0.53019 2.58R

168 0.20 1.50775 0.61430 0.00771 0.89345 4.34R

175 0.97 2.23188 1.36569 0.00806 0.86619 4.21R

176 0.20 1.26102 0.61430 0.00771 0.64673 3.14R

210 1.02 2.08812 1.41832 0.00825 0.66980 3.25R

215 0.44 1.38710 0.84611 0.00731 0.54099 2.63R

221 -0.72 -0.71670 -0.29201 0.01219 -0.42469 -2.06R

255 0.17 1.10285 0.57866 0.00780 0.52419 2.55R

272 1.45 2.26194 1.84404 0.01030

R denotes an observation with a large standardized residual.

X denotes an observation whose X value gives it large influence.

Page 21

Plot of Residuals vs. Fitted Values ‘Four in One Plot’

Still some unusual observations but plots looks much better.

Before leaving this example, we note the following relationship:

Correlations: log t90, FITS1

Pearson correlation r of log t90 and FITS1 = 0.975

P-Value = 0.000

If you look back to page 19, bottom, you will see that the correlation between log t90 and log t50 is also .975. That is, the correlation between the fitted (predicted) values and the observations y is the same as the correlation between the two variables x = log t50 and y = log t90.

Brief Overview of Forthcoming Topics:

• We will first look at general linear regression model analysis in matrix terms.

• We will discuss some model assumptions and how they can be examined.

• Then we will present an example with five predictors and some techniques for model fitting.

• We will then discuss generalized linear models, including logistic and Poisson regression.

-----------------------

hÝMÆhÝMÆOJQJ^JhÊhÝMÆOJQJ^JhÊOJQJ^JhÊhÊOJQJ^JjhÝMÆU[pic]mHnHu[pic]

he‰CJaJhÊhÊCJaJhm[?]hm[?]OJQJ^J

hÊCJaJ

hm[?]CJaJhm[?]hm[?]CJaJhm[?]CJOJQJ^JaJhm[?]hm[?]5?CJ\?aJhÝMÆ5?CJ EMBED MtbGraph.Document [pic]

[pic]

[pic]

[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download