BENEDICTINE UNIVERSITY



Essentials--Linear Regression and Correlation

Major purpose in business: forecasting

In order for forecasting to be possible, the future must, in some way, be like the past.

Forecasting methods seek to identify relationships from the past, and use them to

predict the future (assuming that the identified relationship will persist).

Finding relationships is a way of identifying dependencies.

Dependent variable--one to be predicted

Independent variable--one used to make the prediction

Types of regression

Based on the number of independent variables

Simple regression--one predictor or independent variable (x)

E.g. y = a + bx

Multiple regression--two or more predictor or independent variables (x1, x2, . . . ,xn)

E.g. y = a + bx1 +cx2 +dx3 +ex4

Based on the type of regression line

Linear: y = a + bx a = y-intercept; b = slope

or y = mx + b: b = y-intercept; m = slope

or y = β0 + β1 x: β0 = y-intercept; β1 = slope

Slope is the coefficient (multiplier) of x, no matter what symbol is used or where

it appears in the equation.

Slope is the change in y for a one-unit change in x.

Usually regarded as the single most important result in regression, because it

describes the nature of the relationship between y and x.

In multiple regression, each independent variable has its own slope and its own measure of correlation.

Intercept is the other value, also known as the "constant".

Intercept is the value of y when x = 0.

Non-linear (curved): exponential e.g. y = abx or y = 35(1.06)x

logarithmic e.g. y = a log x or 3.2 log x

power e.g. y = axb or 60(x)5

trigonometric e.g. y = a sin x or 3.7 sin x

etc.

Over a restricted range (relevant range) a curve can be approximated with a straight line

Based on the nature of the suspected relationship between y and x

Causal regression: x may be an actual cause of y, or x may be related to something

else that is a cause of y

Time series regression--popular in business and economics

Time is the independent (x) variable, used to substitute for the actual causes of y.

In time series, it is often better to use less historical data rather than more.

The future is likely to be more like the recent past than the more distant past.

With less data x is closer to x-bar (see below).

Correlation--the degree of "relatedness" between dependent and independent variables

Types of correlation

positive: dependent variable increases as the independent variable increases

negative: dependent variable decreases as the independent variable increases

none: no apparent relationship between dependent variable and independent variable

Measures of correlation

Coefficient of non-determination, k2--always positive--range, 0 to 1

If there is perfect correlation, k2 is equal to zero.

If there is no correlation, k2 is equal to one.

Coefficient of determination, r2, equal to 1 - k2--always positive--range, 0 to 1

If there is perfect correlation, r2 is equal to one.

If there is no correlation, r2 is equal to zero.

Correlation coefficient, r, the square root of r2--positive or negative, depending on

the type of correlation--range -1 to +1

Note: ρ (rho) and ρ2 are the population parameters corresponding to r and r2

Correlation and causation

The presence of correlation does not, in itself, prove that x causes y.

Three things necessary to prove causation

Statistically significant correlation between the effect, y, and the alleged cause, x.

Alleged cause, x, must be present before or at the same time as the effect, y.

Explanation must be found as to how x causes y.

Prediction errors--four standard errors (sampling standard deviations)

Standard error of the slope, σb

Measure of uncertainty regarding the slope of the regression line

Used to find confidence interval for the slope: β = b ± ttσb

Note: β is the population slope, estimated by b.

Standard error of the intercept, σa

Measure of uncertainty regarding the intercept of the regression line

Used to find confidence interval for the intercept: α = a ± ttσa

Note: α is the population intercept, estimated by a.

Standard error of estimate, σd and standard error of prediction, σpred

Measures of uncertainty regarding predictions

Used in finding confidence interval for predictions: y = y' ± ttσpred

Predictions have the least uncertainty when the value of x is near x-bar.

Standard error of the correlation coeffiecient, σr

Measure of uncertainty regarding the correlation coefficient

Types of variation in regression

Initial or original variation

Sum of the squared deviations between the data y-values and the mean of the

y-values -- Σ(y-ybar)2

Residual variation

Sum of the squared deviations between the data y-values and the predicted

y-values -- Σ(y-y')2

Removed or explained variation

Initial variation minus residual variation

k2 is the ratio of residual variation to original variation, Σ(y-y')2 / Σ(y-ybar)2.

r2 is the ratio of removed variation to original variation.

Hypothesis testing in regression

Ho: No correlation (relationship) between y and x.

ρ = 0 or ρ2 = 0 or β = 0

Ha: Correlation between y and x (two-sided)

Positive correlation between y and x (one-sided)

Negative correlation between y and x (one-sided)

Reject Ho if tc ( tt (when n is small) or if zc ( zt (when n is large).

When n is small, df = (n-2)

Reject Ho if p ( α (hypothesis-test α, not intercept α)

If Ho is not rejected, there is no statistically significant correlation between x and y.

The regression equation should not be used--just use y-bar to predict y, or don't

make a prediction at all.

Exponential regression (not in the textbook)

Linear vs. exponential growth

Simple interest--example of linear growth

Interest is paid only on the initial deposit

E.g. $1,000 deposited today at 5% is worth $1,000 + $50(x) after x years.

$1,000 is the intercept (value of y today, when x = 0).

$50 is the slope (change in y each year (5% of $1,000)).

The slope, $50, is constant.

Compound interest--example of exponential growth

Interest paid not only on the initial deposit, but also on previously-earned interest.

E.g. $1,000 deposited today at 5% is worth $1,000 (1.05)x after x years

$1,000 is the intercept (value of y today, when x = 0)

1.05 is the growth factor (b), which is equal to 1 + the growth rate (r)

b = 1+ r and r = b - 1

In the above example r = 0.05 (5%) and b = 1.05

The slope is not constant, but increases as x increases.

Exponential equation: y'exp = a (b)x

a = y-intercept; b = compound growth factor

Growth rate r = b - 1, and compound growth factor b = 1+ r

"b" values compared:

Linear: y = a + b(x)

b < 0 negative correlation

b = 0 no correlation (y = intercept a, regardless of value of x)

b > 0 positive correlation

Exponential: y = a (b)x

b < 1 negative correlation

b = 1 no correlation (y = intercept a, regardless of value of x)

b > 1 positive correlation

Exponential regression computations

Procedure is based on the fact that if y is an exponential function of x, then ln y

(or log y) is a linear function of x

That is, if y = a(b)x, then ln y = a' + b'(x) or log y = a'' + b''(x).

(The three "a" and "b" values in the above equations are different.)

Procedure

Transform the y-values into the lns (or logs) of the y-values.

Math review

The logarithm of a number is the power to which a base number must

be raised in order to give the original number

Natural logarithms use the number e (2.718281828...) as the base.

ln 25 is 3.218876 because e3.218876 is 25

ln 100 is 4.605170 because e4.605170 is 100

Common logarithms use the number 10 as the base

log 25 is 1.397940 because 101.397940 is 25

log 100 is 2 because 102 is 100

Perform linear regression analysis on the lns (or logs) of the y-values.

Result is a linear equation for predicting the ln (or log) of y

ln y' = a'+b'x or log y' = a''+b''x

Determine a and b values in y' = a(b)x

a is the inverse ln of a' (or the inverse log of a'')

b is the inverse ln of b' (or the inverse log of b'')

Inverse ln of z = ez (or Inverse log of z = 10z)

Confidence intervals in exponential forecasting

Intervals are first computed for ln (or log) of y', then are converted to LCL and UCL

values using inverse lns (or logs)

Two-point regression--linear and exponential--quick forecasts (see examples at end of outline)

Linear

Slope (b) is the difference between y-values divided by the difference between

x-values.

Let y-axis be located at the first x-value (let first x-value correspond to zero

on the x-axis).

Intercept (a) is then the first y-value.

Equation y' = a + bx can then be written and used to make forecasts

Exponential

Growth factor (b) is the ratio of the y-values raised to the 1/n power, where n is the difference between x-values.

Let y-axis be located at the first x-value (let first x-value correspond to zero

on the x-axis).

Intercept (a) is then the first y-value.

Equation y' = abx can then be written and used to make forecasts

Confidence intervals cannot be computed for two-point forecasts.

Multiple Regression

More than one independent variable

Linear form: y' = a + bx1 + cx2 + dx3 + . . . (a coefficient for each variable)

Partial correlation coefficients and partial coefficients of determination

r1, r2, r3, . . . and r12, r22, r32, . . .

Terminology--explain each of the following:

forecasting (basic concept), dependent variable, independent variable, simple regression, multiple regression, linear regression, intercept, slope, non-linear regression, exponential regression, causal regression, time-series regression, correlation, positive correlation, negative correlation, k2, coefficient of non-determination, r2, coefficient of determination, r, correlation coefficient, causation, standard error of the slope, standard error of the intercept, standard error of estimate, standard error of prediction, standard error of the correlation coefficient, initial or original variation, residual variation, removed or explained variation, null hypothesis in regression, alternate hypotheses in regression, simple interest, compound interest, compound growth factor, growth rate, transformation, logarithm, natural logarithm, common logarithm, inverse logarithm, two-point regression, multiple regression, partial correlation, cross-products, degrees of freedom, table-t, calculated-t, signal-to-noise ratio

Skills and Procedures

• perform linear regression using the TI-83 and the spreadsheet , including predictions, error factors, hypothesis tests, and evaluation of the degree of correlation

• perform exponential regression using the TI-83 and the spreadsheet , including predictions, error factors, hypothesis tests and evaluation of the degree of correlation

• interpret, in nonmathematical terms, the intercept and slope in linear regression

• interpret, in nonmathematical terms, the intercept and growth factor in exponential regression

• interpret the coefficients of nondetermination and determination in linear and exponential regression

Concepts

• describe “intercept” as nonmathematically as possible

• describe “slope” as nonmathematically as possible

• describe “compound growth factor” as nonmathematically as possible

• explain the difference between simple regression and multiple regression

• explain the significance of the “sum of the squared deviations between the data points and their mean”

• explain the significance of the “sum of the squared deviations between the data points and the regression line”

• describe the relationship between the “coefficient of nondetermination” and the two items immediately above

• describe the relationship between the “coefficient of nondetermination” and the “coefficient of determination”

• identify the difference between linear growth and exponential growth in terms of what is constant in each case

• explain why the demonstrated correlation between smoking and lung cancer does not prove that smoking causes lung cancer

• describe the relationship among the three types of variation: “original,” “residual,” and “explained” (or “removed”)

• explain the relationship between regression hypothesis-test results and the ability (advisability) to make predictions

• in exponential growth, describe the relationship between the compound growth factor and the growth rate

• describe how a regression line, straight or exponential, may be fitted between two data points

If the Ho is rejected:

The p-value is 0.05 or less. The null hypothesis (no correlation) is therefore rejected. The correlation between __________ and __________ in the sample is statistically significant at the 0.05 level. In the population, the variables are probably correlated.

If the Ho is not rejected:

The p-value is greater than 0.05. The null hypothesis (no correlation) is therefore not rejected. The correlation between __________ and __________ in the sample is not statistically significant at the 0.05 level. In the population, the variables could be uncorrelated.

Two-point regression examples: A city’s population was 234,000 in 1995, and 683,000 in 2005.

What are the growth rates and forecasts for 2010?

Linear: The b-value is (683,000 - 234,000) / 10 = 44,900 people per year.

Equation is y’ = 234,000 + 44,900(x)

Forecast for 2010 is y’ = 234,000 + 44,900(15) = 907,500.

Exponential: The b-value is (683,000 / 234,000) ^ (1/10) = 1.113065 or 11.31% annual growth.

Equation is y’ = 234,000 * 1.113065 ^ x

Forecast for 2010 is y’ = 234,000 * 1.113065 ^ 15 = 1,166,872.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download