BENEDICTINE UNIVERSITY
Essentials--Linear Regression and Correlation
Major purpose in business: forecasting
In order for forecasting to be possible, the future must, in some way, be like the past.
Forecasting methods seek to identify relationships from the past, and use them to
predict the future (assuming that the identified relationship will persist).
Finding relationships is a way of identifying dependencies.
Dependent variable--one to be predicted
Independent variable--one used to make the prediction
Types of regression
Based on the number of independent variables
Simple regression--one predictor or independent variable (x)
E.g. y = a + bx
Multiple regression--two or more predictor or independent variables (x1, x2, . . . ,xn)
E.g. y = a + bx1 +cx2 +dx3 +ex4
Based on the type of regression line
Linear: y = a + bx a = y-intercept; b = slope
or y = mx + b: b = y-intercept; m = slope
or y = β0 + β1 x: β0 = y-intercept; β1 = slope
Slope is the coefficient (multiplier) of x, no matter what symbol is used or where
it appears in the equation.
Slope is the change in y for a one-unit change in x.
Usually regarded as the single most important result in regression, because it
describes the nature of the relationship between y and x.
In multiple regression, each independent variable has its own slope and its own measure of correlation.
Intercept is the other value, also known as the "constant".
Intercept is the value of y when x = 0.
Non-linear (curved): exponential e.g. y = abx or y = 35(1.06)x
logarithmic e.g. y = a log x or 3.2 log x
power e.g. y = axb or 60(x)5
trigonometric e.g. y = a sin x or 3.7 sin x
etc.
Over a restricted range (relevant range) a curve can be approximated with a straight line
Based on the nature of the suspected relationship between y and x
Causal regression: x may be an actual cause of y, or x may be related to something
else that is a cause of y
Time series regression--popular in business and economics
Time is the independent (x) variable, used to substitute for the actual causes of y.
In time series, it is often better to use less historical data rather than more.
The future is likely to be more like the recent past than the more distant past.
With less data x is closer to x-bar (see below).
Correlation--the degree of "relatedness" between dependent and independent variables
Types of correlation
positive: dependent variable increases as the independent variable increases
negative: dependent variable decreases as the independent variable increases
none: no apparent relationship between dependent variable and independent variable
Measures of correlation
Coefficient of non-determination, k2--always positive--range, 0 to 1
If there is perfect correlation, k2 is equal to zero.
If there is no correlation, k2 is equal to one.
Coefficient of determination, r2, equal to 1 - k2--always positive--range, 0 to 1
If there is perfect correlation, r2 is equal to one.
If there is no correlation, r2 is equal to zero.
Correlation coefficient, r, the square root of r2--positive or negative, depending on
the type of correlation--range -1 to +1
Note: ρ (rho) and ρ2 are the population parameters corresponding to r and r2
Correlation and causation
The presence of correlation does not, in itself, prove that x causes y.
Three things necessary to prove causation
Statistically significant correlation between the effect, y, and the alleged cause, x.
Alleged cause, x, must be present before or at the same time as the effect, y.
Explanation must be found as to how x causes y.
Prediction errors--four standard errors (sampling standard deviations)
Standard error of the slope, σb
Measure of uncertainty regarding the slope of the regression line
Used to find confidence interval for the slope: β = b ± ttσb
Note: β is the population slope, estimated by b.
Standard error of the intercept, σa
Measure of uncertainty regarding the intercept of the regression line
Used to find confidence interval for the intercept: α = a ± ttσa
Note: α is the population intercept, estimated by a.
Standard error of estimate, σd and standard error of prediction, σpred
Measures of uncertainty regarding predictions
Used in finding confidence interval for predictions: y = y' ± ttσpred
Predictions have the least uncertainty when the value of x is near x-bar.
Standard error of the correlation coeffiecient, σr
Measure of uncertainty regarding the correlation coefficient
Types of variation in regression
Initial or original variation
Sum of the squared deviations between the data y-values and the mean of the
y-values -- Σ(y-ybar)2
Residual variation
Sum of the squared deviations between the data y-values and the predicted
y-values -- Σ(y-y')2
Removed or explained variation
Initial variation minus residual variation
k2 is the ratio of residual variation to original variation, Σ(y-y')2 / Σ(y-ybar)2.
r2 is the ratio of removed variation to original variation.
Hypothesis testing in regression
Ho: No correlation (relationship) between y and x.
ρ = 0 or ρ2 = 0 or β = 0
Ha: Correlation between y and x (two-sided)
Positive correlation between y and x (one-sided)
Negative correlation between y and x (one-sided)
Reject Ho if tc ( tt (when n is small) or if zc ( zt (when n is large).
When n is small, df = (n-2)
Reject Ho if p ( α (hypothesis-test α, not intercept α)
If Ho is not rejected, there is no statistically significant correlation between x and y.
The regression equation should not be used--just use y-bar to predict y, or don't
make a prediction at all.
Exponential regression (not in the textbook)
Linear vs. exponential growth
Simple interest--example of linear growth
Interest is paid only on the initial deposit
E.g. $1,000 deposited today at 5% is worth $1,000 + $50(x) after x years.
$1,000 is the intercept (value of y today, when x = 0).
$50 is the slope (change in y each year (5% of $1,000)).
The slope, $50, is constant.
Compound interest--example of exponential growth
Interest paid not only on the initial deposit, but also on previously-earned interest.
E.g. $1,000 deposited today at 5% is worth $1,000 (1.05)x after x years
$1,000 is the intercept (value of y today, when x = 0)
1.05 is the growth factor (b), which is equal to 1 + the growth rate (r)
b = 1+ r and r = b - 1
In the above example r = 0.05 (5%) and b = 1.05
The slope is not constant, but increases as x increases.
Exponential equation: y'exp = a (b)x
a = y-intercept; b = compound growth factor
Growth rate r = b - 1, and compound growth factor b = 1+ r
"b" values compared:
Linear: y = a + b(x)
b < 0 negative correlation
b = 0 no correlation (y = intercept a, regardless of value of x)
b > 0 positive correlation
Exponential: y = a (b)x
b < 1 negative correlation
b = 1 no correlation (y = intercept a, regardless of value of x)
b > 1 positive correlation
Exponential regression computations
Procedure is based on the fact that if y is an exponential function of x, then ln y
(or log y) is a linear function of x
That is, if y = a(b)x, then ln y = a' + b'(x) or log y = a'' + b''(x).
(The three "a" and "b" values in the above equations are different.)
Procedure
Transform the y-values into the lns (or logs) of the y-values.
Math review
The logarithm of a number is the power to which a base number must
be raised in order to give the original number
Natural logarithms use the number e (2.718281828...) as the base.
ln 25 is 3.218876 because e3.218876 is 25
ln 100 is 4.605170 because e4.605170 is 100
Common logarithms use the number 10 as the base
log 25 is 1.397940 because 101.397940 is 25
log 100 is 2 because 102 is 100
Perform linear regression analysis on the lns (or logs) of the y-values.
Result is a linear equation for predicting the ln (or log) of y
ln y' = a'+b'x or log y' = a''+b''x
Determine a and b values in y' = a(b)x
a is the inverse ln of a' (or the inverse log of a'')
b is the inverse ln of b' (or the inverse log of b'')
Inverse ln of z = ez (or Inverse log of z = 10z)
Confidence intervals in exponential forecasting
Intervals are first computed for ln (or log) of y', then are converted to LCL and UCL
values using inverse lns (or logs)
Two-point regression--linear and exponential--quick forecasts (see examples at end of outline)
Linear
Slope (b) is the difference between y-values divided by the difference between
x-values.
Let y-axis be located at the first x-value (let first x-value correspond to zero
on the x-axis).
Intercept (a) is then the first y-value.
Equation y' = a + bx can then be written and used to make forecasts
Exponential
Growth factor (b) is the ratio of the y-values raised to the 1/n power, where n is the difference between x-values.
Let y-axis be located at the first x-value (let first x-value correspond to zero
on the x-axis).
Intercept (a) is then the first y-value.
Equation y' = abx can then be written and used to make forecasts
Confidence intervals cannot be computed for two-point forecasts.
Multiple Regression
More than one independent variable
Linear form: y' = a + bx1 + cx2 + dx3 + . . . (a coefficient for each variable)
Partial correlation coefficients and partial coefficients of determination
r1, r2, r3, . . . and r12, r22, r32, . . .
Terminology--explain each of the following:
forecasting (basic concept), dependent variable, independent variable, simple regression, multiple regression, linear regression, intercept, slope, non-linear regression, exponential regression, causal regression, time-series regression, correlation, positive correlation, negative correlation, k2, coefficient of non-determination, r2, coefficient of determination, r, correlation coefficient, causation, standard error of the slope, standard error of the intercept, standard error of estimate, standard error of prediction, standard error of the correlation coefficient, initial or original variation, residual variation, removed or explained variation, null hypothesis in regression, alternate hypotheses in regression, simple interest, compound interest, compound growth factor, growth rate, transformation, logarithm, natural logarithm, common logarithm, inverse logarithm, two-point regression, multiple regression, partial correlation, cross-products, degrees of freedom, table-t, calculated-t, signal-to-noise ratio
Skills and Procedures
• perform linear regression using the TI-83 and the spreadsheet , including predictions, error factors, hypothesis tests, and evaluation of the degree of correlation
• perform exponential regression using the TI-83 and the spreadsheet , including predictions, error factors, hypothesis tests and evaluation of the degree of correlation
• interpret, in nonmathematical terms, the intercept and slope in linear regression
• interpret, in nonmathematical terms, the intercept and growth factor in exponential regression
• interpret the coefficients of nondetermination and determination in linear and exponential regression
Concepts
• describe “intercept” as nonmathematically as possible
• describe “slope” as nonmathematically as possible
• describe “compound growth factor” as nonmathematically as possible
• explain the difference between simple regression and multiple regression
• explain the significance of the “sum of the squared deviations between the data points and their mean”
• explain the significance of the “sum of the squared deviations between the data points and the regression line”
• describe the relationship between the “coefficient of nondetermination” and the two items immediately above
• describe the relationship between the “coefficient of nondetermination” and the “coefficient of determination”
• identify the difference between linear growth and exponential growth in terms of what is constant in each case
• explain why the demonstrated correlation between smoking and lung cancer does not prove that smoking causes lung cancer
• describe the relationship among the three types of variation: “original,” “residual,” and “explained” (or “removed”)
• explain the relationship between regression hypothesis-test results and the ability (advisability) to make predictions
• in exponential growth, describe the relationship between the compound growth factor and the growth rate
• describe how a regression line, straight or exponential, may be fitted between two data points
If the Ho is rejected:
The p-value is 0.05 or less. The null hypothesis (no correlation) is therefore rejected. The correlation between __________ and __________ in the sample is statistically significant at the 0.05 level. In the population, the variables are probably correlated.
If the Ho is not rejected:
The p-value is greater than 0.05. The null hypothesis (no correlation) is therefore not rejected. The correlation between __________ and __________ in the sample is not statistically significant at the 0.05 level. In the population, the variables could be uncorrelated.
Two-point regression examples: A city’s population was 234,000 in 1995, and 683,000 in 2005.
What are the growth rates and forecasts for 2010?
Linear: The b-value is (683,000 - 234,000) / 10 = 44,900 people per year.
Equation is y’ = 234,000 + 44,900(x)
Forecast for 2010 is y’ = 234,000 + 44,900(15) = 907,500.
Exponential: The b-value is (683,000 / 234,000) ^ (1/10) = 1.113065 or 11.31% annual growth.
Equation is y’ = 234,000 * 1.113065 ^ x
Forecast for 2010 is y’ = 234,000 * 1.113065 ^ 15 = 1,166,872.
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related searches
- university city education foundation
- city university high school memphis
- why this university essay sample
- university city schools mo
- benedictine monks gregorian chant
- gregorian chant benedictine monks
- gregorian chants benedictine youtube
- youtube chant gregorian benedictine monks
- benedictine monks and gregorian chants
- benedictine monks chant
- gregorian chant benedictine nuns youtube
- benedictine monks music