Chapter 2. Simple Linear Regression.

Basic statistical definitions and concepts.

1 Random variables and distributional properties.

Let Y be a random variable Y with probability density function f(Y). The probability density function (pdf) describes the probability that the variable takes values in any interval (Y, Y+dY). The mean or expectation of Y is defined as:


The expected value of a linear function of random variables is equal to the same linear function of the expected values of the variables.

Variance of Y is defined on the basis of the definition of expected value:


This means that the variance of Y (for which there are various notations) is the expected value of the square of the difference between Y and the expected value of Y.

Covariance is defined for two random variables Y and Z as:


Thus, the variance of a random variable is equal to its covariance with itself. The variance of a linear combination of random variables is the sum of all pairwise covariances, including that of each variable with itself, each multiplied by the product of the corresponding coefficients. In this case the equation is clearer:


Random variables are said to be independent when the probability of any pair of values (Yi,Zi) is equal to the product of the probability of a value Yi (regardless of Z) time the probability of a value Zi (regardless of Y). Independence implies that covariance and correlation between the variables is zero. The converse (0 correlation implies independence) is true only if Y and Z are jointly normally distributed.

The correlation between two random variables is their covariance divided or scaled by the product of their standard deviations. Equivalently, the correlation is the covariance between the standardized forms of the random variables:


All of these definitions are for population parameters. For the estimates of these parameters on the basis of samples, we use the "hat" notation or regular letters such as s, s2, sxy, rxy, etc. Equations to estimate parameters are the usual computational formulas and are typically based on minimization of variance and bias. Keep in mind that parameters have set values and no variance, whereas estimated parameters are random variables with distributions that depend on the distribution of the original population, on the manner in which the population was sampled, and on the way in which the parameter estimation was obtained. Once again, we typically use random sampling (all elements of population have the same probability of being selected) and use linear (e.g. average as estimator of mean, b1 as estimator of β1) and quadratic (s2 as estimator of variance) functions of the observations.

2 Logic behind hypothesis testing.

Statistical hypotheses are statements about populations, typically about the values of one or more parameters or calculations based on the parameters of the population. Hypothesis testing is a means of evaluating the probability that the statements are correct, and transforming the continuous probabilities in discrete decisions. As an arbitrary rule accepted by most people, statements that have a probability smaller than 5% are described as “rejected.” It is wise to keep in mind that there is nothing special about the 5% value other than the fact that most people agree on using it.

Hypothesis can be “rejected” or “not rejected,” but this is not to be taken as disproving or proving the statements. The proper use of statistical methods will always result in wrong decisions. Nothing can be done to prevent this. Often, we never find out whether the decisions were actually wrong or right.

There are two types of errors in hypothesis testing: rejection of a correct hypothesis is called error type I and its probability is α; non-rejection of an incorrect hypothesis is called error type II and it has a probability β. “Power,” the probability of rejecting a hypothesis given that it is incorrect is 1- β. Power depends on a, sample size, variance, and size of the effect to be detected.

When normality, independence, and homogeneity of variance of the random variables (typically the εi’s) can be assumed, we can make statistical inferences by using known distributions. For example, the sum of the squares of many independent standard normal variables has a (2 distribution with degrees of freedom (which is also the expected value for the (2) equal to the number of independent standard normal variables added. The variance estimated for a normal distribution based on a sample is thus a (2; and this distribution is used to make inferences about the unknown variance. Similarly, it can be demonstrated that the calculations used to obtain t and F values should in fact lead to random variables with t and F distributions, provided that the assumptions and null hypothesis are correct. In general, the popularity of the common parametric statistical test we use comes form the fact that, under the assumptions of normality, one can analytically derive the distributions of the random variables resulting from doing the usual calculations to obtain estimated parameters.

The logic behind any typical parametric statistical test (say an F test) is thus as follows. Suppose that you performed an analysis of variance to determine if two treatments give different results. The null hypothesis is that the two means are identical. The assumptions are normality with equal variances and independence of errors. Further, suppose that you find a very large value of F that is “highly significant” (say F=40). The brief version of the logic is just to say "F is significant, thus the means are different." The long version is:

The probability of

(The model assumptions being true, AND

The null hypothesis being correct, AND

Obtaining a large value of F)

Is extremely low THEREFORE,

Because the assumptions are met (I checked them), and I did in fact obtain a large F, I will reject the validity of the null hypothesis.

In other words, the F value and thus the data obtained, and more extreme values of the data are very unlikely if Ho is true. Therefore, either one observed an unlikely event, or the null hypothesis is false. Note that although the decision is formally correct because the method was applied correctly, it can still be factually incorrect if in reality the hypothesis is true (error type I with probability α). One can visualize this by imagining that your professor flips a coin 10 times and obtains head each single time, after placing an abundant wager on getting all heads. Your choices upon witnessing such coin-flipping prowess are to (a) make no comments, (b) state that your professor cheated (i.e. used a coin that is not “fair”), or (c) state that your professor is one lucky devil. Unless you obtain the coin for further inspection you will be left with those choices, and the choice will be yours.

Simple Linear Regression.

1 Regression

The name comes from the first application of the method by Sir Francis Galton, who invented the method. He studied the height of children as a function of the height of parents. He observed that the height of children appear to "regress" to the average for the group. He considered this to be a regression to mediocrity ().

2 Uses of regression.

Regression analysis serves three main purposes that in practice tend to overlap:

□ Description and understanding of the relationship between Y and X. For example, one may be interested in determining whether temperature affects growth rate of plants and how much growth rate changes per degree of temperature. The estimated parameters have to make sense mechanistically or biologically. We are interested in the value of parameters and in the potential effects of X on Y.

□ Control or management of a process. In the example above, the relation between growth rate and temperature may be studied in a greenhouse so that temperature can be adjusted to approach a desired growth rate.

□ Prediction of the value of an unmeasured variable on the basis of measurement of the X variables. In the example above, the effects of temperature on growth rate can be modeled through regression analysis and the model can be applied to predict how much growth should be expected in greenhouses where temperature is measured but growth is not measured. For purely predictive purposes, we are not concerned with the mechanistic interpretation of parameters. We want to predict Y with the greatest precision.

Although these uses can overlap, the success in using regression for each one can differ for a given data set. For example, a model and data set can be great for prediction, but very poor for description and understanding. This is the case in multiple linear regression when the X variables are highly correlated, a situation called "collinearity" or "multicollinearity" that we will explore in detail in later chapters.

3 Model and Assumptions.

The SLR model can be stated in regular notation as:


Where the subscript i identifies each of the n observations, β0 and β1 are the parameters (intercept and slope, respectively), Xi is a constant of known value for each observation, and each εi is a random variable with mean 0, variance σ2, and uncorrelated with any other εi. Note that there are as many error random variables as observations.

If one adds the assumption that these random variables are all normally distributed, then it is possible to use the traditional methods for inference based on the normal and related distributions. However, the equations that describe the variance of the estimated parameters and their expectations are valid regardless of the shape of the distribution of errors. Likewise, the estimation of parameters by minimization of sum of squares of the errors (SSE) is unbiased and has minimum variance, regardless of the distribution of errors.

4 Estimation of parameters.

1 OLS: minimization of SSE. Normal equations.

Parameters are estimated such as to minimize the sum of squares of errors (SSE) or deviations around the linear model. This is known as Ordinary Least Squares. Minimization of SSE is achieved by application of calculus. Use of calculus is not necessary, as one could get as close to the correct solution as desired simply by trial and error. However, use of calculus saves time. SSE are minimized by finding the values of estimated parameters for which the partial derivatives of SSE with respect to each parameter are zero. This calculation leads to a set of two equations called Normal Equations. In multiple linear regression there will be as many equations as parameters need to be estimated. The normal equations are important because they are the basis for Path Analysis, which we will study in the second half of the course.

[pic] Normal Equations

The estimated parameters b0 and b1 are obtained by solving the simultaneous equations for bo and b1. As a result of the calculations, the sums of all residuals or errors is always zero; the sum of the fitted values Yhat equals the sum of observed values Y; and the line goes through the point (Xbar, Ybar).

2 Regression coefficients.

The slope is calculated with the following equation.


This equation can also be used to show that b1 is a linear function of the observed values of Y. This fact is used to derive the variance and expected value for b1. The intercept is calculated based on the fact that the line goes through the point defined by the averages of X and Y.


Units: in most cases Y, X & (‘s and estimated parameters are quantities. This means that they are composed of numbers and units. If the units or the numbers are omitted, a great deal of ambiguity is introduced in the analysis.

Consider an example in which per capita population growth rate of aphids is regressed on population density. Per capita growth rate (Y) is the number of descendants produced per individual per year. It is measured as individuals yr-1 individual-1. Population density (X) is measured in individuals m-2. The figure without units or numbers only shows that Y declines with increasing X, but there is no indication of the range of values, and the results cannot be compared with other studies. Moreover, the regression will always yield some values for the estimated parameters, but the use of these values in population models can be correct only if one knows the units of the parameters. Units for (0 are individuals yr-1 individual-1, whereas the units for (1 are yr-1 m2 individual-1.

3 Estimated variances.

Because b1 is a linear combination of Yi’s, it is also a normally distributed variable.

b1 ~ N ((1, (2 {b1})


The estimated variance is, therefore:


The MSE is the SSE divided by the number of degrees of freedom. In general, for any model, number of degrees of freedom of the error equals the number of observations minus the number of estimated parameters. In this case the estimated parameters for the model are the slope and the intercept.



5 Confidence intervals.

Confidence interval for slope:

if (i ~ N ind (0, (2) then [pic]

The interval (CI) that, on average, will contain the true value of the parameter 100-α% out of 100 times it is calculated is:


6 Prediction of expected and individual value for a given Xh.

Typically, we are interested in making a prediction for, or estimating the expected value of the response variable Y for a given value of the predictor variable X. We calculate a confidence interval for the prediction called “Y hat, given that X=Xh”.

A confidence interval for E{Yh} is estimated by using


Because [pic] and b1 are independent we can write:



The term 1/n factors in the variance due to unknown E{[pic]}, whereas the second term in the brackets reflects the variance due to unknown E{b1}. Inspection of this equation indicates that variance of the predicted expected Y increases with the square of the distance from Xh to the average of X.

The confidence interval for the prediction is calculated in the usual fashion:


In certain cases one may be interested in calculating a CI for the result of individual observations, instead of the expected value of repeated observations. This is called a “prediction for a new observation”. The expectation is the same as for [pic], but the variance is greater, because one must add the deviations of individual observations from the mean.

(2{prediction} = (2 + (2 {[pic]}


The number 1 in the brackets reflects the uncertainty about the value of a random sample of size 1, even when it is taken from a distribution of known mean. The rest of the terms reflect the uncertainty about the value of the mean of the population, which is given by the uncertainty about the intercept and slope, as explained above.

7 ANOVA table

The total variation or SS of Y around its mean (SSTO) can be partitioned into two components: the variance explained by the model or sum of squares of the regression (SSR), and the unexplained variation or sum of squares of the errors (SSE). Similarly, the total deviation or difference between each observed value of Y and the average for Y can be partitioned into a deviation between the regression and the average Y, and a deviation between the observed value and the predicted value.


The coefficient of determination r2 is the proportion of all variation represented by the regression sum of squares, and it is a general indicator of the adequacy of the model. However, the complete adequacy of the model cannot be inferred just on the basis of r2, because models that are clearly non-linear can yield large coefficients of determination. In SLR, the coefficient of determination equals the square of the correlation coefficient.

1 Degrees of freedom.

What are the degrees of freedom? Why are they called “degrees of freedom?” Degrees of freedom can be understood at least in two ways, with and without linear algebra. David Lane (2001) provides the following explanation in his HyperStat site:

“Estimates of parameters can be based upon different amounts of information. The number of independent pieces of information that go into the estimate of a parameter is called the degrees of freedom (df). In general, the degrees of freedom of an estimate is equal to the number of independent scores that go into the estimate minus the number of parameters estimated as intermediate steps in the estimation of the parameter itself. For example, if the variance, σ2 , is to be estimated from a random sample of N independent scores, then the degrees of freedom is equal to the number of independent scores (N) minus the number of parameters estimated as intermediate steps (one, μ is estimated by M) and is therefore equal to N-1.”

For most students, degrees of freedom are like the cashiers at the grocery store, you see them several times a week, they help you fulfill some of your needs, but you do not really know them. The problem with teaching the details of degrees of freedom is that the concept is strongly dependent on linear algebra and hyperspaces, which are not in the statistical toolbox of most students. We will use a somewhat operational explanation of the concept of degrees of freedom (df).

Degrees of freedom is a number associated with a sum of squares (SS). The value of this number is equal to the total number of independent observations (more correctly, the number of independent random variables) used to calculate the sum of squares minus the total number of independent parameters estimated and used to calculate the SS. This operational definition is easiest to apply to the SS of the residuals (SSR) for any model. For example, in a completely randomized design with n observations and 4 fixed treatments the SSE has n terms in the summation, and one uses 4 estimated parameters, one for the mean of each treatment. Therefore, dfe=n-4. When calculating the SS of the model, in this case there are four random variables (the averages for the 4 treatments), and one estimation for the overall mean. Therefore, df of treatments is 3. In the case of regression, where there are no discrete treatments but continuously varying predictors, it is easier to calculate the df for the model as the difference between the df for the total SS (n-1) and the dfe. For a variety of explanations of the concept of degrees of freedom visit Dr. C. H. Yu’s site and scroll to the last part of the document.

Finally, the name “degrees of freedom” comes from the fact that df is the number of dimensions in which a vector (whose squared length is the SS under consideration) can roam “freely.” Degrees of freedom are the dimensions of the space on which the observation vector is projected. For this to make more sense, consider a sample of 10 (X,Y) pairs and the regression of Y on X. Visualize each pair or observation as one dimension. The sample forms a 10-dimensional space. The vector of ten Y values is a line in that space, as is the vector of X’s. The regression consists of projecting 10-D vector Y on 2-D space spanned by X and the unit vector. The component of vector Y that is not contained in that projection is perpendicular (orthogonal) to the model space and exists in the other 8 dimensions.

8 Complete example with Excel

The file xmpl_PfertParSim.xls contains a simulated dataset to explore the concepts of distribution of samples and estimated parameters. This example shows the application of SLR in a situation with replicate values of Y for each X. The data are simulated and the true model is known to be quadratic , although it can be made effectively linear by specifying a coefficient identical to zero for the quadratic term (b2=0). Excel is used to obtain random data sets that meet all assumptions of SLR, except for the linearity of the model when b2 is not zero. An ANOVA table is calculated including terms for the SSLOF and pure error. The worksheet can be easily modified to perform Monte Carlo analysis to simulate the distribution of estimated parameters. This example can be used together with the applets available at the Interactive Regression Simulations web site. The excel spreadsheet can show you the details of what happens in each sampling event simulated by the web site. The website allows you to easily obtain histograms for observed distributions of estimated parameters and other statistics based on hundreds of samples. It is recommended that you explore the example in the Excel file by recalculating the sheet a few times and examining the formulas. Then, move to the web site and set the parameters to the same values used in the spreadsheet.

In order to use the xmpl_PfertParSim.xls file for this example, make sure that cell M4 has a zero in it. The columns in the spreadsheet are as follows:

A. P added: numbers of units of phosphorus added to the crop.

B. Yield: number of units of crop yield observed in a sample.

C. Ybar: overall average for crop yield over all observations in the sample.

D. Ybari: average crop yield within each level of P applied.

E. Yhat: estimated crop yield based on the linear regression of Y on X.

F. E{Y|X}: expected value of crop yield for each level of X based on known slope and intercept.

G. Lofit: Lack of fit; difference between Ybari and Yhat.

H. Puree: pure error; difference between observed yield and Ybari.

I. Totale: total error; difference between observed yield and Yhat.

J. Truee: difference between the observed yield and E{Y|X}; true total error.

1 True model

The true model is known in this simulation. This differs from most real situations in which we do not know the real model, so the shape of the model (function used) is selected a priori and the parameters are estimated by least squares. Because we know the model, we can obtain and study the observed distributions of the estimated parameters.


The true model is a polynomial of degree smaller than or equal to 2. For this example, the coefficient for the quadratic term is set to 0, so the true model is effectively linear.

2 First point: ei is not ει

The first concept illustrated by this example is that the residual ei calculated as the difference between estimated and observed value is not the true realization of the random variable εi. The residual is calculated b using the estimated expected value of Y instead of the true value, which in a real situation is unknown. Compare the Totale column with the Truee columns and the sums of their values. Recalculate the sheet several times while keeping our eyes on these columns. Note that the sum of the realizations of the true error is not necessarily zero, whereas the sum of Totale is always zero, because of the way we estimate the parameters by minimization of the sums of squares of e. As an exercise, guess the result of averaging the sums of Truee for many samples (say 20) and then check what actually happens.

3 Second point: the estimated parameters are correlated random variables

Perform a few simulations and copy the values of b0 and b1 to an area of the worksheet that is not used. For this, recalculate the sheet to get a new random sample, select the range D31:E31, Edit Copy, select a blank cell with 20 empty spaces below it and Edit Paste Special… Values. Repeat the procedure more than 20 times. (You can write a macro to do this for you). After you have more than 20 pairs b0-b1, do a scatter plot of b0 vs. b1. You should observe that they seem to be correlated. Is the correlation positive or negative? Could it have the opposite sign?

Once you understand where the replicates of the estimated parameters come from, you can use the following web site to do more numerous simulations quickly and to seriously study the correlation between parameters. As an exercise, follow the link to: and click on “Histograms of slope and intercept.” Study the effects of the settings, particularly sample size, on the histograms. Make sure you understand the difference between Sample size and Number of samples. Sample size is the number of points “measured” in each sample simulated. Number of samples if the number of times that the sampling is simulated. You will obtain as many values for each estimated parameter as the number of samples. After you explore the histograms, select the option to study the correlation between parameters and try to obtain correlations that are both negative and positive by changing the parameters of the true model and the variance of the errors.


