California State University, Northridge



|[pic] |College of Engineering and Computer Science |

| |Computer Science Department |

| |Computer Science 106 |

| |Computing in Engineering and Science |

| |Spring 2006 Class number: 11672 Instructor: Larry Caretto |

Programming Exercise Eight (and last)

Objective

This assignment provides an example in the use of one-dimensional arrays and introduces the concept of regression analysis, which is used to estimate a relationship between two variables.

Mathematical Background

If several measurements are made on pairs of experimental data {(xi,yi), i = 1,...,N}, we can use a technique, known as regression analysis, to determine an approximate equation of a straight line that gives a best fit to the data. The equation of this best-fit line is written as follows.

= a + b x

In this equation, we use the symbol instead of y to indicate that the predicted value found from the equation, = a + b x is an approximate result. For a given data point, (xi,yi), the value of yi represents the actual data and we would obtain the predicted value of y, at the point x = xi from the equation i = a + b xi. The difference between the measured and predicted value is |yi - i|.

|[pic] |In the chart at the left, the data points are indicated by the small |

| |ellipses. The coordinates of one of a typical data point are shown by the |

| |dotted lines indicating the coordinates xi and yi. The solid line is the |

| |fitted regression line, = a + b x. The point where the dotted line at x = |

| |xi crosses the regression line has the coordinates (xi,i). In this |

| |particular example the value of i is less than the value of yi. There is a |

| |large scatter of data points about the regression line in this example. |

The example plot above might represent calibration data on an instrument. The x values would denote the instrument reading and the y values would indicate the true value of the quantity being measured. Once the calibration tests were completed, it would be useful to have a simple equation to relate the instrument reading (x) to the actual quantity being measured(y).

In addition to finding the values of a and b that give the best-fit line, we would also like to have some measure of how well the line fits the data. Two different goodness-of-fit measures, the standard error and the coefficient of variation are presented below in the equations section.

Equations used

The equations used to calculate a and b can be found by an analysis which minimizes the distances between the actual data points, yi, and the fitted points, i = a + b xi. The results of this analysis are shown below. The equations to compute the intercept, a, and the slope, b, in terms of the entire set of data, {xi,yi}, use the following the definitions of mean values:

[pic]

With these definitions, the slope, b, and the intercept, a, are found as follows.

[pic]

A statistical estimate of the variability can be found from the difference between the actual data points yi and the estimated value i = a + b xi. This measure, which is called the standard error and has the symbol sy|x, is defined as follows:

sy|x = [pic]

Another measure, called the R2 value or the coefficient of variation is considered to be a measure of the amount of variation in the data which is explained by the regression equation. An R2 value of zero means that the regression cannot explain any of the variation in y; an R2 value of one means that all the variation in y can be explained by the regression equation. The value of R2 is computed from the following equation:

[pic]

Task One

You can use a previously written program for this task. Download the program file from the exercise page on the course web site. Review that program and see how the various functions are used to enter array data and do calculations with array data in loops. Note that the program determines the number of data points (N in the equations above) by reading the data. The user is not required to count the data and input a value for N. The program has summary output to the screen and detailed output of a, b, sy|x, R2, and a table of xi, yi, and ŷi.

Prepare a data file for the test case below. Review the input statements to see how you should prepare this file. Run the program with your test data file to make sure you are using the program correctly by matching the results below.

|Test Data and Results for Linear Regression |

|xi |510 |533 |603 |670 |750 |

|yi |1.3 |0.1 |1.5 |1.8 |3.9 |

|Results: a = -5.77566; b = 0.0122238; R2 = 0.768457 |

Copy the output file from the test data set in the table above to your submission file. Do not copy the code or the full output file from the downloaded data set to the submission file.

Task Two

Download the data file for this exercise from the course web site. This data file has several pairs of (xi, yi) data points. In this task you will obtain some overall statistics (1) for the entire data file, (2) for the (xi, yi) data points in which xi ≥ 1000, and (3) for the (xi, yi) data points in which xi < 1000.

You can use some of the code from task one for this task. You do not have to keep the same function structure used for task one. However, you should be able to use the function that reads data from an input data file with no changes.

The program you write for this task should compute and print out the results listed below for the data in the data file that you download:

• The count, mean value and standard deviation of all xi data.

• The count, mean value and standard deviation of all yi data.

• The maximum and minimum values of xi and yi for the full set of data.

• The count, mean value and standard deviation of the subset of xi data for which xi ≥ 1000.

• The count, mean value and standard deviation of the subset of yi data for which the corresponding value of xi ≥ 1000.

• The maximum and minimum values of xi and yi for the subset of data in which the value of xi ≥ 1000.

• The count, mean value and standard deviation of the set of xi data for which xi < 1000.

• The count, mean value and standard deviation of the set of yi data for which the corresponding value of xi < 1000.

• The maximum and minimum values of xi and yi for the set of data in which the corresponding values of xi ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download