Making Prediction Intervals Using Neural Networks



Using Clustering to Make Prediction Intervals

For Neural Networks

Claus Benjaminsen

ECE 539 12/21 2005

Abstract

This project describes a way of estimating prediction intervals using clustering. The method is developed and implemented in matlab, and tested on a small 2D dataset. The performance of the method is compared with the standard method of making prediction intervals using several identical neural networks, and a baseline of just applying equal sized intervals to all predictions. The results show that the method using clustering indeed can be used to estimate prediction intervals, and that the performance can be even better than the standard method.

Table of Contents

Abstract 2

Table of Contents 3

Introduction 4

Approach 6

Data set 6

How the models are set up 7

Implementation 9

Results 11

Comparing the 3 methods 11

Testing with different values of the bias and slag parameters 14

Discussion 16

Conclusion 17

References 18

Introduction

Neural networks are widely used nowadays to solve all kinds of problems related to classification and regression. A special group of problems has to do with prediction, and being able to estimate the output of an unknown function, given a set of new inputs. This could for instance be forecasting the total snowfall over night using today’s weather data, or predicting how many hats of a certain brand will be sold this Christmas, given money spend on marketing and the average temperature in December. Both these examples are regression problems in that the output is a numerical value, and are typical problems, which could be solved by neural networks. Taking the first example, a trained neural network might predict, that the total snowfall will be 2 inches. This estimate might be right or wrong, close to the actual value of the snowfall or far away. Sometimes the neural network will be very accurate, and sometime it won’t, but for the person who wants to plan when to get up tomorrow in order to make it to work on time, only the estimate might not be sufficient information. For him it could be more relevant to know, what the maximum possible snowfall could be, so that he can take precautions and get up early, if there is a chance there could fall more than for instance 5 inches of snow.

This kind of inference has to do with prediction intervals. Instead of predicting only an estimate of the actual snowfall, an interval of the possible range of snowfall could be estimated. These intervals can be given in many forms, for instance by a max and min value and possibly a percentage of confidence, or by a mean value and a variance of a Gaussian distribution. In anyway the output is supplied with a lot of extra information and can be used to determine, how precise the estimated output value is (width of prediction interval), and what the possible max and min values of the actual output value are (limits of prediction interval).

The normal way of estimating prediction intervals using neural networks is by training multiple identical networks, and using the variance of the predicted outputs to estimate a prediction interval for the given output. This method normally has some requirements to the noise added to the unknown function, and can have problems in certain situations. Therefore this project will investigate the possibility of implementing interval prediction with the help of clustering.

The idea is that, if the noise in the unknown function is input dependent, inputs close together might be subject to the same underlying noise function, and hence their outputs might be about equally hard to predict. This can then be used to give the prediction intervals for similar inputs an equal size, and in that way give a good estimate of the possible range of output values.

Motivation

The motivation for this project is to test the possibility of using clustering in interval prediction. In many real world problems prediction intervals can be a big help in making prediction based decisions, and so a good method of estimating prediction intervals can be used extensively. Also in many situations the standard method of training several identical neural networks is unfeasible, because the model easily gets really big, and it can take a long time if several neural networks have to be trained on a large dataset. The proposed method only requires the training of one neural network and clustering of the training data, which would normally be much faster than training many neural networks.

Approach

In this project the main focus has been put on implementing and testing the performance of doing interval predictions using a neural network and clustering. To evaluate the performance two other methods of doing prediction intervals have been included. The first is the baseline, which consists of just applying equally large prediction intervals to all predictions. The second method is the standard way of implementing prediction intervals using neural networks. In this project it uses 10 identical neural networks, trained on the same data, to estimate the individual variance of predicted outputs, and from these variances estimate the corresponding prediction intervals.

Data set

For this project I have chosen to use some easy accessible data coming from a neural network competition[1] set up on the internet. The reason is twofold: First it meant, that I didn’t have to spend a lot of time collecting data and transforming it into a format, which can easily be processed. Secondly the competition, the data comes from, was what gave me the inspiration for this project. It focuses exactly on making good prediction intervals using neural networks, and so the datasets from the competition are very relevant for testing the implementations in this project.

The competition has 4 different datasets:

• Synthetic – a small synthetic dataset

• Precip – a large dataset consisting of precipitation data

• Temp – a large dataset with temperature data

• SO2 – a medium size dataset of various measurement for predicting SO2 levels

I have only used the Synthetic dataset to develop the methods as it is 2D (one input and one output), and therefore easy to work with and plot. It consists of 256 points divided into two files, one containing all the x-coordinates (synthetic_train.inputs), and one with the corresponding y-coordinates (synthetic_train.targets). A plot of this data set is shown below in figure 1.

[pic]

Figur 1. A matlab plot of the Synthetic dataset used in this project.

How the models are set up

When training a neural network, it is often desirable, that the feature vectors (input part) of the training/testing data are scaled into a certain range. This is both to give the same weight to different features, and to make it easier for the neural network to adapt to the inputs. Also since the output of a neural network is computed by a nonlinear function with a limited range of output values, the targets (output values of training/test set) will need to be scaled into this range, in order for the network to be able to make predictions. Therefore the Synthetic dataset is being randomized and then scaled, before it is divided into a training set and a testing set.

Using only the training set 10 identical neural networks are trained using back-propagation and a small random perturbation of the weights, and the weights giving the lowest training error are stored. Then both the training feature vectors and the test feature vectors are fed to all 10 networks, and the mean and the variance of each predicted output is calculated from the 10 individual predictions. The mean values found will serve as the predicted point estimates of the targets. The training error is calculated as the sum of the distance between these predicted point estimates and the training targets. This value gives the minimum total sum of prediction intervals needed to cover all the training data for the found point predictions. Since all prediction intervals in this project are symmetric, the sum of the prediction intervals is calculated as the sum of half the interval length.

Giving the training error plus an optional slag parameter, the baseline prediction intervals are formed as equal size intervals with a total sum equal to the training error. The standard method (from now on referred to as the variance method) scales the variance of each prediction and adds a possible bias term, to make the sum of the variances (prediction intervals) equal to the training error.

Finally the clustering method is applied, in which a k-means clustering algorithm is used on the input feature vectors to find a number of cluster centers. The membership of each of the training feature vectors to these cluster centers is then found, and the mean of the training errors, for all the feature vectors in a given cluster, is assigned to that cluster center. These mean errors are now scaled and possibly added with a bias term to have a total sum equal to the total training error, and thereby they can serve as prediction intervals for the training data.

When calculating the prediction intervals for the test data, the baseline method just assigns the size of the training prediction intervals to all the test prediction intervals. The variance method scales the variance of the test predictions by the same amount as the training set variances were scaled. Finally the clustering method first determines the membership of the given test feature vector, and then assigns the corresponding mean error associated with the found cluster center to the feature vector. This mean error is then scaled by the same factor as under training, and the corresponding value defines the test prediction interval.

The performance measures used in this project are the number of targets, which falls within the prediction intervals, and the cost function, which is determined as the mean squared distance from the edge of the prediction intervals to the corresponding targets, for all the targets which fall outside the prediction intervals.

Implementation

The implementation of the methods described above is done in Matlab, and makes extensive use of the matlab programs developed by professor Yu Hen Hu[2], or modification of these programs. The main file is IntervalNN, which sets up all the parameters for training the 10 Neural Networks, scales the data and divides it into a training set and a test set. Then it calls NNfunction, which is a modified version of bp.m, and takes care of training a neural network with the given training data and returns the found weights and the training data predictions. Given the weights, the test set predictions are calculated, and this procedure is repeated 10 times. Then the mean and variances of the training and test predictions are determined along with the training and test errors. Finally all the values are saved in IntNNvar.mat and the program c_test is called.

In c_test the prediction intervals for the baseline method and the variance method are calculated, and the performance of these intervals on both the train and test data is found. Then the cluster centers of the training feature vectors are found using the function cluster.m, which is a modified version of clusterdemo.m by professor Yu Hen Hu, and the membership of each of the training features is determined using kmeantest.m also developed by professor Yu Hen Hu. Then the prediction intervals are formed, and the training performance evaluated, after which the same is done for the test set. The whole scheme is run multiple times for different number of cluster centers, so the performance can be compared.

Finally disp_performance.m can be run. It uses the prediction intervals given by the number of cluster centers, which give the best performance in terms of the minimum cost function for the test data. These prediction intervals are plotted along with the training data and training predictions, and the same is done for the other two methods. Similar plots are made for the test data.

Also another small program can be run called test_bias.m. This program tests the performance of each of the three methods, when the bias term in the prediction intervals and the training error slag parameter are changed. The bias term alters how much of the total available prediction interval should be divided as an equal amount to all individual prediction intervals, (the rest is divided by scaling the prediction intervals). The slag parameter changes the total sum of available prediction interval to become larger or smaller than the total training error. For each new combination of the slag parameter and the bias value, c_test2.m is called and the performance results recorded. c_test2.m is a very slightly modified version of c_test.m, and is just made to be called from test_bias.m.

Results

I have run the programs described above multiple times with different parameters, and here I will present some of the results I found. The training and test feature vectors were scaled to the range from -5 to 5, and the targets to the range from 0.2 to 0.8. Out of the total 256 samples in the synthetic dataset, 100 samples were randomly picked out to use for testing. The 10 neural networks were set up to have 3 layers (1 input, 1 hidden and 1 output layer) with 5 neurons in the hidden layer. Sigmoidal activation functions were used for all the hidden neurons, and a tangent hyperbolic function for the output neuron. The step size alpha was set to 0.1, the momentum term to 0.8 and the epoch size to 20. The stopping criteria used was no improvement in training error for 200 consecutive iterations.

Comparing the 3 methods

After obtaining the estimated point predictions I evaluated the prediction interval performance using n = 4, 8, …, 100 cluster centers. The best one in terms of the minimum cost function for the test data was picked out, and the training and test data along with the point predictions and prediction intervals are shown in the figures below for each of the 3 methods.

[pic][pic]

Figur 2 and 3. Left figure: Training set and prediction intervals using clustering with 52 cluster centers. Right figure: The test set and corresponding prediction intervals using the same cluster centers.

[pic][pic]

Figur 4 and 5. Left figure: Training set and prediction intervals using the variance method. Right figure: Test set and prediction intervals using the variance method.

[pic][pic]

Figur 6 and 7. Left figure: Training set and prediction intervals using the baseline method. Right figure: Test set and prediction intervals using the baseline method.

From the plots it can be seen how the clustering and variance methods try to fit the sizes of the prediction intervals to match the easy and difficult parts of the data. For the clustering method it can be seen, how the training areas (in the feature vector space), which gives large prediction intervals, also gives large prediction intervals for the test data. The same is in someway true for the variance method.

Looking at the corresponding performance data for this experiment gives an easier way of comparing the 3 methods. The performance measures are shown in the table below.

|n_centers |c_clus |c_test_clus |cost_clus |cost_test_clus | |c_var |c_basis |

|4 |98 |61 |0.001842 |0.001679 | |79 |108 |

|12 |88 |58 |0.001702 |0.001421 | |c_test_var |c_test_basis |

|16 |83 |54 |0.001430 |0.001480 | |56 |62 |

|20 |86 |57 |0.001372 |0.001415 | | | |

|24 |84 |56 |0.001387 |0.001506 | |cost_var |cost_basis |

|28 |89 |58 |0.001092 |0.001620 | |0.002516 |0.003075 |

|32 |89 |56 |0.001112 |0.001322 | | | |

|36 |82 |56 |0.001105 |0.001595 | |cost_test_var |cost_test_basis |

|40 |86 |55 |0.001067 |0.001314 | |0.004856 |0.002414 |

|44 |89 |57 |0.000986 |0.001682 | | | |

|48 |90 |52 |0.001038 |0.001345 | | | |

|52 |89 |55 |0.000960 |0.001277 | |The slag |The bias term |

|60 |89 |50 |0.000898 |0.001845 | |e_slag |bias |

|64 |76 |54 |0.000908 |0.001887 | |0 |0 |

|68 |94 |53 |0.000851 |0.002278 | | | |

|72 |97 |55 |0.000807 |0.002155 | | | |

|76 |96 |56 |0.000866 |0.002276 | |Number of |Number of |

|80 |66 |52 |0.000503 |0.002347 | |training targets |test targets |

|84 |101 |55 |0.000525 |0.002298 | |156 |100 |

|88 |105 |54 |0.000524 |0.002301 | | | |

|92 |110 |55 |0.000513 |0.002278 | | | |

|96 |110 |52 |0.000472 |0.002425 | | | |

|100 |111 |

| |-4 |

|0.00 |25 |

|0.00 |35 |

|40 |53 |62 |69 |76 |83 |87 |89 |91 |93 | |min_c_test_ = Number of test targets inside prediction interval (for the clustering method the result with the minimum test cost function (cost_test_clus) is given).

From the results in the table above, it can be seen, that the bigger the slag parameter is, the larger number of test targets inside the prediction intervals are obtained – as expected. The results also show, that the bias term has a big influence on this measure of performance, and so with the right bias term, both the clustering method and the variance method outperform the baseline method for all values of the slag parameter. The best results are obtained with the clustering method, which, for nearly all values of the slag parameter, performs better than both the other methods, when using the right bias value.

Discussion

The above results show that clustering can be used with neural networks to estimate prediction intervals in regression problems. At least in this problem it worked, but is this problem a special case, or will it work in general? The method doesn’t really require anything in order to be applied to a given regression problem, so the question is, if it in general will give good results as above? The idea behind the method is that similar feature vectors often are equally difficult to predict. This can be the result of a lot of noise or a high nonlinearity in the given area of the feature space. Since it is this property the method exploits, its performance might decrease a lot in problems, where this doesn’t apply. If the noise is uniform, which is often assumed in many regression problems, the clustering method should still perform reasonably well, if a low number of cluster centers are used. If the number of cluster centers is too high, the method will start to react on individual output errors, and hence a single sample with, by chance, a relative high noise value, might make the prediction intervals of similar inputs too large.

The biggest problem with this method is that it requires, that new input feature vectors fed to it, “looks” like old feature vectors, which it was trained on. If one of the input features is time, the new feature vectors might never “look” like old feature vectors, and the model will not be able to perform very well. In these cases it might be possible to omit the features, which behave like that, and perform the clustering only on the remaining features. This has not been tested though, and it is hard to predict, what kind of implications it will have on the performance of the method.

Another type of problem for this method relates to the size of the input space and the number of training samples. If both become large a very high number of cluster centers will be needed to cover the feature space with a certain resolution. This can make this method impractical, if too many cluster centers need to be found, which can require a lot of computation. Also they need to be stored along with the error associated with each of them, and when new inputs arrive it might be a heavy process to determine the membership of each of them. Unfortunately I haven’t had time to test the clustering method on larger datasets, so I can’t really conclude on how big the problem will be.

Conclusion

As shown above in the result section, the clustering method developed in this project can be used for making good prediction intervals. It can, with the right values of the 3 important parameters (number of cluster centers, bias and slag), give prediction intervals with very good performance, even when compared to standard methods. I didn’t find a good way of determining these parameters in this project, they were just found using repeated trials, and I have no good solution to this problem. It is not very different from trying to determine the right structure for a neural network, and maybe a cross validation scheme might be used to determine the right parameters.

The dataset used in this project is very well suited for this method of using clustering to make prediction intervals. Since I didn’t have the time to evaluate the performance on other datasets, it is still a question how well it works in general. Also, as brought up in the discussion section above, there might be some problems related to the size of the feature space, number of training vectors and test vectors, which doesn’t “look” like any of the training vectors. These problems need to be considered, when deciding if the method is applicable to a given problem.

References

Haykin, S.

“Neural Networks: A Comprehensive Foundation” (second edition 1999)

Prentice Hall: Upper Saddle River, New Jersey

Papadopoulos, G.; Edwards, P.J.; Murray, A.F.;

“Confidence Estimation Methods for Neural Networks: A Practical Comparison”

IEEE Transactions on Neural Networks, Volume 12,  Issue 6,  Nov. 2001 Page(s):1278 - 1287

Carney, J.G.; Cunningham, P.; Bhagwan, U.;

“Confidence and prediction intervals for neural network ensembles”

IJCNN '99. International Joint Conference on Neural Networks, 1999.

Volume 2,  10-16 July 1999 Page(s):1215 - 1218 vol.2

Predictive Uncertainty in Environmental Modeling Competition



-----------------------

[1] Predictive Uncertainty in Environmental Modeling Competition



[2] These programs can be found on the class webpage:

[3] This is controlled by the slag parameter c_slag, which is zero in the above experiment.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download