9 - Advanced Methods: Neural Networks



9 - Advanced Methods: Neural NetworksRob Hyndman (with Deppa edits)April 12, 2021Table of ContentsTOC \o "1-3" \h \z \u9.1 Introduction PAGEREF _Toc6824634 \h 19.2 Neural Network Autoregression (NNAR) PAGEREF _Toc6824635 \h 5Example 9.1 - Annual Canadian Lynx Trappings (1821-1934) PAGEREF _Toc6824636 \h 6Example 9.2 - U.S. Monthly Retail Clothing Sales PAGEREF _Toc6824637 \h 129.3 Prediction Intervals for NNAR Models PAGEREF _Toc6824638 \h 169.1 IntroductionArtificial neural networks are forecasting methods that are based on simple mathematical models of the brain. They allow complex nonlinear relationships between the response variable and its predictors.A neural network can be thought of as a network of “neurons” which are organised in layers. The predictors (or inputs) form the bottom layer, and the forecasts (or outputs) form the top layer. There may also be intermediate layers containing “hidden neurons”.The simplest networks contain no hidden layers and are equivalent to linear regressions. The figure below shows the neural network version of a linear regression with four predictors. The coefficients attached to these predictors are called “weights”. The forecasts are obtained by a linear combination of the inputs. The weights are selected in the neural network framework using a “learning algorithm” that minimizes a “cost function” such as the MSE. Of course, in this simple example, we can use linear regression which is a much more efficient method of training the model.Once we add an intermediate layer with hidden neurons, the neural network becomes nonlinear. A simple example is shown in the figure below.This is known as a multilayer feed-forward network, where each layer of nodes receives inputs from the previous layers. The outputs of the nodes in one layer are inputs to the next layer. The inputs to each node are combined using a weighted linear combination. The result is then modified by a nonlinear function before being output. For example in the diagram above, the inputs into hidden neuron j in the figure above are combined linearly to give,zj=bj+k=14wk,jxk,????where?j=1,2,3When using using a feed-forward neural network in forecasting a time series, the inputs can be previous observed values of the time series yt. This is known as neural network autoregression. For example, we might use the previous four values to forecast next value of the time series, i.e.?the inputs would be yt-1,yt-2,yt-3,yt-4. The hidden nodes would then be given by,zj=bj+k=14wk,jyt-kIn the hidden layer, an activation function is applied to the hidden nodes. The activation function is a nonlinear function such as a sigmoid/logistic function,?z=11+e-z that is applied to give the input to the next layer of the neural network. Application of the this function tends to reduce the effect of extreme input values, thus making the entire network more robust to the effect of outliers.The output layer from the neural network is another linear combination of the hidden nodes after the activation function ?(z) has been applied.yt=a+j=13hj?(zj)or in terms of the inputs yt-1,yt-2,yt-3,yt-4 we have,yt=a+j=13hj×?(bj+k=14wk,jyt-k)The parameters or weights a,h1,h2,h3,b1,b2,b3,w1,1,w1,2,…,w3,3,w4,3 are “learned” from the data. The weights to be estimated are generally restricted to prevent them from becoming too large. The tuning parameter that restricts the weights is known as the “decay parameter”, and is often sit to be equal to 0.1.The weights are “learned” by first choosing random values for them initially, and these randomly chosen weights are then updated using the observed data or time series. Because the weights are randomly chosen initially, there is an element of randomness to the predictions produced by a neural network. Therefore, the network is usually trained several times using different random starting points, and the results are then averaged. The number of hidden layers, and the number of nodes in each hidden layer, must be specified in advance. Cross-validation methods can be used to help choose “optimal” values for the model complexity.9.2 Neural Network Autoregression (NNAR)With time series data, lagged values of the time series can be used as inputs to a neural network, just as we used lagged values in a linear autoregression models, AR(p). We call this a neural network autoregression or NNAR model.We will only consider feed-forward neural networks with one hidden layer, and we use the notation NNAR(p,k) to indicate there are p lagged inputs (yt-1,…,yt-p) and k nodes in the one hidden layer. For example, NNAR(4,3) model shown above, is a neural network with the last four observations (yt-1,yt-2,yt-3,yt-4) as inputs for forecasting the output yt, and with three neurons in the hidden layer. A NNAR(p,0) model is equivalent to an ARIMA(p,0,0) model, but without the restrictions on the parameters to ensure stationarity, i.e.yt=a+p=14wpyt-p+?tWith seasonal data, it us useful to also add the last observed values from the same season as inputs. For example of yt is the value of the time series in July we would include yt-12, the value of the time series the previous July, as an input. We could also include the July value from 2 years prior as well yt-2×12. In general, we could use up to P of the previous seasonal values as inputs, i.e. yt-m,yt-2m,…,yt-Pm. The notation for a neural network model with seasonal inputs is NNAR(p,P,k)m. This model has inputs: (yt-1,…,yt-p,yt-m,…,yt-Pm) and k neurons in the hidden layer. A NNAR(p,P,0)m model is equivalent to an ARIMA(p,0,0)(P,0,0)m model but without parameter contraints to ensure stationarity.The R function nnetar in the fpp2 library will fit a NNAR(p,P,k)m model to time series and can be used to forecast future values of the time series. Generally we need to specify the values of p,P, and k. If we do not specify a choice for the number of hidden nodes k the nnetar function will use k=(p+P+1)/2 rounded to the nearest integer. When it comes to forecasting, the network is applied iteratively. For forecasting one-step ahead, we simply use the historical inputs. For forecasting two-steps ahead however, we first have to perform a one-step ahead forecast to obtain yT+1 and then treat this forecast as a historical value to obtain the next forecast yT+2. This process is repeated iteratively in order to obtain a h-step ahead forecast yT+hExample 9.1 - Annual Canadian Lynx Trappings (1821-1934)These data are contained in the time series lynx in the fpp2 library and represent the annual number of lynx trapped in McKenzie River district of northwest Canada: 1824-1934.require(fpp2)lynx## Time Series:## Start = 1821 ## End = 1934 ## Frequency = 1 ## [1] 269 321 585 871 1475 2821 3928 5943 4950 2577 523 98 184 279## [15] 409 2285 2685 3409 1824 409 151 45 68 213 546 1033 2129 2536## [29] 957 361 377 225 360 731 1638 2725 2871 2119 684 299 236 245## [43] 552 1623 3311 6721 4254 687 255 473 358 784 1594 1676 2251 1426## [57] 756 299 201 229 469 736 2042 2811 4431 2511 389 73 39 49## [71] 59 188 377 1292 4031 3495 587 105 153 387 758 1307 3465 6991## [85] 6313 3794 1836 345 382 808 1388 2713 3800 3091 2985 3790 674 81## [99] 80 108 229 399 1132 2432 3574 2935 1537 529 485 662 1000 1590## [113] 2657 3396autoplot(lynx) + xlab("Year") + ylab("# of Lynx Trapped") + ggtitle("Annual Canadian Lynx Trappings (1821-1934)")ggtsdisplay(lynx)These data appear to be seasonal, however the peaks and troughs do not occur with a constant frequency m. Furthermore these are annual counts, thus the behavior we are seeing would be characterized as cyclic, but NOT seasonal. Previous methods used for forecasting with these data has provided disappointing results. We will now consider using a neural network autoregression to make forecasts for these data.nn1 = nnetar(lynx,p=4,P=0,size=2)nn1## Series: lynx ## Model: NNAR(4,2) ## Call: nnetar(y = lynx, p = 4, P = 0, size = 2)## ## Average of 20 networks, each of which is## a 4-2-1 network with 13 weights## options were - linear output units ## ## sigma^2 estimated as 383318checkresiduals(nn1)fc1 = forecast(nn1,h=10)autoplot(fc1) + xlab("Year") + ylab("# of Lynx Trapped") + ggtitle("Ann. Lynx Trappings with 10-year Forecast")fc1.PI = forecast(nn1,h=10,PI=T)autoplot(fc1.PI) + xlab("Year") + ylab("# of Lynx Trapped") + ggtitle("Ann. Lynx Trappings with 10-year Forecast")fc1.PI## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95## 1935 2907.2278 2125.07675 3728.513 1695.2361 4145.367## 1936 1909.2219 450.37959 2930.004 -187.5973 3542.427## 1937 730.5034 -233.93089 1991.378 -632.0561 2593.131## 1938 365.4910 -514.30428 1435.584 -916.4843 1949.392## 1939 343.2421 -547.16490 1661.087 -998.2225 2633.209## 1940 392.4115 -385.76835 2782.020 -926.1815 4366.650## 1941 576.9847 -201.64167 4190.033 -750.3288 5533.513## 1942 1028.5505 47.15389 5088.298 -618.5742 5833.899## 1943 1926.8945 355.78051 5274.270 -429.3078 6004.633## 1944 2846.0631 149.25182 5200.074 -431.9167 5891.553autoplot(lynx) + autolayer(fitted(nn1))This fit seems good, but is it overfitting? We can assess this by performing cross-validation using a training/test set approach.length(lynx)## [1] 114lynx.test = tail(lynx,14)lynx.train = head(lynx,100)nn1 = nnetar(lynx.train,p=4,P=0,size=2)fc1 = forecast(nn1,h=14)accuracy(fc1,lynx.test)## ME RMSE MAE MPE MAPE MASE## Training set -0.06060326 651.9310 444.7607 -52.00776 71.45400 0.5215682## Test set -44.52337270 777.9544 627.6824 -17.39808 67.34963 0.7360794## ACF1 Theil's U## Training set 0.03168095 NA## Test set 0.77868943 2.469678autoplot(fc1) + autolayer(lynx.test,series="Actual")autoplot(lynx.test) + autolayer(fc1,series="Forecast")The forecast does not look too bad for these p and k choices . We should note that there was a fair amount of trial-and-error in arriving at these choices.Example 9.2 - U.S. Monthly Retail Clothing SalesThis time series contains monthly retail sales in millions of dollars from 1992 - 2018. We will examine the performance of neural networks for this strongly seasonal time series.Cloth = read.csv("(millions%20of%20dollars%20-%201992%20to%20present).csv",header=T)head(Cloth)## DATE Clothing## 1 1/1/1992 6938## 2 2/1/1992 7524## 3 3/1/1992 8475## 4 4/1/1992 9401## 5 5/1/1992 9558## 6 6/1/1992 9182ClothSales = ts(Cloth$Clothing,start=1992,frequency=12)ClothSales = window(ClothSales,start=2010)autoplot(ClothSales) + xlab("Year") + ylab("Millions $") + ggtitle("Monthly Retail Clothing Sales (January 2010-January 2018")nn1 = nnetar(ClothSales,p=3,P=1,size=3)nn1## Series: ClothSales ## Model: NNAR(3,1,3)[12] ## Call: nnetar(y = ClothSales, p = 3, P = 1, size = 3)## ## Average of 20 networks, each of which is## a 4-3-1 network with 19 weights## options were - linear output units ## ## sigma^2 estimated as 160768checkresiduals(nn1)## Warning in modeldf.default(object): Could not find appropriate degrees of## freedom for this model.fc = forecast(nn1,h=24,PI=T)autoplot(fc) + xlab("Year") + ylab("Millions $") + ggtitle("Clothing Sales with Forecast from NNAR(3,1,3)")Again we will consider a training/test set approach for cross-validating this model.length(ClothSales)## [1] 97Cloth.test = tail(ClothSales,12)Cloth.train = head(ClothSales,length(ClothSales)-12)nn1 = nnetar(Cloth.train,p=4,P=2,size=3,repeats=30)fc = forecast(nn1,h=12)autoplot(fc) + autolayer(Cloth.test,series="Actual")accuracy(fc,Cloth.test)## ME RMSE MAE MPE MAPE MASE## Training set 2.387537 292.3890 235.8457 -0.01828514 1.199766 0.3345914## Test set 70.795970 515.7302 373.8511 0.36457895 1.642776 0.530378## ACF1 Theil's U## Training set -0.004939969 NA## Test set -0.53207435 0.11077Cloth.hw = hw(Cloth.train,h=12)accuracy(Cloth.hw,Cloth.test)## ME RMSE MAE MPE MAPE MASE## Training set -72.59406 486.1052 382.5669 -0.4345672 1.993428 0.5427429## Test set -257.59198 626.3150 507.7408 -1.5735583 2.430074 0.7203257## ACF1 Theil's U## Training set -0.1410691 NA## Test set 0.1022929 0.1073411Cloth.ets = ets(Cloth.train)fc.ets = forecast(Cloth.ets,h=12)accuracy(fc.ets,Cloth.test)## ME RMSE MAE MPE MAPE MASE## Training set -30.99871 361.8720 297.2967 -0.21897668 1.509087 0.4217712## Test set 40.83965 456.6476 361.3441 0.03397949 1.688414 0.5126344## ACF1 Theil's U## Training set -0.08154105 NA## Test set 0.11982844 0.08476344Cloth.ARIMA = auto.arima(Cloth.train,approximation=F,stepwise=F)fc.ARIMA = forecast(Cloth.ARIMA,h=12)accuracy(fc.ARIMA,Cloth.test)## ME RMSE MAE MPE MAPE MASE## Training set -9.430964 465.1664 335.4993 -0.1801416 1.669091 0.4759687## Test set -352.250883 623.3325 499.7388 -1.7601474 2.391499 0.7089734## ACF1 Theil's U## Training set -0.02641904 NA## Test set -0.03493485 0.1122988Here we can see that the NNAR(4,2,3) model outperforms both the Holt-Winter’s and the ETS forecasts for these data. As with the first example a bit of trial-and-error led to these choices for p, P, and k. Undoubtedly there are several other reasonable choices for these parameters, with some potentially outperforming these.9.3 Prediction Intervals for NNAR ModelsUnlike most of the methods considered in this book, neural networks are not based on a well-defined stochastic model, and so it is not straightforward to derive prediction intervals for the resultant forecasts. However, we can still compute prediction intervals using simulation where future sample paths are generated using bootstrapped residuals.The neural network fit to the clothing sales time series can be written asyt=fyt-1+εtwhere yt-1=(yt-1,yt-2,yt-3,yt-4,yt-12,yt-24) is a vector containing lagged values of the series used as inputs to the neural network, and f is a neural network with a single hidden layer with two nodes. The error series ?t is assumed to have mean 0 with constant variation σ2 and potentially be normally distributed.We can simulate future sample paths of this model iteratively, by randomly generating a value for εt, either from a normal distribution, or by resampling from the historical values of the forecast errors/residuals et. So if εT+1* is randomly drawn value from the distribution of errors observed up to time T+1, then we can generate a future value of the time series at time T+1 by using our neural network model along with this random error added.yT+1*=fyT+εT+1*Setting yT+1*=(yT+1*,yT,yT-1,yT-2,yT-3,yT+1-12,yT+1-24), we can repeat the process to obtainyT+2*=fyT+1*+εT+2*In this way, we can iteratively simulate a future sample path for the time series. By doing this a large number of times we build up knowledge of the distribution for all future values of the time series based upon our neural network.Below is the code to simulate the next two years of monthly retail clothing sales using our neural network developed for these data above.nn1 = nnetar(ClothSales,p=4,P=2,size=3)sim <- ts(matrix(0, nrow=24L, ncol=10L),start=end(ClothSales)[1L],frequency=12)for(i in seq(10)) {sim[,i]=simulate(nn1,nsim=24L)}autoplot(ClothSales)+autolayer(sim) + ylab("Millions $") + xlab("Year") + ggtitle("US Monthly Retail Clothing Sales with Simulations")If we repeated this say 1,000 times, rather than 10, we could obtain approximate 80% Prediction Intervals at each future time point by taking the 100th smallest and 900th largest values at that time point to form the interval. This is what the forecast function does when no closed form for the standard error exists, which is the case with neural network time series models. In order to obtain them you need to specify PI=T inside the forecast function.fc = forecast(nn1,PI=T,h=24)autoplot(fc) + xlab("Year") + ylab("Millions $") + ggtitle("NNAR(4,2,3) Forecasts for US Monthly Clothing Sales")fc## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95## Feb 2018 18212.21 17756.22 18594.28 17552.35 18813.99## Mar 2018 21195.07 20743.64 21620.13 20489.09 21785.76## Apr 2018 21129.60 20684.76 21609.43 20455.80 21835.61## May 2018 21729.56 21296.52 22166.66 21063.57 22423.53## Jun 2018 20599.20 20151.06 20979.32 19882.57 21211.21## Jul 2018 20894.61 20493.33 21330.28 20258.48 21501.33## Aug 2018 22912.79 22514.75 23352.22 22284.91 23550.92## Sep 2018 20066.96 19624.29 20473.24 19430.67 20718.65## Oct 2018 20589.31 20155.21 20998.70 19951.98 21224.12## Nov 2018 24819.00 24405.45 25241.20 24157.10 25423.81## Dec 2018 34689.23 34215.06 35094.38 34011.75 35384.47## Jan 2019 16279.17 15859.89 16684.65 15645.63 16903.26## Feb 2019 18245.81 17836.40 18643.77 17620.74 18871.49## Mar 2019 21312.21 20831.71 21774.95 20570.38 22088.85## Apr 2019 21424.81 20814.72 22075.59 20474.70 22390.13## May 2019 21671.40 21127.21 22226.06 20797.97 22532.53## Jun 2019 20751.82 20207.35 21271.94 19866.38 21579.52## Jul 2019 21121.38 20549.76 21649.24 20296.22 21946.62## Aug 2019 23124.44 22602.22 23684.74 22281.31 23972.73## Sep 2019 20179.21 19651.15 20749.15 19394.58 21005.15## Oct 2019 20622.57 20108.48 21104.08 19873.39 21361.69## Nov 2019 25113.07 24601.81 25688.37 24329.36 25910.92## Dec 2019 34756.92 34264.02 35155.79 34009.95 35468.04## Jan 2020 16411.64 15924.31 16935.81 15661.01 17205.30 ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download