Technology - Winona State University



Ch 2 – Statistical Background for Forecasting – Computing in RInstalling RR is distributed by the “Comprehensive R Archive Network (CRAN)” – , the current version is R 2.15.2 and it is available for both Windows and MAC OS. Bringing Data from JMP to RI will distribute datasets via the course1 website in JMP format. It is easy to bring data from JMP into R by first saving a JMP dataset in a Text Export Format and then reading it into R. Before saving a dataset from JMP in text format, it best to rename the columns in the JMP dataset using single word variable names, without using any special characters such as dollar signs and parentheses. I demonstrate this process below for using the beverage sales JMP datafile U.S. Beverage Sales.JMP.Notice all of the variable/column names are simple one word descriptors, thus we are ready to export it to text format for reading in R. We first save this dataset in text format as shown below using the Save As… option from the File pull-down menu.Here I am saving the text format file “US Beverage Sales.txt” in a folder called Dataset for R in my folder for this course on my laptop. Next I will start R and read this text file into R using the command read.table()as shown below. > BevSales = read.table(file.choose(),header=T,sep=",")The file.choose() option tells R I want to use my standard browser for finding the file to read into R. The header=T option names that the first line of the text file contains the variable names and the sep=”,” option tells R that the columns in the text file are separated by a comma. This text format file is the same as “.csv” extension that is commonly used in Excel.If the command is successful you should receive a new prompt in R “>”. Datasets in R are saved as what are called dataframes. You can see the first few rows of the dataframe by using the command head(). > head(BevSales) Time Month Year Date Sales1 1 1 1992 01/01/1992 35192 2 2 1992 02/01/1992 38033 3 3 1992 03/01/1992 43324 4 4 1992 04/01/1992 42515 5 5 1992 05/01/1992 46616 6 6 1992 06/01/1992 4811To begin working with a data frame in R, it is best to attach() it, which will then allow you refer to the columns of dataframe by name. When you are done working with a dataframe in R you should always detach() it. This is important because if you were to open another time series dataframe with Time as a variable name, R will be confused as to which Time variable you intend to use. > attach(BevSales)> names(BevSales)[1] "Time" "Month" "Year" "Date" "Sales"The command names() used above will display the column/variable names of a dataframe. The Sales variable in the BevSales dataframe is the time series {yt} and it can be plotted versus an Index (i.e. time) by using the command plot().> plot(Sales,type="l") type=”l” means use a line to connect data points.> plot(Sales,type="b") type=”b” means plot both the data points and line.> plot(Sales,type="p") type=”p” means plot the data points only.If we want to use the variable Time on the x-axis we could use plot() as shown below. The title() can be used to a main title on the resulting plot.> plot(Time,Sales,type="b")> title(main="Time Series Plot of U.S. Beverages Sales")To see a help page for any command in R you can use the command ? followed immediately by the function name you want to see the help page on. Help pages will be open in a browser and are in HTML format.> ?plotA portion of the plot() help documentation is shown below.Some self-explanatory basic summary statistics commands for a numeric variable are demonstrated below.> summary(Sales) 5-number summary with mean Min. 1st Qu. Median Mean 3rd Qu. Max. 3519 4660 5268 5239 5750 7270 > mean(Sales) mean of time series values[1] 5238.661> var(Sales) variance of time series values[1] 615603.6> sd(Sales) standard deviation of time series values[1] 784.6041If there are month and year variables in a time series dataframe we can construct comparative boxplots by month and year as shown below. > boxplot(split(Sales,Month),xlab="Month",ylab="US Beverage Sales")> title(main="Comparative Boxplots of Monthly Beverage Sales")> boxplot(split(Sales,Year),xlab="Year",ylab="US Beverage Sales")> title(main="Comparative Boxplots of Yearly Beverage Sales")By using the up-arrow key you can cycle through previous commands that you have executed. This can be really helpful if you wish to execute a command again without typing it from scratch, or if you want to make small changes to an earlier command. In the latter case, you can then use the left/right arrow keys to move the cursor to a point where you wish to make changes. I used this approach in the commands above to change Month to Year in appropriate places in order to get the comparative boxplots by Year after I had entered the commands to create the monthly comparisons. 145786357997100To obtain basic univariate plots of the data, i.e. histogram, boxplot, symmetry plot, and a normal quantile plot I wrote a function called Statplot(). > Statplot(Sales)Smoothing Time Series in RThere are several functions in base R and some from time series related packages in R that can be used to a smooth to a time series plot. In the examples below we will examine several different methods, some which are unique to R.On pages 22-25 of the text several simple methods of smoothing a time series are discussed, namely moving averages and running median smoothers. We will first examine how to use these types of smoothers in R.Moving Averages (centered)A centered moving average at time t of a time series yt has the formMt=12s+1i=-SSyt-iThe command filter() in R can be used to compute a centered moving average of this form to smooth a time series. We can then add the resulting smooth to a time series plot.For example to take a centered moving average of length N = 5 (and hence S = 2), i.e. Mt= yt-2+yt-1+yt+yt+1+yt+25=15yt-2+15yt-1+15yt+15yt+1+15yt+2we use the following command sequence in R.> plot(Sales,type="p",xlab="Time",ylab="Beverage Sales", main="US Beverages Sales with Centered Moving Average (S=2)")> ma5 = filter(Sales,filter=rep(1/5,5)) compute Mt with 1/5 weights> lines(ma5,lty=1,col="blue") add solid (lty=1) blue (col=”blue”) line to an existing plotAdding two more centered moving averages with larger “windows” or “spans” we can see can see the estimates lose the seasonality and move closer toward estimating the long term trend.> ma20 = filter(Sales,filter=rep(1/20,20))> lines(ma20,lty=2,col="red")> ma50 = filter(Sales,filter=rep(1/50,50))> lines(ma50,lty=3,col="black")The Hanning filter discussed on page 23 of the text uses the following weighted moving average to create the smooth,Mt=14yt-1+12yt+14yt+1which we can easily implement in R using the filter() command. We will actually write our own function to do this called hann().> hann = function(x) {filter(x,filter=c(1/4,1/2,1/4))}> plot(Sales,xlab="Time",ylab="Beverage Sales",main="Hanning Filter of Beverage Sales")> lines(hann(Sales),lty=1,col="blue",lwd=2)Running median smoothers are more robust to outliers in a time series, and as a result are generally preferable to moving average smoothers. The command smooth() in R will perform a variety of median-based smoothers in R.> ?smoothSome examples of using running median-based smoothers are shown below.> plot(Sales,xlab="Time",ylab="Beverage Sales",main="US Beverage Sales with Running Median Smoothers")> sm3 = smooth(Sales,kind="3")> lines(sm3,lty=2,col="blue")> plot(Sales,xlab="Time",ylab="Beverage Sales",main="US Beverage Sales with Running Median Smoothers")> sm3RSS = smooth(Sales,kind="3RSS")> lines(sm3RSS,lty=2,col="black",lwd=2) double the thickness (lwd = 2)We can perform kernel or local regression smoothing in R as we did in JMP by using the command lowess() in R, which stands for local regression.> plot(Sales,xlab="Time",ylab="Beverage Sales",main="US Beverage Sales with Local Regression Smoother (f = 2/3)")> lines(lowess(Sales,f=2/3),lty=1,col="blue",lwd=1.5)Choosing much smaller values for the fraction of observations (f) used will allow the smoother to pick up more of the seasonal/cyclical behavior in the time series.> plot(Sales,xlab="Time",ylab="Beverage Sales",main="US Beverage Sales with Local Regression Smoother (f = .05)")> lines(lowess(Sales,f=.05),lty=2,col="black",lwd=2)> plot(Sales,xlab="Time",ylab="Beverage Sales",main="US Beverage Sales with Local Regression Smoother (f = .025)")> lines(lowess(Sales,f=.025),lty=2,col="black",lwd=2)Autocovariance and Autocorrelation in RThe function acf(x,lag.max) in R will form the necessary lagged time series and calculate the autocorrelation for lags up to lag.max. R will automatically choose the number lags to compute, but it does not use K = T/4 as JMP does. > detach(BevSale) assuming this data frame is still attached.> attach(ChemVisc) attach the chemical viscosity time seriesWe will calculate the autocorrelation function (ACF) for the viscosity series and assign it to an object called visc.acf. We can then plot the ACF function and print the autocorrelations as follows:> visc.acf = acf(Viscosity)> plot(visc.acf)By default R uses SE(rk)≈1/T thus the confidence bands are ±2T≈.20 for this time series as T = 100. To match those used by JMP we can specify a difference confidence interval type in the call to plot function.> plot(visc.acf,ci.type="ma")To see the actual estimated autocorrelations we can use the function print().> print(visc.acf)Autocorrelations of series ‘Viscosity’, by lag 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1.000 0.784 0.628 0.492 0.363 0.305 0.209 0.164 0.145 0.104 0.067 0.004 -0.077 -0.052 0.021 15 16 17 18 19 20 0.073 0.071 0.001 -0.057 -0.123 -0.181Data Transformations in RAfter saving the Liquor Sales.JMP file in text export format for R we can read this series into R as outlined on pages 1 & 2 of this handout.> LiquorSales = read.table(file.choose(),header=T,sep=",")> names(LiquorSales)[1] "Time" "Month" "Year" "Sales"> head(LiquorSales) Time Month Year Sales1 1 1 1980 4802 2 2 1980 4673 3 3 1980 5144 4 4 1980 5055 5 5 1980 5346 6 6 1980 546> attach(LiquorSales)> plot(Time,Sales,xlab="Time",ylab="Liquor Sales",main="Liquor Sales by Month")> lines(hann(Sales),lty=2,col="blue",lwd=2)> lines(smooth(Sales),lty=3,col="red",lwd=2)Here we can see the effect of outliers on the running mean based smoother vs. the running median based smoother. Clearly December deserves special consideration in the modeling process for this time series. The constant variation of the time series is evident, thus we might consider a log transformation of the monthly liquor sales to stabilize the variance.> logSales = log(Sales) log() is the natural logarithm by default!Plot the time series in log scale and add some running smoothers to the plot.> plot(Time,logSales,xlab="Time",ylab="ln(Liquor Sales)", main="ln(Liquor Sales) by Month")> lines(smooth(logSales),lty=3,col="red",lwd=2)> lines(hann(logSales),lty=2,col="blue",lwd=2)The legend is added to the plot using the legend() command. > legend(0,7.7,legend=c("Hanning Smooth","Running Median"),col=c("blue","red"),lty=2:3)Coordinates of upper left hand corner of the legendText for the legend box, in this case two smooths were used: Hanning Smooth and Running Median.Colors used to plot the smooths associated with the text in the legend argument.Line type (lty) of the smooths associated with the text in the legend argument.Seasonal Decomposition of a Time Series in RAs discussed in the Chapter 2 JMP handout it is sometimes interesting to break a time series up into long term trend + seasonal trend + error components. In JMP that process involved successively fitting models to extract these components and saving the fitted/predicted values and residuals from the fits to the data table to build the decomposition “manually”. In R there are two functions decompose & stl that complete this process automatically for a properly formatted time series. The function ts()takes information about the starting time, ending time, and periodicity/frequency. You must convert a time series into a time series object using the ts() command before running decompose or stl on a time series. > detach(LiquorSales)> attach(BevSales)> names(BevSales)[1] "Time" "Month" "Year" "Date" "Sales"Display the first six observations in the data frame BevSales> head(BevSales) Time Month Year Date Sales1 1 1 1992 01/01/1992 35192 2 2 1992 02/01/1992 38033 3 3 1992 03/01/1992 43324 4 4 1992 04/01/1992 42515 5 5 1992 05/01/1992 46616 6 6 1992 06/01/1992 4811Display the last six observations in the data frame BevSales> tail(BevSales) Time Month Year Date Sales175 175 7 2006 07/01/2006 6505176 176 8 2006 08/01/2006 7039177 177 9 2006 09/01/2006 6440178 178 10 2006 10/01/2006 6446179 179 11 2006 11/01/2006 6717180 180 12 2006 12/01/2006 6320> Sales.ts = ts(Sales,start=1992,frequency=12)> Sales.ts Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec1992 3519 3803 4332 4251 4661 4811 4448 4451 4343 4067 4001 39341993 3652 3768 4082 4101 4628 4898 4476 4728 4458 4004 4095 40561994 3641 3966 4417 4367 4821 5190 4638 4904 4528 4383 4339 43271995 3856 4072 4563 4561 4984 5316 4843 5383 4889 4681 4466 44631996 4217 4322 4779 4988 5383 5591 5322 5404 5106 4871 4977 47061997 4193 4460 4956 5022 5408 5565 5360 5490 5286 5257 5002 48971998 4577 4764 5052 5251 5558 5931 5476 5603 5425 5177 4792 47761999 4450 4659 5043 5233 5423 5814 5339 5474 5278 5184 4975 47512000 4600 4718 5218 5336 5665 5900 5330 5626 5512 5293 5143 48422001 4627 4981 5321 5290 6002 5811 5671 6102 5482 5429 5356 51672002 4608 4889 5352 5441 5970 5750 5670 5860 5449 5401 5240 52292003 4770 5006 5518 5576 6160 6121 5900 5994 5841 5832 5505 55732004 5331 5355 6057 6055 6771 6669 6375 6666 6383 6118 5927 57502005 5122 5398 5817 6163 6763 6835 6678 6821 6421 6338 6265 62912006 5540 5822 6318 6268 7270 7096 6505 7039 6440 6446 6717 6320The ts() command creates a time series starting with the first month of 1992 (start = 1992) with sales reported monthly (frequency = 12).As stated above, the functions decompose() & stl() expect that the time series has been stored in this format.> Sales.decomposed = decompose(Sales.ts)> plot(Sales.decomposed)Using the function stl() requires that we specify that we expect a periodic/seasonal component in our time series as shown below.> Sales.stl = stl(Sales.ts,s.window="periodic")> plot(Sales.stl)The only benefit of the stl() approach is that allows for more fine tuning or control of the smoothing process used to extract the different components. I will not delve into these in this class however and decompose() should be the default approach.As a second example we will again consider the liquor sales data.> detach(BevSales)> attach(LiquorSales)> names(LiquorSales)[1] "Time" "Month" "Year" "Sales"> ts.plot(Time,Sales,xlab=”Time”,ylab=”Liquor Sales”)As with the beverage sales time series we examined above, we can construct comparative boxplots to compare liquor sales across month and year. We first use a command to set up two comparative boxplots in the same plot, 1 row consisting of 2 plots.> par(mfrow=c(1,2))> boxplot(split(Sales,Month),xlab="Month",ylab="US Liquor Sales")> title(main="Comparative Boxplots of Monthly Liquor Sales")> boxplot(split(Sales,Year),xlab="Year",ylab="US Liquor Sales")> title(main="Comparative Boxplots of Yearly Liquor Sales”)> par(mfrow=c(1,1)) reset the graphics window to one plot per page afterwards.Next we create a time series object using the ts() command.> head(LiquorSales) Time Month Year Sales1 1 1 1980 480 series starts January, 19802 2 2 1980 4673 3 3 1980 5144 4 4 1980 5055 5 5 1980 5346 6 6 1980 546> Liquor.ts = ts(Sales,start=1980,frequency=12)Etc…Now use the decompose() function to break the time series into the different components.The assumption of constant variation appears to be violated. Assuming the same seasonal pattern for each year appears wrong. The residual or error variation is large at both ends, over estimating the magnitude of the seasonal highs and lows in the early years and under estimating the magnitude in the latter years of the series. This non-constant variation is also evidenced in the time series plot and the comparative boxplots across year above.When a time series has nonconstant variation, we often times assume a multiplicative model or equivalently take the natural logarithm of the response (see same example in Ch. 2 JMP Handout).> Liquor.mult = decompose(Liquor.ts,type="multiplicative")> plot(Liquor.mult)> logSales = log(Liquor.ts)> Liquor.log = decompose(logSales)> plot(Liquor.log)Regression Models for Seasonal Time Series (see Sec 2.4.2 pgs. 41-42)We return again to the U.S. beverage sales time series and consider fitting the regression model below (see also pg. 38 of Ch. 2 JMP handout).Eyt=Eyt=βo+ β1t+β2t2+β3t3+β4sin2πt12+β5cos2πt12> detach(PharmSales)> attach(BevSales)> names(BevSales)[1] "Time" "Month" "Year" "Date" "Sales"> t2 = Time^2> t3 = Time^3> sint = sin(2*pi*Time/12)> cost = cos(2*pi*Time/12)> BevSales.lm = lm(Sales~Time+t2+t3+sint+cost)> Bevfit = fitted(BevSales.lm) extract fitted values from regression model> plot(Time,Sales,type="p")> lines(Time,Bevfit,col="blue",lwd=2)> et = resid(BevSales.lm) extract residuals from fit and call them et> ts.plot(et)> acf(et)Still appears to be some lag 6- and lag 12-month autocorrelation the model is not effectively picking up.Lags and Differences in RThe function lag.plot()will construct a series of scatterplots of ytvs. yt-k up to a specified number of lags. For example, for the liquor sales time series we were just working with we can construct a series of these plots up to a lag k = 12 using the following command:> lag.plot(Liquor.ts,lags=12)We could also have used the log time series instead. These plots give a nice visual of the degree of autocorrelation, but it definitely does not replace the ACF plot for this purpose. The two could be used in conjunction with each other to give a nice visual representation of the degree of autocorrelation. This time series is clearly nonstationary so the degree of autocorrelation is very high. Below is a similar analysis for the stationary pharmaceutical sales time series used in the Ch. 2 JMP handout. The data is imported as > detach(LiquorSales)> PharmSales = read.table(file.choose(),header=T,sep=",")> names(PharmSales)[1] "Time" "Sales"> acf(Sales)Lag plot for pharmaceutical sales time series (lags up to 8)There is clearly no autocorrelation present.Differencing in RThe command to construct differences in R is diff()and it takes a time series and the desired lag-d difference as arguments. Using the U.S. beverage sales data again, we first try to remove the seasonality in the beverage sales by using a lag-12 difference. > deltaY12 = diff(Sales,12)> plot(deltaY12,xlab="Time",ylab="Differenced Sales",main="Lag-1 Differenced Beverages Sales",type="l")We can then use a lag-1 difference to capture any additional structure in these data.> deltaY1 = diff(deltaY12,1)> ts.plot(deltaY1)An ACF plot for this seasonally and lag-1 difference time series is shown on the following page. The series appears to be stationary.> acf(deltaY1) ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download