5 - Review of Regression (Sections 5.1 - 5.4)



5 – “Review” of Regression (Sections 5.1 - 5.4)Rob Hyndman (with Deppa edits)February 19, 2021Table of ContentsTOC \o "1-3" \h \z \u5 - Linear Regression Models PAGEREF _Toc1381497 \h 15.1 - The Simple Linear Regression Model PAGEREF _Toc1381498 \h 1Fitting the SLR Model PAGEREF _Toc1381499 \h 2Regression Assumptions PAGEREF _Toc1381500 \h 2Plot of Regression Assumptions PAGEREF _Toc1381501 \h 3Example 5.1 - Growth Rates of Personal Consumption and Personal Income in the US PAGEREF _Toc1381502 \h 45.2 - Multiple Linear Regression (MLR) PAGEREF _Toc1381503 \h 9Example 5.1 - Growth Rates of Personal Consumption and Personal Income in the US (cont’d) PAGEREF _Toc1381504 \h 10Example 5.2 - U.S. Monthly Beverage Sales PAGEREF _Toc1381505 \h 135.4 - Some Useful Predictors/Terms for Forecasting with MLR PAGEREF _Toc1381506 \h 15Trend PAGEREF _Toc1381507 \h 15Seasonal Dummy Variables PAGEREF _Toc1381508 \h 21Fourier Series PAGEREF _Toc1381509 \h 24Example 5.3 - Australian Beer Quarterly Production PAGEREF _Toc1381510 \h 27Intervention Terms PAGEREF _Toc1381511 \h 33Trading Days PAGEREF _Toc1381512 \h 39Example 5.4 - Total Monthly Sales (Fastenal) PAGEREF _Toc1381513 \h 395 - Linear Regression Models5.1 - The Simple Linear Regression ModelIn simple linear regression we are interested in modeling E(Y|X). Simple refers to the fact that we are using a single predictor or explanatory variable X to model the mean of the response Y. Typically we think of SLR as the fitting the modelE(Y|X)=βo+β1Xto a sample (x1,y1),(x2,y2),…,(Xn,yn) from the population of interest. This is an overly narrow view of SLR as the following models are also SLR models.E(Y|X)=βo+β1log(X)E(Y|X)=βo+β1X2This models allow for the relationship between the mean of Y and X to be curvilinear. Also in terms of variation, we assume that Var(Y|X)=constant=σ2.Fitting the SLR ModelThe sample based model when fitting the theoretical or population model E(Y|X)=βo+β1X is given by,yi=βo+β1xi+ei,???i=1,…,n.In forecasting and working with time series, the response Y is more properly denoted Yt and the observed values are yt,??t=1,…,T. In our first use of regression for forecasting purposes the predictor X will typically be time t. For example, we might fit a line to our time seriesE(Y|t)=βo+β1t,???t=1,…,TThis is model is unlikely to work well for most time series, so we will consider adding more time based terms to our model. We will examine the types of time-based terms we add to improve our forecast model in Section 3.4.Sometimes however we might try to relate one time series yt to another time series xt. In our first example below, this will be the case. Our sample based model in this situation would becomeyt=βo+β1xt+et,???t=1,…,T.Before fitting our first SLR model we will examine the assumptions we make when fitting a regression model.Regression AssumptionsIn fitting a Simple Linear Regression Model using OLS we generally make the following assumptions: The model for E(Y|X) is correct. If we are fitting a line to model the conditional expectation this becomesE(Y|X)=βo+β1X.Var(Y|X)=constant=σ2The observations (xt,yt) are independent. This also says that the random errors (et) are independent. This is typically (almost always) violated when X=t= “time”. We can check this using the tests for significant autocorrelation like the Breusch-Godfrey test or the Durbin-Watson test. These tests are alternatives to the other Portmanteau tests cover in section 3.3 - Residual Diagnostics. We can also examine time series and ACF plots the residuals to assess this assumption graphically.For purposes of statistical inference, i.e.?CI’s and p-values derived using the t- and F-distributions, we assume normality. Specifically we assume the conditional distribution of Y|X is normal.Putting assumptions 1, 2, and 4 together we are assuming,Yt|(Xt=xt)~N(βo+β1xt,σ2)or equivalently,et~N(0,σ2).We can refer to the diagram linked below to visualize these assumptions.Plot of Regression AssumptionsWe check this assumptions by examine plots of the residuals et=yt-yt from a regression model fit to the time series yt.Example 5.1 - Growth Rates of Personal Consumption and Personal Income in the USThe dataset uschange in the fpp2 library contains percentage changes in quarterly personal consumption expenditure, personal disposable income, production, savings and the unemployment rate for the US, 1960 to 2016. These data are also available on FRED. These data are currently stored as four time series objects. For the purposes of developing regression models using these data will convert this dataset to a data frame.require(fpp2)require(GGally)autoplot(uschange,facet=T)USchange = as.data.frame(uschange)names(USchange)## [1] "Consumption" "Income" "Production" "Savings" ## [5] "Unemployment"# View(USchange)ggpairs(USchange)We will first consider fitting a SLR model using Y= percent change in consumption and X= percent change in income. As we have converted these time series to standard numeric variables (i.e.?we have removed the time structure - year & quarter) we can use the lm command to fit the model. If we leave them as time series objects we will use the command tslm to fit models.mod1 = lm(Consumption~Income,data=USchange)plot(Consumption~Income,data=USchange)abline(mod1,lwd=2,col="blue")The plot above shows a scatterplot of consumption (Y) vs.?income (X) with the fitted line from the SLR model added to plot. The residuals et are the vertical distances between the data points in the scatterplot and the fitted line. By examining plots of the residuals we can check model assumptions. The two commands we will be using to do this are plot(name of model) and checkresiduals(name of model).par(mfrow=c(2,2))plot(mod1)par(mfrow=c(1,1))checkresiduals(mod1)## ## Breusch-Godfrey test for serial correlation of order up to 10## ## data: Residuals## LM test = 27.584, df = 10, p-value = 0.002104Results of plot(mod1)The first plot is of the residuals et vs.?the fitted values yt. This plot should look shapeless blob with no discernible patterns. If this plot shows evidence of nonlinear patterns then our specified model is not adequate. If the residuals tend to get larger as the fitted values increase then we have evidence that constant variation assumption is violated.The second plot in the top row of plots is a normal quantile plot of the residuals et. If the normality assumption is satisfied for our regression model, these points should fall along the reference line.The third plot (first plot in the second row) is a plot of |et| vs.?the fitted values yt. This plot should NOT have any trend (increasing or decreasing) is the constant variation assumption is satisfied.The fourth and final plot is a plot of the standardized residuals vs.?the leverages ht. The standardized residuals are range between (-3,3) with rougly 95% being between (-2,2). Standardized residuals outside of this range could be considered outliers, i.e.?cases poorly fit by our model. The leverages are a measure of “weirdness” of the x-values. Points that are exceptional in anyway are labeled with the observation number in this plot. The contours added to this plot show values of Cook’s distance, which is a measure of observation influence. Points between .5 and 1 on this measure are troublesome and points beyond 1 are deemed unduly influential. Influence measures how much are fitted model would change if these points were not included in the regression.Q: What do these plots suggest about the assumptions for this model?Results of checkresiduals(mod1)The three plots created by checkresiduals are specific to situations where the response is a time series. The first plot shows a time series plot the residuals et. Ideally this plot will look like white noise, i.e. et~N(0,σ2). Any evidence of structure in this plot (e.g.?autocorrelation, seasonal/cyclic patterns) indicates a violation of the independence assumption.The second plot is an ACF plot of the residuals. We are looking for an ACF consistent with the white noise assumption. Significant autocorrelations indicate non-independence of the residuals and suggest that we are missing important structure in our time series.The final plot is a histogram of the residuals, which ideally will be consistent with a normal distribution.Q2: What do these plots suggest about the residuals from this model?Regression Model SummaryIgnoring any potential assumption violations for this fitted model, we now will examine a summary statistics for the fitted model.summary(mod1)## ## Call:## lm(formula = Consumption ~ Income, data = USchange)## ## Residuals:## Min 1Q Median 3Q Max ## -2.40845 -0.31816 0.02558 0.29978 1.45157 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.54510 0.05569 9.789 < 2e-16 ***## Income 0.28060 0.04744 5.915 1.58e-08 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.6026 on 185 degrees of freedom## Multiple R-squared: 0.159, Adjusted R-squared: 0.1545 ## F-statistic: 34.98 on 1 and 185 DF, p-value: 1.577e-08The model summary gives us the following:5-number summary of the residuals etThe estimated parameters (βj) along with their estimated standard errors (SE(βj))t-statistics for testing whether that parameter in the population is significantly different from 0 (i.e. NH:βj=0vs.AH:βj≠0).p-values associated with the t-statistics. Terms that are significant in the the model will be highlighted using the p-value nomenclature below the table. Here we conclude both the population intercept and slope are signficantly different from 0 (p<.0001).The RMSE(σ), the error degrees of freedom (df=T-k, where k=# of parameters in the model) the R-square (R2), and the adj. R-square (Radj2), which essentially is the R2 penalized by the number of parameters in our model.Overall test of whether or not the regression model explains a signficant amount of variation in response above and beyond the simple model E(Y|X)=βo which has yt=y (i.e.?the mean forecast method). This will almost always show significance unless the model you are considering is complete garbage.The estimated regression equation is given byyt=βo+β1xtwhich isyt=.545+.281xtEstimated Intercept (βo): The estimated percent change in US consumption is .545% when the percent change is 0.Estimated Slope (β1): The estimated percent change in US consumption increases by .281% for each 1 percentage point increase in US income.5.2 - Multiple Linear Regression (MLR)In multiple linear regression we have multiple non-constant terms in our model. For example if we want k predictors or terms in our population model we would haveEYX=βo+β1X1+…+βkXkIn this course our response Y is a time series {yt} so our sample model is given by,yt=βo+β1x1,t+β2x2,t+…+βkxk,t+et???t=1,…,T.This model also assumes that the k predictors/terms in our model are also time dependent thus the subscript t. This is generally the case in developing linear models for forecasting a time series.Example 5.1 - Growth Rates of Personal Consumption and Personal Income in the US (cont’d)Consider now fitting a model using percent change in consumption as the response and the percent changes in income, production, savings, and unemployment.ggpairs(USchange)mod2 = lm(Consumption ~ Income + Production + Savings + Unemployment, data = USchange)par(mfrow=c(2,2))plot(mod2)par(mfrow=c(1,1))checkresiduals(mod2)## ## Breusch-Godfrey test for serial correlation of order up to 10## ## data: Residuals## LM test = 17.853, df = 10, p-value = 0.05749summary(mod2)## ## Call:## lm(formula = Consumption ~ Income + Production + Savings + Unemployment, ## data = USchange)## ## Residuals:## Min 1Q Median 3Q Max ## -0.88296 -0.17638 -0.03679 0.15251 1.20553 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.26729 0.03721 7.184 1.68e-11 ***## Income 0.71449 0.04219 16.934 < 2e-16 ***## Production 0.04589 0.02588 1.773 0.0778 . ## Savings -0.04527 0.00278 -16.287 < 2e-16 ***## Unemployment -0.20477 0.10550 -1.941 0.0538 . ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.3286 on 182 degrees of freedom## Multiple R-squared: 0.754, Adjusted R-squared: 0.7486 ## F-statistic: 139.5 on 4 and 182 DF, p-value: < 2.2e-16Q1: Do plots of the residuals suggest any model assumption violations?Q2: Are all of the predictors in the MLR significant?Q3: What percent of the variation in the consumption time series is explained by our MLR model?As the percent change in production in the US is NOT significant in our full MLR model we will consider dropping this term from our regression model.mod3 = lm(Consumption ~ Income + Savings + Unemployment,data=USchange)summary(mod3)## ## Call:## lm(formula = Consumption ~ Income + Savings + Unemployment, data = USchange)## ## Residuals:## Min 1Q Median 3Q Max ## -0.82491 -0.17737 -0.02716 0.14406 1.25913 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.281017 0.036607 7.677 9.47e-13 ***## Income 0.730497 0.041455 17.622 < 2e-16 ***## Savings -0.045990 0.002766 -16.629 < 2e-16 ***## Unemployment -0.341346 0.072526 -4.707 4.96e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.3305 on 183 degrees of freedom## Multiple R-squared: 0.7497, Adjusted R-squared: 0.7456 ## F-statistic: 182.7 on 3 and 183 DF, p-value: < 2.2e-16Q1: How does this model compare to previous model with all the terms included?Q2: Which predictor, percent change in Income, Savings, or Unemployment, has the strongest adjusted relationship?As this model seems like our final model we should examine the residuals again, but we will not do it here.Example 5.2 - U.S. Monthly Beverage SalesIn this example we consider developing a model for monthly beverage sales in US. What makes this situation different from the previous is that we do not have other time series that we want to relate to beverages sales. Thus our regression model for forecasting will only use terms based upon time t.Beverage = read.csv(file="")names(Beverage)## [1] "Time" "Month" "Year" "Date" "Sales"head(Beverage)## Time Month Year Date Sales## 1 1 1 1992 1/1/1992 3519## 2 2 2 1992 2/1/1992 3803## 3 3 3 1992 3/1/1992 4332## 4 4 4 1992 4/1/1992 4251## 5 5 5 1992 5/1/1992 4661## 6 6 6 1992 6/1/1992 4811BevSales = ts(Beverage$Sales,start=1992,frequency=12)BevSales = BevSales/monthdays(BevSales)autoplot(BevSales) + ggtitle("Monthly US Beverage Sales") + xlab("Year") + ylab("Beverage Sales")length(BevSales)## [1] 180Q: Given the plot above how might we go about building a regression model that can be used to forecast beverage sales using only terms in our model based upon time (t)?5.4 - Some Useful Predictors/Terms for Forecasting with MLRThere are several very useful predictors that occur frequently when using regression for time series data. The tslm function in forecast/fpp2 library has some built in functionality for easily including the terms discussed below. The terms will either be used to specify long term trend in a time series or to model seasonal/cyclic behavior in a time series. If a time series {yt} does not contain long term and/or seasonal behavior then the terms discussed below for use in a multiple regression model will not be useful in developing a forecast.TrendIt is very common for time series data to be trending, i.e.?have an apparent long term trend. For example, a linear trend can be modelled by simply using the modelyt=βo+β1t+et,????t=1,…,T.A quadratic trend can be modelled by simply using the model,yt=βo+β1t+β2t2+et,????t=1,…,T.A cubic trend could be modelled by adding β3t3 to the above model, etc.A trend variable can be specified in the tslm function using the trend predictor. If we want to specify a polynomial trend, then we can use poly(trend,d) to denote the trend component, where d is the degree of the polynomial. In most situations d=2 or d=3. We will now try to extract the long term trend from the U.S. beverage sales time series.autoplot(BevSales) + ggtitle("Monthly US Beverage Sales") + xlab("Year") + ylab("Beverage Sales")## Form time sequence (t) and fit SLR modelt = as.numeric(1:length(BevSales))t## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34## [35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51## [52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68## [69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85## [86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102## [103] 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119## [120] 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136## [137] 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153## [154] 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170## [171] 171 172 173 174 175 176 177 178 179 180## Use tslm() to fit a linear model when the response is a time series objectbevmod1 = tslm(BevSales~t)plot(t,BevSales)abline(bevmod1)## Model summarysummary(bevmod1)## ## Call:## tslm(formula = BevSales ~ t)## ## Residuals:## Min 1Q Median 3Q Max ## -36.149 -7.637 -0.299 8.980 30.750 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 134.58024 2.04074 65.95 <2e-16 ***## t 0.41499 0.01956 21.22 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 13.63 on 178 degrees of freedom## Multiple R-squared: 0.7167, Adjusted R-squared: 0.7151 ## F-statistic: 450.3 on 1 and 178 DF, p-value: < 2.2e-16## Plot residualscheckresiduals(bevmod1)S## ## Breusch-Godfrey test for serial correlation of order up to 24## ## data: Residuals from Linear regression model## LM test = 152.84, df = 24, p-value < 2.2e-16An alternative to creating the time variable t, is to use trend inside the call to the tslm function.bevmod1.2 = tslm(BevSales~trend)summary(bevmod1.2)## ## Call:## tslm(formula = BevSales ~ trend)## ## Residuals:## Min 1Q Median 3Q Max ## -36.149 -7.637 -0.299 8.980 30.750 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 134.58024 2.04074 65.95 <2e-16 ***## trend 0.41499 0.01956 21.22 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 13.63 on 178 degrees of freedom## Multiple R-squared: 0.7167, Adjusted R-squared: 0.7151 ## F-statistic: 450.3 on 1 and 178 DF, p-value: < 2.2e-16Notice that either approach to fit the long-term trend model gives exactly the same results.Q1: Do these models adequately capture the long-term trend in this time series?## Try a quadratic and cubic polynomial in time (t)bevmod2 = tslm(BevSales~poly(trend,2))summary(bevmod2)## ## Call:## tslm(formula = BevSales ~ poly(trend, 2))## ## Residuals:## Min 1Q Median 3Q Max ## -35.562 -7.622 0.064 8.494 32.307 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 172.137 1.013 169.941 <2e-16 ***## poly(trend, 2)1 289.301 13.590 21.288 <2e-16 ***## poly(trend, 2)2 19.819 13.590 1.458 0.147 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 13.59 on 177 degrees of freedom## Multiple R-squared: 0.7201, Adjusted R-squared: 0.7169 ## F-statistic: 227.7 on 2 and 177 DF, p-value: < 2.2e-16checkresiduals(bevmod2)## ## Breusch-Godfrey test for serial correlation of order up to 24## ## data: Residuals from Linear regression model## LM test = 152.13, df = 24, p-value < 2.2e-16bevmod3 = tslm(BevSales~poly(trend,3))checkresiduals(bevmod3)## ## Breusch-Godfrey test for serial correlation of order up to 24## ## data: Residuals from Linear regression model## LM test = 152.54, df = 24, p-value < 2.2e-16The cubic polynomial trend seems a bit better than the quadratic.Seasonal Dummy VariablesWe will now try to model the seasonal behavior of this time series. One option is to use dummy variables to represent the month. A dummy variables is variable that takes on two values, 1 if a condition is met and 0 if a condition is not met. As this time series is monthly (i.e.?frequency = 12) we create dummy variables for 11 of the 12 months. Here we will use January to be the month without a dummy variable.For example, for August we would have,d8,t=1Month=August0Month≠AugustThus the August dummy variable takes on a value of 1 if the month is question is August and 0 if it is not.Monthd2,td3,t…d12,tJanuary00…0February10…0March01…0…00…0December00…1January00…0etc…Notice that all of the dummy variables are 0 for January.If we add these terms to our cubic long-term trend model above we have the multiple linear regression model below.yt=βo+β1t+β2t2+β3t3+M2d2,t+M3d3,t+M4d4,t+…+M12d12,t+etThe season argument supplied to a model using the tslm command automatically creates dummies for all but one of the frequency levels for use in the model.bevmod4 = tslm(BevSales~poly(trend,3) + season)summary(bevmod4)## ## Call:## tslm(formula = BevSales ~ poly(trend, 3) + season)## ## Residuals:## Min 1Q Median 3Q Max ## -12.6894 -3.5073 -0.2272 3.2039 12.7256 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 146.213 1.301 112.387 < 2e-16 ***## poly(trend, 3)1 287.750 5.042 57.071 < 2e-16 ***## poly(trend, 3)2 20.457 5.031 4.066 7.38e-05 ***## poly(trend, 3)3 42.150 5.057 8.335 2.88e-14 ***## season2 21.185 1.837 11.532 < 2e-16 ***## season3 20.767 1.837 11.304 < 2e-16 ***## season4 28.169 1.837 15.331 < 2e-16 ***## season5 38.349 1.838 20.868 < 2e-16 ***## season6 48.042 1.838 26.136 < 2e-16 ***## season7 29.953 1.839 16.290 < 2e-16 ***## season8 37.004 1.839 20.119 < 2e-16 ***## season9 32.176 1.840 17.487 < 2e-16 ***## season10 20.797 1.841 11.298 < 2e-16 ***## season11 22.177 1.842 12.041 < 2e-16 ***## season12 12.464 1.843 6.764 2.22e-10 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 5.031 on 165 degrees of freedom## Multiple R-squared: 0.9642, Adjusted R-squared: 0.9612 ## F-statistic: 317.8 on 14 and 165 DF, p-value: < 2.2e-16Before examining the residuals we will consider how to interpret the parameter estimates for the month dummy terms in our model. Consider the estimate coefficient for June (M6=48.04) which is the adjustment we make to our beverage sales forecast if the forecast period is June. It also represents an estimate of how much higher beverage sales are June compared to January, the reference month (all dummies equal to zero).Examining the residuals from this model, we find.checkresiduals(bevmod4)## ## Breusch-Godfrey test for serial correlation of order up to 24## ## data: Residuals from Linear regression model## LM test = 70.441, df = 24, p-value = 1.874e-06## Trying modeling season before trendbevmod5 = tslm(BevSales~season)checkresiduals(bevmod5)## ## Breusch-Godfrey test for serial correlation of order up to 24## ## data: Residuals from Linear regression model## LM test = 171.26, df = 24, p-value < 2.2e-16## Now we can see the long-term trend we want to approximatetemp =tslm(BevSales~trend+season)summary(temp)## ## Call:## tslm(formula = BevSales ~ trend + season)## ## Residuals:## Min 1Q Median 3Q Max ## -17.1738 -4.1825 -0.0461 4.0747 18.5953 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.084e+02 1.762e+00 61.525 < 2e-16 ***## trend 4.124e-01 8.867e-03 46.505 < 2e-16 ***## season2 2.127e+01 2.252e+00 9.445 < 2e-16 ***## season3 2.094e+01 2.252e+00 9.299 < 2e-16 ***## season4 2.843e+01 2.252e+00 12.624 < 2e-16 ***## season5 3.870e+01 2.252e+00 17.183 < 2e-16 ***## season6 4.849e+01 2.253e+00 21.525 < 2e-16 ***## season7 3.049e+01 2.253e+00 13.534 < 2e-16 ***## season8 3.763e+01 2.253e+00 16.704 < 2e-16 ***## season9 3.290e+01 2.253e+00 14.601 < 2e-16 ***## season10 2.162e+01 2.254e+00 9.593 < 2e-16 ***## season11 2.310e+01 2.254e+00 10.247 < 2e-16 ***## season12 1.348e+01 2.254e+00 5.981 1.31e-08 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 6.168 on 167 degrees of freedom## Multiple R-squared: 0.9456, Adjusted R-squared: 0.9417 ## F-statistic: 241.9 on 12 and 167 DF, p-value: < 2.2e-16There is an increase of sales by .4124 units per month on average. On average, February has sales 21.27 units higher when compared to January, March has sales 20.94 units higher on average when compared to January, etc.Fourier SeriesAnother way to model seasonal or cyclic behavior in a time series is to use Fourier Series terms. Fourier terms are named for Jean-Baptiste Fourier was a French mathematician, born in the 1700s, who showed that a series of sine and cosine terms of the right frequencies can approximate any periodic function. (FYI: His name is among the 72 scientists on the Eiffel Tower.) We should be able to use them to model seasonal patterns in a time series.If m is the seasonal period or frequency, then the first few Fourier terms are given byx1,t=sin2πt/m,x2,t=cos2πt/m,x3,t=sin4πt/m,x4,t=cos4πt/m,x5,t=sin6πt/m,x6,t=cos6πt/m,etc…and so on. If we have monthly seasonality (frequency = m=12), and we use the first 11 of these predictor variables, then we will get exactly the same forecasts as using 11 dummy variables. Below are some examples of what the sine and cosine terms look like with m=12.t = seq(0,100,.08333)y1 = sin(2*pi*t/12)y2 = cos(2*pi*t/12)y3 = sin(4*pi*t/12)y4 = cos(4*pi*t/12)y5 = sin(6*pi*t/12)y6 = cos(6*pi*t/12)par(mfrow=c(3,2))plot(t,y1,type="l",main="sin(2*pi*t/12)")plot(t,y2,type="l",main="cos(2*pi*t/12)")plot(t,y3,type="l",main="sin(4*pi*t/12)")plot(t,y4,type="l",main="cos(4*pi*t/12)")plot(t,y5,type="l",main="sin(6*pi*t/12)")plot(t,y6,type="l",main="cos(6*pi*t/12)")par(mfrow=c(1,1))plot(t,.1*y1 + .2*y2 - .2*y3 + .1*y4 - .3*y5 - .1*y6,ylab="Combination",type="l",main="Combination of Fourier Terms")bevmod6 = tslm(BevSales~poly(trend,3)+fourier(BevSales,K=6))summary(bevmod6)## ## Call:## tslm(formula = BevSales ~ poly(trend, 3) + fourier(BevSales, ## K = 6))## ## Residuals:## Min 1Q Median 3Q Max ## -12.6894 -3.5073 -0.2272 3.2039 12.7256 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 172.1369 0.3750 459.069 < 2e-16 ***## poly(trend, 3)1 287.7499 5.0420 57.071 < 2e-16 ***## poly(trend, 3)2 20.4568 5.0308 4.066 7.38e-05 ***## poly(trend, 3)3 42.1495 5.0567 8.335 2.88e-14 ***## fourier(BevSales, K = 6)S1-12 -4.2692 0.5326 -8.016 1.90e-13 ***## fourier(BevSales, K = 6)C1-12 -14.5197 0.5305 -27.372 < 2e-16 ***## fourier(BevSales, K = 6)S2-12 -3.0817 0.5308 -5.806 3.20e-08 ***## fourier(BevSales, K = 6)C2-12 -0.1292 0.5305 -0.244 0.807799 ## fourier(BevSales, K = 6)S3-12 -0.3953 0.5305 -0.745 0.457211 ## fourier(BevSales, K = 6)C3-12 -2.0644 0.5305 -3.892 0.000144 ***## fourier(BevSales, K = 6)S4-12 -5.7441 0.5303 -10.831 < 2e-16 ***## fourier(BevSales, K = 6)C4-12 2.4388 0.5305 4.598 8.47e-06 ***## fourier(BevSales, K = 6)S5-12 -1.8305 0.5303 -3.452 0.000707 ***## fourier(BevSales, K = 6)C5-12 -1.2046 0.5305 -2.271 0.024449 * ## fourier(BevSales, K = 6)C6-12 2.0199 0.3750 5.386 2.45e-07 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 5.031 on 165 degrees of freedom## Multiple R-squared: 0.9642, Adjusted R-squared: 0.9612 ## F-statistic: 317.8 on 14 and 165 DF, p-value: < 2.2e-16Notice that while the estimated coefficients are different from the model with monthly dummy variables, the R2 and residual et summary statistics are exactly the same for models bevmod5 and bevmod6. The forecasts, which we will not consider now, are also exactly the same for the two models.The advantage of using Fourier terms is that we can often use fewer predictor variables than we need to with dummy variables, especially when m is large. For example, if we have weekly data then m=52 and using the dummy variable approach would require 51 dummy variables! With Fourier series terms we can probably model any seasonal or cyclic behavior in the time series with much fewer terms. For short seasonal periods such as with quarterly data, there is little advantage in using Fourier series over seasonal dummy variables.In R, these Fourier terms are produced using the fourier function. The first argument to fourier just allows it to identify the seasonal period m and the length of the predictors to return. The second K argument specifies how many pairs of sin and cos terms to include. The maximum number allowed is K=m/2, where m is the seasonal period. Because we have used the maximum here K=6, the results are identical to those obtained when using seasonal dummy variables.If only the first two Fourier terms are used x1,t,x2,t, the seasonal pattern will follow a simple sine wave. A regression model containing Fourier terms is often called a harmonic regression because the successive Fourier terms represent harmonics of the first two Fourier terms.Example 5.3 - Australian Beer Quarterly ProductionIn this example we will again examine the quarterly beer production in Australia (ML). We will plot the time series and then consider what we have just examined in terms of trend (long-term and seasonal) in developing a regression model that can be used to forecast beer production.require(fpp2)summary(ausbeer)## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 213.0 379.2 422.0 415.4 465.8 599.0beersub = window(ausbeer,start=1992)autoplot(beersub) + xlab("Year") + ylab("Beer Production (ML)")There does not appear to be much of a long-term trend in this time series, although it does appear to be gradually decreasing. A strong seasonal pattern is immediately evident, thus we will use quarterly dummy variables for all but the first quarter Q1.mod1 = tslm(beersub~trend+season)summary(mod1)## ## Call:## tslm(formula = beersub ~ trend + season)## ## Residuals:## Min 1Q Median 3Q Max ## -42.903 -7.599 -0.459 7.991 21.789 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 441.80044 3.73353 118.333 < 2e-16 ***## trend -0.34027 0.06657 -5.111 2.73e-06 ***## season2 -34.65973 3.96832 -8.734 9.10e-13 ***## season3 -17.82164 4.02249 -4.430 3.45e-05 ***## season4 72.79641 4.02305 18.095 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 12.23 on 69 degrees of freedom## Multiple R-squared: 0.9243, Adjusted R-squared: 0.9199 ## F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.2e-16The R2=92.43%, which means 92.43% of the variation in the beer production time series is explained by our model. There is a downward trend of -0.34 ML per quarter. On average, the second quarter has production of 34.7 megalitres lower than the first quarter, the third quarter has production of 17.8 megalitres lower than the first quarter, and the fourth quarter has production of 72.8 megalitres higher than the first quarter.fit.mod1 = fitted(mod1)autoplot(beersub) + xlab("Year") + autolayer(fit.mod1,series="Fitted Model - trend + season",color="blue")checkresiduals(mod1)## ## Breusch-Godfrey test for serial correlation of order up to 8## ## data: Residuals from Linear regression model## LM test = 9.3083, df = 8, p-value = 0.317This model appears to fit the time series quite well. The residuals appear to be white noise, thus there is little time structure left in our time series that can useful for forecasting purposes.Rather than use quarterly dummy variables, we now consider models using Fourier Series terms.mod2 = tslm(beersub~trend+fourier(beersub,K=1))summary(mod2)## ## Call:## tslm(formula = beersub ~ trend + fourier(beersub, K = 1))## ## Residuals:## Min 1Q Median 3Q Max ## -40.314 -13.687 -1.053 12.419 35.244 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 446.3047 4.4198 100.979 < 2e-16 ***## trend -0.3249 0.1024 -3.173 0.00224 ** ## fourier(beersub, K = 1)S1-4 8.5329 3.0939 2.758 0.00741 ** ## fourier(beersub, K = 1)C1-4 53.3502 3.0939 17.243 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 18.82 on 70 degrees of freedom## Multiple R-squared: 0.8182, Adjusted R-squared: 0.8104 ## F-statistic: 105 on 3 and 70 DF, p-value: < 2.2e-16checkresiduals(mod2)## ## Breusch-Godfrey test for serial correlation of order up to 8## ## data: Residuals from Linear regression model## LM test = 42.928, df = 8, p-value = 9.063e-07mod3 = tslm(beersub~trend+fourier(beersub,K=2))summary(mod3)## ## Call:## tslm(formula = beersub ~ trend + fourier(beersub, K = 2))## ## Residuals:## Min 1Q Median 3Q Max ## -42.903 -7.599 -0.459 7.991 21.789 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 446.87920 2.87321 155.533 < 2e-16 ***## trend -0.34027 0.06657 -5.111 2.73e-06 ***## fourier(beersub, K = 2)S1-4 8.91082 2.01125 4.430 3.45e-05 ***## fourier(beersub, K = 2)C1-4 53.72807 2.01125 26.714 < 2e-16 ***## fourier(beersub, K = 2)C2-4 13.98958 1.42256 9.834 9.26e-15 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 12.23 on 69 degrees of freedom## Multiple R-squared: 0.9243, Adjusted R-squared: 0.9199 ## F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.2e-16checkresiduals(mod3)## ## Breusch-Godfrey test for serial correlation of order up to 8## ## data: Residuals from Linear regression model## LM test = 9.3083, df = 8, p-value = 0.317autoplot(beersub) + xlab("Year") + ylab("Beer Production (ML)") + ggtitle("Quarterly Beer Production") + autolayer(fitted(mod2),series="K = 1 Fourier Terms") + autolayer(fitted(mod3),series="K = 2 Fourier Terms") + guides(colour=guide_legend(title="Fourier Series Models"))Notice that mod3 using K=(m/2)=(4/2)=2 Fourier terms is identical to the quarterly dummy variable model above (e.g.?both have R2=.9243). The K=1 model which contains one sine term and one cosine term only, misses badly on the production troughs.In Section 5.8 we discuss how we can also model a other nonlinear trends.Intervention TermsIt is often necessary to model interventions that may have affected the variable to be forecast. For example, competitor activity, advertising expenditure, industrial action, and so on, can all have an effect.When the effect lasts only for one period, we can use a spike variable. This is a dummy variable which takes value one in the period of the intervention and zero elsewhere. A spike variable is equivalent to a dummy variable for handling an outlier.Other interventions have an immediate and permanent effect. If an intervention causes a level shift (i.e., the value of the series changes suddenly and permanently from the time of intervention), then we use a step variable. A step variable takes value zero before the intervention and one from the time of intervention onward.Another form of permanent effect is a change of slope. Here the intervention is handled using a piecewise linear trend; a trend that bends at the time of intervention and hence is nonlinear. We will discuss this in Section 5.8.Example 5.4 - Monthly Average Sales Per Business Day (Fastenal)Fastenal = read.csv("(2004-2013).csv")names(Fastenal)## [1] "Time" "Month" "Month.Num" ## [4] "Year" "NumBDays" "AvSalesPD" ## [7] "Total.Sales" "Total.Fastner" "Total.Nonfastner"TotSales = Fastenal$Total.SalesTotSales = ts(TotSales,start=2004,frequency=12)TotSales = TotSales/1000000AvgSalesDay = TotSales/Fastenal$NumBDaysautoplot(AvgSalesDay) + xlab("Year") + ylab("Avg Sales Per Business Day (millions)")It appears that time series took a major shock around the start of 2009 and then continued trending upward from there. An intervention dummy term that takes on a value of 1 after 2009, and is 0 before 2009 could potentially handle this feature in our time series, i.e.d2009,t=1Year≥20090Year<2009In the R code below we will form the intervention term and then use it to model this time series.t = time(AvgSalesDay)t## Jan Feb Mar Apr May Jun Jul## 2004 2004.000 2004.083 2004.167 2004.250 2004.333 2004.417 2004.500## 2005 2005.000 2005.083 2005.167 2005.250 2005.333 2005.417 2005.500## 2006 2006.000 2006.083 2006.167 2006.250 2006.333 2006.417 2006.500## 2007 2007.000 2007.083 2007.167 2007.250 2007.333 2007.417 2007.500## 2008 2008.000 2008.083 2008.167 2008.250 2008.333 2008.417 2008.500## 2009 2009.000 2009.083 2009.167 2009.250 2009.333 2009.417 2009.500## 2010 2010.000 2010.083 2010.167 2010.250 2010.333 2010.417 2010.500## 2011 2011.000 2011.083 2011.167 2011.250 2011.333 2011.417 2011.500## 2012 2012.000 2012.083 2012.167 2012.250 2012.333 2012.417 2012.500## 2013 2013.000 2013.083 2013.167 2013.250 2013.333 2013.417 2013.500## Aug Sep Oct Nov Dec## 2004 2004.583 2004.667 2004.750 2004.833 2004.917## 2005 2005.583 2005.667 2005.750 2005.833 2005.917## 2006 2006.583 2006.667 2006.750 2006.833 2006.917## 2007 2007.583 2007.667 2007.750 2007.833 2007.917## 2008 2008.583 2008.667 2008.750 2008.833 2008.917## 2009 2009.583 2009.667 2009.750 2009.833 2009.917## 2010 2010.583 2010.667 2010.750 2010.833 2010.917## 2011 2011.583 2011.667 2011.750 2011.833 2011.917## 2012 2012.583 2012.667 2012.750 2012.833 2012.917## 2013 2013.583 2013.667 2013.750 2013.833 2013.917d2009 = as.numeric(t>2009)d2009## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0## [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1## [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1td = ts(d2009,start=2004)# No shiftfast.mod0 = tslm(AvgSalesDay~trend+season)summary(fast.mod0)## ## Call:## tslm(formula = AvgSalesDay ~ trend + season)## Residuals:## Min 1Q Median 3Q Max ## -1.7329 -0.3189 0.1807 0.4858 1.0286 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.318742 0.256926 16.809 <2e-16 ***## trend 0.059361 0.001954 30.372 <2e-16 ***## season2 -0.002059 0.330024 -0.006 0.9950 ## season3 0.265279 0.330041 0.804 0.4233 ## season4 0.186757 0.330070 0.566 0.5727 ## season5 0.306577 0.330111 0.929 0.3551 ## season6 0.458594 0.330163 1.389 0.1677 ## season7 0.192444 0.330226 0.583 0.5613 ## season8 0.419758 0.330302 1.271 0.2065 ## season9 0.633968 0.330388 1.919 0.0577 . ## season10 0.433746 0.330486 1.312 0.1922 ## season11 -0.142484 0.330596 -0.431 0.6673 ## season12 -0.716150 0.330718 -2.165 0.0326 * ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## Residual standard error: 0.7379 on 107 degrees of freedom## Multiple R-squared: 0.8989, Adjusted R-squared: 0.8876 ## F-statistic: 79.27 on 12 and 107 DF, p-value: < 2.2e-16checkresiduals(fast.mod0)## ## Breusch-Godfrey test for serial correlation of order up to 24## ## data: Residuals from Linear regression model## LM test = 115.28, df = 24, p-value = 6.678e-14# Shift only, no changes in trend or seasonal patternfast.mod1 = tslm(AvgSalesDay~trend+td+season)summary(fast.mod1)## ## Call:## tslm(formula = AvgSalesDay ~ trend + td + season)## ## Residuals:## Min 1Q Median 3Q Max ## -1.69920 -0.16692 0.02304 0.18605 0.67554 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.607945 0.136701 26.393 < 2e-16 ***## trend 0.089980 0.002007 44.836 < 2e-16 ***## td -2.433156 0.138574 -17.558 < 2e-16 ***## season2 0.210638 0.168155 1.253 0.213093 ## season3 0.447356 0.168047 2.662 0.008975 ** ## season4 0.338214 0.167963 2.014 0.046583 * ## season5 0.427416 0.167903 2.546 0.012347 * ## season6 0.548814 0.167867 3.269 0.001454 ** ## season7 0.252044 0.167855 1.502 0.136185 ## season8 0.448739 0.167867 2.673 0.008702 ** ## season9 0.632329 0.167903 3.766 0.000273 ***## season10 0.401488 0.167963 2.390 0.018598 * ## season11 -0.205361 0.168047 -1.222 0.224402 ## season12 -0.809646 0.168155 -4.815 4.92e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.375 on 106 degrees of freedom## Multiple R-squared: 0.9741, Adjusted R-squared: 0.971 ## F-statistic: 307.1 on 13 and 106 DF, p-value: < 2.2e-16# Shift and trend change after intervention - no change in seasonal patternfast.mod2 = tslm(AvgSalesDay~trend*td+season)summary(fast.mod2)## ## Call:## tslm(formula = AvgSalesDay ~ trend * td + season)## ## Residuals:## Min 1Q Median 3Q Max ## -1.35700 -0.15728 -0.00371 0.16385 0.63783 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.961557 0.120477 32.882 < 2e-16 ***## trend 0.078574 0.002228 35.265 < 2e-16 ***## td -3.937620 0.230027 -17.118 < 2e-16 ***## season2 0.213737 0.136364 1.567 0.120030 ## season3 0.449836 0.136277 3.301 0.001317 ** ## season4 0.340074 0.136208 2.497 0.014089 * ## season5 0.428655 0.136160 3.148 0.002139 ** ## season6 0.549433 0.136130 4.036 0.000103 ***## season7 0.252044 0.136121 1.852 0.066890 . ## season8 0.448119 0.136130 3.292 0.001356 ** ## season9 0.631089 0.136160 4.635 1.03e-05 ***## season10 0.399628 0.136208 2.934 0.004111 ** ## season11 -0.207841 0.136277 -1.525 0.130231 ## season12 -0.812746 0.136364 -5.960 3.41e-08 ***## trend:td 0.024054 0.003209 7.496 2.19e-11 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.3041 on 105 degrees of freedom## Multiple R-squared: 0.9831, Adjusted R-squared: 0.9809 ## F-statistic: 437.6 on 14 and 105 DF, p-value: < 2.2e-16# Shift, trend, and seasonal pattern change after 2009fast.mod3 = tslm(AvgSalesDay~trend*td*season-trend:td:season-trend:season)summary(fast.mod3)## ## Call:## tslm(formula = AvgSalesDay ~ trend * td * season - trend:td:season - ## trend:season)## ## Residuals:## Min 1Q Median 3Q Max ## -1.26289 -0.15615 0.02302 0.13448 0.60532 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.895098 0.146694 26.553 < 2e-16 ***## trend 0.078120 0.002313 33.769 < 2e-16 ***## td -3.796127 0.308574 -12.302 < 2e-16 ***## season2 0.247407 0.190161 1.301 0.196426 ## season3 0.430885 0.190035 2.267 0.025656 * ## season4 0.408567 0.189936 2.151 0.034034 * ## season5 0.482849 0.189866 2.543 0.012620 * ## season6 0.611788 0.189823 3.223 0.001745 ** ## season7 0.407616 0.189809 2.148 0.034326 * ## season8 0.521845 0.189823 2.749 0.007166 ** ## season9 0.715122 0.189866 3.766 0.000289 ***## season10 0.529715 0.189936 2.789 0.006402 ** ## season11 -0.063630 0.190035 -0.335 0.738495 ## season12 -0.617889 0.190161 -3.249 0.001606 ** ## trend:td 0.025009 0.003359 7.445 4.59e-11 ***## td:season2 -0.107348 0.283770 -0.378 0.706065 ## td:season3 -0.002158 0.283591 -0.008 0.993945 ## td:season4 -0.177094 0.283451 -0.625 0.533630 ## td:season5 -0.148545 0.283352 -0.524 0.601345 ## td:season6 -0.164915 0.283292 -0.582 0.561867 ## td:season7 -0.351400 0.283272 -1.241 0.217877 ## td:season8 -0.187757 0.283292 -0.663 0.509101 ## td:season9 -0.208420 0.283352 -0.736 0.463836 ## td:season10 -0.300577 0.283451 -1.060 0.291671 ## td:season11 -0.328874 0.283591 -1.160 0.249118 ## td:season12 -0.430216 0.283770 -1.516 0.132856 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.3135 on 94 degrees of freedom## Multiple R-squared: 0.984, Adjusted R-squared: 0.9797 ## F-statistic: 230.8 on 25 and 94 DF, p-value: < 2.2e-16checkresiduals(fast.mod3) ## ## Breusch-Godfrey test for serial correlation of order up to 29## ## data: Residuals from Linear regression model## LM test = 71.993, df = 29, p-value = 1.611e-05# Compare fits visuallyautoplot(AvgSalesDay) + autolayer(fitted(fast.mod0),series="0 - No Intervention") + autolayer(fitted(fast.mod1),series="1 - Constant Shift") + autolayer(fitted(fast.mod2),series="2 - Constant Trend Shift") + autolayer(fitted(fast.mod3),series="3 - Constant Trend Season Shift") + guides(colour=guide_legend(title="Intervention Models"))Trading DaysThe number of trading or business days in a month can vary considerably and can have a substantial effect on sales data. To allow for this, the number of trading days in each month can be included as a predictor.For monthly or quarterly data, the bizdays function will compute the number of trading days in each period.Example 5.4 - Total Monthly Sales (Fastenal)In the example under Intervention Terms above we used the average sales per business day as the time series. This was possible because that information was provided in the data from Fastenal. What is we did not have that information, but rather only had total monthly sales. We can use the bizdays function to compute the number of business days for each month in the time series and then either make the adjustment to total sales by dividing by the number of business days or by using the number of business days as a term in regression model for total sales. Both approaches are covered in the examples below.autoplot(TotSales) + xlab("Year") + ylab("Total Monthly Sales (millions)") + ggtitle("Fastenal Monthly Sales in Millions")bizdays(TotSales)## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec## 2004 20 19 23 21 20 22 21 22 21 21 21 22## 2005 20 19 22 21 21 22 20 23 21 21 21 21## 2006 20 19 23 19 22 22 20 23 20 22 21 20## 2007 21 19 22 20 22 21 21 23 19 23 21 20## 2008 21 20 20 22 21 21 22 21 21 23 19 22## 2009 20 19 22 21 20 22 22 21 21 22 20 22## 2010 19 19 23 21 20 22 21 22 21 21 21 22## 2011 20 19 23 20 21 22 20 23 21 21 21 21## 2012 20 20 22 20 22 21 21 23 19 23 21 20## 2013 21 19 20 22 22 20 22 22 20 23 20 21SalesPerDay = TotSales/bizdays(TotSales)AvgSalesDay - SalesPerDay## Jan Feb Mar Apr May Jun## 2004 -0.2032815 -0.2242063 0.0000000 -0.2122769 0.0000000 0.0000000## 2005 -0.2551539 -0.2765568 -0.2429371 0.0000000 0.0000000 0.0000000## 2006 -0.3134015 -0.3327599 0.0000000 -0.3484252 0.0000000 0.0000000## 2007 -0.3349384 -0.3700269 0.0000000 -0.3661413 0.0000000 0.0000000## 2008 -0.3825220 -0.4000355 -0.4268739 0.0000000 0.0000000 0.0000000## 2009 -0.3698775 -0.3809679 0.0000000 -0.3226258 0.0000000 0.0000000## 2010 -0.3942158 -0.3933064 0.0000000 -0.3774943 0.0000000 0.0000000## 2011 -0.4356265 -0.4646287 0.0000000 -0.4760653 0.0000000 0.0000000## 2012 -0.5275084 -0.5284515 0.0000000 -0.5591049 0.0000000 0.0000000## 2013 -0.5324904 -0.6017229 -0.5919146 0.0000000 0.0000000 0.0000000## Jul Aug Sep Oct Nov Dec## 2004 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.2069625## 2005 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000## 2006 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000## 2007 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.3598762## 2008 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.6545397## 2009 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.2973546## 2010 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.3526784## 2011 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000## 2012 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.5137589## 2013 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.5155893We can see by looking at the difference between our previously used series AvgSalesDay and SalesPerDay, the latter created using the bizdays function, for the most part are 0. The discrepancies are either due to incorrect business day reporting by Fastenal or inaccuracies in the bizdays function.TotSales.sub = window(TotSales,start=2010)mod1 = tslm(TotSales.sub~trend+bizdays(TotSales.sub)+season)summary(mod1)## ## Call:## tslm(formula = TotSales.sub ~ trend + bizdays(TotSales.sub) + ## season)## ## Residuals:## Min 1Q Median 3Q Max ## -13.9106 -5.1460 -0.3977 5.5350 16.1946 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.42755 30.70335 -0.014 0.98897 ## trend 2.09305 0.08875 23.583 < 2e-16 ***## bizdays(TotSales.sub) 8.03710 1.51891 5.291 7.19e-06 ***## season2 -1.99262 5.94435 -0.335 0.73952 ## season3 3.03994 6.57948 0.462 0.64700 ## season4 3.83268 5.94964 0.644 0.52378 ## season5 1.18682 6.14496 0.193 0.84800 ## season6 2.19165 6.15072 0.356 0.72380 ## season7 -5.34382 6.05147 -0.883 0.38340 ## season8 4.09065 6.98820 0.585 0.56217 ## season9 1.32628 5.88880 0.225 0.82316 ## season10 1.49135 6.62541 0.225 0.82325 ## season11 -17.72634 6.00958 -2.950 0.00572 ** ## season12 -38.35976 6.10654 -6.282 3.72e-07 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 8.25 on 34 degrees of freedom## Multiple R-squared: 0.9552, Adjusted R-squared: 0.938 ## F-statistic: 55.74 on 13 and 34 DF, p-value: < 2.2e-16checkresiduals(mod1)## ## Breusch-Godfrey test for serial correlation of order up to 17## ## data: Residuals from Linear regression model## LM test = 42.075, df = 17, p-value = 0.0006541mod2 = tslm(TotSales.sub~poly(trend,2)+bizdays(TotSales.sub)+season)summary(mod2)## ## Call:## tslm(formula = TotSales.sub ~ poly(trend, 2) + bizdays(TotSales.sub) + ## season)## ## Residuals:## Min 1Q Median 3Q Max ## -7.2625 -2.0466 -0.4959 2.8004 8.9212 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 61.3495 16.3176 3.760 0.000662 ***## poly(trend, 2)1 200.8892 4.5228 44.417 < 2e-16 ***## poly(trend, 2)2 -41.1639 4.3978 -9.360 8.26e-11 ***## bizdays(TotSales.sub) 7.5440 0.8082 9.335 8.82e-11 ***## season2 -2.7088 3.1570 -0.858 0.397063 ## season3 3.4027 3.4935 0.974 0.337134 ## season4 3.3713 3.1593 1.067 0.293671 ## season5 0.8334 3.2628 0.255 0.799977 ## season6 1.7690 3.2660 0.542 0.591704 ## season7 -5.8898 3.2135 -1.833 0.075864 . ## season8 4.3536 3.7104 1.173 0.249051 ## season9 0.6183 3.1275 0.198 0.844493 ## season10 1.8541 3.5179 0.527 0.601678 ## season11 -17.7029 3.1907 -5.548 3.66e-06 ***## season12 -37.8666 3.2426 -11.678 2.91e-13 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 4.38 on 33 degrees of freedom## Multiple R-squared: 0.9877, Adjusted R-squared: 0.9825 ## F-statistic: 189.9 on 14 and 33 DF, p-value: < 2.2e-16checkresiduals(mod2)## ## Breusch-Godfrey test for serial correlation of order up to 18## ## data: Residuals from Linear regression model## LM test = 28.73, df = 18, p-value = 0.05178autoplot(TotSales.sub) + xlab("Year") + ylab("Total Month Sales (millions)") + autolayer(fitted(mod1),series="Linear Trend") + autolayer(fitted(mod2),series="Quadratic Trend") + guides(colour=guide_legend(title="Two Models with Business Days Terms"))For forecasting purposes it would definitely be better to adjust the total sales time series by dividing the number of business days, rather than use the number business days as a term in our regression model. The reason being that in order to make forecasts we would need to use the number of business days in each month in our forecast horizon. While this may be easy to calculate it is harder to implement in R. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download