Steve Dutton



Steve Dutton CVEN 6833

12/15/05 Term Project

Denver’s Airborne Particulate Matter:

Diurnal Patterns and Forecasting Potential

Introduction

Since the 1970 amendments to the Clean Air Act, all large metropolitan areas around the country have been monitoring airborne contaminants in an effort to demonstrate compliance with current EPA regulations. Denver is no exception to this rule and there is a wealth of information regarding the average level of criteria pollutants over the last three decades. One of these criteria pollutants is particulate matter which has three classifications: 1) total suspended particulates (TSP), 2) particulate matter less than 10 um in diameter (PM10) and 3) particulate matter less than 2.5 um in diameter (PM2.5).

Recent research has suggested correlations between PM2.5 and various forms of adverse health (Cohen, et al., 2005; Krewski, et al., 2005a; Krewski, et al., 2005b). Therefore, PM2.5 has received a lot of attention in recent years. In Denver, there have been daily measurements of PM2.5 since 1999. These measurements are well suited for regulatory purposes, but sometimes fall short of expectations when utilized for research purposes. This is because 1) the data consists mostly of 24-hour composites, 2) there is a sizeable amount of missing data, and 3) some of the sites only report every 3rd or even every 6th day.

In an effort to augment the available regulatory data, a Tapered Element Oscillating Microbalance (TEOM) was operated a few miles east of downtown Denver near the intersection of Colorado Blvd. and Colfax Ave. This monitor ran nearly continuously from January 1, 2001 through December 31, 2003. The TEOM was configured to monitor PM2.5 at a controlled sample temperature of 50 degrees Celsius. In addition, it provided measurements of temperature and barometric pressure.

From a research standpoint, the TEOM offers a new level of insight into the particulate matter in Denver since it is a real-time monitor. This allows the previously available 24-hour composite data to be disaggregated into hourly measurements of PM2.5. As a result, one can identify important diurnal patterns in PM2.5.

The purpose of this report is to identify the diurnal patterns in Denver’s PM2.5 between 2001 and 2003 and provide a simple predictive model for PM2.5. Correlations between the data and a spectral analysis are performed on the data and two nonparametric time series models are evaluated. The better of these two models is then used in a predictive manner with the results compared to actual PM2.5 measurements.

Methods

The TEOM data used in this project was collected by myself as part of an EPA funded study at National Jewish Medical & Research Center in Denver (EPA grant R825702). The TEOM was operated with a PM2.5 sharp cut cyclone and a constant temperature inlet maintained at 50 degrees Celsius. In this study, twenty-four hour averages of the TEOM data were used to look for daily correlations between PM2.5 and asthma (Rabinovitch, et al., 2004) and PM2.5 and chronic obstructive pulmonary disease (Silkoff, et al., 2005). However, the true advantage of the TEOM over other 24-hour composite monitors is its ability to discern diurnal patterns in PM through use of its real time monitoring capability. This real-time data (aggregated into hourly averages) was utilized in this project.

Data

Missing data

Approximately once every month, the TEOM was shut down for regular cleaning and maintenance. This task took one to two hours to complete, resulting in occasional missing data points. Furthermore, there were a couple times during the three-year period when the instrument was shut down for longer periods of time for annual calibration. In total, there is approximately 1% of the hourly data missing. In order for the following analyses to work, a complete data set was required and therefore the missing data needed to be filled in.

If there were 4 or fewer consecutive hours of data missing, a linear interpolation between the closest values in time was used to fill in the missing PM2.5, temperature and pressure data. If there were more than 4 hours of consecutive missing data, an average of the previous day’s reading and the next day’s reading corresponding to the time of the missing data point was used to fill in the missing data. This procedure was used to help maintain the diurnal pattern in the data. Histograms of the data before and after the missing data was filled in were analyzed to be sure no obvious bias was being introduced into the data.

Time series plots and histograms of the data

The first step in the analysis was to plot time series and histograms of the PM2.5, temperature and pressure data provided by the instrument. Figure 1 contains the time series plots of the hourly data and Figure 2 contains time series plots of the 24-hour aggregated data.

The most dominant feature in the PM2.5 plots is a large episode near the end of June, 2002. These are a result of the Hayman fire plume sweeping over Denver. Also visible are several high pollution days observed during the winter months resulting from atmospheric inversions which limit vertical mixing and concentrate pollutants close to the ground. The temperature and pressure time series show the expected seasonal behavior.

Histograms of PM2.5, temperature and pressure are displayed in Figure 3. The first histogram shows the long tail of the PM2.5 distribution caused by infrequent, short-term events such as the Hayman fire or wintertime inversions. The second histogram shows a log-transformation of the PM2.5 data with a superimposed parametric normal curve created by using the mean and standard deviation of the data. The shape of this histogram suggests that PM2.5 is quite well described by a log-normal distribution. The remaining two histograms are for temperature and pressure with normal distributions superimposed in a similar manner. These histograms suggest that the data is not quite normally distributed and that both the temperature and pressure may contain some bimodality in their distributions.

It is clear from these histograms that any assumption of normality for the raw PM2.5 data will result in biased results. Rather than log-transforming the data and assuming the findings will hold back in the non-transformed space, I have chosen to stick with the original data and utilize nonparametric methods which do not require the assumption of normality.

[pic]

Figure 1: Time series plots of PM2.5, Temperature and Pressure recorded hourly by the TEOM.

[pic]

Figure 2: Time series plots of 24-hour aggregate PM2.5, Temperature and Pressure.

[pic]

Figure 3: Histograms of PM2.5, log-transformed PM2.5, temperature and pressure with superimposed parametric normal N((,(2) distributions.

Spectral Analysis

A spectral analysis was performed on the hourly PM2.5, temperature and pressure data in an effort to identify the primary diurnal patterns in the data. Figure 4 contains the periodograms of hourly PM2.5, temperature and pressure with parametric 95% confidence intervals. In order to help determine which peaks are significant, the spectra from 500 AR(1) simulations derived from the data was created and the 95th percentile of the simulations is plotted in gray. Any peak above this gray AR(1) spectra is considered significant relative to an ordinary AR(1) process.

The pressure periodogram shows no high frequency peaks and all the spectral density is in the lower frequencies corresponding to seasonality and long-term changes. The temperature periodogram shows a peak corresponding to the usual diurnal cycle with a peak at 1 cycle per day along with some spectral density at lower frequencies corresponding to the seasonal cycle.

[pic]

Figure 4: Periodograms of hourly PM2.5, temperature and pressure shown in black with parametric 95% confidence intervals in red. The gray spectra was created from taking the 95th percentile of 500 simulations of the data fit to an AR(1) model to help identify significant peaks.

The PM2.5 periodogram in Figure 4 displays a prominent peak with a frequency of 2 cycles per day. This peak is significant relative to the 95th percentile of the AR(1) model simulations. This peak is likely coming from traffic-generated PM2.5 with the twice daily cycle coming from morning and evening rush-hours. To confirm this, the periodogram was recalculated by day of the week and a corresponding boxplot of hourly time series was generated in Figure 5. It is clear from the periodogram that the spectral peak at 2 cycles per day is present during the weekdays when people are going to work and not so much on Sunday when people are home. The rush-hour PM2.5 is also smaller and less well defined on the weekends than the weekdays as can be seen in the time series boxplots.

[pic] [pic]

Figure 5: On the left are periodograms of hourly PM2.5 shown in black with parametric 95% confidence intervals in red for each day of the week. The gray spectra were created from taking the 95th percentile of 500 simulations of the data fit to an AR(1) model to help identify significant peaks. On the right are boxplots of the observed data as a function of hour during the day for each corresponding day of the week.

Correlation Plots

In order to determine the relationships existing between variables and the forecasting capabilities of this dataset, correlations between several of the variables were analyzed. The goal here would be to forecast PM2.5 using some variation of the temperature, pressure or previous PM2.5 data. To determine what possibilities exist, the following correlations were analyzed:

( daily average PM2.5 vs. daily average temperature

( daily average PM2.5 vs. daily average pressure

( daily average PM2.5 vs. mean removed daily average temperature x pressure

Scatterplots of the above relationships were generated for daily averages over the entire 3 year period and then for the individual months by themselves to look for seasonal trends. The scatterplots with local polynomial fits are shown in Figures 6-9.

[pic]

Figure 6: Scatterplots of average daily PM2.5 with average daily temperature, pressure, and mean removed pressure x temperature for the entire dataset.

[pic]

Figure 7: Scatterplots of average daily PM2.5 with average daily temperature disaggregated by month.

[pic]

Figure 8: Scatterplots of average daily PM2.5 with average daily pressure disaggregated by month.

[pic]

Figure 9: Scatterplots of mean removed average daily PM2.5 with mean removed average daily pressure x temperature disaggregated by month.

The first two plots in Figure 6 suggest that there isn’t a simple relationship between PM2.5 and the two meteorological variables. The third plot in Figure 6 was included to look for the possibility of a joint dependence on pressure & temperature. Given the disperse nature of the points in the scatterplot, however, it seems there is no simple joint relationship either. Looking at the monthly plots shows some interesting patterns for certain months. For example, there is a general downward trend between daily average PM2.5 and temperature for November – March and an opposite upward trend for April, May, Jun & September in Figure 7. July and August seem to have no trend. The downward trend during the colder months could be explained through the presence of some degree of inversion in the atmosphere resulting in both higher concentrations of PM2.5 and lower ground-level temperatures. The upward trend during the summer months could result from increased photochemical activity during the hot days resulting in increased formation of secondary aerosols. There doesn’t seem to be any obvious trends between PM2.5 and pressure or pressure x temperature on a monthly basis in Figures 8 and 9.

Going one step further, correlations with daily average PM2.5 were investigated by splitting the temperature and pressure into daytime and nighttime averages. The theory is that a stronger correlation may exist with nighttime meteorology since we have removed the variability caused by photochemistry and many anthropogenic sources. Scatterplots for the overall data and for the data split into months are included in Figures 10-14 for the following comparisons:

( daily average PM2.5 vs. daytime average temperature & pressure

( daily average PM2.5 vs. nighttime average temperature & pressure

[pic]

Figure 10: Scatterplots of average daily PM2.5 with average daytime and nighttime temperature and pressure.

[pic]

Figure 11: Scatterplots of average daily PM2.5 with average daytime temperature disaggregated by month.

[pic]

Figure 12: Scatterplots of average daily PM2.5 with average nighttime temperature disaggregated by month.

[pic]

Figure 13: Scatterplots of average daily PM2.5 with average daytime pressure disaggregated by month.

[pic]

Figure 14: Scatterplots of average daily PM2.5 with average nighttime pressure disaggregated by month.

There doesn’t appear to be any new trends in Figures 10-14 when the data is restricted to daytime and nighttime domains. This once again suggests that the relationship between temperature, pressure and PM2.5 is complex and a simple predictive model using just the meteorological data available from the TEOM is not likely to produce valuable results. A predictive time series model based on past observations of PM2.5 alone, however, has some potential and this leads into development of two nonparametric time series models in the next section.

Time Series Models

Nonparametric seasonal k-nn model

The first time series model used to describe the PM2.5 data was a nonparametric seasonal k-nearest neighbor (k-nn) model. In this case, the “seasons” are actually the hours of the day. The seasonality of the original data is preserved in simulations using this approach because the nearest neighbors being sampled for a given hour of the day are restricted to historic measurements observed during that particular hour. The simplest form of this model uses a lag-1 structure in the k-nn sampling. Higher order lags can be added to the k-nn framework at the cost of increased computation time.

To help reduce computation time, only one full year of hourly data from 2001 was utilized in the following simulations and a lag-1 structure was used for the k-nn sampling. Twenty simulations using this model were run to test the ability of the model to reproduce basic statistics and probability density functions (PDFs) of the original data.

Figure 15 contains boxplots of the statistics computed for each hour of the day in black along with the observed data in red. It can be seen that the model is able to faithfully reproduce the mean, median, standard deviation, skew, max, min and lag-1 correlation. The model falls short, however, when trying to predict higher-order lags as was expected since only a lag-1 structure was incorporated into the k-nn framework. Overall, this model does a good job reproducing the basic statistics.

Figure 16 shows boxplots of the PDFs for each hour in black generated in the simulations along with the observed PDF for each hour in red. Once again, it is clear that the model is faithfully reproducing the distribution of the original data. Also worth noting is the similarity in distribution of each hour of the day. This was an unexpected result since daytime and nighttime sources of PM2.5 are quite different.

Nonparametric temporal disaggregation model

The second time series model investigated was a nonparametric temporal disaggregation model. In this type of model, the aggregate data is fit to a simple time series model of your choice and then the disaggregated data is obtained by nonparametric methods which utilize past observations of the disaggregated data. In this framework, the summability criterion is also upheld such that the sum of the appropriate disaggregated data is equal to the aggregate data.

For this case, the disaggregated data is the hourly PM2.5 and the aggregate data is the daily PM2.5. A simple best-fit autoregressive model was fit to the daily PM2.5 and the model was used to create the disaggregated data. Once again, one full year of hourly data from 2001 was utilized in this model and 100 simulations were run to test the ability of the model to reproduce basic statistics and PDFs of the original data.

[pic]

Figure 15: Boxplots of several basic statistics for each hour of the day computed from 20 simulations using the nonparametric seasonal k-nn model compared to the observed data plotted in red.

[pic]

Figure 16: Boxplots of the hourly probability density functions (PDFs) computed from 20 simulations using the nonparametric seasonal k-nn model compared to the observed PDFs plotted in red.

[pic]

Figure 17: Boxplots of several basic statistics for each hour of the day computed from 100 simulations using the nonparametric temporal disaggregation model compared to the observed data plotted in red.

[pic]

Figure 18: Boxplots of the hourly probability density functions (PDFs) computed from 100 simulations using the nonparametric temporal disaggregation model compared to the observed PDFs plotted in red.

Figure 17 contains boxplots of the statistics computed by the nonparametric temporal disaggregation model for each hour of the day in black along with the observed data in red. This model is able to capture the mean, median, standard deviation and all of the lag correlations with the exception of the wrap-around correlations around midnight. This is to be expected since reproduction of the wrap-around correlations is not incorporated in the disaggregation model framework. The skew and min are not as well reproduced by the disaggregation model as they were with the seasonal k-nn model. This is likely because the disaggregation model has no non-negativity constraint. Therefore, the simulations contain many negative values which are resulting in a different skew and minimum than the original observed data which has no negative values. The model is creating between 5 and 10% negative values which could have a significant effect on the skew. Some options to reduce this effect would be to re-sample the negative values until a positive value is obtained, replace them by zero, or just throw them out.

Figure 18 shows boxplots of the PDFs for each hour in black generated in the simulations along with the observed PDF for each hour in red. Overall, the model is reproducing the distribution of the original data quite well. However, there are some discrepancies in the peak height, especially during the morning rush-hour period from 5:00 – 9:00 am.

PM2.5 Forecasting

The seasonal k-nn model was used to create a simple 3-hour advance forecast for PM2.5. This type of model could be helpful in providing advance warnings to the public during unexpected periods of high PM2.5 episodes or it could be used in a clinical research setting in which a prediction of upcoming PM2.5 exposures is desired. One example would be a clinical trial looking at relationships between PM2.5 and asthma in which invasive lung-function tests are desired during periods of high (or low) PM2.5. A forecast would give advance warning, allowing the researcher to conduct the tests during the narrow window of desired PM2.5 concentration.

The seasonal k-nn model was chosen over the temporal disaggregation model because of its ability to predict the skew, min and max better and its better fit to the observed hourly PDFs. Also, the seasonal k-nn model does not produce unrealistic negative results. The model previously analyzed was a lag-1 model due to constraints in computation time. In order to create a 3-hour forecast, this model was expanded to include up to lag-3 in the k-nn framework. To improve on the performance of the forecast, the model was restricted to daytime hours (6:00 am – 6:00 pm), weekdays (Mon – Fri), one season (Jan – Mar), and 3 years (2001-2003). The forecast period had the same restrictions, but utilized observed data from January, 2004 which was not included in the model development.

For each hour in the forecast period, a 1-hour, 2-hour and 3-hour advance forecast was generated. This was repeated 100 times to create an ensemble of forecasts. The median value of the ensemble was then compared to the observed data at the appropriate hours to determine the model performance. Figure 19 contains scatterplots of the observed verses the forecast PM2.5 (mean removed) and the Pearson R2 value for each of the 1-hour, 2-hour and 3-hour advance forecasts.

[pic]

Figure 19: Forecast vs. observed PM2.5 (mean removed) for a 1, 2 and 3-hour advance forecast.

As expected, the performance of the forecast (as measured by the R2 value) decreases as the time horizon increases. For a 1-hour forecast, R2 = 0.57, whereas for a 3-hour forecast, the value decreases to R2 = 0.12. There seem to be several observed high PM2.5 events which were not predicted by even the 1-hour advance forecast. If there was a sudden jump in PM2.5 with a time-frame of less than 1 hour, then the k-nn model would not be able to predict this in a 1-hour forecast. There are more of these missed events in the 2-hour and 3-hour forecasts as one would expect. This goes to show that even a 1-hour forecast is sometimes difficult to get right when sudden events are a possibility.

Many options exist for increasing the performance and utility of this forecast model. First, increasing the amount of data in the model should improve its forecasting ability. At the moment, 3 full years of data is all that is available. Second, increasing the lag structure of the model would allow for forecasts on a larger time horizon. Third, creating a separate model for each day of the week would improve the performance. As it stands, I limited the model to the weekdays, recognizing the difference between weekdays and weekends. However, there is a significant difference between the weekdays themselves as can be seen from a close inspection of Figure 5. Therefore, creating a separate forecast model for each day of the week should help reduce the variability stemming from the day of the week. The down-side to this approach is that there would be less data to sample from for each model, but perhaps a longer season or even the entire year could be used for each of the weekday models.

Conclusions

The hourly TEOM data has proved to be a useful tool for analyzing diurnal patterns in PM2.5. The dominant peak in the periodogram for PM2.5 occurred at 2 cycles per day which corresponds to the twice daily rush-hour. This peak is most prominent during the weekdays and mostly disappears on Sunday. Therefore, it can be concluded that traffic has a significant impact on the overall PM2.5 in Denver.

Numerous correlations between daily PM2.5 and the meteorological parameters provided by the instrument (temperature and pressure) were investigated and no obvious trends or patterns were observed. The hope was to establish a simple forecast model in which the well-established meteorological forecast could be used to predict daily average PM2.5. Then, the disaggregation model could be utilized to forecast the hourly PM2.5 levels. A model such as this could have many useful applications, but it seems the relationship between PM2.5 and meteorology is too complicated to be captured using only the temperature and pressure. I anticipate addition of the inversion height would improve this forecast model, but unfortunately that data is not available on a daily basis for Denver, especially in the form of a forecast.

A different type of univariate forecast was attempted instead using only historic hourly measurements of PM2.5. For this forecast, a nonparametric seasonal k-nn model was used. The performance of this model to reproduce basic hourly statistics and the overall PDFs was demonstrated with a lag-1 structure to the k-nn framework. For forecasting purposes, this model was increased to include up to lag-3 and restricted to daytime weekdays in January, February and March of 2001-2003. A 3-hour forecast was then generated for daytime weekdays in January, 2004 and the results were compared to observed data to evaluate the model’s performance.

The statistic used to test performance was the Pearson’s R2 between the observed data and the forecast data. This was done for a 1-hour forecast, 2-hour forecast and a 3-hour forecast. As expected, the scatter in the data increased as the forecast horizon increased. Overall, the model did a reasonable job of predicting the PM2.5 one hour in advance with an R2 of 0.57. Several options for improving this model were discussed, and the most promising is to create separate models for each day of the week. This would help reduce variability from day of the week which is substantial given the unique spectra and time series plots for each day of the week (see Figure 5).

If instead of an exact PM2.5 forecast, a threshold forecast for PM2.5 is desired, then the model is expected to perform much better. This might be the case when one wants to predict if PM2.5 is going to exceed the standard or some threshold suspected to result adverse health effects. The model created in this project could easily be utilized in this type of forecasting.

References

1) Cohen AJ, Anderson HR, Ostro B, Pandey KD, Krzyzanowski M, Kunzli N, Gutschmidt K, Pope A, Romieu I, Samet JM, Smith K. The global burden of disease due to outdoor air pollution. J TOXICOL ENV HEAL A 68 (13-14): 1301-1307 JUL 9 2005.

2) Krewski D, Burnett RT, Goldberg M, Hoover K, Siemiatycki J, Abrahamowicz M, White W. Reanalysis of the Harvard Six Cities Study, Part I: Validation and replication. INHAL TOXICOL 17 (7-8): 335-342 JUN-JUL 2005.

3) Krewski D, Burnett R, Jerrett M, Pope CA, Rainham D, Calle E, Thurston G, Thun M. Mortality and long-term exposure to ambient air pollution: Ongoing analyses based on the American Cancer Society cohort. TOXICOL ENV HEAL A 68 (13-14): 1093-1109 JUL 9 2005.

4) Rabinovitch N, Zhang LN, Murphy JR, Vedal S, Dutton SJ, Gelfand EW. Effects of wintertime ambient air pollutants on asthma exacerbations in urban minority children with moderate to severe disease. ALLERGY CLIN IMMUN 114 (5): 1131-1137 NOV 2004.

5) Silkoff PE, Zhang LN, Dutton SJ, Langmack EL, Vedal S, Murphy J, Make B. Winter air pollution and disease parameters in advanced chronic obstructive pulmonary disease panels residing in Denver, Colorado. J ALLERGY CLIN IMMUN 115 (2): 337-344 FEB 2005.

Appendix A: R code used to create time series plots and histograms

#The following R code is used to create basic plots of the TEOM data

#The data are entered in "data.txt"

#clear registers, load libraries & source functions

rm(list=ls(all=TRUE)) #clear registers

#load data

rawdata=read.table("data.txt",colClasses=list("character","numeric","numeric","numeric","numeric","numeric","character","numeric","numeric","numeric"),sep="\t",fill=T,header=T,na.strings="NA",col.names=c("date","year","month","day","hour","weekday","inst_time","pm2.5","t","p"))

data=cbind(rawdata[,2],rawdata[,3],rawdata[,4],rawdata[,5],rawdata[,6],rawdata[,8],rawdata[,9],rawdata[,10])

#restrict temporal domain of data

data=data[data[,1]==2001,byrow=T] #restrict data by year (plot x-axes may need correcting)

# data=data[data[,2]==1,byrow=T] #restrict data by month (plot x-axes may need correcting)

# data=data[data[,5]==7,byrow=T] #restrict data by weekday (plot x-axes may need correcting)

#define variables

year=data[,1] #year

month=data[,2] #month

day=data[,3] #day

hour=data[,4] #hour

weekday=data[,5] #weekday (1-Monday)

hourlypm=data[,6] #pm2.5 concentration (ug/m3)

hourlytemp=data[,7] #temperature (deg C)

hourlypres=data[,8] #pressure (mmHg)

index=1:(length(year)) #data index

hourlyN=length(index) #number of hourly data points

M=24 #number of hours per aggregated daily measurement

#plotting parameters

bmp(filename="proj_1_%01d.bmp",width=800,height=800,pointsize = 12, bg = "white", res = NA)

xticks=c(1,2161,4345,6553,8761,10921,13105,15313,17521,19681,21865,24073,26281)

xlabels=c("Jan 01","Apr 01","Jul 01","Oct 01","Jan 02","Apr 02","Jul 02","Oct 02","Jan 03","Apr 03","Jul 03","Oct 03","Jan 04")

#adjust for negative values in hourly pm data

dailyN=hourlyN/24 #number of days of data

j=1

for(i in 1:dailyN){

adj=min(hourlypm[j:(j+23)],na.rm=T) #minimum pm value in day i

if(adj>0){adj=0} #make no adjustment if no negative values

hourlypm[j:(j+23)]=hourlypm[j:(j+23)]-adj #make adjustment to remove negative values

j=j+24

}

#create disaggregated data matrices

disagpm=matrix(hourlypm,ncol=24,byrow=T) #pm2.5 data matrix (ndaily x M)

disagtemp=matrix(hourlytemp,ncol=24,byrow=T) #temperature data matrix (ndaily x M)

disagpres=matrix(hourlypres,ncol=24,byrow=T) #pressure data matrix (ndaily x M)

#create daily totals (the sum of the hourly concentrations...to get an average, divide by 24)

dailyindex=index[hour==0] #daily data index

dailyN=length(dailyindex) #number of daily data points

dailyweekday=weekday[hour==0] #daily weekday

dailypm=apply(disagpm,1,sum,na.rm=T)

dailytemp=apply(disagtemp,1,sum,na.rm=T)

dailypres=apply(disagpres,1,sum,na.rm=T)

#plots of data

par(mfrow=c(3,1)) #set plotting parameters

plot(index,hourlypm,main="PM2.5 Hourly Time Series",xlab="",ylab="PM2.5 (ug/m3)",xaxt="n")

axis(1,at=xticks,labels=xlabels,tick=T,las=3)

plot(index,hourlytemp,main="Temperature Hourly Time Series",xlab="",ylab="Temp (C)",xaxt="n")

axis(1,at=xticks,labels=xlabels,tick=T,las=3)

plot(index,hourlypres,main="Pressure Hourly Time Series",xlab="",ylab="Press (mmHg)",xaxt="n")

axis(1,at=xticks,labels=xlabels,tick=T,las=3)

par(mfrow=c(3,1)) #set plotting parameters

plot(dailyindex,dailypm/24,main="PM2.5 Daily Time Series",xlab="",ylab="PM2.5 (ug/m3)",xaxt="n")

axis(1,at=xticks,labels=xlabels,tick=T,las=3)

plot(dailyindex,dailytemp/24,main="Temperature Daily Time Series",xlab="",ylab="Temp (C)",xaxt="n")

axis(1,at=xticks,labels=xlabels,tick=T,las=3)

plot(dailyindex,dailypres/24,main="Pressure Daily Time Series",xlab="",ylab="Press (mmHg)",xaxt="n")

axis(1,at=xticks,labels=xlabels,tick=T,las=3)

par(mfrow=c(4,1)) #set plotting parameters

hist(hourlypm,breaks=40,probability=T,main="PM2.5 Hourly Histogram",xlab="PM2.5 (ug/m3)",ylab="Probability")

box(bty="l")

hist(log(hourlypm),breaks=30,probability=T,main="Ln(PM2.5) Hourly Histogram",xlab="Ln(PM2.5 (ug/m3))",ylab="Probability")

box(bty="l")

m=mean(log(hourlypm[hourlypm>0]),na.rm=T)

s=sd(log(hourlypm[hourlypm>0]),na.rm=T)

x=seq(min(log(hourlypm[hourlypm>0])),max(log(hourlypm[hourlypm>0])),length=100)

n=dnorm(x,mean=m,sd=s)

lines(x,n,col="blue")

hist(hourlytemp,breaks=20,probability=T,main="Temperature Hourly Histogram",xlab="Temp (C)",ylab="Probability")

box(bty="l")

m=mean(hourlytemp,na.rm=T)

s=sd(hourlytemp,na.rm=T)

x=seq(min(hourlytemp),max(hourlytemp),length=100)

n=dnorm(x,mean=m,sd=s)

lines(x,n,col="blue")

hist(hourlypres,breaks=20,probability=T,main="Pressure Hourly Histogram",xlab="Press (mmHg)",ylab="Probability")

box(bty="l")

m=mean(hourlypres,na.rm=T)

s=sd(hourlypres,na.rm=T)

x=seq(min(hourlypres),max(hourlypres),length=100)

n=dnorm(x,mean=m,sd=s)

lines(x,n,col="blue")

dev.off() #turn off plotting

writeLines("created file(s) proj_1_x.bmp")

Appendix B: R code used to perform the spectral analysis

#The following R code is used to do a spectral analysis on the hourly PM2.5 TEOM data

#The data are entered in "data.txt"

#clear registers, load libraries & source functions

rm(list=ls(all=TRUE)) #clear registers

library(sm) #load library sm for pdf estimation

source("skew.r") #source skew function

#load data

rawdata=read.table("data.txt",colClasses=list("character","numeric","numeric","numeric","numeric","numeric","character","numeric","numeric","numeric"),sep="\t",fill=T,header=T,na.strings="NA",col.names=c("date","year","month","day","hour","weekday","inst_time","pm2.5","t","p"))

data=cbind(rawdata[,2],rawdata[,3],rawdata[,4],rawdata[,5],rawdata[,6],rawdata[,8],rawdata[,9],rawdata[,10])

#restrict temporal domain of data

data=data[data[,1]==2001,byrow=T] #restrict data by year (plot x-axes may need correcting)

# data=data[data[,2]==1,byrow=T] #restrict data by month (plot x-axes may need correcting)

# data=data[data[,5]==7,byrow=T] #restrict data by weekday (plot x-axes may need correcting)

#make an equal representation from each day of the week by cutting off excess few days at the end of the timeseries

rowmax=floor(length(data[,1])/(24*7))*(24*7)

data=data[1:rowmax,]

ndays=length(data[,1])/24

nweeks=ndays/7

#define variables

year=data[,1] #year

month=data[,2] #month

day=data[,3] #day

hour=data[,4] #hour

weekday=data[,5] #weekday (1-Monday)

hourlypm=data[,6] #pm2.5 concentration (ug/m3)

hourlytemp=data[,7] #temperature (deg C)

hourlypres=data[,8] #pressure (mmHg)

index=1:(length(year)) #data index

hourlyN=length(index) #number of hourly data points

M=24 #number of hours per aggregated daily measurement

#plotting parameters

bmp(filename="proj_2_%01d.bmp",width=800,height=800,pointsize = 12, bg = "white", res = NA)

xticks=c(1,2161,4345,6553,8761,10921,13105,15313,17521,19681,21865,24073,26281)

xlabels=c("Jan 01","Apr 01","Jul 01","Oct 01","Jan 02","Apr 02","Jul 02","Oct 02","Jan 03","Apr 03","Jul 03","Oct 03","Jan 04")

weekdaylabels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")

#adjust for negative values in hourly pm data

dailyN=hourlyN/24 #number of days of data

j=1

for(i in 1:dailyN){

adj=min(hourlypm[j:(j+23)],na.rm=T) #minimum pm value in day i

if(adj>0){adj=0} #make no adjustment if no negative values

hourlypm[j:(j+23)]=hourlypm[j:(j+23)]-adj #make adjustment to remove negative values

j=j+24

}

#create disaggregated data matrices

disagpm=matrix(hourlypm,ncol=24,byrow=T) #pm2.5 data matrix (ndaily x M)

disagtemp=matrix(hourlytemp,ncol=24,byrow=T) #temperature data matrix (ndaily x M)

disagpres=matrix(hourlypres,ncol=24,byrow=T) #pressure data matrix (ndaily x M)

#create daily totals (the sum of the hourly concentrations...to get an average, divide by 24)

dailyindex=index[hour==0] #daily data index

dailyN=length(dailyindex) #number of daily data points

dailyweekday=weekday[hour==0] #daily weekday

dailypm=apply(disagpm,1,sum,na.rm=T) #daily pm (sum, not average)

dailytemp=apply(disagtemp,1,sum,na.rm=T) #daily temperature (sum, not average)

dailypres=apply(disagpres,1,sum,na.rm=T) #daily pres (sum, not average)

#create daily normalized data

# zdisagpm=disagpm-apply(disagpm,1,mean) #daily mean removed disaggregated pm data

# zdisagtemp=disagtemp-apply(disagtemp,1,mean) #daily mean removed disaggregated temp data

# zdisagpres=disagpres-apply(disagpres,1,mean) #daily mean removed disaggregated pres data

# zhourlypm=array(t(zdisagpm)) #daily mean removed pm data

# zhourlytemp=array(t(zdisagtemp)) #daily mean removed temp data

# zhourlypres=array(t(zdisagpres)) #daily mean removed press data

#create weekday data

weekdaypm=matrix(NA,ncol=7,nrow=(rowmax/7)) #pm data split into weekday (empty for now)

disagweekdaypm=array(NA,c(nweeks,24,7)) #disaggregated pm weekday data (empty for now)

for(i in 1:7){

weekdaypm[,i]=hourlypm[weekday==i] #pm data split into weekday (column 1 = Mon...)

disagweekdaypm[,,i]=matrix(weekdaypm[,i],ncol=24,byrow=T) #disaggregated pm weekday data (week x hours x weekday)

}

#calculate periodogram I(wp) using acf with Parzen smoothing & plot with confidence intervals

#All PM, Temperature & Pressure

par(mfrow=c(3,1))

for(k in 1:3){

if(k==1){ #PM

x=hourlypm

title="PM2.5"

}

if(k==2){ #Temp

x=hourlytemp

title="Temperature"

}

if(k==3){ #Pres

x=hourlypres

title="Pressure"

}

N=length(x)

c=acf(x,type="covariance",lag.max=(N-1),plot=F)$acf #autocovariance

freqs=2*pi*(1:round((N/2)))/N #fourier frequencies

plotfreqs=freqs/(2*pi*(1/24)) #cycles per day for plots

p=1:round((N/2)) #index of frequencies (1...nfreqs)

t=1:N #index of data points (1...N)

nfreqs=length(freqs) #number of fourier frequencies

M=2*round(sqrt(N)) #thumb rule for smoothing

M=2*round(1.5*sqrt(N)) #slightly less smoothing than thumb rule

w=1:M #smoothing weight function

w[1:(M/2)]=1-(6*((w[1:(M/2)]/M)^2))+ (6*((w[1:(M/2)]/M)^3)) #Parzen smoothing weight function (0 for i > M)

w[((M/2)+1):M]=2*((1-(w[((M/2)+1):M]/M))^3)

w0=1

pgram=1:nfreqs #periodogram (empty for now)

for(i in 1:nfreqs){

pgram[i]=((w0*c[1])+(2*sum(w*c[2:(M+1)]*cos(freqs[i]*(1:M)))))/pi #periodogram

}

#Calculate parametric 95% confidence intervals

degf=3.709*N/M

lowfac=degf/qchisq(0.975,degf)

upfac=degf/qchisq(0.025,degf)

spfft.l=pgram*lowfac

spfft.u=pgram*upfac

#Compare to AR(1) model to test significance

nrep=100

z=x-mean(x)

pgramar=array(NA,dim=c(nfreqs,nrep))

for(j in 1:nrep){

arimaout=arima(z,order=c(1,0,0)) #AR(1) fit to data

arimasim=arima.sim(n=N,list(ar=c(arimaout$coef[1]),ma=0),sd=sqrt(arimaout$sigma2))+mean(x)

car=acf(arimasim,type="covariance",lag.max=(N-1),plot=F)$acf #autocovariance for AR(1) model

for(i in 1:nfreqs){

pgramar[i,j]=(car[1]+(2*sum(car[2:N]*cos(freqs[i]*(1:(N-1))))))/pi #calculate I(wp) periodogram

}

}

ar.u=rep(NA,nfreqs)

for(i in 1:nfreqs){

ar.u[i]=quantile(pgramar[i,],probs=.95)

}

#Create plot

yrange=range(pgram,spfft.u,spfft.l)

plot(plotfreqs,ar.u,ylim=yrange,xlab="frequency (cycles/day)",ylab="Periodogram",type="l",col="gray")

abline(v=1,lty=2,col="blue")

abline(v=2,lty=2,col="blue")

segments(plotfreqs,spfft.l,plotfreqs,spfft.u,col="red")

lines(plotfreqs,pgram,lwd=2)

title(title)

text(quantile(plotfreqs,0.65),0.90*max(pgram),paste("Parzen Smoothing, M =",M,"(N =",N,")"))

text(quantile(plotfreqs,0.65),0.75*max(pgram),"parametric 95% confidence intervals in red")

text(quantile(plotfreqs,0.65),0.60*max(pgram),"95th percentile of AR(1) spectra simulations in gray")

}

#Weekday PM

par(mfrow=c(4,2))

for(wd in 1:7){ #loop through weekdays

x=weekdaypm[,wd]

title=paste(weekdaylabels[wd],"PM2.5")

N=length(x)

c=acf(x,type="covariance",lag.max=(N-1),plot=F)$acf #autocovariance

freqs=2*pi*(1:round((N/2)))/N #fourier frequencies

plotfreqs=freqs/(2*pi*(1/24)) #cycles per day for plots

p=1:round((N/2)) #index of frequencies (1...nfreqs)

t=1:N #index of data points (1...N)

nfreqs=length(freqs) #number of fourier frequencies

M=2*round(sqrt(N)) #thumb rule for smoothing

M=2*round(3*sqrt(N)) #slightly less smoothing than thumb rule

w=1:M #smoothing weight function

w[1:(M/2)]=1-(6*((w[1:(M/2)]/M)^2))+ (6*((w[1:(M/2)]/M)^3)) #Parzen smoothing weight function (0 for i > M)

w[((M/2)+1):M]=2*((1-(w[((M/2)+1):M]/M))^3)

w0=1

pgram=1:nfreqs

for(i in 1:nfreqs){

pgram[i]=((w0*c[1])+(2*sum(w*c[2:(M+1)]*cos(freqs[i]*(1:M)))))/pi

}

#Calculate parametric 95% confidence intervals

degf=3.709*N/M

lowfac=degf/qchisq(0.975,degf)

upfac=degf/qchisq(0.025,degf)

spfft.l=pgram*lowfac

spfft.u=pgram*upfac

#Create plot

yrange=range(pgram,spfft.u,spfft.l)

xrange=c(0,6)

plot(plotfreqs,pgram,xlim=xrange,ylim=yrange,xlab="frequency (cycles/day)",ylab="Periodogram",type="l")

abline(v=1,lty=2,col="blue")

abline(v=2,lty=2,col="blue")

segments(plotfreqs,spfft.l,plotfreqs,spfft.u,col="red")

lines(plotfreqs,pgram,lwd=2)

title(title)

}

#Make weekday diurnal pattern boxplots

par(mfrow=c(4,2))

for(wd in 1:7){ #loop through weekdays

temp=disagweekdaypm[,,wd]

title=paste(weekdaylabels[wd],"PM2.5")

n=length(temp[,1])

bxout=boxplot(split(temp,col(temp)), plot=F, cex=1.0)

bxout$names=c("0:00","1:00","2:00","3:00","4:00","5:00","6:00","7:00","8:00","9:00","10:00","11:00","12:00","13:00","14:00","15:00","16:00","17:00","18:00","19:00","20:00","21:00","22:00","23:00")

bx=bxp(bxout,ylim=c(0,45),xlab="",ylab="PM2.5(ug/m3)",cex=1.00,las=3,outline=F)

title(title)

lines(1:24,apply(disagweekdaypm[,,wd],2,median),lwd=2,col="black")

}

dev.off() #turn off plotting

writeLines("created file(s) proj_2_x.bmp")

Appendix C: R code used to perform the correlation analysis

#The following R code is used to investigate relationships between daily PM2.5 TEOM data & meteorology data

#The data are entered in "data.txt"

#clear registers, load libraries & source functions

rm(list=ls(all=TRUE)) #clear registers

library(locfit) #load library locfit for local polynomials fitting

#load data

rawdata=read.table("data.txt",colClasses=list("character","numeric","numeric","numeric","numeric","numeric","character","numeric","numeric","numeric"),sep="\t",fill=T,header=T,na.strings="NA",col.names=c("date","year","month","day","hour","weekday","inst_time","pm2.5","t","p"))

data=cbind(rawdata[,2],rawdata[,3],rawdata[,4],rawdata[,5],rawdata[,6],rawdata[,8],rawdata[,9],rawdata[,10])

#restrict temporal domain of data

data=data[data[,1]==2001|data[,1]==2002|data[,1]==2003,byrow=T] #restrict data by year (plot x-axes may need correcting)

# data=data[data[,2]>=1&data[,2]0){adj=0} #make no adjustment if no negative values

hourlypm[j:(j+23)]=hourlypm[j:(j+23)]-adj #make adjustment to remove negative values

j=j+24

}

#remove Hayman fire data by using average PM level on day before & after

for(i in 1:24){

j=i+12576

hourlypm[j]=mean(c(hourlypm[(j-24)],hourlypm[(j+24)]))

j=i+12792

hourlypm[j]=mean(c(hourlypm[(j-24)],hourlypm[(j+24)]))

}

#create disaggregated data matrices

disagpm=matrix(hourlypm,ncol=24,byrow=T) #pm2.5 data matrix (ndaily x M)

disagtemp=matrix(hourlytemp,ncol=24,byrow=T) #temperature data matrix (ndaily x M)

disagpres=matrix(hourlypres,ncol=24,byrow=T) #pressure data matrix (ndaily x M)

#create daily totals (the sum of the hourly concentrations...to get an average, divide by 24)

dailyindex=index[hour==0] #daily data index

dailyN=length(dailyindex) #number of daily data points

dailyweekday=weekday[hour==0] #daily weekday

dailypm=apply(disagpm,1,sum,na.rm=T)

dailytemp=apply(disagtemp,1,sum,na.rm=T)

dailypres=apply(disagpres,1,sum,na.rm=T)

#plot correlations

bmp(filename="proj_3a_%01d.bmp",width=900,height=300,pointsize = 12, bg = "white", res = NA)

par(mfrow=c(1,3))

title="Scatterplot with Locfit" #daily average PM2.5 vs temperature (all data)

xvar=dailytemp/24

xtitle="Ave Daily Temperature (deg C)"

yvar=dailypm/24

ytitle="Ave Daily PM2.5 (ug/m3)"

plot(xvar,yvar,xlab=xtitle,ylab=ytitle,main=title)

lines(locfit(yvar~xvar),col="red")

title="Scatterplot with Locfit" #daily average PM2.5 vs pressure (all data)

xvar=dailypres/24

xtitle="Ave Daily Pressure (mmHg)"

yvar=dailypm/24

ytitle="Ave Daily PM2.5 (ug/m3)"

plot(xvar,yvar,xlab=xtitle,ylab=ytitle,main=title)

lines(locfit(yvar~xvar),col="red")

title="Scatterplot with Locfit" #daily average PM2.5 vs pressure x temperature (all data)

xvar=(dailypres-mean(dailypres))/24*(dailytemp-mean(dailytemp))/24

xtitle="Mean Removed Pressure x Temperature"

yvar=(dailypm-mean(dailypm))/24

ytitle="Mean Removed PM2.5"

plot(xvar,yvar,xlab=xtitle,ylab=ytitle,main=title)

lines(locfit(yvar~xvar),col="red")

dev.off()

bmp(filename="proj_3b_%01d.bmp",width=600,height=600,pointsize = 12, bg = "white", res = NA)

par(mfrow=c(2,2))

title=paste("Scatterplot with Locfit") #daily PM2.5 vs daytime temperature average (all data)

xvar=hourlytemp[hour>=6&hour=6&hour=6&hour=6&hour0){adj=0} #make no adjustment if no negative values

hourlypm[j:(j+23)]=hourlypm[j:(j+23)]-adj #make adjustment to remove negative values

j=j+24

}

#create disaggregated data matrices

disagpm=matrix(hourlypm,ncol=24,byrow=T) #pm2.5 data matrix (ndaily x M)

disagtemp=matrix(hourlytemp,ncol=24,byrow=T) #temperature data matrix (ndaily x M)

disagpres=matrix(hourlypres,ncol=24,byrow=T) #pressure data matrix (ndaily x M)

#create daily totals (the sum of the hourly concentrations...to get an average, divide by 24)

dailyindex=index[hour==0] #daily data index

dailyN=length(dailyindex) #number of daily data points

dailyweekday=weekday[hour==0] #daily weekday

dailypm=apply(disagpm,1,sum,na.rm=T)

dailytemp=apply(disagtemp,1,sum,na.rm=T)

dailypres=apply(disagpres,1,sum,na.rm=T)

#Loop through different orders in ARIMA model & calculate AIC & GCV

z=dailypm-mean(dailypm) #mean removed daily pm

pmax=14

qmax=14

modelmatrix=matrix(NA,ncol=3,nrow=((pmax)*(qmax))) #matrix of model parameters & AIC results (empty for now)

i=0

for(p in 5:pmax){ #loop through AR(p)

for(q in 1:qmax){ #loop through MA(q)

i=i+1

m=p+q+2 #number of model parameters

arimaout=arima(z,order=c(p,0,q)) #ARIMA calculation

aic=arimaout$aic #AIC calculated in R

modelmatrix[i,]=c(p,q,aic) #matrix of model parameters & AIC results

}

}

modelmatrix=t(data.frame(t(modelmatrix),row.names=c("p","q","aic")))

#Pick optimum model

bestp=modelmatrix[order(modelmatrix[,3])[1],1] #best p for AR

bestq=modelmatrix[order(modelmatrix[,3])[1],2] #best q for MA

# bestp=7

# bestq=3

#Re-run optimum model

arimaout=arima(z,order=c(bestp,0,bestq))

aic=arimaout$aic

res=arimaout$residuals #residuals

coef=arimaout$coef #coefficients

errvar=var(res) #error variance

m=bestp+bestq+2 #number of model parameters

#Plot data & optimum model

par(mfrow=c(1,1)) #set plotting parameters

plot(dailyindex,dailypm,main="PM2.5 Daily Time Series",xlab="",ylab="PM2.5 (ug/m3)",xaxt="n")

axis(1,at=xticks,labels=xlabels,tick=T,las=3)

lines(dailyindex,dailypm-res)

text(mean(dailyindex),0.9*max(dailypm),paste("ARMA(",bestp,",",bestq,")"))

dev.off() #turn off plotting

writeLines("created file(s) proj_4_x.bmp")

Appendix E: R code used for the nonparametric seasonal k-nn model

#The following R code is used to run a "seasonal" nonparametric k-nn model on hourly PM2.5 TEOM data

#The data are entered in "data.txt"

#clear registers, load libraries & source functions

rm(list=ls(all=TRUE)) #clear registers

library(sm) #load library sm for pdf estimation

source("skew.r") #source skew function

#load data

rawdata=read.table("data.txt",colClasses=list("character","numeric","numeric","numeric","numeric","numeric","character","numeric","numeric","numeric"),sep="\t",fill=T,header=T,na.strings="NA",col.names=c("date","year","month","day","hour","weekday","inst_time","pm2.5","t","p"))

data=cbind(rawdata[,2],rawdata[,3],rawdata[,4],rawdata[,5],rawdata[,6],rawdata[,8],rawdata[,9],rawdata[,10])

#restrict temporal domain of data

data=data[data[,1]==2001,byrow=T] #restrict data by year (plot x-axes may need correcting)

# data=data[data[,2]>=1&data[,2]0){adj=0} #make no adjustment if no negative values

hourlypm[j:(j+23)]=hourlypm[j:(j+23)]-adj #make adjustment to remove negative values

j=j+24

}

#create disaggregated data matrices

disagpm=matrix(hourlypm,ncol=24,byrow=T) #pm2.5 data matrix (ndaily x M)

disagtemp=matrix(hourlytemp,ncol=24,byrow=T) #temperature data matrix (ndaily x M)

disagpres=matrix(hourlypres,ncol=24,byrow=T) #pressure data matrix (ndaily x M)

#create daily totals (the sum of the hourly concentrations...to get an average, divide by 24)

dailyindex=index[hour==0] #daily data index

dailyN=length(dailyindex) #number of daily data points

dailyweekday=weekday[hour==0] #daily weekday

dailypm=apply(disagpm,1,sum,na.rm=T)

dailytemp=apply(disagtemp,1,sum,na.rm=T)

dailypres=apply(disagpres,1,sum,na.rm=T)

#Establish parameters for simulations & statistical output

nsim=100 #number of simulations (must be > 1)

lag=4 #lag to investigate for k-nn

hr_mean=array(NA,c(nsim+1,24)) #hourly stats; first row reserved for the historical data (empty for now)

hr_median=array(NA,c(nsim+1,24))

hr_stdev=array(NA,c(nsim+1,24))

hr_skew=array(NA,c(nsim+1,24))

hr_max=array(NA,c(nsim+1,24))

hr_min=array(NA,c(nsim+1,24))

hr_cor1=array(NA,c(nsim+1,24))

hr_cor2=array(NA,c(nsim+1,24))

hr_cor3=array(NA,c(nsim+1,24))

hr_cor4=array(NA,c(nsim+1,24))

da_mean=array(NA,(nsim+1)) #daily stats; first row reserved for the historical data (empty for now)

da_median=array(NA,(nsim+1))

da_stdev=array(NA,(nsim+1))

da_skew=array(NA,(nsim+1))

da_max=array(NA,(nsim+1))

da_min=array(NA,(nsim+1))

da_cor1=array(NA,(nsim+1))

da_cor2=array(NA,(nsim+1))

da_cor3=array(NA,(nsim+1))

da_cor4=array(NA,(nsim+1))

oa_mean=array(NA,(nsim+1)) #overall stats; first row reserved for the historical data (empty for now)

oa_median=array(NA,(nsim+1))

oa_stdev=array(NA,(nsim+1))

oa_skew=array(NA,(nsim+1))

oa_max=array(NA,(nsim+1))

oa_min=array(NA,(nsim+1))

oa_cor1=array(NA,(nsim+1))

oa_cor2=array(NA,(nsim+1))

oa_cor3=array(NA,(nsim+1))

oa_cor4=array(NA,(nsim+1))

alldisagpm=array(NA,c((nsim+1),dim(disagpm))) #hourly pdfs (empty for now)

#Calculate observed data statistics and place them in the first row of the stats matrices

k1=1 #enter values into first row

hr_mean[k1,]=apply(disagpm,2,mean,na.rm=T) #monthly stats

hr_median[k1,]=apply(disagpm,2,median,na.rm=T)

hr_stdev[k1,]=apply(disagpm,2,sd,na.rm=T)

hr_skew[k1,]=apply(disagpm,2,skew)

hr_max[k1,]=apply(disagpm,2,max,na.rm=T)

hr_min[k1,]=apply(disagpm,2,min,na.rm=T)

hr_cor1[k1,1]=cor(disagpm[2:dailyN,1],disagpm[1:(dailyN-1),24],use="complete.obs")

for(k in 2:24){hr_cor1[k1,k]=cor(disagpm[,k],disagpm[,(k-1)],use="complete.obs")}

hr_cor2[k1,1]=cor(disagpm[2:dailyN,1],disagpm[1:(dailyN-1),23],use="complete.obs")

hr_cor2[k1,2]=cor(disagpm[2:dailyN,2],disagpm[1:(dailyN-1),24],use="complete.obs")

for(k in 3:24){hr_cor2[k1,k]=cor(disagpm[,k],disagpm[,(k-2)],use="complete.obs")}

hr_cor3[k1,1]=cor(disagpm[2:dailyN,1],disagpm[1:(dailyN-1),22],use="complete.obs")

hr_cor3[k1,2]=cor(disagpm[2:dailyN,2],disagpm[1:(dailyN-1),23],use="complete.obs")

hr_cor3[k1,3]=cor(disagpm[2:dailyN,3],disagpm[1:(dailyN-1),24],use="complete.obs")

for(k in 4:24){hr_cor3[k1,k]=cor(disagpm[,k],disagpm[,(k-3)],use="complete.obs")}

hr_cor4[k1,1]=cor(disagpm[2:dailyN,1],disagpm[1:(dailyN-1),21],use="complete.obs")

hr_cor4[k1,2]=cor(disagpm[2:dailyN,2],disagpm[1:(dailyN-1),22],use="complete.obs")

hr_cor4[k1,3]=cor(disagpm[2:dailyN,3],disagpm[1:(dailyN-1),23],use="complete.obs")

hr_cor4[k1,4]=cor(disagpm[2:dailyN,4],disagpm[1:(dailyN-1),24],use="complete.obs")

for(k in 5:24){hr_cor4[k1,k]=cor(disagpm[,k],disagpm[,(k-4)],use="complete.obs")}

da_mean[k1]=mean(dailypm,na.rm=T) #annual stats

da_median[k1]=median(dailypm,na.rm=T)

da_stdev[k1]=sd(dailypm,na.rm=T)

da_skew[k1]=skew(dailypm)

da_max[k1]=max(dailypm,na.rm=T)

da_min[k1]=min(dailypm,na.rm=T)

da_cor1[k1]=cor(dailypm[2:dailyN],dailypm[1:(dailyN-1)],use="complete.obs")

da_cor2[k1]=cor(dailypm[3:dailyN],dailypm[1:(dailyN-2)],use="complete.obs")

da_cor3[k1]=cor(dailypm[4:dailyN],dailypm[1:(dailyN-3)],use="complete.obs")

da_cor4[k1]=cor(dailypm[5:dailyN],dailypm[1:(dailyN-4)],use="complete.obs")

oa_mean[k1]=mean(hourlypm,na.rm=T) #overall stats

oa_median[k1]=median(hourlypm,na.rm=T)

oa_stdev[k1]=sd(hourlypm,na.rm=T)

oa_skew[k1]=skew(hourlypm)

oa_max[k1]=max(hourlypm,na.rm=T)

oa_min[k1]=min(hourlypm,na.rm=T)

oa_cor1[k1]=cor(hourlypm[2:hourlyN],hourlypm[1:(hourlyN-1)],use="complete.obs")

oa_cor2[k1]=cor(hourlypm[3:hourlyN],hourlypm[1:(hourlyN-2)],use="complete.obs")

oa_cor3[k1]=cor(hourlypm[4:hourlyN],hourlypm[1:(hourlyN-3)],use="complete.obs")

oa_cor4[k1]=cor(hourlypm[5:hourlyN],hourlypm[1:(hourlyN-4)],use="complete.obs")

alldisagpm[k1,,]=disagpm #hourly pdfs

#Generate lagged data for each entry

z=hourlypm-mean(hourlypm) #mean removed time series data

ndata=dailyN-ceiling(lag/24) #number of available data points (including lags) to sample from

xdata=array(NA,dim=c(ndata,lag,24)) #lagged X array (ndata days x lag x hours)

ydata=array(NA,dim=c(ndata,1,24)) #lagged Y array (ndata days x 1 x hours)

for(hr in 1:24){ #loop through hours

for(dy in 1:ndata){ #loop through days of data for that hour

tmax=(dy+ceiling(lag/24)-1)*24+hr

tmin=tmax-lag

xdata[dy,,hr]=z[tmin:(tmax-1)] #lagged X matrix (ndata days, t-lag,...,t-1, hour)

ydata[dy,,hr]=z[tmax] #lagged Y matrix (ndata days, t, hour)

}

}

#Determine number of nearest neighbors & establish weight function

K=round(sqrt(ndata)) #number of nearest neighbors (K=sqrt(N))

W=1:K

W=1/W

W=W/sum(W)

W=cumsum(W) #weight function

#Create simulations

zsim=rep(NA,length(z)) #simulated time series (empty for now)

hr=0 #hour indicator

disagpmsim=matrix(NA,nrow=dim(disagpm)[1],ncol=dim(disagpm)[2]) #matrix for disaggregated simulation data (empty for now)

for(isim in 1:nsim){ #loop through simulatios

rday=round(runif(1,(dailyN-ndata+1),dailyN)) #pick a random starting day

rhour=(rday-1)*24+1 #0:00 position in the time series z for the random starting day

zp=z[(rhour-lag):(rhour-1)] #randomly chosen 0:00 from z matrix

for(j in 1:hourlyN){ #loop through hours to create simulated time series

hr=hr+1 #increase the hour indicator by 1 hour

if(hr>24){hr=1} #reset hour indicator back to 1 (0:00) after 24 (23:00)

ztemp=rbind(zp,as.matrix(xdata[,,hr])) #combine zp with the values in xdata corresponding to the appropriate hour

zdist=order(as.matrix(dist(ztemp))[1,2:(ndata+1)]) #calculate the distances to zp

temp1=runif(1,0,1) #uniform random number between 0 and 1 to select neighbor

temp2=c(temp1,W) #place uniform random number in weight matrix

temp3=rank(temp2) #pick out rank of uniform random number

jj=zdist[temp3[1]] #pick out distance of randomly selected neighbor

zsim[j]=ydata[jj,,hr] #enter value of selected y-value into simulation matrix

if(lag==1){zp=zsim[j]} #next value of zp for lag=1

if(lag>1){zp=c(zp[2:lag],zsim[j])} #next value of zp for lag>1

}

hourlypmsim=zsim+mean(hourlypm) #add back in mean to create simulation time series

disagpmsim=t(matrix(hourlypmsim,nrow=24)) #disaggregated simulation matrix

dailypmsim=rowSums(disagpmsim) #daily pm simulation

k1=isim+1

hr_mean[k1,]=apply(disagpmsim,2,mean,na.rm=T) #monthly stats

hr_median[k1,]=apply(disagpmsim,2,median,na.rm=T)

hr_stdev[k1,]=apply(disagpmsim,2,sd,na.rm=T)

hr_skew[k1,]=apply(disagpmsim,2,skew)

hr_max[k1,]=apply(disagpmsim,2,max,na.rm=T)

hr_min[k1,]=apply(disagpmsim,2,min,na.rm=T)

hr_cor1[k1,1]=cor(disagpmsim[2:dailyN,1],disagpmsim[1:(dailyN-1),24],use="complete.obs")

for(k in 2:24){hr_cor1[k1,k]=cor(disagpmsim[,k],disagpmsim[,(k-1)],use="complete.obs")}

hr_cor2[k1,1]=cor(disagpmsim[2:dailyN,1],disagpmsim[1:(dailyN-1),23],use="complete.obs")

hr_cor2[k1,2]=cor(disagpmsim[2:dailyN,2],disagpmsim[1:(dailyN-1),24],use="complete.obs")

for(k in 3:24){hr_cor2[k1,k]=cor(disagpmsim[,k],disagpmsim[,(k-2)],use="complete.obs")}

hr_cor3[k1,1]=cor(disagpmsim[2:dailyN,1],disagpmsim[1:(dailyN-1),22],use="complete.obs")

hr_cor3[k1,2]=cor(disagpmsim[2:dailyN,2],disagpmsim[1:(dailyN-1),23],use="complete.obs")

hr_cor3[k1,3]=cor(disagpmsim[2:dailyN,3],disagpmsim[1:(dailyN-1),24],use="complete.obs")

for(k in 4:24){hr_cor3[k1,k]=cor(disagpmsim[,k],disagpmsim[,(k-3)],use="complete.obs")}

hr_cor4[k1,1]=cor(disagpmsim[2:dailyN,1],disagpmsim[1:(dailyN-1),21],use="complete.obs")

hr_cor4[k1,2]=cor(disagpmsim[2:dailyN,2],disagpmsim[1:(dailyN-1),22],use="complete.obs")

hr_cor4[k1,3]=cor(disagpmsim[2:dailyN,3],disagpmsim[1:(dailyN-1),23],use="complete.obs")

hr_cor4[k1,4]=cor(disagpmsim[2:dailyN,4],disagpmsim[1:(dailyN-1),24],use="complete.obs")

for(k in 5:24){hr_cor4[k1,k]=cor(disagpmsim[,k],disagpmsim[,(k-4)],use="complete.obs")}

da_mean[k1]=mean(dailypmsim,na.rm=T) #annual stats

da_median[k1]=median(dailypmsim,na.rm=T)

da_stdev[k1]=sd(dailypmsim,na.rm=T)

da_skew[k1]=skew(dailypmsim)

da_max[k1]=max(dailypmsim,na.rm=T)

da_min[k1]=min(dailypmsim,na.rm=T)

da_cor1[k1]=cor(dailypmsim[2:dailyN],dailypmsim[1:(dailyN-1)],use="complete.obs")

da_cor2[k1]=cor(dailypmsim[3:dailyN],dailypmsim[1:(dailyN-2)],use="complete.obs")

da_cor3[k1]=cor(dailypmsim[4:dailyN],dailypmsim[1:(dailyN-3)],use="complete.obs")

da_cor4[k1]=cor(dailypmsim[5:dailyN],dailypmsim[1:(dailyN-4)],use="complete.obs")

oa_mean[k1]=mean(hourlypmsim,na.rm=T) #overall stats

oa_median[k1]=median(hourlypmsim,na.rm=T)

oa_stdev[k1]=sd(hourlypmsim,na.rm=T)

oa_skew[k1]=skew(hourlypmsim)

oa_max[k1]=max(hourlypmsim,na.rm=T)

oa_min[k1]=min(hourlypmsim,na.rm=T)

oa_cor1[k1]=cor(hourlypmsim[2:hourlyN],hourlypmsim[1:(hourlyN-1)],use="complete.obs")

oa_cor2[k1]=cor(hourlypmsim[3:hourlyN],hourlypmsim[1:(hourlyN-2)],use="complete.obs")

oa_cor3[k1]=cor(hourlypmsim[4:hourlyN],hourlypmsim[1:(hourlyN-3)],use="complete.obs")

oa_cor4[k1]=cor(hourlypmsim[5:hourlyN],hourlypmsim[1:(hourlyN-4)],use="complete.obs")

alldisagpm[k1,,]=disagpmsim #hourly pdfs

}

#Boxplots of hourly, daily & overall statistics

par(mfrow=c(5,2)) #set plotting parameters

for(stat in 1:10){ #loop through stats

if(stat==1){temp=cbind(hr_mean,da_mean/M,oa_mean)}

if(stat==2){temp=cbind(hr_median,da_median/M,oa_median)}

if(stat==3){temp=cbind(hr_stdev,da_stdev/M,oa_stdev)}

if(stat==4){temp=cbind(hr_skew,da_skew,oa_skew)}

if(stat==5){temp=cbind(hr_max,da_max/M,oa_max)}

if(stat==6){temp=cbind(hr_min,da_min/M,oa_min)}

if(stat==7){temp=cbind(hr_cor1,da_cor1,oa_cor1)}

if(stat==8){temp=cbind(hr_cor2,da_cor2,oa_cor2)}

if(stat==9){temp=cbind(hr_cor3,da_cor3,oa_cor3)}

if(stat==10){temp=cbind(hr_cor4,da_cor4,oa_cor4)}

title=paste("PM2.5",c("Mean","Median","Standard Deviation","Skew","Max","Min","Lag-1 Correlation","Lag-2 Correlation","Lag-3 Correlation","Lag-4 Correlation"))[stat]

n=length(temp[,1])

tempsim=temp[2:n,]

bxout=boxplot(split(tempsim,col(tempsim)), plot=F, cex=1.0)

bxout$names=c("0:00","1:00","2:00","3:00","4:00","5:00","6:00","7:00","8:00","9:00","10:00","11:00","12:00","13:00","14:00","15:00","16:00","17:00","18:00","19:00","20:00","21:00","22:00","23:00","Daily","Overall")

bx=bxp(bxout,ylim=range(temp,na.rm=T),xlab="",ylab="",cex=1.00,las=3)

points(bx,temp[1,],pch=16,col="red")

lines(bx[1:24],temp[1,1:24],pch=16,col="red")

title(title)

}

dev.off() #turn off plotting

writeLines("created file(s) proj_5_x.bmp")

Appendix F: R code used for the nonparametric temporal disaggregation model

#The following R code is used to run a nonparametric temporal disaggregation model on hourly PM2.5 TEOM data

#The data are entered in "data.txt"

#clear registers, load libraries & source functions

rm(list=ls(all=TRUE)) #clear registers

library(sm) #load library sm for pdf estimation

source("skew.r") #source skew function

#load data

rawdata=read.table("data.txt",colClasses=list("character","numeric","numeric","numeric","numeric","numeric","character","numeric","numeric","numeric"),sep="\t",fill=T,header=T,na.strings="NA",col.names=c("date","year","month","day","hour","weekday","inst_time","pm2.5","t","p"))

data=cbind(rawdata[,2],rawdata[,3],rawdata[,4],rawdata[,5],rawdata[,6],rawdata[,8],rawdata[,9],rawdata[,10])

#restrict temporal domain of data

data=data[data[,1]==2001,byrow=T] #restrict data by year (plot x-axes may need correcting)

# data=data[data[,2]==1,byrow=T] #restrict data by month (plot x-axes may need correcting)

# data=data[data[,5]==7,byrow=T] #restrict data by weekday (plot x-axes may need correcting)

#define variables

year=data[,1] #year

month=data[,2] #month

day=data[,3] #day

hour=data[,4] #hour

weekday=data[,5] #weekday (1-Monday)

hourlypm=data[,6] #pm2.5 concentration (ug/m3)

hourlytemp=data[,7] #temperature (deg C)

hourlypres=data[,8] #pressure (mmHg)

index=1:(length(year)) #data index

hourlyN=length(index) #number of hourly data points

M=24 #number of hours per aggregated daily measurement

#plotting parameters

# bmp(filename="proj_6_%01d.bmp",width=800,height=800,pointsize = 12, bg = "white", res = NA)

xticks=c(1,2161,4345,6553,8761,10921,13105,15313,17521,19681,21865,24073,26281)

xlabels=c("Jan 01","Apr 01","Jul 01","Oct 01","Jan 02","Apr 02","Jul 02","Oct 02","Jan 03","Apr 03","Jul 03","Oct 03","Jan 04")

#adjust for negative values in hourly pm data

dailyN=hourlyN/24 #number of days of data

j=1

for(i in 1:dailyN){

adj=min(hourlypm[j:(j+23)],na.rm=T) #minimum pm value in day i

if(adj>0){adj=0} #make no adjustment if no negative values

hourlypm[j:(j+23)]=hourlypm[j:(j+23)]-adj #make adjustment to remove negative values

j=j+24

}

#create disaggregated data matrices

disagpm=matrix(hourlypm,ncol=24,byrow=T) #pm2.5 data matrix (ndaily x M)

disagtemp=matrix(hourlytemp,ncol=24,byrow=T) #temperature data matrix (ndaily x M)

disagpres=matrix(hourlypres,ncol=24,byrow=T) #pressure data matrix (ndaily x M)

#create daily totals (the sum of the hourly concentrations...to get an average, divide by 24)

dailyindex=index[hour==0] #daily data index

dailyN=length(dailyindex) #number of daily data points

dailyweekday=weekday[hour==0] #daily weekday

dailypm=apply(disagpm,1,sum,na.rm=T)

dailytemp=apply(disagtemp,1,sum,na.rm=T)

dailypres=apply(disagpres,1,sum,na.rm=T)

#Establish parameters for simulations & statistical output

nsim=100 #number of simulations (must be > 1)

hr_mean=array(NA,c(nsim+1,24)) #hourly stats; first row reserved for the historical data (empty for now)

hr_median=array(NA,c(nsim+1,24))

hr_stdev=array(NA,c(nsim+1,24))

hr_skew=array(NA,c(nsim+1,24))

hr_max=array(NA,c(nsim+1,24))

hr_min=array(NA,c(nsim+1,24))

hr_cor1=array(NA,c(nsim+1,24))

hr_cor2=array(NA,c(nsim+1,24))

hr_cor3=array(NA,c(nsim+1,24))

hr_cor4=array(NA,c(nsim+1,24))

da_mean=array(NA,(nsim+1)) #daily stats; first row reserved for the historical data (empty for now)

da_median=array(NA,(nsim+1))

da_stdev=array(NA,(nsim+1))

da_skew=array(NA,(nsim+1))

da_max=array(NA,(nsim+1))

da_min=array(NA,(nsim+1))

da_cor1=array(NA,(nsim+1))

da_cor2=array(NA,(nsim+1))

da_cor3=array(NA,(nsim+1))

da_cor4=array(NA,(nsim+1))

oa_mean=array(NA,(nsim+1)) #overall stats; first row reserved for the historical data (empty for now)

oa_median=array(NA,(nsim+1))

oa_stdev=array(NA,(nsim+1))

oa_skew=array(NA,(nsim+1))

oa_max=array(NA,(nsim+1))

oa_min=array(NA,(nsim+1))

oa_cor1=array(NA,(nsim+1))

oa_cor2=array(NA,(nsim+1))

oa_cor3=array(NA,(nsim+1))

oa_cor4=array(NA,(nsim+1))

alldisagpm=array(NA,c((nsim+1),dim(disagpm))) #hourly pdfs (empty for now)

#Calculate observed data statistics and place them in the first row of the stats matrices

k1=1 #enter values into first row

hr_mean[k1,]=apply(disagpm,2,mean,na.rm=T) #monthly stats

hr_median[k1,]=apply(disagpm,2,median,na.rm=T)

hr_stdev[k1,]=apply(disagpm,2,sd,na.rm=T)

hr_skew[k1,]=apply(disagpm,2,skew)

hr_max[k1,]=apply(disagpm,2,max,na.rm=T)

hr_min[k1,]=apply(disagpm,2,min,na.rm=T)

hr_cor1[k1,1]=cor(disagpm[2:dailyN,1],disagpm[1:(dailyN-1),24],use="complete.obs")

for(k in 2:24){hr_cor1[k1,k]=cor(disagpm[,k],disagpm[,(k-1)],use="complete.obs")}

hr_cor2[k1,1]=cor(disagpm[2:dailyN,1],disagpm[1:(dailyN-1),23],use="complete.obs")

hr_cor2[k1,2]=cor(disagpm[2:dailyN,2],disagpm[1:(dailyN-1),24],use="complete.obs")

for(k in 3:24){hr_cor2[k1,k]=cor(disagpm[,k],disagpm[,(k-2)],use="complete.obs")}

hr_cor3[k1,1]=cor(disagpm[2:dailyN,1],disagpm[1:(dailyN-1),22],use="complete.obs")

hr_cor3[k1,2]=cor(disagpm[2:dailyN,2],disagpm[1:(dailyN-1),23],use="complete.obs")

hr_cor3[k1,3]=cor(disagpm[2:dailyN,3],disagpm[1:(dailyN-1),24],use="complete.obs")

for(k in 4:24){hr_cor3[k1,k]=cor(disagpm[,k],disagpm[,(k-3)],use="complete.obs")}

hr_cor4[k1,1]=cor(disagpm[2:dailyN,1],disagpm[1:(dailyN-1),21],use="complete.obs")

hr_cor4[k1,2]=cor(disagpm[2:dailyN,2],disagpm[1:(dailyN-1),22],use="complete.obs")

hr_cor4[k1,3]=cor(disagpm[2:dailyN,3],disagpm[1:(dailyN-1),23],use="complete.obs")

hr_cor4[k1,4]=cor(disagpm[2:dailyN,4],disagpm[1:(dailyN-1),24],use="complete.obs")

for(k in 5:24){hr_cor4[k1,k]=cor(disagpm[,k],disagpm[,(k-4)],use="complete.obs")}

da_mean[k1]=mean(dailypm,na.rm=T) #annual stats

da_median[k1]=median(dailypm,na.rm=T)

da_stdev[k1]=sd(dailypm,na.rm=T)

da_skew[k1]=skew(dailypm)

da_max[k1]=max(dailypm,na.rm=T)

da_min[k1]=min(dailypm,na.rm=T)

da_cor1[k1]=cor(dailypm[2:dailyN],dailypm[1:(dailyN-1)],use="complete.obs")

da_cor2[k1]=cor(dailypm[3:dailyN],dailypm[1:(dailyN-2)],use="complete.obs")

da_cor3[k1]=cor(dailypm[4:dailyN],dailypm[1:(dailyN-3)],use="complete.obs")

da_cor4[k1]=cor(dailypm[5:dailyN],dailypm[1:(dailyN-4)],use="complete.obs")

oa_mean[k1]=mean(hourlypm,na.rm=T) #overall stats

oa_median[k1]=median(hourlypm,na.rm=T)

oa_stdev[k1]=sd(hourlypm,na.rm=T)

oa_skew[k1]=skew(hourlypm)

oa_max[k1]=max(hourlypm,na.rm=T)

oa_min[k1]=min(hourlypm,na.rm=T)

oa_cor1[k1]=cor(hourlypm[2:hourlyN],hourlypm[1:(hourlyN-1)],use="complete.obs")

oa_cor2[k1]=cor(hourlypm[3:hourlyN],hourlypm[1:(hourlyN-2)],use="complete.obs")

oa_cor3[k1]=cor(hourlypm[4:hourlyN],hourlypm[1:(hourlyN-3)],use="complete.obs")

oa_cor4[k1]=cor(hourlypm[5:hourlyN],hourlypm[1:(hourlyN-4)],use="complete.obs")

alldisagpm[k1,,]=disagpm #hourly pdfs

#Calculate rotation (R) matrix for nonparametric disaggregation model

Imat=diag(1,M,M) #identity matrix

Rmat=matrix(0,ncol=M,nrow=M) #R matrix (empty for now)

Rmat[,M]=1/sqrt(M) #last column in R matrix = 1/sqrt(M)

for(i in (M-1):1){ #apply graham schmidt to make R orthonormal

j1=i+1

yv=rep(0,M)

for (j in j1:M){

yv=yv+(Rmat[,j]%*%Imat[,i])*Rmat[,j]

}

xv=Imat[,i]-yv

xv=xv/sqrt(sum(xv*xv)) #normalization

Rmat[,i]=xv

}

Rmatcheck1=round(Rmat%*%solve(Rmat)) #verify that R is orthogonal (should be identity)

Rmatcheck2=sum(Rmatcheck1) #verify that R is normalized (should equal M)

Ymat=disagpm%*%Rmat #projected data matrix

#Determine number of nearest neighbors & establish weight function

K=round(sqrt(dailyN)) #number of nearest neighbors (K=sqrt(dailyN))

W=1:K

W=1/W

W=W/sum(W)

W=cumsum(W) #weight function

#Create simulations and calculate basic statistics for each

disagpmsim=matrix(NA,nrow=dim(disagpm)[1],ncol=dim(disagpm)[2]) #matrix for disaggregated simulation data (empty for now)

for(isim in 1:nsim){ #loop through simulations

arout=ar(dailypm-mean(dailypm)) #AR model on daily data

dailypmsim=arima.sim(n=dailyN,list(ar=c(arout$ar)),sd=sqrt(arout$var.pred))+mean(dailypm)

for(dy in 1:dailyN){ #loop through days in AR model and get disaggregated values

xp=dailypmsim[dy] #value in daily simulation under investigation

xtemp=c(xp,dailypm) #combine zp with the daily data

xdist=order(as.matrix(dist(xtemp))[1,2:(dailyN+1)]) #calculate distances

temp1=runif(1,0,1) #uniform random number between 0 and 1 to select neighbor

temp2=c(temp1,W) #place uniform random number in weight matrix

temp3=rank(temp2) #pick out rank of uniform random number

distnn=xdist[temp3[1]] #pick out distance of randomly selected neighbor

U=Ymat[distnn,1:(M-1)] #row in U matrix

Y=c(U,xp/sqrt(M)) #row in Y matrix

disagpmsim[dy,]=Rmat%*%Y #transform back to get row in X matrix

}

hourlypmsim=array(t(disagpmsim)) #simulated timeseries of PM2.5

dailypmsimcheck=rowSums(disagpmsim) #should match simulated daily data above due to summability criteria

k1=isim+1

hr_mean[k1,]=apply(disagpmsim,2,mean,na.rm=T) #monthly stats

hr_median[k1,]=apply(disagpmsim,2,median,na.rm=T)

hr_stdev[k1,]=apply(disagpmsim,2,sd,na.rm=T)

hr_skew[k1,]=apply(disagpmsim,2,skew)

hr_max[k1,]=apply(disagpmsim,2,max,na.rm=T)

hr_min[k1,]=apply(disagpmsim,2,min,na.rm=T)

hr_cor1[k1,1]=cor(disagpmsim[2:dailyN,1],disagpmsim[1:(dailyN-1),24],use="complete.obs")

for(k in 2:24){hr_cor1[k1,k]=cor(disagpmsim[,k],disagpmsim[,(k-1)],use="complete.obs")}

hr_cor2[k1,1]=cor(disagpmsim[2:dailyN,1],disagpmsim[1:(dailyN-1),23],use="complete.obs")

hr_cor2[k1,2]=cor(disagpmsim[2:dailyN,2],disagpmsim[1:(dailyN-1),24],use="complete.obs")

for(k in 3:24){hr_cor2[k1,k]=cor(disagpmsim[,k],disagpmsim[,(k-2)],use="complete.obs")}

hr_cor3[k1,1]=cor(disagpmsim[2:dailyN,1],disagpmsim[1:(dailyN-1),22],use="complete.obs")

hr_cor3[k1,2]=cor(disagpmsim[2:dailyN,2],disagpmsim[1:(dailyN-1),23],use="complete.obs")

hr_cor3[k1,3]=cor(disagpmsim[2:dailyN,3],disagpmsim[1:(dailyN-1),24],use="complete.obs")

for(k in 4:24){hr_cor3[k1,k]=cor(disagpmsim[,k],disagpmsim[,(k-3)],use="complete.obs")}

hr_cor4[k1,1]=cor(disagpmsim[2:dailyN,1],disagpmsim[1:(dailyN-1),21],use="complete.obs")

hr_cor4[k1,2]=cor(disagpmsim[2:dailyN,2],disagpmsim[1:(dailyN-1),22],use="complete.obs")

hr_cor4[k1,3]=cor(disagpmsim[2:dailyN,3],disagpmsim[1:(dailyN-1),23],use="complete.obs")

hr_cor4[k1,4]=cor(disagpmsim[2:dailyN,4],disagpmsim[1:(dailyN-1),24],use="complete.obs")

for(k in 5:24){hr_cor4[k1,k]=cor(disagpmsim[,k],disagpmsim[,(k-4)],use="complete.obs")}

da_mean[k1]=mean(dailypmsim,na.rm=T) #annual stats

da_median[k1]=median(dailypmsim,na.rm=T)

da_stdev[k1]=sd(dailypmsim,na.rm=T)

da_skew[k1]=skew(dailypmsim)

da_max[k1]=max(dailypmsim,na.rm=T)

da_min[k1]=min(dailypmsim,na.rm=T)

da_cor1[k1]=cor(dailypmsim[2:dailyN],dailypmsim[1:(dailyN-1)],use="complete.obs")

da_cor2[k1]=cor(dailypmsim[3:dailyN],dailypmsim[1:(dailyN-2)],use="complete.obs")

da_cor3[k1]=cor(dailypmsim[4:dailyN],dailypmsim[1:(dailyN-3)],use="complete.obs")

da_cor4[k1]=cor(dailypmsim[5:dailyN],dailypmsim[1:(dailyN-4)],use="complete.obs")

oa_mean[k1]=mean(hourlypmsim,na.rm=T) #overall stats

oa_median[k1]=median(hourlypmsim,na.rm=T)

oa_stdev[k1]=sd(hourlypmsim,na.rm=T)

oa_skew[k1]=skew(hourlypmsim)

oa_max[k1]=max(hourlypmsim,na.rm=T)

oa_min[k1]=min(hourlypmsim,na.rm=T)

oa_cor1[k1]=cor(hourlypmsim[2:hourlyN],hourlypmsim[1:(hourlyN-1)],use="complete.obs")

oa_cor2[k1]=cor(hourlypmsim[3:hourlyN],hourlypmsim[1:(hourlyN-2)],use="complete.obs")

oa_cor3[k1]=cor(hourlypmsim[4:hourlyN],hourlypmsim[1:(hourlyN-3)],use="complete.obs")

oa_cor4[k1]=cor(hourlypmsim[5:hourlyN],hourlypmsim[1:(hourlyN-4)],use="complete.obs")

alldisagpm[k1,,]=disagpmsim #hourly pdfs

}

#Boxplots of hourly, daily & overall statistics

par(mfrow=c(5,2)) #set plotting parameters

for(stat in 1:10){ #loop through stats

if(stat==1){temp=cbind(hr_mean,da_mean/M,oa_mean)}

if(stat==2){temp=cbind(hr_median,da_median/M,oa_median)}

if(stat==3){temp=cbind(hr_stdev,da_stdev/M,oa_stdev)}

if(stat==4){temp=cbind(hr_skew,da_skew,oa_skew)}

if(stat==5){temp=cbind(hr_max,da_max/M,oa_max)}

if(stat==6){temp=cbind(hr_min,da_min/M,oa_min)}

if(stat==7){temp=cbind(hr_cor1,da_cor1,oa_cor1)}

if(stat==8){temp=cbind(hr_cor2,da_cor2,oa_cor2)}

if(stat==9){temp=cbind(hr_cor3,da_cor3,oa_cor3)}

if(stat==10){temp=cbind(hr_cor4,da_cor4,oa_cor4)}

title=paste("PM2.5",c("Mean","Median","Standard Deviation","Skew","Max","Min","Lag-1 Correlation","Lag-2 Correlation","Lag-3 Correlation","Lag-4 Correlation"))[stat]

n=length(temp[,1])

tempsim=temp[2:n,]

bxout=boxplot(split(tempsim,col(tempsim)), plot=F, cex=1.0)

bxout$names=c("0:00","1:00","2:00","3:00","4:00","5:00","6:00","7:00","8:00","9:00","10:00","11:00","12:00","13:00","14:00","15:00","16:00","17:00","18:00","19:00","20:00","21:00","22:00","23:00","Daily","Overall")

bx=bxp(bxout,ylim=range(temp,na.rm=T),xlab="",ylab="",cex=1.00,las=3)

points(bx,temp[1,],pch=16,col="red")

lines(bx[1:24],temp[1,1:24],pch=16,col="red")

title(title)

}

#Boxplots of hourly pdfs

par(mfrow=c(4,6),mar=c(0,0,2,0)) #set plotting parameters

for(ihour in 1:24){ #loop through hours in the day

onehour=alldisagpm[,,ihour]

xmin=min(onehour) #minimum value in hourly simulations

xmax=max(onehour) #maximum value in hourly simulations

delta=(xmax-xmin)/97

xeval=seq(xmin-delta,xmax+delta,delta) #evaluation points for the pdf

hourlypdf=matrix(nrow=(nsim+1),ncol=length(xeval)) #estimated pdf for the original data and simulations (empty for now)

for(i in 1:(nsim+1)){ #loop through data & simulations

xdata=onehour[i,] #data to fit pdf to

denout=sm.density(xdata,eval.points=xeval,model="none",display="none") #fittingestimated pdf

hourlypdf[i,]=denout$estimate #estimated pdf

}

temp=hourlypdf[2:(nsim+1),]

bxout=boxplot(split(temp,col(temp)), plot=F, cex=1.0)

bxout$names=rep("",length(bxout$names))

bx=bxp(bxout,ylim=range(hourlypdf),xlab="",ylab="",xaxt="none",yaxt="none",cex=1.00)

points(bx,hourlypdf[1,],pch=16,col="red")

lines(bx,hourlypdf[1,],pch=16,col="red")

title=c("0:00","1:00","2:00","3:00","4:00","5:00","6:00","7:00","8:00","9:00","10:00","11:00","12:00","13:00","14:00","15:00","16:00","17:00","18:00","19:00","20:00","21:00","22:00","23:00")[ihour]

title(main=title)

}

# dev.off() #turn off plotting

# writeLines("created file(s) proj_6_x.bmp")

Appendix G: R code used for the nonparametric seasonal k-nn forecast model

#The following R code is used to forecast from a "seasonal" nonparametric k-nn model on hourly PM2.5 TEOM data

#The data are entered in "data.txt"

#clear registers, load libraries & source functions

rm(list=ls(all=TRUE)) #clear registers

#load data

rawdata=read.table("data.txt",colClasses=list("character","numeric","numeric","numeric","numeric","numeric","character","numeric","numeric","numeric"),sep="\t",fill=T,header=T,na.strings="NA",col.names=c("date","year","month","day","hour","weekday","inst_time","pm2.5","t","p"))

data=cbind(rawdata[,2],rawdata[,3],rawdata[,4],rawdata[,5],rawdata[,6],rawdata[,8],rawdata[,9],rawdata[,10])

#restrict temporal domain of data

datafor=data[data[,1]==2004&data[,2]==1&data[,5] 1)

lag=3 #lag to investigate for k-nn (must be < 7 since days aren't continuous)

z=hourlypm-mean(hourlypm) #mean removed time series data

zfor=hourlypmfor-mean(hourlypmfor) #mean removed time series data (to be forecast)

zdisag=matrix(z,ncol=24,byrow=T) #mean removed disaggregated data matrix

zdisagfor=matrix(zfor,ncol=24,byrow=T) #mean removed disaggregated data matrix (to be forecast)

xdata=array(NA,dim=c(dailyN,12,lag)) #lagged X array (dailyN days x lag x hours)

ydata=array(NA,dim=c(dailyN,12,1)) #lagged Y array (dailyN days x 1 x hours)

for(dy in 1:dailyN){ #loop through days of data for that hour

for(hr in 1:12){ #loop through daytime hours

xdata[dy,hr,]=zdisag[dy,(hr+6-lag):(hr+6-1)] #lagged X matrix (dailyN days, t-lag,...,t-1, hour)

ydata[dy,hr,]=zdisag[dy,(hr+6)] #lagged Y matrix (dailyN days, t, hour)

}

}

#Determine number of nearest neighbors & establish weight function

K=round(sqrt(dailyN)) #number of nearest neighbors (K=sqrt(N))

W=1:K

W=1/W

W=W/sum(W)

W=cumsum(W) #weight function

#Create ensenbles

zsim=matrix(NA,nrow=(12*dailyNfor),ncol=lag) #forecast data (empty of now)

zsimtemp=matrix(NA,nrow=nsim,ncol=lag) #forecast data (empty of now)

i=0

for(dy in 1:dailyNfor){ #loop through days in forecast

for(hr in 1:12){ #loop through hours

i=i+1

for(isim in 1:nsim){ #loop through simulatios

zp=zdisagfor[dy,(hr+6-lag):(hr+6-1)] #observed lag values to forecast from

for(ilag in 1:lag){ #loop through lags

ztemp=rbind(zp,as.matrix(xdata[,hr,])) #combine zp with the values in xdata corresponding to the appropriate hour

zdist=order(as.matrix(dist(ztemp))[1,2:(dailyN+1)]) #calculate the distances to zp

temp1=runif(1,0,1) #uniform random number between 0 and 1 to select neighbor

temp2=c(temp1,W) #place uniform random number in weight matrix

temp3=rank(temp2) #pick out rank of uniform random number

jj=zdist[temp3[1]] #pick out distance of randomly selected neighbor

zsimtemp[isim,ilag]=ydata[jj,hr,] #enter value of selected y-value into simulation matrix

zp=c(zp[2:lag],zsimtemp[isim,ilag]) #next value of zp to forecast from

}

}

zsim[i,]=apply(zsimtemp,2,median) #forecast daytime data including projections up to lag

}

}

zobs=matrix(NA,nrow=(12*dailyNfor),ncol=lag) #observed data being forecast (empty for now)

for(ilag in 1:lag){

zdisagtemp=matrix(zdisagfor[,(6+ilag):(17+ilag)],ncol=12) #mean removed disagregated daytime data matrix (to be forecast)

zobs[,ilag]=array(t(zdisagtemp)) #mean removed time series daytime data (to be forecast)

}

#Plot results

par(mfrow=c(1,3))

plot(zobs[,1],zsim[,1],xlab="Observed Mean Removed PM2.5",ylab="Forecast Mean Removed PM2.5",main="1-hour Forecast")

text(-8,0.9*max(zsim[,1]),paste("R-squarred =",round(cor(zobs[,1],zsim[,1])^2,2)),pos=4)

plot(zobs[,2],zsim[,2],xlab="Observed Mean Removed PM2.5",ylab="Forecast Mean Removed PM2.5",main="2-hour Forecast")

text(-8,0.9*max(zsim[,2]),paste("R-squarred =",round(cor(zobs[,2],zsim[,2])^2,2)),pos=4)

plot(zobs[,3],zsim[,3],xlab="Observed Mean Removed PM2.5",ylab="Forecast Mean Removed PM2.5",main="3-hour Forecast")

text(-8,0.9*max(zsim[,3]),paste("R-squarred =",round(cor(zobs[,3],zsim[,3])^2,2)),pos=4)

dev.off() #turn off plotting

writeLines("created file(s) proj_7_x.bmp")

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download