Predicting Influenza Propagation using Network Inference and ...

Predicting Influenza Propagation using Network Inference and Machine Learning

Mike Chrzanowski (mc2711), Jerome Ku (jeromeku), Foon Wang Pong (ppong) Group 6

1. Introduction

Seasonal influenza epidemics are a major public health issue, leading to millions of days of lost productivity and hundreds of thousands of deaths every year [1]. Epidemic containment is an increasingly complex issue in an ever-more connected world, necessitating means for better tracking and prediction of flu outbreaks.

The goal of our project is to predict influenza propagation in the United States. Being able to predict the spread of influenza would allow the government to take preventive measures to limit infection and anticipate the need for large-scale vaccination. To that end, we tracked the spread of influenza through hospitals in the United States between the years 2009 to 2011 as a means for predicting the spread of influenza in the general population. We trained a classifier that can accurately predict whether a hospital in the United States will observe cases of influenza in a given month by extracting features from hospital networks we constructed.

There is extensive prior work in the area of modeling epidemics [8,9]. The focus of these studies, however, is on explaining the underlying dynamics of disease spread and secondarily on prediction. Our work, in contrast, is focused primarily on prediction.

Our main contribution is the use of novel network-derived features in aiding prediction. Several networks were built from both empirical and inferred network properties. These networks included a transportation network built from ground and air transportation features, where hospitals are nodes and edges are based on Euclidean distance and flight volume between airports in proximity of hospitals. Additionally, we used network inference algorithms to derive a network from hospital inpatient diagnosis records where edges were estimated based on the the spread of disease from one hospital to another. Furthermore, to capture macro-level properties of influenza propagation, we segmented the United States into population clusters and monitored the spread of influenza within these clusters. Finally, we included non-network features such as Google Flu Trend index data as well seasonality to further improve prediction accuracy.

2. Related Work

There is extensive literature on the statistical modeling of epidemics [8,9]. The majority of the work focuses on large systems of differential equations to describe the dynamical evolution of disease spread through simulations. Most commonly, these are variants / extensions of the susceptible, infection, and removed (SIR) model.

Central to these studies is the concept of a contact network, in which nodes represent individual hosts and the edges represent potential paths of infection. From the contact network, a transmission network is realized, which represents the actual graph of disease spread.

Network structure can be estimate through various means: collection of host data, known transportation networks [7], patients in a hospital [5], or inferred from past transmission data [8]. Moreover, algorithms used for social network analysis have proven useful in studying epidemic spread [10].

The focus of these statistical models is on describing the mechanisms giving rise to the complex phenomena of epidemics. While the construction of descriptive models is a useful step in prediction, it is not always a necessary one. In fact, as has been shown through Google Flu Trends, purely data-driven methodologies with minimal

1

domain understanding can yield meaningfully predictive models. For example, in [11], the authors were able to pre-empt the US government's flu tracker by building simple linear models based on large-scale query data.

In our work, we combine both data-driven and domain-informed features with the overarching goal of prediction. We utilize both empirical contact networks and inferred transmission networks in crafting features. More precisely, we created a hospital contact network based on travel distance and flight volume. We also inferred a hospital transmission network based on historical propagation rates. From these networks, we derived features which along with other predictors of disease spread, were then fed into classifiers for predicting likelihood of a hospital being infected.

3. Data

3.1 NIS Dataset

We use the Healthcare Cost and Utilization Project (HCUP) National Inpatient Sample (NIS) from 2009 - 2011 as our primary data source. This data contains all discharge data from a 20% sample of all hospitals in the US, representing approximately 8 million hospital stays per year. Each medical record in the dataset treats a patient record as a unit (i.e., the same patient will have a new record for each visit) and includes ICD9 diagnosis codes, procedure codes, length of stay and additional information regarding the hospital the patient stayed at. The NIS data contains additional information including addresses, the number of beds, and also hospital type (i.e., teaching hospital). Below are summary statistics for the dataset:

Year

2009

2010

Total hospitals

1,050

1,051

Total reported cases of influenza

30,532

5,614

Table 1. NIS dataset statistics

3.2 Supplementary Datasets

2011 1,049 15,734

In addition to the NIS dataset, several supplementary datasets were required to create features.

Transportation / Distance: While we would have liked to use Google Maps' Distance API to estimate distances between hospitals, this was not possible due to usage limits and the large number of requests required to compute all pairwise distances between hospitals in our dataset. Instead, as we will discuss in 4.1.1.1.1, we calculated the distance between zip codes and used that as an approximation for the distances between hospitals. To do this, we needed the 2010 Census Zip Code dataset from the United States Census Bureau to get geolocation coordinates for every zip code in the United States, which we then used to calculate distances.

We also constructed a Flight network where hospitals are nodes and where the edges capture the amount of air travel between the airports nearby pairs of hospitals. Constructing these edges required not only access to the number of flights being undertaken between all airports in the country but also geolocation data mapping airports to their latitude and longitude coordinates. These data are extracted from the 2009 ASA Data Expo dataset. [26]

Population and other: To create population-level features, we partitioned the US into clusters based on population. Doing this required having population data for each zip code in the US, which is provided in the 2010 Census Zip Code dataset as well. We used publicly available Google Flu Search Trend data as an additional data source in engineering features.

2

4. Methodology

4.1 Prediction Model

We want to predict whether a hospital in the United States will diagnose at least 1 patient with influenza in a given month given features generated using data from previous months. Influenza is typically transmitted through direct contact or through the air. Many studies suggest that ground and air transportation networks generally provide a good approximation for modeling epidemic spread.

In our study, we constructed three networks based on both empirical data and inferred edges. Our networks try to model the mobilization of people in the US. The empirical networks were based on measures of "distance" -physical proximity and airport activity -- and the inferred network was based on Rodriguez et al's network inference algorithm. Properties of each of these networks were then used as features in our prediction model. We also cluster hospitals using a similarity metric based on the population of the zip code that each hospital resides in and then use the diagnosis records of all hospitals in a cluster to aid in prediction. Finally, we incorporate non-network features such as the season of the month as well as Google Flu trend data for that month.

In the following section, we will delve into details about our generated networks, the features for our prediction model, the machine learning algorithms, the criteria we use to evaluate the quality of a model's fit to the data, and details regarding hyperparameter selection.

4.1.1 Features

4.1.1.1 Networks Features

4.1.1.1.1 Euclidean Distance Network

The Euclidean distance network measures the physical distance between hospitals. It captures the intuition that hospitals located near other hospitals that have seen influenza cases are more likely to see infections in the future.

Ideally we would have liked to have used distances between hospitals as reported by a service like Google's Distance API to generate this network. However, the usage limits on the service combined with the sheer number of hospitals in our dataset prevented us from doing so.

Instead, we estimated distances based on zip codes. To do so, we mapped each hospital to the zip code to which it belongs. Next, we use United States Census Bureau information to get the geo-location coordinates of that particular zip code in question. Given the geo-location coordinates of two hospitals, we can now easily calculate the distance between them (assuming the United States is a flat plane). We select, for each hospital, the nearest E hospitals, and we generate undirected edges between these E pairs. For each of these E closest hospitals for a given hospital for a given month and a given year, we encode the number of recorded diagnoses of influenza in the previous month at those E hospitals as features.

4.1.1.1.2 Air Transportation Network

These edges are designed to measure the level of air travel activity between two hospitals so as to capture the idea that influenza tends to propagate between areas located near busy airplane hubs. Hospitals close to airports that have many flights between them will spread influenza to each other. For every hospital in our network, we find all airports within a 70 mile radius of it (about an hour of driving distance). For each pair of hospitals we want to determine if they are near different airports that are connected by flights. The weight of this edge is then the number of flights between these two airports. If the hospitals are near multiple airports, then the airport pair is chosen that has the highest level of air travel activity. For each hospital, we generate F undirected edges between the hospital pairs that exhibit the highest level of air travel activity between themselves. As in the Euclidean network, for each of of the F hospitals, we encode as features the number of recorded diagnoses of influenza in the previous month.

3

4.1.1.1.work Inferred From Disease Cascades

In order to infer the transmission network, we utilize observed rates of infection at a hospital level. More specifically, we infer the structure of the underlying network based on the diffusion of various contagions. Using the HCUP CCS diagnosis code mapping, we identified 4 relevant categories in our dataset: bacterial, viral, influenza and tuberculosis. The diagnosis codes in these categories are mostly airborne diseases and can spread easily. Each category contains many diagnoses codes, see Table 2 for summary.

Category

Number of ICD9 disease codes in this category

Total Number of diagnoses in 2009

Total Number of diagnoses in 2010

Total Number of diagnoses in 2011

Tuberculosis 426

Bacterial infection 107

Viral Infection 140

586,801

439,729

363,798

629,627

444,593

385,997

742,613

487,082

439,352

Table 2. CCS disease category statistics.

Influenza 15

30,532 5,614 15,734

Note that each ICD9 code is composed of a 3-digit disease prefix and then a billable-code suffix. For each 3-digit prefix of the diagnosis codes, and for every year of data, we generate a cascade of hospitals by looking at the earliest time in the calendar year when a hospital reported a diagnosis of the code in question. The year and month is used as the timestamp for the hospital in this cascade. Given multiple cascades, we can infer the underlying network based on Rodriguez and Leskovec's algorithm [9].

In particular, we want to find a network that maximizes the likelihood of the given cascades. Assuming the

cascades are independent of each other, we can write this as:

G = argmaxGP(C| G) = argmaxG P(c|G)

cC

where

G is the estimated network

G is the underlying network

C is the set of cascades that are provided as input

c is a single cascade in C

Note that there are some caveats regarding the usage of the inferred network in our prediction model. In particular, we are concerned about using future data to predict past events. It would be a mistake to include diagnosis data from, say, December 2010, in the cascades that produce an inference network which is then used as features for prediction of samples in February 2010. Therefore, for every month in our dataset, we generate a new inference graph built from cascades that use data up to time t-1. When training on a (sample, label) pair (xh(t), yh(t)) at time t, we only use features from the inference network generated from cascades utilizing data up to time t-1.

To get a sense of the amount of data involved, the following table summarizes the statistics about the cascades we use for June and December of each year of data we have:

Month, Year

Number of Cascades Used

Average Number of Hospitals (i.e., Events) per Cascade

4

June, 2009 December, 2009 June, 2010 December, 2010 June, 2011 December, 2011

55

205

56

260

110

234

112

255

167

240

167

255

Table 3. Cascade statistics.

Given the inference network, we want to use its edges as features in our prediction model. In the Euclidean and air transport network, we simply pick the E and F closest hospitals as features. However, the edges in the inference network are unweighted, so there is no intuitive way to choosing the "best" edges. Therefore, given a hospital h, we want to compute the average cases of influenza diagnosed in its neighboring hospitals in the previous month as a feature in our prediction model.

4.1.1.1.4 Population Zone Clustering

The networks described in 4.1.1.1 to 4.1.1.3 primarily capture pairwise similarities between hospitals. In order to capture community-level influenza activity, we used spectral clustering to cluster zip codes in the United States into zones based on the populations of adjacent zip codes. Formally, given a number of clusters C, we define first compute the Laplacian matrix L as follows:

L = D-W Where:

Dii = degree(zipcodei) = nj=01{distance_in_miles(zipcodei, zipcodej) < 70} Wij = 1{distance_in_miles(zipcodei, zipcodej) < 70} (1 / (population(zipcodei) - population(zipcodej) )2) 1{X} is the identity function

We then extract the C generalized eigenvectors corresponding to the lowest C eigenvalues of L. Let U Rnxc be the matrix where the ith generalized eigenvector of L is the ith column. We then cluster the rows i=1..n of U by running k-means with C centroids and assign zipcodei to the cluster we assigned the ith row of U.

To extract features from this clustering for use in our prediction, we perform the following calculation. For a given hospital h at time t, we sum up the number of influenza diagnoses at time t - 1 in all hospitals within the same cluster as h and use that as a feature for h.

4.1.1.2 Non-Network Features

The literature tells us that influenza is more likely to occur during certain months of a year [25]. Therefore, we added three variables to account for seasonality: whether the current month is a spring month, a summer month, or a autumn month. There has also been extensive research correlating Google's National Flu Index data to the number of influenza diagnoses in the country at large [3]. So, for any given month and year, we average the weekly Google Flu Index values for the previous month and use it as a feature for all hospitals for this month. Note that these features are very general and, actually, identical for all hospitals at a given time t: this is because the focus of our project is on the use of network features to aid classification.

4.1.1.3 Feature Vector for Classification

Each sample in our dataset is denoted as xh(t) where h is a particular hospital and t corresponds to a particular time, which is a particular month m and year y combination. xh(t) is composed of the following E + F + 1 + 6

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download