### Study overview

The proposed approach for detecting the aberrations of ILI outbreaks in a study area consists of two stages. In the first stage, we used negative binomial regression models to fit daily outpatient and ED ILI visits during 2004–2007, and selected significant predictors for the three study areas in northern, central and southern Taiwan, separately. The residual of observed number of visits on the last day of each observed series of visits was further standardized. In the second stage, we monitored these daily standardized Pearson residuals for any aberrations in 2008 and 2009. Then, we evaluated the performance of the proposed approach by the empirical health insurance claims data in 2009, when novel H1N1 pandemic flu outbreaks occurred. The detected aberrations were compared with the weekly influenza virus isolation rates. In addition, we compared the widely used surveillance method, modified CUSUM methods applied to both observed visits and Pearson residuals, with the proposed approach using simulated data.

### Data source

The three study areas selected in this study are shown on the map (Figure 1). Each study area consists of several populous districts surrounding a major weather station. The registered residents of the three study areas in northern, central and southern Taiwan in 2009 numbered 2,356,205, 1,330,913 and 1,120,944, respectively. Daily ILI visits from 2004 to 2009 were obtained from the NHI research database of the National Health Research Institute. This study was approved by the institutional review board (IRB) of Academia Sinica (IRB#: AS-IRB01-12117). The database we used was all stripped of identifying information, and thus informed consent was not needed. Daily meteorological data were downloaded from the Data Bank for Atmospheric Research maintained by National Applied Research Laboratories (http://dbar.ttfri.narl.org.tw). Weekly influenza virus isolation rates were calculated from a laboratory surveillance database maintained by Taiwan’s CDC.

### ILI definition and the first occurrence of ILI cases

The NHI research database, also called the hospital-based health insurance claims database, is publicly available to researchers in Taiwan. There are four major levels of the health care system in Taiwan, including medical centers (Level 4), regional hospitals (Level 3), area hospitals (Level 2) and primary health care (i.e. clinics, Level 1). In order to focus on the influenza epidemics at the community level, we restrict the data of outpatient and ED visits to the lower levels of the health care system. For the outpatient visits, only area hospitals and primary health care were included. For the ED visits, only regional hospitals and area hospitals were included.

The ILI cases in this study were determined as those diagnosed with 29 ILI-related ICD-9 codes, which were the definition in the ESSENCE system (Electronic Surveillance System for the Early Notification of Community-based Epidemics, United States) [24,25]. In this study, we proposed analyzing daily series of new ILI cases. If a case had repeated ILI visits within 14 days, only the first visit was counted. The daily series of new ILI cases was easily created because, in the NHI research database, each patient has a scrambled unique identification code and dates of clinical visits. The reason for using 14 days as an observational window was that the influenza incubation period is about 1–4 days, and the virus shedding occurs from the day before symptoms begin through 5–10 days after illness onset [26]. The sum of these two periods has a maximum of around 14 days. Therefore, we defined multiple ILI visits within 14 days as being in the same infection course, and counted only the first occurrence.

### Negative binomial regression model and application to empirical data

For the daily counts of outpatient or ED ILI visits in a study area, we proposed use of a negative binomial regression model to better account for over-dispersion, which was observed in our exploratory data analysis. We utilized the “glm.nb” function under open-source software, an R [27] package named “MASS”, to implement the negative binomial regression [28]. The regression model for estimating expected visits of the last day of the observed series was constructed by fitting the daily series of visits and covariates observed in the three years before the last day. In practice, the regression models needed to be updated for each day and each study area. To simplify evaluation of the method, we first used the data of the years 2004–2007 in the three study areas to identify influential covariates for the regression models. The models with the selected covariates were then repeatedly fitted to daily ILI visits observed each day during 2008–2009 and three years before that day.

During the variable selection stage, the significance level was set at 0.05. The weather factor used here was whether the temperature was equal to or below 14°C, which is the official definition of a continental cold air mass by the Central Weather Bureau in Taiwan (http://www.cwb.gov.tw/V7e/knowledge/encyclopedia/me003.htm). In addition, the temporal dependence was also considered by including the new ILI cases observed exactly one week earlier. Our exploratory data analyses showed day of the week, Chinese New Year, national holidays, typhoon days off, the day following national holidays and typhoon days off may have influences on the observed visits. They are mainly related to the closure of the hospitals and clinics. The seasonality of influenza epidemic has often been modeled with harmonic terms. We have found that these terms of sine and cosine functions sometimes were not good enough to capture the seasonal pattern. In this study, we proposed an alternative seasonal term called moving month-of-the-year time-dependent variable, which is the difference between medians of visits in the past 30 days and visits in the past 365 days for seasonal adjustment in the model. This time-dependent variable is determined dynamically by the observed data. It can be used for adjusting usual season outbreaks. If a factor or variable was statistically significant in an area, it was selected into the regression model as a covariate. Chinese New Year and national holidays, including national public holidays of Taiwan and typhoon days off, were obtained from the Directorate-General of Personnel Administration, Executive Yuan, Taiwan (http://www.dgpa.gov.tw/). The days of the week had clear effects on the clinic visits; for example, there were very small numbers of outpatient visits on Sundays, and more outpatient visits on Mondays and ED visits on Saturdays and Sundays, while both outpatient and ED visits were stable from Tuesday to Friday. Once the covariates were fixed, we used the repeatedly fitted negative binomial regression models to estimate the expected visits for each day in 2008 and 2009 in each area.

Specifically, for estimating expected number of visits on day *t*, we have observed daily ILI outpatient or ED visits and the vector of selected covariates, denoted by *y*_{
i
} and *X*_{
i
} for *i* = *t*, *t* − 1, …, *t* − *T*_{0}. We fixed *T*_{0} = 365 × 3 in the real data analysis. The mean and variance of *y*_{
i
} for the negative binomial model are denoted as *E*(*y*_{
i
}) = *μ*_{
i
} and \( Var\left({y}_i\right)={\mu}_i+\kappa {\mu}_i^2 \), where the constant *κ* is called a dispersion parameter. The mean equation of the negative binomial regression model is often given by \( {\mu}_i= \exp \left({X}_i^T\beta \right) \), where *β* is a vector of coefficients. Let \( {\widehat{\beta}}_t \) and \( {\widehat{\kappa}}_t \) be the estimated vector of coefficients and dispersion parameter from the fitted model using the series of *y*_{
i
} and *X*_{
i
} for *i* = *t* − 1, …, *t* − *T*_{0}. The expected number of visits for day *t* is estimated by \( {\widehat{y}}_t= \exp \left({X}_t^T{\widehat{\beta}}_t\right) \). The variance of estimate is given by \( Var\left({\widehat{y}}_t\right)={\widehat{y}}_t+{\widehat{\kappa}}_t{\widehat{y}}_t^2 \). The Pearson residuals were denoted as \( {R}_t=\left({y}_t-{\widehat{y}}_t\right)/\sqrt{Var\left({\widehat{y}}_t\right)} \) for *t* = 1, 2, …, *n*, from the fitted negative binomial regression models on *n* consecutive days.

### Aberration detection rule

Since the sample size *T*_{0} is often large, the standardized Pearson residuals could be assumed to be approximately distributed normally with mean 0 and variance 1. Therefore, we proposed a simple rule by directly monitoring the series of Pearson residuals. When a Pearson residual was larger than the 100(1 − *α*)^{th} percentile of the standard normal distribution, denoted by *z*_{1 − α}, a signal was issued for the day to report a possible aberrant outbreak. We suggest choosing *α* = 0.025, that is, set the threshold *z*_{0.975} = 1.96 for simplicity. If the series of visits were well fitted by the negative binomial models, the false alarm rate would be around 2.5%. In practice, we should expect a false alarm rate slightly larger than 2.5% because the fitted models were usually not perfect. We may choose a smaller *α* if a false alarm rate is a major concern. However, the identification of an epidemic may be delayed or even undetected. For comparison with the CUSUM method in this study, we name our proposed approach SPR due to use of Standardized Pearson Residuals *R*_{
t
} as the deviation statistic for outbreak detection and threshold *z*_{1 − α}.

### Simulation study

After applying the proposed approach to the empirical data, the performance of the approach was still hard to evaluate due to the unknown daily virus isolation in a small area. Thus, we designed a simulation study to mimic outbreaks and compared the performance of the proposed SPR approach with the popular modified CUSUM. First, time series of counts on *T* = 760 calendar days, denoted by *w*_{1}, …, *w*_{
T
}, were generated from negative binomial distributions with mean *μ*_{
i
} = exp(5 + 0.2*x*_{1i} + *x*_{2i}) and dispersion parameter *κ*_{
i
} = 0.2/*μ*_{
i
} so that variance of *w*_{
i
} equals 1.2*μ*_{
i
} for the *i*^{th} day. The covariates *x*_{1i} and *x*_{2i} were generated to represent the seasonal pattern and day-of-the-week profile, respectively. The consecutive sets of 30 elements (assumed to be 30 days per month) *x*_{1i} were generated from normal distributions with means of 2, 2, 2, 1, 0, −1, −2, −2,-2, −1, 0, and 1 for the 12 months, and standard deviation 0.1, to reflect more visits in winter and fewer visits in summer. The 7 elements of *x*_{2i} were generated from normal distributions with means 0.1, 2, 1.5, 1.5, 1.5, 1.5, and 1 and standard deviation 0.1 for each week repeatedly to mimic the much fewer visits on Sunday and more on Monday observed from the empirical data.

We then assumed there is an epidemic period starting on day 601 and lasting for 40 days. The daily new cases were determined by \( {o}_i=\theta \times sd\left({w}_i\right)\times \exp \left[1-\frac{{\left(i-621\right)}^2}{400}\right] \) for the epidemic period, where *θ* is a fixed parameter of signal-to-noise ratio, and \( sd\left({w}_i\right)=\sqrt{1.2{\mu}_i} \) is the standard deviation of *w*_{
i
}. The final number of visits *y*_{
i
} is *w*_{
i
} plus the integer part of *o*_{
i
} for the epidemic period, and *w*_{
i
} for the other regular days.

We considered three signal-to-noise ratios *θ* = 1, 3, 5. To get an insight of how strong the signals appear in the generated data, we give three daily series of visits simulated from the models with the three signal-to-noise ratios in Additional file 1, Additional file 2, Additional file 3. The patterns of the simulated daily visits look like what we might observe in the real world. The signals were hardly seen in the series of counts with weak signal-to-noise ratio *θ* = 1 but were very clear with *θ* = 5. We then used the models to simulate 1000 data sets for each *θ* value. For each data set, negative binomial regression models were fitted to a series of *T*_{0} = 360 days before each day starting from the 361^{th} to the 760^{th} day. The model included an intercept and 11 dummy variables representing months 2 to 12 and 6 dummy variables for Monday to Saturday as covariates. Note that we have simply used 11 dummy variables for describing seasonal pattern, which are not the true seasonal pattern used for generating the observations. This is to mimic the fact that we often don’t have perfect models in practice. We considered two thresholds with *α* = 0.025 and *α* = 0.005 in the simulation study with expectation of false alarm rates a little higher than 2.5% and 0.5%, respectively. For performance evaluation, we defined a measure of days to detection as the outbreak issuing day during the outbreak period minus 601, the true outbreak day. If no outbreak warning was issued during the epidemic period, event undetected is counted once. False alarm rate is defined as the percentage of signals issued on the remaining 360 regular days. We summarized the three performance measures from the 1000 simulations in the following. The mean days to detection is calculated by the average of days to detection on those simulated data for which all methods had successfully issued an alarm during the epidemic period. The probability of the epidemic going undetected is the number of times the epidemic was undetected divided by 1000. The mean false alarm rate is the average of the 1000 false alarm rates for each method.

The modified CUSUM method implemented in the Early Aberration Reporting System (EARS) is given by the following formula [29]. The deviation statistic is calculated on the observed counts and is denoted by \( {C}_3(t)={\displaystyle \sum_{i=t}^{t-2} \max \left[0,{C}_2(i)-1\right]} \), where \( {C}_2(i)=\frac{y_i-\overline{y}}{s} \), the sample mean \( \overline{y}=\frac{1}{k}{\displaystyle \sum_{j=i-3}^{i-k-2}{y}_j} \), and sample variance \( {s}^2=\frac{1}{k-1}{\displaystyle \sum_{j=i-3}^{i-k-2}{\left({y}_j-\overline{y}\right)}^2} \) for some *k* < (*t* − 3). We chose *k* = 7 as suggested in EARS. It signals on day *t* when the deviation statistic *C*_{3}(*t*) is larger than a threshold, the value of which is often difficult to determine. It has been suggested that deviation statistic *C*_{3}(*t*) was better calculated on model-based residuals. For a fair comparison in the simulation study, we also considered replacing original observations *y*_{
i
} with the Pearson residuals *R*_{
i
} from the same fitted negative binomial regression models in the above C3 algorithm. For the simulation study, thresholds of 1.28 and 2.88 were chosen for the CUSUM methods applied to observations and Pearson residuals, respectively, to keep the false alarm rates higher than those of the proposed SPR methods.