Generalized additive models are a form of non-parametric/semi-parametric regression with the ability to analyze binary or continuous outcome data while simultaneously adjusting for covariates [1]. The methods we propose use smooth terms to model space for overlapping periods of time, and covariates are controlled for parametrically to minimize data requirements. We use a loess smooth, a locally-weighted regression smoother that adapts to changes in data density, which is particularly useful when working with population-based data [1]. GAMs may exhibit biased behavior at the edges of data, and a benefit of a loess smooth is that it uses a tri-cube weight function that down-weights points far from the target point, suggesting smaller edge effects than for nearest neighbor smoothers with the same span. The amount of smoothing performed by loess is determined by minimizing the Akaike's Information Criterion (AIC). The AIC approximates the deviance-based cross validation using the average deviance of a model penalized by the number of degrees of freedom. Given an emergency department dataset of longitude, latitude, day of the ED visit, case status, and covariate age, we use longitude and latitude to model space and day to model time. Case status is the health outcome of interest and can be any illness or symptom presented for emergency care.
We first examined the ED dataset temporally using a univariate smooth (S) of time (t) measured in days
(1)
where the left-hand side is the log of the illness odds at time (t), z is a vector of covariates (in our example, age), and γ is a vector of parameters. We used the AIC to select the statistically optimal degree of smoothing, also referred to as the optimal span size, which represents the proportion of the data used in the smooth window [1]. We then divided the ED dataset into smaller data frames of overlapping time periods based on the optimal span size from the temporal analysis [2]. By dividing the dataset into data frames of overlapping time spans, and then performing a spatial analysis within each of those time spans, we essentially smoothed over time. The optimal length of each data frame is simply the size of the smoothing window from the temporal model (1). Figure 1 illustrates how a hypothetical dataset is divided into data frames of overlapping time spans. For a span of 0.1 (or 10% of the data points), an ED dataset of 10,000 records will have an optimal length of 1,000 records in each data frame, the first data frame consisting of records 1 to 1,000 (Figure 1a).
For the purposes of understanding illness patterns, data frames should be spaced far enough apart in time so that new cases have occurred but with sufficient overlap of data for smoothing purposes. To determine the optimal time interval between subsequent data frames, we use the optimal span for model (1) of the first data frame. The optimal interval is calculated as the number of days corresponding to the nth record, where n is the length of the optimal smoothing window for the first data frame. For the hypothetical first data frame with 1,000 records, if we determine the smoothing window to be 0.3 (or 30% of the data points), then the day corresponding to record 300 (30% of 1,000 records) is the optimal interval. The optimal interval is then added to the median day of the first data frame (i.e., the day corresponding to the median record) to arrive at the median day of the second data frame. Again, for the hypothetical first data frame, if the days corresponding to records 300 and 500 (the median record) are day 25 and day 35, then the median day for the second data frame is day 35 plus an interval of 25 days, or day 60 (Figure 1b). The second data frame is created by adding records from days before and after the median day of the second data frame until the optimal length is reached. We repeat the process until the last day of a data frame is also the last day in the ED dataset.
We then examined spatial variation of each time-span data frame using a bivariate smooth (S) of longitude (x1) and latitude (x
2)
(2)
where the left-hand side is the log of the illness odds at location (x1
x
2), z is a vector of covariates (in our example, age), and γ is a vector of parameters [4]. We then used the AIC to select the statistically optimal degree of spatial smoothing. We created a rectangular grid of points covering the study area using the minimum and maximum longitude and latitude coordinates for the ED dataset as its dimensions. We used ArcGIS to clip grid points lying outside the outline map of the study area, in areas where people cannot live (e.g., national parks), and in areas with low population density along the edges of the dataset. Because GAMs may exhibit biased behavior at the edges of the data [1], we do not predict our spatial models in areas of low population density along the geographic edges of our study area [3]. We used model (2) to estimate the adjusted log odds at each grid point on the study area map. The same prediction grid was used for all spatial analyses, and all analyses were adjusted for age.
GAMs also provide a framework for hypothesis testing. For each time-span data frame, we first tested the global null hypothesis that case status does not depend on the smooth term using the difference of the deviances of model (2) with and without the smooth term. We estimated the distribution of the global statistic under the null hypothesis using a permutation test [4]. We ran the GAM using the optimal span of the data frame and computed the deviance statistic. A p-value cut off of 0.005 was used as a screening tool for possibly meaningful associations. Our previous work suggested the global test may have an inflated type 1 error rate when applied with a significance cut-off of 0.05; a cut-off of 0.025 provided an appropriately sized test [18]. Because a separate spatial analysis is performed on each time-span data frame, we applied an even more conservative cut off of 0.005 to account for the multiple analyses. Despite this adjustment for multiplicity, our results should be considered hypothesis generating rather than hypothesis confirming. We discuss results as “significant” if the associated p-values are less than 0.005, but acknowledge that some results still may be due to chance. If the global deviance test indicated that the map was unlikely to be flat, we next located areas of the map that exhibited clusters of unusually high or low odds of illness. We examined point-wise departures from the null hypothesis of a flat surface using the same set of permutations we used for calculating the global statistics. Once we had a distribution of log odds at every point, points that ranked in the lower and upper 2.5% of the point-wise permutation distributions were identified as areas of significantly decreased odds (“cold spots”) and increased odds (“hot spots”).
The resulting log odds from our spatial analyses were converted to odds ratios (ORs) using the whole study population as the reference, dividing the predicted odds by the odds calculated by the reduced model while omitting the smooth term. Adjusted ORs and pointwise permutation test ranks were exported from R [19] into ArcGIS [20] for mapping. R-code and simulated data are freely available [21]. The results for each time-span data frame were displayed using the same dark blue to dark red continuous color scale and the same range of odds ratios to make maps visually comparable. We denoted cold and hot spots on the maps using black contour lines created from the pointwise permutation test ranks. Maps were saved as image files and used to create a storyboard in Windows Movie Maker [22]. Each map plays for 0.5 seconds before transitioning to the next map. The resulting movie shows how illness risk varies over space and time.
To illustrate our methods with a real-world application, we used daily ED data from the Cape Cod region of Massachusetts, collected by the Institute for Health Metrics from three Cape Cod hospitals. Each record in the dataset includes the geocoded patient billing address (in longitude and latitude), the date of the emergency department visit, the patient’s age, and the syndrome group of the patient’s complaint [5]. We converted date to a continuous measure of time by setting the first record to day 1 and calculating subsequent days based on the days that passed between records. We used the optimal span sizes for the temporal analysis of the entire dataset (0.03) and the first time-span data frame (0.25) to divide the dataset into a series of overlapping time-span data frames. For this application, we examined the space-time distribution of gastrointestinal illness in Cape Cod using the proposed methods [23]. Patients who presented with respiratory illnesses were chosen as a control group. The respiratory illnesses displayed a consistent temporal pattern during the study time period with expected increased numbers in the winter season and geographic distribution that was comparable to population distributions from census data. When controls are appropriately sampled from the population giving rise to the cases, the case–control ratio (illness odds) in a subset of the area should be proportional to the incidence rate [16]. The IRB for the Boston University Medical Campus approved this research.