Spatial autocorrelation analysis of health care hotspots in Taiwan in 2006

Background Spatial analytical techniques and models are often used in epidemiology to identify spatial anomalies (hotspots) in disease regions. These analytical approaches can be used to not only identify the location of such hotspots, but also their spatial patterns. Methods In this study, we utilize spatial autocorrelation methodologies, including Global Moran's I and Local Getis-Ord statistics, to describe and map spatial clusters, and areas in which these are situated, for the 20 leading causes of death in Taiwan. In addition, we use the fit to a logistic regression model to test the characteristics of similarity and dissimilarity by gender. Results Gender is compared in efforts to formulate the common spatial risk. The mean found by local spatial autocorrelation analysis is utilized to identify spatial cluster patterns. There is naturally great interest in discovering the relationship between the leading causes of death and well-documented spatial risk factors. For example, in Taiwan, we found the geographical distribution of clusters where there is a prevalence of tuberculosis to closely correspond to the location of aboriginal townships. Conclusions Cluster mapping helps to clarify issues such as the spatial aspects of both internal and external correlations for leading health care events. This is of great aid in assessing spatial risk factors, which in turn facilitates the planning of the most advantageous types of health care policies and implementation of effective health care services.


Background
The Taiwan National Health Insurance (NHI) program was implemented in 1995. The coverage rate of the program has increased from 92.41% in 1995 to more than 96.16% in 2000. Coverage further increased to 98% after the inclusion of active military forces in 2001. At the beginning of 2004, NHI data related to medical care, such as the leading causes of death, were reclassified and reprocessed in relation to smaller units or areas (e.g., precincts or townships rather than the country as a whole). Regional data from the statistical analysis system (SAS) program are announced publicly by the NHI in regular annual reports (e.g., NHI, 2006) [1]. These reports provide an accurate and reliable data source to help investigators explore health care issues in Taiwan.
In the study of spatially-related objects or characteristics, one first describes the regional characteristics that differentiate areas one from another, and then proceeds with the analysis of spatial interrelations [2]. Common spatial techniques used in health research include disease mapping, clustering techniques, diffusion studies, identification of risk factors through map comparisons and regression analysis [3]. Spatial clustering techniques are important for statistical consideration, and form the beginning steps in the development of models for predicting disease risk sites. Disease risk sites are, specifically, areas located close to one another that tend to share similar disease risk factors, because they share similar environments and are also often connected by the spread of communicable disease via vectors or host dispersal [4]. Cuzick and Edwards (1990) proposed three general methodological approaches that can be utilized for the detection of clustering: the first is based on cell counts; the second on autocorrelative adjacencies of cells with high counts; and the third based on determining the distance between events [5]. Numerical methods have been extensively adopted for spatial cluster detection in health research and epidemiology, especially for the processing of areal data. The analytical approaches include the following: join-count statistics [6]; Ohno statistics [7]; Poisson statistics [8]; Global Moran's I [9][10][11]; Global Geary's C [9][10][11]; General Getis-Ord's G [12]; Local Moran's I [13]; and Local Gi(d) and Gi*(d) [12][13][14]. Spatial autocorrelation statistics such as the Moran's I and Geary's C methods are global, in the sense that they estimate the overall degree of spatial autocorrelation in a dataset. The possibility of spatial heterogeneity suggests that the estimated degree of autocorrelation may vary significantly across geo-space. Local spatial autocorrelation statistics provide estimates which are disaggregated to the level of the spatial analysis units, allowing assessment of dependency relationships in different areas. Local Gi(d) and Gi*(d) statistics can be used to make autocorrelation comparisons in different neighborhoods. A global average is used to help identify local regions of strong autocorrelation. Local version of the Moran's I and Geary's C statistics are also available.
In this study, we develop a method for ascertaining the spatial clustering associated with the 20 leading health care events, based on medical care data collected by the Taiwan NHI agency. Furthermore we also investigate potential spatial risks which contribute to these health care events and redefine epidemiologic and spatially referenced data.

Study area
The study area includes the main island of Taiwan only (excluding all islets), comprising more than 22 million inhabitants in the year 2000, living in an area of 36,000 km 2 . There are a total of 349 local administrative government areas, which include 5 main urban areas, 2 secondary urban areas, 187 rural townships, and 29 aboriginal townships ( Figure 1). According to a bulletin from the Ministry of Interior issued in 1996, urban areas are regions having at least one metropolitan center and can include neighboring cities and townships which share socioeconomic activities. Main urban areas are defined as those with a population larger than one million, specifically, Taipei-Keelung, Kaohsiung, Taichung-Changhua, Jhongli-Taoyuan and Tainan. Secondary urban areas are defined as those with a residential population ranging from 0.3 to 1 million (for example, Hsinchu and Chiayi).

Data collection and management
The data were collected from contractual medical care institutions, which in this study, means institutions where the NHI covers the costs of prescription medicines and treatment at outpatient clinics. Such facilities accumulate detailed databases of medical costs for inpatient care. The number of outpatient cases were classified in relation to disease codes, as defined in the 1975 edition of "The International Classification of Diseases, 9th Revision, Clinical Modification" (hereafter, ICD 9 CM). Criteria for refining the data were first established. Some data were not included in the final statistical data set. For example, cases where patients suffer from diseases which defy code classification, mismatched ID numbers, and so on. Disease codes were classified by gender and age. Cases with the same ID numbers but different diseases were counted as different instances [1].
Demographic information was provided by the Ministry of Interior [15]. The smallest administrative units coded for examination of the various diseases cases or health care events were precincts and townships. Age-adjusted standard prevalence rates, a direct adjustment using the world population in 2000 as the standard population [16], was then calculated, the results showed the leading causes of death for males and females in each township.

Global Moran's I statistic
The global spatial autocorrelation statistical method was used to measure the correlation among neighboring observations, to find the patterns and the levels of spatial clustering among neighboring districts [17]. The Moran's I statistic, which is similar to the Pearson correlation coefficient [18], is calculated by Map of urban areas and aboriginal townships in the study area where N is the number of districts; w ij is the element in the spatial weight matrix corresponding to the observation pair i, j; and x i and x j are observations for areas i and j with mean u and Since the weights are row-standardized Σw ij = 1, the first step in the spatial autocorrelation analysis is to construct a spatial weight matrix that contains information about the neighborhood structure for each location. Adjacency is defined as immediately neighboring administrative districts, inclusive of the district itself. Non-neighboring administrative districts are given a weight of zero.

Determining spatial weights/connectivity matrices
Spatial contiguity for polygons is the property of sharing a common boundary or vertex. Contiguity analysis is an important method for assessing unusual features in the connectivity distribution [13,19]. The Queen's measure of contiguity can be utilized to make up for spatial contiguity by incorporating both the Rook and Bishop relationships into a single measure [19].
The administrative districts considered in this study are highly irregular in both shape and size. We compare the first order queen polygon continuity method and a distance-based method, to choose the most appropriate method for quantifying the spatial weights matrix for analysis of the connectivity distributions between neighbors. Figure 2 shows the results of both the distance-based and the first order Queen's contiguity analysis for the administrative district boundaries. When the distancebased method is used there is a larger percentage of contiguity connection between neighbors (greater than 15); whereas the maximum value for the first order Queen's contiguity is 10. The differences between the distancebased contiguity and the first order Queen's contiguity methods are obvious. The connectivity distribution results obtained with the latter highlights the marked parities in connectivity. Based on the results of the connectiv- Results of the analysis of the connectivity distributions of neighboring administrative district boundaries in Taiwan

Connectivity
Distance-Based Contiguity (at least one neighbor) First order Queen's Contiguity ity distribution, we construct a first order queen polygon contiguity weight file for districts which share common boundaries and vertices. The spatial weights/connectivity matrices are utilized in the following local G*(d) calculations.

Local G i *(d) statistic
The local G i *(d) statistic (local G-statistic) is used to test the statistical significance of local clusters (as related to the 20 leading causes of death), and to determine the spatial extent of these clusters [12,14]. The local G-statistic is useful for identifying individual members of local clusters by determining the spatial dependence and relative magnitude between an observation and neighboring observations [20]. The local G-statistic can be written as follows [12,21,22]: where x is a measure of the prevalence rate of each leading cause of death event within a given polygon (i.e., each administrative district); w ij is a spatial weight that defines neighboring administrative districts j to i; W i is the sum of the weights w ij , .
Developing the spatial weights w ij is the first step to calculating G i *(d). The spatial weight matrix includes w ij = 1. In this study, adjacency is defined using a first order queen polygon continuity weight file which has been constructed based on the districts that share common boundaries and vertices.
Non-neighboring administrative districts are given a weight of zero. The neighbors of an administrative district are defined as those with which the administrative district shares a boundary. A simple 0/1 matrix is formed, where 1 indicates that the municipalities having a common border or vertex; 0 otherwise [21,23].
The local G-statistic includes the value in the calculation at i. Assuming that G i *(d) is approximately normally distributed [12], the output of G i *(d) can be calculated as a standard normal variant with an associated probability from the z-score distribution [24]. Clusters with a 95 percent significance level from a two-tailed normal distribution indicate significant clustering spatially, but only positively significant clusters (the z-score value greater than +1.96) are mapped.

Logistic regression analysis
Similarities between spatial distribution patterns for males and females are displayed. In addition to mapping, logistic regression is also performed. The binary response indicates whether there is significant autocorrelation between administrative districts or areas. There is higher correlation if the absolute value of the z-score of the local G-statistics is larger than 1.96; lower correlation otherwise. Gender is considered as an explanatory variable in the logistic regression model. Thus the model is expressed as where β 0 and β 1 are the logistic regression coefficients of the model. Pr(Higher correlation) and Pr(Lower correlation) denote the "Higher" and "Lower" correlation probabilities, respectively. Computation is performed with the R-language (R 2.8.1).

Results
The The z-score outcomes as calculated by the Gi*(d) statistic are categorized as clusters or non-clusters, at the 5% significance level. This is followed by cross tabulation with the top 20 leading health problems. All results are summarized in Figure 6.
Furthermore, we find that there is no statistically significant dissimilarity (p-value > 0.05) between the spatial distribution patterns for males and females for fifteen out of twenty spatial clusters. We do find dissimilarities for cerebrovascular disease, heart disease, nephritis, nephritic syndrome and nephrosis, ulcers of stomach and duodenum, and certain conditions originating in the perinatal period. All results are shown in Table 2.

Discussion
Nearby locations are likely to possess similar attributes. In other words, everything is related to everything else, and nearby things are more closely related to nearby things than to distant things [25]. In epidemiology, a cluster becomes apparent when a number of health events occur which are situated close together in space and/or time.
The evaluation of spatial distributions as a measure of disease risk may provide etiological insights [26]. Spatial autocorrelation is defined as the relation between the values of a single variable. This relation is attributable to the geographic arrangement of areal units on a map and can be used to identify the degree of spatial clustering [27,28]. In this study, the local G-statistic is used to measure the degree of spatial clustering and map the geographic patterns of the areal units. Spatial clustering of the leading cause of death (also called a hot spot) is defined as when we obtain z-score values larger than 1.96. In epidemiology, hot spots are considered interesting because of their correlation to etiology. For this reason, we indicate the hot spots of 20 leading cause of death, as obtained from our analysis, and identify their spatial locations. Information about spatial location is useful for detecting risk factors from a spatial viewpoint. A more detailed survey of these identified hot spots may reveal important clues as to risk factors for these diseases.
To appropriately use public health data aggregated according to irregular administrative districts it is important to decide on the local measures of spatial autocorrelation for the specification of local neighborhood (as defined by the spatial weights matrix). In general, the spa-tial autocorrelation may be the strongest between the nearest neighbors. As the neighborhoods increase in number, this autocorrelation weakens [29]. A formal guidance for choosing a proper spatial weight matrix has not yet been developed [30,31]. Therefore, the proper spatial weight matrix is chosen after a comparison of the connectivity distributions of neighbors obtained with the distance-based contiguity and the first order Queen's contiguity methods. However, an evaluation of the sensitivity of the results to the different spatial weight matrices still needs to be developed and assessed for further studies.
The modifiable areal unit problem (MAUP) is a phenomenon whereby different results are obtained from analysis of the same data, grouped into different sets of areal units. The MAUP can be subdivided into two separate effects that usually occur simultaneously during the analysis of aggregated data. The scale effect causes variation in statistical results given different levels of aggregation. In other words, association between variables depends on the size of the areal units for which data are reported. Generally, correlation increases as the size of the areal unit increases. The zone effect describes variation in correlation statistics caused by the regrouping of data into different configurations but with the same scale. These effects occur because spatial processes generating the observed data may exist at scales and for particular areal units that may be reflected more or less accurately by the boundaries in use [32]. Studies of the MAUP based on empirical data provide  that it is not possible to define an ideal single census geography that captures all the processes for all variables [32]. Furthermore, the internal composition of the areal units may not be homogeneous, particularly for disease distri-bution. Further to this, Matisziw et al. (2008) suggested that downscaling the spatial structure of polygonal units sould provide valuable information on the spatial distribution of disease [34].  This is the first study of the spatial distribution of the 20 leading health problems in Taiwan. There have been few previous ecological studies related to health care issues and their correlation to risk factors in Taiwan, although malignant neoplasms and tuberculosis have been documented and are discussed briefly below. We hope that this study of the spatial clustering of Taiwan's leading health issues can provide help for the study of spatial epidemiology.
Residents along the southwestern and northeastern coasts of Taiwan drank well water contaminated with a high concentration of arsenic before the establishment of the public water system [35]. Residents in these areas were found to have an increased risk of malignant neoplasms, including cancers of the liver, nasal cavity, lung, skin, bladder and kidney, for both males and females, as well as prostate cancer in males [36,37]. Although well water was no longer used for drinking or cooking after the mid-1970s, there was still significantly increased risks of urinary cancers [38,39] and lung cancer [39,40] in the arseniasisendemic areas of southwestern and northeastern Taiwan.
Our results, showed clusters for malignant neoplasms in these arseniasis-endemic areas, but also did reveal a new carcinogen clustering (for females) in the northen coastal region of Taiwan. This is worthy of more investigation in the future.
According to data from the Center for Disease Control in Taiwan, there is a four-fold higher incidence of tuberculo-sis in aboriginal portions of the population than in people of Han ethnicity (Hans) [41]. Environmental factors such as hygiene, income, and social behavior (e.g., alcoholism) have been blamed for the prevalence of tuberculosis in aboriginal populations. Genetic variations in NRAMP 1 may also affect susceptibility to and increase the risk of tuberculosis in Taiwanese aboriginals [42]. Here we calculate tuberculosis clusters for males and females by utilizing the local G-statistic. The results show clear spatial clustering in Taiwanese aboriginal townships. Thus, our observations support the results obtained in previous studies. In addition, the hypertensive disease cluster, also possibly closely correlated to mountainous and aboriginal townships, is also worthy of attention. The strength of the relationship between aboriginal populations and hypertensive disease clusters needs further study to clarify. A more detailed survey of hypertensive disease may reveal valuable findings in terms of the risk factors between populations (four main populations are distributed in Taiwan) and hypertensive disease.
The z-scores for the local G-statistic are calculated using the logistic regression model. The results for various leading health problems and gender are compared. The test results show statistically significant differences for five health care problems in Taiwan in the year 2006, but another fifteen cases which are not, on the other hand. In other words, the null hypothesis is accepted. The accepted null hypothesis results indicate that the common spatial factor(s) may interact with both sexes.