Spatial analytical techniques and models are often used in epidemiology to identify spatial anomalies (hotspots) in disease regions. These analytical approaches can be used to not only identify the location of such hotspots, but also their spatial patterns.
In this study, we utilize spatial autocorrelation methodologies, including Global Moran's I and Local Getis-Ord statistics, to describe and map spatial clusters, and areas in which these are situated, for the 20 leading causes of death in Taiwan. In addition, we use the fit to a logistic regression model to test the characteristics of similarity and dissimilarity by gender.
Gender is compared in efforts to formulate the common spatial risk. The mean found by local spatial autocorrelation analysis is utilized to identify spatial cluster patterns. There is naturally great interest in discovering the relationship between the leading causes of death and well-documented spatial risk factors. For example, in Taiwan, we found the geographical distribution of clusters where there is a prevalence of tuberculosis to closely correspond to the location of aboriginal townships.
Cluster mapping helps to clarify issues such as the spatial aspects of both internal and external correlations for leading health care events. This is of great aid in assessing spatial risk factors, which in turn facilitates the planning of the most advantageous types of health care policies and implementation of effective health care services.
The Taiwan National Health Insurance (NHI) program was implemented in 1995. The coverage rate of the program has increased from 92.41% in 1995 to more than 96.16% in 2000. Coverage further increased to 98% after the inclusion of active military forces in 2001. At the beginning of 2004, NHI data related to medical care, such as the leading causes of death, were reclassified and reprocessed in relation to smaller units or areas (e.g., precincts or townships rather than the country as a whole). Regional data from the statistical analysis system (SAS) program are announced publicly by the NHI in regular annual reports (e.g., NHI, 2006) . These reports provide an accurate and reliable data source to help investigators explore health care issues in Taiwan.
In the study of spatially-related objects or characteristics, one first describes the regional characteristics that differentiate areas one from another, and then proceeds with the analysis of spatial interrelations . Common spatial techniques used in health research include disease mapping, clustering techniques, diffusion studies, identification of risk factors through map comparisons and regression analysis . Spatial clustering techniques are important for statistical consideration, and form the beginning steps in the development of models for predicting disease risk sites. Disease risk sites are, specifically, areas located close to one another that tend to share similar disease risk factors, because they share similar environments and are also often connected by the spread of communicable disease via vectors or host dispersal .
Cuzick and Edwards (1990) proposed three general methodological approaches that can be utilized for the detection of clustering: the first is based on cell counts; the second on autocorrelative adjacencies of cells with high counts; and the third based on determining the distance between events . Numerical methods have been extensively adopted for spatial cluster detection in health research and epidemiology, especially for the processing of areal data. The analytical approaches include the following: join-count statistics ; Ohno statistics ; Poisson statistics ; Global Moran's I [9–11]; Global Geary's C [9–11]; General Getis-Ord's G ; Local Moran's I ; and Local Gi(d) and Gi*(d) [12–14]. Spatial autocorrelation statistics such as the Moran's I and Geary's C methods are global, in the sense that they estimate the overall degree of spatial autocorrelation in a dataset. The possibility of spatial heterogeneity suggests that the estimated degree of autocorrelation may vary significantly across geo-space. Local spatial autocorrelation statistics provide estimates which are disaggregated to the level of the spatial analysis units, allowing assessment of dependency relationships in different areas. Local Gi(d) and Gi*(d) statistics can be used to make autocorrelation comparisons in different neighborhoods. A global average is used to help identify local regions of strong autocorrelation. Local version of the Moran's I and Geary's C statistics are also available.
In this study, we develop a method for ascertaining the spatial clustering associated with the 20 leading health care events, based on medical care data collected by the Taiwan NHI agency. Furthermore we also investigate potential spatial risks which contribute to these health care events and redefine epidemiologic and spatially referenced data.
The study area includes the main island of Taiwan only (excluding all islets), comprising more than 22 million inhabitants in the year 2000, living in an area of 36,000 km2. There are a total of 349 local administrative government areas, which include 5 main urban areas, 2 secondary urban areas, 187 rural townships, and 29 aboriginal townships (Figure 1). According to a bulletin from the Ministry of Interior issued in 1996, urban areas are regions having at least one metropolitan center and can include neighboring cities and townships which share socioeconomic activities. Main urban areas are defined as those with a population larger than one million, specifically, Taipei-Keelung, Kaohsiung, Taichung-Changhua, Jhongli-Taoyuan and Tainan. Secondary urban areas are defined as those with a residential population ranging from 0.3 to 1 million (for example, Hsinchu and Chiayi).
Data collection and management
The data were collected from contractual medical care institutions, which in this study, means institutions where the NHI covers the costs of prescription medicines and treatment at outpatient clinics. Such facilities accumulate detailed databases of medical costs for inpatient care. The number of outpatient cases were classified in relation to disease codes, as defined in the 1975 edition of "The International Classification of Diseases, 9th Revision, Clinical Modification" (hereafter, ICD 9 CM). Criteria for refining the data were first established. Some data were not included in the final statistical data set. For example, cases where patients suffer from diseases which defy code classification, mismatched ID numbers, and so on. Disease codes were classified by gender and age. Cases with the same ID numbers but different diseases were counted as different instances .
Medical care data obtained from the NHI, 2006 report were examined, and the prevalence rates of the 20 leading causes of death calculated. Diseases classifications (made according to the International Classification of Disease, Injuries, and Causes of Death, 1975) are indicated in parentheses. They include the following: malignant neoplasms (ICD 08-14); cerebrovascular disease (ICD 29); heart disease (ICD 250, 251, 27, and 28* which includes a partial listing of ICD 420-429); diabetes mellitus (ICD 181); accidents and adverse side effects (ICD E47-E53); pneumonia (ICD 321); chronic liver disease and cirrhosis (ICD 347); nephritis, nephritic syndrome and nephrosis (ICD 350); suicide (ICD E54); hypertensive disease (ICD 26); bronchitis, emphysema and asthma (ICD 323); septicaemia (ICD 038); tuberculosis (ICD 02); ulcers of the stomach and duodenum (ICD 341); certain conditions originating in the perinatal period (ICD 45); congenital anomalies (ICD 44); anaemias (ICD 200); homicide (ICD E55); meningitis (ICD 220); and protein-calorie malnutrition (ICD 192).
Demographic information was provided by the Ministry of Interior . The smallest administrative units coded for examination of the various diseases cases or health care events were precincts and townships. Age-adjusted standard prevalence rates, a direct adjustment using the world population in 2000 as the standard population , was then calculated, the results showed the leading causes of death for males and females in each township.
Global Moran's I statistic
The global spatial autocorrelation statistical method was used to measure the correlation among neighboring observations, to find the patterns and the levels of spatial clustering among neighboring districts . The Moran's I statistic, which is similar to the Pearson correlation coefficient , is calculated by
where N is the number of districts; wijis the element in the spatial weight matrix corresponding to the observation pair i, j; and xiand xjare observations for areas i and j with mean u and
Since the weights are row-standardized Σwij= 1, the first step in the spatial autocorrelation analysis is to construct a spatial weight matrix that contains information about the neighborhood structure for each location. Adjacency is defined as immediately neighboring administrative districts, inclusive of the district itself. Non-neighboring administrative districts are given a weight of zero.
Determining spatial weights/connectivity matrices
Spatial contiguity for polygons is the property of sharing a common boundary or vertex. Contiguity analysis is an important method for assessing unusual features in the connectivity distribution [13, 19]. The Queen's measure of contiguity can be utilized to make up for spatial contiguity by incorporating both the Rook and Bishop relationships into a single measure .
The administrative districts considered in this study are highly irregular in both shape and size. We compare the first order queen polygon continuity method and a distance-based method, to choose the most appropriate method for quantifying the spatial weights matrix for analysis of the connectivity distributions between neighbors. Figure 2 shows the results of both the distance-based and the first order Queen's contiguity analysis for the administrative district boundaries. When the distance-based method is used there is a larger percentage of contiguity connection between neighbors (greater than 15); whereas the maximum value for the first order Queen's contiguity is 10. The differences between the distance-based contiguity and the first order Queen's contiguity methods are obvious. The connectivity distribution results obtained with the latter highlights the marked parities in connectivity. Based on the results of the connectivity distribution, we construct a first order queen polygon contiguity weight file for districts which share common boundaries and vertices. The spatial weights/connectivity matrices are utilized in the following local G*(d) calculations.
The local Gi*(d) statistic (local G-statistic) is used to test the statistical significance of local clusters (as related to the 20 leading causes of death), and to determine the spatial extent of these clusters [12, 14]. The local G-statistic is useful for identifying individual members of local clusters by determining the spatial dependence and relative magnitude between an observation and neighboring observations . The local G-statistic can be written as follows [12, 21, 22]:
where x is a measure of the prevalence rate of each leading cause of death event within a given polygon (i.e., each administrative district); wijis a spatial weight that defines neighboring administrative districts j to i; Wiis the sum of the weights wij, .
Developing the spatial weights wijis the first step to calculating Gi*(d). The spatial weight matrix includes wij= 1. In this study, adjacency is defined using a first order queen polygon continuity weight file which has been constructed based on the districts that share common boundaries and vertices.
Non-neighboring administrative districts are given a weight of zero. The neighbors of an administrative district are defined as those with which the administrative district shares a boundary. A simple 0/1 matrix is formed, where 1 indicates that the municipalities having a common border or vertex; 0 otherwise [21, 23].
The local G-statistic includes the value in the calculation at i. Assuming that Gi*(d) is approximately normally distributed , the output of Gi*(d) can be calculated as a standard normal variant with an associated probability from the z-score distribution . Clusters with a 95 percent significance level from a two-tailed normal distribution indicate significant clustering spatially, but only positively significant clusters (the z-score value greater than +1.96) are mapped.
Logistic regression analysis
Similarities between spatial distribution patterns for males and females are displayed. In addition to mapping, logistic regression is also performed. The binary response indicates whether there is significant autocorrelation between administrative districts or areas. There is higher correlation if the absolute value of the z-score of the local G-statistics is larger than 1.96; lower correlation otherwise. Gender is considered as an explanatory variable in the logistic regression model. Thus the model is expressed as
where β0 and β1 are the logistic regression coefficients of the model. Pr(Higher correlation) and Pr(Lower correlation) denote the "Higher" and "Lower" correlation probabilities, respectively. Computation is performed with the R-language (R 2.8.1).
The results of the calculation of the global autocorrelation statistics for the top 20 leading cause of death events in the year 2006 in Taiwan are summarized in Table 1. The results of the global Moran's tests for all cases related to the leading causes of death (for both males and females) are statistically significant (z-score greater than 1.96) and indicate spatial heterogeneity.
The spatial clusters (hotspots) as obtained from the local Gi*(d) statistic for the top 20 leading health care problems for both males and females in Taiwan in 2006 are shown in Figures 3, 4 and 5.
The z-score outcomes as calculated by the Gi*(d) statistic are categorized as clusters or non-clusters, at the 5% significance level. This is followed by cross tabulation with the top 20 leading health problems. All results are summarized in Figure 6.
Furthermore, we find that there is no statistically significant dissimilarity (p-value > 0.05) between the spatial distribution patterns for males and females for fifteen out of twenty spatial clusters. We do find dissimilarities for cerebrovascular disease, heart disease, nephritis, nephritic syndrome and nephrosis, ulcers of stomach and duodenum, and certain conditions originating in the perinatal period. All results are shown in Table 2.
Nearby locations are likely to possess similar attributes. In other words, everything is related to everything else, and nearby things are more closely related to nearby things than to distant things . In epidemiology, a cluster becomes apparent when a number of health events occur which are situated close together in space and/or time. The evaluation of spatial distributions as a measure of disease risk may provide etiological insights . Spatial autocorrelation is defined as the relation between the values of a single variable. This relation is attributable to the geographic arrangement of areal units on a map and can be used to identify the degree of spatial clustering [27, 28]. In this study, the local G-statistic is used to measure the degree of spatial clustering and map the geographic patterns of the areal units. Spatial clustering of the leading cause of death (also called a hot spot) is defined as when we obtain z-score values larger than 1.96. In epidemiology, hot spots are considered interesting because of their correlation to etiology. For this reason, we indicate the hot spots of 20 leading cause of death, as obtained from our analysis, and identify their spatial locations. Information about spatial location is useful for detecting risk factors from a spatial viewpoint. A more detailed survey of these identified hot spots may reveal important clues as to risk factors for these diseases.
To appropriately use public health data aggregated according to irregular administrative districts it is important to decide on the local measures of spatial autocorrelation for the specification of local neighborhood (as defined by the spatial weights matrix). In general, the spatial autocorrelation may be the strongest between the nearest neighbors. As the neighborhoods increase in number, this autocorrelation weakens . A formal guidance for choosing a proper spatial weight matrix has not yet been developed [30, 31]. Therefore, the proper spatial weight matrix is chosen after a comparison of the connectivity distributions of neighbors obtained with the distance-based contiguity and the first order Queen's contiguity methods. However, an evaluation of the sensitivity of the results to the different spatial weight matrices still needs to be developed and assessed for further studies.
The modifiable areal unit problem (MAUP) is a phenomenon whereby different results are obtained from analysis of the same data, grouped into different sets of areal units. The MAUP can be subdivided into two separate effects that usually occur simultaneously during the analysis of aggregated data. The scale effect causes variation in statistical results given different levels of aggregation. In other words, association between variables depends on the size of the areal units for which data are reported. Generally, correlation increases as the size of the areal unit increases. The zone effect describes variation in correlation statistics caused by the regrouping of data into different configurations but with the same scale. These effects occur because spatial processes generating the observed data may exist at scales and for particular areal units that may be reflected more or less accurately by the boundaries in use . Studies of the MAUP based on empirical data provide only limited insight of the inability to control relationships between multiple spatial variables. Data simulation is necessary to control over various properties of individual level data. Simulation studies, such as those by Swift et al. (2008), have demonstrated that the spatial support of variables can affect the magnitude of the ecological bias caused by spatial data aggregation . Manley et al. (2006) concluded that MAUP is not really a problem but in constrast, a resource. Data at different scales can help us identify processes operating at different scales. It is clear that it is not possible to define an ideal single census geography that captures all the processes for all variables . Furthermore, the internal composition of the areal units may not be homogeneous, particularly for disease distribution. Further to this, Matisziw et al. (2008) suggested that downscaling the spatial structure of polygonal units sould provide valuable information on the spatial distribution of disease .
This is the first study of the spatial distribution of the 20 leading health problems in Taiwan. There have been few previous ecological studies related to health care issues and their correlation to risk factors in Taiwan, although malignant neoplasms and tuberculosis have been documented and are discussed briefly below. We hope that this study of the spatial clustering of Taiwan's leading health issues can provide help for the study of spatial epidemiology.
Residents along the southwestern and northeastern coasts of Taiwan drank well water contaminated with a high concentration of arsenic before the establishment of the public water system . Residents in these areas were found to have an increased risk of malignant neoplasms, including cancers of the liver, nasal cavity, lung, skin, bladder and kidney, for both males and females, as well as prostate cancer in males [36, 37]. Although well water was no longer used for drinking or cooking after the mid-1970s, there was still significantly increased risks of urinary cancers [38, 39] and lung cancer [39, 40] in the arseniasis-endemic areas of southwestern and northeastern Taiwan. Our results, showed clusters for malignant neoplasms in these arseniasis-endemic areas, but also did reveal a new carcinogen clustering (for females) in the northen coastal region of Taiwan. This is worthy of more investigation in the future.
According to data from the Center for Disease Control in Taiwan, there is a four-fold higher incidence of tuberculosis in aboriginal portions of the population than in people of Han ethnicity (Hans) . Environmental factors such as hygiene, income, and social behavior (e.g., alcoholism) have been blamed for the prevalence of tuberculosis in aboriginal populations. Genetic variations in NRAMP 1 may also affect susceptibility to and increase the risk of tuberculosis in Taiwanese aboriginals . Here we calculate tuberculosis clusters for males and females by utilizing the local G-statistic. The results show clear spatial clustering in Taiwanese aboriginal townships. Thus, our observations support the results obtained in previous studies. In addition, the hypertensive disease cluster, also possibly closely correlated to mountainous and aboriginal townships, is also worthy of attention. The strength of the relationship between aboriginal populations and hypertensive disease clusters needs further study to clarify. A more detailed survey of hypertensive disease may reveal valuable findings in terms of the risk factors between populations (four main populations are distributed in Taiwan) and hypertensive disease.
The z-scores for the local G-statistic are calculated using the logistic regression model. The results for various leading health problems and gender are compared. The test results show statistically significant differences for five health care problems in Taiwan in the year 2006, but another fifteen cases which are not, on the other hand. In other words, the null hypothesis is accepted. The accepted null hypothesis results indicate that the common spatial factor(s) may interact with both sexes.
Spatial autocorrelation calculation is useful for cluster mapping of regional health care problems. Cluster mapping helps to clarify issues such as the spatial aspects of both internal and external correlations of leading health care events. This helps planners to assess spatial risk factors, and to ascertain what would be the most advantageous types of health care policies for the planning and implementation of health care services. These issues can greatly affect the performance and effectiveness of health care services and also provide a clear outline for helping us to better understand the results in depth.
National Health Insurance: Statistical Annual Report of Medical Care 2006. 2008, Taipei: National Health Insurance, Republic of China (Taiwan)
Douven W, Scholten HJ: Spatial analysis in health research. The Added Value of Geographical Information Systems in Public and Environmental Health. Edited by: de Lepper MJC, Scholten HJ, Stern RM. 1995, Boston, MA: Kluwer Academic Press, 117-133.
Getis A, Morrison AC, Gray K, Scott TW: Characteristics of the spatial pattern of the dengue vector, Aedes aegypti, in Iquitos, Peru. American Journal of Tropical Medicine and Hygiene. 2003, 69: 494-505.
Wu J, Wang J, Meng B, Chen G, Pang L, Song X, Zhang K, Zhang T, Zheng X: Exploratory spatial data analysis for the identification of risk factors to birth defects. BMC Public Health. 2004, 4: 23-10.1186/1471-2458-4-23.
Swift A, Liu L, Uber J: Reducing MAUP bias of correlation statistics between water quality and GI illness. Computers, Environment and Urban Systems. 2008, 32: 134-148. 10.1016/j.compenvurbsys.2008.01.002.
Chiou HY, Chiou ST, Hsu YH, Chou YL, Tseng CH, Wei ML, Chen CJ: Incidence of transitional cell carcinoma and arsenic in drinking water: a following-up study of 8,102 residents in an arseniasis-endemic area in northeastern Taiwan. American Journal of Epidemiology. 2001, 153: 411-418. 10.1093/aje/153.5.411.
Chen CL, Hsu LI, Chiou HY, Hsueh YM, Chen SY, Wu MM, Chen CJ, Blackfoot Disease Study Group: Ingested Arsenic, Cigarette Smoking, and Lung Cancer Risk: a follow-up study in arseniasis-endemic areas in Taiwan. JAMA. 2004, 292: 2984-2990. 10.1001/jama.292.24.2984.
Hsu YH, Chen CW, Sun HS, Jou R, Lee JJ, Su IJ: Association of NRAMP 1 gene polymorphism with susceptibility to tuberculosis in Taiwanese aboriginals. J Formos Med Assoc. 2006, 105: 363-369. 10.1016/S0929-6646(09)60131-5.
The authors wish to thank the referees for their perceptive comments on an earlier draft, which have led to considerable improvements in the paper. This study was supported by grant NSC96-2415-H-156-003-MY2 from the National Science Council, Executive Yuan, Republic of China (Taiwan). The authors are most grateful for the kind assistance of Prof. Melissa Norton, Editor-in-Chief of the BMC series, and the constructive suggestions from anonymous referees all of which has led to the making of several corrections and suggestions that have greatly aided us in improving the presentation of this paper.
Authors and Affiliations
College of Liberal and General Education, Aletheia University, Republic of China, Taiwan
Department of Tourism, Aletheia University, Republic of China, Taiwan
Graduate institute of Geography, National Taiwan University, Republic of China, Taiwan
Department of Statistics and Actuarial Science, Aletheia University, Republic of China, Taiwan
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.