### Study area

Heshun county (Fig. 1) is located at the Tai Hang mountain area of the Shanxi province and consists of 326 administrative villages with a total area of 2,250 km^{2}. Most of the people in this county are farmers and their living environment seldom changes. There is no large-scale human immigration in the region's history. Remarkably, most kinds of birth defects designated by the WHO (World Health Organization) are found in Heshun, and among them the defects linked to NTD are the predominant ones [20]. Among the 7880 births in Heshun during 1998-2005, 187 of them suffered from NTD. The inherited and congenital causes of birth defects are similar among the region's population. Nevertheless, these causes explain only a small fraction of all NTD cases.

### Spatial random field theory

Let the geographical distribution of the epidemic attribute (number of NTD cases) be represented mathematically by the spatial random field (SRF), *Y*
_{
s
}, in the sense of Christakos [21]. The vector **s** = (*s*
_{1}, *s*
_{2}) denotes the location of the attribute, where *s*
_{1}, *s*
_{2} are the associated spatial coordinates of the location. Also, let *m*
_{
s
}= *E* [*Y*
_{
s
}] be a non-random quantity that represents the average value of all possible SRF realizations at the location **s**, where the *E* [·] denotes stochastic expectation. The SRF formulation properly represents the fact that NTD prevalence shows significant variation in terms of geographic location (within regions and between countries [[5, 22], and [23]]. Two quantitative expressions of prevalence rate can be defined in terms of the SRF, as follows.

The observed population rate (OPR) of the NTD cases *Y*
_{
s
}over an area ℜ is defined as

where **s** varies within ℜ, and |ℜ| denotes the total number of births in the area. The *r*
_{ℜ} is a random quantity, i.e. even when considering the same area, one may get different results if the *r*
_{ℜ} is computed over different SRF realizations. The superpopulation rate (SPR), also called the stochastic rate, of the NTD cases *Y*
_{
s
}over an area ℜ is defined as

where *f*
_{
y, s
}is the probability density function of *Y*
_{
s
}, and ψ_{
s
}denotes a realization of *Y*
_{
s
}at s (for the underlying mathematical details the readers are referred to Christakos and Hristopulos [24]).

The OPR is directly observable and expresses the "here-and-now" crude disease rate, which makes *r*
_{ℜ} a useful study parameter when the objectives include the study of infectious disease outbreaks and the assessment of emergency health services. The SPR, on the other hand, expresses an essential property of epidemiological phenomena [25], which makes *m*
_{ℜ} a useful tool in the study of the relationship between an epidemic and its determinants. The SPR is rarely observed directly, but it can be approximated in terms of the available observations and by incorporating neighboring samples in a Bayesian context [26]. Accordingly, *m*
_{ℜ} will be the prime focus of the present study, which means that in the following the term "NTD rate" refers to SPR values.

### Prevalence rates

In order to investigate possible NTD determinants, the prevalence rates need to be estimated across space. In this study, determinants were considered at the village scale (Fig. 2a). The annual number of births, which is the denominator of the mathematical rate formula, was small and it varied highly at the annual scale (e.g., some villages had not a single birth in consecutive years). Since the NTDs constitute small probability events, the nominator is also expected to be small and highly variable annually. As a result, while the crude NTD rate exhibits high variability (simply due to the small sample size available), it does not reflect the essential attributes of the disease in a village. For example, for a village with a number of births equal to 2 and an NTD number equal to 1 during 2 years the calculated NTD rate is 0.5, which can not be the true risk of the disease, because the crude rate obtained from short-term observations within a small area does not reflect the true features of the long-term interaction between residents and negative environmental factors [26]. In addition to collecting data about NTD cases (Fig. 2c) and number of births (Fig. 2b) during as long time-periods as possible, one also needs to stabilize the estimated NTD rates, which is why the Rushton's circle method [27] was used in this study.

In spatial analysis one assumes that events are spatially correlated (dependent); see [27–29]. Accordingly, we introduced a regular lattice of grid points to serve as the spatial centers of the disease distribution. Then we constructed a series of circles that covered the entire Heshun area. Since the average NTD rates were calculated within each one of these circle using the available data, the circles were called "NTD_rate circle" (meaning that each circle had its own NTD rate). The socio-economic activities of Heshun residents usually take place between 6.2 and 9.3 Kms [20], which is why we set the radius of these circles to 3 Kms. Moreover, in order to cover all village points in the Heshun county, the least distance between any two centers should be 4.2 (3) Kms.

The Rushton circle [27] and the hierarchical Bayesian model [26] are efficient tools to correct the instability of the estimated rate for small probability events by borrowing strength from neighbors. Nevertheless, we still doubt that the number of unreliable values in the neighbors may play a more negative than positive role in estimation. Therefore, we set an artificial threshold of five live born of any village; villages bigger than this were included in the NTD prevalence rate calculation, then the values were used as sample in the subsequent statistical inference. Theoretically, there may exist a balance between (a) using larger samples to reduce the estimation variance [30, 31], but at a risk of negative impact due to some highly unreliable values among the sample, and (b) using reliable but small samples in the estimation of a regional attribute. The matter deserves further investigation in the future.

In this study, all 270 villages that have a number of births equal or higher than 5 ("≥ 5 " rule) were used to predict the "NTD_rate circle". Another 56 villages that did not satisfy the " ≥ 5 " rule were left to be predicted. The method used to predict the NTD rate across space involved the assumption that the villages had the same rate as the "NTD_rate circle" to which they belonged. It should be noticed that the "NTD_rate circles" may overlap each other when the distance between two centers is less than 6 Kms. Therefore, the villages to be estimated might not be included only in one circle, in which case the NTD rates of these villages were taken to be equal to the average values of the "NTD_rate circles" to which they belonged. Fig. 3a represents the village points of the Heshun county; solid points are the villages with reasonable NTD rates and hole points are those left to be adjusted. Prediction performance was assessed by means of the average root variance (ARV), in which case it was found that prediction improves when the distance between center points reduces (Fig. 3b). This finding was attributed to the fact that an increasing number of points may be included in some "NTD_rate circles" and, as a consequence, some of these points may be included in an increasing number of "NTD_rate circles". Interestingly, this effect did not increase significantly when the distance between the center points remained below 2 Kms. Therefore, we chose the 2 Kms as the distance between two center points for NTD rate prediction purposes; the corresponding ARV was 0.5347, which is an acceptable level reflecting the overall uncertainty of prevalence rate prediction (Fig. 3).

### Determinants of NTD and their proxies

Three prime types of factors are suspected to cause NTD: (1) environment (physical, social, economic etc.); (2) hereditary (genetic, pre-existing conditions etc.); and (3) synthetic (interaction between (1) and (2)). Recent studies show that most NTD cases are the result of environment-gene interactions [32, 33]. The NTD etiology includes not only physical, chemical and biological agents, but also social and cultural determinants, where the latter affect body's immunity through their impact on psychological and mental health. Many of the direct NTD determinants are difficult or expensive to access. Actually, the direct etiological NTD factors act through at least four geographical layers, which are easier implementable in terms of a geographical information system, GIS [34, 35]. In particular, these layers may be grouped are as follows:

Physical NTD determinants that are spatially distributed. Potential NTD hazards include surface and subsurface water contaminated by insufficiently oxygenized ancient geological media; also, radiation emissions from certain rocks or along faults [36, 37].

Man-made pollution that is spatially distributed. Hazards of this kind include pesticides and chemical fertilizes spread over crop fields. Also, polluted air and water emission from workshops and electromagnetic radiation in the workplace [38–41].

Nutrition processes that are spatially distributed. For example, nutrition strongly depends on spatially varying residential income. Hence, it is usually proportional to the GDP that is regularly surveyed across space and published in the government's annual statistics/census reports [42].

Heredity and habits that are spatially distributed. Ethnic groups have specific genetically transmitted habits and behavioral patterns (e.g., related to food consumption), some of which are hazardous to health [43]. Health determinants may be detected when the disease cases and the ethnic characteristics share similar spatial patterns; for example, when the shape and size of spatial disease clusters are consistent with these of the citizens' daily activities, it could suggest that heredity is relevant to the regional NTD [20].

The explicit physical and human geographical proxies of the NTD determinants are collected: elevation, accessibility (e.g., road buffer), geological background (fault buffer), water conditions (e.g., river buffer), per-capita income (per-capita net income), medical conditions (e.g., number of doctors), crop yield (e.g., vegetable and fruit production), agricultural chemical exposure (fertilizer and pesticide use), land cover, lithology, watershed and soil conditions in every village. The socioeconomic factors were measured in terms of averaged annual levels during the period 1998-2005. Fumonisins in maze or other grains could be an important NTD factor [44, 45]. However, the north of China where our pilot study was conducted is very dry during throughout the year, so the climate is not suitable for Fumonisins growth; and we have not found any report on Fumonisins in Shanxi province. In addition, the study area is hilly and is not the main maize production area, so we did not test for Fumonisins in this study.

### Inference approach

A main objective of the present study is to identify possible NTD determinants. Fig. 4 illustrates a conceptual framework that involves the implicit direct NTD determinants *Z*
_{
s
}and their explicit geographical proxies *X*
_{
s
}. The latter are inserted into a GIS, which is then regressed with the NTD rate *Y*
_{
s
}by means of the GWR technique [46]. Finally, the results are interpreted in terms of the determinants *Z*
_{
s
}according to the conceptual framework of Fig. 4. Mathematically, the attributes *X*
_{
s
}, *Y*
_{
s
}and *Z*
_{
s
}are represented in terms of the SRF theory, see above. In symbolic terms, we seek to calculate the conditionals (*Z*
_{
s
}|*Y*
_{
s
}) and (*X*
_{
s
}| *Y*
_{
s
}), which logically infer the direct NTD determinant *Z*
_{
s
}given the NTD rate *Y*
_{
s
}and the proxies *X*
_{
s
}given the NTD rate *Y*
_{
s
}, respectively. In terms of Bayesian inference we can write [26]

where the (*Y*
_{
s
}|*X*
_{
s
}) is estimated by means of GWR, the (*X*
_{
s
}) is known from GIS, (*Y*
_{
s
}) is known from the corresponding survey, and (*Z*
_{
s
}| *X*
_{
s
}) is calculated on the basis of physical and human geographical processes. The Bayesian equations above allow logistic inference even when not all included variables and relationships are computable. In other words, the logical framework of Fig. 4 offers a valid means for the interpretation of NTD results in the Heshun county, quantitatively and qualitatively.

When implementing the GWR technique, categorical variables (including land cover, lithology, watershed and soil conditions) should be distinguished from explanatory variables. Typically, the former variables appear in terms of the Ordinary Least Squares (OLS) technique by introducing dummy variables. However, this would result in what is technically termed a "sever model design" in GWR analysis (i.e., an explanatory variable is perfectly collinear with the intercept, especially when a village and all its neighbors have the same values for one or more explanatory variables). As a consequence, we only regressed non-categorical variables and NTD rates using GWR. The categorical variables were checked by comparing the spatial patterns of the GWR outputs vs. those of the categorical variables.

### Local regression by GWR

In standard regression applications, the elastic coefficients of the NTD factors are assumed to be constant over space, which is the case of "one model fits all". The GWR technique, however, properly extends the traditional regression framework by allowing local rather than global parameters to be estimated. In this study, therefore, we implemented the GWR technique to identify local relationships between NTD and environmental factors. The GWR software tool used was GWR3.0 [47]; the ArcGIS9.0i was also used to map the variables and the numerical results obtained.

Note that the basic idea of GWR is that observations near a specified point **s**
_{
i
}have more influence in the prediction of the disease parameters associated with i than do observations farther away from **s**
_{
i
}. Accordingly, data close to **s**
_{
i
}are weighted more than data that are farther away from **s**
_{
i
}, in which case the geographical weights of observed data as far as prediction at point **s**
_{
i
}is concerned are as follows

where *w*
_{
ij
}represents the weight of any datum at point **s**
_{
j
}(*j* = 1, 2, ..., *n*) on the calibration of the prediction model at point **s**
_{
i
}. Normally, each *w*
_{
ij
}is a continuous function of *d*
_{
ij
}, the distance between **s**
_{
i
}and **s**
_{
j
}(i.e. *d*
_{
ij
}= |**s**
_{
i
}- **s**
_{
j
}|). One possible choice is , where *h* is called the bandwidth, and its specification depends on the situation under consideration.

The village centroids in the Heshun county are distributed unevenly: some are densely distributed, whereas some others are sparsely distributed. This means that local regression may rely on relatively few data points in areas where these points are sparsely distributed. To address this potential problem we used a spatially adaptive weighting technique, which involves the experimental calculation of the bandwidth rather than assigning it directly in the GWR context. The bandwidths are relatively small in areas where the data points are densely distributed and they are rather large in areas where the data points are sparely distributed. Better results, measured in terms of the global *R*
^{2} of GWR, were obtained when more points were involved in bandwidth calculation.