Neighborhood clustering of non-communicable diseases: results from a community-based study in Northern Tanzania

Background In order to begin to address the burden of non-communicable diseases (NCDs) in sub-Saharan Africa, high quality community-based epidemiological studies from the region are urgently needed. Cluster-designed sampling methods may be most efficient, but designing such studies requires assumptions about the clustering of the outcomes of interest. Currently, few studies from Sub-Saharan Africa have been published that describe the clustering of NCDs. Therefore, we report the neighborhood clustering of several NCDs from a community-based study in Northern Tanzania. Methods We conducted a cluster-designed cross-sectional household survey between January and June 2014. We used a three-stage cluster probability sampling method to select thirty-seven sampling areas from twenty-nine neighborhood clusters, stratified by urban and rural. Households were then randomly selected from each of the sampling areas, and eligible participants were tested for chronic kidney disease (CKD), glucose impairment including diabetes, hypertension, and obesity as part of the CKD-AFRiKA study. We used linear mixed models to explore clustering across each of the samplings units, and we estimated absolute-agreement intra-cluster correlation (ICC) coefficients (ρ) for the neighborhood clusters. Results We enrolled 481 participants from 346 urban and rural households. Neighborhood cluster sizes ranged from 6 to 49 participants (median: 13.0; 25th–75th percentiles: 9–21). Clustering varied across neighborhoods and differed by urban or rural setting. Among NCDs, hypertension (ρ = 0.075) exhibited the strongest clustering within neighborhoods followed by CKD (ρ = 0.440), obesity (ρ = 0.040), and glucose impairment (ρ = 0.039). Conclusion The neighborhood clustering was substantial enough to contribute to a design effect for NCD outcomes including hypertension, CKD, obesity, and glucose impairment, and it may also highlight NCD risk factors that vary by setting. These results may help inform the design of future community-based studies or randomized controlled trials examining NCDs in the region particularly those that use cluster-sampling methods.


Background
Non-communicable diseases (NCDs) are a growing global epidemic that disproportionately affect low-and middle-income countries (LMICs) [1]. In sub-Saharan Africa, they are now one of the leading causes of death among adults, and in order to begin to address this burden, high quality community-based epidemiological studies from the region are urgently needed [2][3][4][5]. Additionally, outcomes-related research either through observational cohort studies or randomized-controlled trials (RCTs) will be an important component of the public health response moving forward [6].
Nonetheless, many challenges exist in carrying out these studies. Poor infrastructure and a lack of resources in many of the sub-Saharan African countries limit rigorous studies, in part due to inadequate methodological capabilities. Physical addresses, phonebooks, and reliable census data are often unavailable for many populations in the region which means that representative community-based samples often require labor-intensive, prospective household surveys. In this context, clusterdesigned sampling methods offer an efficient, practical, and cost-effective means of obtaining a representative sample from the population of interest [7,8].
However, studies that use cluster sampling methods require extra considerations in their design and analyses, and cluster-designed studies in sub-Saharan Africa continue to inadequately address many of these considerations [9]. Because study participants or households are drawn from clusters, which serve as the primary sampling unit, they can demonstrate more homogeneity than would otherwise be expected from a simple, random sample. For NCDs, similar lifestyles, environmental risks, economic stress, and genetic backgrounds may all increase homogeneity within clusters, and consequently, this increased homogeneity within clusters, or intra-cluster correlation (ICC), can significantly affect the precision of population parameter estimates [10,11]. The ICC is typically quantified by the ICC coefficient, and although the ICC coefficient can be calculated post hoc during the analysis stage, this method may not be preferable or ethical in many sub-Saharan African settings due to cost and limited resources. Accounting for the design effect beforehand allows for more accurate estimations of sample size, budget requirements, and logistical needs; however, for NCD-related research, few ICCs have been reported in the region [9,10].
The Comprehensive Kidney Disease Assessment for Risk Factors, epidemiology, Knowledge, and Attitudes (CKD-AFRiKA) study is an ongoing project in northern Tanzania with the goal of understanding and addressing the health burden of chronic kidney disease (CKD) and CKD-related NCDs. As part of the study, we conducted a cluster-designed, community-based epidemiologic survey. In the design stage, we were unable to identify any comparable ICCs for health outcomes related to CKD or CKD-related NCDs, and we had to extrapolate them from data derived from high-income settings. To fill this gap, we report here the observed intra-cluster correlations for multiple NCD-related factors from a community-based, sub-Saharan African setting [12].

Ethics, consent, and permissions
The study protocol was approved by Duke University Institutional Review Board (#Pro00040784), the Kilimanjaro Christian Medical College Ethics Committee (EC#502), and the National Institute for Medical Research in Tanzania. Written informed consent (by signature or thumbprint) was obtained from all participants.

Study setting
We conducted a stratified, cluster-designed cross-sectional household survey between January and June 2014 in the Kilimanjaro Region of Tanzania, which has an adult population of more than 900,000 people [13,14]. The region comprises seven districts, and our study was conducted in two of these districts, Moshi Urban and Moshi Rural, which served as strata for our sampling scheme. Within these districts, there are 21 and 31 administrative wards respectively that range in size from 1500 to 25,000 people. Each ward is then further sub-divided into neighborhoods (also known as streets). Neighborhoods are the most basic governmental administrative unit in Tanzania, and they range in population size from 500 to 5000 people. The 65 urban neighborhoods have a median population size of 2000 people and a median area of 0.50 km 2 . The 165 rural neighborhoods have a median population size of 2200 people and a median area of 4.00 km 2 . In total, there are 230 neighborhoods/ streets in the Moshi Urban and Moshi Rural districts [14].

Sampling methods
We used a three-stage cluster probability sampling method stratified by urban and rural. We used a random-number generator to select twenty nine neighborhoods within the Moshi Urban and Moshi Rural districts. We based the random neighborhood selection on probability proportional to size sampling according to the 2012 national census [14]. From the twenty-nine neighborhoods, we then randomly selected the starting point for each sampling area (37 in total) using geographic coordinates randomly generated by Arc Global Information Systems (ArcGIS), v10.2.2 (Environmental Systems Research Institute, Redlands, CA). From the randomly-selected geographic point, we then chose households based on a coin-flip and die-rolling technique (Appendix 1). All non-pregnant adults (age ≥ 18 years old) living in the selected households were recruited. A neighborhood cluster, therefore, included a group of individuals living in geographically-related households within the boundaries of an administrative neighborhood.
We targeted an enrollment between 15 and 25 participants per sampling area based on the requirements of the CKD AFRiKA study. The total sample size was designed to estimate the community prevalence of CKD with a precision of 5 % when accounting for the clusterdesign effect, assuming a CKD prevalence up to 20 % and an ICC coefficient of 0.05. To reduce non-response rates, we attempted a minimum of two additional visits during off-hours (evenings and weekends) and multiple phone calls using mobile phone numbers.

Data collection
Our data collection methods have been previously described in detail [12]. In brief, participants were tested for CKD and CKD-related conditions including diabetes and hypertension, and anthropomorphic data (including height, weight, and body mass index) were recorded for each participant.
CKD was defined as the presence of albuminuria (≥30 mg/dL; confirmed by repeat assessment) and/or a reduction in the estimated glomerular filtration rate (eGFR) ≤60 ml/min/1.73 m 2 according to the Modification of Diet in Renal Disease equation without the race factor [15]. Hypertension was defined as a single blood pressure measurement of greater than 160/100 mmHg, a two-time average measurement of greater than 140/ 90 mmHg, or the ongoing use of anti-hypertensive medications. Glucose impairment was defined as an HbA1C >6.0 % in the presence or absence of ongoing treatment with anti-hyperglycemic medications. Diabetes mellitus was defined as an HbA1c level was ≥7.0 % or current known use of anti-hyperglycemic medications for the purpose of treating diabetes. Participants with an HbA1C between 6.0 % and 6.0 % in the absence of treatment with anti-hyperglycemic medications were considered to have pre-diabetes. Overweight was defined as a body mass index (BMI) greater than 25 kg/m 2 and obesity was defined as a BMI greater than 30 kg/m 2 .

Data analysis
We used STATA version 13 (STATA Corp., College Station, TX) for all data analyses. Continuous variables were summarized by the mean and standard deviation (SD) or median and inter-quartile range (IQR). Categorical variables were summarized using counts and percentages. To address potential non-response bias, mean and prevalence estimates were sample-adjusted using age-and gender-weights based on the 2012 urban and rural district-level census data [14]. To estimate the level of clustering in health outcome variables at the household level, the sampling area level, and the neighborhood cluster level, we first fitted a mixed effect model with separate random intercepts for neighborhood, sampling area, and household for each of the outcomes of interest. In these models, after accounting for neighborhood, very little clustering (<15 %) remained at the sampling area level and household level indicating that most of the variation in these outcomes was explained at the individual and neighborhood cluster-levels. As such, we estimated the ICC for the neighborhood clusters only.
To estimate the absolute-agreement ICC coefficient for neighborhood clusters (ρ) we used a one-way, random effects analysis of variance (ANOVA) estimator which has been shown to perform well for both binary and continuous outcomes across a wide range of ρ and cluster sizes [16][17][18][19]. These estimations were performed in STATA using the 'loneway' command which uses the F statistic to calculate ρ as described by Hayes and Moulton. Although alternative estimators are available for binary outcomes, given that the ANOVA estimator has been shown to perform well for binary outcomes, we chose to present all estimates based on the common, easily implementable approach as described above [17,18].
We calculated ρ for the social characteristics, selfreported medical histories, physical and laboratory measurements, and measured health outcomes. Negative values were truncated at zero, and our reporting of ρ is in accordance with the guidelines suggested by Campbell et al. [20].
Variance estimation was based on asymptotic theory, as implemented in the 'loneway' command, which accommodates differing cluster sizes. The 95 % confidence intervals for each ICC coefficient were derived from the asymptotic standard error, which has been shown to provide good coverage probabilities for a wide range of parameter combinations including clusters, cluster sizes, and ρ [18,21,22]. Confidence intervals with negative values were truncated at zero.

Results
Between January 2014 and June 2014, we enrolled 481 participants from 346 households from a total of 37 sampling areas (30 urban and 7 rural) within 29 neighborhoods (23 Urban and 6 rural) ( Table 1). These 29 neighborhoods were located within 18 wards (13 urban and 5 rural). The mean age was 46.9 years (SD 15.1). The household non-response rate was 15.0 %. Men (p < 0.001) and adults 18-39 years old (p = 0.001) were more likely to be non-responders. The median neighborhood cluster size was 13.0 participants (IQR 9-21), and neighborhood cluster size ranged from 6 to 49 participants (Appendix 2).
Among the NCDs, neighborhood clustering varied with ρ ranging from 0.000 to 0.075. Hypertension (ρ = 0.075) exhibited the strongest clustering within neighborhoods followed by CKD (ρ = 0.440), obesity (ρ = 0.040), and glucose impairment (ρ = 0.039) (Fig. 1). Among those with glucose impairment, neighborhood clustering was more significant for pre-diabetes (ρ = 0.031) than for diabetes (ρ = 0.000). Neighborhood clustering for physical and laboratory measurements paralleled the NCD outcomes. Both systolic (ρ = 0.064) and diastolic (ρ = 0.056) blood pressures exhibited strong neighborhood clustering. Clustering for albuminuria was modest (ρ = 0.038), but it accounted for most of the neighborhood clustering observed for CKD when compared to serum creatinine or eGFR measurements. Similar to obesity and glucose impairment, clustering of BMI was more significant in urban neighborhoods (ρ = 0.049) while clustering of HbA1C was more significant in rural neighborhoods (ρ = 0.025).

Discussion
In northern Tanzania, prevalence of NCDs, including hypertension, CKD, obesity, and glucose impairment, exhibited clustering by neighborhood. This clustering varied across urban and rural settings, and for NCD prevalence, it was most significant for hypertension and CKD. Based on the ICC coefficients that we observed, cluster-designed studies examining NCDs in the region should account for the design effect on precision or variance caused by clustering. In a region where the NCD burden is quickly growing, these results will be valuable in designing such studies, including cluster RCTs [5,12,23]. The urban and rural differences in neighborhood clustering of NCDs may highlight important environmental and lifestyle risk factors for the development of hypertension, glucose impairment, obesity, and CKD. The neighborhood  SD standard deviation, ρ intra-cluster correlation coefficient, COPD chronic obstructive pulmonary disease, SBP systolic blood pressure, DBP diastolic blood pressure, BMI body mass index, eGFR estimated glomerular filtration rate, MDRD modification of diet in renal disease equation (without the race factor) for eGFR, CKD-EPI CKD epidemiology collaboration equation (without the race factor) for eGFR a Too few positive events/outcomes were observed in these categories, b Age-and gender-weighted estimates clustering for hypertension and glucose impairment was most pronounced in the rural settings where families tend to remain more environmentally clustered, share meals, and work in similar agricultural jobs which may all contribute to the development of such NCDs that are known to be highly associated with lifestyle [24][25][26]. On the other hand, obesity and CKD were most clustered in the urban neighborhoods. For obesity, this urban clustering highlights the importance that urban lifestyles, which may be clustered within neighborhoods on the basis of socioeconomic status, transportation, or occupation, play in the development of obesity. In the context of CKD, living in an urban setting has been shown to be a significant risk factor, yet specific etiologies associated with the urban environment remain unknown [12]. The clustering of CKD within urban neighborhoods that we observed may be important in highlighting causes of CKD, and it further stresses that public health efforts targeting CKD must take a broad approach that includes urban planning with sanitation improvement, safe drinking water, pollution reduction, and infection control. Among all measured variables, ongoing alcohol use, hypertension, a self-reported history of hypertension, and a self-reported history of HIV were most highly correlated among cluster-sampled individuals, and the latter two variables may reflect an increased awareness and/or prevalence of these conditions within certain neighborhoods. In northern Tanzania, alcohol is commonly homemade and shared among households which may in part explain the significant clustering that we observed.
To our knowledge, this is the first community-based, household-level survey to report on the neighborhood clustering of NCDs in East Africa. As such, these are the first ICC coefficients reported for hypertension, CKD, obesity, and glucose impairment in the region, and compared to reports of ICC coefficients in high-income countries there are significant differences in several of the physical and laboratory variables [27][28][29]. Because we also measured clustering in both an urban and rural settings we were able to demonstrate important differences which may help inform future studies examining the demographic transition of NCDs in sub-Saharan Africa where rapid urbanization is occurring [30].
Despite these strengths, we also noted a few limitations. Caution must be taken when applying these estimates to other populations and settings. Although the paucity of data currently available for NCD-related measurements and outcomes may make these results useful to researchers more broadly across the region, differences in prevalence and risk factors for NCDs, particularly those that are geographic or environmental-based, mean that even NCDs can cluster at different rates within villages, neighborhoods, or households. Additionally, although we used sample-balancing approaches to address potential non-response bias, the effect of participant non-response upon these estimates is not fully known. Finally, some results, such as self-reported medical history, rely upon the subjective response of individual participants, and as such, they may be prone to recall or response bias.

Conclusion
In conclusion, we have reported on the observed neighborhood clustering for several NCDs from a community-based study in northern Tanzania. The neighborhood clustering, which varied by urban or rural setting, was substantial enough to contribute to a design effect for NCD outcomes including hypertension, CKD, obesity, and glucose Fig. 1 Neighborhood clustering of non-communicable diseases in northern Tanzania. Intra-cluster correlation coefficients, presented by prevalence, for CKD, obesity, glucose impairment, and hypertension impairment, and it may also highlight NCD risk factors that vary by setting. These results may help inform the design of future community-based studies or randomized controlled trials examining NCDs in the region particularly those that use cluster-sampling methods.
Appendix 1: Standard Operating Protocol (SOP) for household selection

Purpose
To provide a reproducible, systematic, and random method of selecting households for sampling.

Definitions
-Cluster = randomly, pre-selected geographic location that includes multiple households for sampling -Dwelling = A free-standing building that is covered by a roof. Buildings that share a foundation or appear to share a foundation should be considered as one dwelling. -Household = Persons residing within a dwelling whose food is prepared by the same person(s) -ID = Unique Identification Number that is assigned to each participant and each household. -Household ID = Two digit Unique Identification Number contained in the study ID number that is assigned to each household. -Eligible Individuals: Adults over the age of 18 who are not pregnant. Ex-pats or Temporary Residents should be excluded unless they are FULL citizens who reside in Tanzania full time (i.e. more than 9 months out of every year).

Overview
-Cluster site identification -Household identification -Household selection process Process 1. Cluster site identification: the starting point from which household selection will occur has been identified based on a random GPS coordinates. 2. The dwelling physically closest to the starting point will be approached first.

Household Selection Process
a. The first dwelling should be approached: i. If that dwelling fulfills the definition of a household then assign it a household ID and assess the eligibility of the household adults according to the enrollment protocol. ii. If that dwelling does NOT fulfill the definition of a household then move on to the next dwelling.
iii. Unless the dwelling is clearly marked as a business, shop, or restaurant then the field surveyors should assume that it could be a household. They should then approach to confirm. (Remember that sometimes people who own shops also live in the backif any doubt then they should always approach to confirm). b. To identify the next dwelling to approach for sampling, the following methods should be used: i. The field surveyor will stand with his/her back to the main entrance of the first dwelling. ii. Flip a coin.
iii. If the coin lands on TAILS then proceed to your LEFT. If the coin lands on HEADS then proceed to on your RIGHT. iv. Next, roll the die to determine which house to approach. The numbers on the die represent which house number (in sequential order according to physical distance to the front door) will be chosen. v. If the surveyor comes to an intersection or dead-end before reaching the house number on the die, then flip the coin again to determine the continuing direction. Again, TAILS will be LEFT and HEADS will be RIGHT. vi. In instances where there is only one physical direction to go, then proceed in that direction. vii.If a dwelling repeats, then repeat the coin-flip and die process.

Protocol for Gated Houses
a. House with a Gatekeeper i. First, contact the gatekeeper to explain our intentions. If agreeable, he may allow entry. ii. If not agreeable to entry, then leave a study overview pamphlet along with our contact information. iii. Arrange a follow-up time to see if the owners have expressed interest. b. Closed Gate House i. If a gatekeeper is present then proceed as above.
ii. If no gatekeeper and no way to contact the household members, then record as nonresponse. iii. Two additional visits, including one off-hours visit (i.e. evening or weekend day), should be attempted according to the follow-up protocol. c. Open Gate i. First, ensure that there is no gatekeeper.