 Research
 Open Access
 Published:
Rescaling and small area estimation of behavioral risk survey guided by social vulnerability data
BMC Public Health volume 23, Article number: 184 (2023)
Abstract
Background
Local governments and other public health entities often need population health measures at the county or subcounty level for activities such as resource allocation and targeting public health interventions, among others. Information collected via national surveys alone cannot fill these needs. We propose a novel, twostep method for rescaling health survey data and creating small area estimates (SAEs) of smoking rates using a Behavioral Risk Factor Surveillance System survey administered in 2015 to participants living in Allegheny County, Pennsylvania, USA.
Methods
The first step consisted of a spatial microsimulation to rescale location of survey respondents from zip codes to tracts based on census population distributions by age, sex, race, and education. The rescaling allowed us, in the second step, to utilize available census tractspecific ancillary data on social vulnerability for small area estimation of local health risk using an arealevel version of a logistic linear mixed model. To demonstrate this new twostep algorithm, we estimated the eversmoking rate for the census tracts of Allegheny County.
Results
The eversmoking rate was above 70% for two census tracts to the southeast of the city of Pittsburgh. Several tracts in the southern and eastern sections of Pittsburgh also had relatively high (> 65%) eversmoking rates.
Conclusions
These SAEs may be used in local public health efforts to target interventions and educational resources aimed at reducing cigarette smoking. Further, our new twostep methodology may be extended to small area estimation for other locations and health outcomes.
Introduction
Disaggregation of behavioral risk data to finer geographical scales can provide key insights into many public health challenges. Researchers have noted, for example, high prevalence of cigarette smoking has continued among specific subpopulations in the United States (U.S.), many of whom are known to be vulnerable [1]. While tobacco smoking has declined considerably in the U.S. over the past decades, an estimated 13.7% of U.S. adults still smoke cigarettes, and it is the leading cause of preventable disease, disability, and death [2]. Cigarette smoking has been linked to many cardiovascular and respiratory diseases, such as chronic obstructive pulmonary disease (COPD), and is the leading risk factor for lung cancer development [3, 4]. Smoking cessation reduces the risk for these adverse health outcomes and can add as much as a decade to life expectancy [2]. A systematic combination of routinely collected health survey data with measures of local vulnerability can identify the neighborhoods with high smoking rates to allow better targeting of smoking cessation interventions, as well as those who may be experiencing disparities in outcomes of such programs.
National health surveys, such as the Behavioral Risk Factor Surveillance System (BRFSS) [5], are crucial tools for monitoring population trends in smoking and other highrisk, healthrelated behaviors at the country or state level. However, local governments and other public health entities often need these population health measures at the county or subcounty level for activities such as resource allocation and targeting public health interventions, among others. National surveys alone cannot fill these needs, often due to limited coverage of small geographic areas. Further, small sample sizes of such surveys when restricted to local populations make estimation of the variables of interest difficult, and possibly also unreliable below the state level. To address this issue, various small area estimation techniques have been proposed to downscale national or state health survey data and generate small area estimates (SAEs) that are deemed more reliable in terms of providing insights into health conditions and healthrelated risk behaviors that are specific to local populations [6]. A handful of prior studies have sought to produce SAEs based on BRFSS data, including for risk behaviors like smoking [7,8,9,10,11], health outcomes like COPD [12, 13], and other factors [14, 15].
In this study, our objective is to address the problem of estimating subcounty level behavioral risks, such as smoking rates, which can leverage on auxiliary data that generally exist for local populations but not necessarily on the desired spatial scale. Towards this, we introduced a new twostep algorithm for survey data to rescale and generate SAEs of the variable of interest. The term “small area” is used to describe a domain for which the sample size is not large enough to allow sufficiently precise direct survey estimation. Often indirect SAE methods depend on the availability of population level auxiliary information related to the variable of interest [6]. In the first step, we use microsimulation for spatial “sidescaling” of the survey data from the original unit of area (e.g., at zip codelevel) to a different unit of area (e.g., at census tractlevel). While uncertainty in this step may lead to loss of some data points, it can make valuable auxiliary information in the form of social vulnerability data available at this rescaled level. In the second step, such local population level auxiliary data are used to inform the model for small area estimation, which, in this study, is done for every census tract (or simply “tract”). It also helps us avoid the use of zip codes of locations that may (and often do) change over time. Further, we include additional steps to decide whether to incorporate the design of the survey in our model and provide multiple model diagnostics. We demonstrated the methodology by estimating the tract level eversmoking rates of Allegheny County in western Pennsylvania.
Data and Methods
The University of Pittsburgh Institutional Review Board approved this study (STUDY19040081).
Local BRFSS survey
The Allegheny County Health Department modeled its local BRFSS survey after the national survey, but the county raised its own funds for the survey and added many of its own questions. This county survey was administered via telephone to a random sample of adults 18 years and older who resided in Allegheny County in 2015. These methods have been described previously [16]; briefly, a probabilitybased sampling via random digital dial was conducted within the universes of all possible landline and cellular telephone numbers, 1.4 and 1.8 million total possible numbers, respectively. Six percent of possible landline and 4% of cellular telephone numbers in the county were sampled, with a total of 9032 interviews secured. Consent for participation was obtained at the beginning of the call [16]. For the present study, we obtained these as deidentified data, with personal identifying information masked by codes. We excluded 74 survey respondents with likely erroneous ages (< 18 years old) and 122 respondents with missing zip codes, leaving 8836 respondents in 105 zip codedefined areas for the spatial microsimulation. Survey demographic variables were recategorized as necessary to harmonize with key census variables: sex (male or female), age (18–24, 25–34, 35–44, 45–64, ≥ 65 years), race (white, black, other), and education (less than high school, high school graduate, college 1 to 3 years (some college or technical school), and college graduate or higher). The sociodemographic profiles of the survey respondents are summarized in the Supplementary Table S1. For the present study, a respondent’s eversmoking of tobacco in the form of cigarettes (not including ecigarettes or further categories) data was used. For further details, see [16].
American community survey
The spatial microsimulation (Step 1 below) requires census population margins by demographic factors to assign survey respondents to probable tracts. While the National Census takes place once every 10 years, the ACS is a nationwide survey that collects economic, housing, and demographic data every year [17]. We obtained 2015 tractlevel population estimates from ACS to correspond to the year of our BRFSS survey.
Social vulnerability data
The U.S. Centers for Disease Control and Prevention’s (CDC) Social Vulnerability Index (SVI) was originally computed to help public health officials and emergency response planners identify the most vulnerable communities that will require support during a hazardous event. The SVI ranks tracts on 15 social factors and further pools them into four summary themes: socioeconomic, household composition and disability, minority status and language, and housing type and transportation. It also provides an overall SVI [18].
Spatial microsimulation
Step 1 of our twostep algorithm is a microsimulation to assign survey respondents to tracts using the approach of combinatorial optimization (CO). Details regarding this step are available in the Supplementary Methods. In short, this procedure involves the selection of an optimal combination of households from an existing survey dataset that best fit published smallarea census tabulations [19]. We conducted the spatial microsimulation using the simPop package in R (version 4.0.2), an opensource data synthesizer that can be used to allocate populations from larger to smaller geographic areas [20]. Survey respondents were rescaled from zip codes to tracts based on census population marginals by age, sex, race, and education.
For CO based spatial microsimulation, we used the simPop package in R (version 4.0.2), an opensource data synthesizer that can be used to allocate populations from larger to smaller geographic areas [20]. After the study population was initially distributed to census tracts, a postcalibration procedure (calibPop) was performed to refine the distribution to tracts based on known census population marginals for age, sex, race, and education. This procedure implements CO based on simulated annealing to conduct an iterative search for a near optimal combination of households to populate the geographic areas. As this is a probabilistic step, a degree of randomness is involved in the household selection and the results will be slightly different for each run. Thus, the microsimulation was run for N = 100 iterations for each respondent r. In each iteration, r is assigned to at most one tract within her zip code that is known from the BRFSS survey data. Further, one census table containing a population breakdown by all four demographic variables of interest was not available. We therefore repeated the microsimulation for each of the following three combinations of marginals: {age, sex, race}; {age, sex, education}; and {sex, race, education}.
Then, we spatially assign to each respondent r the tract which has (i) the strongest assignment among (ii) the least inconsistent of all tracts assigned to r by microsimulation. Let Max(r, d) and Min(r, d) be the largest and the smallest number of assignments of any tract d to r out of a total of N = 100 microsimulations of r for each of the three combinations of marginals as stated above. For each r, we sort the tracts in a sequence {d_{(i)}}_{r} in the increasing order of Incons(r, d_{j}) = Max (r, d_{j}) − Min (r, d_{j}) as long as Incons(r, d_{j}) < δ. Then r is assigned to the first tract in the sorted sequence {d_{(i)}}_{r} for which Max (r, d_{(i)}) ≥μ. The threshold values of μ and δ were selected as 40 and 50 based on the empirical distributions of Max and Incons to include a majority of respondents in the final assignments. If no tract met these criteria for a survey respondent, then that person was considered “unassigned” and excluded from Step 2.
Small area estimation
In Step 2, we used the rescaled microdata from Step 1 for small area estimation of eversmoking rates for all tracts in Allegheny County. Two types of variables are used for SAE analysis. First, the variable of interest drawn from the survey, i.e., eversmoking, which is binary at the individual level, and corresponds to whether a person had ever smoked or not. The parameter of interest was to estimate the proportion of ever smokers within each census tract (given by the 458 tracts of Allegheny County).
The second type consists of the tractlevel auxiliary variables (or covariates). We used as available covariates four themewise summary SVI variables defined as (i) Socioeconomic: RPL_THEME1, (ii) Household Composition & Disability: RPL_THEME2, (iii) Minority Status & Language: RPL_THEME3, and (iv) Housing Type & Transportation: RPL_THEME4. These values are given as percentile ranking.
A generalized linear model between tractspecific sample (unweighted) proportions of smoking and the set of four auxiliary variables (RPL_THEME1–4) was fitted for choosing the appropriate auxiliary variables. This model was fitted using the glm function in R and specifying the family as “binomial” and the tractspecific sample size as the weight. The primary purpose was to build a good explanatory and predictive model based on the available auxiliary data. Finally, two auxiliary variables, RPL_THEME1 (Socioeconomic) and RPL_THEME3 (Minority Status & Language), which significantly explained the model, were identified for use in subsequent SAE analysis.
The final model, including the covariates RPL_THEME1 and 3, was then used to produce tractlevel estimates of eversmoking rates. The tractspecific direct survey estimates of smoking rates were defined as follows. Let y_{di} denote the variable of interest for person i in tract d (d = 1, …D). In particular, y_{di} is a binary variable that takes the value 1 if person i in tract d smokes and 0 otherwise. Here, D is the total number of tracts in the study population, where D_{1} and D_{2} are the number of tracts with and without sample data, respectively, such that D_{1} + D_{2} = D. The aim is to estimate the proportion of ever smokers, \({P}_d={N}_d^{1}{\sum}_{i=1}^{N_d}{y}_{di}\), in tract d, where N_{d} is the population size of tract d. Let w_{di} be the survey weight for person i in tract d. The direct estimator (denoted by Direct) for P_{d} is \({\hat{p}}_d^{Direct}={\left({\sum}_{i=1}^{n_d}{w}_{di}\right)}^{1}{\sum}_{i=1}^{n_d}{w}_{di}{y}_{di}\), with the estimate of variance of the Direct estimator given by \(v\left({\hat{p}}_d^{Direct}\right)\approx {\left({\sum}_{i=1}^{n_d}{w}_{di}\right)}^{2}{\sum}_{i=1}^{n_d}{w}_{di}\left({w}_{di}1\right){\left({y}_{di}{\hat{p}}_d^{Direct}\right)}^2\), where n_{d} is sample size for tract d.
In case of simple random sampling (SRS) used for survey data collection, \({\hat{p}}_d^{Direct}={p}_d={\left({n}_d\right)}^{1}{\sum}_{i=1}^{n_d}{y}_{di}\) is the simple sample proportion and \(v\left({\hat{p}}_d^{Direct}\right)\approx {\left({n}_d\right)}^{1}{p}_d\left(1{p}_d\right)\), where \(y={\sum}_{i=1}^{n_d}{y}_{di}\) denotes the sample count in tract d. If the sampling design is informative, this SRSbased version of Direct may be biased.
Let u_{d} denote the tractspecific random effects that capture the dissimilarities between the tracts. If we ignore the sampling design, the sample count y_{d} in tract d can be assumed to follow a binomial distribution with parameters n_{d} and π_{d}, i.e., y_{d}u_{d} ∼ Bin(n_{d}, π_{d}); d = 1, …, D_{1}. This leads to E(y_{d}u_{d}) = n_{d}π_{d}. Let x_{d} be the kvector of covariates for tract d available from secondary data sources. Following previous work by study team members [21, 22] as well as others [23,24,25], the aggregate level version of logistic linear mixed model (LLMM) linking the probability π_{d} with the covariates x_{d} is expressed as
with \({\pi}_d=\mathit{\exp}\left({\boldsymbol{x}}_d^T\boldsymbol{\beta} +{u}_d\right){\left\{1+\mathit{\exp}\left({\boldsymbol{x}}_d^T\boldsymbol{\beta} +{u}_d\right)\right\}}^{1}\). Here β is the kvector of regression coefficients and u_{d} is assumed to be independent and normally distributed with mean zero and variance \({\sigma}_u^2\).
Assuming N_{d} > > > n_{d}, an empirical plugin predictor (EPP) of smoking proportion in tract d is given by
It is obvious that in order to compute the small area estimates by eq. (2), the estimates of the unknown parameters β and \(\boldsymbol{u}={\left({u}_1,\dots, {u}_{D_1}\right)}^T\) in eq. (2) are obtained using an iterative procedure that combines the Penalized QuasiLikelihood estimation of β and u with restricted maximum likelihood (REML) estimation of \({\sigma}_u^2\) to estimate unknown parameters. For tracts with no sample data (n_{d} = 0), the synthetic type predictor of smoking proportion in tract d is given by
The mean squared error (MSE) estimation of small area predictor (2) and (3) is due to Chandra et al. (2019) [21].
To determine whether the sampling design used in survey data collection must be incorporated for valid inference about the population, we compute the effective sample sizes and the effective sample counts for the sample data, as described previously [21]. Use of effective sample size rather than the actual sample size allows for the varying information in each area under complex sampling. Following previous work, we use the effective sample sizes in place of observed sample sizes to incorporate the sampling design [22, 26].
Diagnostic measures
These measures are used for examining the assumptions of the underlying models and assessing the empirical performances of the EPP method. Generally, two types of such measures are suggested and commonly employed in SAE application; (i) the model diagnostics, and (ii) the diagnostics for the small area estimates. The main purpose of model diagnostics is to verify the distributional assumptions of the underlying small area model, i.e., how well this working model performs when it is fitted to the survey data. The other diagnostics are used to validate reliability of the modelbased small area estimates.
In LLMM, eq. (1), the random tractspecific effects are assumed to have a normal distribution with mean zero and fixed variance. If the model assumptions are satisfied, then the tract level random effects (or residuals) are expected to be randomly distributed and not significantly different from the regression line y = 0; whereas, from eq. (1) the area level random effects (or residuals) are defined as \({\hat{u}}_d={\hat{\eta}}_d{\boldsymbol{x}}_d^T\hat{\boldsymbol{\beta}}d=1,\dots, D\). To examine the normality assumption, (as shown in Supplementary Fig. F1) the histogram (left plot), the normal probability (qq) plot (center plot) and the distribution of the tractlevel residuals (right plot) are used. The ShapiroWilk test (implemented using the shapiro.test() function in R) was also used to examine the normality of the tract random effects. The value of the ShapiroWilk test statistic was 0.984 with 285 degrees of freedom (pvalue = 0.002). This indicates that the tract random effects are likely to be normally distributed. The tract level residuals appear to be randomly distributed around zero. Further, the histogram and qq plot also provide evidence in support of the normality assumption (Supplementary Fig. F1).
Further, a set of diagnostics described previously [27, 28] are also considered for assessing validity and reliability of the tractwise estimates generated by the EPP method. Here, we used 4 commonly used measures that address these requirements: a bias diagnostic, a goodness of fit test, a percent coefficient of variation diagnostic, and a 95% confidence interval diagnostic. The first two diagnostics examine the validity and last two assess the reliability or improved precision of the modelbased small area estimates. In addition, we implemented a calibration diagnostic where the modelbased estimates are aggregated to higher level and compared with direct survey estimates at this level. Here direct estimates DIR \(\left({\hat{p}}_d^{Direct}\right)\) are defined as the survey weighted direct estimates. We compute bias (Bias) and average relative difference (RE) between direct \(\left({\hat{p}}_d^{Direct}\right)\) and the EPP \(\left({\hat{p}}_d^{EPP}\right)\) estimates as: \(Bias={D}_1^{1}{\sum}_{d=1}^{D_1}{\hat{p}}_d^{Direct}{D}_1^{1}\left({\sum}_{d=1}^{D_1}{\hat{p}}_d^{EPP}\right)\), and \(RE={D}_1^{1}{\sum}_{d=1}^{D_1}\left(\frac{{\hat{p}}_d^{Direct}{\hat{p}}_d^{EPP}}{{\hat{p}}_d^{Direct}}\right)\) respectively. The calculated Bias is due to our modelbased estimation step and not the randomization used in Step 1.
Results
Out of the 8836 survey respondents used for the microsimulation in Step 1, 5901 (i.e., more than twothirds) received a final tract assignment (Supplementary Fig. F2). In general, proportions of groups by education, race, and sex across the five age categories were similar between the 2015 census and our microsimulated datasets (Supplementary Fig. F3). Out of a total of 468 Allegheny County tracts in the survey data, we had 286 tracts with samples, and the rest were out of sample. In the sample data, the sample count (i.e., number of eversmokers in the sample) was 4517. For this study, auxiliary variables were available for 458 tracts (285 with and 173 without sample data) only. Therefore, further analysis considered only 458 tracts for estimating the eversmoking rate using SAE. At this stage, the survey data had a total sample size of 5892 respondents and sample count of 2689 (Table 1).
Across tracts, the sample size ranged from one to 160 with an average of 21. The average sample count was nine per tract, with a range of zero to 71. About 32% (91 out of 285) of total tracts had samples of less than five people (Fig. 1). As majority of the points are on the right side of the diagonal line (Supplementary Fig. F4a), it implies that for most tracts, the effective sample size is smaller than the observed sample size. Similarly, in most of the cases, the effective sample counts are smaller than the observed sample counts (Supplementary Fig. F4b). It is evident from the Supplementary Fig. F5 that the unweighted direct estimates underestimate the number of eversmokers, which indicates that the sampling design is indeed informative, when compared to simple random sampling (SRS), in such tracts. Hence, the sampling weights cannot be ignored in our SAE analysis (Table 1).
We fitted generalized linear models between unweighted proportions of smoking and the four SVI themes to choose the appropriate auxiliary variables. The two auxiliary variables RPL_THEME1 and RPL_THEME3 were significant predictors for the eversmoking rate with an Akaike Information Criterion (AIC) value of 1205.5 (Table 2). Further, the effects of eversmoking were positive for RPL_THEME1 (coefficient: 0.824, p < 0.001) and negative for RPL_THEME3 (coefficient: − 0.633, p < 0.001). The null deviance of the model was 532 with 284 degrees of freedom, but adding RPL_THEME1 and RPL_THEME3 in the model reduced the residual deviance to 478 with a loss of two degrees of freedom. Using these covariates, the tractlevel SAEs, and the corresponding standard errors, were computed. The results are shown in the Supplementary Table S2, along with census tractspecific sociodemographic variables described in the Supplementary Table S3. The excluded tracts are shown in the Supplementary Table S4.
To validate our results, we compared our tractlevel SAEs of eversmoking rates with such estimates by a previous study [8] for the groups of years 1991–1995, 1996–2000, 2001–2005, and 2006–2010. Interestingly, the studies showed positive, significant correlations (correlation coefficients: ~ 0.51, p < 0.001) (Fig. 2). However, our rate estimates ranged from 20 to 72%, whereas these prior estimates had a narrower spread (~ 10–40%). In our analysis, the tracts with the highest estimated eversmoking rate, slightly over 70%, were located southeast of the city of Pittsburgh. Other tracts with relatively high rates (> 65%) were located within neighborhoods in the southern (Hazelwood, Arlington, Carrick) and eastern (East Hills) sections. There was also a cluster of tracts with relatively high rates to the west of Pittsburgh (Fig. 3a). As expected, the standard errors of SAE are higher in nonsample tracts (Fig. 3b). Distributions were similar between tracts in the city of Pittsburgh versus outside of Pittsburgh, although the SAEs for noncity tracts had slightly more spread (Supplementary Fig. F6).
Discussion
In this study, we have developed a new twostep algorithm for rescaling behavioral survey data and modeling the prevalence of small arealevel behaviors. Health surveys, including the BRFSS and others, often do not provide spatial resolution below the state or county level. The local BRFSS survey administered in Allegheny County did collect zip code of residence, but without tract assignments, linkage with informative, ancillary data sources, such as the SVI, is difficult. Our microsimulation step allowed us to distribute survey respondents to tracts within the study area in a way that reflected the known sociodemographic composition of the tracts. While not every survey respondent received a final tract assignment, we gained in spatial resolution for the others. Such disaggregation of population data at an informative spatial scale can mitigate statistical bias that may appear in the form of modifiable areal unit problem (MAUP) [29].
According to the most recent Surgeon General’s report, 13.7% of U.S. adults smoke [2]. Although the adult smoking rate in Allegheny County decreased from 23% in 2009–2010 to 19% in 2017 [2], this still exceeds the national rate. Racial disparities also persist in the county, both for smoking and smokingrelated health outcomes. African Americans are both more likely to smoke (30% versus 17% of whites) and have rates of lung cancer 15–30% higher than whites [14]. The SAEs of smoking rates demonstrated in this study, and its rigorous use of tractspecific (socioeconomic, and minority & languagebased) vulnerability covariates in the estimation, could inform local smoking cessation interventions to further decrease smoking rates in the county, particularly for highrisk groups such as those with higher levels of poverty or unemployment. In addition, lower socioeconomic communities face greater burdens of environmental pollution [30], further compounding their risks for cancer and other diseases.
Past applications of SAE on BRFSS data, e.g., Zhang et al. (2014) [12], were based on fitting a unit level logistic linear mixed model to BRFSS data and then drawing 1000 random samples from their estimated conditional distributions using the fitted model parameters, and thus, generating a sample of 1000 SAEs for each small area defined in the study [10]. The efficacy of the generated SAEs is therefore highly dependent upon the fitted model. The SAE method under an area level, logistic linear mixed model applied in this paper is a widely used approach if the model covariates are only available in aggregate form. It has a simple and closed form expression and, therefore, national statistical agencies (e.g., Office for National Statistics, Australian Bureau of Statistics, etc.) often prefer it.
Yet, the present study has some limitations. The spatial rescaling in Step 1 to gain in terms of the ability to include insightful covariates has a potential cost in terms of some loss of samples. In the probabilistic CO procedure of Step 1, a degree of randomness is involved in the spatial assignments [19]. Given the current methodological limitation in terms of the ability for estimating uncertainty in the results of microsimulation, algorithms such as CO could introduce bias for the small area estimates. This is, however, a more general problem which needs to be addressed in future work. In Step 2, as one would expect, standard errors were higher among nonsample compared to sample tracts. Caution should be used in interpreting the SAE results in these nonsample tracts. We do not have reliable, directestimate data to validate our SAE census tract results, although they correlate significantly with those from past studies. Finally, while these tractlevel estimates may be used to target smoking cessation interventions or help identify highrisk communities for smoking and related health outcomes, they cannot be used to draw inferences about smoking habits of specific individuals residing in the small areas.
Conclusion
We proposed a twostep algorithm for rescaling survey data to more granular geographic levels for which ancillary data may be available to produce locally relevant estimates for healthrelated risk behaviors at these levels. We used smoking rates in Allegheny County, PA, both as a case study to demonstrate the algorithm as well as to create tractlevel estimates that may be used in local public health interventions or additional studies. Future work could leverage on the methods described here for other health surveys, locations, diseases, and healthrelated behaviors.
Availability of data and materials
The datasets generated and analyzed during the current study are not publicly available but are available from the corresponding author on reasonable request.
Abbreviations
 ACS:

American Community Survey
 AIC:

Akaike Information Criterion
 BRFSS:

Behavioral Risk Factor Surveillance System
 CO:

Combinatorial Optimization
 COPD:

Chronic Obstructive Pulmonary Disease
 EPP:

Empirical PlugIn Predictor
 LLMM:

Logistic Linear Mixed Model
 MAUP:

Modifiable Areal Unit Problem
 MSE:

Mean Squared Error
 SVI:

Social Vulnerability Index
 U.S. E.P.A.:

United States Environmental Protection Agency
References
Drope J, Liber AC, Cahn Z, Stoklosa M, Kennedy R, Clifford, et al. Who’s still smoking? Disparities in adult cigarette smoking prevalence in the United States. CA Cancer J Clin. 2018;68(2):106–15 Available from: https://onlinelibrary.wiley.com/doi/full/10.3322/caac.21444. [cited 2022 Jul 1].
Smoking Cessation: A Report of the Surgeon General – Key Findings [Internet]. [cited 2021 May 1]. Available from: https://www.hhs.gov/surgeongeneral/reportsandpublications/tobacco/2020cessationsgrfactsheetkeyfindings/index.html
Hylkema MN, Sterk PJ, de Boer WI, Postma DS. Tobacco use in relation to COPD and asthma. Eur Respir J. 2007;29(3):438–45 Available from: https://erj.ersjournals.com/content/29/3/438d. [cited 2022 Jul 6].
Pulmonary Diseases  How Tobacco Smoke Causes Disease: The Biology and Behavioral Basis for SmokingAttributable Disease : The Biology and Behavioral Basis for SmokingAttributable Disease: A Report of the Surgeon General. Centers for Disease Control and Prevention (US); National Center for Chronic Disease Prevention and Health Promotion (US); Office on Smoking and Health (US). Atlanta (GA): Centers for Disease Control and Prevention (US) [Internet]. 2010 [cited 2022 Jul 1]. Available from: https://www.ncbi.nlm.nih.gov/books/NBK53021/
Behavioral Risk Factor Surveillance System [Internet]. [cited 2021 May 1]. Available from: https://www.cdc.gov/brfss/index.html
Rao JNK, Molina I. Small area estimation. 2nd ed: Wiley Series in Survey Methodology; 2015.
Liu B, Parsons V, Feuer EJ, Pan Q, Town M, Raghunathan TE, et al. Small area estimation of Cancer risk factors and screening behaviors in US counties by combining two large National Health Surveys. Prev Chronic Dis. 2019;16(8):E119.
Ortega Hinojosa AM, Davies MM, Jarjour S, Burnett RT, Mann JK, Hughes E, et al. Developing smallarea predictions for smoking and obesity prevalence in the United States for use in environmental public health tracking. Environ Res. 2014;1(134):435–52.
Song L, Mercer L, Wakefield J, Laurent A, Solet D. Using smallarea estimation to calculate the prevalence of smoking by subcounty geographic areas in King County, Washington, behavioral risk factor surveillance system, 2009–2013. Prev Chronic Dis. 2019;13(5):E59.
Wang Y, Holt JB, Zhang X, Lu H, Shah SN, Dooley DP, et al. Comparison of methods for estimating prevalence of chronic diseases and health behaviors for small geographic areas: Boston validation study, 2013. Prev Chronic Dis. 2019;14:E99.
PLACES: Local Data for Better Health [Internet]. Centers for Disease Control and Prevention. [cited 2021 May 1]. Available from: https://www.cdc.gov/places/index.html
Zhang X, Holt JB, Lu H, Wheaton AG, Ford ES, Greenlund KJ, et al. Multilevel regression and Poststratification for smallarea estimation of population health outcomes: a case study of chronic obstructive pulmonary disease prevalence using the behavioral risk factor surveillance system. Am J Epidemiol. 2014;179(8):1025–33 Available from: https://academic.oup.com/aje/article/179/8/1025/109078. [cited 2022 Jul 1].
Zhang X, Holt JB, Yun S, Lu H, Greenlund KJ, Croft JB. Validation of multilevel regression and poststratification methodology for small area estimation of health indicators from the behavioral risk factor surveillance system. Am J Epidemiol. 2015;182(2):127–37 Available from: https://pubmed.ncbi.nlm.nih.gov/25957312/. . [cited 2022 Jul 1].
Monaghan A, Jones L, Brink LA, Hacker K. Comparison of census tractlevel chronic disease prevalence estimates from 500 cities and local health claims data. J Public Health Manag Pract. 2022;28(1) Available from: https://pubmed.ncbi.nlm.nih.gov/32332486/. [cited 2022 Jul 1].
Pierannunzi C, Xu F, Wallace RC, Garvin W, Greenlund KJ, Bartoli W, et al. A methodological approach to small area estimation for the behavioral risk factor surveillance system. Prev Chronic Dis. 2019;13(7):E91.
Hacker K, Brink L, Jones L, Monroe C. Results from the 2015–16 Allegheny County health survey (ACHS): measuring the health of adult residents. Allegheny County health department. Pittsburgh; 2017. p. 2017. Available from: https://www.alleghenycounty.us/HealthDepartment/Resources/DataandReporting/ChronicDiseaseEpidemiology/AlleghenyCountyCommunityHealthAssessment.aspx
American Community Survey (ACS) [Internet]. United States Census Bureau. [cited 2021 May 1]. Available from: https://www.census.gov/programssurveys/acs
CDC/ATSDR Social Vulnerability Index (SVI) [Internet]. Centers for Disease Control and Prevention. [cited 2021 Mar 1]. Available from: https://www.atsdr.cdc.gov/placeandhealth/svi/index.html
Williamson P. Combinatorial optimisation. In: Whitworth A, editor. Evaluations and improvements in small area estimation methodologies: National Centre for Research Methods [Internet]. National Centre for Research Methods; 2013. p. 11–3. Available from: https://eprints.ncrm.ac.uk/id/eprint/3210/.
Templ M, Meindl B, Kowarik A, Dupriez O. Simulation of Synthetic Complex Data: The R Package simPop. J Stat Softw. 2017;79:1–38 Available from: https://www.jstatsoft.org/index.php/jss/article/view/v079i10. [cited 2022 Jul 6].
Chandra H, Chambers R, Salvati N. Small area estimation of survey weighted counts under aggregated level spatial model. Surv Methodol. 2019;45(1):31–59 Available from: https://www150.statcan.gc.ca/n1/pub/12001x/2019001/article/00006eng.htm. [cited 2022 Jul 1].
Anjoy P, Chandra H, Aditya K. Spatial hierarchical Bayes small area model for disaggregated level crop acreage estimation. Indian J Agric Sci. 2020;90(9):1780–5 Available from: www.agrilmktg.org.
Burgard PJ. Surveyweighted generalized linear mixed models Survey  weighted Generalized Linear Mixed Models Pablo Burgard Jan Patricia Dörr Research Papers in Economics. Research Papers in Economics, No. 1/18. Universität Trier, Fachbereich IV – Volkswirtschaftslehre, Trier; 2018.
Hobza T, Morales D. Empirical best prediction under unitlevel logit mixed models. J Off Stat. 2016;32(3):661–92.
Jiang J. Empirical best prediction for smallarea inference based on generalized linear mixed models. J Stat Plan Inference. 2003;111(1–2):117–27.
Korn E, Graubard B. Confidence Intervals for Proportions with Small Expected Number of Positive Counts Estimated from Survey Data. Surv Methodol. 1998;24:193–201 Available from: https://www150.statcan.gc.ca/n1/pub/12001x/1998002/article/4356eng.pdf. [cited 2022 Jul 1].
Chandra H, Salvati N, Sud UC. Disaggregatelevel estimates of indebtedness in the state of Uttar Pradesh in India: an application of smallarea estimation technique. 2011;38(11):2413–32 Available from: https://www.tandfonline.com/doi/abs/10.1080/02664763.2011.559202. [cited 2022 Jul 1].
Brown G, Chambers R, Heady P, Al E. Evaluation of small area estimation methods – an application to unemployment estimates from the UK LFS. In: Proceedings: symposium 2001, achieving data quality in a statistical agency: a methodological perspective, Statistics Canada; 2001.
Openshaw S. The modifiable areal unit problem. Norwich: Geo Books; 1984.
Hajat A, Hsia C, O’Neill MS. Socioeconomic disparities and air pollution exposure: a global review. Curr Environ Heal Rep. 2015;2(4):440–50 Available from: https://pubmed.ncbi.nlm.nih.gov/26381684/. [cited 2022 Jul 1].
Acknowledgements
The coauthors dedicate the article to the memory of Dr. Hukum Chandra (19722021).
Funding
This work was supported by grant P30CA047904 from the UPMC Hillman Developmental Funding Program. The authors declare that they have no conflicts of interest.
Author information
Authors and Affiliations
Contributions
SLS conducted the microsimulation and wrote the manuscript. HC created the small area estimates and cowrote the manuscript methods. SG and RG provided support for the analysis as well as creation of figures and tables. LLB, LBR, DOW and JMY provided extensive edits and suggestions to improve the manuscript. SP conceived the project, guided the development of the methodology as well as contributed to the statistical analysis, and cowrote the manuscript. The author(s) read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
All experimental protocols were approved by The University of Pittsburgh Institutional Review Board (STUDY19040081). All methods were carried out in accordance with relevant guidelines and regulations. Informed consent has been obtained from all participants.
Consent for publication
Not applicable.
Competing interests
The authors have no competing interests as defined by BMC, or other interests that might be perceived to influence the results and/or discussion reported in this paper.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1: Fig. F1.
Histograms (left plot), normal qq plots (center plot) and distributions of the tract level residuals (right plot). Fig. F2. Scatterplot of Incons (MaxMin) versus Max values for each of N =8836 survey respondents due to spatial assignments in three sets of 100 microsimulations. Empirically, the dotted lines show the most inclusive thresholds at Max ≥ 40 and Incons < 50. The resulting included assignments are shown as red dots. Fig. F3. Barplots comparing the 2015 Census data (C) and Microsimulation results (M) with paired bars that show the proportions of each category of (a) sex, (b) race, and (c) education across 5 groups of increasing age. Fig. F4. Tractwise effective sample size vs. observed sample size (a), and effective sample count vs. observed sample count (b). Fig. F5. Tractwise survey weighted vs. unweighted direct estimates of smoking rates. Fig. F6. Box plots comparing small area estimates of smoking rates and standard errors between the city of Pittsburgh and nonPittsburgh tracts. Table S1. The sociodemographic profiles of the BRFSS survey respondents (Allegheny County, PA, 2015–2016) are shown. The data are due to Table 8 in the report, ‘Results from the 20152016 Allegheny County Health Survey: Measuring the Health of Adult Residents’, K. Hacker, et al. (2017), Allegheny County Health Department, Pittsburgh, PA. (Reference 16, main paper.). Table S2. Tractspecific small area estimates and associated variables for Allegheny County, PA. Table S3. Data dictionary for the variables in Table S2. Table S4. Tracts removed from small area analysis.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Stacy, S.L., Chandra, H., Guha, S. et al. Rescaling and small area estimation of behavioral risk survey guided by social vulnerability data. BMC Public Health 23, 184 (2023). https://doi.org/10.1186/s12889022149704
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12889022149704
Keywords
 Behavioral risk factor
 BRFSS
 Smoking
 Spatial microsimulation
 Small Area Estimation