Subnational estimation of modern contraceptive prevalence in five sub-Saharan African countries: a Bayesian hierarchical approach

Background Global monitoring efforts have relied on national estimates of modern contraceptive prevalence rate (mCPR) for many low-income countries. However, most contraceptive delivery programs are implemented by health departments at lower administrative levels, reflecting a persisting gap between the availability of and need for subnational mCPR estimates. Methods Using woman-level data from multiple semi-annual national survey rounds conducted between 2013 and 2016 in five sub-Saharan African countries (Burkina Faso, Ethiopia, Ghana, Kenya, and Uganda) by the Performance, Monitoring and Accountability 2020 project, we propose a Bayesian Hierarchical Model with a standard set of covariates and temporally correlated random effects to estimate the level and trend of mCPR for first level administrative divisions in each country. Results There is considerable narrowing of the uncertainty interval (UI) around the model-based estimates, compared to the estimates directly based on the survey data. We find substantial variations in the estimated subnational mCPRs. Uganda, for example, shows a gain in mCPR of 6.4% (95% UI: 4.5–8.3) based on model estimates of 20.9% (19.6–22.2) in mid-2014 and 27.3% (26.0–28.8) in mid-2016, with change across 10 regions ranging from − 0.6 points in Karamoja to 9.4 points in Central 2 region. The lower bound of the UIs of the change over four rounds was above 0 in 6 regions. Similar upward trends are observed for most regions in the other four countries, and there is noticeable within-country geographic variation. Conclusions Reliable subnational estimates of mCPR empower health departments in evidence-based policy making. Despite nationally increasing mCPRs, regional disparities exist within countries suggesting uneven contraceptive access. Raising investments in disadvantaged areas may be warranted to increase equity in access to modern contraceptive methods. Electronic supplementary material The online version of this article (10.1186/s12889-019-6545-3) contains supplementary material, which is available to authorized users.


Introduction
We provide a mid-level documentation of our analytic approaches to SAE. For background and technical details on Bayesian methods see Banerjee et al. (2014); Carlin and Louis (2009); Diggle (2014); Gelman et al. (2013). See the main manuscript for details on the analysis goals, data and results.

Woman-level, Bernoulli Modeling
Because some covariates vary over women within an EA, modeling must be Bernoulli (0/1 outcome) at the woman-specific level, with estimates 'rolled-up' to the EA level. To fix ideas, the following is for a single survey wave and so the subscript t in Section 3 is omitted. Notation: • k = 1, . . . , K: EA index • i = 1, . . . , n k , indexes women in EA k.
• Y ik : 0/1 indicator of woman i in region k (not using)/using birth control (or whatever other binary outcome is relevant).
•P k = Y +k /n k : direct (unadjusted) estimate for EA k.
• X ik : regressors for woman i in region k, 1 × q row vector including the intercept. Note that some covariates may be EA-specific, but it's best to retain the (i, k) subscript for all covariates.
• P ik = P (X ik β + u k ): the true, underlying woman-specific probability for woman i in EA k, conditional on [X ik , β, U k = u k ].
• The 'average logistic', Avelogistic k = n k i=1 P (X ik β + U k ), integrated over the posterior distribution of β and over the prior distribution for the U k . The avelogistic plays the role of a standard logistic regression, but brings in uncertainty in the slopes (frequentists should do this too!), and also integrates over the prior distribution of the U k .

Using the MCMC samples
Both the population parameters and EA-specific MCMC outputs are relevant to in and out of sample inferences and predictions. Most programs, including BUGS and rstan, provide some summaries of monitored features, for example their mean, median, quantiles, etc. But, by monitoring and saving all relevant values, one has access to the full joint distribution of all quantities, a distribution that includes all uncertainties.
With ν = 1, . . . , M indexing MCMC post-burn-in, pooled over chains samples, the following are available and need to be saved. Of course, the X k are also available.
For each k, the following, EA-specific summaries using P (1) +k , . . . , P (M ) +k are of primary interest (of course, others can be computed): • The full posterior distribution: the histogram or smoothed density • The sample mean: • SD(P +k | data): the sample standard deviation of the P +k . Note that this is the SD of the estimate, so is its SE. (Warning: using the SE for CIs is not recommended!) • Percentile-based CI: Their 2.5th, 50th and 97.5th percentiles with the 50th being a 'point estimate' and the (2.5th, 97.5th) producing a CI. Consider using the format, 2.5 50 97.5 (see Louis and Zeger, 2008) • Moment-based CI:P +k ± 1.96 × SD(P ik ) (not recommended!) Since the posterior distribution for a P k can be highly skewed, the percentile approach is recommended. If you want the moment-based intervals, do compute them in the logit scale (compute logits of the P (ν) +k , do the analysis and and then invert ('expit') the endpoints. Better still, use the percentile-based CI.
• The posterior distribution of the U k : for each EA k the 2.5th, 50th and 97.5th percentiles with the 50th being a 'point estimate' and the (2.5th, 97.5th) producing a CI.
3 The first-order auto-regressive (AR1) model We provide an overview; the relevant literature is needed to fill in the details. The index t denotes 'wave' and we focus on the U kt . The complete model also includes the fixed-effects, X ikt β (a more general model would allow a t index on β, so β t . 1 As for all regression models, the implicit assumption is that the unconditional mean structure is modeled by the fixed effects and that the U kt are residuals and have marginal mean 0. We focus on the first-order, auto-regressive (AR1) model, starting with a model that allows for a wave-specific, cross-sectional variance (τ 2 t ) and then specialize to τ 2 t ≡ τ 2 . Several other time-series models are candidates for inducing longitudinal association among the U kt . We outline these in Section 3.5.

The AR1 model
• U kt are the EA-and wave-specific random effects, k = EA, t = wave.
• The AR1 model induces correlation, ρ s , ρ ∈ [0, 1) for Us that are s time units apart. 2 • For equally spaced time increments, ρ s = cor(U kt , U k(t+s) ) • Gaussian prior on the U kt : • Options for the prior on the τ 2 t (independent for each t): • Inverse Gamma • Uniform over some interval • In rstan, 'flat' • Options for the prior for ρ (for AR1 and other AR models ρ should be restricted to [0, 1]):

AR1 conditional distributions
We present conditional distributions for a general set of τ 2 t and when τ 2 t ≡ τ 2 .

Longitudinal conditional distribution for a general τ 2 t
Here is the distribution conditional on all previous U-values:  2. Even though we condition on all of the prior Us, only the most recent is used to compute the conditional mean 3. Setting ρ = 0 unlinks the Us over time so that there is no 'learning' from wave to wave 3.3 The marginal distribution when τ t ≡ τ Taking K = 3 and τ t ≡ τ , the marginal distribution of U = (U k1 , U k2 , U k3 ) with equal time-spacing is, The correlations decrease exponentially fast with time-separation. Note that as in the foregoing equations, the conditional distributions use only the most proximal Us.

General conditional distribution when τ t ≡ τ
The AR1 structure isn't restricted to longitudinal relations; it depends on 'neighbors.' For example, using the covariance matrix in equation (4), we have, There is automatic conditioning on the two neighbors, U 1 and U 3 . Doing so reduces the variance more than just conditioning on U 1 (no surprise!). More generally, with the AR1 structure

Bringing in the fixed effects
All of the foregoing is for the residuals U kt . With covariates, let θ ikt = logit(P ikt ), and for 1 2 e kt e kt ∼ N (0, 1) independent of the Us and the other es Note that the 'autoregression' is the same as equation (3), it operates on residuals.

Extensions
The AR1 model is a subset of a far more general ARIMA(p,q) (Autoregressive, Integrated Moving Average) models. The ARMA(p,q) are a subset of these, with the following representations (for a single k). The e are all iid mean 0: We don't need this degree of flexibility and also don't have a sufficient number of waves to support much more than p + q ≤ 2. We'll stay with AR1 probably forever, but it's interesting to see the relation between the ARMA(1,0) and the ARMA(0,1) covariance matrices.
Equation 4) is for the ARMA(1,0) model; the covariance for the ARMA(0,1) model, is: and more generally the diagonal is 1, the first super and sub diagonals are ρ and 0 elsewhere. In this model, the correlation persist irrespective of time-separation. An ARMA(1,1) model allows for correlation that decreases with time-separation down to a positive value rather than to 0.

Out of Sample Prediction
We consider general predictions and also those focused only on assessing fit of the regression model. The predictive distribution captures full uncertainty in a 'future direct estimate' by including uncertainty in the predictive model and in the observed data conditional on the predictive model. The standard Bayes estimates that condition on all observeds are as specified in Section 2; none of what follows changes them.
Out of sample prediction entails using a model informed by 'training data' to generate the full predictive, possibly joint, distribution for 'out of sample' units. In our context these are a subset of EAs identified by a list of (k, t) subscripts. The following assumes that a program is available that accommodates use of 'NA' to indicate a missing direct estimate, or program the model to treat missing data items as 'parameters' and that the M post-burn-in generated imputations for the associated EA kt can be captured. If neither of these approaches are available, imputations need to be programmed 'by hand' (see Section ?? for a basic example). For complex models for the dependency of the U kt (for example, a spatial model), it is quite challenging and most assuredly not recommended.
Note that in what follows we use Y kt as shorthand for Y +kt , P kt for P +kt etc.

The method
The high-level method is very straightforward.
Step 1: Define, is in the training sample Step 2: If I kt = 1, put 'NA' for the direct estimate, or provide the appropriate code that treats the direct estimate as a parameter.
Step 3: Run the model, and retain the M post-burn-in draws for all unknowns including the imputed 'direct estimates' for EAs with I kt = 1.
Step 4: The draws for an EA with I kt = 1 are the predictive distribution for it. In our application, denote them byỸ (ν) kt and so the predicted prevalences arẽ where the tilde (˜) denotes an imputed rather than an observed value.
• The (k, t)-specific mean,P (•) kt gives the point estimate prediction • The interval with endpoints at the 2.5th and 97.5th percentiles gives the 95% prediction interval • If the direct estimate,P kt , is available (but not used in estimating the model), then one can compute the traditional (observed -predicted)/SD standardized residuals and also the more appropriate Z-value computed from the inverse Gaussian of the percentile location, along with other diagnostics (see Section 5).

Notes
• Of course, I kt ≡ 1 for any EA for which we don't have the direct estimate. For EAs that have a direct estimate, we can choose to declare I kt = 1 to obtain the out of sample, full predictive distribution. • There needs to be sufficient information provided by the training data (the EAs with I kt = 0) to support fitting the specified model). And, even if estimable, a model with high uncertainty posterior for the β or the τ will produce broad predictions • The model must be specified to support predictions. For example, if you want to use waves (1,2,3) to predict wave 4 and you want to allow for a wave-specific intercept, you need to have a way to trend the wave (1,2,3) intercepts to wave 4. The base case of 'no change' is a single column of 1s in the X-matrix (µ 1 = µ 2 = µ 3 = µ 4 ). A linear trend is produced by two columns in the X-matrix, a column of 1s and a column (0, 1, 2, 3) , producing µ t = β 0 + β 1 (t − 1), etc. Wave-specific intercepts are produced by using the full 4 degrees of freedom with the most directly interpretable being suppressing the overall intercept and including 4 columns in the design matrix with the t th column having a 1 in the t th location and 0s elsewhere.

Diagnostics
The full predictive distribution supports a wide variety of additional fit and performance assessments. If the modeling is correct or at least reasonably so, then the distribution of the ensemble, P (ν) kt , ν = 1, . . . , M is an accurate depiction of location, spread, shape, etc. of the full predictive distribution. If not, then the direct estimates,P kt will not come from their respective, computed predictive distributions. One measure of this departure is that the collection of percentile locations will depart from U(0,1) and so also the inverse Gaussian transform will depart from a N(0,1) distribution. For model diagnostics, the following should only be used for (k, t) pair with I kt = 1.

Prediction mean, variance and SD
Mean: The prediction mean is, (the predicted prevalence) Note 1: For EAs with I kt = 1: kt is the general version of 'Avelogistic kt ,' the predicted value that conditions on information other than the direct estimate. Therefore, this general definition of Avelogistic is the appropriate X-axis in assessing fit of the (logistic) regression model coupled with the assumed model for association amongst the U -values and the τ t . The full predictive distribution is appropriate evaluating for a wide variety of out of sample predictions, for example wave 4 direct estimates, using wave (1,2,3) training data along with the Xs for wave 4 and a joint distribution assumption on the U s (e.g, AR1). Performance can be compared for different sets of Xs, different assumptions on relations among the Us, and among the τ t . Note 2: Avlogistic kt for an EA with I kt = 1 mixes over the [U kt | U s , = k, s = t].
(a) For example, if the model being fit specifies that the U kt are completely independent, then Avlogistic kt for an EA with I kt = 1 mixes over the prior distribution for that U kt . (b) If the model being fit specifies association among the U s (e.g., is spatial or autoregressive), then the posterior distribution 'learns' from other U-values.

Note 3: For EAs with
kt is the posterior mean that, in addition to other conditioning, conditions on the EA-specific direct estimate. The collection, P (ν) kt , ν = 1, . . . , M provide the full, posterior distribution. E kt should not be used as the X-axis in a residual plot, but is the Bayes posterior mean estimate for EA k in wave t, and is the standard point estimate for comparing EAs, coloring maps, etc. The full distribution should be used for CIs (in Bayes-speak 'credible intervals') and the lengths of these to color maps, etc.
Note 4: The full predictive distribution supports point estimates other than the predictive mean.
For example, in some applications the predictive median is more appropriate and in this case 'mean' should be replaced by 'median' in the foregoing Notes. More generally, pick your favorite one number summary (e.g., the 10% trimmed mean) and use it!

Mean, variance, SD of a residual
Equations (7) and (8) are used to compute the residual and the standardized residual, should only be used for the (k, t) pairs with I kt = 1, and of course they depend on availability of the direct estimate. The sample variance of theP kt is: The direct estimate and residuals are: ResidualsR kt and R * kt in equation (9) are based on the observed direct estimate (P kt ) and so measure discrepancy from the assumed model with R * kt calibrated by the standard deviation of the predictive distribution. If the direct estimates have close to a Gaussian distribution, then the R * kt can be used to make residual plots, histograms, boxplots, QQ plots, etc. However, if the direct estimates are not close to Gaussian, then use the percentile approach described in Section 5.3. In any case, don't use theR kt for any diagnostics, because they haven't been calibrated by their standard deviation.

Using the full predictive distribution
The formulas in display (9) measure deviation in standard deviation units, but other measures less closely tied to the Gaussian distribution are available. The following are effective diagnostics, but any computation using the ensemble that targets fit is 'legal.' The following should only be computed for (k, t) with I kt = 1! 1. Find the percentile location ofP kt amongst the P (ν) kt , denote it by ζ kt , and use for the standardized residual, To compute the percentile location it's important to move away from 0 and 1 and to account for ties. So, do the following, Note that this ratio is strictly greater than 0 and strictly less than 1. Also, if all of the MCMC draws equalP kt , then ζ = 1 2 and the residual is 0, as it should be. If the predictive distribution is exactly Gaussian, these will be identical to the R * kt and in general are less dependent on the Gaussian assumption.
For example, if the predictive distribution were a single binomial (not our case!), here are comparisons of R * and R ‡ when the direct estimate is 0. The formulas are: Of special note is that for small values of p, R ‡ > R * , and as p increases the relation reverses for n = 25 (Table 1), but not for n = 5 ( Table 2). Similar relations hold for smaller values of n. The R ‡ residuals are more appropriate in that they pay attention to the details of the distribution. This benefit also applies when the approach is applied to the full, predictive distribution when producing residuals Q-Q plots, etc. p 0.005 .01 .05 .10 .50 R * -0.35 -0.50 -1.15 -1.67 -5.00 R ‡ -0.15 -0.28 -1.09 -1.80 -5.54 Table 1: Residuals when the predictive distribution is Bernoulli(n = 25, p) and the direct estimate is 0. This is not our exact situation because our full predictive distribution is composed of a sum of not necessarily identically distributed Bernoulli variates and it also includes uncertainty in the probability (uncertainty in p).  Table 2: Residuals when the predictive distribution is Bernoulli( n = 5, p) and the direct estimate is 0. This is not our exact situation because our full predictive distribution is composed of a sum of not necessarily identically distributed Bernoulli variates and it also includes uncertainty in the probability (uncertainty in p).

2.
Box-plots: and other outlier diagnostics using the R * k or R ‡ k .
3. Residual plots: with either R * k or R ‡ k on the Y-axis and the appropriate Avelogistic kt on the X-axis. Importantly, these X-axis values should be for an MCMC run with I kt = 1 (see Note 1 below equation 7).
4. Q-Q plots: of the R * kt or the R ‡ kt against a Gaussian (normal) reference. This will be a good diagnostic, but becauseP kt and the predictive distributions are computed, in part, from sums of 0/1 variables, even under the null hypothesis the distribution won't be exactly N(0,1).
i. Equivalently, a Q-Q plot of the one-sided P-values computed using the R * kt or directly using the ζ kt , against a U(0,1) reference.

5.
Chi-square goodness of fit: When using (Observed -Predicted)/SD), or, when using the percentile approach The exact df needs to be determined and will depend on whether the assessment is in or out of sample, and on the correlation structure assumed for the U kt . It is surely no greater than K and for the AR1 or a spatial model considerably smaller.

Additional summaries
Out of sample residuals are central to assessing the performance of a model, but shouldn't be the only components a report or an evaluation. Here are a few others, with not intention to provide a complete list.

Shrinkage plots
In addition to the residual plot in Section 5.3, two 'shrinkage plots' are informative.
Direct estimate to Bayes: • Direct estimates plotted horizontally sufficiently far above the X-axis • Whisker for each proportional to the length of the 95%, Binomial likelihood-based CI using the numerator and denominator of the direct estimate • Bayes estimates plotted horizontally on the X-axis • Lines connecting the Direct and the Bayes (Direct -Avelogistic) to (Bayes -Avelogistic): • (Direct -Avelogistic) plotted horizontally sufficiently far above the X-axis • Whisker for each proportional to the length of the 95%, Binomial likelihood-based CI using the numerator and denominator of the direct estimate • (Bayes -Avelogistic) plotted horizontally on the X-axis • Lines connecting the Direct and the Bayes • Note: Unlike in Gaussian/Gaussian model, the signs of (Direct -Avelogistic) and (Bayes -Avelogistic) can differ; a line can cross 0. Crossing can occur when Avelogistic is sufficiently far from 0.5 and the Direct estimate is sufficiently close to Avelogistic.

The degree of shrinkage
The degree of shrinkage for EA kt is, The between-EA variance (τ 2 ) plays a role in how much an estimate shrinks towards the regression model; itÕs all relative. For example, if τ 2 is small relative to the variance of a direct estimate (more properly, if the posterior distribution of τ 2 has most of its mass far below the variance of the direct estimate), then the regression model will get a lot of weight even if the variance of the direct estimate is small. On the other hand, if the posterior distribution of τ 2 has most of its mass far above the variance of the direct estimate, then the regression model will get relatively little weight.

Reduction in uncertainty
Because the SD isn't the best summary for binomial and other non-Gaussian data, it is far better to compare the length of the properly computed, exact 95% CI associated with the direct estimate 3 and the length of the 95% probability content of the Bayes credibility interval (provided by the MCMC output). Their ratio gives a good indication of the improved stability conferred by the Bayesian model. If the lengths of the CIs based on the direct estimates and the Bayes estimates are similar, then there has been no 'Bayes advantage' and unless the direct estimates are all very stable (in which case there is no reason the stabilize them), it is worth looking for additional covariates (or transforms of current, interactions of current) that have predictive power and thereby shift the posterior distribution of τ 2 closer to 0. If there are no such covariates, then so be it, we need to live with what we have. I stress comparisons should be on CI length, because while for Gaussian data itÕs equivalent to comparing SDs, for count data, especially when an estimate is near 0, they aren't equivalent.
Related, as indicated in Section 5.4.1 use exact 95% CI length for whiskers: For our shrinkage plots it is better to set the whiskers proportional to the length of the exact 95% CI associated with the direct estimate rather than proportional to the SD.

Model criticism
As in evaluating a standard (non-Bayesian) regression or logistic regression, we donÕt expect that the residuals will all be very close to 0, but for a good model we do expect that the standardized residuals will look reasonable relative to random variables that have mean 0 and variance 1 (not necessarily Gaussian). Ditto for the Bayesian approach and with the percentile method for residuals the Zs should be close to Gaussian.
As for standard diagnostics, patterns matter as much as magnitude, and plotting standardized residuals (ideally the percentile-based ones) versus the relevant Avelogistic values is a good way to identify model lack of fit that might be reduced by including additional covariates including carefully chosen interactions based on currently included covariates. The issue here are essentially identical to traditional modeling.
Traditional models use AIC, BIC, adjusted R 2 , and other one-number summaries to assess fit. In addition, Bayesian models with MCMC support DIC which is interpreted in a manner similar to AIC and BIC.

Aggregation Diagnostics
Comparing aggregated posterior mean or median estimates to the direct estimate associated with the aggregated regions can help diagnose model inadequacy. Aggregation needs to be sufficient so that the aggregated direct estimates and the aggregated Bayes estimates are stable and can be trusted. Subject to this requirement, any aggregation is 'fair game' with the most spatially logical being to aggregate nested domains (e.g., EAs aggregated to regions, regions to the country). These comparisons can be 'decorated' with uncertainty estimates (see Section 5), but with sufficient aggregation, uncertainty will be relatively small.

Country-level aggregation
Recall that we use the shorthand Y kt = n kt i=i Y ikt , etc. Aggregating EAs to the country-level is straightforward. For a fit assessment, compute a weighted average of the Y (ν) kt , producing, and see whereP wt = k w ktPkt falls in the distribution. The w kt need to be specified; use w kt = n kt /n +t , if the n kt are proportional to the population size (i.e., the number of eligible women) of region k; otherwise use weights based on the true population sizes. Section 6.2 provides additional details. This computation is equivalent to comparingP wt to the aggregated E wt = k w kt E kt . It produces a single number, but can be helpful.

General aggregation
Let A indicate the EAs to be aggregated. That is, A is a list of subscripts {k 1 , k 2 , . . . , k |A| }, where |A| is the number of subscripts in A. Then, compute, and use as in Section 6.1 Compute the foregoing for a collection of A that partition the EA space (e.g., regions), look at patterns, etc. Any partition can be used, ideally motivated by substantive considerations such as aggregated urban and aggregated rural. The collection of these aggregations can be very helpful in diagnosing model inadequacies, especially when aggregation is sufficient so that the aggregated direct are stable.

Benchmarking
The goal is to 'roll up' to a target prevalence, for example the directly estimated country prevalence. If the modeling is reasonably good, the rolled-up (usually weighted by EA sample size) Bayesian estimates should come close to the target. If not, either it's an inappropriate target or the modeling wasn't very good. In any case, estimates can be adjusted (benchmarked) to produce the match. There are a variety of ways to force the Bayes estimates (the posterior means) to benchmark, and this is the subject of a forthcoming section. Suffice for now that there are two views on forcing a benchmark: • Force benchmarking so that the estimates are 'face-valid' to stakeholders.
• Don't force benchmarking; notable discrepancies indicate model inadequacy and these should be remedied.
The most naive, and definitely not recommended, is to rake the estimates by applying a common factor to the EA-specific estimates (Bayes or otherwise). For example, if the rolled-up Bayes estimates are 1% higher than the target, divide each of them by 1.01 to guarantee the match. More appropriate is to optimize predictions, subject to a (linear) benchmarking constraint (see, Bell et al., 2013), replacing, 'EA-specific estimates are the mean of the posterior distribution; they minimize posterior squared-error loss.' By, 'EA-specific estimates minimize posterior-squared error loss, subject to the linear constraint that they roll up to the target.' Though the foregoing is very appropriate for benchmarking posterior means, it can't deal with non-standard goals such as ranks and it's not clear what benchmarking would mean in that context. However, recent work I'm doing with Beka Steorts (Duke), embeds the benchmarking in the full posterior distribution so that any quantity computed from it will be 'benchmarked.' Work on this idea continues.  Whether heard a family planning (FP) message on radio/TV or saw in print in past 12 months

Fertility intention
Whether desires her next pregnancy 24 months or later

Number of live births Distance
Distance (km, log transformed) to the nearest facility providing three or more modern contraceptive methods

Covariate selection
Effective prediction depends on striking a balance between bias and stability. A saturated model can reduce bias but at the cost or risk of unstable predictions; a prediction based on too few covariates, or their transforms, is relatively stable but at the cost or risk of increased bias. This is particularly true in our study given sampling and measurement errors in the covariates. Our selection criteria for covariates are based on: (1)

Accounting for survey weight
While there are a variety of approaches for accommodating survey sampling weights in a frequentist analysis, including reciprocal propensity weighting (e.g. the Horvitz-Thompson approach) and casespecific propensity as a covariate, the latter is most directly implementable in a Markov chain Monte Carlo context. 2,3 The goal is to build a model wherein the sampling process is "ignorable", i.e., that the analysis includes all variables that affect the probability of a person being included in the sample, and thereby accommodate the weights when estimating the fixed effects.
We evaluated the impact of including sampling weight as a model covariate. Including both a linear and quadratic terms did not appreciably change the population estimates and predictions compared to noninclusion. Therefore, in the spirit of parsimony, our estimates are based on models that do not include the sampling weight as a covariate. We account for sampling weight in the post-estimation aggregation from individual level to EA-level, regional, and national estimates.

Model checking and assessment
We use several methodological approaches to assess the predictive performance of our model. Withinsample assessments are more optimistic than out-of-sample ones, because the former use the same data for fitting and evaluation. However, the mCPR outcome in this study does not have a Gaussian distribution, and therefore we have to develop a new diagnostic measure, Z-value. It indicates the percentile location of the direct estimate in amongst its predictive distribution from the BHM. In general, the Z-value is less dependent on the Gaussian assumption, and it will be identical to the standardized residual if the predictive distribution is exactly Gaussian. In this study the Z-value is more appropriate than the standardized residual because the mCPR values for many small areas are very low and could be far from following a Gaussian distribution.
See webappendix (pp) for additional details. where each data point is an area-round (e.g. region or county 1 in round 1; region or county 1 in round 2)

Z-value vs Avelogistic plots
because the model uses area-level random effects. The data points in the right panel represent specific EA-rounds because our study interest is regional estimates. The lack of a pattern in the left panel's graphs indicates that our model performs equally well across low, middle and high-mCPR areas.

Priors for model parameters
The following priors are used in the study. And the information has been added to the appendix. shrinkage of the residual between direct-minus-logistic estimates toward the residual between Bayesianminus-avelogistic estimates indicate the extent to which Bayesian estimates help achieve a balance between information from women's direct report of contraceptive use (i.e. direct estimates) and pure model-based prediction (i.e. avelogistic estimates). The balance is largely determined by the accuracy of the direct estimates, where estimates with a longer whiskers tend to shrink more, and have larger variation in random effects. The plots show that our BHMs are achieving a balance between direct and pure modelbased estimates.