Estimation and probabilistic projection of levels and trends in the sex ratio at birth in seven provinces of Nepal from 1980 to 2050: a Bayesian modeling approach

Background The sex ratio at birth (SRB; ratio of male to female births) in Nepal has been reported around the normal level on the national level. However, the national SRB could mask the disparity within the country. Given the demographic and cultural heterogeneities in Nepal, it is crucial to model Nepal SRB on the subnational level. Prior studies on subnational SRB in Nepal are mostly based on reporting observed values from surveys and census, and no study has provided probabilistic projections. We aim to estimate and project SRB for the seven provinces of Nepal from 1980 to 2050 using a Bayesian modeling approach. Methods We compiled an extensive database on provincial SRB of Nepal, consisting 2001, 2006, 2011, and 2016 Nepal Demographic and Health Surveys and 2011 Census. We adopted a Bayesian hierarchical time series model to estimate and project the provincial SRB, with a focus on modelling the potential SRB imbalance. Results In 2016, the highest SRB is estimated in Province 5 (Lumbini Pradesh) at 1.102, corresponding to 110.2 male births per 100 female births, with a 95% credible interval (1.044, 1.127) and the lowest SRB is in Province 2 at 1.053 (1.035, 1.109). The SRB imbalance probabilities in all provinces are generally low and vary from 16% in Province 2 to 81% in Province 5 (Lumbini Pradesh). SRB imbalances are estimated to have begun at the earliest in 2001 in Province 5 (Lumbini Pradesh) with a 95% credible interval (1992, 2022) and the latest in 2017 (1998, 2040) in Province 2. We project SRB in all provinces to begin converging back to the national baseline in the mid-2030s. By 2050, the SRBs in all provinces are projected to be around the SRB baseline level. Conclusions Our findings imply that the majority of provinces in Nepal have low risks of SRB imbalance for the period 1980–2016. However, we identify a few provinces with higher probabilities of having SRB inflation. The projected SRB is an important illustration of potential future prenatal sex discrimination and shows the need to monitor SRB in provinces with higher possibilities of SRB imbalance. Supplementary Information The online version contains supplementary material available at 10.1186/s12889-022-12693-0.

summarizes the matching of the 75 Nepal districts to provinces. The Nawalparasi district is splitted into Province 4 and 5. We classify it in Province 5 because it is geographically closer. Rukum district is split into Province 5 and 6. In NDHS 2016, it is considered as in Province 6 and we follow this classification in our study.  Table 1: Nepal districts and provinces matching. The red numbers at the beginning of each cell refers to the number of districts belong to each province. Currently, four province-level parliaments have decided the names of their provinces. In this study, we use Province 1 to 7 to label the seven provinces and provide the province names if available.

Sampling errors for DHS data
NDHS provide individual-level data with the full birth history for each women at reproductive age interviewed during the survey fieldwork period. We calculate the sampling error for log-transformed SRB for NDHS data series using the jackknife method [1,2,3]. For a certain NDHS, let U denote the total number of clusters. The u-th partial prediction of SRB is given by: where n indexes the live births in each state-survey-year, N is the total number of live births. x n is the sex for the n-th live birth. d n is the cluster number for the n-th live birth. w n is the sampling weight for the n-th live birth. I n (·) = 1 if the condition inside brackets is true and I n (·) = 0 otherwise. The u-th pseudo-value estimate of the SRB on log-scale is: The sampling variance is: For NDHS data, the annual log-transformed SRB observations are merged such that the coefficient of variation (CV) for log-transformed SRB is below 0.05 or the merged period reached 10 years [4]. For a certain NDHS data series, let {t n , t n−1 , · · · , t 1 } be the years with recorded births from recent to past. The merge starts from the most recent year t n : Merging process for NDHS data 1: for t ∈ {t n , t n−1 , · · · , t 1 } do 2: if t = t n then 3: Compute σ as explained in above. Compute CV= σ/ log(r) * u 4: if CV < 0.05 or t n − t n−1 > 1 or t n+1 − t n = 10 then 5: stop and move to the previous time point 6: else 7: Repeat step 3-5 based on births from t n and t n−1 The above procedures of computing sampling error and merging observation periods are performed for each NDHS.

TFR Data by Nepal Province
We compile the total fertility rate (TFR) by Nepal Province from NDHS 2001, 2006, 2011 and 2016 using the R-package DHS.rates [5,6].
For the NDHS 2001, we adjust the microdata for all-women factor since the survey only included evermarried females and those who never got married by the time of the survey interview were not included [7]. We extract from the microdata of the "Household Member Recode" for NDHS 2001. The adjustment is done by multiplying the sample weight of each responded female by the all-women factor AW p,a for Province p and her age a by the time the survey was conducted.
Specifically, in each Nepal Province p ∈ {1, · · · , 7} (by matching the districts to province based on Table 1) for each reproductive age a ∈ {15, · · · , 49}, we compute the total number of female household members A p,a for province p at age a: where H p is the total number of household members in Nepal Province p, v p,h is the sex of the h-th household member in Province p, y p,h is whether the household member slept at home the day before the survey interview conducted and age p,h is the age of the household member. Then we compute the number of ever-married female E p,a in Nepal Province p at age a: E p,a = Hp h=1 I(v p,h = female)I(y p,h = sleep at home)I(age p,h = a)I(m p,h = ever married), where m p,h is the marital status and we select those household members who are ever-married. When there is no ever-married women for a certain age a in Province p, i.e. when E p,a = 0, we merge the number of ever-married women with the number of one year older until the merged number is positive. Hence, Subsequently, we merge A p,a using the same merged ages that we did for E p,a in order to obtain the corresponding A * p,a . Finally, the all-woman adjustment factor for Province p female age a is: The TFR for the period beyond 2016 is based on the medium fertility scenario of population projections of 753 municipalities of Nepal, which was an update to the earlier projection [8]. We then aggregate the municipality births and women's years of exposure by province to generate the provincial TFR. Table 2 summarizes the notations and indexes used in this paper. N (µ, σ 2 ) refers to a normal distribution with mean µ and variance σ 2 . t 3 (µ, σ 2 ) refers to a Student-t distribution with degrees of freedom 3, mean µ and variance σ 2 . U(a, b) denotes a continuous uniform distribution with lower and upper bounds at a and b respectively.

Model for Sex Ratio at Birth by Nepal Province
The model is largely based on the model described in [9,12]. In this study, we made a few modifications in the model to better address the data quality and availability of provincial SRB in Nepal. The outcome of interest Θ p,t , the SRB in Nepal Province p in year t is modeled as: b = 1.049 is the SRB baseline level for the entire Nepal. The Nepal SRB baseline b is estimated based on national SRB observations from Nepal before reference year 1970 [10,11]. Φ p,t follows an AR(1) times series model on the log scale to capture the natural fluctuations of SRB within each province over time. ρ = 0.9 and σ = 0.004 based on previous study [10,11]. For more detailed motivations of the setup for Φ p,t , please refer to [9,10]. The province-year-specific multiplier for capturing the natural fluctuation in SRB around the national baseline b for Nepal Province p in year t. α p,t

Symbol
The SRB imbalance for Nepal Province p in year t. t 0p The start year of the SRB inflation for Nepal Province p. δ p The indicator of the presence (δ p = 1) or absence (δ p = 0) of SRB inflation in Nepal Province p. ξ p The maximum level of the SRB inflation for Nepal Province p. λ 1p The period length for the stage of increase of the sex ratio transition for Nepal Province p. λ 2p The period length for the stage of stagnation of the sex ratio transition for Nepal Province p. λ 3p The period length for the stage of decrease back to national SRB baseline of the sex ratio transition for Nepal Province p.

Known Quantities r i
The i-th SRB observation.
The sampling error for the i-th SRB observation as computed in Section 1.2.
The year in which the total fertility rate (TFR) in Nepal Province p declines to 2.6 children per woman [9]. b The baseline level of SRB for the whole Nepal [10], where b = 1.049. ρ The autoregressive indicator for Φ p,t , where ρ = 0.9 [10,11]. σ The SD of distortion parameter for Φ p,t , where σ = 0.004 [10,11]. Instead of estimating ρ and σ , we use the estimated values from prior study [10,11]. The rationale for this model setup is that ρ and σ reflect the global experience of the temporal fluctuation under natural circumstances. These parameter were estimated based on all available SRB observations without risk of having SRB imbalance before reference year 1970 in [10,11]. The studies were based on an extensive national SRB database and the estimates of the parameters ρ and σ are robust. For Nepal, there are no observed provincial SRB before 1970 (the year in which the sex selection technology started to become available). Hence, all the SRB observations by Nepal province are at risk of SRB imbalance and these data points are considered not suitable to estimate ρ and σ .
δ p is the binary identifier of the sex ratio transition, following a Bernoulli distribution: The logit transformation ensures that the probability parameter π p lies in the interval [0, 1]. The logittransformed π p follows a hierarchical normal distribution with a global mean and variance µ π and σ 2 π . α p,t refers to the province-specific SRB imbalance process, and is modeled by a trapezoid function to represent the increase, stagnation, and decrease of the transition stages: The start year of SRB inflation t 0p incorporates the fertility squeeze effect by using the data of total fertility rate (TFR). t 0p is modeled with a Student-t distribution with degrees of freedom 3 with mean at x p which indicates the year in which the TFR in Nepal Province p declines to 2.6 [9]. We use the TFR 2.6 to determine the mean of the distribution for t 0p because this is the estimated TFR for which Nepal may start the sex ratio transition [9]. The low degrees of freedom for the Student-t distribution is needed to capture higher uncertainty at the tails for the distribution of t 0p . Please refer to [9] for more details on the model setups for α p,t .

Data quality model
where i indexes all the SRB observations across provinces over time. r i is assumed to follow a normal distribution on the log scale with mean at log(Θ p[i],t[i] ) (explained as above) and variance at σ 2 i : σ 2 i is the sampling error variance for log(r i ) which reflects the uncertainty associated with log-scaled SRB observations due to survey sampling design. σ 2 i is known and are calculated as explained in Section 1.2. We impute the sampling error variance for observations from 2011 Census as the median sampling error variance of all the other observations.

Priors
Informative priors Informative priors are assigned to province-level parameters related to the sex ratio transition: the maximum level of SRB inflation ξ p , the period lengths for the stages of increase, stagnation and decrease as λ 1p , λ 2p and λ 3p respectively. The means of the prior distributions are from the systematic study [9] which modeled the sex ratio transition for multiple countries including Nepal. The standard deviations of the prior distribution is set such that the coefficient of variation (CV; defined as the ratio of mean to standard deviation) is 0.1. The informative priors could assist in modeling the sex ratio transition on the province level in Nepal by making use of the corresponding information on the national level. The usage of such informative priors is especially helpful in modeling subnational SRB imbalance in countries with great demographic heterogeneity [13,14]. For p ∈ {1, · · · , 7}: where µ ξ = 0.06, µ λ1 = 11.9, µ λ2 = 7.6, µ λ3 = 16.2, and CV= 0.1.
Vague priors Vague priors are assigned to parameters related to the indicator for detecting the existence of a sex ratio transition and the SD of start year: where U(a, b) indicates a continuous uniform distribution with lower and upper bounds at a and b respectively.

Statistical Computing
We obtained posterior samples of all the model parameters and hyper parameters using a Markov chain Monte Carlo (MCMC) algorithm, implemented in the open source softwares R 3.6.1 [15] and JAGS 4.3.0 [16], using R-packages R2jags [17] and rjags [18]. Results were obtained from 10 chains with a total number of 5,000 iterations in each chain, while the first 1,000 iterations were discarded as burn-in. After discarding burn-in iterations and proper thinning, the final posterior sample size for each parameter by combining all chains is 25,000. Convergence of the MCMC algorithm and the sufficiency of the number of samples obtained were checked through visual inspection of trace plots and convergence diagnostics of Gelman and Rubin [19], implemented in the coda R-package [20].

Model Validation
We assess the inflation model performance via two approaches: 1) out-of-sample validation; and 2) oneprovince simulation.

Out-of-Sample Validation
We leave out 20% of the data points since the data collection year 2016 instead of reference year, which has been used in assessing model performance for demographic indicators largely based on survey data [21,22,23,24]. After leaving out data, we fit the model to the training data set, and obtain point estimates and credible intervals that would have been constructed based on available data set in the survey year selected. We calculate median errors and median absolute errors for the left-out observations, where errors are defined as: e j = r j − r j , where r j refers to the posterior median of the predictive distribution based on the training data set for the j-th left-out observation r j . Coverage is given by For the point estimates based on full data set and training data set, errors for the true level of SRB are defined as e(Θ) p,t = Θ p,t − Θ p,t , where Θ p,t is the posterior median for Province p in year t based on the full data set, and Θ p,t is the posterior median for the same province-year based on the training data set. Similarly, the error for the sex ratio transition process with probability is defined as e(αδ) p,t = α p,t δ p − α p,t δ p . Coverage is computed in a similar manner as for the left-out observations, based on the lower and upper bounds of the 95% credible interval of Θ p,t from the training data set.

One-Province Simulation
We assess the inflation model performance by one-province simulation. For each of the 7 Nepal Provinces, we consider all data points as test data and simulate the SRB using the posterior samples of the global parameters from the sex ratio transition model (obtained using the full dataset).
The g-th simulated SRB Θ p are simulated to refer to a "new" province, without taking into account any province-specific data, following the model specification for these parameters. α p,t and δ p are simulated using the posterior samples of all parameters and hyper-parameters related to them. After generating the simulated values, we calculate the same set of results as described in Section 3.1 on out-of-sample validation. Table 3 summarizes the results related to the left-out SRB observations for the out-of-sample validation exercise and the one-country simulation. Median errors and median absolute errors are very close to zero for left-out observations. The coverage of 95% and 80% prediction intervals are symmetrical and more conservative than expected. The wider-than-expected prediction interval for left-out observations are mainly due to the greater uncertainty associated in more recent observations. The proportions of observations that fall below the prediction intervals constructed based on the one-country simulation are reasonable, given that the average number of observations fall below falling outside their respective bounds is at most 1.1. Table 4 shows results for the comparison between model estimates obtained based on the full dataset and based on the training set for the out-of-sample validation exercise. We look at the model estimates for the true SRB Θ p,t and the inflation process with country-specific probability δ p α p,t . Median errors and the median absolute errors are close to zero.

Validation and Simulation Results
In summary, the validation results indicate reasonably good calibrations and predicting power of the inflation model with conservative credible intervals.    Table 4: Validation results for estimates based on training set. Error is defined as the differences between a model estimate (i.e. Θ p,t or δ p α p,t ) based on full dataset and training set. The proportions refer to the proportions (%) of countries in which the median estimates based on the full dataset fall below or above their respective 95% and 80% credible intervals based on the training set.

Analysis for start year distribution
We assessed the choice of distribution of the start year of SRB inflation (Section 2.2, Eq. 11) and its effect on the province-specific start year parameter t 0p and final outcome of interest SRB Θ p,t . We conducted an alternative approach to estimate the start year parameter t 0p : where U(a, b) indicates a continuous uniform distribution with lower and upper bounds at a and b respectively. Comparing with Eq. 11, instead of fixing the degrees of freedom of the Student-t distribution at 3 (i.e. ν = 3), we assign a uniform prior to it.
The start year parameters t 0p are not sensitive to the choice of ν as shown in Table 5. The median estimates of t 0p are the same in most provinces between the two model assumptions on ν. The 95% credible intervals are also very similar to each other for all provinces.
The final outcome of interest SRB Θ p,t is also not sensitive to the choice of ν, as shown in Table 6. The average difference in Θ p,t over the full period 1980-2050 between ν = 3 and ν ∼ U(1, 50) are not statistically significantly different from zero across all provinces. The median differences are very close to zero as well. The same conclusion can be drawn for the estimation period 1980-2016 and the projection period 2016-2050. Figure 1 illustrates the comparison of the model fittings for SRB for the seven provinces based on ν = 3 and ν ∼ U (1, 50). The median SRB and 95% credible intervals are almost identical for most provinces. Most of the observable differences (none is statistically significantly different from zero) in the median SRB are during the projection period after 2016.

SRB in Province 7
Sex Ratio at Birth

Analysis for sex ratio transition variances
We assessed the effect of the sex ratio transition variances, by setting the CV value (explained in Section 2.4, Eq. 13-Eq. 16), on the estimates and 95% credible bounds for the outcome of interest Θ p,t and parameters that are related to the sex ratio transition. Table 7 shows the average difference in Θ p,t , the SRB, over the period of 1980-2050 between other model assumptions (i.e. CV= 0.2 and CV= 0.5) and the reference level (i.e. the final model setting with CV= 0.1). Across all seven provinces with p ∈ {1, ..., 7}, all differences are close to zero and none of the differences are statistically significantly different from zero. Similarly, when looking at the differences in Θ p,t during the estimation period 1980-2016 in Table 8, and the differences during the projection period 2016-2050 in Table 9, they are all close to zero and none of the differences are statistically significantly different from zero. Figure 2 illustrates the SRB model results based on different CV values assigned to the informative priors (explained in Section 2.4). The median estimates and projections of SRB are very similar when CV is 0.1 and 0.2. When CV= 0.5, the projected sex ratio transition is milder and the projected SRB is less deviated from the national baseline for all provinces except for Province 5. The deviation become more pronounced after 2016, which is the start of the projection period. As the CV increases, the credible intervals become wider. Table 10, 11, 12, and 13 present the model results of the province-specific parameters ξ p (maximum SRB inflation), and λ 1p , λ 2p , λ 3p (period lengths of increase, stagnation, and decrease of sex ratio transition process) based on different CV's used in their prior distributions. In general, the median estimates for all parameters based on different CV's are very similar. The 95% credible intervals of these parameters, however, increase as the CV increases. This is mainly because few data points of Nepal provincial SRB indicate a past or ongoing sex ratio transition. Consequently, the model suggest that the sex ratio transition process will mostly happen in the projection period.

Analysis for time series model
We assessed the model choice for Φ p,t , the province-year-specific factor that captures the year-by-year natural fluctuation within each province. We used AR(1) to model Φ p,t (Eq. 2-Eq. 4). As explained in Section 2.2, we fixed AR(1) parameters. Hence, in order to fully assess the model choice for Φ p,t , we compare the Φ p,t based on AR(1) and AR(2) by modeling all parameters related to the time series model. The AR(1) model with ρ and σ not fixed: The AR(2) model for Φ p,t with ρ 1 , ρ 2 , and σ not fixed: where for causality the roots of the AR(2) polynomial function g(z) = 1 − ρ 1 z − ρ 2 z 2 all lie within the unit circle. Figure 3 illustrates the modeling fittings for Φ p,t , based on (i) AR(1) model and fixing ρ and σ ; (ii) AR(1) model and not fixing ρ and σ ; and (iii) AR(2) model and not fixing ρ 1 , ρ 2 , and σ . In general, the model results for Φ p,t based on settings of (ii) and (iii) are very similar in both median estimates and 95% credible intervals. Hence, given that Φ p,t is not sensitive to the choice of AR(1) or AR (2), if all parameters related to Φ p,t were estimated, we use the simpler model AR(1).

SRB in Province 7
Sex Ratio at Birth AR(1); fix ρ,σ ε AR(1); model ρ,σ ε AR(2); model ρ 1 ,ρ 2 ,σ ε Figure 3: Φ p,t model results based on AR(1) and AR (2). The median and 95% credible intervals of the province-specific SRB are in curves and shades. Figure 4: SRB estimates and projections by Nepal Province, 1980-2050. The red line and shades are the median and 95% credible intervals of the province-specific SRB. The green horizontal line refers to the SRB baseline for the whole Nepal at 1.049 [10]. SRB observations are displayed with dots and observations are connected with lines when obtained from the same source. Shaded areas around observation series represent the sampling variability in the series (quantified by two times the sampling standard errors). The blue squared dots are total fertility rate (TFR) extracted from NDHS. The median estimates of start year and end year of SRB inflation are indicated by vertical lines. The TFR value in the start year is shown.  The end of Figure 4.