Study area and data
The study focused on Malawi and used the standard and nationally representative 2010 Malawi Demographic and Health Survey (MDHS) data. The MDHS data was downloaded from the DHS website (http://www.measuredhs.com/login.cfm) after being granted permission. The sampling design was a two stage cluster design with stratification. The primary sampling units were the enumeration areas (EAs), and the secondary sampling units were the households. EAs were stratified in terms of rural and urban. A total of 849 EAs were sampled with 158 in urban areas and 691 in rural areas. A representative total sample of 27345 households was selected for the 2010 MDHS survey. Data collection was by questionnaires. There were three questionnaires, women, men and household questionnaire. Households that were successfully interviewed were 24825, yielding a response rate of 98%. Eligible women that were successfully interviewed were 23020, yielding response rate of 97%. Eligible men that were successfully interviewed were 7175, yielding a response rate of 92%. The data set that was used in this study was child record data set which was based on women and household questionnaire. The child record data set had a total of 19967 children records. The following exclusion criteria based on 2010 MDHS report [3] and MDHS guide to statistics [27] was used to have the final sample for children. Children whose mothers were not listed in the household questionnaire were not included. All children records where haemoglobin level was missing were dropped. The missing covariate values were left unremoved. The final sample size of children was thus 4177.
Data management in terms of extracting and generation of variables from child record data set was done in STATA version 12. Data variables used in this study were based on the variables used in previous studies on childhood anaemia. Response variable in the extracted data set was child anaemia status based on the categorization of child altitude adjusted haemoglobin level. Child anaemia status was a binary variable based on the cut off point of 11Hb. Children whose haemoglobin level was less than 11Hb were taken as anaemic and not anaemic otherwise. The cut off point used in classifying child anaemia into two categories was based on 2010 MDHS report. The covariates in the generated data set were mother education level, family wealth index, child cough, child fever, receiving vitamin A, mother anaemia status, stunting, wasting, underweight, child birth weight, child birth order, house hold size, child age in months, mother age in years, whether child ate meat in previous one month or not, breast feeding in months and district of the child. Child age in months, mother age in years and breast feeding in months were continuous covariates. Stunting, wasting and underweight were based on categorization of height for age, weight for height, and weight for age z-scores respectively using z-score −2 as cut off point. District of the child was labelled s
i
ϵ(1, 2, 3,.., S) where the label was corresponding to label on the map.
Statistical analysis
Univariate logistic regression was performed in STATA statistical software, version 12 to select potential factors of childhood anaemia. Covariates that were associated with anaemia at significance level of 20% were incorporated in the multiple regression models. The significant level of 20% rather than 5% was used in selecting covariates for multiple regression analysis so as to allow more potential covariates to be selected. Two way cross tabulation was then performed in STATA statistical software, version 12 to find percentage distribution of childhood anaemia per district and per covariate categories. Percentages were weighted using the sampling weight to ensure representative sample. The two way cross tabulation with Pearson chi-square (\( {\mathcal{X}}^2 \)) test was used to compare groups of categorical variables.
Four multiple logistic models were then fitted using R2BayesX package in software R using child anaemia status as a response. More formally, considering child anaemia status being binary, in this case child anaemia status being distributed as Bernoulli (p
ij
) where p
ij
is the probability of child j being anaemic in location i, the following models were fitted.
Model 1: \( \mathrm{logit}\left({p}_{ij}\right)={w}_i^T\gamma \)
Model 2: \( \mathrm{logit}\left({p}_{ij}\right)={w}_i^T\gamma +{f}_1\left({x}_{i1}\right)+{f}_2\left({x}_{i2}\right)+\dots +{f}_p\left({x}_{ip}\right) \)
Model 3: \( \mathrm{logit}\left({p}_{ij}\right)={w}_i^T\gamma +{f}_{spat}\left({s}_i\right) \)
Model 4: \( \mathrm{logit}\left({p}_{ij}\right)={w}_i^T\gamma +{f}_1\left({x}_{i1}\right)+{f}_2\left({x}_{i2}\right)+\dots +{f}_p\left({x}_{ip}\right)+{f}_{spat}\left({s}_i\right) \)
Model 1 was a fixed effects variable model where all variables, categorical and continuous were modelled as fixed effects. In Model 2, categorical variables were modelled as fixed effects and continuous variables were modelled non parametrically by smooth function f
j
s. In Model 3 all covariates were modelled as fixed effects and district of the child was modelled as a spatial effect. Model 4 was an extension of Model 2 by including a spatial component. In the models, the smooth functions f
j
were specified as Bayesian splines. According to [28], this assumes approximating f
j
by polynomial splines of degree l defined at equally spaced knots \( {x}_j^{min}={\zeta}_{j0},{\zeta}_{j1}, \dots, {\zeta}_{js}={x}_j^{max} \) which are within the domain of the covariate x
j
. The Bayesian spline can be written as a linear combination of d = s + l basis functions, B
m
, that is,
$$ {f}_j\left({x}_j\right)\kern0.5em =\kern0.5em {\displaystyle {\sum}_{m=1}^d{\varepsilon}_{jm}{B}_m\left({x}_j\right)} $$
(1)
Now Bayesian estimation of the penalized spline (1) is equivalent in estimating model parameters ε
j
= (εj,1, εj,2, … , εj,m) where first or second order random walk priors for the regression coefficients are assigned. A first order random walk prior for equidistant knots is given by: εj,m = εj,m − 1 + uj,m where m = 2, 3, …, d, and a second order random walk prior for equidistant knots is given by: εj,m = 2εj,m − 1 + εj,m − 2 + uj,m where m = 3, 4, …, d and \( {u}_{j.m}\sim N\left(0,\ {\tau}_j^2\right) \) are random errors. The spatial effect was modelled by the tensor product of two dimensional spline defined as
$$ {f}_{spat}\left({x}_1,{x}_2\right)={\displaystyle {\sum}_i^k{\displaystyle {\sum}_j^k}}{B}_{spat,\ ij}{B}_{1i}\left({x}_1\right){B}_{2j}\left({x}_2\right) $$
(2)
where (x1, x2) refers to the coordinates of the location of the data point, latitude and longitude, or location centroids based on the map. The prior for Bspat, ij = (Bspat,11, Bspat,12, …, Bspat,kk) is based on spatial smoothness priors common in spatial statistics (see [29]). The most commonly used prior specification based on the four nearest neighbours is defined as:
$$ {B}_{spat,\ ij}\Big|.\sim N\left({B}_{spat,i-1\ j}+{B}_{spat,i+1, j}+{B}_{spat,i,j-1}+{B}_{spat,i,j+1},\frac{\tau_{ij}^2}{4}\right) $$
for i, j = 2, …, k − 1 with appropriate changes for corners and edges. Since model estimation was by empirical Bayesian method, all variance parameters were treated as unknown constants that were estimated by restricted maximum likelihood (REML) method and hence their priors were not given. The fixed effects were assigned diffuse priors. An advantage of the empirical Bayesian inference over full Bayesian inference is that questions about the convergence of MCMC samples or sensitivity on hyper parameters do not arise [30]. Further more, a comparison of full Bayesian and empirical Bayesian approach in a simulation study, has shown empirical Bayesian approach yielding somewhat better point estimates, especially for Bernoulli distributed responses (see [31]).