Global measurement of intimate partner violence to monitor Sustainable Development Goal 5

Background One third of women experience intimate partner violence (IPV) and potential sequelae. Sustainable Development Goal (SDG) 5.2—to eliminate violence against women, including IPV—compels states to monitor such violence. We conducted the first global measurement-invariance assessment of standardised item sets for IPV. Methods Demographic and Health Surveys (DHS) from 36 Lower−/Middle-Income Countries (LMICs) administering 18 IPV items during 2012–2018 were included. Analyses were performed separately for two items sets: lifetime physical IPV (seven items) and controlling behaviours (five items). We performed country-specific exploratory and confirmatory factor analyses (EFA/CFA). Datasets meeting benchmarks for acceptable item loadings and model-fit statistics were included in multiple-group CFA (MGCFA) to test for exact measurement invariance. Based on findings, alignment optimization (AO) was performed to assess approximate measurement invariance (< 25% of model parameters non-invariant). For each item set, national rankings based on AO-derived scores and on prevalence estimates were compared. AO-derived scores were correlated with type-specific IPV prevalences to assess correspondence. Results National rates of physical IPV (5.6–50.5%) and controlling behavior (25.9–84.7%) varied. For each item set, item loadings and model-fit statistics were adequate in country-specific, unidimensional EFAs and CFAs. Both unidimensional constructs lacked exact invariance in MGCFA but achieved approximate invariance in AO analysis (12.3% of model parameters for physical IPV and 6.7% for controlling behaviour non-invariant). For both item sets, national rankings based on AO-derived scores were distributed similarly to rankings based on prevalence. However, estimates often were not significantly different cross-nationally, precluding national-level comparisons regardless of estimation strategy. Three physical-IPV items (slap, twist, choke) and two controlling-behaviour items (meet female friends; contact with family) warrant cognitive testing to improve their psychometric properties. Correlations of AO-derived scores for physical IPV (0.48–0.66) and controlling behaviours (0.49–0.87) with prevalences of lifetime physical, sexual, psychological IPV as well as controlling behaviour varied. Conclusions Seven DHS lifetime physical-IPV items and five DHS controlling-behaviour items were approximately invariant across 36 LMICs spanning five world regions, such that cross-national comparisons of factor means are reasonable. Measurement-invariance testing over time will inform their utility to monitor SDG5.2.1; cross-national, cross-time measurement-invariance testing of improved sexual and psychological IPV item-sets is needed. Supplementary Information The online version contains supplementary material available at 10.1186/s12889-022-12822-9.

Given the health, social, and economic costs of IPV, United Nations' bodies, treaties, and declarations have called for better statistics on the nature, prevalence, causes, and consequences of violence against women as a basis for its elimination [11]. This pressure led, in 2015, to Sustainable Development Goal (SDG) 5.2, which urges governments to "eliminate all forms of violence against all women and girls in public and private... " [12]. Widespread endorsement of SDG5.2 compels national governments to measure and to report rates of violence against women, including IPV (SDG5.2.1).
The decades leading up to SDG5.2 saw marked growth in the number of IPV prevalence surveys. These surveys relied on diverse scales and data-collection approaches [13], from small-scale, localised research, to large multicountry studies [14,15], and ongoing surveillance of IPV in multipurpose national surveys. No gold standard exists for data collection on IPV, but the Centers for Disease Control and Prevention [16], World Health Organization (WHO) [17], and Demographic and Health Surveys (DHS) [18] have agreed best practices. These practices include direct inquiry about acts experienced within a clear timeframe; the use of multiple, behaviourally-specific questions to capture reported experiences of specific acts of IPV; reliance on appropriately trained interviewers; and support for respondents and interviewers [11].
The DHS domestic violence module (DVM) is the most commonly administered module that follows these best practices to measure IPV at the national level in lowerand middle-income countries (LMICs). The DHS is a flagship project of the United States Agency for International Development (USAID), which has invested several hundred million dollars in data collection since 1984 [19] and is a critical source of population and health data for LMICs [20]. The DHS DVM is optional; however, by the end of 2020, 65 countries had administered it at least once, and 39 countries had administered it more than once (range: 1-9 times) [18], documenting large differences in national IPV prevalence [1].
While the DHS is used to inform policies, prevention efforts, and response interventions, the DHS DVM has not undergone a rigorous psychometric assessment. It, therefore, is unknown whether questions in the module are measurement invariant across countries on a global scale, a critical precondition for national comparisons. Research by members of this team on DHS questions about the acceptability of IPV showed modest non-comparability of prevalence estimates across countries due to module-design factors, such as slight differences across surveys in the number, wordings, and introductory framing of the questions [21]. If not identified and accounted for, areas of non-comparability may distort estimated differences in national IPV prevalence [21], with potential implications for national policies and the allocation of resources for prevention and response [22]. Addressing this knowledge gap now is critical, since the number of countries monitoring IPV will only increase with SDG5.2.
The objective of this paper is to perform the first comprehensive, global psychometric assessment of items developed to measure IPV in the DHS DVM. Using 36 national surveys conducted in LMICs during 2012-2018, we focused our main analysis on the item sets designed to measure lifetime physical IPV (seven items) and controlling behaviors (five items). The larger numbers of items in both sets made them more likely to be content valid, and violence researchers consider the physical IPV items to be more behaviourally specific and reliable than the psychological or sexual IPV items [23]. Our use of data from the DHS-the most geographically diverse source for nationally-representative data on IPV using similarly worded questions-enables us to make evidenced-based recommendations that are global in scope across LMICs. Our findings inform next steps in a global research agenda to improve measures of IPV to monitor SDG5.2.1.

Eligibility and sample
The DHS are multipurpose surveys administered to large, nationally-representative samples of households and randomly selected women of reproductive age (typically 15-49 years) in interviewed households. The DHS Keywords: Alignment optimization, Controlling behaviours, Cross-national, Measurement invariance testing, Physical intimate partner violence, Sustainable development goal 5 routinely collect data on women's and children's health. They use internationally recognised guidelines for survey methodology and for the ethical collection of data, including data on violence against women and girls (VAWG) [15,24].
Eligible countries had completed a DHS between 2012 and 2018 (inclusive) and had administered the same 18 items measuring physical, sexual, or psychological IPV and controlling behaviours. Based on these criteria, the sample for this analysis included 36 DHS conducted in 36 LMICs (according to the World Bank classification system) and spanning five world regions.
Included DHS represented countries in Sub-Saharan Africa (22 countries), followed by countries in South and Southeast Asia (nine countries), Central Asia (two countries), North Africa/West Asia (two countries), and finally Latin America and the Caribbean (one country) ( Table 1). Although a select sample, included DHS were conducted in demographically diverse national populations. For example, countries in the sample ranged widely in population size, from 516,000 people in the Maldives in its survey year to 1.35 billion in India in its survey year. Countries in the sample also ranged widely on the GINI index of income inequality retrieved for 2009-2018, from lower income inequality in the Kyrgyz Republic (GINI = 27.4) to higher inequality in Namibia (GINI = 59.1). Countries also ranged in gross national income per capita for 2018, from USD280 in Burundi to USD9310 in the Maldives, and in median grades of schooling completed for women of reproductive age in each DHS, from 3.0 in Nepal to 10.7 in the Philippines. Gender differences in the law, measured using the World Bank index on Women, Business, and the Law, ranged from 28.8 for Afghanistan to 86.9 for Zimbabwe, with higher scores indicating greater gender parity under the law (Supplemental Table S1). Finally, basic survey conditions varied somewhat across included DHS. The average survey team size ranged from 3 to 10 members. The number of training days ranged from 19 to 42, and the average interview duration ranged from 20 to 90 min, with a majority of DHS reporting an average duration of 30-60 min.

Data on IPV
The IPV-related questions in the DHS DVM [18] originated from the Revised Conflict Tactics Scales [15], a standardised instrument designed to capture behaviourally based acts of IPV ranging in severity from jealousy or anger for talking to other mean, to pushing or shoving, to the threat or actual use of a weapon. The DHS DVM has evolved to resemble more closely the instrument used by the WHO [17]. Specifically, the module includes three items to assess acts of psychological IPV, seven items to assess acts of physical IPV, three items to assess acts of sexual IPV, and five items to assess acts of male controlling behaviour. The occurrence of IPV is measured as the woman's self-report of experiencing each IPV item: 1) ever in the lifetime of her referent relationship, and if yes, 2) with a standardised frequency in the 12 months before interview. Women's reported experience of five controlling behaviours is measured without a specific timeframe or frequency. All items assess IPV in relation to the woman's most recent spouse or partner. Supplemental Table S2 provides standard item wordings in English for each IPV item. Initial data exploration suggested that fewer than 2% of women in any included DHS sample had missing data on any single IPV item, and only 0.02% of all women (n = 65) across all 36 DHS had missing data on all IPV items.

Statistical analysis
We used Stata [25] for data processing and descriptive analyses and Mplus [26] for all other analyses. The main statistical analyses involved four major steps. As a first step, we conducted descriptive analyses to understand country-specific missingness and prevalence for each IPV item and item-specific prevalence ranges across included countries. As a second step, we performed 36 country-specific factor analyses to explore and then to confirm dimensionality of each IPV item set, the magnitudes of item loadings, and overall model fit. For each country, the exploratory factor analysis (EFA) model was considered adequate if: item loadings were 0.35 or greater; model fit statistics met recommended benchmarks (the root mean square error of approximation (RMSEA) was about 0.08 or lower, and the comparative fit index (CFI) and Tucker-Lewis index (TLI) were about 0.95 or higher); and the results fit with theory [27]. We then conducted country-specific confirmatory factor analyses (CFA), including countries that met the abovementioned model-fit criteria in the EFA. We used the same criteria for the item loadings and model-fit statistics to assess the adequacy of the fit of all CFA models. The EFA and CFA used the means and variance-adjusted weighted least squares estimators, which were appropriate for dichotomous responses (1 = [ever] experienced, 0 = did not [ever] experience the IPV item). The approach used pairwise deletion to handle missing data [28].
As a third step, for national datasets that exhibited adequacy with respect to item loadings and model-fit statistics, we considered two approaches to assess the cross-national measurement invariance of the models confirmed in country-specific CFAs. Initially, we performed multiple-group CFA (MGCFA) to test for exact measurement invariance. When using this approach, small measurement differences are assumed to be exactly zero [29]. Following this approach, we tested sequentially for configural invariance, or equivalence of the factor structure across countries; then metric invariance, or equivalence of the factor loadings across countries; and then scalar invariance, or equivalence of the factor loadings and thresholds (or intercepts) across countries. Configural invariance implies that the dimensional structure of the latent IPV factor is equivalent across countries, although the item loadings and intercepts are free to vary across countries; whereas configural non-invariance implies that the latent IPV factor has a different dimensional structure across countries. Metric invariance implies that each IPV item contributes to the latent IPV construct to a similar degree across countries. Conversely, metric non-invariance implies that at least one IPV item is related differently to the latent IPV construct across countries. Scalar invariance implies that the factorial scores are comparable across countries. Conversely, scalar non-invariance may indicate potential measurement bias and suggests that larger forces, such as cultural norms, may influence systematically how different populations respond to IPV items in ways that are unrelated to the latent IPV construct. We used Maximum Likelihood estimation, which is appropriate for dichotomous responses and allowed us to test separately for metric and scalar invariance [30].
In the exact invariance-testing framework, evidence of metric or scalar non-invariance leaves three analytical options: (1) investigate the source of the non-invariance by sequentially releasing or adding loading or intercept constraints and retesting the models until partial measurement invariance is achieved, (2) omit IPV items with non-invariant loadings or intercepts and retest the sequential invariance models, or (3) assume that the IPV construct is noninvariant and discontinue exact invariance testing. Given the large number of countries and small number of IPV items per set, we did not consider options (1) or (2) to be advisable.
Instead, as a fourth step, based on findings from the MGCFA, we used alignment optimization (AO) to assess approximate measurement invariance of the IPV items across countries. According to users of AO methods, the restriction of equal model parameters required by MGCFA may be overly strict, especially when many groups or time points are involved in the comparison (e.g., Davidov et al. [31]). The approximate measurement invariance approach allows, instead, for differences in these model parameters across groups by finding an optimal model with the minimal amount of measurement non-invariance. In the first step of AO [32], MGCFA was used to confirm cross-national configural invariance of the IPV factor model. In the second step of AO, if configural invariance was achieved, the factor means and variances of all but the reference group, which were fixed to 0 and 1, were estimated to minimise the total amount of non-invariance across all parameters. The quality of the alignment result, then, was determined by the percentage of loading and intercept parameters that displayed non-invariance. As a guide, a limit of 25% of non-invariant parameters or less indicated trustworthy results [33]. For higher percentages, a Monte Carlo simulation is advised to assess the quality of the results [33]. Monte Carlo simulations are based on the correlation between the population factor means and the estimated alignment factor means, computed over groups and averaged over replications. Correlations of at least 0.98 produce reliable factor means [33]. Like MGCFA, AO employed maximum likelihood estimation, which used all available data, assuming data were missing at random [28,33].
At the initial stages of analysis, we attempted to follow the above steps including the following IPV item sets: (1) four item sets (physical IPV, sexual IPV, psychological IPV, controlling behaviors) to assess the invariance of a four-dimensional IPV model, (2) three item sets (physical IPV, sexual IPV, controlling behaviours) to assess the invariance of a three-dimensional IPV model, and (3) two item sets (physical IPV and either sexual IPV or controlling behaviors) to assess the invariance of a bidimensional IPV model. We encountered challenges completing all analytical steps for these models (Supplemental File S1), which we discuss in the Limitations section of the Discussion with recommendations for future research. To address these challenges, we applied the above analytical steps to assess the invariance of unidimensional IPV models for item sets that arguably were more behaviourally based and/or more content validity because they included more items [34]. So, the analyses presented in the body of this paper assessed separately the measurement invariance of the seven physical-IPV items and the five controlling-behaviour items.

Conventional prevalence estimates of IPV
Estimates for lifetime IPV were generally high but ranged widely across sample countries ( Table 2). Reported lifetime experience of physical IPV ranged from 5.6% in Comoros to 50.5% in Afghanistan. Reported lifetime experience of sexual IPV ranged from 1.1% in Armenia to 25.5% in the DRC. Reported lifetime experience of psychological IPV ranged from 6.4% in Comoros to 50.8% in Afghanistan, and reported experiences of controlling behaviours ranged from 25.9% in Cambodia to 84.7% in Gabon.
Reported prior-year prevalences of IPV, by type, also are presented in Table 2. In general, the lower item-specific prevalences for prior-year IPV, by type, made invariance testing with these measures more difficult (Supplemental File S1; results available on request).

Results from country-specific exploratory and confirmatory factor analyses
Tables 3 and 4 present the results for country-specific EFAs and CFAs for lifetime physical IPV (Table 3) and for controlling behaviours (  (Table 4). Thus, in country-specific EFAs and CFAs, unidimensional models for the seven physical-IPV items and the five controlling-behaviour items had reasonable fits with the data across all countries. The country-specific loadings for each item and the ranges of estimated item loadings across countries are reported in Supplemental  Tables S3a and S3b.
Multiple-group CFA results: assessment of exact measurement invariance Table 5 presents results for the MGCFAs for physical IPV (Panel 1) and controlling behaviours (Panel 2), across all 36 included countries. For the physical-IPV unidimensional model, the metric and configural models differed significantly (at p < 0.001), as did the scalar and metric models (at p < 0.001). Based on the test statistics and their proposed benchmarks, metric invariance across countries was not achieved. Similarly, for the controllingbehaviour unidimensional model, the metric and configural models differed significantly (at p < 0.001), as did the scalar and metric models (at p < 0.001). Based on the test statistics and their proposed benchmarks, metric invariance across countries was not achieved.

Alignment optimization results: assessment of approximate measurement invariance
Given the lack of exact measurement invariance based on the MGCFA results, Table 6 presents the results based on alignment optimization, in which we assessed approximate measurement invariance separately for the physical-IPV items (Panel 1) and the controlling-behaviour items (Panel 2). For physical IPV, 55 (or 21.8% of ) estimated thresholds, eight (or 2.8% of ) estimated loadings, and 12.3% of all parameter estimates were measurement noninvariant (Table 3). The items 'slap' , 'choke' , and 'twist' had a low degree of threshold invariance, and the item 'choke' had a low degree of loading invariance (see low R 2 values Table 6, Panel 1). For controlling behaviours, 21 (or 11.7% of ) estimated thresholds, three (or 1.7% of ) estimated loadings, and 6.7% of all parameter estimates were measurement non-invariant (Table 4). All items had a reasonable degree of threshold invariance; however, the items 'meet your female friends' and 'contact with your family' had a low degree of loading invariance (see low R 2 values in Table 6, Panel 2). Again, a guideline of 25% or fewer total non-invariant parameter estimates is recommended for trustworthy latent mean estimates and their comparison across groups. Overall, results suggested that the DHS item sets for physical IPV and controlling behaviours exhibited approximate measurement invariance across the 36 countries and allowed acceptable alignment performance.

Country rankings on level of physical IPV based on AO-estimates and standard prevalence
For illustration, Fig. 1 compares country rankings on level of lifetime physical IPV based on AO-derived scores versus conventional prevalence estimates. (Full countryranking results for physical-IPV and analogous results for controlling behaviours are available on request.) The physical IPV scores are factor means derived from the final AO factor model, which presumes that observed items reflect a latent physical IPV construct. The prevalence estimates are based on aggregates of the observed responses to physical IPV items using mean estimation with adjustment for sampling. Uncertainties in both sets of estimates are reflected in 99.9% confidence intervals to account for multiple comparisons. As shown in Fig. 1, the distributions of country rankings based on AOderived scores and prevalence estimates suggested some country-level differences; however, a Wilcoxon matchedpairs sign-rank test supported no significant difference in country rankings. Both sets of estimates exhibited a high degree of clustering. For example, in comparing countries using AO-derived scores, 12 clusters emerged, wherein country estimates did not differ significantly from one another. In comparing countries by conventional estimates of prevalence and associated confidence limits, three major clusters emerged: countries ranked 1-12, those ranked 13-30, and those ranked 31-36.

Convergent validity of AO-derived scores for physical IPV and controlling behaviour with IPV prevalences
As expected, AO-derived scores for physical IPV and for controlling behaviours were positively correlated with prevalence estimates for all four types of IPV, providing   Fig. S1).

Summary of findings
This is the first cross-national analysis to assess the measurement invariance of seven standard physical-IPV items and five standard controlling-behaviour items from the DVM administered in 36 DHS across 36 LMICs during 2012-2018. Included countries spanned five world regions and had populations that varied in size, schooling attainment for women, income inequality, and degree of gender equity in the national legal environment.
Elements of survey administration related to team size, number of training days, and average interview duration also differed across countries. In separate (unidimensional) analyses, both item sets exhibited good country-specific measurement properties for all 36 LMICs. Although neither item set met the criteria for metric or scalar invariance, both item sets did meet the criteria for approximate invariance across all 36 LMICs. The distributions of country rankings, based on AO-derived scores and conventional prevalence estimates, were similar for physical IPV and for controlling behaviours. However, both AO-derived scores and prevalence estimates often were highly clustered and not significantly different, suggesting that individual country rankings were not interpretable using either set of estimates for either type of IPV.

Limitations and strengths
Findings are limited to the seven physical-IPV items and five controlling-behaviour items included in this analysis. As such, findings cannot be extrapolated to different physical IPV items, different controlling-behaviour items, or other item sets intended to capture other types of IPV. This limitation is important, given the challenges we encountered when attempting to undertake the same analytical steps for other combinations of IPV items sets (Supplemental File S1). These analytical challenges may be attributable to a variety of issues. First, the conceptualizations of psychological IPV [34][35][36] and sexual IPV [34] remain under-developed, especially in research undertaken in LMIC settings. Second, the sexual IPV and psychological IPV items sets used in the DHS each included only three items capturing a narrow subset of behaviors. The sexual IPV items, for example, captured only "forced" sex acts and excluded acts that occur when the victim is unable to consent [34]. Small item sets that lack content validity may miss acts that contribute importantly to the latent construct across countries. Third, low item-specific prevalences (especially for sexual IPV) have been noted as a concern for efforts to validate measures of IPV [37]. In our case, low item prevalences prevented model convergence during the Monte Carlo simulation stage of some AO analyses. Underestimates of IPV present ongoing challenges to the accurate measurement of IPV in LMICs [38]. For the DHS DVM, some of this low prevalence may have arisen because the DVM is implemented at the end of a sometimes long, multi-purpose survey (Table 1), when respondents and interviewers may be fatigued. Fourth, the less behaviourally-based and more subjective nature of the sexual IPV items (e.g., "physically forced") and psychological IPV items ("humiliated") may be a source of non-invariance, as such wording may be interpreted differently across languages and contexts. Finally, some differences in survey administration across countries in this analysis (e.g., team size; training duration; interview duration) may have contributed to our inability to establish exact invariance for items in the analysis. The DVM typically is administered at or toward the end of the women's interview; therefore, the inclusion of more, sensitive, or different modules earlier in the interview may have framed the DVM in ways that affected its measurement invariance across countries. Findings also are limited to this non-representative set of LMICs and for the period of analysis (2012-2018). Nevertheless, the establishment of approximate measurement invariance for seven physical-IPV items and five controlling-behaviour items across highly diverse LMICs spanning five world regions suggests the utility of these item sets to compare countries on these dimensions of IPV. These results support their use to monitor SDG5.2.1.