Reliability, validity and measurement invariance of the Simplified Medication Adherence Questionnaire (SMAQ) among HIV-positive women in Ethiopia: a quasi-experimental study

Background Adherence to antiretroviral therapy is critical to the achievement of the third target of the UNAIDS Fast-Track Initiative goals of 2020–2030. Reliable, valid and accurate measurement of adherence are important for correct assessment of adherence and in predicting the efficacy of ART. The Simplified Medication Adherence Questionnaire is a six-item scale which assesses the perception of persons living with HIV about their adherence to ART. Despite recent widespread use, its measurement properties have yet to be carefully documented beyond the original study in Spain. The objective of this paper was to conduct internal consistency reliability, concurrent validity and measurement invariance tests for the SMAQ. Methods HIV-positive women who were receiving ART services from 51 service providers in two sub-cities of Addis Ababa, Ethiopia completed the SMAQ in a HIV treatment referral network study between 2011 and 2012. Two cross-sections of 402 and 524 female patients of reproductive age, respectively, from the two sub-cities were randomly selected and interviewed at baseline and follow-up. We used Cronbach’s coefficient alpha (α) to assess internal consistency reliability, Pearson product-moment correlation (r) to assess concurrent validity and multiple-group confirmatory factor analysis to analyze factorial structure and measurement invariance of the SMAQ. Results All participants were female with a mean age of 33; median: 34 years; range 18–45 years. Cronbach’s alphas for the six items of the SMAQ were 0.66, 0.68, 0.75 and 0.75 for T1 control, T1 intervention, T2 control, and T2 intervention groups, respectively. Pearson correlation coefficients were 0.78, 0.49, 0.52, 0.48, 0.76 and 0.80 for items 1 to 6, respectively, between T1 compared to T2. We found invariance for factor loadings, observed item intercepts and factor variances, also known as strong measurement invariance, when we compared latent adherence levels between and across patient-groups. Conclusions Our results show that the six-item SMAQ scale has adequate reliability and validity indices for this sample, in addition to being invariant across comparison groups. The findings of this study strengthen the evidence in support of the increasing use of SMAQ by interventionists and researchers to examine, pool and compare adherence scores across groups and time periods.


Background
According to the United Nations AIDS Program (2019), by the end of 2018 nearly 38 million people were living with HIV/AIDS, of whom 23 million were on antiretroviral therapy (ART) [1]. At the same time, 63% of the nearly 700 thousand adults living with HIV in Ethiopia were women, and new infections among young women aged 15-24 years annually were more than double those of young men, 5800 compared to 2000 [2]. HIV treatment using ART can improve functionality and decrease mortality but lapses in adherence may render treatment permanently ineffective, for example, due to drug resistance [3]. The WHO has defined adherence as "the extent to which a person's behaviortaking medication, following a diet, and executing lifestyle changes, corresponds with agreed recommendations from a healthcare provider" [3]. Non-adherent patients have higher mortality rates than adherent ones with similar CD4+ counts and adherence is the critical determinant of survival among persons living with HIV [4][5][6]. Non-adherence is also associated with poor health outcomes, increased healthcare costs and poor patient safety, due to increased risk of dependence, relapses, toxicity, to mention a few [7]. Adherence is reported to be a major challenge in healthcare, estimated at 50% in high-income countries and even lower in some low and medium income countries [7]. Adherence is also critical to the achievement of the third target of the UNAIDS Fast-Track Initiative goals of 2020-2030, in which 90-95% of people with HIV are diagnosed with it, 90-95% of the diagnosed receive ART, and 90-95% of those on ART achieve viral suppression [8][9][10].
In Ethiopia, treatment adherence and retention were estimated to be on average 51-85% and 70% among those who had been initiated on ART, respectively [11]. In addition, a meta-analysis of 27 studies conducted in 12 sub-Saharan Africa countries (not including Ethiopia) found average adherence rates of 77% among study participants who were on ART [12]. Further, in the same meta-analysis, the authors reported average adherence of 55% among patients who participated in 24 studies in the United States and Canada [12]. In the literature, studies comparing adherence rates by sex of participants in Ethiopia are scant, but Molla et al. (2018) found that women had 1.22 higher odds of adherence to ART than men [13].
Accurate measurement of adherence is important for correct assessment of health outcomes and in predicting the efficacy of ART [7]. Non-adherence compromises treatment efficacy, and without accurate treatment efficacy data, adherence rates necessary for planning and evaluation cannot be achieved [7]. Further, accurate measurement of adherence is required for effective and efficient treatment planning, and for ensuring that changes in health outcomes can be attributed to recommended regimens. In addition, decisions to change recommendations, medications, and communication style to promote patient participation depend on valid and reliable measurement of the adherence construct [7].
Medication adherence has been measured using several methods, including: direct measures, measures involving secondary database analysis, measures involving electronic medication packaging (EMP) devices, pill count and measures involving clinician assessments and self-report [14]. However, there is no "gold standard" for measurement of adherence, and each method has advantages and disadvantages [14,15]. For example, the WHO reported that there are challenges in measurement of the adherence construct even when more objective methods are used [7]. The report cited challenges including: counting inaccuracies using the "remaining dosage units" method; the inability to capture important information such as timing of dosage and pattern of missed dosage; the high cost of medication event monitoring systems (MEMS); the inability to tell whether patients actually use their medicine when they are removed from the bottle; difficulties faced when an individual acquires medication at multiple pharmacies; and inaccurate and incomplete records using the prescription refills method [7].
Self-reports include measures such as patient-kept diaries, patient interviews and questionnaires and scales; they tend to overestimate adherence behavior compared with other methods [15]. Despite their limitations, selfreports can significantly predict clinical outcomes and produce actionable information for patients and providers [15]. They are also cheaper, noninvasive and easier to administer compared with other methods [15]. Some examples of self-report questionnaires and scales for general use include: Adherence Estimator, Adherence to Refills and Medication Scale (ARMS), Brief Medication Questionnaire (BMQ), Medical Outcomes Study (MOS), Medication Adherence Scale (MAS), Medication Management Instrument for Deficiencies in the Elderly (MedMaIDE), Medical Adherence Measure, Morisky Adherence Questionnaire 4 item (MAQ) and the Morisky Adherence Questionnaire 8 item (MAQ) [14][15][16]. In addition, there are self-report questionnaires and scales specific to measurement of adherence to ART, for example: AIDS Clinical Trials Group (ACTG) Adherence Questionnaire, Community Programs for Clinical Research on AIDS (CPCRA), Antiretroviral Medication Self-Report, Self-Rating Scale Item (SRSI), Self-Reported Adherence (SERAD) Questionnaire, Self-Reported Questionnaire Assessing Adherence to Antiretroviral Medication, Simplified Medication Adherence Questionnaire (SMAQ), Visual Analog Scale (VAS), among others [14][15][16].
There is a dearth of literature on use of standardized scales to measure medication adherence among people on ART in Ethiopia. A systematic review of 15 ART adherence studies in Ethiopia reported that 60% of the studies used self-reports, other methods included: caregiver reports, unannounced pill counts, pharmacy refill record, medication event monitoring systems, viral load measurement, CD4 count and record review [17]. Some studies have reported challenges associated with various methods of assessing adherence among users of ART in Ethiopia. Biressaw et al. (2013) found a discrepancy in adherence levels estimated by caregiver reports and unannounced home-based pill counts. They found adherence estimated from unannounced pill count was unacceptably low, but comparable to that of Medication Event Monitoring System reported by other studies [18]. Amberbir et al. (2008) and Markos, Worku & Davey (2008) also reported using self-reports to assess adherence to ART among HIV positive individuals in Ethiopia. Both studies reported that self-reports overstated adherence levels compared to unannounced pill count due to social desirability bias, in addition to being susceptible to recall bias [19,20]. The authors reported that despite their limitations, self-reports and pill count are widely used in Ethiopia because they are cheaper and easy to implement [17]. Self-reports have also been found to correlate with viral load and clinical outcomes [17].
According to the WHO, standardized multi-item scales such as SMAQ that assess specific behaviors relating to medication recommendations may be better predictors of adherence than simple yes/no responses [7]. The underlying logic is that each indicator when used on its own may be insufficient to capture the construct, but when these indicators are combined, they represent a valid composite measure of the underlying construct of interest [42]. While standardized scales have potential advantages in understanding perceptions about adherence, literature assessing psychometric properties including reliability, validity and measurement invariance (MI) of different scales in diverse settings is sparse. In addition, standardized scales are often used with populations that may be quite different from the one in which the scales were originally validated [42]. Also, there is a natural desire to make group comparisons and conclusions about effects of interventions on the mean scale scores of expected patient outcomes [43]. However, such comparisons are justified only to the extent that these comparisons approximate differences of means on the theoretical true score of the relevant constructs, and when the means are generated from data collected using questionnaires and scales exhibiting acceptable levels of reliability and validity [15,43]. Further, even when standardized scales are used, inferences and conclusions about observed mean differences are dependent on the between-group equivalence of the underlying measurement model [43]. However, an investigator's ability to assess true differences between groups or across time can be hindered by measurement errors, which can limit the ability to make accurate meaningful comparisons when determining program impacts [42].
Measurement invariance is a statistical criterion that is used to assess the extent to which a standardized scale measures the same construct in each group and at each time point studied [43]. It provides a way to assess whether respondents interpreted measures conceptually similarly across groups and time and whether participation in an intervention altered the conceptual frame of reference against which a group responded to an indicator over time [42]. Measurement invariance requires that any two persons with the same level of the latent construct should obtain the same expected score on the indicators used to measure the underlying construct, regardless of the group they are in [44]. Assessment of MI helps in determining if a scale functions equivalently for all groups defined by factors such as gender, age, education, mother tongue, socioeconomic status, regional background, among others [44]. Demonstrating that a scale has MI allows an investigator to make valid comparison of construct scores such as means that yield meaningful interpretations and substantive inferences [45].
To improve clinical research on ART adherence in this population, properties of measuring instruments, such as reliability, validity, and MI must be analyzed. While the importance of reliability and validity for assessing a selfreport instrument is well-understood, measurement invariance is increasingly being evaluated for valid comparisons of levels of latent outcomes to be made. Despite increasing frequency of use of the SMAQ in assessment of adherence to antiretroviral therapy, to date no study has assessed its MI and other psychometric properties such as reliability and validity in sub-Saharan Africa. Using data from a pre-post quasi-experimental evaluation study of a HIV/AIDS intervention among HIV-positive women of reproductive age in Ethiopia, hereinafter referred to as the parent study (pre refers to before intervention assessment or T1, whereas post refers to post intervention assessment or T2) [46,47], this paper assesses the internal consistency reliability, concurrent and factorial validity, and MI for the SMAQ in this setting. These analyses build upon the parent study and add to the sparse literature about the validity of SMAQ as a HIV/ AIDS treatment adherence measure.

Parent study
Data for this paper were obtained from a parent study conducted by MEASURE Evaluation that was funded by the United States Agency for International Development between March 2011 and December 2012. MEASURE Evaluation conducts studies globally on innovative public health interventions that have high potential positive impact on target (sub)populations and present high potential returns on investment. The parent study sought to assess the effect of a quasi-experimental organizational network intervention on HIV testing, ART initiation and adherence. The intervention was an organizational referral network improvement initiative to increase access to and quality of health services. The study followed the treatment and referral experiences of 926 HIV-positive women 18-49 years of age who were receiving ART and other healthcare services from provider agencies in an intervention or control site. Additional details of the parent study are reported in Appendix 1, and its findings have been published elsewhere [46,47].

Client interviews
The MEASURE Evaluation team enrolled clients, using random selection, from one large home-based care service provider that operated in both sites [46,47]. Following a quasi-experimental design strategy, HIV-positive women were interviewed in two cross-sections, one (T1) prior to and the other (T2) 18 months following the network intervention. At T1, 402 clients were interviewed: 210 at the intervention site and 192 from the control site; at T2, 524 clients were interviewed: 268 from the intervention site and 256 from the control site. At both times, after voluntary verbal consent, clients were asked about personal and household-level demographic characteristics, HIV treatment status and medication adherence. Although some clients may have been interviewed at both T1 and T2, participation in T2 interviews was not conditional on T1 participation. There was no way to know whether a participant in T2 also participated in T1, because at both times the research team randomly sampled clients from agency caseload lists. Consequently, we analyzed the samples as if they were independent of each other. For demographic characteristics of participants: we used one-way ANOVA with Bonferroni correction to compare mean age; chi square test to compare response categories of levels of education, marital status and SMAQ items; and Kruskal Wallis test to compare mean income per week across groups. In addition, we used pairwise correlation to assess correlations between demographic variables and items of the SMAQ. The parent study was reviewed and approved by the Office of Human Research Ethics at the University of North Carolina at Chapel Hill IRB Number 11-0282, the Office of Research Ethics at FHI 360 and the Addis Ababa City Administration Health Bureau in Ethiopia [46,47].

Measurement of ART adherence
To assess adherence, participants who reported using ART were asked six questions from a standardized scale known as the Simplified Medication Adherence Questionnaire (SMAQ) [21], as follows: "1. Do you ever forget to take your medicine?", "2. Are you careless at times about taking your medicine? "3. Sometimes if you feel worse, do you stop taking your medicines? "4. Thinking about the last week, how often have you not taken your medicine?" "5. Did you not take any of your medicine over the past weekend? And, "6. Over the past three months, how many days have you not taken any medicine at all? Adherence was scored as a "no" response to questions 1, 2, 3 and 5, zero response for question 4 and any response less than 2 for question 6. The six questions/items constituted the unidimensional model for measurement of adherence. The six questions assess three components of adherence to ART: intentional (question three), unintentional (questions one and two) and frequency or quantity (questions 4, 5 and 6). Intentional non-adherence refers to when a patient deliberately decides not to take their medication because of various reasons, for example feeling worse. Whereas unintentional non-adherence occurs when a patient wishes to adhere to medication but is prevented by some reason, for example, forgetfulness [48]. Questions four to six assess various aspects of frequency of nonadherence. An experienced Amharic-English speaker translated the questionnaire and then it was backtranslated by an Ethiopian survey coordinator.

Reliability and validity
Prior to conducting MI tests, we assessed SMAQ's reliability and validity in an Ethiopian context. Reliability denotes the ability of a scale to produce consistent results when completed under similar conditions, whereas validity denotes the extent to which a scale measures the construct it is supposed to. Reliability is analogous to the scale's precision, whereas validity is analogous to its accuracy.

Internal consistency reliability and concurrent and factorial validity
We conducted an internal consistency reliability test of the SMAQ data from Ethiopia using Cronbach's alpha (α). This index measures internal consistency reliability of both items and the construct being measured [49,50]; in this case, how closely-related the six items of the SMAQ were as a set in measuring the adherence construct. Values of α in the range of 0.6 to 0.8 (0.6 ≤ α ≤ 0.8) are considered adequate, while 0.8 or higher is considered a high value of internal consistency [51]. We used Pearson product-moment correlation coefficients (r) to assess concurrent validity of the domain scores at T1 and T2. In this context, concurrent validity represents the extent to which item scores at T1 related to those of the same scale administered to women at T2 [52]. Criteria for concurrent validity were based on directionality of expected relationships of the six items between the two times and strength of the observed correlation coefficient. The Pearson product-moment correlation coefficient has a range of − 1 to + 1 between two sets of scores, and coefficients close to 1 in absolute value indicate high concurrent validity [52]. Based on thresholds from previous studies, correlation coefficients less than or equal to 0.25 suggest a weak relationship, those between 0.25 and 0.50, a moderate relationship, those between 0.50 and 0.75, strong relationship; and values greater than 0.75, very strong correlation [53,54]. Finally, we used model fit indices and statistical significance of factor loadings to assess factorial validity. Factorial validity is one of the different, but inter-related elements of construct validity. A strong correlation between a set of indicators and a latent construct indicates factorial validity [55].

Measurement model
We measured adherence with six factor indicators corresponding to the six SMAQ items. In Figs. 1, 2, 3, 4, the latent factor of adherence is represented by the circular shape. The arrows represent factor loadings, which are direct effects of each adherence indicator on the latent construct of adherence. We report summary statistics, factor loadings and model fit indices for specific models including chi-square values, root mean square error of approximation (RMSEA) values, comparative fit indices (CFI)/Tucker-Lewis indices and the final estimated measurement models. A significant chi-square test indicates a poor model fit, but this may also be due to moderate discrepancies in normality of data and large (n > 200) sample size [56]. Therefore, we used other model fit indices to supplement the chi-square test in determining the model that best fit the data. The RMSEA is a measure of the estimated discrepancy between the population and model implied covariance matrices per degree of freedom [43]. Values of RMSEA less than 0.05 indicate close model fit whereas values up to 0.08 reflect adequate fit. The CFI varies from 0 to 1, representing extremely weak to perfect fit, respectively, and a value of 0.95 is considered to represent adequate fit [43].

Measurement invariance test
Measurement invariance testing is based on the overall assumption that comparison between groups is important, and the presence or absence of differences between groups has some meaningful implications [45]. We tested for the levels of invariance based upon the following assumptions: (1) the measure of interest, that is adherence to ART, was perceptually based; adherence comprises multiple manifest indicators (i.e., multiple items of the SMAQ); (2) the six items of SMAQ are combined additively to operationalize the underlying construct; (3) evidence exists of the SMAQ's psychometric soundness beyond the preliminary stages of scale development, i.e. evidence exists of the SMAQ's psychometric soundness in a Spanish sample, but it has yet to be demonstrated for Ethiopia or other sub-Saharan Africa location; (4) the four study groups are independent of each other: T1 control, T1 intervention, T2 control and T2 intervention; and (5) the common factor model for describing relationships among items of the SMAQ holds across groups [45].
Following the independent groups assumption, we applied a multiple-group confirmatory factor analysis (CFA) to test three levels of invariance: configural, weak factorial and strong factorial [45]. Multiple-group CFA allowed us to simultaneously test four group-specific latent adherence factor models using robust weighted least squares (WLSMV). We fit models for each group/time and evaluated sample differences with a chi-square test. We used WLSMV to conduct chi-square difference testing because adherence indicators were categorical and non-normally distributed. A significant chi-square difference value indicated that constraining the parameters of the nested model significantly worsened the fit of the model, which indicated measurement non-invariance, thereby sustaining the unconstrained or less constrained model. A non-significant chi-square difference indicated that constraining the parameters of the nested model did not significantly worsen the fit of the model, which indicated MI of the parameters constrained to be equal in the nested model. We did not estimate the next restrictive model if the result was significant, as it suggested that the next level of parameter restriction would have significant differences with the previous model. We    [57] and Stata 12 [58] to conduct data analysis. Additional details of steps for invariance testing can be found in Appendix 2.

Results
Study participants were 926 female clients who were 18-45 years of age and receiving HIV care from a homebased care organization in each sub-city. Their overall mean age was 33 (across the four groups 33.06-33.74; median: 34 years). Participants in the two intervention groups had significantly higher mean age (34 years) compared to Control group (32 years) at T1 (F(3, 922) = 4.67, p < 0.01). Significantly higher proportion of participants were married across all groups X 2 (df = 3) = 8.70, p = 0.03), with only one third of the participants living with their sexual partner across groups. There were significant differences in categories of levels of education across study groups (X 2 (df = 21) = 60.66, p < 0.01), with nearly one quarter of all participants had no formal education, and only 15% had post-primary education. There were significant differences in mean weekly income across groups (X 2 (df = 3) = 143.24, p < 0.01), and of the 80% who reported their weekly income, the average income was 2011 US$4 (range US$ 0-72) ( Table 1).

Correlations between adherence measurement items of the SMAQ
Participant responses to the six SMAQ items/questions on adherence are presented in Table 2. Chi square tests showed no differences in response categories of the SMAQ across the four groups (X 2 (df = 3) = 0.58-6.64, p > 0.05). Initial assessment of correlations between the six items ranged from − 0.09 to 0.95 (see Table 3). Question five "did you not take any of your medicine over the last weekend" was not significantly correlated (correlation coefficient = − 0.09), with question three "sometimes if you feel worse, do you stop taking your medicines?" in the T1 intervention group. This was not expected, as all indicators of a construct are expected to have significant positive correlations with each other. Due to the negative correlation coefficient and its nonsignificance, we considered removing question five from our analysis, but sensitivity analysis with and without this item showed no differences in results for measurement invariance tests. Therefore, we included it in order to present results for the full SMAQ scale as it was originally designed and validated in Spain. Further, the demographic variables had substantially weak correlation with SMAQ items, and thus we did not include demographic variables in the multiple group confirmatory factor analysis (0.05 < r < 0.105).

Internal consistency reliability and concurrent and factorial validity
The six items of the SMAQ exhibited adequate or moderately strong internal consistency reliability in measuring the latent construct of adherence at T1, T2 and in the full sample. The Cronbach's α was 0.66 and 0.68 at T1 for control and intervention groups respectively; 0.75 at T2 for both groups and 0.72 for the full sample. Concurrent validity of the six-item SMAQ exhibited moderate to excellent positive correlations for the full sample between T1 and T2. Pearson correlation coefficients of items 1, 5 and 6 were excellent for T1 compared to T2 (item 1 = 0.78, item 5 = 0.76, item 6 = 0.80); similarly, correlations of item 3 was good (item 3 = 0.52) and that for item 2 and item 4 were moderate (item 2 = 0.49, item 4 = 0.48) (see Table 4). The good model fit indices and significant factor loadings across the four groups indicate factorial validity, and that the scale performed equally well when T1 control, T1 intervention, T2 control and T2 intervention were all compared using multiple group confirmatory factor analysis. (See Tables 5  and 6).    Table 5). Factor loadings in the final strong factorial model were all statistically significant (p < 0.05) and ranged from 0.26 to 1.18 (see Table 6). Positive and significant factor loadings suggest that the construct of adherence significantly and positively influenced all the measures generated by the six items of SMAQ.

Measurement invariance test
A chi-square difference test between configural and weak factorial models was significant (chi-square difference = 34.79 (DF = 15) p < 0.01). Other model fit indices were comparable between the two models, which suggested that the weak factorial model had a better fit for the data. The next chi-square test between the weak and strong factorial models found no significant difference (chi-square difference = 13.36 (DF = 15) p = 0.57). In addition, the RMSEA statistic reduced by 0.01 to 0.06 and other model fit indices were comparable with those of the weak factorial invariance model. Therefore, the strong factorial invariance model had the best fit for the data and was accepted as the final model (see Tables 5 and 6). The final estimated measurement models for the strong factorial invariance are presented in Figs. 1, 2, 3, 4. Factor loading estimates for the models are shown in Table 6.

Discussion
The purpose of this study was to assess the psychometric and related measurement properties of the six-item SMAQ using data from a quasi-experimental parent study of HIV-positive women of reproductive age in Ethiopia. Our findings indicate that the six-item SMAQ demonstrated adequate internal consistency reliability, suggesting that the six items in the questionnaire reflect the same latent construct of adherence to antiretroviral therapy. In addition, concurrent validity of the scale was moderate to excellent based on correlations between the item responses at T1 compared with T2. Further, model fit indices and significant factor loadings demonstrated factorial validity, which suggests construct validity as well.
In addition, we documented strong factorial invariance across the four independent study groups, suggesting  that the SMAQ questions/items were being interpreted in an equivalent manner across groups. This finding suggests that the SMAQ performs equally well across samples and operationalizes group-specific differences in an invariant manner across groups. An important implication of this finding is that adherence scores obtained using SMAQ from the four study groups can be compared pre-and post-intervention for policy or intervention purposes [37]. Taken together, these findings affirm that the six-item SMAQ is a valid measure of adherence to ART in this sample of women with HIV/AIDS in Ethiopia. Our findings add confidence for researchers and interventionists interested in using the SMAQ to assess adherence to ART in this setting.
We found one negative but nonsignificant correlation between the six indicators of the SMAQ suggesting that a five-item version might be more efficient [59]. However, our findings showed no differences in measurement invariance tests when question five was included or excluded. It is possible that the lack of correlation was caused by a data entry error, but we were unable to verify this possibility. More likely, it was due to the magnitude of question five's correlation with question three being too small to impact the results. In addition, question five was strongly and positively correlated with other items of the scale, and all its factor loadings were positive and significant. Thus, we maintained the integrity of the original sixitem SMAQ scale in our final analyses. In comparison with the validation study in Spain, the mean age of patients in the present study was slightly lower (33 years versus 36 years). All participants in our study were female compared to 28% in the Spanish study. The Cronbach's α in the present study was lower for T1, but comparable for T2 and for the full study sample (α = 0.75) [21]. Estimating concurrent and factorial validity was a quick way to validate our SMAQ data, although predictive validity would be a more powerful criterion for future studies predicting SMAQ scores in relation to ART interventions where the counts of HIV ribonucleic acid (RNA) or CD4 T lymphocytes (CD4 cells), for example, are available.
The SMAQ has several advantages for field studies-it is short and easier to administer, which makes identification of non-adherent patients and intervening quicker at crowded health service facilities associated with severe personnel shortages and long waiting times [60]. Conversely, collection of HIV RNA or CD4 counts requires much more time, financial and workforce recourses which were limited in the study setting. Other studies have used data from two cross sections to assess validity of standardized scales [52]. Our study complemented the need for assessment of validity by testing for measurement invariance of the SMAQ and found it to be invariant across groups and time, suggesting that the six items are relevant for measurement of the latent factor of adherence to ART.
The Morisky Scale and variations of the adult AIDS Clinical Trials Group (ACTG) are also used to assess self-reports of adherence [15,61,62]. The SMAQ is a modified version of the original four-item Morisky Scale, which has since been modified and validated as an eightitem scale [61,63]. However, the Morisky Scale has been validated and is more commonly used in hypertension patients and general purpose adherence studies and interventions, compared to the SMAQ and ACTG which have been validated with persons living with HIV [62,64]. Compared to the SMAQ, the ACTG scale is longer and would require much more time and higher costs to administer and collate participant responses into actionable insights that can guide quick adherence improvement interventions. Self-reports of adherence seek various types of information from respondents, including: medication-taking behavior, and barriers and beliefs associated with adherence [64]. The choice of a selfreport measure depends on the goal of the study or intervention. The SMAQ seeks information about medication-taking behavior and barriers to adherence [64]. The parent study assessed level of adherence with the goal of improving adherence by reducing ART access barriers by increasing the number of access points or service providers. In this way, the SMAQ was more appropriate to the goals of the parent study than the ACTG. Researchers and interventionists with similar needs and goals may consider using the SMAQ in their studies.
Initiation, retention and adherence to ART positively influence quality of life among persons living with HIV [65][66][67][68][69], are required for viral suppression, and are critical to the achievement of the UNAIDS 90-90-90 goals. However, initiation and retention on ART are only meaningful to the extent to which users of ART can adhere to the regimen [70]. Also, recent studies have shown that adherence to ART can be a successful HIV prevention strategy [8,71,72]. Improving adherence may be challenging or impossible without our ability to measure it reliably, validly and consistently across groups of individuals, which makes efforts to improve measurement methods and tools an important contribution for public health. A study of strategies to improve adherence to ART in low-resource settings reported that adherence measurement was required for optimal targeting and tailoring of interventions [73]. The present study moves the field forward by presenting reliability, validity and invariance test statistics for SMAQ from a sub-Saharan Africa setting where such HIV research is scant, yet the burden of disease and potential need for such measurement is greatest as sub-Saharan Africa bears the greatest HIV/AIDS burden. According to the WHO, nearly one in every 25 adults is living with HIV in Africa, accounting for nearly two-thirds of the global total [74]. Providing evidentiary measurement properties for SMAQ increases practitioners' confidence in using SMAQ, which increases its adoption in assessment of adherence.
Although we found strong invariance for the six items of the SMAQ, it is worthwhile to note that adherence is a dynamic behavior which may change over time, even without intervention. Thus, invariance can be expected for SMAQ items that assess intentional non-adherence across time, such as question three of the SMAQ: "Sometimes if you feel worse, do you stop taking your medicines?", because such items are embedded in a patient's beliefs and self-construct and therefore, are more robust to behavior change. Conversely, the SMAQ also has a component of unintentional non-adherence due to forgetfulness, assessed by questions one and two: "Do you forget to take your medicine?" and "Are you careless at times about taking your medicine"? The Unintentional non-adherence component may be prone to random variability, which may not be captured by invariance testing of the six items of the SMAQ together, but by testing invariance for each item using longitudinal data. Thus, attribution of changes in adherence to specific components of the SMAQ as intentional or unintentional was not possible in the present study, because of the independent cross-section design. This is an important area for future studies in which researchers may be able to identify modifiable items of non-adherence measured by the SMAQ so as to appropriately intervene to improve adherence, as was demonstrated by Mora et al. (2011) in their assessment of non-adherence among asthma patients using the Medication Adherence Report Scale (MARS-A10) [75].
Several limitations of our study should be noted. Although we treated the samples as independent, they may not be truly independent because some participants may have participated at both T1 and T2 interviews. This limitation may manifest in repeated questions where social desirability bias is also a limitation. However, the cross-sectional design of the parent study mitigated this tendency. In addition, statistical tests showed group differences in demographic characteristics. The design limited the use of multilevel multigroup CFA, as suggested by Kim and colleagues [76]. Ethical considerations and operational logistics were also considered in the design. The taxonomy for describing adherence to medications now suggests that results from baseline and follow-up can only be compared if the patient was already on treatment at least 3 months prior to baseline [77]. However, the taxonomy was not in place at the time of data collection. Challenges associated with diagnosis and treatment initiation records in the study settings would also limit application of the taxonomy. Further, the lack of data on clinical methods of measuring adherence-such as a HIV RNA test (a test that checks for RNA genetic material from the virus in a sample of blood) [78,79] or CD4 counts (the number of CD4 T lymphocytesa type of white blood cells --in a sample of blood, which is used to monitor an individual's response to ART) [80]-limited our ability to assess the predictive validity of the SMAQ with these data. This is an important agenda for future research.

Conclusions
This is the first study to assess reliability, validity and measurement invariance of SMAQ in the sub-Saharan Africa region, using pre-and post-intervention data from two treatment referral networks. The findings show that the SMAQ is sufficiently reliable and valid to be used for HIV-positive Ethiopian women of reproductive age who are on ART. In addition, the findings demonstrate that comparisons across groups are possible in the study sample and in future, and unlikely to be affected by differences in response styles, interpretations of indicators, time lapse and socio-economic factors [81]. Further research is warranted to determine whether the measurement properties of SMAQ reported here would hold among participants from other countries for males and females, of different age groups, from various regions, and various socio-economic statuses.

Details of the methodology of parent study by MEASURE Evaluation researchers
Data for the present study are from a quasi-experimental referral network study conducted by MEASURE Evaluation (https://www.measureevaluation.org). Kirkos was the site of an intervention aimed at improving ties among providers. Providers were identified using snowball sampling, beginning with well-known service providers as the seeds. All service providers that offered, and could refer clients to one another for, HIV care and treatment support and family planning services to HIV-positive women of ages 18-49 were included. Saturation was reached when nominated organizational representatives, who were also the interviewees for the study, named service providers that had already been named by others. Ultimately, 25 providers.
in Kirkos and 26 in Kolfe-Keranyo were included in the study. The data included provider characteristics. and linkages among providers. To obtain referral network data, T1 (baseline) and T2 (follow-up) interviews were conducted with nominated key informants in each provider organization about HIV services and the nature and types of referrals offered, provider characteristics, collaborations, joint programs, and linkages with other providers.

The intervention
A referral network strengthening intervention was implemented in Kirkos where 21 of the 25 providers were represented in at least one of the meetings. Kirkos was selected for the intervention after T1 results showed lower network density there compared with Kolfe-Keranyo. The intervention consisted of a series of three 2-day educational meetings held 2 months apart at different times after T1 data collection. During the meetings, participants learned about strategies for client referral, collaboration, joint programming, and partnerships. They also learned about services offered by other facilities in the network. Service directories, listing contact information and services offered, were developed and shared, with each participating provider receiving at least one directory. No intervention was implemented in Kolfe-Keranyo, the control network. Characteristics of service providers Service providers in the two networks were owned and operated by the Ethiopian government or various types of nongovernmental bodies. The government owned and operated 10 out of the 51 service providers, whereas nongovernmental organizations (NGOs), faith-based providers (FBOs) or private individuals owned and operated the remainder. Kirkos had significantly fewer (three) government owned and operated providers compared with Kolfe-Keranyo, which had seven. There was a significant increase in the number of providers self-identifying as NGOs in Kirkos from five at T1 to 14 at T2. Conversely, Kolfe-Keranyo experienced a reduction in the number of NGOs from eight at T1 to five at T2.

Appendix 2
Invariance testing An invariant measurement model has equal factor loadings, intercepts, factor variances and residual variances across groups. We applied MI testing methods described by Bollen (1989) and Widaman & Reise (1997) [82,83]. The most basic form of invariance is configural or pattern invariance, followed by weak, strong and strict factorial invariance, in that order. Measurement invariance was assessed by testing the invariance of measurement parameters including: factor loadings, intercepts and factor variances across the four groups [82,83]. We tested for MI by consecutively imposing additional constraints on successive levels of invariance. For configural invariance test, we conducted independent group-specific CFAs with no modifications to assess overall fit of the model within each study group. Next, we equated factor loadings and fixed mean adherence to zero to test for weak factorial. A factor loading is the strength of the linear relation between a latent factor and each of the six items of the scale [43,82]. Finally, we tested for strong invariance by fixing factor means for adherence in one group and constraining factor loadings, intercepts and factor variances to be equal, whereas residual variances were freely estimated. Strong invariance justifies across time and between group comparisons [42,43,82]. We did not test for strict invariance because residual variances would vary from group to group even if all the groups were from one population.

1) Configural invariance
Under configural invariance, we were interested in testing whether the same indicators measured the construct of interest across multiple groups [42]. The model structure requires that the same item must be an indicator of the latent factor in each group, but the magnitude of the factor loadings, intercepts, variances and covariances may differ across groups [43]. This is the baseline model because we assumed the same pattern of fixed and free parameters across groups, but no equality constraints were imposed. Therefore, we can compare it with more restrictive models. To estimate this model, the means of adherence were fixed to zero in all groups, but factor loadings, intercepts/thresholds, factor and residual variances were free to vary across groups.

2) Metric, weak factorial or factor loading invariance
This level of invariance requires that in addition to the constructs being measured by the same indicators as in configural invariance, factor loadings of those indicators must be equivalent across multiple groups [42]. A factor loading is the strength of the linear relation between a latent factor and each of the associated indicators [43,82]. Under this type of invariance, factor loadings are constrained to be equal across groups, but no other equality constraints are imposed. Differences in this model imply differences across groups in how the latent variable affects its indicators [43]. Factor means cannot be compared across groups because of differences in the origin of the scale [43]. To estimate this model, means of adherence were fixed to zero in all groups, factor loadings were constrained to be equal, and the other parameters were freely estimated across groups.

3) Scalar, strong factorial or intercept invariance
This level of invariance justifies across time and between group comparisons [42]. It builds upon weak factorial, factor loading or metric invariance by requiring that the indicator intercepts also be equivalent across multiple groups [42]. An intercept is the origin or starting value of the adherence scale that the factor is based upon. To assess strong factorial invariance, factor loadings and intercepts were constrained to be equal across groups. This condition is also called "scalar invariance." This level of invariance is required to make valid comparisons of means of latent factors across groups [43]. The model implies that any differences in means of the indicators are attributable to a difference in means of the latent variable. Thus, differences in covariances among indicators and in means of indicators are attributable to group differences in covariances and means on latent variables [82]. Under this invariance model, factor means for adherence were fixed in one group and free in the others, factor loadings and intercepts/thresholds and factor variances were constrained to be equal, whereas residual variances were freely estimated.

4) Strict factorial or residual invariance
This is the final criterion of invariance testing [42,43]. There are two levels of strict factorial or residual invariance. The first level is invariance of factor variances, which represents overall error in prediction of the latent construct. The second level is invariance in error term estimates of individual indicator variables. Strict invariance testing assesses whether residual errors are equivalent across multiple groups. A residual is defined as uniqueness or measurement error associated with each measured indicator. This type of invariance extends strong factorial invariance, such that, residual variances are also constrained to be equal, in addition to factor loadings, intercepts and factor variances. Under this type of invariance, group differences are solely due to group differences in latent variables. This type of invariance likely does not hold in practice, because residual variances would vary from group to group even if all the groups were from one population. Therefore, we did not estimate models for this type of invariance.