External validity in healthy public policy: application of the RE-AIM tool to the field of housing improvement

Background Researchers and publishers have called for improved reporting of external validity items and for testing of existing tools designed to assess reporting of items relevant to external validity. Few tools are available and most of this work has been done within the field of health promotion. Methods We tested a tool assessing reporting of external validity items which was developed by Green & Glasgow on 39 studies assessing the health impacts of housing improvement. The tool was adapted to the topic area and criteria were developed to define the level of reporting, e.g. “some extent”. Each study was assessed by two reviewers. Results The tool was applicable to the studies but some items required considerable editing to facilitate agreement between the two reviewers. Levels of reporting of the 17 external validity items were low (mean 6). The most commonly reported items related to outcomes. Details of the intervention were poorly reported. Study characteristics were not associated with variation in reporting. Conclusions The Green & Glasgow tool was useful to assess reporting of external validity items but required tailoring to the topic area. In some public health evaluations the hypothesised impact is dependent on the intervention effecting change, e.g. improving socio-economic conditions. In such studies data confirming the function of the intervention may be as important as details of the components and implementation of the intervention.


Background
Improving the use of research findings in policy and practice requires, among other things, clear reporting of external validity items [1][2][3]. External validity sometimes referred to as generalisability, means the extent to which causal inferences reported in one study can be applied to different populations, setting, treatments and outcomes [4,5]. For interventions this requires clear reporting of population characteristics (including setting and reach of the intervention), details of the intervention (including implementation, and adaptation to local settings), outcomes, and sustainability of the intervention and impacts [6,7]. Improved reporting of these items can assist the reader in judging the applicability and relevance of study findings to their own situation. It has been argued that improved reporting of external validity items may improve the usefulness and the appropriate use of research findings, as well as potentially contributing to improved quality of available evidence [2,3].
Despite an emerging acknowledgement by publishers of the importance of external validity, there is little clear guidance on what should be reported to facilitate judgements about external validity of study findings [6]. Within the health field much has been devoted to the development of tools to assess internal validity, but far less to external validity. One tool which articulates the required external validity items has been developed by the members of the RE-AIM team [8]. This tool is informed by Cronbach et al's work on generalisability theory and the related UTOS elements [9]; Units (e.g. individual patients, moderator variables, sub-populations), Treatments (variations in treatment delivery or modality), Occasions (e.g. patterns of maintenance or relapse over time in response to treatments), and Settings (e,g, medical clinics, worksites, schools in which the intervention is being implemented and evaluated). The RE-AIM checklist was developed for public health interventions, specifically within the field of health promotion. While acknowledging the limitations of developing a standard tool the authors of this checklist have called for piloting to test and refine available tools [6]. The tool has been applied to studies of health promotion interventions [10,11], but to our knowledge, has not been tested in the field of healthy public policy, that is nonhealth sector interventions such as education, welfare, housing, transport etc.
We used a recent systematic review of the health impacts of housing improvements [12] to investigate levels of reporting of external validity items using the RE-AIM tool [8]. This brief report presents the level of reporting and reflects on how reporting of external validity might need to be adapted for use in the broader field of healthy public policy.

Methods
The Green & Glasgow tool was tailored to meet the characteristics of the studies being assessed. This required some rewording of the original questions to improve clarity or adapting some details of the original questions developed by Green & Glasgow to be more appropriate to a group of housing studies. A set of criteria to assess the extent to which each external validity item had been reported was developed by both authors, i.e. "large extent", "some extent", "unclear", "not at all" or "not applicable (N/A)". A summary of the items assessed is provided in Table 1. The full version of the tool with the criteria for assessment is available as Additional file 1.
Thirty nine intervention studies which had assessed the health impacts of housing improvement and were included in an earlier systematic review [12] were assessed independently by two reviewers for the extent of reporting of the external validity items detailed in the tool, i.e. "large extent", "some extent", "unclear", "not at all" or "not applicable (N/A)". Disagreements on the assessment between the two reviewers were resolved by discussion and where disagreements persisted the questions or assessment criteria (large extent, some extent etc) were further clarified. Only the studies and related papers included in the published review were included in this assessment, this included the key paper for each study (n = 39) plus a further 29 publications linked with the included studies. Authors were not contacted to obtain further information on external validity items.
Data were extracted and entered onto a Microsoft Access database ©. A summary score for the level of reporting was calculated for each domain (Reach; Implementation; Outcomes; Maintenance). The codes for each item were converted into a numeric value ("large extent" or "some extent" = 1 "unclear" or "not at all" or "N/A" = 0). A subtotal score for each domain and a total score was calculated for each study. The score indicating the level of reporting for each of the external validity items, and summary scores for the domains were tabulated along with key study characteristics identified in the original systematic review, these included intervention type, context, study design, and overall assessment of internal validity (for details of the internal validity assessment and definitions of the study intervention categories see the full systematic review [12]).

Application of Green & Glasgow tool
There was considerable disagreement between the two reviewers requiring substantial iteration to clarify the meaning and purpose of some of the external validity items (Table 1 & Additional file 1). Three items were particularly difficult to clarify (2, 3 & 11) and were reworked to relate to descriptions of the study population or setting and eligibility for the intervention. Eight items were rephrased for clarification and/or inclusion of terms or issues relevant to the field. Item 16 was split into two. Five items (5, 6, 13, 14, & 15) remained unchanged from the original tool developed by Green & Glasgow [8]. Following these edits there was improved agreement between the reviewers but some disagreements persisted and were resolved by discussion. The two items with greatest disagreement were 2 & 3 where half or more of the assessments differed (50% and 68.8% respectively). Levels of agreement were highest for items 5, 6, 15 and 16.

Reporting of external validity in housing improvement studies
Reporting of external validity items was low across the studies ( Table 2 & Additional file 2); overall 35.3 % of items were reported (mean 6, range 2-9, median 6). Within each external validity domain (Reach, Implementation, Outcomes, & Maintenance) few studies reported more than half the items either "to some extent" or "to a large extent". The "outcomes" domain had a greater number of reported items among the studies (mean items reported 49.8%); the "intervention" domain was the most poorly reported (mean items reported 29.0%). No item was universally reported. Item 9 & 16a were most commonly reported. Three items were not reported in any study: items 1, 14, and 15 (Table 2). There was little variation in the number of reported items between intervention type, location or date of study. The better quality studies reported more external validity items (mean number of external validity items reported by Internal Validity Grade A/B/C 6.7/5.36/5.4).

Discussion
Following adaptation and development of detailed assessment criteria relevant to the studies being assessed, the external validity tool was successfully applied to studies of housing improvement drawing on the primary paper and associated papers available at the time of the original review [12]. Reporting of external validity items was low overall (median 35.6%) and across individual domains in the tool. This is comparable to the level of reporting in a group of studies of childhood obesity prevention (median 34.5%) [10].
The studies we assessed represented a broad range of study designs, contexts, and other aspects of study quality and interventions, as well as representing both published and unpublished studies. There was no suggestion of a link between study characteristics and reporting of external validity. The apparent link between internal and external validity reporting may be explained by the overlap in assessed items, specifically attrition and sample 15 Are data reported on the sustainability (or reinvention or evolution) of programme implementation and intervention, at least 12 months after the formal evaluation?   [10]. While faithful replication of a novel intervention may depend on detailed reporting of intervention components and implementation [13], this may be less important for a well established intervention, such as housing improvements. Moreover, for complex social interventions, such as housing improvements, data confirming intervention function may be of more value than the details of intervention form [14]. Data on changes effected by the intervention, such as improved warmth, may be used to refine generalisable theories regarding tackling socio-economic determinants of health even where the specific intervention may not be widely generalisable. Where there is evidence to support the theory that changes in an intermediate outcome can lead to health, selection and implementation of appropriate and effective interventions to improve a named socio-economic outcome, such as warmth, may be made locally. This issue is of particular relevance to interventions where the hypothesised health impacts are dependent on the intervention affecting an intermediate variable, for example healthy public policy interventions tackling socio-economic determinants of health.
There is little doubt that reporting of external validity items needs improving. However, in agreement with Green et al, development of a standard tool may not be appropriate [6]. In our study there was poor agreement between the two assessors in the interpretation of the tool. In response it was necessary to amend and clarify meaning to allow appropriate application of the tool to this group of studies. Specifically, aspects of maintenance and reach require tailoring to the intervention, and reporting of differential effects using sub-group analysis will inevitably be limited where studies are typically small. Where a well established intervention like housing improvement is being evaluated items assessing reporting of population details (namely items 2 & 3) may require editing to clarify whether they relate to reporting of the target population and context for the study or the wider intervention. We chose to focus on the study population. In addition, criteria to indicate the extent of reporting were developed to reflect issues pertinent to our particular group of studies. The use of graded criteria improves the sensitivity and interpretation of the tool beyond the previous version which is restricted to a binary assessment [10,11].