Assessment of fidelity in individual level behaviour change interventions promoting physical activity among adults: a systematic review

Background Behaviour change interventions that promote physical activity have major implications for health and well-being. Measuring intervention fidelity is crucial in determining the extent to which an intervention is delivered as intended, therefore increasing scientific confidence about effectiveness. However, we lack a clear overview of how well intervention fidelity is typically assessed in physical activity trials. Methods A systematic literature search was conducted to identify peer - reviewed physical activity promotion trials that explicitly measured intervention fidelity. Methods used to assess intervention fidelity were categorised, narratively synthesised and critiqued using assessment criteria from NIH Behaviour Change Consortium (BCC) Treatment Fidelity Framework (design, training, delivery, receipt and enactment). Results Twenty eight articles reporting of twenty one studies used a wide variety of approaches to measure intervention fidelity. Delivery was the most common domain of intervention fidelity measured. Approaches used to measure fidelity across all domains varied from researcher coding of observational data (using checklists or scales) to participant self-report measures. There was considerable heterogeneity of methodological approaches to data collection with respect to instruments used, attention to psychometric properties, rater-selection, observational method and sampling strategies. Conclusions In the field of physical activity interventions, fidelity measurement is highly heterogeneous both conceptually and methodologically. Clearer articulation of the core domains of intervention fidelity, along with appropriate measurement approaches for each domain are needed to improve the methodological quality of fidelity assessment in physical activity interventions. Recommendations are provided on how this situation can be improved. Electronic supplementary material The online version of this article (10.1186/s12889-017-4778-6) contains supplementary material, which is available to authorized users.


Background
Lack of physical activity is a key risk factor for mortality, associated with approximately 5.3 million deaths per year worldwide [1]. Regular maintenance of physical activity can also have significant implications for physical and mental health including reduced risk of depression [2] reduced risk of cardiovascular disease, and weight loss [3] and physical activity promotion is considered to be the "best buy" for public health [4]. In the UK, guidance from the Chief Medical Officer currently recommends at least 150 min of moderate to vigorous physical activity (MVPA) per week. Despite this, national levels of MVPA in the UK are low with only 39% of men and 29% of women achieving this target [5].
Individual level (one to one and group-based) behavioural interventions are a key strategy for increasing physical activity, however, there is considerable variation in their reported effectiveness [6][7][8]. This may be due to the fact that behavioural interventions for physical activity are often complex (with many interacting factors), and are therefore challenging to design and implement [9]. These interacting factors moderate and mediate study outcomes, and include theoretical mechanisms (e.g. motivation or confidence) and contextual factors (e.g. participant demographics) [10]. Another, key moderator of study outcomes is intervention fidelity -the extent to which a behavioural intervention was designed, implemented and received as intended [11,12].
Inadequate attention to the assessment of intervention fidelity can increase the risk of type 1 and type 2 errors and result in spurious conclusions about intervention effectiveness [11]. As well as allowing more accurate judgements about effectiveness [13], assessing fidelity can also facilitate easier replication and implementation of behavioural interventions in real world settings [9]. The UK, Medical Research Council (MRC) guidelines emphasise the importance of fidelity assessment when interpreting outcomes [14,15]. One framework specifically developed for individual level behaviour change interventions was developed by the NIH Behaviour Change Consortium (BCC). The BCC conceptualised fidelity across five core domains: Study Design, Provider Training, Intervention Delivery, Intervention Receipt and Enactment. Study Design is concerned with whether a study adequately tests its hypotheses in relation to its underlying theoretical and clinical processes. Provider Training involves standardizing training between providers and ensuring they are trained to clear criteria and monitored over time. Intervention Delivery involves assessing and monitoring differentiation (differences between the intervention and any comparison treatments), competency (skills set of provider), and adherence (delivery of intended components). Intervention Receipt refers to whether the intervention was understood and 'received' by participants and enactment refers to whether participants used intervention related skills in day to day settings [11,13,16]. The NIH BCC framework provides guidance for the assessment, enhancement and monitoring of fidelity, however, the focus of the present review is on assessment. Focussing on assessment is important to ensure proposed strategies to enhance fidelity (e.g. those recommended by the NIH BCC) have indeed been successful (e.g. did the provision of a treatment manual result in adequate adherence to treatment components?) and also facilitates accurate monitoring of fidelity over time.
If the assessment of intervention fidelity is important, then agreement on what constitutes fidelity in physical activity interventions is clearly needed. In addition, recommendations for good practice could help to reduce the risk of bias when assessing fidelity [17,18]. Despite this, reviews investigating fidelity assessment in health behaviour research [16,19] self-management [20,21], mental health [22] school based drug abuse prevention [23] and physical activity [24,25] have revealed that there is considerable variability in the conceptualisation and measurement of intervention fidelity. For example, in a review of diabetes self-management interventions, it was reported that intervention fidelity was assessed inconsistently, using a range of different concepts, including adherence to intervention content, duration, coverage and quality of programme delivery. There was also heterogeneity in measurement, with a variety of approaches such as participant self-report, researcher observation, and provider self-report [20].
A variety of ways to conceptualise [11,13,16,26] and measure [22,27] fidelity in behavioural interventions have been suggested. Previous studies have also reviewed the theoretical basis of physical activity counselling interventions, and competency level of the interventionists [24] and highlighted the importance of assessing fidelity in physical activity interventions based on motivational theories [25]. However, to the best of our knowledge, a review of whether and how fidelity has been assessed (using the NIH BCC framework) in physical activity interventions, along with an appraisal of the quality of these approaches and association to outcomes is lacking. An overview of this field would provide intervention developers with a foundation to improve fidelity assessment of their own interventions, and provide researchers and reviewers with a means to assess the extent to which reported intervention processes are a) delivered and b) responsible for study outcomes.
The current review has four key aims. Firstly, to identify and summarise (using the NIH BCC framework) how behavioural interventions to promote physical activity in adults have conceptualised and measured fidelity. Second, to summarise the reported results of fidelity assessments. Third to summarise any reported associations of fidelity and other intervention outcomes. Fourth, to critically appraise the methodological approaches identified.

Searches
A search of the databases PsychInfo, PsychArticles, MEDLINE, Embase, and Google Scholar was undertaken in March 2017 for all studies published in English up to that date. Searches were carried out on titles, abstracts and keywords using proximity and wildcard operators to maximise the range of potential studies. Search terms for intervention fidelity (Additional file 1) were informed by previous reviews [16,20,28] and consisted of synonyms for intervention fidelity (e.g. treatment integrity) combined with those for 'exercise' (e.g. physical activity). Additional searches were carried out by citation searching of included papers.

Study inclusion/exclusion criteria
Retrieved studies were included based on the following criteria: [1] Mentioning fidelity (or related term) in the title, abstract or keywords either as a main focus of the study or as a nested study (e.g. as an analysis conducted within a trial or feasibility study); [2] Individual level behavioural interventions [29] designed to increase any type of physical activity; [3] Study focussed only on physical activity and no other behaviours (e.g. diet, smoking); [4] Study involving adults aged 18 or over; [5] Peer reviewed publications in English published up to March 2017 (no time frame imposed) [6] RCTs, observational studies pre-post studies, case-controlled or other quasi-experimental studies. Comparison groups could include usual care, no intervention or other interventions as the present study was only interested in the main intervention group. Studies were excluded if the intervention consisted of structured exercise alone or behavioural support plus structured physical activity (e.g. exercise classes) (Additional file 2).

Study selection
All titles and abstracts were screened by the lead author (JL) with 10% independently screened by another coauthor (RC). All full texts were also screened independently by JL and RC. Inter-rater reliability was calculated using the AC1 statistic [30] and disagreements were resolved by discussion and, if necessary, mediated by a third author (CG).

Data extraction and synthesis
Data extraction and synthesis was guided by previous recommendations on the conduct of narrative synthesis in systematic reviews. Narrative synthesis is an approach used in systematic reviews to textually summarise findings from the included studies [31]. Characteristics of the main intervention study (i.e. study design, population, intervention, and physical activity outcome) were extracted in addition to fidelity data. Borrelli (2011) provides the latest iteration of the NIH BCC treatment fidelity checklist [11,13,16] (now referred to as the 'treatment fidelity assessment and implementation plan'). Items pertaining only to fidelity assessment (within the domains of design, training, delivery, receipt and enactment) were used to organise and summarise the descriptions of author attempts at assessing fidelity, methods used to collect data and fidelity outcomes ( Table 1). All items from the treatment fidelity assessment and implementation plan were not used as they referred to aspects not related to assessment (e.g. use of a treatment manual). Although important, these items relate to 'enhancing' as opposed to 'assessing' fidelity. Nvivo (version11) was used to organise the data.
As no specific criteria for appraising the quality of fidelity assessment exist, studies were critically appraised based on criteria suggested in previous studies to represent good practice when assessing fidelity [11,13,16,17,22]. The more general recommendations made in the NIH BCC papers were used to create a checklist by the study authors. This aimed to provide a sense of the 'quality' of the application of each method. The checklist items assessed the presence or absence of good practice methods for each study and confirmed how robust the fidelity measures were (Table 1). There were eleven criteria overall, with two for design, one for training, six for delivery, one for receipt, and one for enactment and one applied to all domains.. A previous checklist has been developed (i.e. the treatment fidelity assessment and implementation plan) to quantify the extent to which studies have assessed monitored and enhanced fidelity according to the five domains of the BCC framework [11,13,16]. However, this checklist was not appropriate for the purpose of appraising the quality of the assessment measured, as they did not include more specific items relating to the methodological quality of the fidelity measures themselves (e.g. method used to collect fidelity data). Data from all included studies was independently extracted and appraised by two authors and compared. Discrepancies were resolved by discussion with a third author.

Included studies
As highlighted in the PRISMA flow chart (Fig. 1), the search identified 11,464 records. Once duplications were removed, 8262 records remained. After title and abstract screening, 47 full texts were examined further with 28 articles describing 21 physical activity interventions included in the review. Inter-rater reliability for titles and abstracts was excellent (AC1 = 0.99) but poor for full text (AC1 = 0.23). However further discussion revealed a systematic difference in the way that one inclusion criterion was being applied (focussing on physical activity as an outcome rather than as a focus of the intervention). After clarification, the full text screening yielded perfect agreement (AC1 = 1).

Measurement of intervention Fidelity
Overall, 66 approaches to measuring fidelity were identified across the 21 studies with 52 approaches measuring Prior to study implementation, investigators, and optimally a protocol review group or panel of experts, should review their protocols or treatment manuals to ensure that the active ingredients of the intervention are fully operationalized The degree to which the measures reflect the hypothesized theoretical constructs and mechanisms of action should be assessed Training providers Assess provider skills acquisition Ensure providers are trained to a well-defined, a priori performance criterion. Provider role-plays with standardized patients should be evaluated for both adherence to treatment components and adherence to process (e.g., interactional style) Assess and monitor provider skills maintenance Delivery of treatment Assess if provider adhered to intervention plan, or in the case of computer delivered intervention, method to assess participants contact with information Adherence to treatment components and competence to deliver the treatment in the manner specified Assess non-specific treatment effects Direct observation evaluated according to criteria developed a priori Assess whether or not the active ingredients were delivered Raters of the audiotapes or videotapes should be skilled in treatment delivery as well as in more subtle aspects of the intervention and the treatment manual.
Assess whether or not proscribed components were delivered (e.g. components that were unnecessary or unhelpful) Raters of the audiotapes or videotapes should be independent of the study Raters of the audiotapes or videotapes should be blind to treatment assignment, participant progress and outcomes, and provider identity.
Interrater reliability of raters of the audiotapes or videotapes should be conducted Receipt of Treatment Assess degree to which participants understood intervention Assessment of treatment receipt involves verifying the participants' understanding of the information provided in the treatment and verifying that they can use the skills and recommendations discussed. This could include written verification (pre-post-tests), using audio visuals (repeat information orally and visually), and behavioural strategies (role-plays skills with feedback).
Assess participants ability to perform the intervention skills

Enactment of treatment skills
Assess participant performance of intervention skills in setting in which the intervention is applied Objective observation to determine if participants were using behaviour change techniques in relevant day to day settings All Psychometric properties a Only items relating to fidelity assessment were taken from the Treatment fidelity assessment and implementation plan, b Quality criteria informed by general recommendations made by [11,13,16,17,22] delivery fidelity, eight measuring enactment, four measuring receipt, two measuring training fidelity and no approaches assessing design fidelity. Table 2 provides an overview of the fidelity measures identified. It is important to note that many studies contained multiple concepts or measurement approaches.

Design and training
No studies reported assessing design fidelity (the extent to which the intervention content reflected the underlying theory or logic model) and only two studies reported assessing training fidelity (the level of provider competence to deliver the intended intervention content before delivery).
Training fidelity was assessed in one study by measuring provider competence using a 20 point checklist to assess whether or not providers adhered to the intervention protocol during practice sessions [46]. In the other study, it was measured using an eight item self-report scale to assess perceived provider confidence to deliver the intended intervention content [42]. One study reported an increase in provider confidence (training fidelity) in delivery of the intervention [42] the other study did not report the fidelity outcome, but stated that a minimum level of competence was required before delivering the intervention [46].

Delivery (human provider)
20 studies measured delivery of human providers. These included using self-report by providers to measure the presence or absence [38,42,43,52], frequency [36], or delivery quality [43] of intervention components, or using observation by researchers to assess the presence or absence [33,48,51], frequency, [34,37,38,48,53] or delivery quality [43] of intervention components. Two studies reported measuring the provider's satisfaction with his or her own intervention delivery [50,52] and two studies reported measuring the participants' satisfaction with intervention delivery [36,39]. Four studies reported measuring researcher observed rating of provider competence/spirit (i.e. the interpersonal skills of the provider) [34,37,48,53] and three studies reported using researcher assessment of the number of proscribed behaviours used (e.g. arguing/giving advice without permission) [34,37,48]. Seven studies assessed the treatment dose delivered (i.e. length of time or number of sessions) [33,34,36,43,48,52,53]. One study assessed provider adherence to an intervention script, although it was not clear how this was measured [49]. Data relating to delivery was obtained using provider [37,38,42,43,50,52], and participant self-report [32,36,[39][40][41]44], as well as audio recordings [34,37,38,43,48,53], video recordings [51], and direct observations [33,49] by researchers. Provider interviews [52] were also used. Approaches to sampling varied with some studies taking a sample of sessions from the trial population [34,37,38,48] and others sampling the whole trial population [32,35,36,38,50,52]. All studies using observational methods opted to apply coding   (2) Assess provider skills acquisition (2) Provider confidence to deliver intervention (1) Provider self-report (1) [46] Provider competence to deliver intervention (1) Assessment of provider (1) [42] Assess and monitor provider skills maintenance (0) Delivery of treatment (51) Assess if provider adhered to intervention plan, or in the case of computer delivered intervention, method to assess participants contact with information (28) Number of email messages delivered (1) Researcher observation (1) [32] Number of email messages/intervention materials read/received (3) Participant self-report Percentage of intervention script adhered to (1) Researcher observation (1) [49] Assess non-specific treatment effects (6) Rating of provider spirit/competence (4) Audio observation (4) [34,37,48,53] Participant rating of provider support (2) Provider self-report (2) [36,39] Assess whether or not the active ingredients were delivered (15) Rating of intervention components delivered (2) Audio observation (1) [43] Provider self-report (1) [43] Checklist of intervention components delivered (7) Researcher observation (2) [33,51] Audio observation (1) [48] procedures to a selected sample of recordings (or transcripts), rather than using data from all possible intervention sessions. Overall, of the studies reporting fidelity outcomes, many reported adequate intervention delivery [32,[34][35][36][37]52]. One study found adequate levels of fidelity for checklist of intervention components (>70%) and rating of competence measured by researchers. For the same study, less than adequate fidelity was found for an MI intervention measured by trained researchers using a ratio of number of 'adherent' vs 'not-adherent' intervention components [48]. Two studies contrasted delivery assessments made using different methods. One reported low levels of delivery fidelity using objective rating of audio transcripts (with 44% of intervention components delivered as intended), but with provider self-reports indicating high delivery fidelity (100% of intervention components delivered as intended) [38]. Another study found that provider self-assessment scores were higher than those assigned by an independent rater [43].

Receipt
There were three approaches to assessing intervention receipt, these included; participant demonstration of knowledge or skills acquired [49], perceived understanding of intervention content [50], and participant confidence (self-efficacy) to perform skills taught by the intervention [38,50]. Approaches to sampling for assessment of intervention receipt included sampling the whole trial population at the end of each session and ten days later [49] and assessing the whole trial population at the end of the trial [38,50]. One study reported a small increase (from 86.2% to 89.4%) in perceived ability to carry out intervention skills [50]. Another study found that 73.5 to 100% of participants achieved the learning objectives of any given session [49].

Enactment
Enactment measures included participant self-reports or automatic tracking of using the intervention materials (e.g. log books, worksheets, pedometers) [32,41,48,53] or of using specified intervention techniques (e.g. action planning/self-monitoring) [35,38,50]. One study also measured participant use of self-monitoring using electronically recorded data from the intervention website [35]. All studies that measured enactment collected data from the whole trial population. Studies that assessed enactment reported a range from 35.3% to 60% [32,41] of participants regularly using intended intervention techniques post intervention. One study found a nonsignificant increase in the self-reported use of action planning techniques from pre to post intervention (4.6 to 6.8 on a 9 point scale) [50] and one study reported that all participants used intervention techniques as intended at 6 and 12 months follow up [38]. Finally, another study reported 71.9% of participants using intervention materials (accelerometers) [49].

Intervention Fidelity in relation to physical activity (and other study outcomes) Delivery in relation to physical activity
Only three studies assessed the relationship between delivery fidelity and physical activity. One study found a positive association between MI fidelity (counts of adherent (and non-adherent) components of intervention components and spirit of delivery) with objectively measured total energy expenditure (TEE) (p = 0.027) [37]. Two studies found no significant relationship between the number of intervention components delivered (based on coding of audio observations) and levels of selfreported and objective physical activity [38,48].

Delivery in relation to other outcomes
Only one study looked at the relationship between number of intervention components delivered and participant confidence in using intervention strategies, intention to be physically active and affective attitude towards physical activity [38].

Critical appraisal of intervention Fidelity measurement practices
A summary of critical appraisal criteria for each dimension of the BCC framework, and the number of studies meeting the criteria can be seen in Table 3.
No studies assessed design fidelity, so by default none met these criteria.
Of the two studies that assessed training, only one met this criterion [46] by getting providers to do a practice role play and marking it against a checklist of intervention techniques. Of the 20 studies that assessed delivery fidelity, only four measured adherence to treatment components and competence to deliver the treatment in the manner specified [34,37,48,53]. This was done using the Motivational Interviewing Treatment Integrity Scale (MITI) which measures both usage of MI techniques and 'spirit' of delivery (i.e. use of a person-led, empathybuilding interactive style). This is important to ensure effects are due to treatment rather than to different interactional styles [13]. Some form of direct observation to evaluate delivery against a priori criteria (e.g. using audio/ video tapes) was reported in eight studies [33,34,37,38,43,48,49,51], enhancing the reliability of the data. Of the eight studies that used direct observation, credible data supporting the competence of the raters used to assess delivery fidelity (e.g. previous training in the intervention) was evident in six. Such evidence included mentions of having expertise in health behaviour change [51], being trained in MI [34,37,48] and being the 'intervention director' [36,53]. Rater independence was only present in three studies [37,38,53] where external raters who were not otherwise involved in the study were employed to provide a more objective rating of fidelity. Evidence of rater blinding from providers, participants or outcomes was not reported in any of the studies and interrater reliability of raters was only reported for three studies, one reporting a Cohens Kappa of 0.60 [51] and the other two reporting percentage agreement scores ranging from 75% to 100% agreement [38,43]. No studies met all six of these criteria and only two studies met at least four out of the six (11%). Of the three studies which measured receipt, only one study made use of knowledge tests by providing multiple choice tests as well as free text questions [49]. Finally, for the seven studies which measured enactment, only one study met the criterion of 'objective observation to determine if participants were using behaviour change techniques in relevant day to day settings' by counting the amount of times steps were logged by participants on a website [35]. Across all 21 identified studies, the psychometric properties of instruments used to measure any type of intervention fidelity were only reported for six studies. This involved either reporting internal consistencies [36,50], intraclass correlations [48] or referencing the use of previously validated and reliable scales [34,37,48,53].

Summary of findings
This review systematically identified and summarised the range of concepts and methods used to assess intervention fidelity in interventions to increase physical activity and critically appraised the methods used. Only twenty eight articles reporting twenty one studies were identified which had explicitly examined intervention fidelity, suggesting an overall lack of attention to this issue in the field.
A range of different ways to assess intervention fidelity were identified, with delivery of intervention components being the most frequent. The concepts measured often deviated from those identified by the BCC Framework [13]. For example, there was a lack of clear distinction between fidelity of training and fidelity of delivery and no studies assessed every aspect of fidelity.
A wide range of approaches were used to measure fidelity, with data collection measures ranging from researcher coding of observational data (using checklists or scales) to participant self-report measures, to simple counting of sessions attended. A mixture of provider self-report and audio observation were most common for delivery and participant self-report was most common for receipt and enactment. However, there was an overall lack of methodological rigour in the approaches used for data collection (e.g. lack of attention to psychometrics and use of untrained, potentially biased raters) when appraised against a priori quality criteria for fidelity assessment.

Relation to other literature/interpretations
The lack of attention, consistency and rigour in the conceptualisation and measurement of fidelity in physical activity interventions found in this review confirms previous findings. For example, a recent scoping review found that only 5% of published articles addressed the issue of fidelity in motivational physical activity interventions [25]. This also resembles findings in other behavioural domains [18,20,22]. For instance, in a review of fidelity in diabetes self-management interventions, only fifteen studies were identified that assessed intervention fidelity, with delivery adherence again being the most frequent concept assessed [20]. In contrast, a review of fidelity in after-school programmes to promote behavioural and academic outcomes identified 55 studies [55]. However, the review of after-school programmes included strategies used to maintain fidelity (e.g. use of an intervention manual) and under further examination, only 29% (n = 16) of the included studies actually measured fidelity outcomes. Possible reasons for the lack of Design (0) Prior to study implementation, investigators, and optimally a protocol review group or panel of experts, should review their protocols or treatment manuals to ensure that the active ingredients of the intervention are fully operationalized.

N/A
The degree to which the measures reflect the hypothesized theoretical constructs and mechanisms of action should be assessed.

N/A
Training (2) Ensure providers are trained to a well-defined, a priori performance criterion. Provider role-plays with standardized patients should be evaluated for both adherence to treatment components and adherence to process (e.g., interactional style).
Direct observation evaluated according to criteria developed a priori 8 [33,34,37,38,43,48,49,51] Raters of the audiotapes or videotapes should be skilled in treatment delivery as well as in more subtle aspects of the intervention and the treatment manual.
6 [ 34,36,37,48,51,53] Raters of the audiotapes or videotapes should be independent of the study 3 [37,38,53] Raters of the audiotapes or videotapes should be blind to treatment assignment, participant progress and outcomes, and provider identity.

0
Interrater reliability of raters of the audiotapes or videotapes should be conducted 3 [38,43,51] Receipt (3) Assessment of treatment receipt involves verifying the participants' understanding of the information provided in the treatment and verifying that they can use the skills and recommendations discussed. This could include written verification (pre-post-tests), using audio visuals (repeat information orally and visually), and behavioural strategies (role-plays skills with feedback).
1 [ 49] Enactment (7) Objective observation to determine if participants were using behaviour change techniques in relevant day to day settings 1 [ 35] All [21] Psychometric properties 6 [34,36,37,48,50,53] attention to fidelity assessment could be a lack of journal space, or a lack of definitive guidance requiring the reporting of fidelity data. The recent development of checklists such as the Template for Intervention Description and Replication (TIDIER) [14] and the Medical Research Council (MRC) guidance on process evaluations [15] may help to improve this situation in the future.
Another key finding from this review was the lack of attention to quality of measures used to assess fidelity in physical activity interventions. Where checklists or rating scales were applied to session recordings or live session observation by researchers, there was a lack of clarity regarding the use of skilled and unbiased raters. Many studies appeared to assume that the use of researchers (who were not necessarily involved in intervention development) was sufficient to evidence rater competence. However, this would not distinguish between junior or experienced researchers and so may introduce a high risk of error. This could also have implications for the validity and reliability of the findings, as it has been shown that (using the MITI coding tool), scoring is more reliable for coders with higher levels of experience than for those with lower levels [56]. There is a further methodological tension here, as those in the best position to rate competence and adherence are arguably the intervention developers themselves or providers (due to the training received in all of the treatment components). However, developers and providers directly involved in the project may have a vested interest in demonstrating high quality of delivery [27] and obtaining skilled independent raters may be more difficult to find for novel interventions. A solution to this dilemma might be to use experts with a wide range of expertise in intervention design or delivery with clear definitions and instructions on the intervention components and their intended use (perhaps with examples of good practice provided by the developers). The use of multiple raters and checking of inter-rater reliability could also help to reduce the risk of bias further.
Only a few studies used objective data collection methods. This may be problematic as objective measures showed poor convergent validity with self-report measures of intervention fidelity in some studies [38,43]. Factors responsible for this could include social desirability or a lack of sophistication /accuracy /reliability of the measures used. For example, provider self-report measures were typically assessing broad concepts such as a global appreciation of delivery [50], whereas researcher observations assessed more finely-detailed concepts such as the number of specific intervention components delivered [34]. Finally, the lack of attention to the psychometric properties of measurement tools used to assess fidelity also increases the potential for bias due to unknown validity and reliability [57]. This lack of psychometric integrity could be due in part to resource constraints, as novel interventions often require new instruments to be developed. Hence, a balance must be found between scientific rigour and pragmatism.
The importance of assessing intervention fidelity for the interpretation of the findings of intervention effectiveness studies is increasingly recognised [13,22,58]. However the current findings suggest that conceptualisation and measurement of intervention fidelity remains heterogeneous in the field of physical activity promotion research [20,22]. Methodological checklists (e.g. Cochrane) exist to evaluate the quality and rigour of reporting of randomised trials [59]. However, there is currently no such tool for intervention fidelity measurement. Despite this, as demonstrated by this review, the behaviour change consortium [11,13,60] have provided a useful basis for categorisation, planning and critical appraisal of intervention fidelity assessments. It is worth noting that studies that used motivational interviewing had access to validated tools with instructions regarding observation method and rater characteristics (e.g. MITI, BECCI). This provides a good example of what might be possible for future development of fidelity assessment methods in the wider field of physical activity interventions. Further research could combine existing tools such as the Behaviour Change taxonomy [61] and the BCC Framework [13] to construct and check the reliability and validity of intervention fidelity checklist items for each behaviour change technique in the intervention. A focus on delivery style as well as content is also important however [62].

Strengths and limitations
Although previous studies have looked at fidelity to theory [63] and use of fidelity-enhancing strategies [24], to the best of our knowledge this is the first systematic review to specifically identify intervention fidelity assessment methods for physical activity interventions and critique them. Our systematic approach highlighted key conceptual and methodological gaps in current practice. There are however, several limitations of this review that should be acknowledged. First, effectiveness studies may not have reported intervention fidelity in their titles abstracts, or keywords, as implementation is rarely a key focus of intervention studies [31]. This means that it would be possible to miss some relevant studies using the search strategies employed here. However, this review was concerned with understanding how studies typically assessed fidelity and the methodological implications of current practice, rather than aiming to provide a comprehensive overview of the field. It has been previously pointed out, that studies that do not mention fidelity in their titles, abstracts or keywords most likely do not consider it a significant focus [25]. As such, we were confident that the search strategy employed yielded a representative sample of studies in which the issue of fidelity was given significant consideration. Second, although the BCC Framework [13] was used for structuring the analysis, it is worth noting that other conceptual frameworks of intervention fidelity exist, which may have highlighted slightly different issues. For instance, Carrol et al's (2007) Conceptual Framework for Implementation Fidelity [26] includes content (active ingredients), coverage (reach of the intervention), frequency (number of sessions), duration (time taken), complexity (difficulty), quality of delivery (competence), facilitation strategies (strategies to enhance delivery) and participant responsiveness (participant receipt). Although there is much overlap with the BCC Framework, there are some minor differences. The BCC Framework was used here as it conceptualised intervention fidelity across key stages from design to implementation that are not all covered by Carrol et al's., (2007) framework. Thirdly, due to the lack of a consensus of reporting standards for fidelity assessment, the level of description of the methods and measures used was generally poor, so ascertaining the provenance and reliability of relevant information was challenging [31]. However, to attempt to mitigate this, a second coder was used to double check the extracted data and companion papers were sourced and included. Finally, the appraisal criteria used were developed for the purposes of this study based on a combination of existing approaches (Table 1), as definitive criteria do not appear to exist in the literature. As such, there may be other important appraisal criteria that were not considered here. However, we hope this will at least provide a building block for further development.

Implications/recommendations
Based on this review some key recommendations are proposed. Firstly, clearer conceptualisation of fidelity is needed when researchers plan and conduct fidelity analyses. This could be achieved by applying structures such as the BCC Framework. Secondly, researchers should clearly report all aspects of fidelity assessment measures (e.g. observational methods, rater attributes, and sampling procedures) as these can have implications for the likely risk of bias. Thirdly, clearer guidelines are needed on fidelity measurement, including consideration of data collection, sampling, measurement validity and reliability, minimizing the effects of rater bias and other methodological issues. This could then act as an adjunct to existing checklists such as TIDIER. Fourthly, a possible approach for critically appraising fidelity assessment methods in behavioural interventions has been proposed in this review and may provide a useful template for future studies. Finally, researchers should acknowledge the inherent strengths and weaknesses of their assessment methods when reporting and interpreting their intervention fidelity outcomes.

Conclusion
This review highlights new directions for research to improve the rigour and replicability of behavioural interventions for promoting physical activity by enhancing the assessment of intervention fidelity. The conceptualisation and measurement of fidelity in behavioural interventions for physical activity are wide ranging and of variable quality. Further work is needed to generate a more definitive understanding of the key concepts and best practice methods for conducting fidelity assessments of physical activity (and other behavioural) interventions.

Additional files
Additional file 1: Search terms with results.