Translation, adaptation and validation of the American short form Patient Activation Measure (PAM13) in a Danish version

Background The Patient Activation Measure (PAM) is a measure that assesses patient knowledge, skill, and confidence for self-management. This study validates the Danish translation of the 13-item Patient Activation Measure (PAM13) in a Danish population with dysglycaemia. Methods 358 people with screen-detected dysglycaemia participating in a primary care health education study responded to PAM13. The PAM13 was translated into Danish by a standardised forward-backward translation. Data quality was assessed by mean, median, item response, missing values, floor and ceiling effects, internal consistency (Cronbach's alpha and average inter-item correlation) and item-rest correlations. Scale properties were assessed by Rasch Rating Scale models. Results The item response was high with a small number of missing values (0.8–4.2%). Floor effect was small (range 0.6–3.6%), but the ceiling effect was above 15% for all items (range 18.6–62.7%). The α-coefficient was 0.89 and the average inter-item correlation 0.38. The Danish version formed a unidimensional, probabilistic Guttman-like scale explaining 43.2% of the variance. We did however, find a different item sequence compared to the original scale. Conclusion A Danish version of PAM13 with acceptable validity and reliability is now available. Further development should focus on single items, response categories in relation to ceiling effects and further validation of reproducibility and responsiveness.


Background
Several initiatives have been taken to develop evidencebased activities to improve care for chronic conditions and, as in the Chronic Care Model, collaborative care and patient activation are cornerstones [1,2]. Evaluations of these initiatives are essential for further research and development. Although patient activation has been a central concept of chronic care for decades, there is a general lack of clarity regarding the definition of "activation", and consequently a lack of adequate assessment tools.
In 2004, Hibbard et al defined the concept of being "activated" and developed the Patient Activation Measure (PAM) 22 item and a 13 item short form (PAM13) [3,4]. They identified four elements; knowledge, skills, confidence and behaviours critical for coping with a chronic ill-ness, and suggested four stages of activation that patients go through on their way to becoming fully activated in managing their own health [4,5]. Studies indicate that the PAM-measure predicts self-management behaviours, including healthy behaviours, disease-specific behaviours and "attitude to health system" -behaviour [6,7]. The PAM was formulated in two versions targeted people with or without chronic disease, with few semantic differences. Psychometric, PAM was evaluated to be a unidimensional, Guttman-like scale [3]. The PAM13 version for the chronically ill was used in this validation study ( Table 1).
The aims of this study were to translate and adapt the original American PAM13 into a Danish version and to report data quality and psychometric scale properties in a Danish population with dysglycaemia.

Translation and adaptation of the PAM13
A systematic approach to translation and adaptation was conducted as recommended by WHO [8]. It implies five steps: forward-translation, expert panel discussion, backward translation, a pre-test, a cognitive briefing and a consensus on the final version. Two independent native translators with Danish as their mother-tongue translated the original version of PAM13 from American English to Danish. These two translators comprised the expert panel together with three experts in chronic care and measurement development. The panel reconciled the forward translation into a single translation by identifying and resolving inadequate expressions or concepts. Back-translation was done as a quality control step to ensure that the original meaning of the concepts was derived. The backtranslations were conducted by two independent native translators with American/English as their mother tongue and without knowledge of the questionnaire beforehand.
A pretest investigated the level of comprehensibility and cognitive equivalence of the translation [9,10] among 12 patients with newly diagnosed type 2 diabetes from a local diabetes outpatient clinic. The patients filled in the questionnaire at home and participated in a focus group interview on the following day. The interviewer (first author) facilitated a cognitive briefing on general comprehensiveness followed by a review of each question. The participants were asked to think out loud, highlight problems and express their attitude to the question. The final version of the questionnaire was resolved by the expert panel, and a simple and acceptable language was ensured in accordance with the WHO guidelines [8].

Participants
The Danish version of PAM13 was sent to 467 participants in "The Ready to Act" health education randomised controlled study (296 in the intervention and 171 in the control group) at the 12-month follow-up [11]. Participants were between 43 and 75 years and diagnosed with different aspects of dysglycaemia (Impaired Fasting Glucose, Impaired Glucose Tolerance and Type 2 diabetes) by a step-wise screening procedure in general practice within the last five years [12,13]. Participants received a mailadministered PAM13 as part of a larger 16-page 1-year follow questionnaire on the psychological and behavioural outcomes of the "Ready to Act" study. A reminder including a new questionnaire was sent to participants who did not respond within three weeks.

Ethics
Ethical approval of the study was attained from the local Science Ethics Committee of Aarhus County, Denmark (protocol no: 20000183). All participants gave informed content. The Danish Data Surveillance Authority permitted the collection and storing of data (journal no: 2000-41-0042).

Scoring
Each item had five response categories with scores from 1 to 5: (1) strongly disagree, (2) disagree, (3) agree, (4) Only PAM-questionnaires with answers to seven or more items were included in the analyses. If included questionnaires had missing observations, these observations were omitted from the analysis, but not the corresponding persons or items. The raw scores were transformed into natural logarithms to achieve a better expression of the relative distances between the scores [14]. Further, items were calibrated from the logit metric to a user-friendly 0 to100 metric (0 = lowest activation level, 100 = highest activation) [15] to compare the Danish results to the original data.

Analyses and statistical methods
The psychometric elements of the PAM13 Danish version were examined in two parts.
First, we assessed the data quality, internal consistency and correlations between items and the sum of the other items. Data quality was assessed in terms of mean for each item with standard deviation, median, percentage of missing data, number of "not applicable" answers and extent of ceiling and floor effects. Floor and ceiling effects between 1-15% were defined as optimal [16]. Internal consistency was assessed using Cronbach's alpha and average inter-item correlation. We defined an alpha of 0.80 as the lowest acceptable value [17][18][19]. In contrast to alpha, the average inter-item correlation is independent of the number of items and sample size when measuring internal consistency. We aimed at an average inter-item correlation between 0.15-0.50 [19].
We assessed whether each item had a high correlation with the sum score of the rest of the scale (internal item convergence), which is assumed in a unidimensional scale [17]. Correlations were fixed at a minimum of 0.60 to reflect a high level of internal convergence [20].
Secondly, we used Rasch Rating Scale Model [21,22] to investigate whether the scale was unidimensional, which is a prerequisite for the summation of the items [23]. The following criteria for Rasch model were investigated; item statistics, person and item reliability, rating scale diagnostics, factorial test of residuals and differential item functioning.
In the Rasch analysis, person and item scores were used to calibrate items on a logit scale where the midpoint of the scale is 0. Items at one end of the scale are "easier"/"less severe" and items at the other end are more "difficult"/ "more severe". In the current analysis, items with a positive calibration were those indicating a high level of patient activation (more difficult to achieve). From the Rasch model, we reported reliability and separation index for persons and items, and item statistics for measure order. Reliability expresses the reproducibility of the relative measure. A high reliability indicates that in all probability, persons (or items) with high measures actually do have higher measures than persons (or items) estimated with low measures. Winsteps [24] computes upper (Model) and lower (Real) boundary values for reliability. The true reliability can be found between these boundaries. Person reliability of 0.9 means that the scale may discriminate the sample into 3-4 levels, 0.8 into 2-3 levels and 0.5 into 1-2 levels. High item reliability merely indicates that the sample is big enough to precisely locate the items on the latent variable. We compared Rasch person reliability for subgroups to the original data [4].
An important characteristic of a high-quality scale is a good overall separation of persons and items assessed with the scale. The separation index is an estimate of the spread or separation of persons (or items) on the measured dimension. The separation index should be at least 2, indicating that the measure separated persons, items or both into at least two distinct groups [14]. Individual items that are at least 0.15 logits apart represent individual strata [25]. Otherwise one item is not distinctly separate from the next.
Two item fit mean square (MNSQ) statistics (infit and outfit) were computed to check whether the items fitted the expected model. MNSQ determines how well each item contributes to defining a single underlying construct (unidimensionality). Infit is more sensitive to misfitting responses to items closest to the person's ability level, while outfit is more sensitive to misfitting items that are farther away. If the data fit the Rasch model, the fit statistics should be between 0.6 and 1.4 [26].
The assumption is that the use of response categories for each item reflects the way people answer the items that are close to each other. However, this is only true if the distances between each response category are similar. The step measure (Rasch-Andrich threshold) is a calibrated measure of the transition between response categories. The thresholds are expected to increase monotonically. If not, the response categories do not reflect a reasonable interval on the latent variable, and consequently indicate substantial problems with the category definitions. Thresholds should increase by at least 1.4 logits, to show distinction between categories, but not more than 5.0 logits, to avoid large gaps in the variable [14].
Local independence of items was tested using Principal Components Analysis (PCA) on the Rasch item measure residuals. The purpose of PCA of residuals is to analyse the amount of unexplained variance and whether this unex-plained variance indicates that there may be more than one dimension. Simulation studies indicate that even Rasch-conforming data produces residual-factors with eigenvalues up to 2.0. Thus, if there is more than one contrast (factors) in the residuals, there may be a second dimension. Contrasts in the Rasch analysis of residuals contradict unidimensionality [24].
As the last part of the Rasch analysis, we assessed differential item functioning (DIF) by estimating item parameters separately by groups of participants (sex, age groups, diagnosis, education, self-rated health and randomisation group). The scale should work uniformly, irrespective of the group assessed. The criteria used for the DIF analysis was DIF contrast >0.50. We tested using t-test and compared the probability multiplied by the number of DIF tests for each variable with a significance level of 0.05 to correct for multiple comparisons by the Bonferroni method.

Translation and adaptation
The two translations from American to Danish agreed on most items. Different Danish words were used, but were semantically equivalent. A few conceptual discrepancies were identified; for example "health care", "medical care" and "treatment" had slightly different meanings, when used directly translated into Danish (item 5,7,9). The two translated versions were reconciled into a single translation relevant for Danish terminology at expert panel meetings between the translators and the research group.
The first back-translation included all items; the second back-translation included four items (item 2, 4, 7, 13). The emphasis in the back translation was on the conceptual and cultural equivalence, and not the linguistic as suggested by WHO [8]. We recognised a few general problems when comparing the two back-translations with the original version [4] and the first Danish draft: As in the forward-translation, we had difficulty translating health service terminology, partly because of organisational differences, and partly because a lack of specific Danish words for health care, illness and disease.
In the pre-test, the participants found all thirteen items relevant for measuring activation. The participants found the introductory wordings easy to understand; in addition they considered the response-categories exhaustive and exclusive. The participants found that the word "treatment" directed their attention to medication rather than diet or exercise. As half of the participants did not receive medication, they suggested that this elaborated term; "treatment (e.g. medicine, diet and exercise)" was used in the introductory wordings and in item two and seven. The expert group incorporated the results from the briefing process in the draft version, and proofreaders corrected the spelling and grammar.

Participants
A total of 358 of 467 (76.7%) returned the questionnaire. Excluding "not applicable" answers, 344 had answered at least 50% of the items (>6 items) in PAM13. The mean age of the participants was 62.3 (s.d.: 7.1), 44.8% were female, 60.5% had type 2 diabetes and 39.5% had a prediabetic condition. The respondents had been diagnosed within the last five years (median 2 year (interquartiles 0-4).

Data quality
The item response was high with few missing answers (0.8-4.2%) ( Table 2). The response category "not applicable" was used by 0.6-18.4% of responders. In five of the items (4, 9, 11, 12 and 13) this category represented more than 10% of the answers. For all items, the distribution of answers was left-skewed with a small floor effect (range 0.6-3.6%) and a ceiling effect larger than 15% (range 18.6-62.7%) for all items. Cronbach's alpha was 0.89 and average inter-item correlation 0.38. Item-rest correlations ( Table 2) ranged from 0.48-0.65 and were below 0.60 for six items (1, 4, 5, 6, 11 and 13).

Item statistics
The item infit and outfit mean square statistics ranged from 0.67-1.34, which all are within the acceptable range (Table 3). Separation distances of at least 0.15 logits were identified for nine of the 12 separations between items, but not for separations between items 2 and 3, items 10 and 9 and items 9 and 5 ( Table 3). The calibrated 0-100 scale covered the range from 33.3-57.5.

Person and item reliability
The overall Rasch person reliability for the Danish 13item measure was between 0.83 (real) and 0.85 (model). Item reliability was 0.99. The separation index for persons was 2.24 and for items 8.37. Table 4 shows the person reliability statistics for subgroups in Danish populations compared with the American data. The person reliability was between 0.54 (real) and 0.92 (model). Some subgroups had a reliability below 0.80 (excellent self-rated health and age group 75 years or above).

Rating scale diagnostics
The summary of measured steps is displayed in Table 5. This table shows the category label, observed counts, average measures, infit and outfit MNSQ, and step measures on the PAM13 scale. The category "Strongly disagree" was used in 1% of all answers whereas "agree" was used in 58% of all answers. However, both the average measure and the thresholds increased monotonically across the rating scale. The increase of the thresholds ranged from 1.4-4.2, which was within the targeted range.
Factorial test of residuals PCA of item measure residuals revealed one dimension. A total of 43.2% of the variance in the data was explained by the measures and with a perfect model fit, this was expected to be 43.1%. The eigenvalue of the first PCA con-trast was 2.5, which corresponded to 11% of the variance in the data.

Differential item functioning
No significant DIF was found in subgroups of self-rated health, diagnosis or randomisation groups. Items 1 and 2 were easier to endorse for highly educated persons compared to persons with short education (p-values Bonferroni-corrected) (DIF contrast = 1.35, p = 0.027 and DIF contrast = 1.22, p = 0.031). Item 10 was easier to endorse for men (DIF contrast = -0.65, p = 0.049) and item 13 was easier to endorse for persons between 65 and 74 years

Discussion
We found it possible to make a standardised translation and adaption of the original PAM13. The forward-backward translation was successfully conducted and the few discovered conceptual differences were primarily due to differences in health care systems. These findings are supported by the fact that e.g. the reliabilities in subgroups were comparable with the American version.
The psychometric assessment of the Danish version replicated to a great extent the findings from the original version [4] showing similar data quality and internal consistency. We found that the items had a different order. The items are arranged progressively in order of difficulty to reflect the developmental continuum of patient activation in an American population with chronic diseases. However, this was not confirmed by the results in this study, meaning that this population simply found some questions easier to answer compared with an American population. This could be due to the specific population of people with screen-detected newly diagnosed Only questionnaires with at least 7 items answered were included * The extreme age group was 45-54 and 75-84 in the American version Education in the American version is slightly different categorised: high school or less, some college and college graduate+  [5] is to be of significance in the Danish version. Further, the person reliability did not indicate an ability to separate four levels at all. This may be due to differences between the Danish and the American populations, but as mentioned, the psychometric results in many instances replicate the American findings.
The investigation of the scale properties in general showed that PAM13 may be regarded as a unidimensional scale performing as a Guttmann sum-scale. This was particularly true for the reliability measures, measures of unidimensionality and aspects of the response categories. We found e.g. that the scale could distinguish between 2-3 levels as the person reliability was between 0.8-0.9, and the statistics indicated that each item could be regarded as part of one dimension (infit and outfit). The high-reliability estimates at the person level indicate that the scale is appropriate on an individual basis to diagnose activation and individualise plans for future health care as suggested [5].
However, we noted some possible problems with ceiling effect, potentially irrelevant items, and important aspects for responsiveness and separation difficulties at three points in the scale. Most PAM13 items demonstrated a ceiling effect and Items 1 to 3 had more than 50% of the answers in "agree strongly". This percentage suggests that the response categories do not cover relevant answers for the study population. Caution must be taken in future studies if ceiling effects are common in Danish populations. The high ceiling effect may be a problem if PAM13 is to be used for measuring change over time (e.g. in randomised studies) because of low responsiveness.
The five items with more than 10% answers in "not applicable" indicate that the scale cannot be used for all types of patients with chronic diseases and a revision of these items might be necessary.
On three points, there seemed to be no additional information when answering the next item in the scale (no separation). This means that 2-3 items can be omitted from the scale as a simple sum-scale. Further research will clarify the items to be omitted.
Although we may conclude that the Rasch analysis supports the PAM13 as a unidimensional sum-scale, some results do, however, indicate a need for improvement. We noted that six items had low correlations with the sum of the rest of the items, which indicates that they may not be absolutely true to one dimension. In addition, the test for other dimensions (PCA) revealed a possible additional factor. However, criteria for deciding whether there are two or more dimensions and when a deviation becomes a dimension have yet to be established. To the best of our knowledge, the rule of thumb is [24] that variance explained by measures four times greater than the variance explained by the additional factor and the size of the components less than three, is good. Our analysis does therefore not indicate more than one dimension.
When testing for differential item-function, most items did not have DIF in subgroups. However, items revealing DIF (item 1, 2, 10 and 13) showed possible explanations for this. Items 1 and 2 may appeal to educated people, being responsible and taking action. Gender seemed to play a role for item 10, which was endorsed by men more often than women, setting less demanding goals for their lifestyle changes. Patients aged 65-74 who answered more convincingly to item 13 may be explained by more experience of maintaining lifestyle even during stress.
The activation score in the Danish version covered the range from 33.3-57.5, which is more than the range of 38.6-53.0 for the American data [4]. However, this may not be enough to be able to detect changes in underlying behaviour studies, and in particular clinically relevant changes, which subsequently have to be tested.

Strengths and limitations
The systematic translation approach was a strength in this study. Translation has no best practice as yet [9,28] and in particular, the value of back-translation has recently been questioned [29]. In our study, the backward translation procedure contributed with new perspectives on the cultural differences in the health care concepts.
We obtained a high response rate with 74% answering more than half of the items. This may minimise the risk of selection bias. The sample size of 344 persons was sufficient for this validation study as a minimum of 300 respondents is recommended to replicate structural analyses [19]. The mean square statistics used in the Rasch analysis are moderately insensitive of sample size for polytomous data [30].
The fact that PAM13 was delivered as part of a larger questionnaire at the 12-month follow-up of a health education intervention study might have affected the actual score level, but it is unlikely to have changed the scale properties. The number of missing values may have been higher that at the baseline questionnaire due the respondents being fatigued by the questionnaire.
The rather heterogeneous group of patients may be regarded as a weakness in many instances. However, when assessing scale properties, the use of a population representing many levels of activation is an advantage. A population screen-detected with dysglycaemia represents merely one of a range of chronic conditions.

Conclusion
A Danish version of PAM13 measuring the latent variable of patient activation in chronic care is now available, although further development is recommended before use in daily practice. The PAM13 questionnaire was translated and adapted into Danish in a sample with screendetected dysglycaemia showing initial reasonably good validity and reliability.
The Danish version formed a unidimensional, Guttmanlike scale. The order of the items differed compared with the American version and therefore the suggested four activation stages in the American version were not relevant. Our findings show that the Danish PAM13 has promising psychometric properties indicating that going on with further validation in other populations with chronic diseases is expedient. However, special attention to discrimination and responsiveness is required to be able to use the score as a screening tool for tailored interventions. These studies have to be carried out before we have a much requested fully evidence-based activation measure for use in Danish chronic care intervention studies and in daily practice.