Cross-cultural adaptation into German
The adaptation of the Dutch WRFQ 2.0 into German followed the six-stage approach proposed by Beaton et al. [8]. The prefinal version was tested with a sample of 40 individuals (30 patients presenting psychosomatic symptoms, and 10 persons without symptoms), who also participated in cognitive interviews exploring issues such as content validity, wording, or logical structure of the items. Consequently, some items have been slightly adjusted to the German language usage.
Respondents were asked to assess the extent to which they have had difficulties meeting the work demands due to physical or mental health issues in the last 4 weeks (prior to completing the survey).
The 27 items were answered on a five point Likert scale ranging from 0 = difficult all of the time (calculated as 100%), 1 = difficult most of the time, 2 = difficult half of the time (50%), 3 = difficult some of the time, and 4 = difficult none of the time (0%). Each item also has the option ‘does not apply to my job’.
Data analysis
The respective missing values generated by answering ‘does not apply to my job’ were imputed by the hot-deck algorithm in the program ‘r’ for the subsequent analyses.
For scale construction, the items were summed up with IBM SPSS 26, then divided by the number of items, followed by multiplication with 25 to obtain percentages between 0 (difficult all the time) and 100 (difficult none of the time). Thresholds of significance were set at p ≤ .05. Details of the cross-cultural adaption are part of a doctoral thesis [9].
Design and sample
The sample was obtained from volunteers of a custom online panel (www.respondi.com) in Germany in 2018. Inclusion criteria for the online survey were aged 18–64, having worked more than 12 hours per week in the last 4 weeks prior to study participation and adequate reading comprehension skills in German. Excluded were individuals on parental leave, retirees, and self-employed. Participants received small monetary incentives (T0: 1.50 €, follow-ups: 1 €).
We targeted a sample size of about 600 respondents for the cross-sectional survey at T0, to have a sufficient number of employees in the subsequent multivariate subgroup analyses. This sample size was considered appropriate for the construct validation by following the rule of thumb of 10 cases per item of the WRFQ, i.e., n = 270, as recommended [10]. To conduct reproducibility and responsiveness analyses, two follow-up measurements were performed at 1 week (T1) and 3 months (T2) after the baseline measurement at T0. For the T1 and T2 follow-up, we targeted the participation of 50 and 100 individuals, respectively. For stable conditions we again controlled the inclusion and exclusion criteria mentioned above. The usability of the online survey was pretested among five employees.
Since the main purpose of the WRFQ is to measure the extent to which workers experience difficulties in meeting the work demands given a certain level of health, it was important to sample employees from different occupational settings. Therefore, an equiproportional quota sampling was defined based on the following three occupational categories: 1. blue-collar workers (e.g. workers in the manufacturing and processing industry, and craft professions), 2. gray-collar workers (e.g. health care, support and medical assistance occupations, service professions in the areas of facility management, caretakers, cleaning and security services, warehouse, and trade), and 3. white-collar workers (e.g., social workers, clerks and other respective professionals working in offices).
Instrument validation
The investigation of the measurement properties of the German WRFQ followed the COSMIN-criteria [11], and consisted of the analysis of the structural, convergent and discriminant validity, floor and ceiling effects, internal consistency, reproducibility, and responsiveness. We aimed to replicate the Dutch validation study with no further development of the instrument. We therefore used the same methods of the working group of Abma et al. [5].
Structural validity
An exploratory factor analysis which was carried out by principal component analysis with eigenvalue criterion and varimax rotation. The factor structure was defined by taking into account items with loadings > 0.4 only [12].
Convergent and discriminant validity
The following constructs and instruments were used for the convergent validity analysis: productivity assessed with the Endicott Work Productivity Scale (EWPS [13];), overall work ability with the single item derived from the Work Ability Index (WAI; ‘Assuming that the highest work ability you have ever had is 10, how would you rate your current work ability?’, 0 = absolutely unable to work to 10 = best work ability [14]), Decision latitude and Job demands with the Job Content Questionnaire (JCQ [15]), and General health with the respective single item derived from the 12-item Short Form Survey of General Health (SF-12) health questionnaire [16]. Convergent validity was determined by assessing the extent to which the strength of the correlations (Pearson or Spearman rho) of the WRFQ with similar constructs agrees with a set of pre-defined hypotheses. High discriminant validity was expected by detecting low correlations with non-related constructs. Correlations were classified as either small (0.15 ≤ r < 0.25), moderate (0.25 ≤ r < 0.35), or large (0.35 ≤ r) [17].
The hypotheses of the convergent validity (no. H1–3) and discriminant validity (no. H4 and H5) analyses were: A high WRFQ total scale value correlates …
-
H1: … with a high work productivity value (EWPS scale; moderately to highly).
-
H2: … with a good self-reported general health value (SF-12 item General health) (moderately).
-
H3: … with a good overall work ability (WAI item) (moderately).
-
H4: … with a high decision latitude (JCQ subscale; lowly).
-
H5: … with low psychological job demands (JCQ subscale; lowly).
Both convergent and discriminant validity measured by the correlation coefficient of Spearman are considered acceptable if at least 75% of the hypotheses are confirmed [10].
Floor and ceiling effects of scales
Floor and ceiling effects of a scale were considered present if more than 15% of the responses were at the lowest or highest attainable scores of the scale, respectively [10]).
Internal consistency
The reliability of the items was analyzed assessing Cronbach’s α, the intraclass correlation coefficient (ICC), and the inter-item and item-to-total correlations of the scales. Cronbach’s α and ICC greater than 0.7 are considered appropriate for group comparisons [18]. Inter-item and item-to-total correlations were considered appropriate if they were included in the intervals 0.2 and 0.8, and 0.3 and 0.9, respectively [19].
Reproducibility
The reproducibility was assessed with the ICC, and was considered acceptable at the group and individual level for ICC > 0.7 and ICC > 0.9, respectively [18]). Additionally, the standard error of measurement (SEM) was calculated by SDdiff/√2.
Responsiveness
The sensitivity of the instrument to measure changes between T0 and T2 was evaluated by comparing the mean changes of the WRFQ and of the overall work ability (global item). In addition, the responses to two additional items at T2, the so-called global perceived effect (GPE) items, which measure the extent to which respondents perceived changes in their mental and physical work ability since baseline (e.g., ‘to what extent has your work ability changed regarding the mental demands at work in the last 3 months?’, 1 = much better, 5 = much worse) were examined.
The mean change of the WRFQ scores was estimated for the total scale and subscales by calculating the mean differences between T0 und and T2 and the respective standard deviations (SDs). The standardized response mean (SRM; ratio between the mean change score and its SD) was calculated for all scores (WRFQ total and subscales). Furthermore, the WRFQ mean changes were correlated with mean changes of work ability and the respective GPE items by Spearman correlation coefficient rho.
SRM effect size categories were defined as < 0.2 (trivial), ≥0.2- < 0.5 (small), ≥0.5- < 0.8 (moderate) and ≥ .80 (large) [20]. An at least moderate correlation between the WRFQ measurement change and the change of work ability between T0 and T2 was expected, as well as stable responses in a large part of the sample. On the basis of this set of change measures, the following hypothesis was formulated: