Burnout is related to huge costs, for both individuals and organizations and is recognized as an occupational disease or work-related disorder in many European countries. Given that burnout is a major problem it is important to measure the levels of burnout in a valid and reliable way.
The Burnout Assessment Tool (BAT) is a newly developed self-report questionnaire to measure burnout. So far, studies concerning the psychometric properties of the original version of the Burnout Assessment Tool (BAT) including 23 items show promising results and suggest that the instrument can be used in many different settings.
For various reasons there is a need of a shorter instrument. For example, burnout questionnaires are typically included in employee surveys to evaluate psychosocial risk-factors, which according to the European Occupational Safety and Health Framework Directive, should be carried out in organizations on a regular basis. The aims of this paper are to develop a shorter version of the BAT, including only 12 items (BAT12) and to evaluate its construct validity and differential item functioning regarding age, gender and country.
Using data from representative samples of working populations in the Netherlands and Belgium (Flanders) a shorter version of the BAT was developed by combining quantitative (Rasch analysis) and qualitative approaches (item content analysis and expert judgements). Construct validity of the new BAT12 was evaluated by means of Rasch analysis.
In an iterative procedure, deleting one item from each subscale at each step, a short version of the BAT – BAT12 was developed. The BAT12 fulfils the measurement criteria according to the Rasch model after accounting for local dependency between items within each subscale. The four subscales can be combined into a single burnout score.
The new BAT12 developed in the present study maintains the breath of item content of the original version of the BAT. The new BAT12 has sound psychometric properties. The scale works invariantly for older and younger, women and men and across two countries. A shorter version of the BAT is timesaving compared to the BAT23 and can be used in e.g. employee surveys.
Burnout is a metaphor that refers to the loss of mental energy, or more specifically to a work-related state of mental exhaustion [1, 2]. Since its introduction, at the end of the 1970s it has been studied extensively and a recent overview showed convincingly that burnout is related to huge costs, both for the individual as well as organizations . For instance, burnout is associated with poor physical and mental health of employees such as, cardio-vascular disease, anxiety and depression, musculoskeletal disorders, insomnia, and psychosomatic complaints. For organizations, burnout leads to high replacement costs due to turnover, sickness absence and work incapacitation, but also to poor business outcomes in terms of job performance, occupational safety, service quality, and productivity. Not surprisingly, burnout is recognized as an occupational disease or work-related disorder in many European countries, including Belgium and the Netherlands . According to the European Occupational Safety and Health Framework Directive (89/391/EEC-OSH) employers must evaluate all the risks to the safety and health of their workers. In order to comply with this directive a valid and reliable instrument should be available that can be used to measure employee’s burnout levels.
To date, the Maslach Burnout Inventory (MBI)  is almost universally used to measure burnout. It is estimated that in 88% of all scientific papers on burnout, the MBI is the instrument of choice . The MBI builds on the definition of burnout as a syndrome of emotional exhaustion, depersonalization and reduced personal accomplishment , later denoted as exhaustion, cynicism and lack of professional efficacy, respectively .
In fact, because of the dominance of the MBI, burnout is what the MBI measures, and vice versa. Needless to say, that this circularity of concept and assessment is undesirable because it hinders fresh and innovative research that increases our understanding of burnout. Moreover, the MBI has been developed as a research instrument and not as a tool to assess levels of burnout in organizations. Even more importantly, over the years a number of conceptual as well as technical concerns have been raised against the use of the MBI. For instance, it was argued that rather than a constituting element, poor professional efficacy should be considered a consequence of burnout . Conversely, it was maintained that impaired cognitive functioning is wrongly not taken into account as an indicator of burnout in the MBI . Psychometrically speaking the MBI has been criticized – amongst others – for: (1) skewed answering patterns that may affect its reliability; (2) reversing positively worded items for evaluating a negative psychological state; (3) producing three different subscale scores instead of a single, composite burnout score .
Hence, recently an alternative self-report questionnaire has been developed that effectively addresses these conceptual and technical issues; the Burnout Assessment Tool (BAT)  (see also, www.burnoutassessmenttool.be). The BAT includes four subscales: exhaustion (i.e., extreme tiredness); mental distance (i.e., psychological withdrawal); cognitive and emotional impairment (i.e., reduced ability to regulate one’s cognitive and emotional processes, respectively). So far, the psychometric features of the BAT seem encouraging. For instance, using nationally representative samples of seven different countries, De Beer et al.  showed that the postulated second-order factor structure with all four BAT subscales loading on one general, composite score was invariant across countries. In a similar vein, a rigorous Rasch analysis attested the one-dimensionality of the BAT , suggesting that a single score can be used to assess the employee’s level of burnout. These results confirm that, contrary to the MBI, the BAT conceives burnout as a syndrome that consists of set of related symptoms that refer to one underlying psychological condition. Furthermore, multitrait-multimethod analyses showed convergent and discriminant validity of the BAT vis-à-vis other burnout measures, such as the MBI [10, 13]. That is, although the BAT partially overlaps with other burnout instruments, its distinctiveness was established as well. Finally, results suggest that burnout as assessed by the BAT can be discriminated from other, related aspects of employee well-being such as work engagement, job boredom and workaholism [10, 13].
Typically, burnout questionnaires are included in employee surveys to evaluate psychosocial risk-factors, which according to the European Occupational Safety and Health Framework Directive should be carried out in organizations on a regular basis. Because employee surveys tend to be rather long and employers usually impose time constraints for surveying employees during their work time, there is increasing pressure on researchers to develop valid, reliable, yet short measures without redundant items . Such concise measures also reduce participant’s fatigue, frustration, and the likelihood of refusing to participate because the survey is perceived to be too long and time consuming .
Shorter instruments are requested for both academic research purposes and for practical purposes such as employee surveys as they may be more efficient from a practical point of view. Construction and use of short scales bring together theoretical, statistical, and practical perspectives. Removing items from psychometrically sound instruments goes against the traditional psychometric view that many items are needed for reliable and valid measurement  and might result in lower reliability and poorer construct coverage. With the growing need and use of short questionnaires, there is also a growing literature regarding scale-shortening strategies and their consequences [17,18,19,20,21,22,23,24,25,26,27]. The specific number of items included or required in a questionnaire is of course contextual and depends on the complexity of the construct of interest. Therefore, general recommendations regarding the number of items in a questionnaire are perhaps not always useful.
Different strategies for scale-shortening and choosing the number of items, are available and include both quantitative (data-driven) and qualitative (theoretical driven) strategies. One data-driven strategy is to select items that maximize the internal consistency coefficient, which sometime might be associated with the risk of narrowing the construct coverage. The Rasch analysis is another strategy. If a set of items fits a Rasch model, then any subset of these items would also fit the model. As a consequence, the selection of the items can be very easy, given that there is no local dependency and/or differential item functioning. In a previous study  we have shown that the BAT23 fits the Rasch model, after accounting for local dependency by combining the items within each subscale into four testlets. However, due to local dependency it was not possible to choose any subset of items out of the original 23 items.
In order to shorten the BAT, we used a combination of a quantitative (Rasch analysis) as well as qualitative approach (item content analysis and expert judgement). The Rasch measurement model, usually referred to as Rasch analysis, belongs to the item response theory or modern test theory as opposed to the classical test theory. The Rasch analysis can be used for shortening scales as it provides information about each item in several ways and in that way helps in selecting items that improve the short scale’s accuracy [17, 18, 28]. Besides a statistics-driven strategy, a qualitative analysis based on item content analysis was employed as well as judgmental strategy. This will guide the selection of items on the basis of expert judgment on which items best cover the construct of interest, and with the goal to preserve the same content validity as in the original version of the BAT.
An equal number of items in each subscale is recommended when a questionnaire is used as self-assessment and the respondent is also scoring and interpreting the test results on its own . In that way the scoring of the test would be less complicated and more transparent to the test user. To fulfil these requirements, the minimum number of items within each subscale was set to three. The choice of three items was a matter of balance as three out of the four BAT subscales consist of five items each. Reducing to four instead of five items is not really time saving for the respondents and two items per subscale are perhaps not enough to ensure appropriate construct coverage within each subscale.
The current paper has four aims: (1) to shorten the original 23-item version of the BAT to a version that includes only 12 items, 3 items for each subscale, using a combination of content and Rasch analysis; (2) to evaluate construct validity of the BAT12 using Rasch analysis; (3) to evaluate whether the items of the BAT12—like those of the BAT23—can be combined into a single burnout score; (4) to evaluate possible differential item functioning of the BAT12 regarding gender, age and country.
Study design and population
Analysis was done using the original Dutch version of the BAT23 and data from two representative samples of the working populations in terms of age, gender and industry in the Netherlands and Flanders, Belgium, respectively (n = 1500 from each country). Prior to filling out the questionnaire, all participants were informed about the purpose of the study, that participation was voluntary, that they could stop at any moment if they wished to do so, that questions could be directed to a contact person (name and email address provided) and that complaints could be filed with the ethical committee (email address provided). Moreover, participants declared that they agreed with these terms by clicking on “next”. This study was approved by the ethical committee Sociaal Maatschappelijke Ethische Commissie (Social and Societal Ethics Committee (SMEC) of KU Leuven) (https://www.kuleuven.be/english/research/ethics/committees/smec) on June 16, 2016 (reference number: G-2016 06 2027). All the methods were in accordance with the declaration of Helsinki or in accordance with relevant national/institutional guidelines.
Details about the sampling procedure and sample characteristics are described in previous studies [10, 12] as well as in the BAT test-manual . Complete cases on all items were obtained for n = 2978 (NL = 1500, FL = 1478) and these were considered for analyses.
In this study differential item functioning (DIF) was evaluated for age, gender and country. Given the same level of the latent trait (i.e. burnout), the scale should function identically for all comparable groups. In the presence of DIF, comparable groups (e.g., women and men) score differently on a specific item, even though they have the similar levels of burnout. An equal number of cases within each of the compared groups is recommended to ensure that if there was DIF, one group does not dominate in the estimates of parameters . Given that the data material was large, it was possible to adapt a cross-validation strategy (i.e. analyze data twice using two random subsamples) to check the robustness of the results. Therefore, the total sample was divided into 4 homogenous strata of men/NL, men/FL, women/NL and women/FL. Then a random sample of 200 respondents from each stratum was drawn twice, resulting in two subsamples of 800 individuals. Age was divided by the median. Median age in the two samples was 41.
The burnout assessment tool (BAT)
The Burnout Assessment Tool (BAT) is a self-report a questionnaire consisting of 23 items (see Additional file 1) that includes four dimensions: exhaustion (EX; 8 items), mental distance (MD; 5 items), cognitive impairment (CI; 5 items) and emotional impairment (EI; 5 items). All items are expressed as statements with five frequency-based response categories (1 = never, 2 = rarely, 3 = sometimes, 4 = often, 5 = always). Detailed information about the development of the BAT is provided elsewhere  as well as in the BAT test-manual . A previous study showed that the BAT23 has sound psychometric properties and satisfies the requirements for the measurement criteria according to the Rasch model when subscales are used instead of the individual items . The BAT score also works invariantly for women and men, younger and older respondents, and across both countries.
The shortening of the scale was done by combining quantitative (statistics-driven) and qualitative strategies.
The qualitative strategy incorporated item content analysis, also known as a subject matter analysis , as well as expert judgement strategy. Subject matter analysis classifies items into five categories: (1) no problems with the item, (2) wording errors, (3) wording similar to one or more other items, (4) item measures the same characteristic as one or more other items, and (5) item is an unclear measure of the construct. All items were analyzed by the first author. Then all authors read and commented the results and adjustments were made until all authors agreed on the classifications for all items.
The statistical part of the analysis was based on Rasch analysis. Short introductions to Rasch analysis can be found elsewhere [32,33,34] and a comprehensive overview of the statistical theory of Rasch models is presented in a textbook . The goal of the Rasch analysis is to evaluate whether the observed data satisfy the assumptions of the Rasch model, in which case the questionnaire has solid psychometric properties. The advantage of the Rasch model over classical test theory approaches such as factor analysis, is that the normal score distribution of items is not required.
The following four item fit indicators were examined: (1) the item’s ability to discriminate (based on item fit residuals expected to range between ± 2.5 and χ2 statistic); (2) appropriateness of the response categories (threshold ordering); (3) response independence relative to other items (residual correlations); (4) and the absence of differential item functioning (DIF) for age, gender and country. Absence of DIF means that given the same level of burnout, items should function similarly for all comparable groups (women and men, older and younger age, NL and FL). Any residual correlation between the items 0.2 above the average observed correlation is indicative of response dependency .
Besides item fit, the overall fit to the Rasch model was evaluated at each step by means of summary fit statistics: the item-trait interaction statistic (non-significant χ2 statistic), and person and item fit residuals (expected values around zero mean and SD of 1). The internal consistency and the power of the scale to discriminate among respondents were evaluated with the Person Separation Index (PSI). The PSI is similar to Cronbach’s alpha with a range of 0 to 1. Dimensionality of the scale was tested by Smith’s test of unidimensionality . For this test, first a principal component analysis (PCA) on residuals is performed. Next, items loading positively and negatively on the first principal component are used to obtain an independent person estimate. In the next step, independent t-tests for differences among these estimates for each person were performed . Less than 5% of such tests being outside the range of ± 1.96 support the unidimensionality of the scale. A 95% binomial confidence interval of proportions  was used to show that the lower limit of the observed proportion is below the 5% level .
Analyses were performed independently on the two random samples, each consisting of 800 participants. A sample size of 800 is sufficient to yield a high degree of precision . All analyses were done in RUMM2030 . The item’s fit was analyzed using the partial credit model for polytomous cases . To control for the large number of comparisons, the significance level was set at 0.01 and Bonferroni adjusted.
When local dependency was detected, a method of combining correlated items into testlets was applied as suggested by Marais and colleagues [42,43,44]. According to this method, correlated items are combined into one or more testlets (preferably based on theoretical reasoning) and the data are re-analyzed using testlets instead of individual items. The testlets’ model fit was compared with the fit obtained from the BAT12 analysis. The latent correlation among the subscales was also calculated, as well as the proportion of the non-error common variance accounted for when the testlets were added to constitute a total score (also known as explained common variance) [42, 45, 46].
DIF was evaluated using ANOVA on standardized residuals which enables separate estimations of misfit along the latent trait, uniform and non-uniform DIF. Uniform DIF implies a systematic difference in the response to an item that is consistent across the entire range of the latent trait of burnout. Non-uniform DIF means an interaction and implies that the magnitude of DIF is not consistent across the continuum of latent trait. The distinction between real and artificial DIF, and the magnitude and the impact of DIF were investigated by the methods recommended by Andrich and Hagquist [30, 47, 48]. In addition to formal tests, DIF was also evaluated graphically by means of the item characteristic curve.
Targeting (distribution of the persons and the items on a common logit scale) was evaluated visually by the person-item-threshold graph. Targeting is an aspect of how well the BAT12 items are targeted for severity levels of burnout as reported by the respondents. This is important for the precision of the person estimates. In case of good fit to the Rasch model, person estimates from the Rasch analysis, which are logits, can be transformed into a convenient range (in this case values in the range 1 to 5), henceforth referred to as metric score .
The shortening procedure
Items were eliminated iteratively, one at a time from each subscale, re-running the Rasch analysis after each step and comparing summary fit statistics. The first analysis included all 23 items (BAT23). Thereafter, one item from each subscale was eliminated at each step and the process was repeated until three items remained in the subscales MD, CI and EI and six items remained in the EX- subscale (BAT19 and BAT15). The procedure then continued by eliminating one item from the EX-subscale at each step until three items remained in this subscale as well (BAT14, BAT13 and BAT12). As a result, the final model (BAT12) included the same number of items within each subscale. Lastly, the BAT12 was fitted to the Rasch model and evaluated using summary and item fit statistics.
Items that were judged to perform poorly based on one or more of the item fit indicators and/or results of the subject matter analysis were selected as candidates for elimination at each step. In cases when the decision about item elimination could not be reached based on item fit statistics and/or subject matter analysis, two additional criteria were used. One criterion was to investigate item and threshold locations (positionings) on a latent trait in order to maximize the spread of the items across the latent burnout continuum. Thresholds partition the latent continuum of each item into ordered categories and are the points between any two adjacent categories in which the conditional probability of either response is equally likely. The other criterion was to evaluate the meaning and the content of the item and keep items that were judged important theoretically, to ensure proper content validity of each subscale.
Results, sample 1
Subject matter analysis
All BAT items were judged using categories (1) to (5) described above. The BAT is a newly developed questionnaire and the conceptualization, construction as well as the items formulation were described in detail in a recently published paper  and in the BAT manual . All items were considered as relevant for measuring the burnout construct and none of the 23 items were classified into category (5; ‘item is an unclear measure of the construct’). Furthermore, none of the items were classified into category (3; ‘wording similar to one or more other items’).
Six of the eight items (items EX2 through EX7) were categorized as (1) ‘no problems with the item’. The contents of item EX1 (At work, I feel mentally exhausted) and item EX8 (At the end of my working day, I feel mentally exhausted and drained) are not identical, but partly overlap as both items address aspects of mental exhaustion. They were therefore classified as (4) ‘item measures same characteristic as one or more other items’.
Items MD1 (I struggle to find any enthusiasm for my work) and MD4 (I feel indifferent about my job) could be classified as (4) – items that measure the same characteristic –but in opposite directions, expressed in positive (enthusiasm) and negative (indifferent) terms, respectively. On the other hand, one could also argue that these are actually different aspects of mental distance and thus the items could alternatively be classified as (1) no problem with the item. The remaining items MD2, MD3 and MD5 were classified as (1) ‘no problems with the item’.
Items CI2, CI3 and CI5 were classified (1) ‘no problems with the item’. Items CI1 (At work, I have trouble staying focused) and CI4 (When I’m working, I have trouble concentrating) could be classified (4) ‘item measures same characteristic as one or more other items’.
Items EI2 through EI4 were classified as (1) ‘no problems with the item’. Items EI1 (At work, I feel unable to control my emotions) and EI5 (At work, I may overact unintentionally) were classified as related but also different and therefore both classified as (1) ‘no problems with the item’. EI1 is more general than EI5 as it includes not only one’s behavior (i.e., acting) but also one’s feeling, which not necessarily need to be acted out.
The Rasch analysis
An initial Rasch analysis included all 23 items (BAT23). In the next two steps (analyses BAT19 and BAT15), four items were deleted at each step (one item from each subscale). At this stage of the analysis, the mental distance, and the cognitive and emotional impairment subscales were reduced to three items each, while the exhaustion subscale still had six items. Thus, the next goal was to further reduce the number of exhaustion items from six to three, deleting one item at each step (analyses BAT14, BAT13 and BAT12).
Overall fit statistics for each step are shown in Table 1. As seen in the Table 1, a deviation from the Rasch model was observed in the BAT23 analysis with a significant χ2 statistic. The values of item and person residuals means and SDs were higher than the expected value of 0 and 1 respectively. The PSI values was high (0.95) and the Smith’s test indicated problems with unidimensionality.
Compared to the first analysis, the value of the χ2 statistic dropped somewhat in the BAT19 analysis, but it was still high and significant. The mean and SD of the person and item fit residuals changed only marginally as well as the value of PSI, compared to the initial BAT23 analysis. The percentage of significant t-tests used for Smith’s test of unidimensionality decreased from 21 to 17 but is still very high compared to the expected value of 5, given local independence. The value of the χ2 statistic decreased with every step. For BAT13 and BAT12 the p-value approached the cut-off value of 0.01, so the fit to the model improved considerably. As expected, with decreasing the number of items, the PSI values also decreased at each step.
As mentioned above, four item fit indicators were evaluated at each step and used as criteria for deletions of items (i.e., threshold ordering, item fit residuals, residual correlations between items and DIF). All item fit statistics for the BAT23 analysis were published in a previous study  and therefore not repeated here. Item fit residuals for analyses BAT19 through BAT12 are shown in Table 2. The complete residual correlation tables for each analysis can be found in Additional file 2. DIF tables for age, gender and country are found in Additional file 3. The results for each subscale are presented below. In summary, item fit statistics identified several items with item fit residuals outside the predefined range of ± 2.5, high residual correlations between items within each subscale and some DIF issues. High residual correlations between item pairs belonging to different subscales were not observed in any of the analyses, indicating that item overlap is not an issue. All items had ordered thresholds in all analyses (BAT23, BAT19, BAT15, BAT14, BAT13, BAT12).
The shortening of the BAT
Among the eight exhaustion items, item fit residuals outside the predefined range of ± 2.5 were observed for items EX2, EX7 and EX8 (BAT23 analysis). DIF for age and gender was observed for item EX8. None of the items showed DIF between the two countries. Residual correlations indicated high values between several exhaustion items. Item EX8 was chosen as candidate for elimination due to misfit according to the item fit statistic, the high residual correlation with other items and DIF issues. According to the item content analysis, items EX8 and EX1 were classified into (4) ‘the item measures the same characteristic as on or more other items’. Tellingly, these items also had the highest observed residual correlations. Since both analytical strategies were pointing at the same item, the decision was made to eliminate item EX8.
In the next step (BAT19 analysis) items EX2, EX6 and E7 had problems with item fit residuals. For item EX7 misfit due to the item fit residual was also statistically significant. No DIF issues were found. The critical value for residual correlations in this analysis was 0.15, and several pairs of items exceeded that value: EX1-EX3 0.16, EX1-EX4 0.23, EX3-EX4 0.17, EX3-EX5 0.20, EX4-EX7 0.18. The decision was to discard EX7 (When I exert myself at work, I get tired quicker than normal) in the next step.
In the BAT15a analysis, only item EX2 had a high item fit residual -3.2 (Table 2). Residual correlations above the critical value (in the analysis > 0.13) were found for item pairs EX1-EX4 0.19, EX3-EX5 0.17 and EX3-EX4 0.14. Although the correlation between EX1 and EX4 was somewhat higher than the critical value, both items were kept in the analysis after item content evaluation, because they covered the mental and physical aspects of exhaustion, respectively (EX1: At work, I feel mentally exhausted and EX4: At work, I feel physically exhausted). Items EX3 (After a day at work, I find it hard to recover my energy) and EX5 (When I get up in the morning, I lack the energy to start a new day at work) had the lowest locations and are kept for that reason (Fig. 3). Thus, the decision here was to eliminate EX2. Evaluating the content of item EX2 (Everything I do at work requires a great deal of effort) strengthened the decision because it was judged that this aspect does not necessarily need to imply exhaustion.
In the next step (BAT14 analysis) all exhaustion items had item fit residuals within the predefined range and showed no DIF regarding age, gender or country. Residual correlations above the value of 0.13 were indicative of local dependence and were found for items EX1-EX4 0.20, EX3-EX5 0.18 and EX3-EX4 0.15. In the previous step, the decision was made to keep both items EX1 and EX4; so possible candidates for elimination at this stage were items EX3 and EX5. The fit of these items was not bad according to the item fit statistics, even though the residual correlations were slightly higher than expected. Both items were similar in terms of locations. Thus, the decision on which item to delete in the next step was based on content analysis. Item EX5 (When I get up in the morning, I lack the energy to start a new day at work) was chosen for elimination since lacking energy in the morning does not necessarily imply that the cause of energy lack is work-related, whereas the content of EX3 (After a day at work, I find it hard to recover my energy) is work-related.
In the last step (BAT13 analysis) item EX3 had an item fit residual of 2.9. None of the remaining four EX items showed any signs of DIF. Once again, the highest residual correlation was found for the pair EX1-EX4 0.20. Above the critical value of 0.12 were also item pairs EX1-EX3 0.15 and EX3-EX4 0.17. Based on the item fit residual and the residual correlations with the other items, EX3 was noted as possible candidate for exclusion. Moreover, although no misfit was found regarding the item fit indicators, item EX6 (I want to be active at work, but somehow, I am unable to manage) was also considered as a candidate for removal based on item threshold locations, presented in Additional file 4 (BAT13). Item EX6 had a somewhat higher location than items EX1, EX3 and EX4, which had the lowest locations of all 13 items. The decision was to keep EX3 and exclude EX6 in the final step, arriving at twelve items to improve the coverage of items across the latent trait.
In the BAT23 analysis, items MD1, MD2 and MD3 showed misfit according to the item fit residual and MD2 had a significant item χ2 statistic. DIF for gender was observed for item MD4 (women scored lower than men) and for country for item MD2 (FL lower than NL). High residual correlations were observed for many pairs of items. The decision was taken to delete MD2due to misfit on multiple indicators.
In the next step (BAT19 analysis) misfit according to item fit residuals was found for items MD1 and MD3. DIF for gender was observed for item MD4 (women rated lower than men). The highest residual correlation (0.34) was observed between MD1 (I struggle to find any enthusiasm for my work) and MD4 (I feel indifferent about my job). MD4 was selected for removal, based on DIF issues and the highest correlations with the other items and the subject matter analysis.
In the BAT23 analysis, item CI2 had a high negative item fit residual DIF for country was noted for item CI3 (NL rated consistently lower than FL, given the same burnout level). High residual correlations were found for all item pairs. The highest correlation was found between items CI2 (At work I struggle to think clearly) and CI4 (When I’m working, I have trouble concentrating). Although the item CI4 was also a possible candidate according to the item content analysis, the decision was made to remove the item CI2 in the next step due to misfit to multiple item fit indicators.
All CI items have good fit according to the item fit statistics in the second step (BAT19 analysis). DIF between countries was found for item CI3 only (NL rated lower than FL). Residual correlations between all pairs were higher than expected under the condition of local independency. The highest correlation (0.43) was found between items CI1 (At work, I have trouble staying focused) and CI4 (When I’m working, I have trouble concentrating), and 0.34 between items CI3 (I am forgetful and distracted at work) and CI5 (I make mistakes in my work because I have my mind on other things), respectively. The decision here was to delete CI3 in the next step because the DIF and the correlation with CI5.
In the initial analysis (BAT23) the emotional impairment items EI3, EI4 and EI5 had high item fit residuals. Item EI4 also had problems with class intervals and DIF for country. High residual correlations were found between all pairs of EI items, except for EI2-EI3, which was just below the cut-off of > 0.16. The two highest correlations were observed for the item pairs EI2-EI4 and EI1-EI4 respectively. Based on these findings EI4 was removed.
In the next step (BAT19 analysis) items EI3 and EI5 had high item fit residuals. No DIF issues were observed. The residual correlations between all item pairs were above 0.15, except for the pair EI2-EI3, which was 0.15. The highest residual correlation was 0.41 between EI1 (At work, I feel unable to control my emotions) and EI2 (I do not recognize myself in the way I react emotionally at work) followed by 0.39 between EI-EI5 and 0.33 between EI3-EI5. Item EI5 was also suggested as a candidate for deletion according to the subject matter analysis. It was not an easy decision, so both EI3 and EI5 were tested as candidates to discard in the next step, one at a time. Therefore, in the next step, two combinations of BAT15 items were tested (BAT15a – EI3 deleted and BAT15b – EI5 deleted).
Item fit residuals from both analyses are shown in Table 2. In the BAT15a analysis item EI5 had a high item fit residual and in the BAT15b analysis it was item EI3, which showed a significant χ2 statistic. No DIF issues were noted and residual correlations from the two analyses were comparable in pattern and magnitude (see Additional file 2 and Additional file 3). The summary fit statistics from the two analyses were also comparable, except for the χ2 statistic, which was lower for BAT15a (Table 1). Thus, based on the significant item fit residual for EI3 and the lower χ2 statistic in the BAT15a, the decision was to exclude EI3 and to keep EI5 in the emotional impairment subscale.
Before taking a final decision, item and threshold locations (positionings) on a latent trait were also examined. In Additional file 4 item and threshold locations from both analyses are presented. The same information is visualized in Fig. 1, were items from the BAT15a are rank-ordered from bottom to top, indicating increasing severity according to their locations on the latent burnout continuum (shown at the bottom of the plot ranging from -4 to 4 on a logit scale, with higher values indicating higher levels of burnout). All three EI items are found at the bottom of the list, meaning that they had the highest locations, while the EX items are found at the top. Looking at item thresholds, the intended increasing levels of severity across the response categories (1 = never to 5 = always) is reflected in the data for all items; hence the model expected Guttman structure is confirmed. The thresholds’ locations for the EI items are higher than, for example, the EX items, meaning that the highest response categories for the EI items are endorsed at higher levels of the latent burnout trait, compared to the corresponding categories for the EX items. Corresponding results from the BAT15b analysis are shown in Additional file 3 (BAT15b) and Fig. 2. The results displayed in Fig. 2 show that in terms of locations, EI3 is more similar to the EX items than the other two EI items. Therefore, the decision was to keep EI5 instead of EI3, in order to improve the spread of item locations across the latent trait.
The construct validity of the BAT12
The final step was to test the fit of the BAT12 to the Rasch model. The new BAT12 included three items from each subscale, namely: EX1, EX3, EX4, MD1, MD3, MD5, CI1, CI4, CI5 and EI1, EI2, EI5 (see Additional File 1 for item formulations).
Compared to the initial BAT23 analysis, the item and person fit residuals were at approximately the same levels and were higher than the expected zero and one values, respectively (Table 1). The value of the χ2 statistic has decreased to approximately 160, with a p-value that was still significant. Item MD3 showed DIF for gender but was not further investigated due to problems with local dependency. Problems with local dependency were indicated by the test for unidimensionality, where the lower confidence bound was > 5% and further confirmed by the item residual correlations. As expected, due to the reduction in items and problems with local dependency, the PSI value decreased from 0.95 to 0.91. High item fit residuals were found for items EX3, MD1, MD3 and EI5 (Table 2).
The pattern of residual correlations mapped to the underlying conceptual structure of the BAT and its four subscales: exhaustion, mental distance, cognitive and emotional impairment (see correlation table in Additional file 2, BAT12). Consequently, to account for the local dependency, the items were grouped into subscale-based testlets, and the analysis was re-run (Table 1, BAT12 4 testlets). The testlets analysis resulted in a good fit to the Rasch model according to the overall fit statistics. Although the number of items was almost halved, the targeting of the BAT12 (Fig. 3) and BAT23  were similar. As seen in Fig. 3, showing the person-item distribution, there is a group of participants with very low burnout levels (-2 on logit scale) and these are lower levels of burnout than measured by the items. The same information is also illustrated by the person mean -1.04 (SD 1.045) compared to the item mean, which is constrained to 0. Thus, the targeting was not optimal, but still acceptable.
The average latent correlation between the four testlets was 0.71, explained common variance was 93%, which is further evidence that the responses on the four subscales can be summarized into a single score.
DIF for gender was found for testlets EX and MD and DIF for country for testlet CC. In the next step, the variable with the highest F-value e.g. MD testlet was split for gender which resulted in a disappearance of gender DIF for the exhaustion testlet and indicated artificial DIF. Artificial DIF is further confirmed by the non-significant difference between the MD location values for women and men in the DIF resolved analysis (0.03 and -0.07 respectively, p-value 0.03). The same procedure was repeated for country DIF and cognitive impairment, also showing artificial DIF (results not shown). The conclusion was that no adjustments for DIF were needed.
Results, sample 2
The process of shortening from BAT23 to BAT12 was repeated using sample 2, by way of cross-validation. The shortening of the scale resulted in the selection of the same BAT12 items (EX1, EX4, EX5, MD1, MD3, MD5, CI1, CI4, CI5 and EI1, EI2, EI5). In each step, items were chosen for elimination based on the same the criteria and logical reasoning as for sample 1, and therefore not explained in detail here. (Details about this analysis can be obtained upon request from the first author.)
The summary fit statistics and item fit residuals for each step are shown in Tables 3 and 4, respectively. The residual correlations and DIF tables are shown in Additional file 2 and Additional file 3 respectively. As in the sample 1 analyses, the residual correlations mapped into the BATs four subscales and high residual correlations between item pairs belonging to different subscales were not observed in any of the analyses. All items had ordered response categories in all analyses.
The average latent correlation between the four testlets (0.67) explained 92% of the variance of the BAT12.
Ordinal-to-interval conversion table
Person scores from the Rasch analysis are used to transform the mean values of the BAT12 which are ordinal scores, into metric, interval-level scores. This was possible given the good fit of the BAT12 to the Rasch model after accounting for local dependency. Person scores can take both negative and positive values since they are situated on a logit scale, which can be hard to interpret compared to the original 1–5 range. For that reason, the logit person scores are linearly transformed into 1–5 interval scores. In Table 5 we provided interval scores in both logit units and in a 1–5 range, allowing users of the BAT to convert the ordinal mean score into interval-level (metric) scores. The total sample (n = 2,978) was used for the score calculation which increases the precision of the scores.
The aim of this study was fourfold. Combining quantitative (Rasch analysis) and qualitative approaches (subject matter analysis and expert judgement), the original 23-item version of the BAT (BAT23) was shortened to 12 items (BAT12), which was the first aim. Using the Rasch analysis, the BAT12 construct validity consisting of the four subscales was evaluated in the second aim. The results showed that the BAT12 has good psychometric properties after adjusting for local dependency between the items within each subscale. The BAT12 fulfils the criteria required by the Rasch measurement model when subscales are used instead of item scores. Thus, the BAT12 quantifies a latent trait of burnout. The results show that the items of the BAT12, like those of the BAT23, reflect the scoring structure indicated by the developers of the scale and the BAT’s four subscales can be summarized into a single burnout score (third aim). The BAT12 works in the same way (invariantly) for women and men, younger and older and across both countries, meaning that the fourth aim was also achieved. Finally, given good fit to the Rasch model, the mean scores of the BAT12 have been transformed into interval metric scores, which allows the use of parametric statistical techniques.
Shortening of the scale
Shortening of questionnaires in a proper way and testing different aspects of validity is an important, yet often neglected issue . For time saving reasons, survey researchers are often forced to develop the shorter scales themselves, which gives rise to several problems. First, psychometric properties of these newly developed instruments are not always tested or available for a larger audience. Secondly, several short versions might be developed, which hinders comparisons between different populations and over time. Thirdly, in some cases the psychometric properties of the original version of the instrument could also be questioned.
The advantage of the BAT12 established in the present study, is that the original version of the BAT was developed as an alternative self-report questionnaire that effectively addressed the conceptual and technical difficulties associated with the MBI, the most commonly used instrument to measure burnout. Incorporating the results of the intensive burnout literature during the past four decades, along with clinical experiences, an updated burnout definition was presented . This resulted in a new measure: the Burnout Assessment Tool (BAT). The unique feature of this newly developed measure is that the BAT has a strong conceptual basis, promising psychometric properties and that the instrument has been anchored to be used globally .
In a literature review regarding scale-shortening strategies, Kruyen et al. highlighted that researchers often aim to maximize the reliability coefficient . Cronbach’s alpha is probably the most reported reliability coefficient of internal consistency, and the shortening strategies are often aiming to achieve a coefficient alpha above a certain value as prescribed by rules of thumbs. This is of course not unproblematic. Strategies involving maximizing the alpha coefficient result in selecting items that are highly correlated, excluding lowly correlating Items. Highly correlated items are however often similar in content. This way of selecting items tends to narrow construct coverage. Moreover, items that are similar in content might cause a bias, when respondents noticing these similarities purposively (but incorrectly) match the responses from similar items to ensure response consistency . In contrast, the indication of local dependency among items, i.e., high residual correlation between items, was one of the criteria for the elimination of items in this study. The strength of this study is that the focus was not on optimizing the reliability coefficient, but to ensure broad construct coverage. Moreover, the number of items (three in each subscale) was predefined and not decided based on the value of the reliability coefficient.
Psychometric properties of the BAT12
As mentioned, the goal of the Rasch analysis is to evaluate the fit of the data to the Rasch model, which implies testing of several assumptions. If a scale works properly, estimates of thresholds need to be ordered. Thresholds are partitioning the latent continuum of burnout into ordered categories and the increasing levels of burnout should be consistent with the response ordering of the items. All BAT12 items had ordered thresholds, meaning that the increasing level of burnout severity across the categories was reflected in the data for all BAT12 items. In other words, respondents are using the item response categories (from never to always) as intended by the developers.
Evaluation of local dependency and unidimensionality are crucial steps in the evaluation of scales. Residual correlations between the items revealed that the items clustered within the predefined four subscales in the initial analysis including 12 separate items. Thus, local dependency was present, and the Smith’s test indicated multidimensionality. These results are consistent with previous studies on the BAT23 and with the theoretical conceptualisation of burnout as a syndrome [10,11,12].
Although the results were expected and made perfect sense theoretically, local dependency was still a problem from a measurement viewpoint . We have accounted for local dependency by combining the items from each subscale into four testlets [42,43,44]. When items within each subscale were combined into four testlets, fit to the Rasch model was achieved (BAT12 4 testlets analysis). The high explained common variance and average latent correlation between the four testlets indicated a strong general factor, which is a prerequisite for combining the items into a single burnout score. Thus, the BAT12 quantifies a latent trait of burnout in the same way as the BAT23. In other words, burnout is illustrated as a syndrome with four interrelated symptoms that all refer to one underlying mental state. The responses of the four subscales can thus be summarized into a single burnout score.
Finally, evaluation of DIF is an equally important step in a scale validation process. In a frame of reference that includes different groups, it is important to investigate whether the scale works invariantly across these different groups and thus can be used for comparison between the groups. In this study we have instigated DIF for gender, age and country and found that no adjustments for DIF were needed. Thus, as was the case for the BAT23, also the BAT12 works invariantly for women and men, older and younger and for participants from both countries.
Strengths and limitations
One strength of the present study is that the data come from large, representative samples of the working population in the Netherlands and Flanders (Belgium). On the other hand, when using chi-square statistics in statistical inference, large sample sizes could result in bias, as even minor levels of misfit become statistically significant. We tried to solve this issue by selecting two random samples from the dataset with 800 participants each, which were still large enough to perform the statistical analyses with good precision. In addition, we were able to cross-validate results in both samples. If there had been major problems with the scale, these would have emerged in both subsamples.
The focus on the content coverage of the BAT12 was mentioned as a strength. The combination of different approaches is another strength. Various item indicators and positioning of the items along the burnout continuum were considered as part of the statistical strategy. Through subject matter analysis the items were classified into predefined categories and items that were judged to be similar in content were considered as candidates for deletion. To further ensure the broad construct coverage, the meaning and content of the items were evaluated by experts (i.e., the developers of the BAT) and items that were judged theoretically important were kept in the final selection. The decision on which items to include in a short version of an instrument should not be solely made on psychometric properties, but also needs to incorporate strong theoretical considerations .
As mentioned above, previous studies regarding different validity aspects of the BAT23 showed promising results. A limitation is that these validity results do not automatically transfer to the short version of BAT12 but need to be investigated in forthcoming studies. Another limitation is that the participants in this study have answered all 23 items. As recommended in the literature , the psychometric results should be further validated in a new study using the BAT12 items only. To the best of our knowledge, so far the psychometric properties of the BAT12 were evaluated in two studies, both with promising results [51, 52], but more studies are needed.
Besides the psychometric properties of the short scale, another crucial issue is the scale’s invariance property under the reduced number of items, i.e., whether the same conclusion about the latent construct (burnout) can be drawn irrespective of whether the short or long version is used. This is especially important for individual assessments as the shortening of a scale affects the accuracy and precision of the scores. In this study the PSI was 0.82 and 0.81 in the sample 1 and sample 2 respectively and 0.82 in the total sample. These values were somewhat lower compared to the BAT23 the four testlets analysis (0.85)  but still high enough to allow comparison of the BAT respondents with high precision on both group and individual levels. However, future studies of the BAT12 should further focus on this issue.
Practical use of the BAT12
The intended use of the (short) scale needs to be specified and discussed, e.g. whether scales are intended to be used for research, screening, clinical assessments etc.[26, 27]. A shorter version of the BAT is timesaving compared to the BAT23 and can be used in e.g. employee surveys for screening purposes. Shorter scales are typically used in larger survey assessment in organizations, as many variables need to be measured in order to comply with the legal requirements to perform a work-related psycho-social risk analysis. Longer scales are less practical in that regard and could be used for a more detailed assessment of individuals, needed for their clinical follow-up. The construction of a scale and its evaluation requires samples that match the intended target population. In this study, data from representative samples of the working population in the Netherlands and Flanders (Belgium) was used. As regards the diagnostic or clinical use of the BAT23, studies so far analyzed and reported clinical cut-off values . The results of the present study indicate good precision when measuring burnout complaints, suggesting that individual assessment of burnout can also be done using the BAT12. However, we recommend to additionally perform studies on patients to assess the classification consistency of the BAT12, and to establish clinical cut-off values.
Given the good fit to the Rasch model and acceptable targeting, an ordinal-to-interval-conversion table is reported in this study. For the users of the BAT12 we recommend the use of metric scores instead of mean scores. In this way better precision can be obtained. The reason is that the mean scores are calculated using ordinal response categories. They are thus not equidistant across the entire continuum that is being measured. This means that the increase of one unit does not imply the same burnout magnitude along the entire burnout continuum. This problem is more pronounced toward both ends of the scale and perhaps not that serious in the middle of the scale. Interestingly, it is toward the upper end of the scale that we would expect to find people at high risk of burnout. This is obviously a well-known issue that is true for many ordinal scales and not in any way unique for the BAT .
Regarding the practical implications for the total burnout score, we recommend the users of the BAT12 to first calculate the mean scores for each person and then translate these into metric scores using the conversion table. Metric score variables can be used in further analyses (e.g., calculation of population average levels). The conversion table is valid for complete answers only (no missing values are allowed).
Using data from representative samples of working populations in the Netherlands and Flanders (Belgium) and combining quantitative and qualitative approaches, a new shorter version of the BAT (Burnout Assessment Tool) – BAT12 -is developed and tested psychometrically. The new BAT12 developed in the present study maintains the breath of item content of the original version of the BAT. The shorter version of the BAT has sound psychometric properties. The BAT12 fulfils the measurement criteria according to the Rasch model when the four subscales are combined into a single burnout score. The scale works invariantly for women and men, older and younger and across both countries. A shorter version of the BAT is timesaving compared to the BAT23 and can be used in e.g., employee surveys.
Availability of data and materials
All data generated or analysed during this study are included in this published article [and its supplementary information files].
Analysis of variance
Burnout Assessment Tool
Differential item functioning
The Maslach Burnout Inventory
Principal component analysis
Person Separation Index
Schaufeli WB, Leiter MP, Maslach C. Burnout: 35 years of research and practice. Career Dev Int. 2009;14(3):204–20.
Schaufeli WB. Burnout: Acritical overview. In: Lapierre LM, Cooper C, editors. Cambridge companion to organizational stress and well-being. Cambridge University Press. In press.
Lastovkova A, Carder M, Rasmussen HM, Sjoberg L, Groene GJ, Sauni R, Vevoda J, Vevodova S, Lasfargues G, Svartengren M, et al. Burnout syndrome as an occupational disease in the European Union: an exploratory study. Ind Health. 2018;56(2):160–5.
Maslach C, Leiter MP, Jackson SE. Maslach Burnout Inventory Manual, 4th ed. Palo Alto: Mind Garden, Inc.; 2017.
Boudreau RA, Boudreau WF, Mauthe-Kaddoura AJ. From 57 for 57: a bibliography of burnout cita-tions. Poster presented at the 17th Conference of the European Association of Work and Organizational Psychology (EAWOP). Oslo; 2015.
Maslach C, Jackson SE. The Measurement of Experienced Burnout. J Occup Behav. 1981;2(2):99–113.
de Beer LT, Schaufeli WB, De Witte H, Hakanen JJ, Shimazu A, Glaser J, Seubert C, Bosak J, Sinval J, Rudnev M. Measurement Invariance of the Burnout Assessment Tool (BAT) Across Seven Cross-National Representative Samples. Int J Environ Res Public Health. 2020;17(15):5604.
Nielsen T, Kreiner S: Improving Items That Do Not Fit the Rasch Model. In: Rasch Models in Health. edn. Edited by Christensen KB, Kreiner S, Mesbah M; 2013: 317–334.
Tennant A, Conaghan PG. The Rasch measurement model in rheumatology: What is it and why use it? When should it be applied, and what should one look for in a Rasch paper? Arthritis Care Res. 2007;57(8):1358–62.
Vinueza-Solórzano AM, Portalanza-Chavarría CA, de Freitas CPP, Schaufeli WB, De Witte H, Hutz CS, Souza Vazquez AC. The Ecuadorian Version of the Burnout Assessment Tool (BAT): Adaptation and Validation. Int J Environ Res Public Health. 2021;18(13):7121.
We would like to thank all members of the international BAT research consortium (see www.burnoutassessmenttool.be) for their inspiring input into the BAT project.
Open access funding provided by University of Gothenburg. Onderzoeksraad: KU Leuven: BOFZAP-14/001 and C3-project C32/15/003 – “Development and validation of a questionnaire to assess burnout” financed by KU Leuven, Belgium.
Authors and Affiliations
Institute of Stress Medicine, Region Västra Götaland, Carl Skottsbergs gata 22 B, 413 19, Göteborg, Sweden
School of Public Health and Community Medicine, Institute of Medicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
Department of Psychology, Utrecht University, Utrecht, The Netherlands
Research Unit Occupational & Organizational Psychology and Professional Learning, Leuven, KU, Belgium
Wilmar Schaufeli & Hans De Witte
Optentia Research Unit, North-West University, Potchefstroom, South Africa
EH performed all analyses and wrote a first draft of the manuscript. WS was a major contributor in writing the background. All the authors helped to revise the manuscript. All the authors read and approved the final manuscript.
Prior to filling out the questionnaire, all participants were informed about the purpose of the study, that participation was voluntary, that they could stop at any moment if they wished to do so, that questions could be directed to a contact person (name and email address provided) and that complaints could be filed with the ethical committee (email address provided). Moreover, participants declared that they agreed with these terms by clicking on “next”. This study was approved by the ethical committee (Social and Societal Ethics Committee (SMEC) of KU Leuven) (https://www.kuleuven.be/english/research/ethics/committees/smec) on June 16, 2016 (reference number: G-2016 06 2027). All the methods were in accordance with the declaration of Helsinki or in accordance with relevant national/institutional guidelines.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Hadžibajramović, E., Schaufeli, W. & De Witte, H. Shortening of the Burnout Assessment Tool (BAT)—from 23 to 12 items using content and Rasch analysis.
BMC Public Health22, 560 (2022). https://doi.org/10.1186/s12889-022-12946-y