The validity of the SF-36 in an Australian National Household Survey: demonstrating the applicability of the Household Income and Labour Dynamics in Australia (HILDA) Survey to examination of health inequalities

Background The SF-36 is one of the most widely used self-completion measures of health status. The inclusion of the SF-36 in the first Australian national household panel survey, the Household, Income and Labour Dynamics in Australia (HILDA) Survey, provides an opportunity to investigate health inequalities. In this analysis we establish the psychometric properties and criterion validity of the SF-36 HILDA Survey data and examine scale profiles across a range of measures of socio-economic circumstance. Methods Data from 13,055 respondents who completed the first wave of the HILDA Survey were analysed to determine the psychometric properties of the SF-36 and the relationship of the SF-36 scales to other measures of health, disability, social functioning and demographic characteristics. Results Results of principle components analysis were similar to previous Australian and international reports. Survey scales demonstrated convergent and divergent validity, and different markers of social status demonstrated unique patterns of outcomes across the scales. Conclusion Results demonstrated the validity of the SF-36 data collected during the first wave of the HILDA Survey and support its use in research examining health inequalities and population health characteristics in Australia.


Background
While much health research focuses on objective outcome measures such as mortality or morbidity defined through clinical assessment, there is an increasing emphasis on self-reported measures of health status and health-related quality of life. Self-reported measures of health status have been included in epidemiological and community-based survey research. Their use reflects the importance of considering the patients' point of view and the multidimensional nature of health [1][2][3].
The Medical Outcomes Study Short Form (SF-36) is one of the most widely used, self-completion measures of health status. It was developed to meet the psychometric standards necessary for group comparisons, to enable profiling of functional health and well-being, and to quantify disease burden [3]. It comprises 36 items of which all but one are used to measure eight important health concepts that are frequently examined through health surveys. These eight concepts or scales are: Physical Functioning; Role-Physical (interference with work or other daily activities due to physical health); Bodily Pain; General Health; Vitality; Social Functioning (interference with normal social activities); Role-Emotional (interference with work or other daily activities due to emotional problems); and Mental Health (symptoms associated with anxiety and depression and measures of positive affect). In addition, the eight scales yield two summary scales of health, relating to physical (the Physical Component Summary: PCS) and mental (the Mental Component Summary: MCS) functioning and well-being.
The SF-36 was first adapted for use in Australia in 1992, as part of the International Quality of Life Assessment (IQOLA) Project [4]. Previous research has demonstrated the validity of the SF-36 for use by Australian respondents using samples from Canberra and New South Wales [2,4]. This has involved assessment of the psychometric properties of the Australian form of the SF-36, evaluation of internal consistency and reliability, and demonstration of content and construct validity. There are considerable Australian data on the SF-36 from large National samples. In 1995, a subset of National Health Survey respondents (around 18,800 adults) completed the SF-36 and the Australian Bureau of Statistics published Australian population norms [5]. The SF-36 was also included in the Women's Health Australia survey, with data collected from a sample of around 41,500 women aged [18][19][20][21][22][45][46][47][48][49], and 70-74 [6].
In 2001, the SF-36 was included in the first wave of the Household, Income and Labour Dynamics in Australia (HILDA) Survey. This is an Australia-wide survey of approximately 7,680 households, comprising around 14,000 people aged 15 and over. The HILDA Survey is the first longitudinal household survey in Australia and is designed to provide a sound evidence base to support research and analysis of income, labour market and family dynamics. As such, it is a critical resource for social policy development. The inclusion of the SF-36 in the HILDA Survey will enable investigation of the interaction between social, economic and health measures.
There is an extensive body of research demonstrating socio-economic inequalities in the distribution of physical and mental health problems [7][8][9]. The SF-36 has been utilised in research on health inequalities and this research has shown that the SF-36 scales are differentially associated with markers of social-economic circumstance [10][11][12][13]. A focus of our research has been on the nature of disadvantage and social exclusion associated with welfare receipt and specifically the association between welfare receipt and mental health problems [14]. The HILDA Survey provides a valuable dataset with which to investigate this research topic. As such, it is critical to firstly ascertain the validity of the data within the HILDA Survey. Further, demonstrating the validity of the SF-36 scales collected through the survey is critical for other researchers and policy analysts who will utilise the HILDA dataset.
The aim of this paper is to evaluate the psychometric properties and criterion validity of the HILDA SF-36 data. We have used the manuscript by Sanson-Fisher and Perkins [4] as a framework for the analyses reported in the first section. This follows the standard IQOLA validation procedures [15]. The analysis examines the reliability and validity of the eight SF-36 scales and the PCS and MCS scales. We then compare the SF-36 results obtained from the HILDA Survey with other Australian estimates to assess the representativeness of the data. We also evaluate the criterion validity of the HILDA Survey SF-36 data. We do this by 1) looking for convergent validity, in which scales measuring similar or related constructs demonstrate a positive association, and divergent validity whereby unrelated scales and measures are not associated; and 2) examining the profile of SF-36 results across a range of measures associated with health, disability and social functioning, demographic characteristics as well as a focus on a number of measures of socio-economic circumstances.

Data source
Data are from the first wave of the HILDA Survey (Release 1.0), a nationally representative household panel survey. The HILDA Survey was funded by the Australian Department of Family and Community Services and managed by a consortium led by the Melbourne Institute of Applied Economic and Social Research at the University of Melbourne. The survey utilised a multi-stage sampling approach (sampling households within Census Collection Districts) and was stratified by State and part-of-State.
Four survey instruments were included in Wave 1. A Household Form and Household Questionnaire were completed during a personal interview with one adult member of each household. The Person Questionnaire, also administered during the personal interview, was conducted with all adult household members. Finally, the Self-Completion Questionnaire (SCQ) was provided to all respondents to the Person Questionnaire and was collected at a later date or returned by post. The SF-36 was included as the first element in the SCQ. Fieldwork for wave one of the HILDA Survey was conducted between August 2001 and January 2002.
A total of 7,682 households responded to the survey (a household response rate of 66 per cent). Within these households, there were 15,127 eligible adults. Of this group 13,969 (92%) completed the Person Questionnaire and 13,055 (86%) completed and returned an identifiable SCQ.
There are some differences between the characteristics of respondents to the HILDA Survey and population estimates from the Australian Bureau of Statistics. However, these discrepancies are not large enough to discredit the data and the differences in rates of response across both sex and location are corrected by applying population weights [16].

Analysis
The first set of analyses evaluated the reliability and validity of the SF-36 scales in the HILDA Survey. Given that these analyses were concerned with the internal structure of the data rather than representativeness, we ignored the clustered and stratified nature of the data and did not take population weights into account. These analyses were conducted using SPSS version 11.7.
Item-internal consistency assessed the extent to which each item measures what its associated scale measures. A correlation of 0.40 (corrected for overlap) or greater demonstrates adequate item-internal consistency [17]. Itemdiscriminant validity was also assessed. This is demonstrated when an item correlates significantly more strongly with the scale it contributes to rather than with any other scale. We use the multitrait/multi-item correlation matrix approach in which the correlation of each item with each scale is examined [18]. In this approach, the correlation between each item and its own scale is corrected for overlap: that is, the scale is calculated without the specific item in question to avoid inflating the correlation. The extent to which item-scale correlations within a scale were equal was also assessed, as was the approximate equality of item means and standard deviations.
To assess the internal consistency of scale scores, Cronbach's alpha coefficients were assessed, with a criterion of 0.7 used to define adequate internal consistency. Descriptive statistics for the eight SF-36 scales and the two summary scales were calculated, including the mean, median, range, standard deviation, skewness, kurtosis, and the per cent scoring at the lowest value (floor) and the highest value (ceiling).
Construct validity of the SF-36 in the HILDA Survey was also evaluated using the principal components method of factor analysis. Results were compared with the factor structure obtained from other analysis of the SF-36 scales.
The second set of analyses also examined a form of construct validity, assessing the extent to which scores on the SF-36 scales were associated with other criteria. We included in our analysis a number of measures of health and disability, ratings of satisfaction with life and health, and measures of stressful employment conditions and persistent feelings of loneliness. In addition to examining the profiles across the eight SF-36 scales, we also examined the relationship between these criterion measures and the PCS scale and the MCS scale, which were calculated according to the standard procedure outlined by Ware, Kosinski and Keller [19].
For these analyses it was critical to take account of the complex survey design. We therefore utilised the svy procedures of STATA to account for the clustered and stratified nature of the data, and to facilitate the use of weights to overcome differential response rates and to replicate Australian population parameters. The HILDA Survey dataset contains person-level weights which, when applied to individual survey respondents, adjusts for the unequal probabilities of selection and completion of the Person Questionnaire. However, these weights do not adjust for the further attrition associated with the SCQ. This is critical as our analysis is focused upon measures from this questionnaire. We therefore, conducted a logistic regression analysis to predict completion of the SCQ using the same predictors used to derive the HILDA person-level weights (geographic location, labour-force status, sex, age, number of adults in household, number of children in household, marital status, English language ability, and dwelling type) [20]. The probabilities of responding derived from this analysis were used to adjust the person-level weights. Utilising these adjusted weights produced accurate population estimates from the respondents who completed the SCQ (similar to the estimates for Personal Questionnaire respondents using the original person-level weights).
We also examine the correspondence between SF-36 results from the HILDA Survey and data from another, large-scale Australian survey to assess the representativeness of the data. We contrast the means for each of the SF-36 scales from the HILDA Survey with those obtained by the Australian Bureau of Statistics through the National Health Survey conducted in 1995 [5]. To facilitate evaluation of potential differences between means, two sample t-test statistics were calculated using a pooled variance approach.

Demographic, socio-economic and health measures
Based on previous analysis of the SF-36, we expected results across the eight scales to differ according to subgroups identified by demographic and socio-economic characteristics, including age, sex, marital status, educational attainment, housing tenure and employment status. We also examined differences associated with receipt of welfare payments, a focus of our research endeavours [14]. The analyses also include a range of measures related to health and life circumstances drawn from other scales and instruments included in the HILDA Survey.

Long-term health condition
One question from the Person Questionnaire asked respondents if they experienced a disability or health condition that had lasted, or was likely to last 6 months or more, and restricted their everyday activities. A showcard listing examples of health conditions, impairments and disabilities was presented as a prompt for respondents. Responses were either Yes or No.
Limited ability to work Those identifying a disability or condition were asked to rate whether this limited the type or amount of work they were able to undertake. Response categories were No, Yes and Can Do Nothing.

Satisfaction with health
In a series of questions from the Person Questionnaire, respondents were asked to rate their satisfaction with a range of life circumstances using an eleven-point scale with descriptive anchors at either end (0 = totally dissatisfied; 10 = totally satisfied). Responses were categorised as either dissatisfied (if the rating was 4 or less) or satisfied/ mixed (if the rating given was at the midpoint 5 or higher).

Satisfaction with life
Respondents were also asked, all things considered, how satisfied they were with their life. Responses were categorised as above.

Job stress
The SCQ included a series of statements about current job characteristics, to which respondents rated their level of agreement or disagreement using a seven-point scale anchored at either end (1 = strongly disagree; 7 = strongly agree). One item asked respondents to rate whether they feared that the amount of job stress would make them physically ill. Respondents were categorised as experiencing significant job stress if they agreed with this statement (a rating of 5 or greater).

Social isolation
The final measure was also drawn from the SCQ. Using the same scale as the job stress measure, respondents were presented with a series of statements about their level of social support. Respondents were categorised as lonely if they agreed (rating of 5 or above) with the statement "I often feel very lonely".
Based on previous research using the SF-36, we anticipate that increasing age will be associated with poorer physical health, and that women will demonstrate poorer health than men as measured on all SF-36 scales with the possible exception of the General Health scale [5]. Those respondents who are divorced or separated are expected to demonstrate poorer health than those currently married or in a de facto relationship, with the greatest differences observed on the scales related to social functioning and mental health [21]. For all of the socio-economic measures, we expect to find a socio-economic gradient for all SF-36 scales, though the pattern may differ for different measures. For example, Sullivan and Karlsson found that educational level was more strongly associated with the measures of physical health whereas employment status was more strongly associated with mental health. Also based on results such as those reported by Sullivan and Karlsson, it was expected that reported experience of a long-term health condition, limited ability to work and satisfaction with health would be more strongly associated with SF-36 scales related to physical health, while job stress, social isolation (loneliness) and satisfaction with life would be more strongly associated with the scales loading on the mental health factor.

Sample statistics
Of the 13,055 respondents with identifiable SCQs, 53 percent were female. The age of respondents ranged from 15 to age 90 or greater (reported age was capped at 90 in the survey data), with an average age of 43 years (SD = 17.25).
A total of 11,264 respondents (86.3%) completed all SF-36 items, with a further 1,326 (10.1%) having 5 or fewer missing items. Overall, the mean number of missing items per person was 0.69, and the median was 0. Details of non-responses/missing values for each item are presented in Figure A1 of the Attached file. The rate of missing values was below 3.5 percent for all items. The highest rate of missing values was evident for items from the Physical Functioning scale, followed by the Role Physical, General Health and Role Emotional scales. Scales were calculated using standard SF-36 scoring procedures, whereby missing values were replaced by scale means where valid responses were available for at least half of the scale items. Therefore, the number of respondents with valid scale scores ranged from 12,686 for the Role Emotional scale to 13,031 for the Social Functioning scale. Analyses were conducted with the maximum number of respondents possible and, therefore, varies across the scales. Table 1 demonstrates that the range of item-scale correlations within each scale were moderate to strong (see iteminternal consistency column). Indeed, all item-scale correlations were greater than the recommended correlation of 0.40 for adequate item-internal consistency [17]. The widest range of item-scale correlations was observed for the General Health scale (0.47 -0.77), with the weakest itemscale correlations evident for items 11a and 11c (see Table  A1 in Additional file). Nonetheless, the item-scale correlations were reasonably similar within each scale.

Tests of scaling assumptions
In order to assess item-discriminant validity, items that make up a scale were correlated with other scale constructs. Item-discriminant validity is demonstrated when an item correlates higher with its own scale than with other scales. While there was some overlap in correlations between items from one scale and other scales (compare the range of item correlations for the item-internal consistency and item-discriminant validity columns in Table  1), these were relatively minor. At the level of the individual items, it is apparent that all items were more strongly correlated with their own scale than with other scales, and that this difference was statistically significant for all but one item (see 1). Thus, Table A [2,4].

Descriptive statistics for scales
In accordance with the standard scoring procedures [19], the eight scales of the SF-36 were constructed by aggregating 35 of the 36 individual items (the excluded item measures self-reported health transition). Each of the 35 items contributed to one scale, with each scale comprising 2 to 10 items. The range of scores possible on each of the eight scales was from 0 to 100, with 100 representing optimal functioning as measured by the SF-36.
The (unweighted) means for the eight scales ranged from 60.87 (Vitality) to 83.19 (Physical Functioning; see Table  2). All scales were found to be negatively skewed with the Physical Functioning, Role Emotional, Role Physical and Social Functioning scales moderately skewed. Ceiling effects were high for the Role Emotional (73.3%) and Role Physical (68.8%) scales and moderate for the Social Functioning (49.4%), Physical Functioning (33.8%) and Bodily Pain (32.9%) scales. The greatest prevalence of floor effects was observed for the Role Physical (12.4%) and Role Emotional (9.9%) scales. Minimal ceiling and floor effects were observed for the Vitality, Mental Health and General Health scales. This pattern is similar to that obtained with previous assessment of the SF-36 [4]. Descriptive statistics for the two summary scores (the Physical and Mental Component Summary scores) are also consistent with previous findings.

Principal Components Analysis
Principal Components Analysis was conducted to examine the underlying structure in the SF-36. The analysis supported the two factor solution. Two factors had eigenvalues greater than one. The two factor solution was also consistent with the pattern of results evident in the scree plot. The two factors accounted for 69% of the total variance in the 8 SF-36 scales. The first factor accounted for 56% of the variance and the second factor accounted for an additional 13%. The total variance accounted for by the two-factor solution was similar to that found in other studies. For example, in a ten country comparison of the factor structure of the SF-36, Ware and colleagues report that the two factor solution accounted for between 66 and 72 percent of total variance [22]. Previous Australian analyses also found that the two factors explained around 70 percent of total variance [4,6]. Table 3 shows the correlations between the SF-36 scales and the rotated components. We used the varimax method to obtain orthogonal factors. Across the eight scales, the percentage of variance within each scale explained by the two-factor solution (commonalities) ranged from 0.57 to 0.82. Again, this is consistent with previous international studies [22]. It is apparent from As with most international comparisons (though see [23,24] using Taiwanese and Japanese populations), the factor loadings associated with General Health were stronger for the physical health factor. We also found this pattern of results, though the relationship observed between the General Health scale and the mental health factor was at the upper end of the range of international data. The opposite pattern is generally demonstrated for the Vitality scale, with it loading most strongly on the mental health factor. One previous Australian study [4] found that the Vitality scale loaded more strongly on the physical health factor. The current results are more consistent with the body of international evidence. Table 4 presents population estimates of the mean and standard deviation for each of the SF-36 scales and the Physical and Mental Component Summary scores. These results take account of the complex survey design and population weights. Also presented are the corresponding results from the ABS 1995 Health Survey.   The results obtained are broadly consistent with the previous Australian national data. The means are of a similar magnitude. Direct comparison of the means, assessed using independent t tests (see [6]), indicated that the differences observed were statistically significant (at the .01 level) for six of the eight scales. The mean scores from the HILDA Survey were consistently lower than those obtained in the ABS Health Survey. However, the greatest differences observed on the individual scales (VT and SF) were less than four points and Ware et al. [19] suggest that a difference of 5 points or more indicates clinical or social meaningfulness.

Normative comparisons
While relatively small, the significant differences in mean SF-36 scale scores from these two national surveys are intriguing. The most marked differences were on scales related to mental health, while the scales most strongly related to physical health (the Physical Functioning and Role Physical scales) were those where no difference between surveys was observed. It may be that the differences observed reflect different methodological approaches adopted in the two surveys, or could be due to non-sampling error. Alternatively, it could reflect a real change in the Australian population between 1995 and 2001, with poorer reported mental health occurring in more recent times. This may be the case, as changes were restricted to the measures related to mental health and therefore are not simply a general change in response bias.
There is evidence which corroborates the apparent increased prevalence of mental health problems in the community. Comparison of the 1997 Australian Survey of Mental Health and Wellbeing and the 2001 National Health Survey, both of which included the 10 item Kessler Psychological Distress Scale, showed the rates of substantial psychological distress had increased by over one percent [25]. In the UK, comparison of the results from the 1993 and 2000 Psychiatric Morbidity Surveys found that, while there was no difference in the overall rate of neurotic disorders amongst adults, there was a significant increase in the prevalence among men [26]. Further investigation of this finding and possible explanations (e.g. increasing mental disorders, reduced stigma and increased likelihood of disclosure) is warranted.

Comparison of groups
The data presented in Table 5 show the profile of mean SF-36 scores across a range of demographic and socio-economic characteristics. There were small but significant sex differences across most of the SF-36 scales and summary measures. Females consistently demonstrated slightly poorer health than males on all measures apart from the General Health scale, including the physical and mental summary scores. This pattern of results precisely replicates the pattern found in the previous ABS National Health Survey [5].
Age effects were also consistent with expectations. While those in older age groups reported poorer levels of health across most SF-36 scales, the decline was most pronounced in the measures related to physical health (Physical Health, Bodily Pain). In contrast to this pattern, however, both the Mental Health scale and Mental Component Summary Score showed improved mental health with increasing age, apart from a slight decline evident in the oldest (75+) age group. This is consistent with epidemiological survey results using a variety of mental health measures [27]. Health differences associated with current marital status (comparing partnered respondents with those who were separated or divorced) were evident across all scales. The greatest differences were observed in the Social Functioning and Role Emotional scales.
The analysis showed health was associated with a range of measures of social status, however the profile across the SF-36 scales differed for the different measures. Consideration of educational attainment showed that those respondents who had less than secondary education had significantly poorer physical health, most evident on the Physical Functioning scale. Current unemployment (compared to full or part-time employment) was significantly associated with poorer health on a number of scales, but particularly the measures loading on the mental health factor (Social Functioning, Role Emotional and Mental Health). This too, is consistent with the extensive unemployment literature [28]. Housing tenure (rental housing While Australian welfare payments are universally available, eligibility is highly targeted. Therefore, welfare receipt appears to be a good proxy for poor financial circumstances. In addition, a subset of the welfare population are those with severe disabilities that prevent work. Consistent with this, we found that, amongst working age respondents, receipt of an Australian income support payment was strongly associated with both poorer physical and mental health. The final set of comparisons examined the association between SF-36 scales and respondents' health and social circumstances as measured by other scales and items included in the HILDA Survey (Table 6). Consistent with expectations, respondents who reported a long-term health condition or disability demonstrated lower mean scores on all SF-36 scales, but particularly so on the scales loading on physical health (Physical Functioning, Role Physical, Bodily Pain and General Health). A similar pattern of associations was observed when examining the circumstances of those with health conditions that impacted on their ability to work. The greater influence on work ability was associated with poorest physical health as measured by SF-36 scales. Reported satisfaction with health was most strongly associated with General Health and physical health measures more generally, while overall satisfaction with one's life was more highly associated with the SF-36 scales related to mental health.
The final two criterion measures (job stress and loneliness) were expected to be most strongly associated with poorer mental health outcomes. Consistent with expecta-

Conclusions
This analysis has demonstrated the validity of the SF-36 data collected through the first wave of the HILDA Survey.
The eight scales were shown to be psychometrically sound, with good internal consistency, discriminant validity and high reliability. The results supported the underlying two-factor structure, with factors related to physical and mental health. The pattern of factor loadings, variance explained, and the profile and magnitude of the scale means were consistent with previous Australian and international results. There was, however, an indication that the estimated population means on several of the SF-36 scales, particularly on scales loading on the mental health factor, were lower in the HILDA Survey than previous Australian national data collected through the National Health Survey. Self-report measures of health status can be subject to a range of biases, such as reporting bias or cohort effects. They may also be influenced by contextual factors, such as the setting in which the measure is conducted. Further consideration of these different data sources, and an examination of possible methodological and other differences, is warranted.
With respect to the relationship between SF-36 scales and external criteria, we demonstrated a pattern of results consistent with expectations and previous research. We examined measures having differential effects on physical and mental health (e.g., disability and health satisfaction vs job stress, loneliness, and life satisfaction). The relationship between demographic characteristics (age, sex, marital status) and SF-36 scales was consistent with previous data. Finally, we focused on health inequalities, using a range of markers of social status and examining their relationship with the SF-36 scales. While each marker of social status was associated with poorer health, the dimensions of health related to each social variable differed. Whereas educational attainment was most strongly associated with poorer physical health, housing tenure and employment status were related to mental health scales. The measure of welfare receipt, which is associated with economic hardship, was strongly associated with both poorer physical and mental health.
The application of the probability weights available with the HILDA Survey data ensures the accuracy of the pointestimates of the population parameters, such as the mean, by correcting for non-response and design characteristics. However, it is also important to recognise the impact of the survey design (clustering and stratification) on the calculation of standard errors. As the sampling units in the HILDA Survey are Census Collection Districts and, within that, households, the data contradict assumptions of independence and should not be treated as a simple random sample drawn from the population. We expect more similarity between individuals within Census Collection Districts and households than between those drawn at random from the population and if these characteristics of the survey design are not taken into account the estimates of standard error will overstate accuracy.
Calculation of the design effect provides an indication of the increase in error associated with the complex survey design, representing the ratio of the variance for the complex design to that obtained assuming a simple random sample. These measures range from 1.  21. The design effects were greater for the scales assessing physical health than for those loading on mental health. This suggests physical health more strongly clusters within households and areas than does mental health, perhaps as a consequence of the clustering of age (design effect of 3.8) and the strong relationship between age and physical health and disability. Note that the design effects are not evident in the population estimates in Table 4 as these figures were corrected to provide an estimate of the population mean and population standard deviation. Nonetheless, the design effects highlight the need to utilise appropriate statistical techniques in analysis of the HILDA Survey data.
This analysis has provided evidence that the SF-36 data collected in the HILDA Survey is valid and supports its use as a general outcome measure of physical and mental health status. It also suggests that the results obtained using the SF-36 in the HILDA Survey can be interpreted by reference to published SF-36 normative data and comparison with previous research findings. Most critically, from our perspective, the analysis provides initial support for the proposition that the SF-36 scales and component summary scores are related in a meaningful and interpretable manner to measures of social status. This supports the use of the HILDA Survey data in our analysis of health inequalities and the effects of social exclusion in Australia.