The development and validation of an instrument to measure the quality of health research reports in the lay media

Zeraatkar, Dena; Obeda, Michael; Ginsberg, Jeffrey S.; Hirsh, Jack

doi:10.1186/s12889-017-4259-y

Research article
Open access
Published: 20 April 2017

The development and validation of an instrument to measure the quality of health research reports in the lay media

Dena Zeraatkar¹,
Michael Obeda²,
Jeffrey S. Ginsberg³ &
…
Jack Hirsh³

BMC Public Health volume 17, Article number: 343 (2017) Cite this article

12k Accesses
15 Citations
5 Altmetric
Metrics details

Abstract

Background

The media serves as an important link between medical research, as reported in scholarly sources, and the public and has the potential to act as a powerful tool to improve public health. However, concerns about the reliability of health research reports have been raised. Tools to monitor the quality of health research reporting in the media are needed to identify areas of weakness in health research reporting and to subsequently work towards the efficient use of the lay media as a public health tool through which the public’s health behaviors can be improved.

Methods

We developed the Quality Index for health-related Media Reports (QIMR) as a tool to monitor the quality of health research reports in the lay media. The tool was developed according to themes generated from interviews with health journalists and researchers. Item and domain characteristics and scale reliability were assessed. The scale was correlated with a global quality assessment score and media report word count to provide evidence towards its construct validity.

Results

The items and domains of the QIMR demonstrated acceptable validity and reliability. Items from the ‘validity’ domain were negatively skewed, suggesting possible floor effect. These items were not eliminated due to acceptable content and face validity. QIMR total scores produced a strong correlation with raters’ global assessment and a moderate correlation with media report word count, providing evidence towards the construct validity of the instrument.

Conclusions

The results of this investigation indicate that QIMR can adequately measure the quality of health research reports, with acceptable reliability and validity.

Peer Review reports

Background

The public obtains medical information from television, newspapers, and online sources. The media provides an important link between medical research, as reported in academic sources, and the lay public. Seeking health information in the media has become increasingly common, as evidenced by survey data indicating that nearly a quarter of Canadians used the internet to obtain health information in the year 2000 [1], a rate which has most likely grown over the past decade and a half. Recently, increased attention has been paid to the medical content published in the lay media, with the rise of initiatives such as Health News Review (HealthNewsReview.org), a website which critically reviews medical stories in the U.S. news, and various health website accreditation programs, such as the Health On the Net Foundation (www.hon.ch/) and URAC (www.urac.org/) which provide quality accreditation for websites following an application and screening procedure and MEDCIRLCE and MEDCERTAIN (www.medcircle.org/) which aim to develop technologies to guide consumers towards trustworthy health information [2].

Good quality reporting of health research in the lay media can help set a productive health policy agenda, increase society’s collective awareness of pressing health problems, and positively influence the public’s day-to-day health behaviors (e.g., undergoing preventative screening, pursuing a healthy diet and exercise, and promoting smoking cessation) [3–5]. Despite the valuable potential of the media as a public health tool, concerns have been raised about its reliability [1, 6–12]. For example, Haneef and colleagues (2015) identified at least one example of spin or misrepresentation of study findings in 88% of American, British, and Canadian media reports on studies of medical interventions. Given the impact health-related media reports can have, ensuring that they are of high quality is of critical importance. Misinformed readers may have heightened concerns or expectations about medical interventions which may lengthen, multiply, or complicate medical consultations and generate inappropriate health behaviors and requests for medical treatment, thereby increasing healthcare spending. Although the importance of reliable health journalism is well recognized, no empirical investigation has yet been undertaken to evaluate the quality of health research reporting in the Canadian media. Furthermore, initiatives such as Health On the Net Foundation are not aimed specifically at evaluating information about health research. This may be partly due to the paucity of available measurement tools, which has potentially stalled needed assessment and surveillance. To address this issue, we developed the Quality Index for health-related Media Reports (QIMR; Additional file 1). The objective of this paper is to describe the development and preliminary validation of the QIMR for evaluating the quality of health research reports published in the Canadian media.

Methods

This study was undertaken in two phases. First, the QIMR was developed through literature searches and consultation with key experts. The reliability and validity of the QIMR were subsequently tested with a sample of media reports. The development and testing process of the QIMR is presented in Fig. 1. This study was exempt from ethics review by the Hamilton Integrated Research Ethics Board.

Literature search

A literature search was undertaken to identify existing instruments for the evaluation of the quality of health-related media reports. MEDLINE and EMBASE were searched through the Ovid interface using key terms related to health education, patient education, and journalism. No restriction was placed on publication year. The search yielded Oxman et al.’s (1993) Index of scientific quality (ISQ) for health-related news reports [13]. Given the limited yield, the search was expanded to also include instruments developed for the assessment of health information outside of the media. Two additional instruments were identified: Ensuring Quality Information for Patients (EQIP) and DISCERN [14, 15]. The three instruments were assessed for face and content validity.

A number of items on the ISQ were deemed to be irrelevant to the reliable communication of health research in the media. For example, when the ISQ was tested with a sample of media reports, item 1, which asks whether it is clear to whom the information presented in the media report applies, was found to be irrelevant in most cases. Discussion with key informants, such as journalists and researchers, also revealed item 5, which queries whether the media report communicated a clear and well-founded assessment of the precision of estimates, to be irrelevant. Key informants expressed that precision and confidence intervals were topics beyond the scope of what could be reasonably expected of a lay audience with no background in health research to appreciate. It was suggested that it may be sufficient for health journalists to acknowledge the possibility of false positive findings in qualitative terms as opposed to quantifying the likelihood. Finally, items of the ISQ are not organized within domains, thus limiting its capacity to identify specific areas of strength and weakness in health research reports. Unlike the ISQ, the EQIP and DISCERN were developed to evaluate the quality of health information written for patients and thus demonstrate poor content relevance to health research reporting.

To empirically evaluate the quality of health research reporting in the Canadian media, a new measure was developed that specifically addressed the content validity problems of the ISQ, EQIP, and DISCERN.

Item generation

To guide item generation, informal, semi-structured interviews were conducted with key informants, which included Canadian health journalists and medical researchers from the departments of Medicine and Clinical Epidemiology & Biostatistics at McMaster University. All health journalists held positions at Canadian newspapers and specialized in science and health reporting. All medical researchers were either full-time medical or epidemiology professors or medical doctors pursuing further graduate education in epidemiology at the doctoral level. Key informants were asked to describe high and poor quality health research reporting. They were also asked to provide examples of high and poor quality media reports and explain their strengths and shortcomings. A total of 13 interviews were conducted with six health reporters and seven health researchers, at which point it was believed that content saturation was reached. Content generated from these interviews was organized into five themes (background, sources, results, context, validity), which were then used to guide item generation.

Forty-two candidate items were generated by a working group of three researchers (DZ, JSG, JH) who were not part of the group of key informants consulted in the previous step. Although the ISQ, EQIP, and DISCERN were found to have poor content relevance overall, the few items within each instrument that were found to be relevant to assessing the quality of health research reports were also included in this pool of items. Using items from previously developed tools is advantageous in that these items have already undergone some degree of validation and have demonstrated acceptable psychometric performance [13–15]. Items were phrased as descriptive statements, which raters can endorse to varying degrees, depending on the extent to which the statements apply to the media report being evaluated.

To ensure content coverage, throughout the item generation process, items were mapped onto a matrix with each row representing an item and each column representing one of five themes that emerged from interviews. At least five items were generated for each of the five themes (range: 5-13).

A final global item, which queries the overall quality of the media report and is to be interpreted independently of the overall score, was also developed. Such a global item is useful in cases where the quality of a media report is more nuanced than can be captured by defined criteria [16]. This is occasionally seen with assessments of competence where global rating scales can demonstrate superior reliability, compared to lengthier scales.

Item reduction

It was decided that the instrument should include approximately 15 to 20 items to reduce burden on respondents, who for research or monitoring purposes, may be required to evaluate a large number of media reports, while also ensuring that the scale would be able to adequately discriminate between media reports of varying quality and would demonstrate acceptable reliability. Additionally, the number of items within similar instruments like the EQIP and DISCERN fall within this range.

Items generated by the investigators were reviewed with a group of five key informants (three health researchers and two health journalists who did not participate in the item generation process) who were instructed to review items for clarity and relevance. Items found to be unclear or irrelevant were excluded. To ensure that all themes were adequately covered, items were again mapped onto a matrix. Each domain included at least three items.

Formatting and response options

To facilitate ease of administration, items and domains were ordered according to the likely order of presentation of corresponding elements in media reports required for their assessment. A seven-point adjectival scale was used for all items. For each response option, an adjective which matched the item stem was selected and all response options were numbered. A seven-point scale was selected to maximize reliability and sensitivity and to compensate for potential end-aversion bias, while also avoiding cognitive overload in prospective respondents [17–19]. A seven-point scale has also performed well for other quality assessment instruments. For example, the Appraisal of Guidelines for Research & Evaluation II (AGREE II) instrument for assessing the quality of clinical practice guidelines also utilizes a seven-point scale and has demonstrated acceptable reliability [20].

Scoring

The QIMR is scored by adding the numbers corresponding to each endorsed response option together. Scores can be reported for each domain and/or totaled across all domains. We encourage prospective users to report scores as a percentage of the maximum possible attainable score in a domain or the full scale to aid interpretability.

Sample

The objective of the sampling strategy was to obtain a sample of media reports of variable quality, reflective of the content to which the public is regularly exposed. The selection of news sources which would yield artificially high variation in quality and inflated reliability coefficients was avoided. A purposive sample of four Canadian news sources (The Toronto Star, National Post, The Hamilton Spectator, and Winnipeg Sun) of varying rates of circulation was selected. Factiva (https://global.factiva.com) was used to search the four news sources for health research reports according to the search strategy presented in Additional file 2, developed with assistance from a social science research librarian. The search was conducted on March 9, 2016. Two investigators (DZ and MO) selected the seven most recent media reports published by each news source that met the predefined inclusion criteria. Media reports that focused predominantly on a health research study were included (i.e., half or more of the word count was dedicated to the reporting of a health-related research study). Letters to the editor, media reports on research findings not published in peer-reviewed journals, and media reports which discussed findings from more than two research studies were excluded. Conflicts between investigators regarding the inclusion status of a media report were resolved by discussion.

Rating the quality of media reports

Two investigators (DZ and MO) independently evaluated the quality of the included media reports using the QIMR and the accompanying manual.

Statistical analysis

Statistical analyses were performed using SPSS Version 21 (Boston, Massachusetts, 2012) and G-String Version IV (http://fhs.mcmaster.ca/perd/index.html).

Item and domain characteristics

Item and domain characteristics, including measures of central tendency (i.e., means, medians), dispersion (i.e., standard deviations), and internal consistency (i.e., Cronbach’s alpha coefficients) were computed. Corrected item-total correlations were also calculated by correlating item scores with the total QIMR score to evaluate the scale for unidimensionality.

Scale reliability

To test the reliability of the QIMR, Generalizability theory (G theory) was used [21]. In contrast to classical test theory, G theory has the advantage of allowing assessment of the relative contribution of variance across multiple sources, referred to as facets of generalization, simultaneously, thus producing more accurate reliability estimates. G studies yield relative (i.e., norm-referenced) and absolute (i.e., criterion referenced) G coefficients, which can be interpreted similar to reliability coefficients. To compute G coefficients, facets of generalization are designated as either fixed or random. Random facets contribute to the error term in the calculation of the reliability coefficient and fixed facets do not. In this investigation, the object of differentiation was specified as media reports. Three facets of generalization were identified: item nested within domain, domain, and rater.

Five generalizability coefficients were calculated: an internal consistency generalizability coefficient was computed by holding rater and domain fixed and designating item as a random facet of generalization; an inter-domain generalizability coefficient where raters and items were fixed and domains were designated as a random facet of generalization; three inter-rater generalizability coefficients by varying the number of raters and holding items and domains fixed and designating raters as a random facet of generalization; and an overall mixed generalizability coefficient was calculated by designating all facets as random. Relative inter-item and inter-domain generalizability coefficients were interpreted under the assumption that items and domains will not vary between administrations of the QIMR, whereas absolute inter-rater generalizability coefficients were interpreted under the assumption that the raters used to generate the data are a random sample of all possible raters and systematic differences between raters is important in interpreting QIMR scores. The overall generalizability coefficient was calculated as a mixed coefficient to retain the systematic effects of raters and remove the systematic effects of items and domains. Traditionally, for non-clinical or high-stake measurement, reliability coefficients below 0.5 are seen as unreliable, measures between 0.5 and 0.7 are modest, and reliability coefficients above 0.7 indicate satisfactory reliability [22]. This standard was used to interpret reliability and generalizability coefficients with the caveat that these criteria are less stringent for generalizability coefficients, which take into account multiple sources of variation simultaneously.

Construct validation

Pearson correlation coefficients were used to test two convergent construct validity hypotheses: the correlation of the QIMR total score with raters’ global impression and media report word count. Although the QIMR is completed on an adjectival scale, scores generated from Likert or adjectival scales generally demonstrate interval properties and parametric tests are robust against these interval assumptions [23]. Hence, Pearson correlation coefficients were thought to be appropriate.

Sample size

Given that reliability coefficients are not substantially impacted by sample size and the absence of existing guidelines for conducting sample size calculations for G studies [16], a sample size of 28 media reports was pragmatically chosen. Assuming a strong correlation of 0.8 between global and QIMR score, we estimated that 28 media reports would provide more than sufficient power at the 0.05 alpha level. Assuming a moderate correlation of 0.5 between word count and QIMR score, a sample size of 28 media reports would provide 80% power to detect a statistically significant correlation at the 0.05 alpha level.

Results

Media report characteristics

Characteristics of the 28 included media reports and the four included news sources are presented in Tables 1 and 2. Media reports described studies from a range of medical specialties. The three most commonly reported specialties were public health (5; 18%), neonatology and pediatrics (5; 18%), and infectious disease (4; 14%). The mean word count of included media reports was 587.21 (SD 290.49) and ranged from 127 to 1175 words. The mean QIMR rating was 50.86% (SD 15.64%).

Table 1 Media report characteristics

Full size table

Table 2 News source characteristics

Full size table

Item characteristics

Item characteristics are presented in Table 3. Responses for most items appear to be approximately normally distributed, as indicated by similar mean and median statistics. All item standard deviations exceed one, indicating variability in responses. Response options endorsed appear to range the full scale, as illustrated by the minimum and maximum values. Endorsement frequencies of response options for all items were individually examined. No response option for any of the items produced an endorsement frequency greater than 75%. These results suggest that all items are adequately differentiating between media reports.

Table 3 Item statistics

Full size table

Most items appear to have acceptable item-total correlation values, between 0.3 and 0.7 [16]. Four items were poorly correlated with total scale scores: one item in the ‘background’ domain assessing use of jargon, one item in the ‘sources’ domain assessing the identification of the organizational and financial affiliations of the study, and two items in the ‘validity’ domain assessing whether the appropriateness of the study methodology and strengths and weaknesses of the study are adequately discussed.

Domain characteristics

Domain statistics are presented in Table 4. The mean domain scores ranged from 15.06% to 66.41% for the ‘validity’ and ‘results’ domains, respectively. Standard deviation statistics ranged from 2.26 points (12.56%) to 4.44 points (24.67%) for the ‘results’ and ‘context’ domains, respectively.

Table 4 Domain Statistics

Full size table

The ‘background’, ‘sources’, and ‘context’ domains produced less than ideal Cronbach’s alpha coefficients, ranging from 0.49 for the ‘sources’ domain to 0.58 for the ‘background’ domain. Cronbach’s alpha values for the remaining two domains were above 0.70.

Generalizability study

Variance components from the G study are presented in Table 5. The three largest sources of variance were media reports (38.85%), domains (29.21%), and the interaction between media reports and raters (11.26%).

Table 5 Variance components generated from the Generalizability study

Full size table

Generalizability coefficients are presented in Table 6. The internal consistency of the scale was estimated at 0.87, inter-domain reliability was 0.72, and inter-rater reliability between two raters was 0.68. Averaging QIMR ratings over three raters increased the inter-rater reliability to 0.76. The overall generalizability coefficient for the scale was 0.54.

Table 6 Generalizability coefficients generated from G study

Full size table

Construct validity

Table 7 presents the results of the constructs tested. QIMR scores were highly correlated with raters’ global impression and moderately correlated with word count. The relationship between QIMR scores and global ratings and word count are presented in Figs. 2 and 3, respectively.

Table 7 Construct validation results (Pearson correlation coefficients)

Full size table

Discussion

The results from the preliminary reliability and validity testing suggest that the QIMR demonstrates adequate reliability and validity for use by health researchers to evaluate the quality of health research reports in the lay media.