Can the validity of a cohort be improved by reweighting based on register data? Evidence from the Swedish MDC study
BMC Public Health volume 20, Article number: 1918 (2020)
In any study with voluntary participation, self-selection risks leading to invalid conclusions. If the determinants of selection are observed, it is however possible to restore the parameters of interest by reweighting the sample to match the population, but this approach has seldom been applied in epidemiological research.
We reweighted the Malmö Diet and Cancer (MDC) study based on population register data on background variables, including socio-demographics and hospital admissions for both participants and the background population. Following individuals from baseline in 1991–1996 and at most until 2016, we studied mortality (all-cause, cancer, and CVD), incidences (cancer and CVD), and associations between these outcomes and background variables. Results from the unweighted and reweighted participant sample were compared with those from the background population.
Mortality was substantially lower in participants than in the background population, but reweighting the sample helped only little to make the numbers similar to those in the background population. For incidences and associations, numbers were generally similar between participants and the background population already without reweighting, rendering reweighting unnecessary.
Reweighting samples based on an extensive range of sociodemographic characteristics and previous hospitalizations does not necessarily yield results that are valid for the population as a whole. In the case of MDC, there appear to be important factors related to both mortality and selection into the study that are not observable in registry data, making it difficult to obtain accurate numbers on population mortality based on cohort participants. These issues seem less relevant for incidences and associations, however. Overall, our results suggest that representativeness must be judged on a case-by-case basis.
Selective participation is a concern in most cohort studies. In particular, conclusions about prevalences, incidences, and associations may not generalize from the study sample to the population as a whole when participants are not representative of the full population [1,2,3,4,5,6,7,8,9]. Furthermore, the interpretation of associations as causal effects may be hampered [6, 10, 11]. While non-response and refusal to enroll in scientific studies have become increasing problems in both social science and medicine, few epidemiological studies have so far taken measures to correct for the implied biases.
In many ways, selection into a cohort resembles selection into an exposure, i.e., the phenomenon that can give rise to confounding. There is a recent theoretical literature on how representativeness can be restored by reweighting the cohort, particularly by using the method of inverse probability of participation weighting (IPPW) [12,13,14,15], a method with similarities to propensity score (PS) weighting to deal with confounding. To apply IPPW or similar methods, however, data on background variables for both participants and non-participants are required. Since such data have typically not been available for non-participants, there have so far been limited opportunities to evaluate the practical consequences of reweighting for representativeness.
Summarizing previous evidence on cohort participation in epidemiology, Galea and Tracy  noted that women, married, and individuals of higher socioeconomic status were more likely to participate. Other personal characteristics such as age or ethnicity have sometimes also been linked to participation, but with no consistent pattern across studies. In any case, given the strong relationship between socioeconomic status and health [17,18,19], one may often expect individuals with poor health to be underrepresented in studies. In a previous article based on another Swedish cohort study and where it was possible to compare prevalences in the sample with those in the population, we showed that reweighting based on sociodemographic register data worked very well and sometimes appeared sufficient to correct for selection into the cohort .
In this article, we considered selection into the Malmö Diet and Cancer (MDC) study, a cohort study conducted in Southern Sweden with a baseline examination including a self-administered questionnaire, anthropometric measuring, and blood sampling . Recruitment to this cohort took place between 1991 and 1996, with a participation rate of about 40%. As has been shown previously, participants and non-participants differed in terms of cancer incidence and mortality, both before and during recruitment, and during a short follow-up period . In the present study, we investigated the possibilities to use reweighting to correct for selective participation in a long-term follow-up, considering outcomes such as mortality, disease incidences, and associations with these outcomes, using weights based on socio-demographics and disease history. As we had access to data on the same outcomes also in the background population, we were able to evaluate to what extent the reweighting method was able to improve representativeness of the study cohort.
We reweighted the sample with a method in line with previous literature on IPPW weighting [7, 12,13,14].A special feature of our data, however, was that it came in two separate sets – one for the participants and one for the full background population, with no linkage or indicator of who in the background population was a participant. As a result, the standard IPPW method could not be applied. Instead, we used an approach similar to that applied in studies of transportability, where the aim is normally to produce estimates that are valid for an entirely different population than the background population [12, 22]. As we have described in a previous article, this approach can also be used to achieve generalizability in situations like ours, where data on participants and the background population come in separate datasets .
As a first step, we combined the participant sample and the background population datasets into one dataset, where each participant thus appeared twice (but could only be identified as a participant once). A binary variable was created to indicate membership in the participant sample, and a logistic regression was then applied to predict this membership based on the background characteristics, including socio-demographics and disease history. As we have shown, predicted odds for belonging to the participant sample could then be interpreted as predicted probabilities of actually being a participant . Sampling weights were calculated for the participants as the multiplicative inverses of the predicted probabilities of actually being a participant.
The distributions of background characteristics in the background population were compared with those in the participant sample and with those in the reweighted participant sample. (In principle, background characteristics in the reweighted participant sample should resemble those in the background population, as the reweighting was made exactly based on these.) Subsequent mortality and incidence of hospitalization were then compared across the background population, participants, and reweighted participants. Additional analyses were stratified on quintiles of the estimated participation probabilities, allowing us to examine if differences between participants and individuals in the background population may have been concentrated, for example, to those with a low propensity to participate. Furthermore, we used Cox regressions to estimate associations between outcomes and background variables, and to compare these across the background population, participants, and reweighted participants.
To evaluate the ability of the logistic regression model to predict participation based on the background variables, the area under the ROC curve was calculated. We also visually examined the estimated participation probabilities, separately for participants and non-participants. In these two analyses, duplicates of participants were removed by sorting the data according to the predicted probabilities of participating, and for each predicted probability omitting the same number of individuals from the background population as the number of individuals in the participant sample. Analyses were performed using Stata 15.1 (StataCorp) and SPSS Statistics 25 (IBM).
Our full background population consisted of all men (born 1923–1945) and women (born 1923–1950) who lived in Malmö between January 1, 1991, and September, 30, 1996. This population, comprising 74,103 individuals, essentially corresponded to those who were invited to participate in the MDC study. (In practice, some were never invited because of death, migration, or other issues.) The participant sample comprised 28,098 individuals. All participants in MDC provided written informed consent at enrollment.
Data on socio-demographics were retrieved from Statistics Sweden and spanned the years 1990 to 2016. These data included year of birth, sex, civil status, country of birth (grouped), migration events, and an array of socioeconomic information, such as the highest level of education and income from different sources. Moreover, we retrieved data from the National Board of Health and Welfare, including the Patient Register, covering all hospitalizations and associated diagnoses from 1987 to 2016, and the Cause of Death Register, from which we obtained data on deaths and causes of deaths between 1990 and 2016. The reason for retrieving hospitalizations specifically from 1987 was that the Patient Register reached national coverage in this year, and we wished to account for hospitalizations during at least a few years prior to baseline.
For participants in MDC, we made use of background data on socio-demographics in the year prior to enrollment (or in the same year if no data was available in the previous year, which could occur if the individual had lived abroad). Hospitalizations were divided into groups based on the International Classification of Diseases (ICD), version 9 or 10: neoplasms (ICD codes 140–239/C00-D48), diabetes (ICD codes 250/E10-E14), mental and behavioral disorders (ICD codes 290–319/F00-F99), diseases of the circulatory system (ICD codes 390–459/I00-I99), diseases of the respiratory system (ICD codes 460–519/J00-J99), and diseases of the digestive system (ICD codes 520–579/K00-K93). Binary indicators were created to measure if the individual had had at least one hospitalization for these types of diagnoses between 1987 and enrollment.
There was no information on when individuals in the background population were invited to participate in the MDC study. We therefore assigned “imaginary” dates of enrollment to individuals in the background population, where the calendar date was always set to July 1 and the enrollment year was drawn from the birth-year-specific distribution of enrollment years observed among participants. Individuals in the background population who had moved, died, or for other reasons lacked information on sociodemographic variables around the time of imaginary enrollment were excluded, reducing the background population to 71,447 individuals. Among the 28,098 participants, two were excluded because there was no information on sociodemographic variables around the time of enrollment.
Outcomes examined included mortality and incidence of disease. We considered all-cause mortality, but also the two most common causes of death: deaths due to cardiovascular disease (CVD) and cancer. Furthermore, we considered incidences of CVD and cancer. Following previous studies on MDC , CVD mortality was defined by ICD codes 390–459/I00–99 whereas incident CVD was defined as the occurrence of either a coronary event (a fatal or nonfatal myocardial infarction, 410/I21, or a death due to ischemic heart disease, 414/414/I22/I23/I25) or a fatal or nonfatal stroke (430/431/434/436/I60/I61/I63/I64), whichever came first. Incident cancer was conventionally defined by ICD codes 140–209/C00–99.
Individuals who had not yet experienced the particular outcome of interest were followed from the time of baseline examination (or imaginary enrollment) in MDC, and contributed with person-time until the first event of interest occurred, or until death or emigration; at most until the end of 2016.
Table 1 displays background characteristics, reported separately for the full background population and for the participants. It also includes the same descriptive statistics for the reweighed participant sample. Compared to the background population, individuals in the unweighted participant sample were more likely to be aged 56–64, female, born in Sweden, married, have more than primary education, be employed, and have higher income. The differences were mainly noticeable for country of birth and the different aspects of socioeconomic status. For disease history, the difference was mainly that participants were less likely to have had a hospitalization for a mental or behavioral disorder.
Reweighting the participant sample produced distributions of background characteristics that throughout resembled those in the background population. The joint ability of the background characteristics to classify individuals as participants was, however, relatively modest (area under the ROC curve = 66.6%). Disease history contributed very little to the classifying ability (omitting previous hospitalizations only reduced the area under the ROC curve to 66.2%). Figure 1 shows the distribution of estimated participation probabilities, separately for participants and non-participants. While the distributions of the estimated probabilities overlapped, the central tendency was clearly higher among participants.
In Table 2, we display mortality and disease incidences in the background population, the participant sample, and the reweighted participant sample. Mortality was clearly higher in the background population than among participants, but less so for cancer mortality than for all-cause and CVD mortality. Reweighting only eliminated smaller shares of the gaps: 14% for all-cause mortality and 21% for CVD mortality. For cancer mortality, reweighting made virtually no difference.
For disease incidences, the numbers were quite similar across the background population and participants even without reweighting. In the case of CVD, the existing gap narrowed somewhat further by the reweighting; for cancer, it instead increased somewhat.
Since there is a definitional overlap between incidence and mortality, we also calculated incidences based only on nonfatal events (i.e., only hospitalizations). As it turned out, these incidences were virtually identical across participants and the background population even without reweighting (Table A1, supplement).
Table 2 (as well as Table A1, supplement) also shows numbers stratified on participation propensity quintiles. Across almost all the quintiles and outcomes, participants had more favorable outcomes than the background population. The finding reflects the limited success of the reweighting method: Shifting the distribution of the sample away from those with a high propensity to participate makes only a small difference since, throughout, there are unobserved factors that contribute to more favorable outcomes in participants.
In Fig. 2, we show Kaplan Meier (KM) plots for the five outcomes considered, with separate lines representing cumulative incidences for the background population, the participant sample, and the reweighted participant sample. For all mortality outcomes, the graphs suggest that discrepancies in survival between participants and the background population appeared more or less immediately, and widened continuously over time. Throughout the time period, reweighting had only small effects. For disease incidences, numbers were quite similar for the background population, the participant sample, and the reweighted participant sample throughout the time period, especially so for cancer incidence.
In Table 3, we report associations (fully adjusted models) between all-cause mortality and the background variables, separately for the background population, the participant sample, and the reweighted participant sample. The corresponding results for cause-specific mortality and incidences are provided in Tables A2-A5 in the supplement. There were strong associations between the outcomes and background variables, especially age but also gender, socioeconomic status, and, in some cases, disease history.
Associations were generally similar across the background population, the participants, and the reweighted participants, with ratios of hazard ratios tending to be close to 1. However, the association between age and all-cause mortality as well as between age and CVD-related outcomes was larger in the participants than in the background population, and the association between mental illness and cancer-related outcomes was larger in the participants than in the background population. Throughout, it can be seen that the reweighting made virtually no difference for the associations.
Selective participation is generally a concern for studies based on voluntary participation. While, in principle, sampling methods such as quota sampling could be used to obtain cohorts that are similar to the full population with respect to observed background variables, these methods are rarely used in practice, implying that selective participation must be accounted for retrospectively. The MDC study is one example of a cohort with voluntary participation, and has been used in more than 100 published articles, typically without any reference to the lack of representativeness. The MDC study is also part of the European Prospective Investigation into Cancer and Nutrition Cohort (EPIC), one of the world’s largest cohort studies, with more than half a million participants. Assessing whether results based on these studies are representative of the underlying populations should therefore be of crucial interest. In this article, we set out to examine selection into the MDC study, the discrepancies in mortality, incidences, and associations across MDC participants and the background population, and to what extent reweighting the cohort with respect to observed background characteristics from registers allowed for the elimination of the discrepancies.
We found that the distributions of the background characteristics differed across MDC participants and the background population – not least in that socioeconomic status was higher among participants. Nevertheless, despite the high level of detailed background information, many determinants of selection into the cohort appeared to remain unobserved, as the ability of the background characteristics to discriminate between participants and non-participants was limited. Hence, for outcomes influenced by the same unobserved background characteristics that influenced selection, the reweighting method should have limited success.
While there were pronounced differences in mortality across participants and the background population, we found that reweighting based on observed background characteristics helped only little to narrow these gaps. Disease incidences and associations were more similar across participants and the background population already before reweighting. For CVD incidence, reweighting closed the existing gap somewhat, whereas for cancer incidence it actually increased. For associations, reweighting generally had little effect.
The particular finding that associations tended to be similar across participants and the background population, even without reweighting, is in line with several previous studies. For example, in the Norwegian Mother and Child Cohort Study and in the Danish National Birth Cohort, associations between risk factors and birth outcomes were generally similar across participating women and the full background populations of women giving birth, despite marked differences between participating women and the background populations in terms of background variables such as age and family status [4, 5]. Similarly, in a recent study based on another Swedish cohort (the Scania Public Health Cohort Study; SPHC), we documented that associations between different variables and mortality as well as drug purchases were relatively similar across participants, drop-outs, and the total cohort, although drop-out as such was strongly related to several variables including age, country of birth, smoking, socioeconomic status, and mortality . In some contrast, however, another recent study compared participants in the UK Biobank with participants in the more representative Health Survey for England/Scottish Health Surveys (HSE-SHS) and found that the associations between CVD mortality and several risk factors (e.g., gender, glycated hemoglobin, and self-reported CVD) were substantially different across the two cohorts. Whether these discrepancies in associations could have been mitigated by reweighting based on observable characteristics was not examined.
Risk scores, such as the Framingham risk score for 10-year risk of CVD , are often estimated based on cohorts where self-selection may be an issue. If it is generally the case that associations between CVD and its risk factors in participant samples are similar to the corresponding associations in the background population, these risk scores may provide accurate conclusions about the relative risks across different population groups. As we have seen, however, the levels of outcomes may still vary across participants and background population, and a risk score calculated based on a self-selected cohort may therefore underestimate (or overestimate) risks for everyone.
In conclusion, the results of the present study suggest that reweighting a health cohort with respect to sociodemographic and disease history variables that are commonly available in population registers is not necessarily sufficient to accurately estimate population-level outcomes, such as mortality. While this finding is somewhat discouraging, not least in light of the substantial and expanding literature on reweighing and generalizability [12,13,14,15, 22, 27], it should be emphasized that our results varied across the outcomes considered. For mortality and incidences, reweighting was at least somewhat helpful. Researchers using MDC data may thus be able to improve their validity to some extent by applying our weights, which we make available. On the other hand, several parameters were similar across participants and the background population already to begin with. We conclude that representativeness must be judged on a case-by-case basis . Future research should examine the potential benefits of reweighting in other contexts, not least where associations differ markedly between unweighted participants and the background population.
Availability of data and materials
The database used in this study is closed but researchers with an ethical approval from the Swedish Ethical Review Authority may contact the first author A.N. to gain access. We received access to the data via the register holders (the Malmö Diet and Cancer study at Lund University, Statistics Sweden, and the National Board of Health and Welfare in Sweden) after an ethical approval by the Regional Ethics Review Board in Lund.
International Classification of Diseases
Inverse probability of participation weighting
Malmö Diet and Cancer
Hara M, Sasaki S, Sobue T, Yamamoto S, Tsugane S. Comparison of cause-specific mortality between respondents and non-respondents in a population-based prospective study: ten-year follow-up of JPHC study cohort I. Japan Public Health Center. J Clin Epidemiol. 2002;55:150–6.
Mattila VM, Parkkari J, Rimpelä A. Adolescent survey non-response and later risk of death. A prospective cohort study of 78,609 persons with 11-year follow-up. BMC Public Health. 2007;7:87.
Ferrie JE, et al. Non-response to baseline, non-response to follow-up and mortality in the Whitehall II cohort. Int J Epidemiol. 2009;38:831–7.
Nilsen RM, et al. Self-selection and bias in a large prospective pregnancy cohort in Norway. Paediatr Perinat Epidemiol. 2009;23:597–608.
Nohr EA, Liew Z. How to investigate and adjust for selection bias in cohort studies. Acta Obs Gynecol Scand. 2018;97:407–16.
Biele G, et al. Bias from self selection and loss to follow-up in prospective cohort studies. Eur J Epidemiol. 2019;34:927–38.
Bonander C, Nilsson A, Björk J, Bergström GML, Strömberg U. Participation weighting based on sociodemographic register data improved external validity in a population-based cohort study. J Clin Epidemiol. 2019;108:54–63.
Batty GD, Gale CR, Kivimäki M, Deary IJ, Bell S. Comparison of risk factor associations in UK biobank against representative, general population based studies with conventional response rates: prospective cohort study and individual participant meta-analysis. BMJ. 2020;368:m131.
Canivet C, Nilsson A, Björk J, Maghaddassi M, Östergren PO. Assessment of selection bias due to dropouts in the followup of the Scania public health cohort. Scand J Pub Heal. 2020. https://journals.sagepub.com/doi/full/10.1177/1403494820919544.
Hernan MA, Hernández-Diaz S, Robins JM. A structural approach to selection bias. Epidemiology. 2004;15:615–25.
Richiardi L, et al. Baseline selection on a collider: a ubiquitous mechanism occurring in both representative and selected cohort studies. J Epidemiol Community Heal. 2019;73:475–80.
Cole SR, Stuart E. Generalizing evidence from randomized clinical trials to target populations: the ACTG 30 trial. Am J Epidemiol. 2010;172:107–15.
Lesko CR, et al. Generalizing study results: a potential outcomes perspective. Epidemiology. 2017;28:553–61.
Buchanan AL, et al. Generalizing evidence from randomized trials using inverse probability of sampling weights. J R Stat Soc Ser A Stat Soc. 2018;181:1193–209.
Westreich D, Edwards JK, Lesko CR, Cole SR. Target validity and the hierarchy of study designs. Am J Epidemiol. 2019;188:438–43.
Galea S, Tracy M. Participation rates in epidemiologic studies. Ann Epidemiol. 2007;9:643–53.
Lynch JW, Kaplan JT, Salonen JT. Why do poor people behave badly? Variation in adult health behaviours and psychosocial characteristics by stages of the socioeconomic lifecourse. Soc Sci Med. 1997;44:809–19.
Adler NE, Ostrove JM. Socioeconomic status and health: what we know and what we don’t. Ann N Y Acad Sci. 1999;896:3–15.
Marmot M. Multilevel approaches to understanding social determinants. In: Berkman L, Kawachi I, editors. Social Epidemiology. Oxford: Oxford University Press; 1999. p. 349–67.
Berglund G, Elmstähl S, Janzon L, Larsson SA. The Malmo diet and Cancer study. Design and feasibility. J Intern Med. 1993;233:45–51.
Manjer J, et al. The Malmo diet and cancer study: representativity, cancer incidence and mortality in participants and non-participants. Eur J Cancer Prev. 2001;10:489–99.
Westreich, D., Edwards, J. K., Lesko, C. R. & Al., E. Transportability of trial results using inverse odds of sampling weights. Am J Epidemiol 186, 1010–1014 (2017).
Bonander C, Nilsson A, Bergström GML, Björk J, Strömberg U. Correcting for selective participation in cohort studies using auxiliary register data without identification of non-participants. Scand J Pub Heal. 2020. https://pubmed.ncbi.nlm.nih.gov/31826719/.
Svensson-Färbom P, et al. Cystatin C identifies cardiovascular risk better than creatinine-based estimates of glomerular filtration in middle-aged individuals without a history of cardiovascular disease. J Intern Med. 2013;275:506–21.
Royston P, Sauerbrei W. A new measure of prognistic separation in survival data. Stat Med. 2004;23:723–48.
D’Agostino R, et al. General cardiovascular risk profile for use in primary care: the Framingham heart study. Circulation. 2008;117:743–53.
Stuart E, Cole SR, Bradshaw CP, Leaf PJ. The use of propensity scores to assess the generalizability of results from randomized trials. J R Stat Soc Ser A Stat Soc. 2011;174:369–86.
The authors would like to thank Anders Dahlin for providing the data from MDC as well as information about the cohort and variables.
This work was supported by Forskningsrådet för hälsa, arbetsliv och välfärd (FORTE) [grant number 2017–00414 to U.S.]. The funder has played no role in in the design of the study or in the collection, analysis, or interpretation of the data. Open Access funding provided by Lund University.
Ethics approval and consent to participate
The project has been approved by the Regional Ethics Review Board in Lund (Dnr: 2017/846). Two data sources were used: 1. Data from the MDC cohort and 2. National register data. Written informed consent has been obtained from all participants in the MDC cohort. Consent was however not obtained from individuals only appearing in the national register data. As stated by the Swedish law of Research Ethics, national register data can be used for research purposes without informed consent, given that approval has been obtained from an ethics board. The Regional Ethics Review Board in Lund granted us approval to use national register data without informing individuals appearing in the data and without their consent under the condition that information about the project was published at www.lupop.lu.se and that individuals were given the opportunity to be removed from the study by contacting co-author J.B. The request by the ethics board was obeyed.
Consent for publication
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Incidences only including hospitalizations, not deaths (events per 10,000 person-years), in the background population and in the participant sample, before and after re-weighting; the follow-up begins at baseline in 1991–1996 and ends at an event, death, emigration, or at the latest in the end of 2016. Table A2 Associations between background characteristics and CVD mortality – results from multivariable Cox regression models; the follow-up begins at baseline in 1991–1996 and ends at an event, death, emigration, or at the latest in the end of 2016. Table A3 Associations between background characteristics and cancer mortality – results from multivariable Cox regression models; the follow-up begins at baseline in 1991–1996 and ends at an event, death, emigration, or at the latest in the end of 2016. Table A4 Associations between background characteristics and CVD incidence – results from multivariable Cox regression models; the follow-up begins at baseline in 1991–1996 and ends at an event, death, emigration, or at the latest in the end of 2016. Table A5 Associations between background characteristics and cancer incidence – results from multivariable Cox regression models; the follow-up begins at baseline in 1991–1996 and ends at an event, death, emigration, or at the latest in the end of 2016.
About this article
Cite this article
Nilsson, A., Bonander, C., Strömberg, U. et al. Can the validity of a cohort be improved by reweighting based on register data? Evidence from the Swedish MDC study. BMC Public Health 20, 1918 (2020). https://doi.org/10.1186/s12889-020-10004-z