Data analysis
We reweighted the sample with a method in line with previous literature on IPPW weighting [7, 12,13,14].A special feature of our data, however, was that it came in two separate sets – one for the participants and one for the full background population, with no linkage or indicator of who in the background population was a participant. As a result, the standard IPPW method could not be applied. Instead, we used an approach similar to that applied in studies of transportability, where the aim is normally to produce estimates that are valid for an entirely different population than the background population [12, 22]. As we have described in a previous article, this approach can also be used to achieve generalizability in situations like ours, where data on participants and the background population come in separate datasets [23].
As a first step, we combined the participant sample and the background population datasets into one dataset, where each participant thus appeared twice (but could only be identified as a participant once). A binary variable was created to indicate membership in the participant sample, and a logistic regression was then applied to predict this membership based on the background characteristics, including socio-demographics and disease history. As we have shown, predicted odds for belonging to the participant sample could then be interpreted as predicted probabilities of actually being a participant [23]. Sampling weights were calculated for the participants as the multiplicative inverses of the predicted probabilities of actually being a participant.
The distributions of background characteristics in the background population were compared with those in the participant sample and with those in the reweighted participant sample. (In principle, background characteristics in the reweighted participant sample should resemble those in the background population, as the reweighting was made exactly based on these.) Subsequent mortality and incidence of hospitalization were then compared across the background population, participants, and reweighted participants. Additional analyses were stratified on quintiles of the estimated participation probabilities, allowing us to examine if differences between participants and individuals in the background population may have been concentrated, for example, to those with a low propensity to participate. Furthermore, we used Cox regressions to estimate associations between outcomes and background variables, and to compare these across the background population, participants, and reweighted participants.
To evaluate the ability of the logistic regression model to predict participation based on the background variables, the area under the ROC curve was calculated. We also visually examined the estimated participation probabilities, separately for participants and non-participants. In these two analyses, duplicates of participants were removed by sorting the data according to the predicted probabilities of participating, and for each predicted probability omitting the same number of individuals from the background population as the number of individuals in the participant sample. Analyses were performed using Stata 15.1 (StataCorp) and SPSS Statistics 25 (IBM).
Data
Our full background population consisted of all men (born 1923–1945) and women (born 1923–1950) who lived in Malmö between January 1, 1991, and September, 30, 1996. This population, comprising 74,103 individuals, essentially corresponded to those who were invited to participate in the MDC study. (In practice, some were never invited because of death, migration, or other issues.) The participant sample comprised 28,098 individuals. All participants in MDC provided written informed consent at enrollment.
Data on socio-demographics were retrieved from Statistics Sweden and spanned the years 1990 to 2016. These data included year of birth, sex, civil status, country of birth (grouped), migration events, and an array of socioeconomic information, such as the highest level of education and income from different sources. Moreover, we retrieved data from the National Board of Health and Welfare, including the Patient Register, covering all hospitalizations and associated diagnoses from 1987 to 2016, and the Cause of Death Register, from which we obtained data on deaths and causes of deaths between 1990 and 2016. The reason for retrieving hospitalizations specifically from 1987 was that the Patient Register reached national coverage in this year, and we wished to account for hospitalizations during at least a few years prior to baseline.
For participants in MDC, we made use of background data on socio-demographics in the year prior to enrollment (or in the same year if no data was available in the previous year, which could occur if the individual had lived abroad). Hospitalizations were divided into groups based on the International Classification of Diseases (ICD), version 9 or 10: neoplasms (ICD codes 140–239/C00-D48), diabetes (ICD codes 250/E10-E14), mental and behavioral disorders (ICD codes 290–319/F00-F99), diseases of the circulatory system (ICD codes 390–459/I00-I99), diseases of the respiratory system (ICD codes 460–519/J00-J99), and diseases of the digestive system (ICD codes 520–579/K00-K93). Binary indicators were created to measure if the individual had had at least one hospitalization for these types of diagnoses between 1987 and enrollment.
There was no information on when individuals in the background population were invited to participate in the MDC study. We therefore assigned “imaginary” dates of enrollment to individuals in the background population, where the calendar date was always set to July 1 and the enrollment year was drawn from the birth-year-specific distribution of enrollment years observed among participants. Individuals in the background population who had moved, died, or for other reasons lacked information on sociodemographic variables around the time of imaginary enrollment were excluded, reducing the background population to 71,447 individuals. Among the 28,098 participants, two were excluded because there was no information on sociodemographic variables around the time of enrollment.
Outcomes examined included mortality and incidence of disease. We considered all-cause mortality, but also the two most common causes of death: deaths due to cardiovascular disease (CVD) and cancer. Furthermore, we considered incidences of CVD and cancer. Following previous studies on MDC [24], CVD mortality was defined by ICD codes 390–459/I00–99 whereas incident CVD was defined as the occurrence of either a coronary event (a fatal or nonfatal myocardial infarction, 410/I21, or a death due to ischemic heart disease, 414/414/I22/I23/I25) or a fatal or nonfatal stroke (430/431/434/436/I60/I61/I63/I64), whichever came first. Incident cancer was conventionally defined by ICD codes 140–209/C00–99.
Individuals who had not yet experienced the particular outcome of interest were followed from the time of baseline examination (or imaginary enrollment) in MDC, and contributed with person-time until the first event of interest occurred, or until death or emigration; at most until the end of 2016.