Accurate and reliable PA assessment is essential for epidemiologists, exercise scientists, clinicians, and behavioral researchers. Recently, objective measures of PA, such as accelerometers, have dominated the public health literature . Objective devices may assess current behavior well; however they cannot provide information about PA from the distant past. To understand the influence of historical PA on chronic disease risk, it is important to know whether PA can be reasonably recalled over the long-term and to assess demographic, social, and behavioral factors that may affect recall.
Results of our comparison of PA recalled over 15 years to original reports varied by both the type of survey question and the type of information obtained. Respondents were able to reproduce categorical ratings of overall PA level (e.g., ‘very active’) reasonably well, particularly for a well-defined and significant time such as during HS (percent agreement >50% and Kappa = 0.43). As displayed in Table 2, percent agreement and Kappa were more modest for periods that were likely to be less memorable, such as before HS or the year before entry into the CARDIA study. It is also possible that HS activity was recalled well because it is a period that may be recalled and reported more often than less well-defined periods. Agreement for reports of whether respondents participated in specific activities was also reasonably reliable (see Table 3), particularly for vigorous-intensity activities (percent agreement 64-79% and Kappa 0.28-0.48). However, in some cases, high agreement may be due to the low participation rate for these activities (e.g., bowling/golf), so that one agreement category (No/No) is very large. As a result, it is not possible to determine which specific activities can be recalled best. When more quantification was involved, such as estimating the number of participation months for an activity, agreement was negatively affected by a combination of a propensity to exclude activities in the Historical report, as well as response clustering at 0 and 12 months of participation, as shown in Figures 1 and 2. It appeared that long-term recall led to a loss of time resolution, with a tendency toward all-or-none estimation, and perhaps some “splitting the difference” by estimating a value of 6 months.
In general, our data showed that PA recall consistency over fifteen years among young and middle-aged adults was generally modest, but comparable to studies of similar and longer duration [14–19]. However, an important observation is that even when overall agreement was reasonably good for long term recall studies of this type -- such as the correlation of 0.50 for the vigorous activity score -- error at the individual level was quite large. As shown in Figure 3, for a vigorous-intensity score of 500 at Baseline, the Historical scores ranged from 0 to 1500. This substantial error at the individual level will likely reduce the researcher’s ability to detect relationships between historical activity and outcomes within individuals. However, on the positive side, in contrast to Falkner et al. , historical recall in this study appeared to reflect actual recall, rather than current activity. With the exception of the Moderate Activity Score, for which agreement of Historical scores with Current score was similar to agreement with Baseline score, Historical reports were more similar to Baseline than they were to Current reports of PA.
The large CARDIA cohort allowed examination of demographic and other predictors of reporting discrepancy. Generally, the Historical score was higher than Baseline, particularly for vigorous-intensity scores. These results indicate over-reporting of recalled physical activity and are consistent with previous studies [16, 18]. Demographic characteristics including BMI, race and education were significantly associated with discrepancies in recall. A race by education interaction reflected the unexpected finding that over-reporting increased with higher levels of education among black participants. Our results also indicated an interaction of sex and race with the Baseline activity level that highlighted over-reporting by men, particularly black men. These results indicate that demographic factors need to be taken into consideration when pursuing studies of physical activity recall, and specific examination by subgroups should be considered. However, in our study, the recall discrepancy was also a function of Baseline activity scores, with the most active participants at Baseline likely to produce Historical activity scores lower than Baseline. Because true activity at the recalled time period is usually not available, it may be difficult to account for this source of error in studies that use recalled activity.
Overall, the current study adds to a growing body of research on long-term PA recall. These studies are important, as researchers increasingly use historical and lifetime measures to examine exposure to PA over the life course in relation to health outcomes [5, 6, 8, 16]. A major limitation of methodological studies of long-term recall PA instruments is that validity can rarely be established due to the lack of criterion measures at the period(s) being reported. That leaves reliability or reproducibility of reports as the primary indicator of instrument quality. Reliability has generally not been examined relative to the actual period of interest. Many studies look at reproducibility of lifetime reports over 3–10 week periods [4, 8, 21], or up to one year . These studies show that the reproducibility of self-reported lifetime activity recalled over short periods range from r = 0.53 to 0.85. Studies of specific activities recalled over longer periods (10–36 years) have shown weaker associations; correlations and kappa statistics range from 0.09-0.52 [14–19]. However, recall accuracy may have been affected by participants’ age, which varied from middle to older age.
This study has several notable strengths. Participants recalled activity over a long period of time (15 years) with the same instrument that was originally used. Other studies have used different instruments at two points in time, which has made interpretation and comparison of findings difficult . The CARDIA cohort provided a large and diverse sample that allowed examination of factors related to the difference between Historical and Baseline reports. The current study focused on early adulthood and included questions about activity during adolescence. The Historical recall in CARDIA provides relevant data for studies of PA exposure in early life and later health outcomes.
There were also limitations to our study. We cannot determine whether PA recall reflected what participants were actually doing at the time of the Baseline exam, but rather what they reported doing. There are also potential limitations of generalizability. This study relied upon a single questionnaire in a cohort of black and white young and middle aged adults. Long term reproducibility is, at least in part, a function of the reliability of the questionnaire. The CARDIA questionnaire has been shown to have good reliability over two week retest (r ~ 0.80) . It is not clear how well our results will generalize to other ethnic groups, older age groups, or to studies that use different questionnaires. Additionally, we cannot estimate the effect of participant dropout from CARDIA between Years 0 and 15.
However, our results provide an important step in understanding historical PA recall and have implications for future studies. For example, for investigators assembling retrospective cohort studies in which PA is used as an exposure variable, our data suggest that individuals do well at classifying their activity level with categorical questions, particularly for memorable life periods. Categorical responses can be important indicators; for example, a five-category single-item general health question has been shown to be related to health outcomes . Seeking more quantitative precision in Historical recalls, on the other hand, may not be productive; our data showed that there was not much precision in participants’ ability to recall the amount of time over the course of a year for a given activity. These results provide important information when considering the kinds of questions that may be reasonably asked. In future research, investigators may want to consider whether the additional participant burden of asking about details regarding specific components of activity such as duration and frequency adds predictive value.