Calibration of the global physical activity questionnaire to Accelerometry measured physical activity and sedentary behavior

Background Self-report questionnaires are a valuable method of physical activity measurement in public health research; however, accuracy is often lacking. The purpose of this study is to improve the validity of the Global Physical Activity Questionnaire by calibrating it to 7 days of accelerometer measured physical activity and sedentary behavior. Methods Participants (n = 108) wore an ActiGraph GT9X Link on their non-dominant wrist for 7 days. Following the accelerometer wear period, participants completed a telephone Global Physical Activity Questionnaire with a research assistant. Data were split into training and testing samples, and multivariable linear regression models built using functions of the GPAQ self-report data to predict ActiGraph measured physical activity and sedentary behavior. Models were evaluated with the testing sample and an independent validation sample (n = 120) using Mean Squared Prediction Errors. Results The prediction models utilized sedentary behavior, and moderate- and vigorous-intensity physical activity self-reported scores from the questionnaire, and participant age. Transformations of each variable, as well as break point analysis were considered. Prediction errors were reduced by 77.7–80.6% for sedentary behavior and 61.3–98.6% for physical activity by using the multivariable linear regression models over raw questionnaire scores. Conclusions This research demonstrates the utility of calibrating self-report questionnaire data to objective measures to improve estimates of physical activity and sedentary behavior. It provides an understanding of the divide between objective and subjective measures, and provides a means to utilize the two methods as a unified measure.


Background
Physical activity (PA) is important as an exposure and outcome variable in public health research. Although objective measurement of PA using accelerometry is widely used in public health research, its costs, complexities, and participant burden preclude its use in all studies. As such, self-report measures of PA, including questionnaires, remain a valuable method of PA measurement that can provide unique and valuable information about domains in addition to estimates of volume of PA. Furthermore, self-report measures of PA allow for comparison of longitudinal data where objective measures were not utilized in all measurement periods. When compared to accelerometry, the most commonly used self-report measures of PA, e.g., the International Physical Activity Questionnaire, the Global Physical Activity Questionnaire (GPAQ), and the Seven-Day Physical Activity Recall show low to moderate validity of 0.17-0.41 [1][2][3][4]. Clearly, there is a need to improve the accuracy of PA self reports. For these reasons, many public health research studies utilize both objective and self-report measures of PA, such as the National Health and Nutrition Examination Survey, the Women's Health Study, and the Iowa Bone Development Study [5][6][7].
The GPAQ was developed by the World Health Organization (WHO) for PA surveillance, including the assessment of PA trends over time [8]. The GPAQ is available in nine languages, and has been utilized as a part of the WHO STEPwise Approach to Non-Communicable Chronic Disease Risk Factor Surveillance in over 100 countries [9]. The GPAQ collects information about PA in three domains: work, travel, and recreation, as well as average time per day spent in sedentary behavior (i.e., sitting and reclining) [10]. It is scored in minutes per day (min/d) to provide meaningful behavioral units, rather than a unit-less questionnaire score that is difficult to interpret, e.g., the Godin-Shephard Leisure-Time Exercise Questionnaire [11]. The first version of the GPAQ was validated in multiple populations, including adults from the Bangladesh, Brazil, China, Ethiopia, India, Indonesia, Japan, Portugal, South Africa, the United States, and Vietnam. The GPAQ had low to moderate validity for total PA when compared to pedometers (r = 0.31-0.39) [8,12,13] and accelerometers (r = 0.20-0.34) [14]. Validity appeared lower for vigorous-intensity PA (r = 0.23-0.26). However, the first version of the GPAQ showed good testretest reliability (r = 0.67-0.81) [8,12]. This suggests that while the GPAQ may not be appropriate for crosssectional studies due to its low to moderate validity, it may be a good tool for assessing changes in PA over time.
The GPAQ was revised to improve wording and shorten its length. Herrmann and colleagues examined the second version's validity by comparing it to PA measured using the ActiGraph accelerometer and the Yamax pedometer. In adults, the GPAQ had a low level of agreement with the ActiGraph for moderate-and vigorous-intensity PA (MVPA) (r = 0.26) and a moderate level of agreement with the Yamax pedometer for steps/ day (r = 0.39). Associations were negative for sedentary behavior, with a non-significant level of agreement between the GPAQ and ActiGraph (r = − 0.12). However, once again, the GPAQ showed good short-term reliability, with coefficients ranging from r = 0.83-0.96 [2]. Cleland and colleagues assessed the validity of the second version of the GPAQ in adults living in the United Kingdom, and reported a moderate level of agreement between the GPAQ and ActiGraph measured PA (r = 0.48). The GPAQ had a moderate, but non-significant agreement with the ActiGraph for change over time in MVPA (r = 0.52) and a non-significant negative agreement of change over time in sedentary behavior (r = − 0.02). These researchers also found a negative bias with the GPAQ, indicating that individuals who were more active, were more likely to over-report PA [3].
Recently Saint-Maurice and colleagues demonstrated a calibration strategy to improve interpretability of the Physical Activity Questionnaire for Children and Physical Activity Questionnaire for Adolescents. They simultaneously measured PA in children and adolescents with questionnaires and accelerometers, then used multivariable linear regression to calibrate the relationship between questionnaire scores and ActiGraph measured MVPA [15]. Welk and colleagues developed a similar method to adjust for error in the 24-Hour Physical Activity Recall (PAR) in adults, by simultaneously measuring PA with the 24-Hour PAR and the SenseWear Armband Mini [16]. The 24-h PAR queries contextual PA information like that of the GPAQ (e.g., domain), however it takes on average 20 min to complete, and inquires about the previous 24 h. These studies showed the feasibility of calibrating self-reported PA data to objectively measured PA, and the ability to improve their accuracy and usability. The current study used a similar strategy to provide a unique way to calibrate and compare past and present self-reported PA measures with objective measures, which are the current standard for PA measurement. It also allowed for better understanding of the direction and level of disconnect between selfreported and objectively measured PA. The purpose of this study was to improve the validity of the GPAQ for measuring PA and sedentary behavior by leveraging all information from the self-reported data in a prediction model trained on accelerometer measures in a random sample of mainly white and Latino adults, and to further test this calibration in an independent cohort of white, African American, and Latino adults.

Participants
Data for this study were collected between November 2015 and May 2016 as a part of baseline data collection for Active Ottumwa (AO). AO is a community-based participatory research project designed to increase the PA of all residents of Ottumwa, Iowa through the activation of Lay Health Advisors as Physical Activity Leaders. Ottumwa is a city in southeast Iowa with a population around 25,000, of which 12% are Hispanic or Latino. Ottumwa is largely a processing and manufacturing community [17].
Participants of AO are a random sample of Ottumwa residents (n = 139), who were recruited via random digit dialing. The study recruited additional Latino residents via Respondent Driven Sampling [18]. Eligibility requirements included (1) being at least 18 years of age or older; (2) having lived in Ottumwa for at least 6 months; (3) planning to stay in Ottumwa for at least 2 years; and (4) passing physical functioning screening questions modified from NHANES [19].

Procedures
Individuals who were interested in participation attended a baseline data collection visit at the AO office with a bilingual (English-Spanish) data collector. Participants completed a health survey, and anthropometric measurements. Upon completion of the office visit, participants were fitted with an ActiGraph GT9X Link (ActiGraph LLC, Pensacola, FL) accelerometer, which served as the criterion measure of PA. The ActiGraph GT9X Link is a small (3.5 × 3.5 × 1 cm), lightweight (14 g), and waterproof tri-axial accelerometer. Its primary accelerometer has a dynamic range of ± 8 gravitational units, and was initialized to collect data at a sampling rate of 80 Hz. The accelerometer was attached to the participants' non-dominant wrist using a hospital band, and was worn continuously for 7 days, including at night. Participants were instructed to remove the accelerometer 8 days following their office visit, and return it via pre-paid mail.
On the accelerometer removal day, a research assistant contacted the participants and administered the GPAQ over the phone. The GPAQ asked participants to recall their PA and sedentary behavior over the past 7 days, to coincide with the time the accelerometer was worn. The GPAQ was designed to fit the specific population being assessed through the use of tailored examples. Within each PA domain, participants were asked to report time spent in moderate-intensity and vigorous-intensity PA. Participants were also queried about sedentary behavior by reporting time spent sitting or reclining. The GPAQ was scored by calculating the average time per day spent in each activity domain and intensity.

Independent sample validation
The GPAQ calibration models were tested on an independent sample representing a dissimilar population. The validation sample data were collected between December 2014 and September 2016 as a part of baseline data for the Living Well Together (LWT) study in Polk County, Iowa. Participants of the LWT study were low income, urban, obese adults (BMI ≥ 30 kg/m 2 ) who were the primary caregiver of a child six to 12 years old, and were drawn from an overall study population that was 29% African American, 25% Latino, and 41% white.
Similar data collection procedures were followed for the LWT study, as for AO. Participants completed baseline data collection and were fitted with an ActiGraph GT9X Link accelerometer on their non-dominant wrist using a hospital band. Participants wore the accelerometer for seven consecutive days, including nights, and were asked to remove it on the eighth day following their baseline data collection visit. A research team member contacted the participants via telephone on the eighth day to complete the GPAQ, where they were queried about their PA over the preceding 7 days.

Data processing
For both samples, upon receipt of the accelerometer by the research team, data were downloaded and screened for wear-time and monitor malfunction. Participant sleep time was manually extracted using the Physical Activity and Health Outcomes Laboratory sleep algorithm (University of Iowa, Iowa City, IA; details available from corresponding author). Sedentary behavior (min/d) was extracted using the Staudenmayer decision tree algorithm [20] programmed into R software (version 3.3.1). Moderate-, and vigorous-intensity PA (min/d) were calculated using the Hildebrand algorithm [21] in a Visual Basic analysis program developed at the University of Iowa (R. Paulos, University of Iowa, Iowa City, IA). Accelerometry data were matched with the corresponding GPAQ data for analysis. Participants were excluded if they had less than four valid days of accelerometer data collection. A valid day was defined as having at least 10 h of awake data collection.

Statistical analyses
The analytical goal was to take the information included in the self-reported GPAQ and construct prediction models using each of the outcomes from the accelerometry data as the response variable. To this end, four multiple linear regression models were trained corresponding to sedentary behavior, moderate-intensity PA, vigorous-intensity PA, and MVPA (min/d) as measured by ActiGraph. Covariates for each of the four models included gender, age, and self-reported min/d of sedentary behavior, moderate-intensity PA, vigorous-intensity PA, and the sum of moderate-and vigorous-intensity PA corresponding to the GPAQ scores. Due to the lack of a clear linear relationship between these variables and the ActiGraph data (correlations between ActiGraph and corresponding GPAQ values ranged from 0.04 to 0.19), several transformations of each variable were considered, namely log, square root, linear, and quadratic relationships. Importantly, our purpose here is not to estimate and make inference on the relationship between the GPAQ self-report outcomes and the corresponding outcomes as measured by accelerometry, but rather to take the self-report outcomes and some easy to measure demographic variables and make accurate predictions of the outcomes as would be measured by accelerometry.
In addition to the four covariate transformations mentioned above, break point analysis was also considered. Break point analysis entails estimating from the data a change point in the covariate space where the linear relationship between the response and the covariate changes [22]. That is, at a break point the intercept and the slope for the covariate under consideration changes. This method works by adding a covariate binary indicator which equals one if the variable under consideration is greater than its break point and zero otherwise; the interaction of this binary indicator with the variable itself is also considered. For each continuous covariate and for each of the four response variables, a bivariate break point analysis was performed and the ensuing non-linear relationship was considered to be the fifth possible covariate transformation.
AO data were split into training (80%) and testing (20%) samples. The training sample was used to create the multiple linear regression models. Akaike Information Criterion (AIC) was used to perform model selection on the models created based on the training sample. Specifically, for each of the four responses, an exhaustive search over the 2 × 6 5 ( = 15,552) candidate models was performed, and for each model fit the AIC was computed, thus giving a measure of predictive ability for each of the 15,552 models for each of the four outcomes [23]. The model with the best AIC on the training data was selected for each outcome and subsequently evaluated on the out of sample data. Mean Squared Prediction Errors (MSPE) were calculated for the AO testing and LWT independent validation samples. Percent reductions in MSPE were evaluated to measure error relative to the raw GPAQ estimates of sedentary behavior and PA.
In addition to the analysis described above, for comparative purposes we implemented a simple outcome-matching model only utilizing gender, age, and the matching GPAQ outcome, e.g., using gender, age, and self-reported sedentary behavior to predict ActiGraph measured sedentary behavior. All analyses were performed using R software (version 3.3.1). Break point analyses were performed using the strucchange package [24].

Participants
The AO sample used in our analysis included 108 participants (n = 74 females; n = 34 males). Average age of participants was 49.4 years (range: 19.8-68.7 years). The LWT independent validation sample included 120 participants (n = 111 females; n = 9 males). Average age of participants was 36.5 years (range: 23.4-62.3 years). Participants wore the ActiGraph accelerometer for an average of 7 days (range: 4-10 days) and had an average awake wear time of 984 min/d (16.4 Table 1 shows the descriptive statistics of the testing, training, and independent validation samples. Table 2 shows the breakdown of GPAQ reported MVPA by domain. The table demonstrates the vast differences in how each sample accrued daily MVPA. The AO sample accumulated the majority of MVPA from work-related activity (66.4%), while the LWT sample accumulated the majority of MVPA from recreation (50.0%). Additionally, the LWT sample had more MVPA from travel than the AO samples.

Calibration
We developed multiple linear regression models to the AO training set utilizing as covariates gender, age, GPAQ reported sedentary behavior, GPAQ reported moderateintensity PA, GPAQ reported vigorous-intensity PA, and GPAQ reported MVPA to predict accelerometer measured sedentary behavior and PA. Figure 1 shows the scatterplots of the self-reported GPAQ outcomes and the corresponding accelerometry outcome. This figure shows that there is a weak association between the self-report and accelerometry data. Table 3 corroborates this by giving the proportion of variance explained (R 2 ) for the outcome-matching model using only gender, age, and the matching GPAQ outcome   Figure 2 shows the improvement in predictive value of the GPAQ using the multiple linear regression models detailed in Table 4   raw GPAQ values closer to the truth (i.e., closer to the y = x line), thus showing the vast improvement in predictive accuracy of sedentary behavior and PA when using the multiple linear regression models over the raw GPAQ scores. Figure 3 shows these same plots but without the GPAQ values included. Despite the dramatic reduction in error, the predictions are still rather noisy. For the model predicting sedentary behavior, there seems to be an issue with obtaining relatively few unique predicted values; these two things imply that further work in this area would be beneficial.

Discussion
Calibrating PA questionnaires to objective measures of PA is a useful way to denoise this self-reported data. Additionally, this method can be utilized to compare self-reported PA data with objective measures, as well as better understand how these two measures can be used together. The current study demonstrates the feasibility of this approach by calibrating the GPAQ to 7 days of ActiGraph data. The multiple linear regression models dramatically reduced prediction errors in both the AO testing sample and the LWT independent validation sample. Raw GPAQ scores overestimated MVPA by 31.6 min/d, and underestimated sedentary behavior by 204.3 min/d in the AO sample using the ActiGraph as ground truth. While the model-adjusted mean GPAQ scores were within 2.1 min/d and 8.4 min/d of the ActiGraph measured mean MVPA and sedentary behavior, respectively. It is important to note that these reductions in error were seen in out-of-sample predictions, meaning they were applied to separate samples than what the models were built on. Additionally, using an independent validation approach we were able to show the model's utility in a dissimilar population, who underreported PA rather than over-reported it, as the AO population did. The two populations also accumulated PA in differing domains; while the AO sample accumulated 66.4% of their MVPA from work-related activity, the LWT sample only accumulated 36.8% from work. The majority of the LWT sample's MVPA came from recreational activity (50.0%), and they also accumulated substantially more travel-related MVPA than the AO sample (13.2% for LWT versus 4.8% for AO). The LWT population has lower income and education levels than the AO population, and they are 42.4% White, 23.5% African American, and 23.5% Hispanic or Latino, while the AO population is 82.4% White, 3.7% African American, and 13.0% Hispanic or Latino. This provides some assurance of the calibration model's accuracy for new studies on dissimilar populations.
Adding age and performing data transformations on the self-reported GPAQ values improved PA prediction. Models utilized only data readily available from the questionnaire to improve usability in other studies where more thorough examinations, including anthropometric and objective PA data, are not available. The potential application of this calibration model includes improved predictions of PA in large sample surveillance data, longitudinal studies, and pre−/post-measurements where objective measures of PA are not feasible.
Strengths of the current study include the use of a population-wide random sample, as well as the use of data splitting to validate the models. Additionally, a major strength of the study lies in the ability to evaluate the predictive ability of our models on an independent sample from a disparate population with differing racial and socio-demographic make-up. The study utilizes one of the most commonly used PA questionnaires in current practice, which increases its usability for other researchers. Limitations of the study include a mostly white sample, while an effort was made to oversample the Latino population in Ottumwa. The study sample is also mostly female, particularly in the independent validation sample. Finally, the training data set was necessarily small, potentially limiting the range of PA observed to be smaller than that which might be observed in a larger study.

Funding
This work was funded by the Centers for Disease Control and Prevention Cooperative Agreement Number 1 U48 DP005021-01, and the National Institutes of Health through grant 5R01 HL119882. The funding agencies were not involved in the study design, collection, analysis and interpretation of data, or in writing the manuscript.

Availability of data and materials
The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Authors' contributions
KM was a contributor in acquiring and interpreting data, and writing and revising the manuscript. MCG was a contributor in the analysis and interpretation of data, and writing and revising the manuscript. SF was a contributor in acquiring data, and writing and revising the manuscript. HL was a contributor in conception and design, and critically revising the manuscript for intellectual content. KJ was a contributor in conception and design, interpretation of data, and writing and revising the manuscript. BB was a contributor in conception and design, and critically revising the manuscript for intellectual content. DS was a contributor in conception and design, analysis and interpretation of data, and writing and revising the manuscript. All authors provided final approval of the manuscript and agree to be accountable for all aspects of the work.

Ethics approval and consent to participate
The Institutional Review Board at the University of Iowa approved the study protocols and informed consent form. The IRB reference number is 201309708. All participants provided written informed consent prior to participating in study activities.

Consent for publication
Not applicable.