Use of SOPARC to assess physical activity in parks: can race/ethnicity, observation settings, and target area conditions affect reliability?

Since its introduction in 2006, the SOPARC protocol has become a fundamental tool to quantify park visitor behaviors and characteristics. We tested SOPARC reliability when assessing race/ethnicity, physical activity, contextual conditions at the time of observation, and conditions of target areas to understand its utility when trying to account for individual characteristics of users.Methods We used 4,725 SOPARC observations completed simultaneously by two independent observers to evaluate intraclass correlation and agreement rate between the two observers when trying to assess gender, age group, race/ethnicity, and level of physical activity of urban park users. Observations were in 20 New York City parks during Spring and Summer 2017 within the PARC 3 project.Results Observers counted 25,765 park users with high interobserver reliability (ICC=.94; Agr=.75). Reliability scores were negatively affected by the population being observed, the intensity of physical activity, and the conditions and settings of the target area at the time of observation. Specific challenges were found when assessing the combination of physical activity and race/ethnicity.Conclusions SOPARC training should aim to improve reliability when assessing concurrent measures; physical activity, race/ethnicity, age, and gender. Similarly, observing active areas and areas that can be congested with people require more observation practice hours.


Abstract Background
Since its introduction in 2006, the SOPARC protocol has become a fundamental tool to quantify park visitor behaviors and characteristics. We tested SOPARC reliability when assessing race/ethnicity, physical activity, contextual conditions at the time of observation, and conditions of target areas to understand its utility when trying to account for individual characteristics of users.Methods We used 4,725 SOPARC observations completed simultaneously by two independent observers to evaluate intraclass correlation and agreement rate between the two observers when trying to assess gender, age group, race/ethnicity, and level of

Background
Research has long demonstrated how physical activity and recreation in public open spaces are positively associated with health outcomes (1,2). Systematic observation studies in spaces such as parks and playgrounds, however, can be challenging because of the lack of objective measures to quantify visitor characteristics and behavior. Accurate assessment of large numbers of physically-active park users poses additional challenges due to the constant dynamic movement of the park users.
Since its introduction in 2006, the SOPARC protocol (Systematic Observation of Play and Recreation in Communities) (3) has provided an answer to those challenges.
SOPARC is based on momentary sampling through periodic and systematic scans of delimited target areas. It provides a consistent method to count large groups of people while they are taking part in often highly-dynamic activities without placing a burden on participants (4). The development of SOPARC and its rapid adoption by researchers allowed a better understanding of the context in which physical activity occurs while being able to measure the behavior of large groups of people. Its development has allowed evaluation to move beyond individual measurements and self-reported physical activity.
Previous research shows that the interrater reliability of SOPARC observers is improved by a period of training and practice (5). Intraclass reliability when counting park users using SOPARC, for instance, has been previously reported as ranging between 0.94 and 1.0 by Chow et al. (6), as consistently greater than 0.8 by Cohen et al., (7) and by over 0.9 by Banda et al. (8). Reliability values have been slightly lower when observers have tried to assess a combination of age and gender; age and physical activity; or age and ethnicity. Floyd et al., (9)  Similarly, while the majority of studies using SOPARC report some kind of general reliability measure, no effort has been made to understand how contextual variables at the time of the observation may affect reliability scores. Factors such as time of the day when the observation is taken, day of the week, or first versus last rounds of SOPARC observation might determine observer fatigue, potentially resulting in variability in attentiveness and lower interobserver reliability. The number of people in the target area can make it difficult to accurately assess factors such as race/ethnicity and physical activity due to the increased visual complexity of the target area. The presence of organized activities -both formal and informaltypically entail not only more people in the target area but can also generate distractions making the task more difficult for the observer. Finally, the type of activity setting that is being observed might also have implications for observers.
Observing a basketball court, for example, might prove harder than observing a swing set; the intricate design of some playgrounds can generate blind spots for the observers; and assessing race/ethnicities of park users in large areas such as baseball fields can prove difficult due to distance, masks or other related paraphernalia. Even after subdividing the target areas as recommended by the SOPARC protocol some of these problems may persist.
In order for SOPARC to be a useful tool, it is not possible simply to sidestep these issues. For example, given the known racial disparities in physical activity (11,12) and access to park settings and outdoor opportunities(13) in the US, being able to use the SOPARC protocol not only to assess the use of space in parks and playgrounds but also to gather individual user characteristics was of prime importance for the PARC 3 project-despite the fact that interrater reliability is likely to be negatively affected by the addition of more stratifying variables. Because of that, after introducing some small modifications to the original form we aimed at evaluating SOPARC reliability and its capacity to accurately assess physical activity levels among different racial-ethnic population groups.
Specifically, this study aimed to understand how reliability in observations of park users may be affected by (1) (14)). Parks were located in low-income census blocks with high prevalence of Latino or Asian populations (15). To account for individual characteristics of users and to be able to link them, modifications were made to the original SOPARC form, which proposes gender to be an anchor variable for each scan and age group, physical activity, and race/ethnicity to be observed separately.
The adapted forms (see Figure S1) required each scan to use both gender and age group as anchors, therefore generating 10 scans per target area per round. At each scan, race/ethnicity and physical activity were observed for the gender and age group anchored. For example, females, 0-4 years old were observed first. At the time of the scan, all females apparently 0-4 years old were assessed for race/ethnicity, and for physical activity level. Six observers were trained in two days for a total of nine hours including lecture and practical field training. The traditional SOPARC protocol was used, where the areas were scanned from left to right for each gender connected to an age group. Observed variables included gender (male; female), age groups (children 0-4 years old; children 5-10; teens 11-19; adults 20-64; older adults 65+), race/ethnicity (White; Black; Asian; Latino; Other/unsure), and physical activity levels (sedentary; moderate/walking; vigorous). All parks were visited and SOPARC administered on two weekdays and two weekend days in both the spring and the summer, for one-hour periods (3pm, 4:30pm, and 6pm during spring; 10am and 6pm during summer). Each observation consisted of 4 rounds within an hour -one every 15 minutes-and each round included a total of 10 scans for each target area. These target areas (n=167) that had been previously identified and geolocated by research team members included playgrounds, swings, water features, sports courts, and fields. Each target area was observed in pairs during a one-hour period, each round at 15 minutes (e.g., 3:00, 3:15, 3:30, 3:45) between April and June and July-August 2017. All study procedures were approved by the North Carolina State University Institutional Review Board (IRB#9376) Reliability measures A wide range of statistical scores and metrics are available to measure reliability when observing park use. Most previous SOPARC studies have used either one or a combination of intraclass correlation and interobserver agreement (5). The intraclass correlation coefficient (ICC) is the most widely used metric to measure observer agreement (5,16). ICC is an estimate of the relative magnitude of variation for the relationship between multiple assessments of the same observation as compared with variation across observations (17,18). Scores 0.75 meaning excellent agreement beyond chance (19,20). Interobserver agreement, which evaluates the degree of concordance between assessments of two or more observers, measures the proportion of occasions that individuals gave the same score (18). Interobserver agreement values over 70% are typically considered high (21). Agreement values however are subject to any shared systematic biases the observers might have which might result in poor agreement even in cases with high intraclass correlation.
Because interobserver agreement is increasingly hard to achieve in crowded park areas, we followed McKenzie et al. (3) and used a variant overall proportion of agreement indicator by which in areas where at least one observer counted 11 or more people, a 10% discrepancy between observers was allowed and still counted as an agreement. We used both ICC and interobserver agreement to analyze SOPARC reliability when assessing the number of park users stratified by race/ethnicity and physical activity in specific park and contextual conditions. Reliability scores do not make sense when areas are empty, as they will falsely represent high agreement scores. Therefore, empty areas were excluded from the final database for the reliability analysis (n=1335 observations excluded) following At this point, however, there might still be areas with at least one person but no Asians, and thus agreement might still be slightly inflated due to the presence of zeros. One could select only those pairs of observations where at least one Asian person was counted by either of the two observers (n=1837) in which case we find that the pair of observers agreed 47% of the time on the exact number of Asians present in the target area. This restrictive method, however, is highly limiting in terms of sample size, especially when trying to compare reliability scores in terms of race/ethnicity or with physical activity. Because most SOPARC analysis either have low sample sizes (4,22) or have only a subsample of observations actually collected in pairs (4,6,23), this most restrictive method seems to challenge future replicability. Because of this we chose to use all observations made in nonempty areas (n=3390) as our base sample. Statistical Analysis We first analyzed reliability scores of pairs of observers assessing race/ethnicity and physical activity independently (Table 1). We used paired sample T-tests to identify significant differences when trying to assess each race/ethnicity category and each physical activity intensity. We also used paired t-tests to measure how reliability scores changed when trying to identify combined traits such as the number of people of a single race/ethnicity engaging in a particular physical activity intensity (Table 2). In addition, Chi-square tests were used to compare how target area settings and contextual variables at the time of the observation were associated with interobserver agreement variance (Table 3). Contextual variables included the type of day (weekends or weekdays), time of day when the observation was taken (10-11am; 3-4pm; 4.30-5:30pm; 6-7pm), and round number (1)(2)(3)(4). Variables concerning the conditions of the target area included the amount of people present in the target area (average between the assessment of the two observers) type of activity setting (playground, basketball, baseball or handball court, swing set, water feature, or other), presence of organized activities (formal or informal), and the presence or absence of shade We used a binary logistic regression to regress the odds of achieving agreement in the different target area characteristics. Finally, in Table 4

Results
A total of 25,765 park users were found during the observations covered by this analysis. Adequate reliability was recorded when observing race/ethnicity and physical activity separately, both when using all observations (n=4725; ICC=0.941) and only observations made in nonempty areas (n=3390; ICC=0.915) ( Table 1). In nonempty areas, the agreement was high for estimating the number of Asian people at the park (ICC=0.922; Agreement=71.3%), and low for estimating the number of people engaging in vigorous physical activity (ICC=0.517; Agreement=54.2%)  1 Observations in areas where at least one of the two observers counted one or more people 2 Average of the counts reported by the two observers. 3 Percent of times where both observers counted the exact same number of people in the target area. In areas with 11 or <10% discrepancy is still considered as agreement.
Paired-samples t-tests were used to evaluate if there were significant differences in the rate of agreement between observers when assessing race/ethnicity and physical activity ( Table 2). Observers agreed more often when counting Asian Americans than Latinos (t=23.2 p<0.001), or African Americans (t=11.41 p<0.001), and they agreed less often when counting African Americans than Latinos (t=-11.36 p<0.001). In terms of physical activity, assessing sedentary activity was more agreeable than moderate activity (t=13.32 p<0.001), but less than assessing vigorous activity (t=-4.85 p<0.001). When trying to assess both race/ethnicity and physical activity jointly, the higher discrepancies were found when comparing reliability achieved when counting Asian Americans vigorous activity to the reliability at counting Latinos' vigorous activity (t=5.538 p<0.001). When contextual variables and settings of the target area were taken into account (Table 3) we found a decrease on the rate of agreement when more than five people were observed in the target area. There were however, no major differences in interobserver reliability when observing areas with 5-15 people, as compared with more than 15 people. While the day of week or round of SOPARC did not appear to affect the rate of interobserver agreement, observing in the morning seemed to be less reliable than observing at any other time of the day. In terms of conditions of the target area, having organized activities did not compromise the reliability of the observation. Interobserver agreement however was slightly lower when observing completely shaded or completely sunny areas, and also when observing basketball courts and playgrounds. The adjusted agreement rates, that account for the characteristics of each target area, confirm that agreement is harder to achieve in areas with more than 5 people, and easier to achieve in swing areas, in comparison with basketball courts. Finally, Table 4  Interestingly, the reliability of SOPARC observations did not significantly change with the rounds of observation. In the case of informally organized activities, observers had better reliability when counting Asian people engaging in informally organized activities than non-organized areas (s:79% vs 73%; m 73% vs 66%; v:83% vs 78%). When counting Latinos and African Americans however, reliability was always worse when assessing organized activities, compared with areas with nonorganized activities.

Discussion
Despite most SOPARC studies reporting high reliability and a protocol that has been widely accepted and adopted in outdoor recreation research (6,7,22,24,25), for the study. Employing local community members in SOPARC observations has been reported as a way to help overcome some of these issues (33) although it should be noted that past studies did not find evidence of an association between observers' demographic characteristics and better identification race/ethnicity traits (31,34). Regarding contextual characteristics at the time of the observation and potentially affecting reliability scores, an important finding of this study has been that interobserver agreement did not decay with each additional round of observation.
The SOPARC protocol with modified format seems to be adequate as the quality of observations is not impacted by the amount of time spent observing or observers fatigue. Other than that, the most important contextual variables affecting interobserver reliability were the number of people present in the target area, and the type of target area that was being observed. Observers achieved a very high agreement when observing areas with five or fewer people. The type of target area for its part, also significantly affected reliability. While swing areas recorded very high interobserver reliability, basketball courts and playgrounds reached low agreement rates close to 60%.
These low reliability statistics in specific areas of the park, can be partially explained by a combination of some race/ethnicities being more difficult to assess than others, some physical activity being harder to identify, and the fact that areas with more people might be harder to assess. African American males for instance are more frequently observed engaging in vigorous physical activity in basketball courts (15). The combination of more people in the target area engaging in a highly dynamic activity can lower interobserver reliability. Similarly, complex cognitive tasks such as identifying race/ethnicity and physical activity at the same time, can work well in calm and well-defined areas such as swings, but can prove tricky in intricate areas such as playgrounds.
Trying to assess a hard-to-measure target, such as physical activity categories that are hard to distinguish, and in the context of a difficult-to-observe target area, can substantially drop reliability values below acceptable thresholds. This variability was also described by Chung-Do et al., (35) who had agreement rates as high as 0.94 for sedentary girls, and as low as 0.44 for vigorous boys.
Our results suggest that observers might benefit from subdividing the target areas whenever more than 5 people are present, or in the case that more moderate activities are taking place. Modifying the SOPARC form to anchor observations based on race/ethnicity and age instead of gender and physical activity could also improve reliability since race and physical activity are the more difficult-to-observe variables. While some technological assistance such as iSOPARC can be valuable to streamline coding and data management (10), if future SOPARC studies want to keep increasing the amount of information gathered per scan, they should consider a technological change or accept the drawbacks of lower reliability scores. In the future, if the appropriate permits are obtained, researchers might want to start using pictures or video cameras that can provide static assessments of the conditions of the park area and its park users, which can later be examined more thoroughly by human researchers or by machine learning.

Conclusion
Observational protocols such as SOPARC are key tools to understand park use to accurately assess physical activity. This study set to understand how reliability of SOPARC observation can be affected by trying to link individual characteristics such as race ethnicity and physical activity. At the same time, we were interested on testing how contextual variables at the time of the observation and the settings and conditions of the observed target area could also affect SOPARC reliability. Results suggest that SOPARC is an excellent tool for assessing population level physical activity, but its limitations arise when trying to link observations by race, age, gender and physical activity at the same time. Similarly, active areas and areas that can gather more people may benefit from simultaneous observation by multiple raters to allow more stable estimates.

Supplementary Material
Modified SOPARC coding sheet that was used by this study, which allows to link gender and age observations with race/ethnicity and physical activity.

Declarations
Ethics approval and consent to participate