An iterative and multi-step approach was taken to meet the purpose of the current study. Prior to formal data collection, item development took place over an extended period of time using multiple expert working groups. This process resulted in Great Recess Framework -Observational Tool (GRF-OT) to measure the context in which recess takes place and the behaviors that manifest within this context. The measurement model of the GRF-OT was then tested, followed by reliability and stability testing of the tool. The methods and procedures used to accomplish these goals are described below.
Item development
The GRF-OT was developed over several iterations with the use of expert working groups and field testing driven by Playworks Education Energized (www.playworks.org). During the first iteration, a national team of recess researchers and practitioners developed a series of indicators thought to support physical activity and positive social development during recess. These indicators were derived from previous research [20, 29] as well as decades of professional practice. This team included three Playworks program directors, two Playworks Pro Trainers, the Playworks National Evaluation Director and the National Director of Quality Programs, along with input from researchers working in this area see [20, 29]. The initial items were then field tested by PlayworksFootnote 1 program directors at recess sessions across the United States. A second working group of four different Playworks program directors, two different Playworks Pro Trainers, and the Playworks National Evaluation Director and the National Director of Quality was developed to improve these processes, as users felt the initial items were not user friendly and did not follow a logical pattern for observation. Based on this groups experiences working in the field, higher order domains of safety, engagement, and empowerment were identified, with items corresponding to each of these domains subsequently created by this team. These items were then placed on a 3-point scale and sent to an external researcher with publications in this area for critical review and modifications i.e., [29]. This scale was piloted prior to being sent to a second group of external researchers [WVM, MBS] for critical review and further modifications. At this point, the items on the GRF-OT were moved from a 3-point to a 4-point response scale. Items were also modified for operational definition clarity and ease of use. The results of these processes can be found in the 24-item version of the GRF-OT (see Additional file 1).
Procedures
Recess is often defined, and implemented, differently across various sectors but generally refers to discretionary breaks children have during the school day. For the purposes of the current study, recess observations took place during scheduled recess breaks immediately before or after the lunch period, typically lasting 15–30 min in duration. Schools maintained variable schedules, with some schools sending groups of students outside all at once, while others rotated the sessions with different children and different supervisors (e.g., only first through third graders at recess one, followed by only fourth and fifth graders at recess two). Outcome assessors arrived to the outdoor playground approximately 15 min before the scheduled recess session to complete a walkthrough of the playground and take any notes about the built environment. Outcome assessors then observed the entire recess session, taking notes on each item throughout the process. In all cases, the recess environment was completely visible to the outcome assessor, and outcome assessors were trained to move throughout the playground in a discreet manner in an effort to observe patterns of interaction and behavior. Final scoring of each item was completed immediately after the recess session and took into account the aggregate patterns of behavior throughout the duration of the recess session.
Validity
To test the measurement validity of the GRF-OT, data were collected at 649 individual school-based recess periods during the fall of 2016. These recess sessions spanned 495 schools across 22 urban, or metropolitan, areas in the United States of America. A list of data collection locations are available upon request.
Reliability
Eight graduate students were recruited as outcome assessors. These outcome assessors had no prior experience with observational data collection of recess, facilitation of school-based recess, or teaching in an elementary school. Thus, outcome assessors were novices related to school-based recess observational assessment. Outcome assessors were introduced to the items on the GRF-OT, the operational definitions, and trained in the scoring procedures. Each item was discussed in a series of workshops that allowed the outcome assessors to ask clarifying questions regarding scoring procedures. After initial discussions of each GRF-OT item, outcome assessors were instructed to read through a GRF-OT scoring manual that was created by the lead investigator. This training manual included each GRF-OT item, operational definitions, and examples for each item, scoring criteria for each item, as well as corresponding photos and videos to enhance the training process. Pilot observations were then conducted, followed by debriefing sessions to ensure clarity in the scoring instructions. Pilot data were not used in any analyses.
Data were then collected by two independent outcome assessors, blinded to one another’s scores, at 162 recess sessions. To ensure blinding of scores, data were entered by an independent staff member uninvolved in data collection. The 162 recess sessions took place across 9 schools, and data were collected at each school over a two-week period. In total, first grade students were observed in 47 sessions, second grade students in 52 sessions, third grade students in 51 sessions, fourth grade students in 52 sessions, fifth grade students in 52 sessions, and sixth grade students in 23 sessions.
Data analysis
Validity
To examine the measurement model of the GRF-OT, exploratory structural equation modeling (ESEM) was used in MPlus version 7.4 [30]. In consideration of alternative data analysis strategies, exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) were ruled out as an appopriate choice as EFA structures are often not supported by subsequent CFA [31], and CFA does not inherently permit correlated variance structures (i.e., conceptual overlap between similar items). Given that CFA requires each item to load on only one factor, statisticians [32] have offered ESEM as a more flexible and realistic approach to utilise in model development. Moreoever, inter-related constructs are consistent with the authors’ a priori conceptualization of recess domains, as safety, engagement, and student empowerment at recess are theorized to be inter-related constructs. By using the ESEM method, items were free to cross-load on multiple factors, rotation of the factor matrix was possible, and the researchers were able to calculate goodness of fit statistics typically associated with CFA [33, 34]. Additionally, this procedure allows for step-wise evaluation of the GRF-OT using multiple statistical criteria.
Decisions about the most appropriate model were made using the Chi Square (χ2) statistic, the Root Mean Square Error of Approximation (RMSEA), the Comparative Fit Index (CFI), the Tucker Lewis Index (TLI), and the Standard Root Mean Square Residual (SRMR). Generally, χ2 should have a non-significant value (p > .05, indicating model-to-data fit). However, the meaningfulness of this “absolute criterion” has been greatly debated, as it is sensitive to sample size and model complexity. Psychometricians have therefore recommended the use of multiple fit indices to be included in model evaluation process. Values at or above .95 for CFI and TLI, and values < .08 for SRMR and RMSEA have be used to indicate acceptable model fit, whereas values ≥ .98 and < .06 are preferred, respectively [35, 36]. There are no gold standards in model evaluation, but a conservative stance was adopted at this early stage of model development.
Reliability
To examine the reliability of the GRF-OT, weighted Kappa scores were calculated to examine the relative consistency of scoring between outcome assessors on each individual item. Kappa scores ranging from 0.8 to 1.0 are considered excellent agreement; scores ranging from 0.6–0.79 are considered good agreement; scores ranging from 0.4–0.59 are considered moderate agreement; scores ranging from 0.2–0.39 are considered fair agreement, and scores below 0.2 are considered poor agreement. In addition to individual item reliability, the inter-rater reliability of each sub-scale identified in the ESEM analysis, as well as the total GRF-OT, was examined by calculating an intra-class correlation coefficient (ICC) using a 2-way random effects model.
In addition to this inter-rater reliability, the test-retest reliability of the GRF-OT was examined. Data were averaged across three recess sessions in one week, and compared to data averaged across three recess sessions for the same school, and same time period, the following week. An ICC using a 2-way random effects model was used to compare the three day average in week one against the three day average in week two. The procedure was then replicated by computing a two-day average in week one (day 1 and day 2) and comparing that to a two-day average from week two (day 4 and day 5). A minimal detectable change (MDC) was calculated for both the two-day and the three-day average. The MDC is a practically important data point that allows users to assess actual change in recess, as opposed to expectated variability. The MDC is thought the be the change needed to ensure the recess climate is different than a previous observation. The following formula was used to calculate the MDC at a 95% confidence rate:
$$ \mathrm{MDC}=1.96\ \mathrm{x}\ 1.414\ \mathrm{x}\ \left(\mathrm{SD}\ \mathrm{x}\kern0.5em \left[\mathrm{sq}\ \mathrm{root}\ \mathrm{of}\ 1-\mathrm{ICC}\right]\right) $$