Portuguese Physical Literacy Assessment Questionnaire (PPLA-Q) for adolescents (15–18 years) from grades 10–12: development, content validation and pilot testing

Background The Portuguese Physical Literacy Assessment (PPLA) is a novel tool to assess high-school students’ (grade 10–12; 15–18 years) Physical Literacy (PL) in Physical Education (PE); inspired by the four domains of the Australian Physical Literacy Framework (APLF), and the Portuguese PE syllabus. This paper describes the development, content validation, and pilot testing of the PPLA-Questionnaire (PPLA-Q), one of two instruments in the PPLA, comprised of modules to assess the psychological, social, and part of the cognitive domain of PL. Methods Development was supported by previous work, analysis of the APLF, and literature review. We iteratively gathered evidence on content validity through two rounds of qualitative and quantitative expert validation (n = 11); three rounds of cognitive interviews with high-school students (n = 12); and multiple instances of expert advisor input. A pilot study in two grade 10 classes (n = 41) assessed feasibility, preliminary reliability, item difficulty and discrimination. Results Initial versions of the PPLA-Q gathered evidence in favor of adequate content validity at item level: most items had an Item-Content Validity Index ≥.78 and Cohen’s κ ≥ .76. At module-level, S-CVI/Ave and UA were .87/.60, .98/.93 and .96/.84 for the cognitive, psychological, and social modules, respectively. Through the pilot study, we found evidence for feasibility, preliminary subscale and item reliability, difficulty, and discrimination. Items were reviewed through qualitative methods until saturation. Current PPLA-Q consists of 3 modules: cognitive (knowledge test with 10 items), psychological (46 Likert-type items) and social (43 Likert-type items). Conclusion Results of this study provide evidence for content validity, feasibility within PE setting and preliminary reliability of the PPLA-Q as an instrument to assess the psychological, social, and part of the cognitive domain of PL in grade 10 to 12 adolescents. Further validation and development are needed to establish construct validity and reliability, and study PPLA-Q’s integration with the PPLA-Observation (an instrument in development to assess the remaining domains of PL) within the PPLA framework. Supplementary Information The online version contains supplementary material available at 10.1186/s12889-021-12230-5.

in modern pedagogy have been the works of Margaret Whitehead [2][3][4], which conceptualized PL as the motivation, confidence, physical competence, understanding and knowledge to maintain physical activity throughout the life course.
Notwithstanding its lifelong development, sowing the seeds of PL during school-age seems critical, as participation in early childhood might predict adherence to active lifestyles throughout life [5,6], counteracting the rising levels of physical inactivity observed in adolescents and adults [7,8]. In this line, PL is argued as the main outcome of quality physical education (PE) in schools [9], since it provides a privileged environment -mandatory, free and qualified -for learning the life skills and values needed for active and global citizenship [10]; as well as being the only opportunity to participate and learn from PA for some school-aged children and adolescents [11]. Thus, many authors have underlined the need to operationalize this concept in school curricula and educational policies [12][13][14].
Despite a general consensus on the ultimate goal of PL -sustained lifelong PA participation [15,16] -, its proposed conceptualization and constituent elements differ across sources [17][18][19]. These range from philosophically-driven conceptualizations, like Whitehead's PL original proposition [2] -rooted in the philosophical tenets of monism, phenomenology, and existentialism -to diametrical conceptualizations focusing solely on one of its aspects (e.g., fundamental movement skills) [20]. Although recognized as a rich theoretical concept, the former might lack pragmaticism to be implemented in practice [17]: while the later might deviate from the holistic nature of PL, compromising crucial elements like pleasure and enjoyment in taking part in PA [21]. As such, a middle-ground compromise might offer a tenable solution: providing clear and measurable outcomes, while honoring most of the philosophical-driven premises that define the concept [20,22]. To this end, a team of Australia-based researchers developed the Australian Physical Literacy Framework (APLF) [1], a research-based, integrated model of PL in the physical, cognitive, psychological and social domains with 30 different elementsnovel in recognizing the contribute that PL might play in cultural and social participation. It provides a clear focus on a learning continuum, inspired by the Structure of Observed Learning Outcomes taxonomy [23], designed to include individuals in different states of their PL journey: from their first steps (pre-foundational) to higher stages of proficiency (transfer & empowerment) [24,25].

Physical literacy assessment
Given evaluation's essential role in PL implementation and practice [12] a few assessment instruments have been developed, under diverse conceptual models [18,26,27]. Of these, the most prolific research-wise have been the Canadian Assessment of Physical Literacy (CAPL) [28,29], and the Physical Literacy Assessment for Youth [30] (PLAY). The CAPL is comprised of standardized assessments developed for children from 8 to 12 years [31] (with preliminary testing done in 12 to 16 year-olds [32]), to assess daily behavior, physical competence, motivation and confidence, and knowledge and understanding. The PLAY tools have been developed to assess children from 7 years up (with recommendations mainly targeted at the 7-12-year range), comprised of measures of motor competence, comprehension, and confidence. Both tools integrate observational procedures and self-report, and feature overall good feasibility in PE [27] but lack options for older adolescents (15-18 years), a critical age range in Portugal which presents lower levels of PA [33][34][35] -making them a priority target in the Portuguese PE setting.

Portuguese physical education and PL
The Portuguese PE national syllabus (PPES) was designed under the Crum's socio-critical conception of PE, contemplating integrated learning in the motor, cognitive, affective and social domains, to empower students to engage in significant PA, and actively participate in the movement culture throughout their lives [36]; expanding beyond a restricted and instrumental participation in PA [37]. Although the initial development of this syllabus slightly predates Whitehead's influential works on PL [2], it implicitly aligns with the latter's ontological and epistemological premises. Akin to a phenomenological and existentialist perspective [38], it advocates pedagogical practices of differentiation, allowing a high degree of flexibility towards the achievement of curricular goals, recognizing that each individual enjoys and values different forms of movement; while using assessment as a tool to motivate and identify where every student should work to improve, in line with strategies proposed both by PL [38] and assessment specialists [39].
The PPES distinguishes three learning areas: 1) Physical Activities, 2) Health-Related Fitness, 3) Knowledge. In the first area, it advocates the participation in a wide range of physical activities (sport-based team and individual activities, rhythmic and expressive activities, nature exploration activities, and traditional games), enabling students to choose from an eclectic array of physical activities throughout their life. In each of these activities, student progress is charted through 3 levels of competency -introductory, intermediate, and advanced -integrating 1) mastery of specific movement skills, 2) cognitive skills related to tactical decision, 3) knowledge and application of activity rules and 4) prosocial behavior during said activity [40]. This multilateral learning through participation in physical activities is supported by the development of health-related fitness, and the knowledge and skills needed to lead a healthy lifestyle through personal significant PA (second and third areas of the PPES, respectively).
Despite having common points with most PL definitions and models, the PPES curricular and pedagogical choices align more closely with the Australian proposal previously presented, since the latter explicitly includes the social domain as an integral part of the PL development, as well as elements pertaining to tactical and rules learning. Also, the APLF maps all development through the usage of a modified continuum based on the Structure of Observed Learning Outcomes taxonomy [23], which recognizes that learning might differ not only in quantity (i.e., being less or more skilled/knowledgeable) but in qualitative state as well (i.e., going from a descriptive, surface knowledge to a relational understanding of a skill/knowledge); a principle mirrored in the three levels of competency in the PPES.
Considering these specificities of the PPES design and implementation, none of the presented PL assessments provide a complete picture of learning in all four domains; nor were they designed for older adolescents. As such, we developed an instrumental system -Portuguese Physical Literacy Assessment (PPLA) -to address this gap, and use PE as a privileged mean for PL development in Portugal.

The Portuguese physical literacy assessment (PPLA)
The PPLA was designed to provide a detailed and feasible assessment of each student's PL journey, and to inform pedagogical decisions (at local, regional, and national level) towards a more meaningful and targeted environment to promote PL learning of grade 10-12 (15-18 years) adolescents. The PPLA ( Fig. 1) is based on the PPES and integrates assessment in the four domains of the APLF, using two instruments: a) the PPLA-Observation (PPLA-O), and b) the PPLA-Questionnaire (PPLA-Q). The PPLA-O (still in development) uses observational data collected by the teachers during regular PE classes (competency levels in the different physical activities, and physical fitness levels using standardized protocols) to assess the physical and part (Rules, and Tactics elements) of the cognitive domain. The PPLA-Q, which will be the focus of this article, uses a knowledge test (with multiplechoice questions) and self-report (Likert-type scales) to assess the psychological, social and the remaining part of the cognitive domain (Content Knowledge element). Both these instruments were designed to be applied together to provide a holistic picture of each student's PL journey. PPLA (Fig. 1), following the APLF conceptualization of a learning continuum summarizes its five development levels (for each element of the four PL domains), into two learning levels: Foundation and Mastery. This simpler structure still captures the qualitative change in the learning experience, separating surface learning from deep learning, while providing a more parsimonious and feasible instrument.
The Foundation level represents the initial development of each element, building affective, cognitive, psychomotor and social structures that enable participation in movement and physical activities, albeit in an isolated, instrumental or externally focused manner (i.e., to obtain benefits/rewards, or conform to the norm) -akin to the Unistructural and Multistructural levels of the Structure of Observed Learning Outcomes taxonomy, and the foundational levels of Bloom's Revised Affective Taxonomy [41].
Mastery level represents a deeper development of the element, invoking metacognitive processes, relational understanding, or internalized behaviors (i.e., integrated into the individual's sense of self ) regarding participating in movement and physical activities -derived from the Relational and Extended Abstract levels of Structure of Observed Learning Outcomes taxonomy, and higher levels in Bloom's affective taxonomy.
As such, based on previous constructs studies of PL [29,42] and the structure implied by the APLF, we hypothesize a hierarchical measurement model, with PL conceptualized as a fourth-order formative construct ( Fig. 1) composed by its four domains (third-order formative constructs). Each domain is then formatively composed by several elements (second-order formative constructs), in turn composed by two first-order constructs, reflexively formed by a set of manifest indicators (i.e., items).
The distinction between formative constructs (i.e., composites) and reflexive constructs (i.e., factors) is important here. While the later assumes that items (or lower-order constructs) are interchangeable -since they measure the same underlying trait (i.e. are unidimensional) -and thus are expected to covary, the former assumes the opposite: that its composing items are not interchangeable, and are not expected to covary -where an omission or deletion of an item changes the essence of the construct being measured [43][44][45].
Based on this conceptual framework, in a series of studies, we sought to develop the PPLA-Questionnaire (PPLA-Q), an instrument comprised of modules to assess grade 10-12 adolescents' psychological, social and part of cognitive domains of PL; and gather evidence for its content validity, feasibility within PE setting, preliminary reliability, item difficulty and discrimination.

Studies overview
The development of the Portuguese Physical Literacy Assessment Questionnaire (PPLA-Q) entailed a series of studies ( Fig. 2), based on a multiple phase design [46][47][48], inspired by the psychological, social and cognitive domains of the PL model proposed in the APLF [1,49], and by the Portuguese PE syllabus [50][51][52].
All the work was done in Portugal, as part of the doctoral project of the lead author, approved by the Ethics Council of Faculty of Human Kinetics, as well as the Portuguese Directorate-General of Education. All methods were performed in accordance with the relevant guidelines and regulations.
PPLA initial development was based on previous work done in the Erasmus+ Sport Project: PhyLit -Physical Literacy (590844-EPP-1-2017-1-UK-SPO-SSCP, January-December 2018), where a panel of experts selected -among the 30 proposed by the APLF -relevant elements for developing and advocating PL as an essential competence for European citizenship, based on a literature review of existing conceptualizations [19].
Initial development for each of the three modules of PPLA-Q entailed domain identification and item generation; followed by an iterative process to gather judgmental evidence on content validity that included: two rounds of qualitative and quantitative expert validation; three rounds of cognitive interviews with high-school students; and multiple instances of expert advisor input. We also conducted a pilot study to assess feasibility of the questionnaire in PE and collect preliminary data on reliability and construct validity.

Domain identification
Based on literature review, we established a theoretical framework for each of the eight elements in the psychological (Motivation, Confidence, Emotional Regulation, and Physical Regulation) and social domains (Culture & Society, Ethics, Collaboration, and Relationships) ( Table 1). The literature review conducted by Dudley and colleagues [49], in the report preceding the creation of the APLF, was used as starting point to identify established and relevant theories for each element in the literature of motor development, physical education and/or physical activity. Then, constructs with higher conceptual proximity were chosen -caring to minimize overlap -, mapped to the two-level framework, and operational definitions derived from the APLF.
For the Cognitive Domain, we conducted a content analysis of the Portuguese PE syllabus (PPES) to identify key learning objectives coherent with the Content Knowledge, Tactics and Rules elements of the APLF. In this process, to ensure adequate content representation, we subdivided the Content Knowledge element into different content themes (Nutrition, Body Composition, Training Methods, Safety & Risk, PA Benefits); each was then mapped to the two-level framework and its operational definition derived from the PPES ( Table 2).
Since tactical behaviors and adherence to rules (i.e., as a participant, and as a referee or judge) are better assessed through direct observation of the student's behavior during PE, we chose to include the Tactics and Rules elements alongside the assessment of the physical domain (in the PPLA-O). As such, these elements will not be further discussed here, despite them being integral part of the Cognitive domain.

Item generation Psychological and social modules
Items in the Psychological and Social domains were developed to conform to self-report measurement using Likert-type scaling, given its adequacy and versatility to measure attitudes, beliefs and self-perceived abilities [67,68]. An initial goal was set to generate a 5-item subscale per learning level (two subscales per element, four elements per module). This was a compromise between the size of the resulting questionnaire, and a larger initial item pool to provide margin for eliminating poorly performing items during testing [67,69]; down to four per subscale -the recommended number to calculate reliability and further test measurement models [70].
In an effort to use psychometrically sound items as a reference for item generation [71] a non-systematic literature review was conducted using ERIC, Google Scholar, Scopus and ProQuest databases to identify a first round of eligible articles for each element, which were then used to refine further searches for articles. In these, we selected published and validated scales or subscales (in English or Portuguese), amply used in PE, sport, or PA contexts, and sampled items that adhered to each level's operational definitions (Table 1). When various identical items overlapped in content, those with higher item loading were selected.
After permission for adaptation was granted by each scale's lead author, sampled items were used as reference to generate items in Portuguese, based on the examples provided by the APLF, and technical recommendations available in the literature [67-69, 72, 73]. When suitable reference scales were not available or failed to achieve full content representation for the element, or level, items were generated according to previous literature view.
All items used a consistent 5-points unipolar response scale, to maximize reliability and validity [73,74] . Response points were fully labelled, using both numeric and verbal labels, (0 = Not at all; 1 = Slightly; 2 = Moderately; 3 = Quite a lot; 4 = Totally), measuring student's identification with each of the statements (How much do the following statements describe you?).

Cognitive domain
For their suitability to test cognitive ability and knowledge [68], and ease of application, multiple-choice questions were generated for each content theme and level (10 items), according to technical advice presented by the literature [75,76], and by an educational assessment expert (PhD holder with extensive experience as a PE and graduate-level college professor, as well as an employee in the Portuguese Institute for Educational Assessment).
Throughout the process in all modules, the lead author acted as item generator, while remaining authors acted as co-validators to ensure preliminary content validity.

Content validity
Content validity pertains to the extent to which a set of items represents the intended construct [67]. It requires evidence of content relevance, representativeness, and technical quality, assessed through evaluation by experts and population judges [47]. As such, we led an iterative process with multiple rounds [77], collecting both qualitative and quantitative evidence from both parties.

Cognitive interviews
Cognitive interviewing is a qualitative method to assess whether a survey fulfills its intended purpose, through

Instruments used as reference
Psychological Domain

Confidence
Psychological need satisfaction -Perceived competence [57] A belief in self-worth and ability to perform in movement and physical activity [1] Psychological Need Satisfaction in Exercise Scale (PNSE) [56] Foundation: Beliefs of self-worth and ability (5 items) Mastery: Beliefs of self-worth and ability in challenging contexts (5 items) Emotional Regulation Emotional Intelligence [58] Ability to manage emotions and resulting behaviors in relation to movement and physical activity [1] Wong and Law's Emotional Intelligence Scale (WLEIS) [59] Foundation: Awareness of own emotions and other's (5 items)

Mastery: Emotional regulation and control (5 items)
Physical Regulation NA Recognizing and managing physical signals such as pain, fatigue and exertion [1] NA Foundation: Awareness of physical signals (5 items) Mastery: Regulation and management of physical signals (5 items)

Social Domain
Culture & Society Sport Education [60] Appreciation of cultural values which exist within groups, organizations and communites [1] NA Foundation: Participation in sport's cultural phenomena (5 items) Mastery: Valuing participation in sport's cultural phenomena and encouragement of others to do so (5 items)

Ethics
Moral development [61,62] Moral principles that govern a person's behavior, relating to fairness and justice, inclusion, equity, integrity, and respect [1] Fair Play Questionnaire in Physical Education (FPQ-PE) [63] Foundation: Respect for basic moral and ethical principles in physical activity contexts (fair-play) (5 items) Mastery: Autonomy and empowerment of others in respecting moral and ethical principles in physical activity contexts (fair-play) (5 items)

Collaboration
Personal and Social Responsibility [64] Social skills for successful interaction with others, including: communication, cooperation, leadership and conflict resolution [1] Personal and Social Responsability Questionnaire (PSRQ) [65] Foundation: Respect and cooperation with others Mastery: Caring and leading others to success interview of selected individuals, before, during and after pretesting [78]. In our study, cognitive interviews were conducted in three rounds, in two different high schools in Lisbon -one with a dominantly higher socioeconomic status population, and another with a lower socioeconomic status population -involving students of the target age-group (15-18 years), through different phases of development of the PPLA-Q. Before participation, informed consent was provided by all students and their legal guardians. All interviews were conducted by the lead author during PE classes and recorded. Initial interviews were more extensive (i.e., more content, less depth), while the latter ones were progressively more intensive (i.e., narrower content, higher depth). This strategy balanced gross evaluation (e.g., format, conceptual breadth) in earlier phases with fine-tuning (e.g., wording, syntax) in later ones. In February 2020, in each high school, a cognitive interview was conducted with a group of two students from grade 10 (aged 15) and another with two grade 11 students (aged 17). We sought to diversify these groups by 1) including, in each, a female, and a male, with different PE competency levels (according to their teacher); and 2) including students from different majors: one group from a Science, Technology, Engineering and Math major, the other from a Humanities and Arts major. Students were asked to fill in a draft version of the PPLA-Q, marking any items with ambiguous or unclear wording. Afterwards, an interview was conducted to probe for comprehension of items -focusing on the ones marked by students. Students were asked to verbally express their understanding of each and paraphrase it according to their own words. They were also questioned about general issues of the questionnaire (i.e., length and structure, layout, ease of reading, rating scales, comprehension of instructions and item stems). Average duration was 45 min.
In December 2020, a second round of individual cognitive interviews was conducted immediately after pilot testing (version 0.4 of PPLA-Q) with two students from grade 10 (1 female, 1 male, both aged 15) from a Humanities major class. Here, students who posed abundant questions during the questionnaire application were selected to better study the clarity of the items. Given time constraints of the project, this round enlisted less students that initially warranted. Students were asked about their comprehension of selected items -those

Relationships
Psychological need satisfaction -Perceived Relatedness [57] Building and maintaining respectful relationships that enable a person to interact effectively with others [1].
Psychological Need Satisfaction in Exercise Scale (PNSE) [56] Foundation: Interaction and relatedness with others Mastery: Management and maintaining relationships with others which were the target of most of student's questions during pilot testing, as well as those previously revised. Average duration was 17 min. In January 2021, a third round of individual cognitive interviews was conducted with six different students from the same grade 10 Humanities class recruited for last round (3 female, 3 males, mean age = 14.8 years). These were selected according to as different PE competency levels as possible (reported by the teacher).
They were asked about their comprehension of all items changed from version 0.4 to version 0.5. Average duration was 15 min.

Evaluation by experts
Among the many methods available, Content Validity Index (CVI) and Cohen's coefficient kappa (κ) for interrater agreement were used to systematically assess expert consensus on content validity of an instrument [47,79].
Given different subject matter for each of the modules, expert selection was stratified per module to allow for more useful inferences. We intended to collect evidence from 6 experts -following recommendations of 5 [80] with relevant scientific and professional background, on each of the questionnaire's domains (i.e., psychology of physical activities/sport; sociology of sport; educational assessment/curriculum development), and ideally with experience in instrument development [81]. According to their expertise, each expert was invited to participate either (a) in all 3 modules (n = 3); (b) in 2 modules (n = 1) or (c) in a single module (n = 11).
Experts were invited through an email presenting the project's goals and explaining the motives for selection, containing (1) instructions for intended contribution, (2) a draft version of PPLA-Q, and (3) a spreadsheet file. Operational definitions for each construct were also provided -as content validity is inextricably linked to the definition of constructs under examination [67]. In the spreadsheet file, experts were asked to: (1) rate each item on its relevance ("How important is the item to assess the targeted construct?") and clarity ("Is the wording of the item clear?"), (2) provide suggestions for item improvement, (3) provide suggestions on questionnaire structure, instructions, and rating scale. Both relevance and clarity were assessed with a 4-point Likert-type Scale [80]. For relevance the rating options were: 1 = not relevant, 2 = somewhat relevant, 3 = quite relevant, and 4 = very relevant [79]. For clarity, the options were: 1 = not clear, 2 = item needs revision, 3 = clear, but needs minor revision, 4 = very clear [82]. During analysis, both ratings were collapsed into two dichotomous categories ("content invalid" and "not clear" for ratings of 1 and 2, and "content valid" and "clear" for ratings of 3 and 4, respectively) [80].
Of the invited experts, the actual first-round expert sample (n = 10) consisted of 2 global experts (3 modules), 1 expert rating 2 modules, and 7 experts rating a single module. Another expert provided solely qualitative feedback (i.e., suggestions of improvements for item and questionnaire structure) on 2 of the modules, with no quantitative ratings. We had minimal missing data, with no bearing on calculations, since all adjusted for the total number of raters in each item. Further characteristics about the participating experts are summed up in Additional file 1.
All calculations used RStudio [83] with R version 4.0.2 [84]. CVI was computed both at item level (I-CVI) and module level (S-CVI/Ave and S-CVI/UA). Polit & Beck [85] argue that given diverse uses of CVI in the literature, one should explicit their calculations. We computed I-CVI as the proportion of experts rating each item as content valid. S-CVI/Ave was computed as the average of I-CVI for each module, while S-CVI/UA was computed as the proportion of items with I-CVI = 1 (i.e., universal agreement) for each module.
Many authors have criticized drawing content validity evidence based solely on CVI, given its susceptibility to chance agreement. They propose that Cohen's kappa [86] -a statistic which accounts for the possibility of chance agreement of experts -be used alongside CVI [79]. For this purpose, kappa (κ) was computed using Fleiss's modified version for multiple raters [77,87] for each item: where P a (proportion of agreement) = I-CVI for the item, and where P c (probability of a chance agreement), for a random binomial variable, with one outcome: With N = number of experts, and A = number of experts rating item as content valid.
For item clarity, an identical procedure was used to calculate proportion of agreement (akin to I-CVI), and a κ statistic for each item. as the usual application of Content Validity Index (CVI) pertains to a global evaluation of the item [77], which might hide some crucial aspects of the item's quality, confounding the conceptual relevance of the item, with the clarity of its wording.
We used κ to inform item level decisions, evaluating item relevance as fair (.40 to .59), good (.60 to .74) and 74); κ lower than .40 prompted elimination of the item [87,88]. For clarity, the threshold increased to discriminate items needing minor revisions and ensure higher clarity throughout: we evaluated items as clear (κ > .74) and as needing revision (κ < .74).
Scale level decisions were informed by S-CVI. We used literature recommendation of .80 as an adequate level of agreement for the more stringent S-CVI/UA [81], and .90 for S-CVI/Ave [89].
In the second round of expert evaluation, the same procedures were followed to gather evidence of content validity on the revised Culture & Society scale (version 0.3), targeting a lower number of experts (n = 3, 2 of which participated in the previous round), due to time constraints in the project schedule.

Pilot testing
Pilot testing, or pretesting constitutes an opportunity to (1) test the application of items in development to a representative sample of target population [90]; (2) gather feasibility evidence to plan a larger scale study [91]; and (3) gather data for preliminary item analysis and estimates of reliability [92].
Although no clear-cut standard is available for sample size of pilot tests, Hertzog [91] suggests a sample size of 40 individuals for estimating preliminary data on reliability and item discrimination. As such, we pilot tested version 0.4 of PPLA-Q with a sample of 41 grade 10 students (down from an initial pool of 58 students who received the informed consent), from two classes of the different schools in Lisbon (n school1 = 19, n school2 = 22) aforementioned -one with a higher socioeconomic status population, another with a lower socioeconomic status population, as attested in each school's pedagogical project. This sample was composed of 29 females (71%) and had an average age of 15 (0.4) years. All students provided an informed consent signed by themselves and their legal guardian.
PPLA-Q was self-administered, in pen and paper format, both in PE gym and classroom setting -to test likely settings expected for future application -in presence of the lead author. Students were instructed to state any question regarding questionnaire's instruction, items, or rating scales. Application was timed to calculate average completion time; attrition rate was calculated as the percentage of students completing the study, among those who received the informed consent.

Preliminary item analysis
Psychological and social modules Given the novel status of any construct validation under the APLF model, as well as a complex and high number of constructs under analysis, we chose to conduct preliminary item analysis using the partial least squares -structural equations modelling (PLS-SEM) framework [93]. No a priori power analysis was conducted, since our goal was to gather very rough insights into the statistical behavior of the measurement model of items. Despite this, our sample size approximated the thumb-rule of 10 times the maximum number of indicators per construct [44].
Prior to calculation, data was scanned for suspicious response patterns, and items P1 to P5 were reversedscored -since they refer to controlled motivation, and thus expected to negatively load on the second-order motivation construct. Missing data was below the 5% threshold for every indicator (i.e., item), under which circumstances PLS-SEM is robust [44]. SmartPLS 3.2 [94] was used to calculate Cronbach's α, composite reliability and outer loadings (factor weighting scheme, with 300 iterations and stop criterion of 1*10 − 7 ) using a Hierarchical Component Model (reflective-formative) for each of the modules, with the repeated-indicator approach [44,95].
For interpretation, we followed Hair's et al. [44] advice of using both α and composite reliability -as lower bound and upper bound estimates of reliability, respectively. α was deemed acceptable at .70 [96,97], while composite reliability was deemed acceptable at .60 [44]. As for indicator reliability (outer loadings) values of .70 were deemed acceptable [44].
Cognitive module In order to gather preliminary evidence on construct validity for items in the cognitive module, we analyzed item's difficulty index, discrimination index, and performed a distractor analysis [75,89] under the Classical Test Theory framework.
We had missing data for one student who did not complete this module. Item were scored using the CTT package [98] in RStudio [83] with R version 4.0.2 [84]; we used dichotomous scoring (i.e., 0 and 1) for correct answers -multiple selection items were considered correct if all correct options were selected. Difficulty and discrimination (gULI) indexes calculation, and distractor analysis (proportion of responses in each distractor) were calculated with the shinyItemAnalysis package [99].

Results
The following sections are organized chronologically, as to provide the reader with a detailed view of the different development phases and refinements that the PPLA-Q went through. In the Discussion section, we summarize and discuss these results according to their overarching goal (e.g., content validity).

Domain identification Psychological domain
Motivation Self-Determination Theory (SDT) [53] has abundant research in exercise and physical activity contexts [103]. One of its mini-theories, Organismic Integration theory [104], posits a continuum of different behavioral regulations varying according to their degree of self-determination. Among these, external and introjected are posited as more controlled (i.e., less autonomous) forms of extrinsic motivation; while identified, integrated and intrinsic are posited as more autonomous forms of motivation. More autonomous forms have shown positive association with increased participation in PA [105], and with positive experiences in PE [106]. We placed controlled forms of motivation in the foundational level, and more autonomous forms into the mastery level -following a two factor structure proposed in previous research [107].
Confidence Multiple self-concept constructs in the literature center around the belief in one's abilities to perform in PA settings; of these, (perceived) competence and self-efficacy seem to be determinants of participation in PA in children and adolescents [105,108]. Although conceptualized under different frameworks -perceived competence in the SDT tradition (as a basic psychological need driving motivation), and self-efficacy as the main construct of Social-Cognitive Theory (SCT) [109] -studies have called for their integration, since they stem from the same concept of human agency [110], and might share a common core [111]. As such, we integrated perceived competence -given its centrality to SDT, and similarity to task self-efficacy -in the foundation level, and barrier self-efficacy [112] (i.e., belief in one's ability under challenging conditions) in the mastery level.
Emotional regulation Self-regulation is a broad concept that entails the individual's capacity to override and alter their behavior towards a standard or goal [113].
When referring to the affective domain, the construct of Emotional intelligence (i.e., ability to perceive and regulate emotion) [58,114] has gained visible traction in research. It has been linked to PA participation, both as an outcome and as predictor [115]. Among its many conceptualizations we chose to adapt Wong and Law's Emotional Intelligence Scale's factorial approach [59], mapping emotional evaluation (own and interpersonal) to the foundation level, and use and regulation of emotions to the mastery level.
Physical regulation Although we failed to identify a PAspecific construct that dealt with APLF's idea of regulating physiological signals and effort during PA-analogous to emotional regulation -we found it related to other affective constructs such as activity pacing (i.e., regulation of activity level towards an adaptive goal) [116] and coping (i.e., behavioral and cognitive efforts to manage internal and external demands during stressful situations) [117]. The latter has been researched mainly in performance-oriented settings, and has showed positive association with sport commitment in adolescents [118]. As such, we integrated this concept in an identical structure to that of Emotional Regulation: perception of changes in the body during exercise in the foundational level; and regulation of effort in the mastery level.

Social domain
Culture & Society The Culture & Society element is defined in the APLF as the appreciation of values present within communities of PA practice, however, we argue that its operationalization deals with cultural tolerance and cultural intelligence [119], rather than with the specific participation and appreciation of the cultural phenomenon of sport and PA. As such, we based this construct on Siedentop's call for symbolic attributes like values, rituals and traditions to be an integral part of PL [60]. This ritualist facet manifests through the use of specific attire, jargon, and participation in select behaviors and habits [120]; as well as through displays of fandom and sport fan passion [121]. All these further contribute to feelings of affiliation and membership in a collective identity [122]; and although literature linking this phenomena to participation in PA is sparse, it is plausible that it might play a mediator role in increasing perceived relatedness [123], and emotional regulationparticularly in anxiety-inducing settings [124]. We chose to map participation in cultural behaviors to the foundational level, while the mastery level represents a more involved stance in these (i.e., valuing and encouraging participation).
Ethics Fair play, is an integral part of modern sport as its major ethical system -coherent with universal values [125,126]. PE plays a critical role in teaching this "inner morality of sport", which surpasses simple adherence to rules, and includes following unwritten rules and moral codes [126]. Interiorization of these moral codes are concomitant with mature stages of moral development, which are known antecedents of prosocial behavior [61] (i.e., acts involving care for welfare of others) [127], and might also increase intrinsic motivation in PA settings [128,129]. We chose to use Gibbs' [61] model of moral development which, based on Kohlberg's work [62], identifies two main levels in standard moral development: immature (i.e., a pragmatic, instrumental sense of morality, mapped to the foundational levels) and mature (i.e., based on social values and empathy, mapped to the mastery levels).
Collaboration Personal and social responsibility are the main focus of Hellison's [64] Teaching Personal and Social Responsibility (TPSR) model for developing prosocial behavior, providing a way to address holistic development of students in PE, and enable them with life skills for active citizenship through five levels: (1) Respect for the rights and feeling of others, (2) Effort and cooperation, (3) Self-direction, (4) Caring and helping others, (5) Transfer outside the gym. Evidence shows association of its application with many positive emotional, psychological, and social outcomes (e.g., self-efficacy, self-regulation, caring, conflict resolution) [130]. It is also suggested that students' level of personal and social responsibility are associated with intrinsic motivation in PE [65]. To avoid overlap between personal responsibility and other elements tapping into similar concepts (i.e., Ethics, Emotional and Physical regulation), we mapped TPSR's "Respect" level into the foundational level, and "Caring and Helping" into the mastery one, based on the works of Li's et al. [65].
Relationships Relatedness (i.e., perceived connection with others) is another one of the basic psychological needs posited to drive motivation according to SDT. Despite its theoretical relevance, evidence has shown little to no direct association between relatedness and participation in PA, in both general [103,105] and PE contexts [131]. However, some authors [103,132] suggest that this might be due to relatedness being highly context-dependent (i.e., affected by prevalence of solitary exercise, or lack of connection with classmates), and thus, not captured in its entirety in the researched contexts. This idea is further reinforced by evidence of peer-support associating with PA practice [133], positive outcomes in PE [106], and as mediator in other relevant outcomes as effort [134] and enjoyment [132]. In our model, akin to Collaboration, we mapped a reactive role in relationships to the foundational level, while the mastery level presupposes an active role in relationship development.

Cognitive domain
Content knowledge Few studies have examined the relationship between knowledge regarding PA, and outcomes in PE contexts (either affective, social, or behavioral). However, there is evidence of positive association of knowledge of PA guidelines [66] and health benefits, both with PA participation in young adults [135,136], and physical fitness [137]. Similarly, awareness of health risks related to inactivity might predict PA participation in adults [138] and adolescents [139]. A consensus among aforementioned studies seems to be that knowledge of these contents is consistently low, with similar evidence in Portugal: both in PE setting [140] and in young adults [141].

Content validity Version one (v0.1): cognitive interviews
All students (n = 4) referred to the questionnaire as having an adequate layout and length, as well as clear directions for filling in the questionnaire. Their understating of item stems and rating scales, in the psychological and social modules, matched our intention: with equivalent conceptual distance between rating scale options. The response options in the cognitive module were deemed intuitive, given their familiarity with multiple-choice items. Item content was mostly clear for all students, with some difficulties arising in discerning the meaning of many items in the Culture & Society scale; they suggested adding examples to clarify concepts like "cultural diversity" and "traditional physical activities".
We found a quality issue with the cognitive module item C6 (i.e., doping's impact on health and fair play): During think-aloud response, it became evident that students could extrapolate the correct response without pertinent knowledge, due to the implausibility of distractors. According to students' comments changes were made to the questionnaire: we added examples for mentioned concepts and improved the plausibility of C6's distractors.

Version two (v0.2): expert evaluation -1st round
To quantitatively assess the relevance and clarity of each item, a panel of subject matter experts were asked to rate each item on a 4-point Likert-type scale. Ten experts in total participated in this round, of these 6,5 and 4 experts rated the cognitive, psychological, and social modules, respectively. Based on their ratings, CVI (I-CVI, S-CVI/ Ave, and SCI/UA) and κ were calculated.
Item relevance Item CVI ranged from .33 to 1 (cf. Additional file 2): 1 item had a CVI of .33, 3 items had a CVI of .5, 10 items had a CVI of .75, 6 items had a CVI of .8 and the remaining 70 items a CVI of 1.
Κ ranged from .13 to 1, with 86 items (96%) considered either excellent (76 items) or good (10 items) ( Table 3); four items were prompted for elimination -one in the cognitive module (C2, Nutrition) and three in the social module (S3 in Culture & Society scale, and S30 and S34 in Relationships scale) ( Table 3).
Scale relevance The psychological and social modules showed adequate content validity, with a S-CVI/Ave of .98 and .90 respectively [89]; while the cognitive module failed to reach the proposed adequacy threshold of .90, with an S-CVI/Ave of .87.
Questionnaire refinement Based on both qualitative and quantitative evidence from experts, two items were eliminated from the Relationships scale. It also prompted a major revision of the Culture & Society scale to increase S-CVI/Ave and S-CVI/UA of the social module to acceptable levels -informed by consultation with a subject matter expert, and one of APLF's authors.
The cognitive module underwent restructuration as most experts commented on quality issues regarding (1) implausibility of distractors, (2) syntax and (3) structure. None of the items were eliminated, as it would compromise content representation, and the two-level framework of the module. Albeit not reaching the desired threshold for S-CVI (Ave and UA), we chose not to submit the cognitive module to a formal second round of expert evaluation, given that all κ's (relevance) were excellent (>.74), save from item C2. Alternatively, we consulted with an assessment expert to restructure item C2 and improve the clarity on items C6 and C8, with no changes content-wise.

Version three (v0.3): 2nd round results for Culture & Society element
We asked 3 experts to participate in a second round of evaluation of Culture & Society scale, given the depth of its restructuration. Same procedures and calculations applied from the 1st round. All items in the revised Culture & Society scale obtained a I-CVI and κ of 1, indicating absolute agreement on item's relevance (Additional file 2). As such, S-CVI/UA of the social module increased to .84, entering an acceptable range [81].
Proportion of agreement on clarity ranged from .33 to 1 (Additional file 2): 1 item with .33, 4 items with .67 and the remaining 4 with 1; κ ranged from −.07 to 1, with 5 items considered clear and 5 prompted for revision.
Questionnaire refinement Five items in the Culture & Society scale -with clarity κ lower than .74 -were revised (S3 -S5, S7 and S9), and S6 was eliminated, since expert's comments pointed to it being more representative of general cultural tolerance than adherence to sport's culture.

Version four (v0.4): pilot Testing & Cognitive Interviews Feasibility
Of the 58 students who got the informed consent, 41 completed the PPLA-Q, resulting in an attrition rate of about 30%. These 41 students (71% female) studied in grade 10 of two different schools, with two different majors (19 students from a Science, Technology, Engineering and Math course, 22 from Humanistic course), mean age 15 (0.4) years.
Completion time was gathered to assess the questionnaire's feasibility during PE classes. Average completion time was 27 (7) minutes (n = 34, with the remaining 7 students failing to fill in the beginning and ending time). Questionnaire application in the gym allowed for ample space between students, which restricted talking; however, application in a crowded classroom promoted student's sharing ideas about the items, and their correct option(s) (in the cognitive module). No response errors or any suspicious response patterns were identified on the responses (e.g., straight or diagonal lining, or alternating poles) [44].

Preliminary reliability (psychological and social modules)
Preliminary reliability for each subscale, as well as each item's outer loading (indicator reliability) on its intended construct are summarized in Table 4.
Ten of the subscales (63%) attained acceptable reliability according to both α and composite reliability (> = .70, and > =.60, respectively); 2 subscales only attained acceptable values in the upper bound estimate (i.e., composite reliability). Out of the remaining 6, the Ethics element had the lowest reliability on its two subscales. We noticed a discrepancy in α and composite reliability's expected behavior (i.e., α lower than composite reliability) in the Motivation foundation, and Physical Regulation foundation subscales.
We found that 42 items (56%) had acceptable individual item reliability (outer loading >.70). Eleven items had unexpected negative loadings -as they were intended to relate positively with their constructs; these were, however, mostly negatively worded items found in Motivation, Physical Regulation and Ethics foundation level subscales. Table 5 summarizes the preliminary item analysis of the cognitive module of the PPLA-Q. We found a mismatch between intended complexity of the item and its difficulty in 2 of the 5 content groups (i.e., foundational items being answered incorrectly more often that mastery items for the same content); as well as an overall low success in foundational items. Additionally, average difficulty of the items in the module was .50, representing a more difficult test than ideal for maximizing discrimination -.70 to .74, for a test with four, five and six options multiple-choice items [101]. Notwithstanding, 6 items showed good or very good discrimination between lower-knowledge and higher-knowledge students (D > .30). Distractor analysis revealed that 16 (57%) were low functioning distractors (i.e., ≤ 10% of total responses for the item); these were mostly in easier items.

Cognitive interviews -2nd round
Further individual cognitive interviews (n = 2) were conducted to probe student's understating of changes made to the items in the last 3 versions of the PPLA-Q, as well as in items which raised frequent requests for clarification during pilot application. Interviewed students showed good comprehension of the items. Additionally, a minor change was suggested in one of the distractors of the item pertaining to basic safety procedures during PA (C5): substitute "Always drink water" for "Drink water regularly".

Questionnaire refinement
Results of preliminary reliability analysis prompted a detailed analysis of every item and subscale in the psychological and social modules. Based on this, negatively stated items were changed into positively stated ones to improve comprehension, and subsequently, validity and reliability. Minor changes were made to item stems as well, to improve clarity. Additionally, 11 global assessment items (e.g., "I'm motivated to practice PA") were introduced into the psychological and social modules to allow for convergent validity assessment through redundancy analysis of the second, third and fourth-order formative constructs [44,142] in further stages of PPLA-Q development. Of these, 8 targeted each of the elements, 2 targeted the general psychological and social domains, and 1 targeted general PL. Their content followed the respective operational definition stated by the APLF (see Table 1), while adhering to the same structure and rating scale as the remaining items.
Informed by the results of the preliminary item analysis, items in the cognitive module were revised to better conform to the expected difficulty levels (i.e., mastery items harder than foundation ones). We revised low functioning distractors, to make them more plausible to students. C6 was modified from a single selection multiple-choice to cloze-type item, with no changes to intended outcome. All revisions in this module were made in consultation with a subject matter expert to ensure technical adequacy and content validity.
Before the next iteration of cognitive interviews, all items were co-validated by non-generating authors to guarantee that clarity was improved, and content validity was left unchanged.

Version five (v0.5): cognitive interviews
To assess the clarity of items changed between version four (v0.4) and five (v0.5) of the PPLA-Q, 6 students were interviewed. Most Likert-type items were clear and coincident with their intended meaning, except those regarding justice (e.g., "I try to be just"), which led to interpretations related with collaboration and teamwork, instead of the intended meaning regarding ethics/fair-play. In the revised cloze-type item in the cognitive module (C6), one of the students failed to respond to the item according to instructions (i.e., filled in the spaces, instead of circling options that would fill each space), revealing a need to clarify its instructions. According to this data, we fine-tuned all pertaining items. Similarly, informed by the pilot test, we created two different versions of the cognitive module by   mirroring the arrangement of options -in the second version, A became D, B became C -hoping to discourage students to share their answers during application and reduce subsequent measurement error.

Discussion
This article followed the development, content validation, and pilot testing of the first of two instruments that comprise the Portuguese Physical Literacy Assessment (PPLA): the PPLA-Questionnaire (PPLA-Q), it assesses the psychological, social and part of the cognitive domains of PL, inspired by the Australian Physical Literacy Framework (APLF) and Portuguese PE syllabus (PPES). Its primary target are high-school students (grade 10 through 12) in PE context. Older adolescents are a critical intervention group -especially in Portugal -given that they possess lower PA levels than their younger peers [33][34][35]; and will cease to have mandatory and free access to professional guidance in PA and movement, eventually becoming dependent on their PL to participate in meaningful PA, and further advance on their journey.

Content validity
We gathered evidence on content validity using an iterative process with experts in each subject matter domain (i.e., for each of the modules), and target population. The number of experts per module was considered acceptable and ranged from 4 to 6. Although literature recommends 5-7 experts to rate content validity [47], a minimum of 3 is acceptable for content areas in which expert recruitment might prove difficult [80] -as we argue was the case in this study, given constraints imposed by the COVID-19 pandemic. PPLA-Q showed evidence for adequate content validity at item level improved throughout multiple revisions. In version 0.2, using κ, 96% of the items were rated as good or excellent (>.74) [87,88] regarding relevance, and 84% considered clear (>.74). Module-wise, a S-CVI/Ave of .90 is considered adequate [89], a cut-off that decreases to .80 for S-CVI/UA, given that it requires universal agreement between all raters [81]. While the psychological module attained an adequate S-CVI on both accounts (.98/.93), the social module did so only on S-CVI/Ave (.90 /.68), and the cognitive failed to achieve both standards (.87/.60); further analysis identified that most items with lower I-CVI in the two latter modules were those generated without a conceptual reference to an existing instrument (i.e., Culture & Society). Qualitative suggestions from the experts and advisors augmented quantitative data, targeting concepts in need of rewording or clarification. We then revised and eliminated items to improve content validity across all modules. A targeted revision of the Culture & Society scale increased overall social module's S-CVI to .90/.84 (Ave/UA) on a second round of expert evaluation aimed solely at it (version 0.3).
Multiple rounds of qualitative cognitive interviews were conducted until saturation was achieved (i.e., no new suggestions emerged) [143] with a heterogenous sample of high-school students (n = 12), using different versions of the PPLA-Q. These informed improvement on item wording and syntax, to effectively target the intended concepts and reduce ambiguity. During initial stages, students noted lack of clarity in abstract concepts like those from the Culture & Society (values, rituals, and traditions of sport/PA), and Ethics (justice, honesty, fair play) scales; notwithstanding evidence that iterative revisions clarified these items, further validation efforts should scrutinize their performance. Similarly, despite obtaining evidence for item-level content validity (except for item C2), and subsequent reviews in consultation with a test and assessment expert -based on the qualitative comments of experts and students -we advise further quantitative scrutiny of the cognitive module to establish its module-wise content validity.

Feasibility
Average completion time for the PPLA-Q was 27 min. Although it might impose a substantial burden upon respondents, diversity of constructs and items used throughout the different modules might have effectively reduced it. Notwithstanding, depuration of subscales in the modules -during the next steps in development -will certainly reduce this time and further improve feasibility.
We had no response errors and low levels of missing data during pilot testing, which might stem from student's routine exposure to questionnaires using the same item format (i.e., multiple-choice items and Likert-type scales). We also argue that application of the questionnaire during PE class, with the lead investigator present, to clarify any question, might have played a determinant role in this. In one of the application settings (i.e., classroom) it was notorious the student's urge to copy or share their answers from/with colleagues, especially in the cognitive module. The similarity of this module with usual summative evaluation instruments used in school setting might partially explain this occurrence; non the less, we expect that future use of the two differently arranged versions of the cognitive module (i.e., mirrored distractors) might reduce this.
We experienced a high rate of attrition (≈30%). Constraints imposed by the COVID-19 pandemic might have reduced the number of students completing the questionnaire: both by reducing their willingness to participate, as well as the possibility to be present during application (due to prophylactic lockdown). This number shall inform the sample size calculations in further phases of development, as it is expected that these conditions might endure during next phases.

Preliminary reliability and item analysis
Results of reliability analysis in the psychological and social modules established preliminary evidence of adequate reliability in 10 out of 16 subscales (α > .70 and composite reliability > .60) [97]. Analysis of item reliability highlighted items that were contributing negatively to subscale reliability (outer loading <.70) of the remaining 6 subscales: Upon careful inspection, most of these were negatively worded. Although the use of negative wording might filter out unwarranted responding patterns (e.g. acquiescence), they have the potential to confuse students and compromise validity and reliability [67] by, for example, creating an artificial subconstruct within the intended subscale [144]. As such, these items were altered and then tested for comprehension during subsequent cognitive interviews, with positive results. Further reliability testing is warranted with a bigger sample size, to gather more definite evidence on the reliability of these subscales.
Regarding item analysis of the cognitive module, item difficulty ranged from .10 (very hard) to .95 (very easy), with an average difficulty of .50. Initial evaluation of its 10 items identified 6 good or very good discriminating items (D > .30) [100,145] (i.e., capable of differentiating knowledge levels among students).
We expected items designed for in the mastery level (i.e., pertinent to deeper learning) to be more difficult than those in the foundation level, within the same content; however, pilot data does not fully support this idea, as foundational items were more difficult than their mastery counterpart in 2 content pairs (C5 & C6, C7 & C8). We identified low-functioning distractors in the mastery level's C6 & C8 (non-plausible), that increased likelihood of a correct answer, even without full knowledge of the content. Conversely, C5 and C7 (foundation) had characteristics which inflated its difficulty: one of C5's (multiple selection item about safety during PA) intended "correct" options contained absolute language ("[one should] hydrate during all the duration of the activity"), steering respondents away from it; while C7 measured factual knowledge of the recommendations for PA in children and adults, which has been previously shown to be low among adolescents [140] and young adults [136,141]. A similar phenomenon emerged with C9, which asked respondents to select the Body Mass Index calculation formula -although students might be familiar with the concept they might not recall its formula. Informed by this data, distractors were thoroughly revised.
We would like to acknowledge, that although the methods used here to preliminarily assess the quality of the items followed the Classical Test Theory framework, Item Response Theory and Rasch models might play a role in further validation efforts, since they expressly integrate the notion of item difficulty (as well, as other possible parameters like discrimination and guessing) into the calculation of student's scores [43]; this would allow precise student scoring along the learning continuum posited in the development of PPLA. These were not used in this pilot study, given their requirement of larger sample sizes [146].
PPLA as a whole is intended to assess the integrated physical, cognitive, psychological, and social variables that are posited to underpin PL; both to direct the pedagogical action at local, regional and national level in proving a PL-supporting environment, and to inform self-directed changes by the students. Even though it pertains to attitudes, skills and knowledge applied in general PA settings, further adaptation is warranted if it is to be applied to younger students and/or outside of PE. Moreover, we argue that although culture might play a defining role in the representation of PL -as stated by Whitehead [4] -and that the PPLA-Q was designed with this peculiarity in mind, most of its indicators (i.e., items) might be easily adapted to other cultural contexts.

Strengths and limitations
To our knowledge, this study is the first report of content validity for a measurement instrument of PL designed for grade 10 to 12 adolescents. The content in the tool was inspired by the APLF and the PPES, informed by previous decisions of consortium of experts during a European project (PhyLit). Its development used an iterative process of content validation, using both subject matter experts in each knowledge domain (i.e., cognitive, psychological, and social), as well as target population, resulting in many revisions to improve its clarity and validity.
Although great care was taken to create a heterogeneous sample for the cognitive interviews and pilot test, all participants were nonetheless from a convenience sample from Lisbon's metropolitan area. Similarly, we could not reach our goal of 6 experts participating in every module. Arguably, the effects of the COVID-19 pandemic might have had an overarching effect on expert availability to participate in the project, and students' participation rate -through previously discussed constraints. However, we did not collect enough information to extrapolate specific causes for attrition, which could provide additional insights to prepare future studies and further improve feasibility.
Given that only preliminary testing was done regarding reliability and construct validity, further work is warranted and is currently ongoing to establish evidence in this regard, with a statistically adequate sample size.
PPLA inherits the complex nomological network of APLF, as such, some theoretical constructs underwent adjustments in other to be fully integrated into the same model; as such, further robust construct validation needs to ensure adequate dimensionality of each construct chosen, as well as the accuracy, validity, and practical usefulness of the usage of the learning continuum posited through the foundation and mastery levels. Further studies should also evaluate PPLA-Q's integration with PPLA-O (in development) to provide a holistic, integrated assessment, as warranted.
Similarly, this effort might allow for depuration of the instrument, contributing to a more parsimonious and shorter version; further improving its feasibility in PE contexts. As the PPLA-Q only targets older adolescents now, future adaptation into earlier age ranges might provide a clearer picture of PL development throughout all school-age.

Conclusion
This study details the iterative development process of the PPLA-Q as an instrument to assess the psychological, social, and part of the cognitive domain of PL in grade 10 to 12 adolescents (15-18 years). It also provides evidence for adequate content validity at item level, and, except for the cognitive module, at module level. It was improved through multiple rounds of expert and targetpopulation consultation. This instrument has also shown good feasibility within PE settings, and gathered preliminary evidence in favor of its reliability for application in older adolescents. Further validation efforts are needed to reinforce these conclusions, establish evidence of construct validity, and study PPLA-Q's integration with the PPLA-O (an instrument in development to assess the remaining domains of PL) within the PPLA framework to provide feedback to support older adolescents in their PL journey.