Design and structure of bull-M
The test was originally designed by four experts in the fields of social sciences and community health (ARJ, AWM, OEDV and RPHT). They were supported by seven teachers with over five years of experience in bullying (Bull) management at schools with a high incidence of violence and drug abuse in Ciudad Juarez (Chihuahua, México). Focus groups and in-depth interviews were conducted to identify the most common Bull behaviors observed in or out of school. Ten questions (items) were included in the final test (Additional files 1 and 2): Five (items 1 to 5) about twelve Bull behaviors (representations) occurring in or out of school, four (items 6 to 9) on the student’s and/or peer’s participation in bullying acts, and one (item 10) on several somatic consequences in bullying victims (“In the last four weeks, how often have you had a stomachache, headache, loss of appetite, or problems sleeping?”) [12, 26, 27]. Also, in order to evaluate the frequency of each experienced situation or the student involvement in them, a 5-point Likert scale was added: never, rarely, sometimes, often, and every day. Only for validation purposes, this frequency was coded on a scale of 0–5. Lastly, although designed and applied in Spanish (Additional file 1), an English version of Bull-M is also provided (Additional file 2).
A preliminary version (9 items) of Bull-M, in which the order of all items was not that of the final instrument, was administered to 20 students (13–15 years old), to make sure that it was understandable and to evaluate response time. Although all participants found it clear, a certain degree of intimidation was observed due to the order of items. It was then decided to begin with another question (How often do your classmates allow or invite you to participate in their games, school activities, or extracurricular activities?) as the initial item (Additional files 1 and 2). Before answering the test, the general instructions were explained individually as well as the anonymous nature of Bull-M in order to reinforce the students’ confidence. The test was efficiently applied in 10–15 minutes by two collaborators (ARJ, OEDV) collectively within the classroom while the teacher was not present.
Survey
Bull-M was applied from February to May of 2011 among 400 students (60% male, 13.4 ± 1.1 y) from three out of 25 high schools located on the outskirts of Ciudad Juarez, zones with a long history of violence and poverty (bullying elicitors). The sample, although sampled completely randomly, was not representative of the population at all high schools in Ciudad Juarez. Two hundred participants were further selected for a second re-examination (test-retest validity) 14 days after the first application. An informed consent for participation was obtained from each parent, from school authorities, and from each participant. The protocol was approved by the Ethics Committee of the Autonomous University of Chihuahua (UACH).
Validation
Four criteria were chosen to validate Bull-M: a) content (CV, judges), b) reliability [Cronbach’s alpha (CA) and test-retest], c) construct [principal component (PCA) and confirmatory factor (CFA) analysis and goodness of fit (GF)], and d) convergent validity (Bull-M vs Bull-S). Their characteristics, rationale, procedure, and statistics used are described below:
Content validity (CV)
In the social sciences, CV (also known as logical validity) is important for demonstrating the degree to which a written test fulfills its purpose. Generally, this is the first validation criterion because it examines the overall comprehension of the test. It is usually performed by experts on the studied phenomenon (judge validation) who evaluate the design characteristics of the test as well as each individual item. In this study, CV was performed by twelve judges (RPHT, AWM, and ten more) by using the Expert Judgment Validity Test (EJVT; Additional file 3). All items (n = 19) included in EJVT were designed to intentionally evaluate the following characteristics of Bull-M: content, size, order, accuracy, and answering format, by using a 3-point Likert scale, which were further scored as follows: Poorly (0), fairly (1), and sufficiently (2). The average score (AS) for each item was then calculated (mean ± SD). An AS <1.5 to any item was taken into account for redesign when needed.
Reliability
This type of validation is required for “adjusting” a test and contributes to its “most convenient” format. There are many statistical indicators used for this type of validation but the two most common are Cronbach’s alpha, which evaluates the consistency of results across items within a test, and test-retest, which evaluates the degree to which the test scores are consistent from one test application to the next. Here, a Cronbach’s alpha from 0.70-0.80, 0.81-0.90 and ≥0.90 was considered as acceptable, good, and excellent, respectively [28]. Test-retest reliability was evaluated in 200 students within a 14 day frame. The difference between the items score of the first (test) and second application (retest) was evaluated by Spearman correlation (rs).
Construct validity
This was performed by principal component analysis (PCA) and confirmatory factor analysis (CFA), the latter through a structural equation model. PCA has two objectives: a) to reduce the number of items in a written test while retaining the variability of the data and b) to identify hidden patterns (components or factors) to classify them according to their contribution to the final test score. PCA is generally followed by CFA whose main objective is to test whether all items fit a hypothesized measurement model.
PCA and CFA were performed with samples of 200 and 198 participants, respectively. The factor structure of PCA was ascertained by Varimax rotation following the Kaiser-Guttman criterion in which eigenvalues are taken ≥1.0 as a decision rule [29]. Also, in order to ensure an adequate representation of the variables, only those items whose communality (proportion of their variance explained by the factor) was ≥0.45 were included. Finally, in order to evaluate the fitting of the sampling and the possible sphericity of the data collected, the Kaiser-Meyer-Olkin (KMO) [30] and Bartlett [31] tests were applied. Lastly, the following goodness of fit indicators were calculated: Root mean square error of approximation (RMSEA), goodness of fit index (GFI), adjusted goodness of fit index (AGFI), comparative fit index (CGI), and normalized fit index (NFI). The model of structural equations was analyzed with Amos 16.0 (Amos development corporation, USA) while other analyses were done with PASW Statistic 18.0.
Convergent validity
This refers to the degree to which two measures of constructs (e.g. two tests) that theoretically should be related are in fact related. Here, 100 participants from the second application (retest) of Bull-M were also invited to answer Bull-S. Bull-S is mainly used to detect Bull roles (victim, aggressor, and victim-aggressor) [9, 26]. However, one of its items (item #13) is related to the frequency in which Bull acts occur (how often the aggressions occur?) on a 4-point Likert scale (every day, twice a week, rarely, and never) [32]. Pooled frequencies (%) from the “bullying other” subscale (items 6–9) of Bull-M and from item #13 of Bull-S were then compared, making a proper adjustment of both frequency scales for a proper comparison [Sometimes + often (Bull-M) = Twice a week (Bull-S)].