Simple and rationale-providing SMS reminders to promote accelerometer use: a within-trial randomised trial comparing persuasive messages

Background Literature on persuasion suggests compliance increases when requests are accompanied with a reason (i.e. the “because-heuristic”). The reliability of outcomes in physical activity research is dependent on sufficient accelerometer wear-time. This study tested whether SMS reminders—especially those that provided a rationale—are associated with increased accelerometer wear-time. Methods We conducted a within-trial partially randomised controlled trial during baseline data collection in a school-based physical activity intervention trial. Of 375 participants (mean age = 18.1), 280 (75%) opted to receive daily SMS reminders to wear their accelerometers. These 280 participants were then randomised to receive either succinct reminders or reminders including a rationale. Data was analyzed across groups using both frequentist and Bayesian methods. Results No differences in total accelerometer wear minutes were detected between the succinct reminder group (Mdn = 4909, IQR = 3429–5857) and the rationale group (Mdn = 4808, IQR = 3571–5743); W = 8860, p = 0.65, CI95 = − 280.90–447.20. Similarly, we found no differences in wear time between participants receiving SMS reminders (Mdn = 4859, IQR = 3527–5808) and those not receiving them (Mdn = 5067, IQR = 3201–5885); W = 10,642.5, p = 0.77, CI95 = − 424.20–305.30. Bayesian ANOVA favored a model of equal weartime means, over one of unequal means, by a Bayes Factor of 12.05. Accumulated days of valid accelerometer wear data did not differ either. Equivalence testing indicated rejection of effects more extreme than a Cohen’s d (standardised mean difference) of ±~0.3. Conclusions This study casts doubt on the effectiveness of using the because-heuristic via SMS messaging, to promote accelerometer wear time among youth. The because-heuristic might be limited to face-to-face communication and situations where no intention for or commitment to the behavior has yet been made. Other explanations for null effects include non-reading of messages, and reminder messages undermining the self-reminding strategies which would occur naturally in the absence of reminders. Trial registration DRKS DRKS00007721. Registered 14.04.2015. Retrospectively registered. Electronic supplementary material The online version of this article (10.1186/s12889-018-6121-2) contains supplementary material, which is available to authorized users.


Compliance with accelerometer wear instructions
Reliable and valid assessment is necessary when evaluating whether public health policies or interventions change physical activity (PA) levels in the target group. Little consensus exists about what to measure, when, with what and for how long in PA research [1,2]. While an inability of individuals to accurately remember their past PA and social desirability are clear problems with self-reported PA measures [3], objective measurements of PA (e.g. pedometers and accelerometers) have issues too. Zhuang et al. [4] found that missing accelerometry data was more common in 15-to 17-year-olds than among younger participants, especially during weekends (Sundays in particular), with missing data occurring increasingly from the first recording day to the last. This exemplifies a key issue in measurement: the proportion of an individual's day or week captured by the measure. An extreme example would be an individual, who only wears the measurement device when undertaking PA. Thus, some guidelines suggest that a person should wear an accelerometer for a minimum of 10 h daily for at least 4 days in a 7-day measurement period in order to obtain an accurate reading of PA [1,2]. Participants' compliance with instructions on wearing the accelerometer is clearly very important in obtaining accurate PA measurements [5].
Research on enhancing accelerometer instruction compliance rates is rare [2,6], particularly among older adolescents. One strategy has been monetary incentives contingent on proper wear-time [7]. Sallis et al. [8] used an alternative strategy, asking participants to re-wear the accelerometer if they had not worn it for at least 5 valid days (> 10 valid hours of data) or a minimum of 66 valid hours across 7 days.
Barak et al. [9] suggest that new opportunities to promote compliance-such as text messaging (SMS; Short Messaging Service)-may be more reliable and effective than traditional methods, such as written or verbal wear instructions by the investigator. Zhuang et al. [4], too, recommend SMS reminders. Toftager et al. [10] used SMS reminders to increase compliance but did not report effects or acceptability. In a self-selected Irish sample of adolescents [11], daily SMS reminders were associated with putting on the accelerometer in the morning, but not in increased overall compliance (defined as valid days of data or minutes of non-wear). The study did not report levels of wear or effects of the reminders. The discrepancy between remembering to put on the device and actually wearing it for a sufficient amount of time indicates that these may be separate behaviors.

Compliance and the 'because-heuristic'
Since the classic "Xerox machine study" by Langer, Blank and Chanowitz [12], providing reasons for compliance has been discussed in the social influence literature. The study indicated that placebic or pseudo-reasons [13] ("Excuse me, I have 5 pages. May I use the xerox machine, because I have to make copies?"; 93% compliance) could result in similar compliance rates as actual reasons ("[…] because I'm in a rush?"; 94% compliance) compared to the request only condition ("Excuse me, I have 5 pages. May I use the xerox machine"; 60% compliance). Pratkanis (2007), identified "placebic reasons" in his index of social influence tactics, but called for further research into the subject. Less careful are Cialdini, Goldstein and Martin [14], who tout the "unique motivational influence of the word because", basing their claims on the importance of reasoning in social influence. To this day, the xerox machine study remains cited in the press as an example of the power of the word 'because' [15][16][17][18].
A well-known principle of human behavior says that when we ask someone to do us a favor we will be more successful if we provide a reason. People simply like to have reasons for what they do. [19] Following the terminology used by Key, Edlund, Sagaring and Bizer [20], the phenomenon of increased compliance by providing reasons is referred to as "the because-heuristic." Let us accordingly define the naïve because-heuristic as "reasons increase compliance." In the Langer, Blank and Chanowitz study 1, this effect of reasons increasing compliance was only found when the confederate asked for 'a small favor' (five instead of ten pages, translating to effect sizes of d = 0.87 and d = 0.13, respectively) [12]. Still, the results in general, as well as their implications have been questioned [21,22]. A study by Folkes suggests, that instead of the size of the request, the effect is moderated by controllability [21]. Pooling Folkes' reason conditions results to an effect size of d = − 0.026, speaking against the quote above, and pointing out that the "power of reasons" effect is malleable, in the least.
To our knowledge, only one published direct replication of the Langer, Blank and Chanowitz study 1 exists [20]. The main effect of the study replicated (d = 0.67 for placebic over no reason and d = 0.69 for real over no reason conditions), although over 20% (34 out of 163) of the participants needed to be excluded for various reasons. Lack of published replication studies, of course, is not new in the field of psychology [23].
In a conceptual replication of the phenomenon, in small request conditions, reasons (either placebic or real) increased compliance by an equivalent of d = 0.43 (calculated from Table 1 of [24]) when including their additional persuasion group and d = 0.22 when excluding it. Another conceptual replication [25] found d = 0.15 for requests perceived as small, and d = 0.21 for requests perceived as large (as calculated from Figure 3 of [25]).
These studies seem to temper earlier claims for the power of reasons in increasing compliance. In contrast to the naïve because-heuristic, let us define the weak because-heuristic as "reasons increase compliance, but only if the perceived favour is small".
This study will investigate the effects of the because heuristic on compliance with the physical activity measurement procedures in the context of baseline measurements of a large school-based intervention.
The Let's Move It cluster randomized trial Inadequate PA predicts increased morbidity and mortality in people of low socioeconomic status (SES) [26], with SES differences in PA emerging already in adolescence [27]. Finnish vocational school students are less physically active than those in high school [28]. The Let's Move It intervention aimed to increase PA and decrease sedentary behaviors in older adolescents in vocational schools.
The current study was conducted as a sub-study of the cluster randomised effectiveness evaluation trial of the Let's Move It intervention [29]. In a preceding feasibility study [30], participants' accelerometer wear times were suboptimal; 47% (18/38) of baseline participants reached the cutoff of 10 h per day for at least 4 days, 63% (17/27) for the first and 75% (9/12) for the second follow-up. A frequently cited explanation for not wearing the accelerometer was forgetting to put on the device.

Aims and hypotheses
In this within-trial study, we investigate SMS-reminder strategies to improve the duration of accelerometer wear time. The literature cited previously lead us to hypothesise that reminders would increase accelerometer wear time and that citing reasons would amplify the effect. In addition to daily wear hours, we are interested in the number of days our participants provide valid activity data (i.e. days of ≥10 h of activity data). The target behavior is thus twofold: 1) putting on the accelerometer in the morning for as many days as possible, 2) wearing the accelerometer for as long as possible in the waking hours each day. In this study, two main research questions are posited: 1. Are SMS-reminders associated with greater accelerometer wear times?
The current study investigated this by comparing the compliance rates across a) participants who opted to receive SMS reminders to wear their accelerometer, and b) participants who opted not to receive the reminders (non-randomised control group). If forgetting is an important reason for non-compliance, in the absence of intervening factors, reminders should increase compliance.
Statistical hypothesis H 1 : Those who receive SMS reminders will have higher accelerometer wear times than those who do not.
2. Does offering reasons to comply affect accelerometer wear time?
If reasons increase compliance, SMS reminders containing reasons to wear an accelerometer should lead to greater compliance.
Statistical hypothesis H 2 : Those who receive reasons in the SMS reminders have more minutes of accelerometer wear and more days of valid data (≥10 h of activity) than those who do not receive reminders containing a reason.
An additional research question, on whether providing reasons to comply with accelerometer wear increases trial retention, is omitted here. These null results are reported in [31].

Methods
The design of this study was a within-trial, outcome -assessor blinded, partially randomised controlled trial (RCT). In addition to the randomised experiment between two message types, quasi-experimental data were acquired from a self-selected opt-out arm (see Fig. 1). This study was conducted during the baseline assessment of the first two recruitment waves (out of six) of the Let's Move It cluster-randomised controlled trial [29]. This article is based on unpublished work available at https://osf.io/ 89mhu/. Additional information on methods and results, in addition to all analysis code, can be found in the supplementary website at https://git.io/vNl8X (permalink provided in [32]).

Participants and sampling procedures
To be included in the study, the participants had to fulfill inclusion criteria of the Let's Move It study [29] and had to have consented to the accelerometry measurements: all were at least 16 years old and were vocational school students. The reminder arms consisted of the participants who opted in to receive reminders for accelerometer wear.
During baseline recruitment of the first two recruitment waves of the Let's Move It trial, students in two vocational schools were approached during class and informed about their school's study participation in the study. After the invitation to participate in the main trial and collection of signed informed consent forms, those who consented were given an online questionnaire to complete. Details of trial procedures are reported in the protocol [29].
After 1-3 days, research assistants gave the participants a waist-worn accelerometer (Hookie AM 20, Traxmeet Ltd., Espoo, Finland) and instructed them on how to wear it for a duration of seven consecutive days (including the day of receiving the device). The used Hookie accelerometer is a tri-axial accelerometer that collects data at 100 Hz sampling rate without preprocessing. The measurement range of the accelerometer is ±16 g and the resolution is 4 mg (milligravity). The Hookie accelerometer employs the same tri-axial acceleration sensor component (ADXL345; Analog Devices, Norwood MA) that is used in widely used research-grade accelerometers [33]. The validation of the Hookie accelerometer has been reported in both children [34] and adults [35] in studies comparing analysis of raw acceleration data from different accelerometers.
When participants received the accelerometers, they were asked whether they would like to receive SMS messages to help them remember to put it on every morning. Those who consented to the messages were subsequently randomised to one of two message conditions, and those who opted not to receive the reminders were treated as a self-selected control arm.
After 7 days, participants returned their devices to research assistants and were asked to fill out a short questionnaire assessing process measures (see Additional file 1: Appendix S1 and Additional file 2: Appendix S2.

Random assignment
Participants were assigned to the reason and succinct arms after they were recruited. The first author extracted the phone numbers from the list and used R code to create an amount of random numbers equal to the number of new participants. The vector of random numbers was then assigned to the participants. Participants with a number equal to or smaller than the median of the vector were allocated to the reason-condition. Others were allocated to the succinct condition. Research assistants working in the field to assess were blind to group allocation. Recruitment and randomisation took place on the same day, and restrictions such as blocking or stratification were not used.
Recruitment took place in two waves, alongside the recruitment of the main trial. In order to increase the rates of participants opting in for the reminders, the recruitment prompt was slightly modified for the second wave. The research assistants presented the SMS reminders as the default option, and asked whether this is acceptable to the participants.
Random assignment was not visible to the participants and the research assistants did not mention that different kinds of messages were going to be sent. The statistician who analysed the raw accelerometer data was blind to group assignment.

Interventions
An important issue regarding the current study was to avoid tampering with the effects of the main trial. In other words, it should not affect main trial outcome measures in any other ways except for increased data quality. Care was taken to formulate the SMS messages to not pressure participants or provoke changes in main trial outcome measures such as PA. We altered a previous procedure [11] by varying the message content slightly each day to reduce habituation and thus expected to increase the chances of the message being read, for both arms.
The two arms received different message content.
a) Succinct reminder condition: 1. a greeting -2. a reminder -3. a thank you b) Reminder and reason: 1. a greeting -2. a reason beginning with "Because…", followed up with a reminder -3. a thank you Messages are presented in detail in Table 1 below. We sent the messages using an SMS Gateway device MT-SF100-G-EU (MultiModem iSMS Server 1-port) by Multi-Tech Systems (http://www.multitech.com/brands/ multimodem-isms). We used a manufacturer-designed guided user interface for the first recruitment wave and a custom interface designed by a local service provider for the second wave.

Registration and deviations from registered plan
The study plan was reviewed by the Ethics Committee for Gynaecology and Obstetrics, Pediatrics and Psychiatry of the Hospital District of Helsinki and Uusimaa (decision number 367/13/03/03/2014).
Official public registration in the German Clinical Trials Register (DRKS-ID: DRKS00007721) was completed 3 months after recruitment of the first wave had been initiated, but before data was available. Pre-registration (before starting data collection) failed due to lack of available resources at the time.
The original plan was to establish the additive effect of messages containing a reason and those not containing one over a no-message condition during the baseline measurement of the first batch. With the sample size we expected (n = 140), we would have had over 95% power to detect an effect of d = 0.6 (slightly smaller than the one discovered in the Langer, Blank and Chanowitz replication [20]). We had then planned to pit the more successful message type against a third message in the second wave. Instead of going forward with the plan of using a third message, we made the decision to gather another wave of participants with the same message types after the data from the first wave was analysed. This was due to the fact that, contrary to our expectations, no difference between the two messages was detected. This is important to note, as it means we can no longer rely on a long-term error rate of 5% [36] and-as p-values depend on the sampling distribution-default p-values from common statistical programs no longer apply [37].
To address the issue of inadequate reporting in the sciences [38], the current report complies with the Consolidated Standards of Reporting Trials (CONSORT) statement [39]. Contributor roles are clarified in Additional file 3: Appendix S3, according to a taxonomy for this purpose [40].

Primary outcome measures
Primary outcome measures were 1) accelerometer wear time minutes and 2) days with ≥10 h of valid accelerometer data. As this trial was conducted within a larger trial, several other measures were collected and are listed in the Let's Move It protocol [29]. The main trial used a

Implementation assessment measures
A one-page questionnaire (Additional file 1: Appendix S1; translation in Additional file 2: Appendix S2) was used to gain additional insight into the reception of the messages. Self-reported message receipt. As we could not gather objective log data on the number of messages opened, we asked participants to assess on how many mornings they had opened and read the SMS. Response options were: Not on a single morning, On 1 morning, On 2-3 mornings, On 4-5 mornings and Every morning.
Manipulation and contamination check. As participants were randomised individually, as opposed to clusters at school class level, discussing the SMS messages with their classmates could have led to students finding out that not everyone received the same messages, and perhaps also reveal the study hypotheses. We attempted to gauge the extent of this by asking them how often they had discussed the messages with peers. Response options were: Not once, Once, 2-3 times, 4-5 times and More often.
Acceptability of SMS message content was assessed by asking the participants, how much they agree with the statement "I was satisfied with the content of the messages". Response options again had a 5-point scale: Completely disagree, Somewhat disagree, Do not agree nor disagree, Somewhat agree and Completely agree.

Statistical analyses
All non-Bayesian analyses were conducted using RStudio running R [41,42]. Plots were drawn using R packages 'ggplot2' [43] and 'yarrr' [44]. Distributions between the reason and succinct groups in the implementation assessment questions were compared using the chi-square test.
Accelometer wear times were analysed using bootstrapping methods. A 95% bootstrap confidence interval for a mean can be acquired by resampling observed data to simulate a sampling distribution, obtaining the values for the 0.025th and 0.975th percentiles of resampled means [45]. A kernel density plot, bootstrap confidence interval and a bootstrap test of equivalence were conducted using R package 'sm' [46] for differences of distributions of the two reminder arms. Wilcoxon rank sum test with continuity correction was used to compare medians between groups.
ANOVA for equivalence of means between the two reminder groups and the no-reminder group, as well as its illustration, was performed using R package 'userfriendlyscience' [47]. Additionally, a MANOVA with wear time minutes and wear days with valid data as dependent variables, and SMS group as an independent variable, was used to test robustness of results.
A 95% Bayesian Highest Density Interval (HDI) [44] of the means of valid wear days was plotted using R package 'yarrr'. HDI refers to the most likely population parameter values (here: means) given the data; information which is not delivered by frequentist confidence intervals [48,49].

Bayes factors
Due to our sampling methods (e.g. decision to collect more data was based on observed data), traditional frequentist statistics faced limitations. Thus, we also calculated Bayes Factors [50][51][52] for our main outcome measures. A Bayes factor BF 01 is essentially the ratio of two likelihoods, answering questions such as "Given the data, how many times more likely is the null hypothesis, compared to a specific alternative hypothesis". We used the R package BayesFactor [53]; For comparing means, this package assigns the alternative hypothesis a Cauchy prior. We used a prior scale of 0.3, in accordance with common effects in health psychological research [54]. This reflects a prior belief that 50% of the effects lie between d = − 0.3 and 0.3. For contingency tables, priors are described in Jamil et al. [55]. The minimum value is 1, and an increase reflects the belief, that the distribution of observations in the given categories under H1 is relatively more similar to H0. Additional information on inference using Bayes Factors, and prior robustness checks are found in the supplementary website (https://git.io/ vNl8X).

Equivalence testing
In the frequentist statistical paradigm, support for the null hypothesis is indicated by the practice of equivalence testing [56]. For a difference between means, one essentially first establishes a region of equivalence to zero, then conducts and combines two t-tests. The first one tests whether the effect is higher than the lower bound (in our case, − 0.3), and the other tests whether the effect is smaller than the higher bound (in our case, 0.3). The tests were conducted using R package "TOSTER" [57].
We did not conduct multi-level analyses to account for the intra-class correlation of 0.09 for total accelerometer wear time. Heterogeneity analysis is presented in the supplementary website (https://git.io/vNl8X) file under "Heterogeneity among clusters".
Using standard deviations estimated from feasibility study [30] data, we determined a practically significant effect size for wear time hours to be d = 0.42enough to bring a person from 9.5 h of daily data to reach the cutoff of 10 h. For our purposes, we decided to consider effect sizes between − 0.3 and 0.3 as equivalent to zero. Additional details are presented in the supplementary website under "Statistical power".
Analysis regarding statistical power is presented in Fig. 2, holding alpha constant at 0.05 and sample size at achieved levels. As seen from the figure, we had 90% power to discover an effect of size d = 0.39, 80% to detect d = 0.3, 60% to detect d = 0.27 and 40% to discover an effect of d = 0.21. Thus, type 2 error probabilities were small for effects near our defined minimal effect size of interest, but high for small effects.
We also evaluated Type S and type M error probabilities [58], and the v-statistic [59]. The analysis is presented in the supplementary website (https://git.io/vNl8X). In brief; our design was relatively well-equipped to handle medium-sized effects, but is subject to considerable bias under small effects.

Descriptive data
A participant flow diagram presented in Fig. 1 indicates how the messages were sent to almost all participants as intended.
Of the 375 participants consenting to accelerometer measurements as part of the main trial, 95 opted out of receiving reminders and an additional 7 did not receive messages due to technical difficulties. In the end, the SMS messages with reasons were sent to 138 and the succinct messages to 135 participants. Consent rate for reminders was 54% (101 out of 186) in the first wave and 95% for the second wave (179 out of 189). Table 2 shows the sample characteristics for the baseline data.

Implementation and process measures
Manipulation and contamination check, as well as satisfaction with the messages and discussing their content are presented in the supplement. In brief, we did not detect differences across any groups, with Bayes Factors indicating strong support for the null hypotheses. As shown in Fig. 3. Seventy four point nine percent of respondents reported having opened and read the SMS at least four mornings. Discussing the content of the messages with peers was not common; 91.1% answered having done so never or just once Fig. 4.
Open comments did not reveal unforeseen negative effects. In addition, 13% (9 out of 70) of participants who answered the question explicitly added, that remembering to wear the device was due to receiving the messages.

Wear times Wear time minutes
Accelerometer wear times did not indicate meaningful differences between groups (see Fig 5) Bootstrap tests of equal densities indicated no differences in total wear time minutes between the two message types (p = 0.28), nor between those who received and did not receive messages (p = 0.35). Wilcoxon rank sum test showed no differences in distributions between message groups (W = 8860, p = 0.647, The violin plots in Fig. 6  Valid measurement days Figure 7 shows densities and spread of valid measurement days by group. As can be visually inspected from the HDIs, population means are equivalent. Differences between the distributions of measurement days with > 10 h of data were not detected between the reason and succinct groups, χ 2 (7) = 7.893, p = 0.342. A Bayesian contingency tables test provided BF 01 = 6.96 (Poisson sampling, prior concentration = 1.0; prior robustness test depicts a concave function where, as concentration approaches 2, BF 01 approaches 22.97).
Differences were not detected in valid wear day distributions between participants for whom reminders were sent, and for whom they were not: χ 2 (7) = 8.344, p = 0.303. A BF 01 = 34.79 (Poisson sampling, prior concentration = 1.0; robustness function is concave as before. As concentration approaches 2, BF 01 approaches 93.50).
Again, equivalence tests of mean differences between message types (− 0.07 days, 90% CI [− 0.47; 0.33]) was statistically significantly larger than d = − 0.3 and smaller than d = 0.3. The mean difference between reminder and opt out groups (− 0.18 days, 90% CI [− 0.60; 0.24]) was statistically significantly smaller than d = 0.3, but we could not reject the hypothesis that the effect was higher than d = − 0.3.
A MANOVA with both total wear time minutes and valid wear days as dependent variables neither detected

Dose dependence
If reading of messages is linearly related to wear time, an upward moving slope in means would have been expected. The dose dependence curve Fig. 8 is flat, showing no support for such a relationship between messages and wear time.

Discussion
In an attempt to improve measurement of physical activity and sedentary behaviour-key public health issuesthis study evaluated the effects of two interventions to increase accelerometer wear times during the first two recruitment waves of the Let's Move It trial. Specifically, it tested the effects of the because-heuristic on accelerometer wear time in older adolescents. We did not detect increased wear times among participants who received a reason in their daily SMS reminders, nor did we detect different wear times between those receiving the reminder messages and those who opted out. In all cases, null models were supported over those with small-to-medium sized effects (see supplementary website (https://git.io/vNl8X) sections "Interpreting Bayes Factors" and "Bayesian ANOVA" for details). As it is neither logically nor statistically appropriate to conclude the absence of an effect from a non-significant hypothesis test [60,61], we hope the analyses contribute to a long-overdue inferential development in the field. Our main results are in line with the results of Belton et al. [11], of reminders not being able to increase wear time, despite attempting to improve on the earlier studies by not having exactly the same message sent every day.
We do not have data on whether the reminder caused our participants put on the accelerometer more often, in spite of not increasing wear time as found in [11].
Although the xerox machine study [12] has been highly publicised for 30 years, the contextual framework of the effect remains unclear. Thus, many possible reasons could explain the null results obtained in this study, including the impersonal nature of SMS communication (as compared to face-to-face interaction), the source of the information, being incapable to complete the requested task, and a several other factors varying in plausibilitydemographic factors, the target behaviour, contextualised cognitive processes and so forth. Accordingly, the effect of reasons on this particular behavior, given our context and delivery method, has proved smaller than what would be considered minimally interesting (although participants did attribute remembering to wear their accelerometer to receipt of the reminders), and possibly zero. Thus, the naïve because-heuristic does not receive support in the current study. We can not make conclusions regarding the weaker claim of reasons affecting only tasks which are easy to carry out, due to design and sample size considerations.
The flat dose-dependence curve can indicate several things, including the possibility that text messages do not affect wear times. Attributing remembering to getting reminded could be a case of a post hoc reasoning error [62]. Another possibility is that the messages could have had a small effect, but opening and reading the message provided no additional benefit. For example, the participant could have looked at the preview of the message on the cell phone screen and remembered without reading the whole message.
As there were no differences between the SMS and no-SMS arms, this effect may have been masked by selection bias, with those people who expect to experience problems with remembering, opting in to receive SMS reminders. As consent was almost fully dependent on the recruitment prompt, an additional assumption is needed that the two recruitment waves differ qualitatively (on an unobserved confounder). So, for example, the second wave may have consisted of more compliant participants or the potential interactions with the first wave participants might have made the opinion of the study more favorable. Thirdly, the effect of reminders may not have been linear, or only a small dose is needed to form a habit, and thus achieve maximal effect. This explanation requires the same assumptions as the one described above. Fourth, the flat curve may also be caused by unreliable measurement: dose should be operationalized in a way not dependent on self-report. Finally, it is possible that receiving reminders causes an undermining of one's own responsibility, so that those who receive reminders relinquish control and do not carry out the remembering techniques (e.g. placing of the accelerometer in a conspicuous place as a prompt to put it on) they would have, in the absence of reminders.
It may be that daily accelerometer wear is not determined by heuristic/automatic processes, but rather, is under more reflective reasoning processes. In this case, these reminders should have provided justifications and rationales that truly are important for this target group. We do not have any evidence what thoughts and connotations our reminder content evoked in the youth's minds, and whether it was counterproductive. Finally, it is possible that participants who had agreed to take part in the accelerometer data collection already had made the reflective decision and proceeded to "implemental mindset" where persuasion messages are less relevant; e.g. as speculated in [63].

Limitations and strengths
There are a number of ways this study could have been improved on.

Opening and reading the messages (manipulation success)
Number of participants who opened and read the messages was assessed with a questionnaire instead of Fig. 8 Self-reported opening and reading of messages. Y-axis is total wear time. Boxes represent 95% HDIs for the means, solid lines connect means and dashed lines connect medians. Participants who opted out of reminders are aggregated with those who indicated not having opened the messages even once. Participants who received messages, but did not answer the question on message reading, are excluded objective log data. This self-report measure (as well as the other post-intervention questionnaire items) was only a non-validated single item, thus probably far from optimal in terms of reliability. We had no reliable way to certify at which times the messages were received or whether they were opened at all. Anecdotal evidence indicated that the messages were too late for some students (i.e. they had already left the house and forgotten the accelerometer when receiving the message). On the other hand, we deemed sending the messages too early might pose an acceptability issue. The SMS queue in the gateway device presented a difficulty: larger number of message recipients heavily affected the deviation of delivery times, making the last messages in the queue arrive late for some students. During the second recruitment wave, time of initiating the send process was changed to be 45 min earlier (06:15 instead of 07:00), but we do not have data on the effect of this change.
We attempted to alleviate effects of not opening the messages by starting the each with the word "because", so that message preview would render it visible on many devices even when not opened. Unfortunately we did not have access to a gateway system that could have sent e.g. MMS-messages, where a small picture could have been added, thus providing log data on how many times the picture was downloaded.

Contamination effects and masking the different message conditions
Participants may have found out their group allocation when discussing the messages with peers. This would require the discussion to have been about the nuances of message content and assumes that the participants are intrigued enough to spend time on making such inferences in the first place; an assumption perhaps not warranted. It is unclear how the discovery of SMS arm would have affected the results, but the possibility of confounding cannot be excluded. Randomising the groups by clusters could have helped to avoid this, but would have led to a reduction in statistical power. Still, the participants reported mainly not having discussed the messages with peers.

Sampling plan
The stopping rule for data collection was not defined in advance. The decision to collect another wave of participants with the same design was made, when it became apparent that the messages did not have the strong impact we had anticipated. This leads to uninformative p-values in terms of error control [64], whereas Bayesian analyses are not as crucially affected by stopping rules ( [65], but see also [66]).

Lack of a randomised no-SMS control group
In order to avoid distortion of main trial outcomes (e.g. increased PA), care had to be taken in this within-trial RCT. The risk of sabotage due to disappointment of being allocated to a no-SMS control group was deemed too high, and thus participants were not randomised into a no-SMS group. This, in turn, lessens the strength of conclusions based on wear times between the participants receiving the reminder and those not receiving one. People who know they do not need a reminder may have thus ended up self-selecting to the no-SMS group.
This presumes that teenagers studying in a vocational school have the capacity to make accurate predictions about their future self-regulation capabilities in an unfamiliar task (putting on an accelerometer). On the other hand, as described, the wording of the recruitment prompt was slightly modified from wave 1 to wave 2, and consent to reminders was increased from 53% (85 out of 97) to 95% (176 out of 186), whereas wear times did not differ. Thus, strong selection effects seem unlikely. Although this indicates that opting out was more a result of the recruitment procedure than knowledge of not needing the reminders, future research should aim to randomise when feasible.
One way to address this problem would have been an n-of-1 design, where each day is randomised to one of the three message conditions. With this design, one should be careful to not leave learning effects undetected, as participants could habituate to reminders and forget in the concurrent absence of them.

Message content and size of request
The intervention was not piloted, nor was extensive testing of it's component parts done, which may have affected the results. The pre-testing of the message content was limited, too, and we thus do not have data on whether our participants considered the messages persuasive. This could be important theoretically, especially if the request size was considered large and our reasons were perceived as placebic or near-placebic. However, this might not be an issue in the first place, as participants had already agreed to wear the accelerometer as part of the trial. Message content (as explicated in hypothesis H 2 ) may not play a role at all, if the real reason for non-compliance is e.g. leaving the house in a rush. In such a case, though, we would still expect those who are reminded to have increased wear times compared to those who are not reminded (hypothesis H 1 ).

Pre-registration
In this paper, we attempted to answer to the call of more stringent methodology by pre-registration. Optimally, this would have been done prior to beginning data collection. In these cases, it has been proposed that analyses should be considered exploratory [67]especially in the presence of researcher degrees of freedom or data-dependent analysis decisions [68]-and can render p-values meaningless. In our case, this mistake turned out to be nonconsequential. We used Bayes factors to avoid claiming findings based on p-values alone, as recently warned against by the American Statistical Association [69]. Other approaches we used to address the replicability problem were transparent reporting and open data.

Implications for practice
Our results, in line with some other studies e.g. [11] indicate that researchers should not expect simple reminders to have strong effects on accelerometer wear times among youth. Also, despite previous strong claims, the because-heuristic in this context lacks the strength attributed to it in the popular literature. When considering using SMS reminders for youth, we suggest ensuring that remembering plausibly plays the key role in compliance with the behavior and target group in question, instead of other determinants/factors (such as social norms or motivation). Participants' coping skills and attention span may act as a ceiling to the potential effect of the reminder in situations where the target behavior can not immediately be carried out, so suitability of SMS reminders could be assessed in these respects as well.

Implications for future research
To an extent, the findings here apply to situations where cost-effective reminders can potentially improve compliance. These areas may range from medication adherence [70] to sunscreen use [71]. An interesting hypothesis to test, would be whether reminders actually reduce active coping strategies that people use spontaneouslythis could partly explain some null findings in the literature on technical reminder systems [72]. Second, the delivery of the reminders should optimally be objectively trackable, in order to make firm conclusions about the independent effects of delivery and receipt. Third, the context (including timing and location) where the participant receives the reminder is likely to be important, as well as the coping behaviour of the control group. It may also be worthwhile to gauge whether altering frequency of reminders affects the target behavior [70], or if the system can be made such that it adapts to the users and their environments [73]. Lastly, it might be worthwhile to investigate, if personally meaningful persuasive arguments work better than vague and general ones (e.g. contributing to science), which were used in order to minimise risk of participants changing their activity behaviour instead of merely the wear time behaviour. As the literature presented earlier suggests, any reasons should be enough for heuristic decision making, whereas good reasons may be needed for more reflective decisions. Further theorising and additional measures to test hypotheses based on dual process models may be fruitfulbut we encourage researchers to see [74][75][76][77][78][79], and also consider a wider perspective from complexity and systems theory [80][81][82][83], which have recently been applied in public health [84][85][86].

Conclusion
In this research, we have found evidence against the assumed superiority of the naïve because-heuristic; providing reasons in simple compliance requests having a general persuasive effect on behaviour. By using Bayesian methods and equivalence testing, we were able to claim evidence of no effect for the because heuristic in this setting. Likewise, sending SMS reminders was not associated with improved accelerometer wear times. Although we did not randomise the no-SMS group, the changed recruitment procedure plausibly accounts for majority of the selection effect, and a more potent explanation for the lack of differing weartimes is reaching the ceiling of the participants' ability to wear in the absence of very high motivation.
Our design had several limitations, which should be improved upon in future research. All in all, we remain pessimistic of the efficacy of the naïve because-heuristic and of simple reminders, even if they have a potent effect in participants' perceptions.
We conclude that despite strong claims, there is reason to consider the study of the because-heuristic a degenerating research programme [87], although there may be some contexts where the technique works as intended. Seeking to increase accelerometry wear time in participants may benefit from a design using the intervention mapping approach [88], including a plausible theoretical framework.

Additional files
Additional file 1: Appendix S1. Post-SMS questionnaire (Finnish). Abbreviations BF: Bayes Factor; LMI: Let's Move It intervention to increase physical activity and decrease sedentary behavior in older adolescents; RCT: Randomised controlled trial; SMS: Short Message Service, also known as "text messaging"