Completeness and accuracy of crash outcome data in a cohort of cyclists: a validation study

Background Bicycling, despite its health and other benefits, raises safety concerns for many people. However, reliable information on bicycle crash injury is scarce as current statistics rely on a single official database of limited quality. This paper evaluated the completeness and accuracy of crash data collected from multiple sources in a prospective cohort study involving cyclists. Methods The study recruited 2438 adult cyclists from New Zealand’s largest mass cycling event in November 2006 and another 190 in 2008, and obtained data regarding bicycle crashes that were attended by medical personnel or the police and occurred between the date of recruitment and 30 June 2011, through linkage to insurance claims, hospital discharges, mortality records and police reports. The quality of the linked data was assessed by capture-recapture methods and by comparison with self-reported injury data collected in a follow-up survey. Results Of the 2590 cyclists who were resident in New Zealand at recruitment, 855 experienced 1336 crashes, of which 755 occurred on public roads and 120 involved a collision with a motor vehicle, during a median follow-up of 4.6 years. Log-linear models estimated that the linked data were 73.7% (95% CI: 68.0%-78.7%) complete with negligible differences between on- and off-road crashes. The data were 83.3% (95% CI: 78.9%-87.6%) complete for collisions. Agreement with the self-reported data was moderate (kappa: 0.55) and varied by personal factors, cycling exposure and confidence in recalling crash events. If self-reports were considered as the gold standard, the linked data had 63.1% sensitivity and 93.5% specificity for all crashes and 40.0% sensitivity and 99.9% specificity for collisions. Conclusions Routinely collected databases substantially underestimate the frequency of bicycle crashes. Self-reported crash data are also incomplete and inconsistent. It is necessary to improve the quality of individual data sources as well as record linkage techniques so that all available data sources can be used reliably.


Background
Regular cycling provides health and other benefits [1][2][3][4]. However, in New Zealand, using a bicycle is not an attractive mode of travel for many people [5] and accounts for only 2% of total travel time [6]. Cycling is becoming more popular as a sport but just over one-fifth of adults reported participating in either road cycling or mountain biking at least once over twelve months in a recent national survey [7].
For many people, safety concerns are one of the major barriers to riding a bicycle [8,9]. For each million hours that were spent cycling on New Zealand roads, according to the official statistics, 29 deaths or injuries resulted from collisions with a motor vehicle [10] and 31 injuries resulted in death or hospital inpatient treatment [11]. Furthermore, almost as many bicycle crashes occurred off-road [12].
However, current statistics typically refer to a single official data source, most commonly police crash reports and less frequently hospital records. These data sources are known to disproportionately undercount bicycle crashes [13][14][15]. This is not surprising as many bicycle crashes do not come to the attention of the police or medical personnel, and this undercount amounted to 70% or more of self-reported crashes in overseas studies [16,17]. While self-reports can provide information on unreported crashes, their validity may be questionable also, for example, due to nonresponse [18], failure to recall [19] and the influence of socially desirable responses [20]. For all these reasons, it has been proposed that "unattended" bicycle crash injuries are excluded when developing indicators of injury incidence [21].
Even for the crashes that were attended medically or by the police, routinely collected databases may not be complete [13,15] and accurate [22]. Moreover, as the crash data are usually collected for specific administrative purposes, each data source typically captures only a fraction of all crashes [14]. Therefore, using multiple data sources through record linkage may provide a broader, more complete and truer picture of injury, at a relatively low cost.
This paper evaluated the completeness and accuracy of bicycle crash data collected by self-report and by record linkage drawing on four national, routinely collected databases.

Design, setting and participants
The Taupo Bicycle Study is a prospective cohort study of cyclists designed to examine factors associated with regular cycling and injury risk. The sampling frame comprised cyclists, aged 16 year and over, who enrolled online in the Lake Taupo Cycle Challenge. This is New Zealand's largest mass cycling event, which is held each November and attracts about 10000 cyclists. Participants have varying degrees of cycling experience and they range from competitive sports cyclists and experienced social riders to relative novices of all ages.
Recruitment was undertaken at the time of the 2006 event for the majority of participants, as described, in detail, elsewhere [23]. In brief, email invitations, containing a hyperlink to an information page describing the study, were sent to 5653 participants who provided their email addresses at registration for the event. Those who agreed to take part in the study were taken to a page containing a web questionnaire and asked about demographic characteristics, general cycling activity, previous crash experience and use of injury preventive measures. The questionnaire was completed and submitted by 2438 cyclists (43.1% response rate). Another 190 cyclists were recruited from the 2008 event by including a short description about the study in the event newsletter. Ethical approval was obtained from the University of Auckland Human Participants' Ethics Committee.

Crash outcome data
Crash outcome data were collected through record linkage to insurance claims, hospital discharges, mortality records and police reports, covering the period from the date of recruitment to 30 June 2011. Record linkage was undertaken by the data custodians using name, gender, date of birth and address as identifiers. All participants consented to link their data to these databases. In addition, a follow-up survey was conducted in December 2009.

Insurance claims
In New Zealand, the Accident Compensation Corporation (ACC) provides personal injury cover for all residents and temporary visitors to New Zealand no matter who is at fault. The claims database is a major source of information on relatively minor injuries with over 80% of the claims related to primary care (e.g., GPs, emergency room treatment) only [24].
Approval for record linkage was obtained from the ACC Research Ethics Committee. A probabilistic linkage followed by a clerical review was undertaken and all claims for bicycle crashes were extracted. The data extracted contain nature and mechanism of injury, health service utilisation and out of hospital cost. Crashes that occurred on public roads and crashes that involved a collision with a motor vehicle were identified from relevant variables as well as from the free text field describing the crash.
Hospital discharge and mortality data These databases are maintained by the Ministry of Health's Information Directorate. The National Minimum Dataset (NMDS) contains information about inpatients and day patients discharged after a minimum stay of three hours from all public hospitals and over 90% of private hospitals in New Zealand [25,26]. The Mortality Collection contains information about all deaths registered in the country [27].
Participant data were matched to a National Health Index (NHI) number, a unique identifier assigned to every person who uses health and disability support services in New Zealand. An electronic match was made where possible, followed by two stages of manual matching for participants who could not be linked electronically. Of 2590 participants who were resident in New Zealand at recruitment, 99.0% were successfully matched. All hospital discharges and deaths due to injuries or other health conditions were extracted.
The hospital discharge data contain diagnoses and diagnostic and therapeutic procedures undertaken in each hospital visit, which are coded under ICD-10-AM. Cycle crashes were identified using the E codes V10-V19; those that occurred on public roads were identified using the E codes V10-V18.3-9, V19.4-6, V19.9; and those that involved a collision with a motor vehicle were identified using the E codes V12-V14, V19.0-2 and V19.4-6. Readmissions were identified as described previously [28] and excluded.
The mortality data contain the underlying cause of death which is coded under ICD-10-AM and is also described in free text fields. However, the coroners' reports on the cause of injury death were available only up to 31 December 2008. All deaths due to a bicycle crash were identified from the available data.

Police reports
In New Zealand, it is mandatory that any fatal or injury crash involving a collision with a motor vehicle on a public road be reported to the police. A Traffic Crash Report is then completed and sent to the New Zealand Transport Agency where the data are entered in to the Crash Analysis System (CAS) database. A deterministic linkage followed by a clerical review was undertaken and all bicycle collisions were extracted. The linked data contain location, time and circumstances of the crash, and severity of injury.

Follow-up survey
The survey was conducted in December 2009 using a web questionnaire. The questions asked included: the total number of bicycle crashes experienced during the preceding year, the number of crashes for which claims were lodged with ACC, the number of crashes requiring hospital admission, and the number of crashes that were reported to the police. The participants were also asked to indicate the degree of confidence they had regarding the accuracy of their answers to each question using a five-point scale (very unsure, quite unsure, about 50/50, quite sure, very sure). This confidence rating has been shown to be a useful indicator of recall accuracy for physical activity measures [29].
A total of 1537 participants (58.5%) completed the questionnaire, of whom 70 reported not cycling in the preceding year.

Analyses
A capture-recapture analysis was undertaken to estimate the number of crashes that had occurred which were not identified through record linkage. In addition, the linked data were compared with the self-reported data collected in the follow-up survey.

Capture-recapture analysis
Capture-recapture methods were originally developed to estimate the size of an animal population, based on proportions of animals that were captured, marked, released and recaptured in two or more random samples. The procedure assumed closeness of the population, mark integrity, independence of the samples and equal probability of being captured in each sample [30]. Since then, similar methods have been applied in epidemiological studies [31].
For this analysis, the study sample was restricted to the 2590 participants who were resident in New Zealand at recruitment. For each participant, bicycle crashes identified from the different databases were matched based on the date of crash allowing for a two-day difference. Log-linear models were used to estimate missing crashes, taking into account possible associations across the databases. The models were fitted to the incomplete multiway contingency table with one missing cell corresponding to absence in all databases. The strength of evidence for each model was assessed using Akaike's Information Criterion (AIC) and its weight. Based on the model averaged estimate and unconditional standard error, the frequency for the missing cell and its 95% confidence interval (CI) were calculated. Analyses were undertaken for bicycle crashes in general, and also for the specific categories of on-road crashes and crashes involving a collision with a motor vehicle.

Comparison with self-reports
This analysis was based on the 1456 participants who completed the follow-up questionnaire and reported cycling in the preceding year. As some participants may have experienced more than one crash during the specified period, the exact crash date was not asked in the questionnaire. As such, it was not possible to match the linked and self-reported data for each crash identified in the source databases. Instead, agreement was assessed on a person-to-person basis for each database as well as for the combined data. Agreement was established (1) if a participant reported at least one bicycle crash that required medical attention (that is, involved a claim lodged with ACC or required an admission to hospital) or reported to the police in the preceding year, and the linked data also showed at least one bicycle crash during the same period, or (2) if such a crash had not been experienced in the preceding year according to both the self-reported and linked data.
Cohen's kappa coefficients were used to determine the degree of agreement. In addition, the sensitivity, specificity and predictive values of the linked data were calculated, assuming that self-reports were the gold standard. Analyses were undertaken for all crashes as well as those involving a collision with a motor vehicle. In addition, subgroup analyses were performed for all crashes to examine differences in agreement by participants' demographic characteristics, amount of cycling, pre-existing medical conditions (heart attack, stroke, cancer, diabetes or high blood pressure) and confidence in recall.

Results
The average age of the participants was 44.0 years (SD 10.4) and 72.4% were males (Table 1). About half the sample were university graduates (53.9%) and lived in least deprived neighbourhoods (49.9%), and 77.7% lived in main urban areas. On average, participants cycled 5.7 hours a week (SD 3.7; Quartile Range 5).

Bicycle crashes reported at the follow-up survey
Of the 1456 participants who completed the follow-up questionnaire and reported cycling in the preceding year, 432 reported experiencing one or more crashes in the preceding year (Table 2). There were a total of 784 self-reported crashes, of which 57.4% occurred on the road and 17.9% involved a collision with a motor vehicle. Based on the respondent reports, 29.1% of all crashes involved a claim lodged with ACC, 3.7% required hospital admission and 6.5% were reported to the police. A higher proportion of collisions involved medical or police attention with 35.0% resulting in claims to ACC, 7.1% requiring hospital admission and 32.9% being reported to the police.

Bicycle crashes identified through record linkage
During a median follow-up of 4.6 years, only one death occurred due to a bicycle crash. As this fatal crash was recorded in both the Mortality Collection and NMDS databases, the former was excluded in further analysis.
Of the 2590 participants, 855 experienced 1336 bicycle crashes recorded in one or more databases, of which 755 (56.5%) occurred on public roads and 120 (9.0%) involved a collision with a motor vehicle. Only 18 crashes that involved a collision with a motor vehicle were identified from all databases ( Table 3).

Completeness of the linked data
As no crashes identified in both the NMDS and CAS databases were found to be missing in the ACC database, the models containing both interaction terms ACC*NMDS and ACC*CAS were excluded. Table 4 shows model-based estimates and unconditional standard errors from the remaining six models. From these data, it was estimated that 477 crashes in general (95% CI: 362-629), 258 on-road crashes (95% CI: 197-338) and 24 collisions (95% CI: 17-32) were missing from all databases. That is, the completeness of the linked data was 73.7% (95% CI: 68.0-78.7%) for all crashes, 74.5% (95% CI: 69.1-79.3%) for on-road crashes, and 83.3% (95% CI: 78.9-87.6%) for collisions.

Agreement between the linked and self-reported data
There was a moderate agreement (kappa 0.55) between the linked and self-reported data for all crashes as well as crashes involving collisions, with the highest level of agreement observed with the claims data (Table 5). For 4.7% of participants who reported at least one crash (that required medical attention or reported to the police) in the preceding year, there was no crash record in the linked data. In contrast, in 5.6% of participants who did not report a crash, one or more crashes were recorded in the linked data. This disagreement was less pronounced for collisions.
When self-reports were considered as the gold standard, the linked data for all crashes had 63.1% sensitivity, 93.5% specificity, 59.0% positive predictive value (PPV) and 94.5% negative predictive value (NPV). The sensitivity was counter-intuitively lower but the specificity and predictive values were higher for collisions. There were variations in agreement by participants' demographic characteristics, amount of cycling, pre-existing health conditions and confidence in recalling crash events ( Table 6). A higher level of agreement was associated with being younger, male and Māori, having a higher level of education, spending less time cycling, not having pre-existing medical conditions, being more socioeconomically deprived and having a higher degree of confidence regarding the accuracy of recall.

Main findings
Our findings revealed a substantial underestimation of bicycle crashes in administrative databases. The capturerecapture models estimated that the linked data were 73.7% complete for all crashes with negligible differences between on-and off-road crashes. The linked data were 83.3% complete for collisions. In comparison with self-reports, the linked data had 63.1% sensitivity, 93.5% specificity, and 59.0% PPV and 94.5% NPV for all crashes and 40.0% sensitivity, 99.9% specificity, 91.7% PPV and 97.7% NPV for collisions. Agreement between the linked and self-reported data varied across individual data sources and by participants' demographic characteristics, amount of cycling, pre-existing medical conditions and recall confidence.

Strengths and limitations
The bicycle crash data collected in this prospective cohort study were obtained through record linkage to four routinely collected databases. This resource efficient method of data collection was designed to minimise potential biases associated with loss to follow-up [32]. This also provided a unique opportunity to evaluate the completeness of bicycle crash records across the spectrum of severity. To the best of our knowledge, this is the first study to compare official vs. self-reported data on bicycle crashes. However, some limitations need attention.
In our capture-recapture analysis, all underlying assumptions may not be completely satisfied. First, the assumption that the study population is closed may be violated by death or emigration of some participants, thereby underestimating the findings [33]. However, such underestimation may not be substantial as only six deaths were identified from the Mortality Collection database and only 23 participants provided an overseas address at the follow-up survey. Moreover, ACC support is available to New Zealand residents if they return home with an injury sustained during an overseas trip of up to six months (or longer if they are travelling on business and paying income tax).
Second, the assumption that each individual has equal probability to be captured in each database may be violated if the probability differs by crash, personal, social and health service factors [21,34].  Total 755 Collisions with a motor vehicle

Total
Third, the assumption that there are no lost marks between databases (mark integrity) may be violated if ascertainment of relevant cases is affected by inaccuracies in coding of bicycle crash data in each data source [22,25,35]. Miscoding may have resulted in failure to identify some bicycle crashes, thereby underestimating the capture-recapture counts [36]. This may account for the counter-intuitive finding of a lower sensitivity for collision crashes compared to all crashes. It is possible that some collisions were miscoded as 'cyclist only' crashes as observed previously in the UK [37]. Case ascertainment may also be affected by the quality of record linkage. Although the match rate by NHI was high (99%), mistakes may have occurred during extraction of bicycle crashes from each data source as a conservative approach was used to minimise false matches. While this served as a sensible strategy to estimate unbiased risk ratios in our subsequent analyses [38,39], it may have underestimated the capture-recapture counts [36].
In addition, the self-reported data, although used as the gold standard in this study, may not be accurate. Inaccuracies in recall or provision of socially desirable responses may have resulted in under-or over-reporting of bicycle crashes. Cyclists generally experience frequent minor crashes, which could make recall of crash experiences during a specified period difficult. In previous research, the injury rates were significantly underestimated if the recall periods were two months or more [19] and the ability to recall was influenced by number, type and severity of injuries, and time elapsed since the injury event [40,41]. Over-reporting, as observed in relation to motor vehicle crashes [42], is also likely as some reported crashes may have occurred prior to the specified recall period. Moreover, near misses or evasion crashes may have been reported as collisions with a motor vehicle. This  may be another explanation for the counter-intuitively lower sensitivity for collisions compared to all crashes. While previous studies reported negative associations between self-reported motor vehicle crashes and social desirability scales [20,43], little is known about how this bias might impact self-reported bicycle crashes.

Interpretation
Our findings extend the existing literature and inform future attempts to estimate the burden and risk of bicycle-related injuries. As in previous research [16,17], our findings show that at most 30% of self-reported bicycle crashes were attended by medical personnel or  [44][45][46]. A New Zealand study found that only 22% of hospital-reported bicycle crashes and 54% of those involving a collision with a motor vehicle appeared in police reports [14]. In this study, 13% of hospital reported crashes and 64% of collisions were linkable to police records whereas 39% of police reported crashes and 43% of collisions were linkable to hospital records. Very few studies have estimated the completeness of combined databases. In a US study, hospital and police records, if combined, were 80% complete for automobile vs. child bicyclist collisions [44]. However, this level of completeness could be much lower if minor injuries were also considered. In this study, only 12% of bicycle crashes and 43% of collisions extracted from the linked data were recorded in hospital or police databases. To our knowledge, no other studies have assessed the completeness of individual or combined databases for relatively minor injuries.
Even though multiple data sources were used to capture a spectrum of injuries, our capture-recapture counts may still be underestimates given the limitations mentioned above. This is evident in comparisons with the self-reported data where the sensitivity of the linked data was lower than the completeness of data as estimated from the capture-recapture methods. If potential over-reporting is taken into account, however, the actual completeness of the linked data may lie between the two extremes, that is, between 63% and 74% for all crashes and between 40% and 83% for collisions.
In this study, agreement between the self-reported and official data was at most moderate although a higher level of agreement was observed in relation to motor vehicle crashes and unintentional injuries [47]. This may be because, compared to motor vehicle crashes, bicycle crashes occur more frequently and many are less severe, making them less likely to be recalled or coded properly. Our findings suggest that confidence ratings may be a useful tool in assessing the quality of recalled crash data as observed in previous research [29]. There were also variations in agreement by participants' personal factors, in accordance with earlier research on motor vehicle crashes [48].

Conclusions
There were underestimations and inaccuracies of bicycle crash data collected from different sources. This underscores the need to consider and account for potential biases due to outcome misclassification in our subsequent analyses as well as in other similar studies. Our findings also emphasise the need to improve the quality of individual data sources, to develop comprehensive record linkage techniques, and to enhance the validity and reliability of self-reported information so that all available data sources can be used reliably in our future attempts to capture a complete picture of important injuries.