A data quality assessment to inform hypertension surveillance using primary care electronic medical record data from Alberta, Canada

Background Hypertension is a common chronic condition affecting nearly a quarter of Canadians. Hypertension surveillance in Canada typically relies on administrative data and/or national surveys. Routinely-captured data from primary care electronic medical records (EMRs) are a complementary source for chronic disease surveillance, with longitudinal patient-level details such as sociodemographics, blood pressure, weight, prescribed medications, and behavioural risk factors. As EMR data are generated from patient care and administrative tasks, assessing data quality is essential before using for secondary purposes. This study evaluated the quality of primary care EMR data from one province in Canada within the context of hypertension surveillance. Methods We conducted a cross-sectional, descriptive study using primary care EMR data collected by two practice-based research networks in Alberta, Canada. There were 48,377 adults identified with hypertension from 53 clinics as of June 2018. Summary statistics were used to examine the quality of data elements considered relevant for hypertension surveillance. Results Patient year of birth and sex were complete, but other sociodemographic information (ethnicity, occupation, education) was largely incomplete and highly variable. Height, weight, body mass index and blood pressure were complete for most patients (over 90%), but a small proportion of outlying values indicate data inaccuracies were present. Most patients had a relevant laboratory test present (e.g. blood glucose/glycated hemoglobin, lipid profile), though a very small proportion of values were outside a biologically plausible range. Details of prescribed antihypertensive medication, such as start date, strength, dose, frequency, were mostly complete. Nearly 80% of patients had a smoking status recorded, though only 66% had useful information (i.e. categorized as current, past, or never), and less than half had their alcohol use described; information related to amount, frequency or duration was not available. Conclusions Blood pressure and prescribed medications in primary care EMR data demonstrated good completeness and plausibility, and contribute valuable information for hypertension epidemiology and surveillance. The use of other clinical, laboratory, and sociodemographic variables should be used carefully due to variable completeness and suspected data errors. Additional strategies to improve these data at the point of entry and after data extraction (e.g. statistical methods) are required.


Background
Hypertension is a common chronic condition, affecting more than one in five Canadians, and is associated with an increased risk of cardiovascular disease and mortality, as well as considerable economic and societal costs [1]. Monitoring the incidence and prevalence of hypertension over time is an important part of surveillance systems and public health activities. In Canada, administrative databases, which include in-patient hospital discharges and physician billing claims, are often used to report on hypertension prevalence estimates, such as the Canadian Chronic Disease Surveillance System (CCDSS) [2]. While administrative sources provide population-level data for those who have encountered the healthcare system, there are a lack of clinical details that are essential for better understanding the patient context and disease severity, including blood pressure (BP), body mass index (BMI), and lifestyle risk factors. Physical measures surveys are another commonly used source, as they obtain directly measured BP coupled with health-related interviews, as achieved by the Canadian Health Measures Survey (CHMS) [3]. However, these surveys are costly to maintain, response rates are often low, and the cross-sectional design does not allow for longitudinal follow-up.
A contemporary approach to hypertension surveillance is utilizing the clinically-generated, detailed data from electronic medical records (EMR), particularly from primary care settings where chronic conditions are largely diagnosed and managed [4,5]. EMR adoption among Canadian family physicians is growing, with an estimated 83% now using EMRs in practice to some degree in 2018 [6]. Additionally, linkages between primary care EMR and administrative data can further enhance surveillance opportunities by providing a more complete perspective of disease manifestation and current management practices. Because EMR data are recorded to support individual patient care and administrative tasks, they may not be produced with the same standardization and rigor as research data; as such, some concern exists about their re-use for secondary purposes [7]. Therefore, investigations into data quality are necessary to determine whether the data are 'fit for purpose'. Previous studies evaluating the quality of primary care EMR data in Canada have typically reported on a limited aspect of quality (e.g. completeness) or data elements [8][9][10] or have assessed quality more broadly without focusing on a specific context for use [11,12]. The objective of this study was to comprehensively assess the quality of primary care EMR data in Alberta, Canada within the context of hypertension.

Data source
The Canadian Primary Care Sentinel Surveillance Network (CPCSSN) is a collaboration of eleven practicebased research networks (PBRN) across Canada who manage the extraction, cleaning and processing of deidentified EMR data from primary care settings [13]. At present, over 1200 primary care providers and 1.8 million patients contribute data from eight provinces and territories [14]. National CPCSSN data have been previously used to report on the epidemiology of many conditions in primary care, such as hypertension [5], diabetes [15], depression [16], osteoarthritis [17], dementia [18], chronic obstructive pulmonary disease [19], and others. The CPCSSN organization and data extraction and processing have been described elsewhere [13,20]. This data quality assessment utilized primary care EMR data obtained by the two PBRNs in the province of Albertathe Northern and Southern Alberta Primary Care Research Networks (NAPCReN and SAPCReN, respectively). Because healthcare in Canada is organized and delivered separately within each province or territory, only one province (Alberta) was chosen for the data quality assessment in order to minimize variation in the data due to interprovincial differences such as healthcare delivery and practice, drug coverage, health information legislation, EMR uptake and extent of use, types of EMR systems available, and many other factors [21,22].
In Alberta, there were 323 providers (mostly family physicians with a small proportion of nurse practitioners and community pediatricians) participating from 53 primary care practices. This represents slightly over 5% of the total number of family physicians in Alberta [23]. As of June 2018, de-identified EMR data were extracted from 397,518 patients in total; this reflected approximately 9.2% of Alberta's general population of 4.3 million people [24]. The CPCSSN data has previously been found to overrepresent older adults and women [25], but this is typical of primary care populations.
Currently, CPCSSN in Alberta extracts from five distinct EMR systems -Wolf, Med Access, Practice Solutions Suite, Accuro and Healthquest. The earliest (or 'start') date of information in the CPCSSN database varies by clinic and by patient, depending on when a clinic first implemented their EMR system, as well as when the patient first attended the clinic.

Patient sample
Adult patients (18 years and older) who had at least one primary care encounter in the previous two years (July 1, 2016 to June 30, 2018) were included, in order to establish an 'active' patient population. Any patient who was recorded as 'deceased' or 'inactive' in the EMR was excluded, as were any patients or providers who had explicitly requested to opt out of the CPCSSN database. The data quality assessment focused specifically on patients with hypertension who were identified using a CPCSSN-developed definition [26]. The hypertension definition consisted of a combination of International Classification of Disease version nine (ICD-9) codes (401, 402, 403, 404, 405) and medications located throughout the EMR: a minimum of two physician billing codes within two years or any occurrence of a diagnosis in the Problem List/Profile or prescription for an anti-hypertensive medication (with medication criteria alone being insufficient if other specific diagnoses exist, such as heart failure or diabetes) [26]. The definition was validated using chart reviews as the reference standard and demonstrated good sensitivity (84.9%) and specificity (93.5%) [26].

Data quality assessment
The data quality assessment was a cross-sectional, descriptive evaluation guided by reporting recommendations for distributed data networks [27]. Data elements were selected based on their potential use and relevancy for hypertension surveillance, as well as availability in the CPCSSN data. These included: patient demographics; physical examinations (weight, height, body mass index [BMI], and systolic and diastolic blood pressure); laboratory values (high density lipoprotein [HDL] cholesterol, low density lipoprotein [LDL] cholesterol, total cholesterol, triglycerides, fasting blood glucose, glycated hemoglobin [HbA1C]), anti-hypertensive medications (defined using categories of the relevant groups of Anatomical Therapeutic Chemical [ATC] codes: C02*, C03*, C07*, C08*, C09*); and risk factor records for smoking and alcohol use. Only the CPCSSN-processed/coded values were used, as these are typically the data elements that are accessible from CPCSSN for secondary purposes. A full description of all data elements can be found in the CPCSSN Data Dictionary online [14].
Summary statistics were reported for continuous variables, which included range, mean, and median. Proportions (restricted to the three most frequent values) and number of unique values were described for categorical variables. Missingness was reported as a proportion of patients without a recorded data element (e.g. height) or record (e.g. medication, smoking); missingness of specific items within a record was also reported (e.g. dose in medication record). Data completeness was also represented visually by clinic and EMR type.
Several temporal aspects of the data were examinedthe proportion of patients who had at least one physical exam measurement or laboratory value documented in the previous year (July 1, 2017 to June 30, 2018) was reported, in addition to the proportion of risk factor (i.e. smoking and alcohol) and medication records that contained a stop/end date prior to the start date. An exploration of patient-level weight values over time were visualized by plotting the difference between subsequent weight measurements and the length of time (days) between subsequent measurements for individuals with at least two weight measurements.
External validity was evaluated by comparing the most recent crude hypertension prevalence estimates from three national population-level sources: administrative data from the Canadian Chronic Disease Surveillance System (CCDSS), consisting of physician billing claims, hospitalizations and prescription drug records [28]; the Canadian Health Measures Survey (CHMS), which defines hypertension based on standardized, direct BP measurements and health-related interviews [29]; and selfreported high BP from the Canadian Community Health Survey (CCHS) [30]. Hypertension prevalence estimates from the national CPCSSN data [5] were also used as a comparison to the regional-level (Alberta) data.
RStudio version 1.1.456 was used for the analysis, which was conducted in 2019. This study was approved by the University of Calgary's Conjoint Health Research Ethics Board (REB17-1825) and the University of Alberta's Health Research Ethics Board (Pro00079372).

Results
In the CPCSSN data for Alberta, there were 205,364 adult patients who had at least one primary care encounter in the previous two years; of these, 48,377 patients were identified with hypertension and who were not labelled 'inactive' at the practice or deceased. Patients in the hypertension sample had a median of 8.0 years (IQR 7) of information in their record. Figure 1 provides a visual summary of the completeness of data for patient demographics, physical measurements, and smoking status by each of the 53 clinics and 5 EMR systems included in the data quality assessment. The data element characterization in Tables 1, 2, 3 and 4 provides a more in-depth examination of the quality of hypertension-related variables.

Patient demographics
Birth year was complete for all patients, as was sex (with the exception of two patients). However, nearly all socio-demographic information on patients was mostly incomplete (Table 1). For those who had some information recorded in ethnicity, occupation, or education fields, the data were highly inconsistentfor instance, over 3500 unique entries were recorded for occupation and more than 75 distinct entries were found for ethnicity.

Height, weight, BMI
Approximately 10% of patients were missing a height or BMI value and even fewer patients were missing weight ( Table 2). Males had a median of four measurements for height, weight, or BMI and females had five, with these measurements showing a skewed distribution. From the lower and upper ranges of the height, weight and BMI values, it appears that data errors are present. For example, female weight values ranged from 1.8 to 477 kg, which is biologically unlikely. When plotting patientlevel height and weight values (Fig. 2), those located outside the main cluster of points visually identify specific data errors. For instance, the vertical line of points approaching 0 on the x (weight) axis might indicate a data entry error (e.g. weight entered as 10 instead of 100) or swapped height and weight values (e.g. height in metres entered in the weight field). Another observable area of atypical points was between 150 and 200 on the x (weight) axis, which potentially represents height and weight values that were entered in the wrong fields (e.g. weight = 175 and height = 100 recorded instead of weight = 100 kg and height = 175 cm). Figure 3 investigates possible errors in weight values using successive patient-level weight measurements for those who had at least two weight values recorded in their EMR (n = 39,202). It would be expected that changes in individual weight might demonstrate more variability over time (e.g. patient weight recorded 10 days apart should have minimal difference, whereas weight measurements taken several years apart might show a more significant change). Two peaks centred around 100 and − 100 on the y-axis emerged as potentially problematic data: in a relatively short time period between measurements, the difference between successive weight measurements was approximately 100 kg for patients clustered around those two peaks. This likely represents inconsistencies in the unit of measurement (e.g. kilograms versus pounds) for subsequent weight measurements for a given individual. However, the extent of the problem was not substantialthe majority of weight values (94.8%) occurred within two standard deviations of the central peak (mean − 0.29) and at least one potential data error (i.e. outside two standard deviations) at any time was detected in the records of 18.4% of patients.

Blood pressure
BP measurements were well-recorded in terms of completeness (99%) and the majority of patients (85%) had at least one measurement recorded in the previous year (Table 2). However, BP values at the minimum and maximum end of the range may indicate data errors ( Table  2). These values could be biologically possible, but would be very unlikely in an outpatient setting; for instance, a systolic BP of 52 might indicate shock and a systolic BP of 290 would be an emergency event. In addition, CPCS SN also sets limits to BP values when processing the raw EMR data (50-300 mmHg for sBP; 20-200 mmHg for dBP), which would underestimate the true range of values.
Male patients had a median of 16 total BP measurements recorded in their EMR and females had slightly more (median = 18). A small proportion of patients had large sums of annual BP measurementsfor instance, 4.2% of females and 4.0% of males were above the 95th percentile for number of BP measurements (greater than 10) in 2017 (data not shown).

Laboratory values
Of the laboratory tests measuring blood glucose, HbA1C values were present in the EMR more often than fasting glucose (88% versus 79% of patients), and more patients had an HbA1C test result in the previous year compared to the fasting glucose test ( Table 2).
The lipid values included in this assessment (LDL, HDL, total cholesterol, triglycerides) were available for the majority of patients in this cohort (at least 91%, varying by lab type), with a median of 4-5 values for each patient in the EMR (Table 2). Female patients were observed to have a slightly fewer lipid values present in their EMR compared to male patients ( Table 2).
For all types of lab results, the upper and lower limits were unlikely to be seen in an outpatient setting (i.e.   primary care) and many values were beyond a biologically plausible range (e.g. HDL and LDL lower value = 0). This points to likely data errors at the upper and lower ends of the range of values, however, it was only for a very small proportion of lab values.

Hypertensive medications
The vast majority of males (92%) and females (93%) with hypertension had at least one recorded anti-hypertensive prescription, with a median of six anti-hypertensive medication prescriptions per person ( Table 3). The medication records themselves were fairly complete; all records contained a start date and most contained a stop date, strength, dose, frequency, duration, and count. Drug Identification Number (DIN) and 'reason for medication' mostly incomplete, with DIN missing in over half of medication records and 'reason' missing in over three-quarters of records.

Smoking and alcohol status
Within the Risk Factor section in the EMR, nearly 80% of patients had a smoking status recorded, with 'Unknown' and 'Never' as the most frequently recorded categories (Table 4). However, after excluding the indiscriminate 'Unknown' smoking status, a total of 31,976 patients (66.1%) and 68,110 records remained across three categories: 'Current', 'Past' or 'Never' (data not shown). Males and females had a similar number of smoking records per person (median = 1; mean = 3). All start and end dates were missing from the records. More males than females had their alcohol use recorded (47 and 40%, respectively) and these records were primarily for 'Current' users (Table 4), indicating that alcohol use is likely recorded differentially between users and non-users. Patients had a mean of 2 records in their EMR (median of 1) and no records contained start or end dates.
Of note, a 'Date Created' field exists for both smoking and alcohol records. This field indicates when the record was created in the EMR system but does not necessarily correspond to the start of the risk behaviour. 'Date Created' was present in 80.6% of smoking records and 76.6% of alcohol use records. Median number of total measurements per patient, n 5.

External validity
The overall crude estimate for Alberta-specific hypertension prevalence in the CPCSSN data (23.6%) were similar to the 2014-15 physical measure survey (CHMS) (23.3%) and was also comparable to the national CPCS SN estimate (22.8%) (

Discussion
This paper describes the quality of primary care EMR data in Alberta within the context of utilization for hypertension surveillance and epidemiology. Overall, there was observable variability due to the type of EMR system, between clinics, and among the data elements themselves. As this assessment focused on patients with hypertension, it was not surprising to see blood pressures and prescribed medication records that were largely complete and contained minimal outliers; these data constitute a particularly valuable contribution for surveillance purposes, given that BP and prescribing information are not available in administrative data or are limited (i.e. cross-sectional) in survey data. Although these data cannot confirm whether a patient has filled their prescription or is adherent, the information within the medication records are relatively complete and can be used to approximate persistence/adherence, for example, by calculating medication possession ratio or using similar methods [31].  Table 4 Missingness and summary statistics for risk factor records The select laboratory values were present in the EMR of the majority of patients in this cohort, with the exception of fasting blood glucose. This aligns with current clinical guidelines recommending routine testing of lipids and blood glucose/glycated hemoglobin for individuals with hypertension [32]. Although most laboratory test results in Alberta are imported directly into the EMR from the community lab provider, data quality issues were still present, although to a very small degree. The observed range of values demonstrated upper and lower limits that are not likely in an outpatient setting and some that were biologically implausible (e.g. 0 mmol/L for LDL and HDL; 43 mmol/L for fasting glucose). These errors may have been introduced during the import of lab results to the EMR or during the CPCSSN processing to convert different units of measurement to a standard unit (e.g. mmol/ mol to % for HbA1C).
Other information, such as sociodemographic, height, weight, and risk factor information, were more inconsistent and less complete. Although achieving 100% completeness for all data elements may not be realistic, it is not unreasonable to aim for near complete information for these data elements at the point of care. Cardiovascular disease guidelines suggest that smoking status should be updated on a regular basis and given that screening is often risk based, information about alcohol use, height, weight, BMI, and ethnicity are particularly important to document for a hypertensive cohort [32]. However, distinguishing between data that are missing due to inadequate data entry or as a result of not extracting the data is difficult. One significant challenge when addressing poor data quality is determining the source of the issuefor instance, missing data may be due to the unavailability of these fields in certain EMR systems (in which case, missingness will always exist); patients might not be asked about specific topics, such as alcohol use or ethnicity, or they may decline to answer; lastly, the CPCSSN processes may omit extraction from certain fields of the EMR either deliberately (e.g. identifiable fields or physician notes) or unintentionally (e.g. if an EMR system upgrade changes the names of data elements, which would subsequently affect the CPCSSN extraction code). Identifying true inaccuracies in the data are similarly problematic; this may be possible for some data elements through a chart review, with particular attention to the detailed physician notes and scanned documents (i.e. specialist letters, diagnostic imaging) that are not currently captured in the CPCSSN data. However, this is a time-intensive method and the structured EMR fields are likely to contain the same errors and omissions as the CPCSSN data. Beyond this, confirming with or measuring patients directly to verify data elements in the EMR would most accurately reveal true data errors but this method is also the least feasible.
Therefore, the most appropriate strategies for preventing and mitigating EMR data quality issues should be multifaceted and involve a variety of settings. CPCSSN has largely taken a post-extraction analytic approach to data improvement. This includes extensive cleaning and coding algorithms, the development and validation of case definitions for various conditions that are made availble as part of the database [26,[33][34][35], and exploration of more advanced techniques like natural language processing [36] and machine learning [35]. As an example, CPCSSN is currently developing a patternmatching algorithm that aims to enhance the completeness and accuracy of smoking records. In the raw or original EMR data, some additional information related to smoking, such as frequency of tobacco use and quantity of tobacco units consumed (e.g. cigarettes / cigars, packs), is present but primarily in unstructured, lengthy text strings that is not useful for analysis, may also contain identifiable patient information, and is therefore not currently available to researchers. The pattern-matching algorithm is designed to extract only smoking-related information from the free text and categorize the record into a defined smoking status, leading to more available coded data for researchers to access. A number of other strategies have been shown to improve the completeness and accuracy of EMR datasome occur at the practice level, such as employing a dedicated data entry clerk [37] or providing data quality audit and feedback reports to clinicians [38]. Other initiatives require more substantial resources and uptake, such as mandated national EMR content standards [39], developing EMR interfaces that are easier to navigate and contain more stuctured fields, and promoting financial or other incentives for 'meaningful EMR use' [40].
In the future, routine linkage to other data sources, like administrative health data, could enhance quality by providing a mechanism to verify certain aspects of EMR data and expand the breadth of information about individual patients throughout the broader healthcare system.

Limitations
This paper provides a quality assessment of select CPCS SN data elements deemed to be important for hypertension surveillance or research, but it was not possible to examine and report on all variables contained in the CPCSSN database in a single manuscript, nor was it possible to examine discrete cardiovascular outcomes related to hypertension (for example, hospitalization for myocardial infarction), as this information is usually contained in other databases external to CPCSSN or captured in the EMR in an inaccessible format (e.g. PDF document, free text notes). It was also not feasible to quantify the true accuracy of data elements, other than appraising the plausibility of values through descriptive means. Secondly, during the CPCSSN processing and data transformation stages for physical exam measurements and some lab types, restrictions are introduced for out-of-bounds values and thus, the summary statistics presented in this paper may not reflect the full variation of values originating from the source EMR. In addition, any changes or improvements made to the CPCSSN processing may result in slight differences in the CPCSSN EMR database between each extraction cycle. Thirdly, although the CPCSSN definition for hypertension demonstrated high sensitivity and specificity, a potential for misclassification still exists. This may have underestimated the number of patients with hypertension or produced a patient sample that is biased towards a greater severity of illness. Lastly, the quality was described specifically for CPCSSN data from Alberta and within the context of hypertension. This is not a population-level data source and only constitutes a sample of participating providers and patients who have sought care. Thus, the overall findings may not be representative of the wider Alberta population or for other provinces or territories that participate in CPCSSN, and may also differ in other disease-based contexts. However, CPCSSN has developed uniform data extraction, processing, and standardization methods across the country, which may allow for other regional networks to compute the same data quality assessment for comparison.

Conclusion
Primary care EMR data are a valuable data source for hypertension surveillance or within an epidemiological context. The high-quality and longitudinal blood pressure and prescribed antihypertension medication data are particularly useful, as these types of data are not found in traditional administrative databases. Other data elements, such as sociodemographics, physical examination values, laboratory results, and risk factor information, exhibited variation in quality. These data elements may be less useful in their current state but offer promising value in the future once data quality issues can be addressed through additional pre-or post-extraction solutions.