Study population
We used data from The Health Improvement Network (THIN), a large electronic anonymised primary care database, with records from 559 general practices across the UK. We identified for analysis all individuals aged 16 years or over who had been registered with a THIN practice throughout 2011, and whose record indicated that they were a current smoker, either having a Read code indicative of current smoking recorded during 2011, or in those with no smoking status recorded in 2011, since 1st Oct 2009. Since the Quality and Outcomes Framework (QOF) requires general practitioners (GPs) to record the smoking status of their patients at least every 27 months, all individuals should have had a smoking status recorded in this timeframe [10].
For each of these patients, we identified whether they had received a prescription for varenicline during 2011, using drug codes from BNF. We also extracted data on age (categorised as 16 to 30, 31 to 45, 46 to 60 and 61 or over), gender, whether patients were coded as heavy smokers (>20 cigarettes/day), had alcohol weekly consumption over recommended levels (>14 units/week for women and 21 units/week for men) and obesity (body mass index >30). We extracted data on medical conditions based on the clinical disease indicators in the Quality and Outcomes Framework (QOF), including asthma, atrial fibrillation, cancer, coronary heart disease (CHD), chronic kidney disease (CKD), chronic obstructive pulmonary disease (COPD), dementia, anxiety or depression, diabetes, epilepsy, heart failure, hypertension, learning disability, psychotic disorders, stroke or transient ischaemic attack (TIA) and thyroid disorders. Any diagnosis made before the end of 2011 was included.
Ethics statement
Ethical approval for the use of THIN data for this study was obtained from the THIN Scientific Review Committee (reference number 10–022A).
Analysis
We used association rule mining (ARM) to discover the combined characteristics of groups of patients who did, or did not receive a varenicline prescription. ARM is one of the most well-established techniques in data mining, and has been widely applied in many areas. ARM identifies rules of the form A = > B, where A is the subset of exposures, and B is the subset of outcomes. A and B are mutually exclusive. We used the ARM method to identify rules for sets of characteristics (A) of patients who did, and did not, receive a prescription for varenicline during 2011 (B), carrying out two separate analyses for these two different outcomes (receiving varenicline/NOT receiving varenicline). ARM rules are produced to meet minimum constraints on level of ‘support’ for the rule, that is, the proportion or number of the study population, in this case current smokers, with the characteristics in set A. Since we were interested in identifying numerically important exposure groups while also ensuring that each of the patient groups represented by the disease indicators had potential to be included in our rules, we set a minimum exposure group size of 500 to ensure that exposure groups were numerically important. The rules were listed in order of ‘confidence’, that is the probability of B given A, or the probability of receiving, or not receiving, a prescription given that the patient has each of the characteristics in the set A. A more detailed description of our ARM approach was provided in our previous study which investigated the characteristics of smokers who received any smoking cessation prescription in primary care [9]. We used the open-source data mining software WEKA [11] to undertake the ARM analyses.