Can cluster analyses of linked healthcare data identify unique population segments in a general practice-registered population?

Background Population segmentation is useful for understanding the health needs of populations. Expert-driven segmentation is a traditional approach which involves subjective decisions on how to segment data, with no agreed best practice. The limitations of this approach are theoretically overcome by more data-driven approaches such as utilisation-based cluster analysis. Previous explorations of using utilisation-based cluster analysis for segmentation have demonstrated feasibility but were limited in potential usefulness for local service planning. This study explores the potential for practical application of using utilisation-based cluster analyses to segment a local General Practice-registered population in the South Wales Valleys. Methods Primary and secondary care datasets were linked to create a database of 79,607 patients including socio-demographic variables, morbidities, care utilisation, cost and risk factor information. We undertook utilisation-based cluster analysis, using k-means methodology to group the population into segments with distinct healthcare utilisation patterns based on seven utilisation variables: elective inpatient admissions, non-elective inpatient admissions, outpatient first & follow-up attendances, Emergency Department visits, GP practice visits and prescriptions. We analysed segments post-hoc to understand their morbidity, risk and demographic profiles. Results Ten population segments were identified which had distinct profiles of healthcare use, morbidity, demographic characteristics and risk attributes. Although half of the study population were in segments characterised as ‘low need’ populations, there was heterogeneity in this group with respect to variables relevant to service planning – e.g. settings in which care was mostly consumed. Significant and complex healthcare need was a feature across age groups and was driven more by deprivation and behavioural risk factors than by age and functional limitation. Conclusions This analysis shows that utilisation-based cluster analysis of linked primary and secondary healthcare use data for a local GP-registered population can segment the population into distinct groups with unique health and care needs, providing useful intelligence to inform local population health service planning and care delivery. This segmentation approach can offer a detailed understanding of the health and care priorities of population groups, potentially supporting the integration of health and care, reducing fragmentation of healthcare and reducing healthcare costs in the population.


Background
Globally, health care systems are increasingly interested in population health. In many developed countries, improvements in life expectancy have slowed or stalled and health inequalities are increasing [1]. Population health as an approach seeks to improve physical and mental health outcomes, promote wellbeing and reduce health inequalities across whole populations. Growing interest in population health is possibly due to recognition of the challenges facing health care systemsrising costs, ageing populations, unhealthy lifestyle choices and deepening poverty in society [2]. These challenges lend themselves to explanatory and interventional models inherent in the population health approach. At the core of this approach is the goal of improving health outcomes for whole populationsnot just for those seeking care while paying attention to the distribution of those outcomes within the population [3].
One of the key pillars of population health is personcentred integration of health and care systems, a reflection of the need to reduce fragmentation of care around the growing numbers of patients with multiple longterm conditions [1]. Person-centred care is however not feasible if, in population health policy terms, it implies developing care pathways unique to every individual in the population [4]. Population segmentation, which involves grouping populations on the similarity of one or more proxies of health needs, potentially allows definition of population groups for whom integrated and tailored health and care interventions across the continuum of care can be tailored [5].
Two broad approaches to population segmentation have evolved in recent years. In traditional (or expertdriven) approaches, a population is segmented on apriori, expert-defined criteria informed by literature review and consensus [6]. For example, the Suicide and Self Harm Prevention Strategy for Wales 2015-2020 highlights the need to focus preventative efforts towards men aged 15-44 years [7]. In England, the London Health Commission segmented the population of London based on morbidity and age group [8]. This approach is limited by lack of generally agreed ways of: (i) knowing the number of natural clusters in the population, and (ii) determining the variables on which to base segmentation. Furthermore, grouping populations on criteria, such as age and morbidity, does not accurately reflect actual use of health and care services.
More recently, population segmentation based on health and care utilisation has gained recognition as an alternative. This data-driven segmentation approach potentially generates detailed insight into the needs of populations using a variety of analytical methods applied to large integrated datasets from various health and care settings [9]. A recent study exploring this approach was limited by failure to include data on use of A&E care [10]. In addition, it was based on a random selection of General Practice-registered patients across England and therefore did not reflect local patterns, a critical component of local health and care planning and service delivery.
This study therefore set out to explore the potential for using utilisation-based cluster analyses to segment a local GP-registered (and geographically-defined) population in the South Wales Valleys. This was done in two sequential stepsfirst assessing whether utilisationbased cluster analyses could identify clusters of patients in the population based on healthcare utilisation parameters and, secondly, undertaking detailed profiling of the utilisation-based segments to indicate their healthcare needs [11].

Data
We created a pseudonymised integrated dataset by linking, at a patient-level, primary and secondary care data for a population of about 80,000 people registered with General Practices in one geographical locality of the South Wales Valleys. For each patient, we identified seven healthcare utilisation variables on which cluster analyses were basedelective inpatient admissions, non-elective inpatient admissions, outpatient first attendances, outpatient follow-up attendances, A&E attendances, GP visits (specifically those for which a General Practitioner was seen) and count of distinct drugs used in the year. We selected these seven utilisation variables because they reflected different types of healthcare providers and use of health care resources across different parts of a health care system [10]. Five of these variables have previously been identified as suitable for datadriven utilisation based segmentation across healthcare providers without overlap [10]. The number of outpatient attendances was further broken down into first attendance and follow-up attendance, and the number of A&E attendances was included, as we felt that these offered additional understanding of the healthcare needs of our population. We also included data on patient characteristics such as long-term condition (LTC) diagnoses, age, deprivation, smoking status, cost and scores for risk of emergency admission in the next 12 months.

Cluster analyses
We carried out sequentially two types of cluster analyses on the dataset. We conducted hierarchical cluster analysis which allows identification of the optimal number of clusters in the population as readily available stopping rules mean it does not require a priori selection of number of clusters [12]. Given that hierarchical cluster methods are sensitive to outliers and are generally not suitable for larger datasets, [13] we followed this with kmeans non-hierarchical cluster analysis used with an Euclidean distance. This method is efficient and can handle large datasets [14].
Hierarchical cluster analysis was conducted by selecting 10 random population subsets of size 3000 and calculating the pseudo F-statistic defined by Calinski and Harabasz [15]. This approach assessed the cluster tightness for increasing cluster size (2 to 20) by comparing the mean sum of squares between groups to that within groups. The pattern, which was of a gentle decline in the pseudo F-statistic seen almost consistently across the 10 subsets, did not clearly suggest an optimal number of clusters. The Duda and Hart Je (2)/Je (1) index [16] was then calculated. This used the within-cluster sum of squared distances from the mean to compare the present cluster to a potential further split. The suggested rule of thumb for deciding on the number of clusters is to look for a clustering solution with a high Duda-Hart and a corresponding low pseudo T-squared value, with high pseudo T-squared values on either side [17]. Using this method we determined that the optimal number of clusters was approximately 10. K-means analysis for the entire dataset population was then performed to create the final 10 clusters for the population.
All clustering was done on standardised versions of the 7 healthcare utilisation variables derived by subtracting the mean of each variable and dividing by its standard deviation. This ensured that each variable got equal weight in the determination of "distance" used by the various clustering methods.
All cluster analysis was done in Stata 15 [18].

Statistical analyses and cluster profiling
The clusters (hereafter referred to as segments) were then assessed and profiled on the average of the healthcare utilisation variables, as well as other characteristics such prevalence of LTCs, age, deprivation and risk of emergency hospital admission in the next 12 months. The statistical analyses sought to determine whether there were statistically significant differences across the segments in each characteristic. For the mean counts of healthcare utilisation variables and number of LTCs, we used a Kruskal-Wallis test of differences of means as these variables did not meet Normality assumptions. For age and risk of emergency admission score, an ANOVA test for difference of means was estimated. For the proportions of the population who were smokers and who were in the most deprived population quintiles, as well as for segment prevalence of LTCs, we calculated Chi square tests for proportions. The variables which differed significantly in the statistical tests of difference were then explored pair-wise between segments using Mann-Whitney U tests (for the non-Normal continuous variables), Student t-tests (for the Normal continuous variables), and z-tests (for the categorical variables). We adjusted the significance level of 0.05 for the pair-wise tests using the Bonferroni method to account for multiple testing done in comparison of the segments. The determinants of healthcare need and healthcare complexity differ [19,20]. Therefore, in profiling segments, we applied a rule of thumb that distinguished these, defining 'high need' segments as ones fulfilling either of two criteria: (i) mean activity (count) more than 100% above the mean for the study population in any care setting, or (ii) mean activity (count) more than 20% above the mean for the study population in 4 or more care settings. Segments were identified as 'high complexity' if they had mean activity (count) higher than the mean for the study population in 4 or more care settings.

Results
The study population included 79,607 patients (50.1% Female) with an average age of 41.4 years. All patients were registered with the General Practices in the Rhondda locality of Cwm Taf Morgannwg in the South Wales Valleys. K-means cluster analysis produced ten segments based on healthcare utilisation patterns across diverse settings of health care provision (Table 1). All seven healthcare utilisation variables were statistically different across the segments -reflecting the central aim of cluster analysis which is to maximise the distance between clustering variables. In addition, the non-clustering variablespatient characteristics and much LTC prevalence were also found to differ significantly, demonstrating that each segment was largely unique.
There was significant deprivation in the population, with 85% of people living in the two most deprived national quintiles. The prevalence of current smoking in the study population was 21.6% -a figure consistent with rates reported for the region [21]. The average number of LTCs per person was 1.32 but this ranged from 0.6 to 6.5, highlighting the tendency toward multiple morbidity in this population. The commonest LTCs in the study population were asthma (11%), depression (10.3%), diabetes (7.1%) and hypertension (18.9%). These rates, were again largely consistent with those reported for the same population [22].

Profiling the population segments
For each segment, specific attributes are presented in comparison with the average for the study population ( Table 2 & Fig. 1).
Although Segments 1, 3, 5, 8 and 10 were characterised broadly as 'low need, low complexity' segments, there were notable differences in their profiles. Segments 1 (mean age 36 years) and 10 (mean age 30 years) were on average young adults with 0-1 LTC, whose healthcare utilisation profiles were low in all the healthcare settings assessed. In addition, about half of the study population, despite the general high levels of deprivation, were in Segment 1with few and low-complexity healthcare needs. Segments 3 and 5 were on average young and middle-aged adults with 1 or 2 LTCs. For these segments, no specific LTCs were dominant in terms of prevalence. Segment 3 may include individuals with suboptimal control of LTCs, resulting in higher than average use of non-elective inpatient careand consequent high per capita cost -although this did not reach the 'high need' threshold of our rule of thumb. Despite their similarities with Segment 3, Segment 5 patients have a much lower cost per capital profile. This probably reflects the impact of lower-than-average use of non-elective inpatient care associated with having fewer LTCs (1.2, cf. 2.1 in Segment 3) that are also probably better-managed through appropriate outpatient care and prescribing. The Segment 8 population is an older adult population (mean age 60 years) with an average of 3 LTCs per personpredominately ambulatory care sensitive (ACS) conditions -asthma, diabetes, hypertension and Ischaemic Heart Disease (IHD). Their higherthan-average use of prescribing possibly reflects success in ACS condition management. Segments 2 and 7 were characterised broadly as 'high need, low complexity' segments. Segment 2 patients were older adults (mean age 54 years) with particularly high use of elective inpatient care. Although they have 2-3 LTCs on average, there is no dominant LTC that might be driving elective inpatient care use. As elective care is the standard route for many common operations [23], elective surgery may account for the high per capita cost from elective inpatient care use in this segment. Segment 7 patients similarly have high per capita cost consumption attributable to high utilisation of care in one settingin this case outpatient follow-up visits. The dominance of bipolar disorder and schizophrenia in this segment suggests these mental health disorders may be driving outpatient follow-up care use in this segment. COPD, Diabetes and Hypertension were relevant LTCs that may have contributed to use of non-elective inpatient care and prescribing in this segment. Segment 7 patients also had the highest number of maternity bed days during the year of all Segments. Segments 4, 6 and 9 are the 'high need, high complexity' segments in this population and their diversity in age (47, 59 and 82 years, respectively) and high per capita cost underscore the fact that significant and complex healthcare need is a feature across age groups in this population. Segment 4 makes up only 1% of the population but accounted for 10% of total healthcare expenditure. It has the highest proportion of people living in the 2 most deprived quintiles, the highest cost: population ratio and is one of three segments with the highest prevalence rates of current smoking. High healthcare consumption is consistent across all settings and

Discussion
A central goal of population segmentation is to identify population subgroups that are homogeneous enough in terms of healthcare needs to enable tailoring of integrated health and care for them [24]. This study demonstrates that cluster analyses of linked healthcare data can identify distinct segments of care users in a local General Practice-registered population. Further profiling of the segments in the study population established that they had unique demographic and morbidity attributes that could potentially support planners and providers of health and care services in responding more accurately to the needs and priorities of each segment [4]. 'Low need' population groups tend not to be prioritised by health systems as they make relatively little demand on services. They may therefore not be deemed 'impactable' but proactively availing them of preventative services is critical to healthcare sustainability. To enable tailoring of preventative care to 'low need' populations, further differentiation across this often-sizeable segment of the population is necessary. Although we found that nearly 88% of our study population were in 'low need' segments, our approach to segmentation demonstrated the heterogeneity in this group with respect to age, morbidity, per capita cost of care, settings in which care was mostly used and prevalence of relevant risk factors. While Segments 1 and 10 might benefit from targeted and universal preventative initiatives to support them in staying healthy and non-care-seeking, Segment 3 patients could be targeted with hospital-based preventative services, such as smoking cessation [25], and improved LTC management to reduce emergency hospital admissions. The high prevalence rates of current smoking in Segment 8 patients, combined with prevalent ACS conditions and a higher-than-average use of prescribing, indicates they could be an ideal segment for integrated approaches involving active ACS condition management, lifestyle risk modification and medication reviews.
The 'high need, low complexity' Segments have high per capita cost incurred in a restricted setting (elective inpatient care in Segment 2 and outpatient follow-up care in Segment 7). For these Segments, understanding and de-escalating need quickly is key. The Segment 7 population has a relatively high prevalence of mental health disorders but outpatient visits during the 30 days after a mental health hospital discharge are reported to be associated with a lower hospital readmission risk [26]. Consequently, improving care in this population segment may require alternative closer-to-home models of specialist follow-up care rather than reducing specialist follow-up per se. In addition, given the high prevalence of current smoking in this population segment, integrating smoking cessation treatment into mental health care, rather than referral to specialist smoking cessation treatment could yield greater smoking quit success [27]. For Segment 2 patients, who had the longest elective inpatient spells and higher-than-average smoking prevalence, considering the rationing of elective surgery for procedures of limited clinical value is justifiable on prognostic grounds although a strong evidence base would be needed to conclusively establish the rationale [28].
Perhaps the most widely studied population segments are the 'high need, high complexity' Segments 4, 6 and 9. These Segments, which together accounted for 5% of this population and over 27% of total healthcare expenditure in the year, had the highest average number of LTCs per person. Although they are often referred to as 'high need, high cost' populations, their high cost consumption is probably driven by the fragmentation of care associated with the complexity of their need [29]. Consequently, a key objective for these segments should be to de-escalate need and reduce fragmentation of care by targeting integrated care management and other resources to them.
The question of how to address need and reduce use and cost of care in such high need populations merits consideration of the local determinants of need. One factor thought to drive healthcare consumption patterns of adults with multiple LTCs is the presence of functional impairment [30]. In this population however, we observed higher care consumption volumes and cost in the younger and fitter Segment 4 population compared to the older and frailer Segment 9 population. While both segments were very similar in terms of multimorbidity, their notable differences were in age (45.5 vs. 81 years), degree of functional impairment (proportion who were moderately-severely frail 28.3% vs. 66%), deprivation (88.6% vs. 78.5% in the two most deprived national quintiles) and smoking prevalence (27.2% vs. 12.8%). Despite their older age and greater functional limitation, Segment 9 patients had a lower per capita cost of care than Segment 4 patients, underlining the importance of deprivation and behavioural risk factors in driving care use (and, by extension, indicating need). This finding is consistent with those reported in other general [31] and disease-based populations [32] and implies that interventions aimed at Segment 4 patients should necessarily incorporate behaviour change support and access to broader social initiatives tackling poverty. Segment 9 on the other hand could benefit from anticipatory care planning involving both case identification and proactive intervention to reduce hospitalisation [33]. For both Segments, local health and care systems could pursue complex case management programs incorporated into or superimposed on traditional primary care systems or create specialised clinics for these Segments delivered by a multidisciplinary team offering enhanced care coordination and other support [34]. The relative merits of either approach should be explored through diverse lenses, not least of which would be capacity to engage local general practitioners and patients as well as size of potential benefit.
The approach set out in this study potentially offers a quantitative evidence base for local population health planning and delivery [35]. As segmentation processes are most useful when they iterate between quantitative and qualitative data sources [36], adding relevant qualitative social and place context to this quantitative intelligence is desirable. As a potential complement to traditional health needs assessments [37], which may lack granularity and responsiveness, this wholepopulation approach allows useful insight into expressed need and offers a measure of insight into non-careseeking populations in whom unmet need may be present.
Segmenting a heterogeneous population into discrete and relatively homogenous groups with similar healthcare needs can enable the development of integrated health and care systems that are more targeted and efficient [38]. Systems that successfully achieve integration of health and care demonstrate specific attributeschiefly (i) a focus on segments of their population with the highest need for care, and (ii) a change in core delivery processes to enable multidisciplinary teams to work around patients [39]. Segmentation and stratification of risk allows the identification of such high-risk populations and the detailed profiling of the segments based on proxies of health need potentially engages multidisciplinary teams.
Integrating health and care around segments of the population potentially tackles fragmentation of care and represents a basis for bridging the chasm in healthcare quality and outcomes often experienced by populations [4]. Achieving improved outcomes at lower cost per capita is the essence of Value-Based healthcare which depends on reliable and consistent measurement of both outcomes and cost of care in population. The potential role of population segmentation in Value-Based healthcare is evident in the fact that measurement of outcomes only works if outcomes are measured for people with similar needs.
There are potentially many other datasets which could offer greater insight into the needs of these population segments if integrated in future, for example, data from social care. Health and care policy promoting integration around patients and populations must therefore offer enabling legislative and technical environments to facilitate routine integration of datasets from diverse settings of health and care provision as well as social and demographic information.
There are potential limitations of this study worth highlighting. The creation of healthcare utilisation variables was based on an integrated primary and secondary care dataset. There are some limitations potentially associated with this approach. For example, Read codes were used to identify primary healthcare utilisation which are known to be prone to variation in their use and the overlap of different codes. In some instances, proxy measures were used where the data variables were not available. For example, GP appointment data was not available and GP practice encounters that resulted in a diagnosis Read code were taken as a proxy. Of the General Practices in the Rhondda locality, one practice did not wish to participate in the study. This practice constituted 10.3% of the GP-registered population in the locality and an element of bias may have been introduced if their practice population was significantly different to the study population. This study compared traditional segmentation with utilisation-based cluster analysis but did not look at other segmentation methodologies such as prescribed binning criteria or decision trees. There are examples in the scientific literature where these methods have achieved a greater reduction in variance than through clustering using k-means methodology [40].
Finally, it is worth placing the findings in this study in context of the degree of general deprivation in the population. Despite 60% of the population (Segments 1 and 10) using relatively little healthcare resources, the 12month risk of emergency admission in those lowutilisation segments was 8%, a much higher rate than the 3% reported for a similar low utilisation segment in a randomly selected population in England [10]. The implications of our findings for local healthcare policy and planning may therefore differ if the population was more diverse in respect of levels of deprivation.

Conclusion
Cluster analysis of linked primary and secondary healthcare use data for a local GP-registered population can segment the population into distinct groups with unique health and care needs. Despite some potential limitations, this approach yields valuable intelligence to inform local service planning and at the same time offers great potential for further research into its use in informing preventative, holistic health and social care.