Application of Data Mining Technology in Risk Prediction of Metabolic Syndrome in Oil Workers

Background. The prevalence of metabolic syndrome continues to rise sharply worldwide, seriously threatening people's health.In this paper, three kinds of risk prediction models applicable to the metabolic syndrome of oil workers were established, and the optimal models were found through comparison. The optimal model can be used to identify people at high risk of metabolic syndrome as early as possible, to predict their risk, and to persuade them to change their adverse lifestyle so as to slow down and reduce the incidence of metabolic syndrome. Methods. A total of 1,468 workers from an oil company who participated in occupational health physical examination from April 2017 to October 2018 were included in this study. We established the Logistic regression model, the random forest model and the convolutional neural network model, and compared the prediction performance of the models according to the F1 score, sensitivity, accuracy and other indicators of the three models. Results. The results showed that the accuracy of the three models in the training set was 83.45%, 94.21% and 86.34%, the sensitivity was 78.47%, 94.62% and 81.30%, the F1 score was 0.79, 0.93 and 0.83, and the area under the ROC curve was 0.894, 0.987 and 0.935, respectively. In the test set, the accuracy was 76.72%, 80.66% and 78.69%, the sensitivity was 70.00%, 77.50% and 68.33%, the F1 score was 0.70, 0.76 and 0.71, and the area under the ROC curve was 0.797, 0.861 and 0.855, respectively. Conclusions. The study showed that the prediction performance of random forest model is better than other models, and the model has higher application value, which can better predict the risk of metabolic syndrome in oil workers, and provide corresponding theoretical basis for the health management of oil workers. hyperglycemia, hypertension, hyperlipidemia, etc. (cid:0)Working conditions: shifts, exposure to high temperature, noise and other harmful factors. (cid:0)Physical examination: height, weight, blood pressure and waist circumference, etc. The study subjects took venous blood in the early morning after fasting for 12 hours, and tested the biochemical indicators such as fasting blood glucose, high-density lipoprotein, and triglyceride using the Dirion CS-1200 automatic biochemical analyzer (China Changchun Dirion Medical Technology Company). The diagnostic criteria of metabolic syndrome [8] can be diagnosed if it meets three or more of the following ve indicators:


Introduction
Metabolic syndrome (MetS) refers to the accumulation of multiple metabolic risk factors in the body including obesity, impaired glucose regulation, dyslipidemia and hypertension. MetS is a group of complex clinical syndromes based on insulin resistance. Relevant literatures have shown that metabolic syndrome increases the risk of cardiovascular disease, type 2 diabetes and chronic kidney disease [1][2][3].With the social and economic development and changes in people's lifestyles, the prevalence of metabolic syndrome has increased year by year and brought a heavy economic burden, which has become an important health issue of common concern to people worldwide.
At present, the de nition and diagnostic criteria of metabolic syndrome have not been completely uni ed. In 1998, WHO o cially named the "metabolic syndrome" and and proposed corresponding diagnostic criteria for the rst time [4].Over the course of the next decade, the diagnostic criteria for metabolic syndrome have undergone many changes and revisions, including 2001 national cholesterol education program adult treatment group report for the third time (NCEP ATP ). Chinese diabetes association (CDS) diagnostic criteria in 2004. International diabetes federation (IDF) diagnostic criteria 2005. In 2009, the American heart association (AHA), the international diabetes federation, the national heart, lung and blood institute and other institutions jointly proposed a tentative uni ed standard [5][6][7][8].According to a large number of epidemiological data, the global prevalence of MetS is about 30% [9]. DoosupShin based on 2007-2014 national health and nutrition survey data on MetS prevalence statistics found that American adults MetS prevalence rate has reached 34.3% (according to the revised NCEP-ATP diagnostic criteria) [10].In South Korea, according to the same diagnostic criteria, the prevalence rate of metabolic syndrome in adults from 2009 to 2013 was as high as 30.52% [11].In China, in 2010, Jieli Lu [12] and others conducted a data report analysis of 97,098 adults in China, and estimated the prevalence of MetS was 33.9% (according to the NCEP-ATP diagnostic criteria).In 2015, Ting Liu analyzed the prevalence of MetS among 34,025 residents in Jilin province and found that the prevalence of MetS was 32.5% (according to IDF diagnostic criteria) [13].In 2016, Ri Li [14]and others conducted a meta-analysis showing that the prevalence of MetS in subjects over 15 years old was 24.5% (according to IDF diagnostic criteria).Although the diagnostic criteria are not uniform, it is undeniable that metabolic syndrome has become one of the chronic diseases with high incidence in China and even in the world.
Data mining refers to extracting hidden information and knowledge with potential research value from large data, which is often used in the medical eld with large amounts of data and fast update speed. Among them, the classi cation algorithm has been widely concerned and applied in recent years. This algorithm takes a variety of risk factors affecting the occurrence of disease as a prerequisite, and uses statistical methods and computer algorithms to build a predictive model of disease risk. The constructed model is used to predict the probability of a certain population or individual developing a certain disease, and then provides a theoretical basis for personal health management and corresponding preventive measures [15].At present, Logistic regression, Cox regression, BP neural network, decision tree, support vector machine and other models are mostly used to construct metabolic syndrome risk models at home and abroad [16][17][18]. These models can be used to identify high-risk groups of MetS, persuade them to change their unhealthy lifestyles, reduce and slow down the occurrence and development of the disease. Among them [19][20][21], Logistic regression and Cox regression, as traditional statistical modeling methods, are widely used and have strong explanatory power. However, Cox regression is often used for survival analysis data, which requires two dependent variables at the same time and has relatively strict requirements on data. The decision-making tree model has strong visibility, but is prone to over tting and poor generalization effect. The random forest model is a classi er composed of multiple decision-making trees, which improves the weak generalization ability of a single decision-making tree and balances the error of unbalanced data. As a kind of arti cial neural network model, BP neural network is fault-tolerant to some extent, but local minimization problems often occur, and the learning speed is slow, and the phenomenon of over tting is easy to occur. In the convolutional neural network model, the local receptive eld and weight sharing of convolutional kernel reduce the computational complexity and have high accuracy and good generalization ability. Due to regional and cultural differences, the effects of existing models vary, and mature and accurate metabolic syndrome risk prediction systems have not been established at home and abroad. Moreover, most of these models established at present are aimed at the assessment of the risk of disease in the general population, ignoring the special group of occupational population.
As an important part of China's non-renewable energy industry, the petroleum industry still accounts for a large proportion in the national economy. Oil workers are also the main laborers in the production of the secondary industry in China. Their health will affect the development of China's economy to a certain extent and should be paid more attention. Oil workers are affected by high temperature, noise, shift work and other harmful occupational factors, as well as a variety of adverse lifestyles caused by occupational stress, which can greatly increase the incidence of metabolic syndrome to some extent. For special occupational group, the risk prediction model of ordinary people is no longer suitable for them, so it is necessary to establish a risk prediction model of metabolic syndrome for them, so as to achieve early detection, diagnosis and treatment, and protect the health of oil workers. In this study, a certain oil industry workers were selected as the research objects to construct the traditional Logistic regression model and random forest model, and introduce the convolutional neural network model which has been hot discussed in recent years. The prediction performance of each model is compared to nd the optimal model, which provides a theoretical basis for the health management of this special occupation group of oil workers.

Research content
One-to-one questionnaire survey was conducted on oil workers by uniformly trained personnel to collect the following information: General situation: gender, age, education, income status, marital status, etc. Lifestyle: smoking, drinking, diet and physical exercise. history of personal and family diseases: hyperglycemia, hypertension, hyperlipidemia, etc. Working conditions: shifts, exposure to high temperature, noise and other harmful factors.
Physical examination: height, weight, blood pressure and waist circumference, etc.
The study subjects took venous blood in the early morning after fasting for 12 hours, and tested the biochemical indicators such as fasting blood glucose, high-density lipoprotein, and triglyceride using the Dirion CS-1200 automatic biochemical analyzer (China Changchun Dirion Medical Technology Company). The diagnostic criteria of metabolic syndrome [8] can be diagnosed if it meets three or more of the following ve indicators: . Central obesity: Chinese people have a waist circumference ≥ 85 cm (male). waist circumference ≥ 80 cm (female).
. Elevated blood glucose: FBG ≥ 5.6 mmol/L or those who have been diagnosed with diabetes and receive treatment.
. TG ≥ 1.7 mmol/L or those who have been diagnosed with hypertriglyceridemia and received treatment.
. Systolic / diastolic blood pressure ≥ 130/85 mmHg or those diagnosed with hypertension and receiving treatment.

Quality control
The investigators can only take up their posts after uni ed training. The collected questionnaire data are collected on the spot for double and double check and input, and the questionnaires with incorrect input are checked for the third time to ensure the accuracy of the collected data. The same instrument was used for physical examination and laboratory test, and the biochemical indicators were tested by the same kit in North China Petroleum Underground Hospital.

Statistical methods
CscrMainUI system developed by a scienti c research company was used to scan and input questionnaires and establish a database. IBM SPSS19.0 was used for statistical analysis. The measurement data obeying the normal distribution were expressed as − x ±s, and the t test was used for comparison between groups. The non-normally distributed measurement data were represented by [M (P25,P75)], and the rank sum test was used for comparison between groups. The count data were used as the ratio, and Pearson x² test was used for comparison between groups. Unconditional binary classi cation logistic regression was used for multivariate analysis. The independent variable introduction criterion was P ≤ 0.05, and the test level α = 0.05(both sides).

Hardware and software platform
Loading The sample data were partitioned, 60% as the training set, 20% as the veri cation set, and 20% as the test set. Logistic regression model (using forced entry method) and random forest model were constructed by SPSS Modeler18.0 (set the number of base classi ers as 100, set the sample number of data set used by each base classi er as 100, the maximum node number as 10000, the maximum tree depth as 10, and the minimum size as 5). The convolutional neural network model is constructed by using Pytorch (input 4*4 matrix, convolution kernel 3*3, step length = 1, padding = 1, maximum pooling, size 2*2, step length = 1, input 144 in full connection layer, output 2). ROC curve was drawn with Medcalc and the area under the curve was compared.

General situation
Of the 1,468 oil workers, 1,105 were male, with an average age of 43(38,48),363 were women, with an average age of 44(42,47). The prevalence rate of metabolic syndrome in petroleum workers was 40.67%, among which, the rate of central obesity was 56.81%, the rate of abnormal blood glucose was 49.39%, the rate of abnormal triglyceride was 32.90%, the rate of abnormal HDL was 19.28%, and the rate of abnormal blood pressure was 55.99%. As shown in Fig. 1.

Independent variable screening
Single factor analyses were performed on the basic conditions, diet and lifestyle, occupational exposure factors and laboratory tests of 1,468 oil workers. The results showed statistically signi cant differences in age, gender, Body Mass Index(BMI), marital status, family history of hypertension, family history of diabetes mellitus, salt, meat intake, smoking status, drinking status, shift work situation, Occupational heat, noise, hemoglobin, uric acid(UA), alanine transaminase(ALT), etc (P < 0.05), are shown in Table 1 to Table 4.

Collinearity diagnosis
The diagnosis of collinearity was made by using the binary correlation coe cient r, tolerance and variance in ation factor(VIF).The results showed that the correlation coe cient |r| was 0.31 at most and |r|<0.5, as shown in Table 7.The minimum tolerance was 0.844, much higher than 0.1, and the Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js maximum variance in ation factor was 1.185, less than 5, as shown in Table 8.The above results indicate that there is no serious multicollinearity among the screened independent variables.

Comparison of predictive performance of metabolic syndrome risk prediction models
In the training set, the accuracy, sensitivity, speci city, F1 score, Youden's index, positive likelihood ratio, Kappa index, positive predictive value and negative predictive value of the random forest model were all higher than those of the Logistic regression model and the convolutional neural network model. The area under ROC curve (AUC) of the random forest model was larger than that of the Logistic regression model and the convolutional neural network model, and the difference was statistically signi cant (P < 0.001). See Table 11 and Fig. 2.
In the validation set, The accuracy, sensitivity, speci city, F1 score and other indexes of the three models were all higher. In order to further re ect the relationship between sensitivity and speci city, it is necessary to judge whether the models are over tting and have good robustness. By plotting ROC curve and calculating AUC value, it was found that the three curves of Logistic regression model, random forest model and convolutional neural network model were basically identical, with no statistically signi cant difference (P > 0.05). The area under the curve (AUC) was 0.875, 0.878 and 0.872 respectively.See Table 10,11 and Fig. 3.
In the test set, the accuracy, sensitivity, F1 score, Youden's index, Kappa index and negative predictive value of the random forest model were the highest, while the speci city, positive likelihood ratio and positive predictive value of the convolutional neural network model were the highest, but the sensitivity and negative predictive value were the lowest. The area under ROC curve (AUC) of the random forest model was larger than that of the Logistic regression model and the convolutional neural network model. Comparing the AUC of the three models in pairs, the difference between Logistic regression model and random forest model was statistically signi cant (Z = 2.806, P = 0.005), the difference between Logistic regression model and convolutional neural network model was statistically signi cant (Z = 2.352, P = 0.019), and the difference between random forest model and convolutional neural network model was not statistically signi cant (Z = 0.320, P = 0.749). See Table 11 and Fig. 4 [25].The established models have local applicability advantages due to the differences in region, population and input variables.
The results of this study showed that the prevalence of MetS in workers of an oil company was 40.67%, higher than the average level of Chinese adults [12][13][14].At the same time, the prevalence rate of the ve diagnostic criteria of metabolic syndrome ranged from high to low, which were: central obesity, abnormal blood pressure, abnormal blood glucose, abnormal triglyceride, and abnormal high-density lipoprotein. The occurrence of this phenomenon was related to the generally good living conditions, dietary habits, irregular life and rest oil workers. Independent variable screening found that age, income, BMI, family history of diabetes, salt intake and physical exercise were all in uencing factors of metabolic syndrome, which was consistent with previous research results [26][27].Different from the general population, oil workers have been in a special occupational environment for a long time.
High temperature environment causes the body's circulatory system to be in a long-term stress state, resulting in decreased elasticity of blood vessel wall, increased blood viscosity, and increased blood pressure. In addition, studies have shown that high temperature contact can affect insulin hemodynamics, resulting in insulin resistance in the body [28][29].Harmony between biological rhythm and natural rhythm is the basis of normal physiological activities. Irregular shift work will affect the biological rhythm of human body due to irregular circadian rhythm, resulting in the disturbance of nutrients and related hormones in the body, thus resulting in glucose and lipid metabolism disorder and energy imbalance [30].On the other hand, the workers of night shift work lack of sleep time, and the incidence of unhealthy lifestyle such as smoking, drinking and irregular diet increases greatly, which are the driving forces for the occurrence of metabolic syndrome [31].In this study, UA and ALT were found to be risk factors for MetS, and related studies showed that UA increased the risk of MetS by increasing insulin resistance, and increased ALT in the blood might cause fat accumulation in the liver. Through investigation, Mandana Khalili et al. found that patients with MetS had higher hepatic steatosis level, and there was a correlation between the elevation of ALT and MetS [32][33].
In this study, Logistic regression model, random forest model and convolutional neural network model were established to compare their prediction performance. It was found that the random forest model was suitable for prediction model of MetS risk of oil workers. The prediction performance of the training set of the random forest model was higher than that of the Logistic regression model and the convolutional neural network model, and the difference was statistically signi cant. However, the speci city of the random forest model in the test set was slightly weaker than that of the convolutional neural network model, and the difference was not statistically signi cant. In general, the training ability of the model is directly proportional to the testing ability. On one hand, the above reasons may be due to the limitation of the sample size, which is not large enough, leading to poor model effect, on the other hand, the instability of the network, the setting of parameters and the selection of input variables may affect the prediction performance of the model. In addition, although the speci city of the convolutional neural network model was high in the test set, its sensitivity was too low. As a prediction model for the risk of metabolic syndrome in petroleum workers, the model with higher sensitivity is more suitable for the early detection of patients, so as to play a real role in early detection, early diagnosis and early treatment of the disease, namely secondary prevention of the disease. As an emerging machine learning algorithm in recent years, random forest model [34][35] is a highly exible classi er containing multiple decision trees. The random forest model solves the shortcoming of the decision tree algorithm, and adopts the random sampling method to enhance the generalization ability. Proposed by Yann Lecun of New York university in 1988, the convolutional neural network model is the rst truly successful deep learning method using multi-layer hierarchical network, including input layer, hidden layer (convolutional layer, pooling layer, full connection layer) and output layer, which effectively reduces the number of network parameters and signi cantly reduces the computational complexity. Previously, convolutional neural network was mainly used for image, language and medical imaging processing. In recent years, it has also been used as a neural network model to predict the risk of various diseases [36-38].However, the prediction effect of CNN for different diseases is uneven, which may be because the model construction needs to be further improved and there is no uni ed standard yet. At the same time, a certain amount of data is required for model training. Logistic regression model is a traditional statistical modeling method, which is widely used in the eld of risk factor screening and disease prediction. It is convenient to use and the meaning of the parameters is clear, but it cannot solve the nonlinear problems and the application conditions are strict. The sample size increases with the increase of input variables, and the predictive power decreases when the data do not meet the requirements [39].

Conclusions
Three risk prediction models (Logistic regression model, random forest model and convolutional neural network model) for the occurrence of metabolic syndrome in petroleum workers were established and compared. It was found that the random forest model performed well in training set, test set, accuracy, sensitivity, speci city and other indicators, and has high robustness. It shows that the random forest model can predict the risk of metabolic syndrome in oil workers more accurately, and can provide health education for high-risk employees with metabolic syndrome and put forward corresponding prevention strategies, so as to improve the allocation of national medical and health resources and the distribution of health services. Ethics approval and consent to participate All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.The study was approved by the Ethics Committee of North China University of Science and Technology(NO.16040).Informed consent was obtained from all individual participants included in the study.

Consent for publication
Not applicable.

Availability of data and materials
The data that support the ndings of this study are available from [Institute of basic medicine, Chinese academy of medical sciences] but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of [Institute of basic medicine, Chinese academy of medical sciences].   ROC curves of three predictive models in the test set Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js