Model of genetic and environmental factors associated with type 2 diabetes mellitus in a Chinese Han population

Background Type 2 diabetes mellitus (T2DM) is a metabolic disorder which accounts for high morbidity and mortality due to complications like renal failure, amputations, cardiovascular disease, and cerebrovascular events. Methods We collected medical reports, lifestyle details, and blood samples of individuals and used the polymerase chain reaction-ligase detection reaction method to genotype the SNPs, and a visit was conducted in August 2016 to obtain the incidence of Type 2 diabetes in the 2113 eligible people. To explore which genes and environmental factors are associated with type 2 diabetes mellitus in a Chinese Han population, we used elastic net to build a model, which is to explain which variables are strongly associated with T2DM, rather than predict the occurrence of T2DM. Result The genotype of the additive of rs964184, together with the history of hypertension, regular intake of meat and waist circumference, increased the risk of T2DM (adjusted OR = 2.38, p = 0.042; adjusted OR = 3.31, p < 0.001; adjusted OR = 1.05, p < 0.001). The TT genotype of the additive and recessive models of rs12654264, the CC genotype of the additive and dominant models of rs2065412, the TT genotype of the additive and dominant models of rs4149336, together with the degree of education, regular exercise, reduced the risk of T2DM (adjusted OR = 0.46, p = 0.017; adjusted OR = 0.53, p = 0.021; adjusted OR = 0.59, p = 0.021; adjusted OR = 0.57, p = 0.01; adjusted OR = 0.59, p = 0.021; adjusted OR = 0.57, p = 0.01; adjusted OR = 0.50, p = 0.007; adjusted OR = 0.80, p = 0.032) . Conclusion Eventually we identified a set of SNPs and environmental factors: rs5805 in the SLC12A3, rs12654264 in the HMGCR, rs2065412 and rs414936 in the ABCA1, rs96418 in the ZPR1 gene, waistline, degree of education, exercise frequency, hypertension, and the intake of meat. Although there was no interaction between these variables, people with two risk factors had a higher risk of T2DM than those only having one factor. These results provide the theoretical basis for gene and other risk factors screening to prevent T2DM.


Background
As a global public health issue causing significant morbidity and mortality, type 2 diabetes mellitus (T2DM) affects more than 380 million people worldwide [1,2]. The International Diabetes Federation has estimated that the number of individuals with diabetes will increase from 240 million in 2007 to 642 million in 2040 [3,4]. In China, because of scientific and technological advances as well as socioeconomic development, the number of patients with diabetes is predicted to increase from 20.8 million in 2000 to 42.3 million in 2030 [5,6]. China has the largest number of people with diabetes, with 92.4 million adults currently affected [7,8]. T2DM accounts for approximately 90% of all diabetes cases, with an overall prevalence of 9.1% of the population. T2DM occurs mainly when the body becomes unable to effectively use insulin and pancreatic β cells to compensate for an enhanced insulin demand, leading to uncontrolled glucose homeostasis [2,9]. Over time, poor glycemic control affects the blood vessels and nerves, accelerating the development and progression of neuropathies, micro-and macrovascular complications, and premature death [9,10].
Most cases of T2DM are closely related to genetic and environmental risk factors [11,12] and their interactions [13]. Previous genome-wide association studies [14][15][16] have identified numerous genetic polymorphisms and rare genetic variants associated with slight or significant effects on T2DM, suggesting that the disease results from complex interactions between genetic mechanisms and environmental factors. For instance, Zhang et al. [17] found a close relationship between the SLC12A3 gene and T2DM, and showed that a T allele in this gene had a modestly unfavorable impact on lipid levels. Ference et al. [18] showed that the genetic variants of the HMGCR gene are associated with T2DM. Ergen et al. [19] suggested ABCA1 polymorphism as a genetic marker of T2DM. Fumitaka et al. [20] identified the genetic susceptibility of patients with a novel common variant of rs964184 in ZPR1 to T2DM.
In addition to genetic predisposition, epidemiological risk factors play crucial roles in T2DM, such as gender differences, body mass index (BMI, weight in kilograms divided by height in square meters), lifestyle (e.g., smoking, alcohol consumption, etc.), and interactions between various factors [11-13, 21, 22].
We comprehensively analyzed the potential interactions between genes, physiological indices, biochemical indicators, and behavioral factors and T2DM. We constructed a model by elastic net that included genes and other environmental factors to identify variables strongly associated with T2DM rather than to predict the occurrence of T2DM.

Subjects
A total of 2323 subjects, who underwent physical examination at a community health service center from April 2013 to July 2013, were selected by cluster random sampling from 4 towns and townships in a district of Ningbo, Zhejiang Province. All subjects had to meet the following criteria: (1) Permanent residents aged more than 40 years old; (2) Han ethnic; (3) no consanguinity relation; (4) free from patients diagnosed with T2DM in April 2013, as well as patients with severe liver and kidney disease, malignant tumors and infectious diseases. We collected individual medical reports, lifestyle details, and blood samples and performed genotyping for singlenucleotide polymorphisms (SNPs) using the polymerase chain reaction-ligase detection reaction method. Interviews were performed in August 2016 to determine the subjects' incidence of T2DM. A total of 2113 people qualified for the study. T2DM was diagnosed based on World Health Organization guidelines [23]. The case group included 100 patients diagnosed with CAD between April 2013 and August 2016. The rest who did not develop type 2 diabetes in 2016 were in the control group. The study was approved by the Medical Ethics Committee of Hangzhou Normal University (No. 2013020), all participants signed informed consent forms. The study design is as follows (Fig. 1):

Demographic information and epidemiological investigation
Demographic variables mainly consisted of fundamental demographic criteria such as age, sex, education level and information on lifestyle such as smoking and drinking behavior. The main lifestyle variables were defined as follows. (1) Diet: "drink milk" and "drink soymilk" were defined as maintaining a certain amount of milk or soymilk intake every day, whereas "no milk intake habit" was defined as "not drinking". An average intake of fried food of less than 1 time per week was defined as "no fried food"; those who ate less than one sweet treat per week were defined as "not eating sweets". (2) Smoking: smoking behavior was defined as smoking at least one cigarette per day for at least 1 year. (3) Drinking: drinking behavior was defined as drinking white wine ≥50 g, red wine ≥150 g, or beer ≥500 g on average every day for 1 year or more. (4) Physical activity classification: "there is little physical activity, such as desk workers such as secretary" was defined as "sedentary"; "Light physical activity" was defined as "office work, repair of electrical clocks and watches, sales clerks, hotel services, chemical laboratory operations, lectures, etc."; "Students' daily activities, motor vehicle driving, electrical installation, lathe operation, metal cutting, etc." was defined as "moderate physical activities"; "Non-mechanized agricultural labor, steelmaking, dancing, sports movement, loading and unloading, mining, loading and unloading cargo, construction workers, etc." was defined as "heavy physical activity".
Blood samples were collected from the antecubital vein after the subjects had fasted for ≥8 h. Part of the collected samples was used to examine biochemical indicators such as serum lipid levels, whereas the other part was transferred into a test tube containing anticoagulant solution to extract DNA.

Isolation of genomic DNA
Genomic DNA was extracted from the blood cells using a standard phenol/chloroform extraction method, centrifuged, and stored at − 80°C. All genomic DNA samples were analyzed by electrophoresis. DNA was extracted using Tiangen Blood Genomic DNA extraction kits (Tiangen Biotech, Beijing, China) and sent to Shanghai Jierui Biological Engineering Co., Ltd., for genotyping analysis using the polymerase chain reaction (PCR)-ligase detection reaction (LDR) method (Generay Biotech Company, Shanghai, China). For this part, we have covered this in detail in previous articles [24]. For quality control, we randomly chose 10% of samples for regenotyping, and the concordance was 100%.

SNP selection and genotyping
Peripheral venous blood samples were collected from the study subjects to evaluate four physiological indicators of blood lipids (TC, TG, HDL-C, LDL-C and gene locus information. SNPs were mainly searched using the PubMed, Kyoto Encyclopedia of Genes and Genomes, and GeneCard databases. The specific screening process was as follows: (1) Literature related to gene polymorphisms, lipid levels, and atherosclerosis were searched in NCB-PubMed, and SNPs were screened; (2) GeneView information was obtained for relevant SNPs from the GeneCards database and NCBI database, and then, missense mutations, 3′ untranslated region (3′ UTR), 5′ UTR, or transcription factor-binding sites were selected; (3) The minor allele frequency (MAF) of SNPs in the Chinese population was detected from the HapMap database for the international human genome, and SNP sites with MAF values greater than 0.05 were screened; (4) Haploview software was used to conduct linkage imbalance analysis on all selected sites, and tagSNP was selected with r 2 ≥ 0.8 as the standard.
This process identified 103 SNPs, including those in SLC12A3, HMGCR, ABCA1, and KCNJ1, among others. Information regarding all SNP loci is shown in Table S1.

Statistical analysis
Statistical analysis was conducted with SPSS 24.0 software (SPSS, Inc., Chicago, IL, USA) and RStudio (Version 1.1.456. RStudio: Integrated development environment for R. Boston, MA, USA; http://www.rstudio.org/) using the glmnet package [25]. Elastic net regularization was used for feature selection which automatically performs variable selection to shrink the model to reduce over-fitting and co-variate correlation [26]. This technique has been shown to be superior to other methods of analysis when the set of features is much larger than the number of cases [27]. Chi-squared test, t test, Fisher exact test (for categorical variables), and Wilcoxon rank sum test (for continuous variables) were used to evaluate demographic characteristics and SNP genotypes. The odd ratios (ORs) and 95% confidence intervals (CIs) by logistic regression analysis were used to estimate the associations between variables (such as genetic models and lifestyles)and the risk of T2DM. The logistic-regression model based on 102 SNP feature selection and model based on SNP/ lifestyle features were separately developed on an elastic net. A gene-score was calculated for each person via the elastic net of 5 selected SNPs weighted by their respective coefficients. The gene-scores were combined with 31 environmental variables and 6 variables were screened out, including gene-scores with nonzero coefficients as determined by elastic net. Finally, receiver operating characteristic (ROC) curves were plotted to assess the efficiency of the model. Acoording to Knol [28], we used Excel software to identified interaction (RERI), OR, and 95% CI. Haploview, plink, and g-plink were used to calculate the p values of Hardy-Weinberg equilibrium. In all analyses, p values < 0.05 were considered to indicate a statistically significant difference. The purpose of this study was not to establish a model with good performance in predicting T2DM, but rather to explain T2DM through a relatively meaningful model, such as which SNP or environmental factors are likely to cause the disease.

General characteristics
The subjects included 2163 randomly selected men and women: 54% of the subjects were female and 46% were male. A summary of their demographic characteristics such as age, sex, BMI, weight, HDL-C, LDC-C, TC, and TG is shown in Table 1. There were significant differences in age, weight, BMI, waistline, SBP, DBP, TG, LDL-C, degree of education, and exercise frequency between the case and control groups (p < 0.05) ( Table 1). All studied SNPs in the control subjects were in Hardy-Weinberg equilibrium (p > 0.05). The MAF of each SNP was more than 5% to ensure that this study had sufficient statistical power (Table S1).

Gene-based model: SNPs associated with T2DM
Elastic net penalization allows for variable selection by shrinking the coefficients of the variables not related to the response to zero. Thus, variables with non-zero IQR interquartile range, SD standard deviation, BMI body mass index, TC total cholesterol, TG triglyceride, HDL-C high-density lipoprotein cholesterol, LDL-C lowdensity lipoprotein cholesterol, SBP systolic blood pressure, DBP diastolic blood pressure coefficients are considered as important predictors. Selection of the shrinkage parameter (lambda) for the elastic net model was performed by 20 repetitions of 10-fold cross-validation. The one-standard-error rule was used.
Using this value as the minimum lambda value resulted in 5 variables being included in the prognostic model. Initially, 102 SNPs were reduced to 5 potential predictors in 2163 people, and were features with nonzero coefficients in the elastic net model (Model A). The 5 potential SNPs were rs5805 in SLC12A3, rs12654264 in HMGCR, rs2065412 and rs414936 in ABCA1, and rs964184 in ZPR1. The area under the ROC curve for model A was 0.63 (Fig. 2). Figure 2 shows the ROC curves generated for each model. The black line represents model A, which was generated from SNP features using elastic net regression. Table 2 is the association between the 5 SNPs and environmental factors with T2DM, which was examined under each gene model. Without adjustment, the recessive models of rs12654264 and dominant model of rs2065412 and rs4149336 were found to be significantly associated with T2DM (Table 2). In the additive models, the TT genotype of rs12654264 and CT genotype of rs4149336 were associated with a reduced risk of T2DM (unadjusted OR = 0.45, 95%CI = 0. 24

All covariance-based model
Considering that model A only focused on the influence of genes on CAD, we recreated model B that included genetic characteristics and physiological, biochemical, and lifestyle indicators to identify factors related to CAD. When 102 SNPs were reduced to 5 potential predictors, the features of the 5 SNPs were presented in the gene-score calculation formula by elastic net. A genescore was calculated for every person by linear combination of the selected features weighted by their respective coefficients. The gene-score was combined with 31 lifestyle variables, and 6 variables with gene-scores with nonzero coefficients were screened out by elastic net (Model B). The red line represents the model B generated from the gene-score and lifestyle features using the same technique. The area under the ROC for model B was 0.71 (Fig. 2). The 6 variables were gene-score, hypertension, meat intake, waistline, education degree, and exercise frequency (Table S2 and Table 3).

Interactions between gene polymorphism and other covariance estimators for the risk of T2DM
Considering that interactions may occur between variables in the model, we further explored these interactions through an extensive literature survey. At the same time, we had studied the correlation between the kinds of factors, for example, compared to individuals with lower genetic risk and healthy lifestyle, whether individuals with similar lifestyle but higher genetic risk have a higher starting risk of developing disease.  Adjusted for waist circumference, the history of hypertension, the intake of meat, degree of education, exercise. The meaning of "/" is "VS"; the meaning of "+" is "and"   hypertension (OR = 3.60, 95%CI = 1.84-7.04, p < 0.001); in rs12654264, within the strata of AT+AA, compared to in people without a history of hypertension, those with a history of hypertension were at a higher risk of T2DM (OR = 2.68, 95%CI = 1.58-4.56, p < 0.001); in rs2065412, within the strata of hypertension, subjects carrying TT genotype had a higher risk of T2DM than those carrying the CC + CT genotype (OR = 1.94, 95%CI = 1.08-3.48, p = 0.026); in rs4149336, within the strata of hypertension, subjects carrying the CC genotype had a higher risk of T2DM than those carrying TT + CT; within the strata of the CC genotype, people with a history of hypertension were at a higher risk of T2DM (OR = 1.27, 95%CI = 1.00-1.61, p = 0.049; OR = 2.95, 95% CI = 1.53-5.68, p = 0.001) ( Table 4). Table 5 shows the effect of the interaction between meat intake, exercise frequency, dyslipidemia, and hypertension on T2DM. For meat intake, compared to in people without hypertension who eat white meat less than three times per week, those with hypertension who eat meat were at a higher risk of T2DM regardless of the

Discussion
To construct the model, 133 candidate features were reduced to 7 potential predictors by examining the predictor-outcome association by shrinking the regression coefficients using the elastic net method. This method not only is superior to the method of choosing predictors based on the strength of their univariable association with outcome [27][28][29], but also enables the panel of selected features to be combined into a model. Thus, the model, which makes use of easily accessible metrics, can serve as a more convenient biomarker for explaining T2DM. As T2DM is a complex disorder, and several genes have been implicated in its etiology and evolution. The identification of risk alleles is useful because if the involved genes and their functions are known, this information can be used to develop prevention, treatment, prognosis prediction, and/or curative methods for the disease. In the gene-based model, we examined the influence of genetic polymorphisms in four genes (SLC12A3, HMGCR, ABCA1, ZPR1) on T2DM through elastic net screening. Our data demonstrated that rs5805 in SLC12A3, rs12654264 in HMGCR, rs2065412 and rs414936 in ABCA1, and rs96418 in ZPR1 were significantly associated with T2DM.
We found that the minor allele ("C") of rs5805 in SLC12A3 was associated with a reduced risk of T2DM in the Chinese population. SLC12A3, located on 16q13, encodes a thiazide-sensitive Na + Cl-cotransporter that mediates reabsorption of Na + and Cl-in the renal distal convoluted tubule and is expressed specifically in the kidneys [30]. Studies of SLC12A3 suggested that its genetic variants and rare mutations impact the development of hypertension and T2DM and/or nephropathy in Asian populations [31][32][33], which is consistent with the results of our study.
Our finding that variants in HMGCR were associated with the risk of diabetes. People carrying the TT genotype of rs12654264 are at a reduced risk of T2DM. Past studies have shown that, HMGCR variants are associated with obesity or its subphenotypes, such as weight, BMI, or waist circumference [34][35][36]. Thus, the mechanism by which HMGCR variants increase the risk of diabetes is likely mediated by weight gain.
ABCA1 plays an important role in cholesterol metabolism, particularly for HDL-C [37]. Previous investigations have showed that the ABCA1 gene may influence cardiovascular risk in the general population [38]. In addition, the ABCA1 R230C polymorphism may play an important role in maintaining glucose-mediated insulin secretion, in turn, leads to a 4-fold increase occurrence of diabetes [39]. Few studies have examined the role of ABCA1 polymorphism (rs2065412 and rs414936) in diabetes. We found a significantly higher frequency of both the T allele and genotype in the control group compared to in patients, indicating that the T allele is a protective factor against diabetes mellitus. ZPR1 is located~1.6 kb upstream of the APOA5-A4-C3-A1 gene complex. We found that rs964184 of ZPR1 was significantly associated with T2DM in Chinese individuals. This is consistent with the results of a previous study [40,41]. rs964184 is in the intron region of ZPR1 at chromosome 11q23.3. ZPR1 is an essential regulatory protein for cell proliferation and signal transduction and may have multiple physiological functions [41,42].
Multiple environmental risk factors, including gender, personal fitness status, weight, other physical conditions, and their interactions, can modulate serum lipid profiles, in addition to the effects of genetic background [13,43,44]. In the present study, demographic characteristics and lifestyle factors of the participants, including waistline, education degree, exercise frequency, hypertension, and meat intake, influenced T2DM. This has been confirmed in previous studies [11-13, 43, 44].
Epidemiological experts have suggested that quantitative interactions in the additive model are best suited for assessing the importance of interactions [26]. RERI, as well as the p values and 95%CI of RERI, were determined in this study. The RERI caused by an interaction is generally considered as the standard measure of an additive model interaction in case-control studies. We explored the interactions of gene-lifestyle factors, genebiochemical indicators, and certain lifestyle factors with the risk of T2DM. Although the interactions between these indices were not statistically significant, those carrying risk alleles of these SNPs who also had a history of hypertension or dyslipidemia were also at a high risk of disease.
This study had some limitations. First, our model was designed to explain the relationship between variables and disease and not to predict the risk of T2DM, and thus the model was not tested in new populations. Second, most responses related to lifestyles were obtained through questioning of the patients, and thus, there may have been recall bias. Finally, the conclusions may only be applicable to people in southern China. Studies in multiple regions and different populations using a randomized, large-scale, long-term design are needed.

Conclusions
In conclusion, the model which we built showed that four SNPs and 5 variance-covariance estimators were associated with T2DM in people in southern China. These results will provide a theoretical basis for gene and risk factor screening to prevent T2DM.
Additional file 1. The following are available online at www.mdpi.com/ xxx/s1, Table S1. The information of 107 SNPs. Table S2. Elastic net regularisation feature selection for gene-score and lifestyles.