Development and validation of a deep learning model for predicting postoperative survival of patients with gastric cancer

Background Deep learning (DL), a specialized form of machine learning (ML), is valuable for forecasting survival in various diseases. Its clinical applicability in real-world patients with gastric cancer (GC) has yet to be extensively validated. Methods A combined cohort of 11,414 GC patients from the Surveillance, Epidemiology and End Results (SEER) database and 2,846 patients from a Chinese dataset were utilized. The internal validation of different algorithms, including DL model, traditional ML models, and American Joint Committee on Cancer (AJCC) stage model, was conducted by training and testing sets on the SEER database, followed by external validation on the Chinese dataset. The performance of the algorithms was assessed using the area under the receiver operating characteristic curve, decision curve, and calibration curve. Results DL model demonstrated superior performance in terms of the area under the curve (AUC) at 1, 3, and, 5 years post-surgery across both datasets, surpassing other ML models and AJCC stage model, with AUCs of 0.77, 0.80, and 0.82 in the SEER dataset and 0.77, 0.76, and 0.75 in the Chinese dataset, respectively. Furthermore, decision curve analysis revealed that the DL model yielded greater net gains at 3 years than other ML models and AJCC stage model, and calibration plots at 3 years indicated a favorable level of consistency between the ML and actual observations during external validation. Conclusions DL-based model was established to accurately predict the survival rate of postoperative patients with GC. Supplementary Information The online version contains supplementary material available at 10.1186/s12889-024-18221-6.


Background
Gastric cancer (GC) is one of the common malignant tumors, and surgical resection is still the only option to cure early GC and the main treatment for GC [1].Even if patients with GC have undergone radical surgery, there are still many factors affecting their survival and disease progression, including clinical as well as pathological factors, such as stage, histological type, depth of infiltration and lymph node and distant metastasis [2][3][4][5].Therefore, accurate prediction of postoperative survival rate is crucial for both patients and healthcare institutions.GC is a heterogeneous, multifactorial disease, and the variability of these multiple factors and the complexity of GC make treatment and survival prediction extremely difficult [6].Currently, clinicians usually assess survival based on the American Joint Committee on Cancer (AJCC) stage combined with their own medical experience, overlooking the role of other survival-influencing factors [7].The staging system is widely used and effective in guiding treatment decisions for GC.However, it fails to consider various factors such as sex, age, tumor size, and histopathological type, all of which can significantly influence survival prognosis.Additionally, traditional methods, such as Cox regression, in survival analysis encounter limitations including the requirement for proportional hazards and the assumption of linearity in continuous variables.These constraints may restrict their applicability in complex scenarios.However, deep learning-based prognostic models represent a significant advancement as they effectively address these issues.They can handle non-proportional hazards and model non-linear relationships between variables and outcomes, rendering them more versatile and accurate for survival prediction across diverse clinical settings.
Machine learning (ML) excels in acquiring information from high-dimensional, complex data, learning automatically and making predictions in supervised or unsupervised mode, and plays a major role in disease prognosis [8].Compared to the AJCC stage model, ML predictive models may be more appropriate for clinical settings to guide clinical decision making.To the best of our knowledge, there is a lack of effective predictive models for the correlation of several clinical factors with the prognosis of GC patients after surgery.Deep learning (DL), a special kind of ML model that includes multiple neural networks, can handle more complex information.The DL method, compared to traditional ML models including the multitask logistic regression and random forest models, has many advantages.First, DL can learn complex patterns and representations from large datasets, resulting in superior performance compared to traditional ML algorithms.Second, DL algorithm can scale effectively with large amounts of data.Third, DL model has the ability to automatically learn and extract relevant features from raw data.At last, DL models can leverage pre-trained models on large datasets, allowing for transfer learning.This approach enables the application of existing knowledge from one domain to another, even with limited labeled data, thereby reducing the need for extensive training data.Some studies have utilized DL models for analysis in surgical oncology research [9][10][11].However, most of these studies have focused on diagnostic applications, such as automated quantification of radiographic images, digital histopathologic image interpretation, or biomarker analysis [12][13][14][15].To our knowledge, there are limited published studies utilizing DL models for prognostic prediction in surgical oncology, particularly in the field of GC.Therefore, DL-based survival analysis offers insights into survival prediction after GC surgery.
The Surveillance, Epidemiology and End Results (SEER) database, established by the National Cancer Institute, is a comprehensive cancer registry with well-developed and regularly updated data that provides a wealth of information on patient clinical characteristics, treatment, and survival data [16].This study aims to extract information about postoperative GC patients from the SEER database, construct a GC survival prediction model by DL algorithm, and evaluate the accuracy of the constructed model using information collected from real-world GC patients to analyze the factors influencing the probability of GC survival and the 5-year survival status to provide decision support for clinical treatment and prognosis of GC.

Data collection and patient characteristics
The current research is a retrospective study that relies on data from the SEER database, which is maintained by the National Cancer Institute.The SEER database collects information on cancer incidence and survival rates from 18 cancer registries, covering approximately 27.8% of the US population.This study also includes patients diagnosed with gastric cancer between 2016 and 2020 at the Affiliated Cancer Hospital of Zhengzhou University, Henan Cancer Hospital, forming a Chinese dataset.All procedures involving human participants in this study followed the ethical standards established by the institutional and/or national research committee, as well as the 1964 Helsinki declaration and its subsequent amendments or equivalent ethical standards.Due to the retrospective nature of the study, informed consent from patients was not required.
In this study, individuals who were diagnosed with primary GC and met the following criteria were included: being above 18 years of age and having undergone surgical intervention.The identification of eligible patients was based on the definition of the primary tumor location, which encompassed the following categories: C16.0-Cardia, C16.1-Fundus of stomach; C16.2-Body of stomach; C16.3-Gastric antrum; C16.4-Pylorus; C16.5-Lesser curvature of stomach; C16.6-Greater curvature of stomach; C16.8-Overlapping lesion of stomach; C16.9-Stomach.Individuals with unknown age or survival duration were excluded from the study.For the purpose of analysis, important patient attributes were collected, which included the following information: age at diagnosis (in years), sex, tumor location, histology, grade, AJCC stage, tumor size (in millimeters), the number of examined lymph nodes (LN), the number of positive LNs, radiation treatment, and chemotherapy.

Multi-task logistic regression (MTLR)
The MTLR model presents a novel approach to survival analysis, extending traditional methods by directly modeling the survival function across multiple time intervals.The model accommodates the time-varying effects of covariates, allowing for a more nuanced understanding of risk factors.MTLR captures these dynamics, offering several advantages over Cox's proportional hazards and Aalen's additive models, which are traditional staples of survival analysis.Unlike these models, MTLR can handle non-proportional hazards and non-linear effects, resulting in improved predictive accuracy and flexibility.
A simplified view of the mathematical formulation behind MTLR is shown below: Assume we divide the survival time into N discrete intervals, [t 0 , t 1 ), [t 1 , t 2 ), . . ., [t N−1 , t N ], where t 0 = 0 and t N = ∞ (or some maximum follow-up time).For each interval i, MTLR models the probability that the event of interest (e.g., death) occurs within that interval, given that it has not occurred before t i .
The probability of the event occurring in the i th interval, given covariates X, is modeled as: where: • T is the survival time, • X represents the covariates (or features) of the patient, • βi is the coefficient vector for interval i, • N is the number of intervals.
Random Survival Forests (RSF) represent another innovative extension of Random Forests specifically adapted for survival analysis.RSF does not assume a specific form for the underlying hazard function, making it adaptable for various datasets.By aggregating predictions from multiple decision trees built on various subsamples of the data, RSF enhances prediction accuracy and robustness.
For a new observation, the survival function is estimated by aggregating predictions from all the trees in the forest.
In mathematical terms, if we denote S i (t) as the survival function estimated by the i th tree in the forest for time t, and N as the total number of trees, the overall survival function S(t) for an observation is given by: The DeepSurv model in PySurvival is a deep learningbased approach to survival analysis, renowned for its ability to capture complex, non-linear relationships in the data.DeepSurv extends traditional survival analysis models by using neural networks, allowing for more flexible and potentially more accurate modeling of survival data, especially when dealing with high-dimensional and complex datasets.DeepSurv utilizes a neural network to model the hazard function.The hazard function in the context of Deep-Surv can be expressed as: where: • h(t|X) is the hazard function at time t given covariates X. • h 0 (t) is the baseline hazard function, which is typically left unspecified.• exp(g(X, θ))is a non-linear function represented by the neural network, with X being the input covariates and θ representing the network's parameters.• The neural network g(X,θ) learns complex relationships between the input covariates and the log-risk of the event occurring.
Our decision to employ DL, MTLR, and RF into our study, alongside a comparison with the traditional TNM staging system, stemmed from our aim to explore a spectrum of machine learning approaches that address the unique challenges posed by survival data.Each method was selected based on its specific strengths in addressing different aspects of survival analysis: • DL was chosen for its unparalleled capability in modeling complex, non-linear relationships within high-dimensional datasets.
• MTLR was employed for its innovative approach to capturing time-varying effects of covariates across multiple time intervals.• RSF was utilized for its effectiveness in handling censored data and reducing overfitting, leveraging the ensemble strength of Random Forests tailored to survival analysis.
Various ML algorithms were employed in this study.We implemented a two-stage validation process for our survival analysis models.Initially, we conducted internal validation using the SEER database, where the data was randomly split into two portions: 60% for model training and 40% for validation.This partitioning enabled us to develop and subsequently evaluate the model within the same dataset.We employed a grid search approach combined with the C-index to select parameters for our survival analysis model.This method entails exploring a predefined set of parameter combinations, training the model on a training dataset, and then evaluating its performance on internal validation dataset using the C-index.This process systematically identifies the optimal parameters that enhance the model's predictive accuracy by ranking survival times effectively.The results of the grid search are provided in the supplementary materials.For external validation, we utilized an independent dataset from China, allowing us to assess the model's performance in a different patient population.This comprehensive approach ensured a thorough evaluation of our models across diverse clinical contexts [17].The ML algorithms that were tested in this study encompassed DL, MTLR and RF.The accuracy of these ML models was compared with TNM stage.In order to evaluate the performance of the model, various metrics were computed, including the area under the receiver operating characteristic curve [18].Area Under the Curve (AUC) is a performance measure that remains unaffected by specific thresholds and provides a comprehensive evaluation of the model's performance.
The AUC values range from 0.5 to 1.0, where 0.5 represents random chance and 1.0 represents perfect classification.Additionally, the calibration of the model, which compares predicted outcomes to observed outcomes, was evaluated through visual examination of calibration plots.Decision curve analysis was performed to calculate the clinical net benefit for each prediction model.The net benefit measures the advantages gained by using the model's predictions to guide decision-making.The net benefit of employing these strategies was compared to the models that rely on prognosis-based interventions, meaning interventions based on a predicted risk exceeding a specific threshold.

Statistical analysis
Categorical variables were compared for differences using the chi-square test, while the non-parametric Mann-Whitney test was used to compare differences between continuous variables.Univariate survival analysis was conducted using the Kaplan-Meier method, and the log-rank test was employed to compare the survival rates among different subgroups.The ML algorithms were trained and tested using Pysurvival, while SAS 9.4 was used for conducting survival analysis, chi-square tests, and Mann-Whitney tests.A significance level of P < 0.05 was considered statistically significant, using a two-sided test.

Study population and baseline characteristics
The number of identified patients was 2,846 in the Chinese and 11,414 in the SEER registry dataset, with median ages at diagnosis of 60.8 and 65.3 years, respectively (60.75 ± 10.07 vs. 65.32 ± 13.58 years, P < 0.01).The proportion of male patients in the Chinese dataset was higher than that in the SEER dataset (75.54% vs. 62.18%,P < 0.01).Patients in the Chinese dataset were more likely to have tumors located in overlapping locations (24.49% vs. 7.23%, P < 0.01).However, patients in the SEER dataset were more likely to have tumors located in the gastric antrum (21.34% vs. 15.39%,P < 0.01).Most patients in the Chinese dataset had grade II (55.48%) and T4 (51.65%), while patients in the SEER dataset had grade III and IV (60.55%) and T3 (36.00%).Patients in the Chinese dataset received less radiation (3.62% vs. 34.33%)and more chemotherapy (75.38% vs. 55.68%)than those in the SEER dataset.Among 11 variables included in this study, all parameters showed significant difference between the SEER and Chinese datasets (Table 1).

Patient prognosis analysis
We conducted univariate analyses of overall survival in patients from the datasets to evaluate the impact of all variables on GC prognosis (Fig. 1).The results of the univariate analyses of overall survival indicate that poor prognosis was associated with age (patients aged 60-69 years, 70-79 years, and over 80 years, compared to patients younger than 60 years), pathological type (signet-ring cell carcinoma compared to others), grade (grades II, III, IV compared to grade I), and TNM stage in both the SEER and Chinese dataset.There was a significant association between patients with different tumor sites.In the SEER database, patients with tumors in the greater curve (C16.6-Greatercurvature of stomach) were associated the poorest prognosis.However, in the Chinese dataset, patients with tumors in the stomach (C16.9-Stomach) were associated with the poorest prognosis (Figs. 2 and 3).

Performance of ML method
Correlation matrix showed a cluster of interconnected variables among the patients with GC (Fig. 4).
The ROC curves in Fig. 5 demonstrate the performance of our ML models on the validation and external testing sets, compared to TNM staging.Our model showed consistent performance between the validation and external testing sets, and outperformed TNM staging in predicting overall survival with a higher AUC value.Deep learning was found to have the highest AUC value 1, 3, and 5 years after surgery (from 0.75 to 0.82).Multitask logistic regression and random forest also showed a good AUC for predicting mortality (from 0.74 to 0.82).The random forest algorithm showed a relatively poorer predicting capability (only 0.70) compared to the other two models when predicting the prognosis in external validation 5 years after surgery (0.74 and 0.75, respectively).
We next evaluated the performance of the models using decision curve analysis, to account for the impact on treatment decision making (e.g.surveillance vs. radical treatment).The heterogeneous profile of the patient population renders a uniform treatment strategy (treat all or no patients) inferior to strategies informed by any one of the four models (Fig. 6).When performing internal validation of the models, we found that TNM stage provided the least benefit compared to the other models, while deep learning yielded the highest net benefit.However, the gain from the models was more than zero only at 3 years in the external validation.The gain from TNM stage was much less than the other three models and the difference is even greater when compared with deep learning model.
The calibration plots for 1-, 3-, and 5-year survival in the internal validation cohort demonstrated favorable consistency between the ML prediction and actual observation (Fig. 7).However, only the calibration plots at 3 years showed favorable consistency in the external validation between the ML prediction and actual observation.

Discussion
The current authoritative AJCC stage for predicting patient prognosis is hindered by widely varying clinical outcomes, despite similar staging and treatment regimens, suggesting that AJCC staging does not provide sufficient prognostic information [19,20].ML models represent a new approach that can improve prognosis prediction.Due to the autonomy-driven nature of ML, its ability to continuously add data as the patient progresses through treatment and visualize the prognosis of patients after GC is irreplaceable by all current predictive models [21].Better prediction of prognosis, especially in the context of limited healthcare resources, allows us to identify patients with possible recurrence early and thus  prevent them effectively, and has an impact on key clinical outcomes (e.g.mortality) [8,22,23].Previous studies have identified prediction models and algorithms that surpass classical models such as the AJCC TNM staging system [24,25].In this study, we developed and tested ML-trained prognostic models for predicting 5-year postoperative GC survival.In contrast to the TNM stage prediction model, our ML-based models used 11 clinical features, including the AJCC TNM stage, to achieve optimal modeling.Their inherent data-driven capability allows multiple variables to be automatically combined to obtain high a AUC value and good calibration from previously undetected factors.DL-based models, compared to traditional machine learning models including multitask logistic regression and random forest models, can more comprehensively reveal the underlying nonlinear relationships in data and handle more complex neural network models [9,12].A previous study constructed ML models, including the Boruta algorithm, neural network, support vector machine, and random forest, to accurately assess the overall survival of patients after GC surgery; however, no deep learning model was established for validation and comparison [26].Our study introduced the DL-based model to predict survival rates and showed that the DL model achieved better performance compared to that of the AJCC TNM stage system and traditional ML-based models in individualized estimation of survival in GC patients.In recent years, DL models have successfully been applied for analyzing clinical, imaging, and genetic data [12][13][14][15].However, the application of DL in predicting postoperative prognosis for GC patients has been limited.Zeng et al. developed a DL algorithm to predict survival rates for GC patients.However, the DL model was validated only using the SEER database and lacked validation with real-world data [17,27].In this study, a total of 11,414 patients with GC from the SEER database were included for internal validation and randomly divided into the training and testing groups at a 3:2 ratio.In addition, 2,846 GC patients from Henan Cancer Hospital were included for external validation to develop and test ML-trained prognostic models.Our study showed that DL-based models yielded more reliable predictions of survival compared to those by traditional ML and the AJCC stage models.These findings indicate the potential of DL models to significantly enhance individualized survival estimation for GC patients.
In our study, we found that the net gains of each model decision curve at 1, 3, and 5 years in the training set are larger than that in the test cohort, with multi-task logistic regression and DL slightly higher than TNM staging.In the test set, the net gain of each model was lower at 1 and 5 years, while at 3 years, the DL and multi-task logistic regression models were much more effective than the TNM staging.Moreover, the calibration curves showed that the training model fit was better than the test model for each year.In the test set, the accuracy of 3-year survival prediction was higher than that of 1-, 5-year survival prediction.The 1-year survival prediction probability was higher than the actual probability, and the 5-year survival prediction probability was lower than the actual probability.There are several reasons for this phenomenon: (1) the mortality rate of tumor patients is relatively low at 1 year and relatively high at 5 years, which makes the model identification more difficult.(2) The high accuracy of the training set and the low accuracy of the validation set may be due to the differences in the genetic background, race, and treatment of patients in the training set and the validation set.Methods such as transfer learning can be considered in subsequent studies to increase the accuracy in the validation set [28].
A limitation of this analysis is that the clinical prognostic model we developed does not incorporate any previously identified molecular markers that potentially influence patient prognosis [29][30][31].Including these markers in our predictive model will enhance its reliability.Moreover, this was a retrospective study with some selection bias.A large prospective study is needed to validate the developed model to obtain more generalized and robust predictive validity of DL.Lastly, The algorithms we selected indeed established methods within the field of machine learning and have been extensively applied across diverse datasets in various domains.However, our intention in utilizing these well-established methodologies was to harness their proven strengths in handling complex, high-dimensional survival data, particularly in assessing their efficacy within a survival analysis context mirroring real-world clinical scenarios.The novelty of our work lies not in the introduction of new algorithms but in the comprehensive application and comparative analysis of these methods against the traditional TNM staging system within the specific domain of survival analysis.
The DL-based model constructed in this study achieves the first validation of the SEER database combined with the Chinese population database on internal and external data, which can more accurately assess the prognosis of postoperative GC patients and helps to provide accurate and personalized treatment for postoperative GC patients in a clinical setting.

Conclusions
We constructed a DL-based model using data from both the SEER and Chinese population databases for the first time and subjected it to internal and external validation to predict the prognosis of postoperative GC patients.Our findings suggest that the DL-based model accurately predicts the survival rate of postoperative patients with GC.

Fig. 1
Fig. 1 Comparison of the 5-year overall survival of patients in the SEER and Chinese datasets.SEER: Surveillance, Epidemiology, and End Results

Fig. 2
Fig. 2 Kaplan-Meier survival analyses for overall survival of stomach cancer patients in the SEER dataset.(A): Comparison of the 5-year overall survival according to sex.(B): Comparison of the 5-year overall survival according to age.(C): Comparison of the 5-year overall survival according to different grades.(D): Comparison of the 5-year overall survival among patients of different races.(E): Comparison of the 5-year overall survival according to different T stages.(F): Comparison of the 5-year overall survival according to different N stage.(G): Comparison of the 5-year overall survival according to different M stage.(H): Comparison of the 5-year overall survival among patients undergoing radiotherapy.(I): Comparison of the 5-year overall survival among patients undergoing chemotherapy.SEER: Surveillance, Epidemiology, and End Results

Fig. 3 Fig. 4 Fig. 5
Fig. 3 Kaplan-Meier survival analyses for overall survival of stomach cancer patients in the Chinese dataset.(A): Comparison of the 5-year overall survival according to sex.(B): Comparison of the 5-year overall survival according to age.(C): Comparison of the 5-year overall survival according to different grades.(D): Comparison of the 5-year overall survival among patients of different races.(E): Comparison of the 5-year overall survival according to different T stages.(F): Comparison of the 5-year overall survival according to different N stage.(G): Comparison of the 5-year overall survival according to different M stage.(H): Comparison of the 5-year overall survival among patients undergoing radiotherapy.(I): Comparison of the 5-year overall survival among patients undergoing chemotherapy

Fig. 6
Fig. 6 The decision curve for predicting patient survival.The clinical net benefit for each prediction model is calculated across a range of risk threshold probabilities.Clinical net benefit is defined as the minimum probability of disease at which further intervention would be warranted.(A) 1 year after surgery in the SEER dataset (B) and 1 year after surgery in the Chinese dataset; (C) 3 years after surgery in the SEER dataset (D) and 3 years after surgery in the Chinese dataset; (E) 5 years after surgery in the SEER dataset (F) and 5 years after surgery in the Chinese dataset.SEER: Surveillance, Epidemiology, and End Results

Fig. 7
Fig. 7 Calibration curves for predicting patient survival.Calibration curves for 1-, 3-, and 5-year overall survival in the SEER and Chinese datasets.By plotting overall survival on x-axis and the actual observed overall survival on y-axis, the calibration of the model can be evaluated.A line closer to 45 degrees indicates better calibration, suggesting that the predicted probabilities closely match actual outcomes.(A) 1 year after surgery in the SEER dataset (B) and 1 year after surgery in the Chinese dataset; (C) 3 years after surgery in the SEER dataset (D) and 3 years after surgery in the Chinese dataset; (E) 5 years after surgery in the SEER dataset (F) and 5 years after surgery in the Chinese dataset.SEER: Surveillance, Epidemiology, and End Results

Table 1
Characteristics of included patients in Chinese and SEER dataset