Data source
The data of gross industrial production come from Tianjin Bureau of Statistics. The case data of pneumoconiosis in this study were collected from the follow-up survey of occupational pneumoconiosis patients in China ‘s National Programme of Action for the Prevention and Treatment of Pneumoconiosis. The basic information of pneumoconiosis patients in 2005 and before was obtained by the epidemiological survey data of pneumoconiosis, and the data of occupational pneumoconiosis cases reported from 2006 to 2019 was obtained by the occupational disease reporting system. A total of 10,694 pneumoconiosis patients were included in the study.
The basic information of pneumoconiosis patients such as gender, age, survival, region, industry classification, dust exposure time, pneumoconiosis type, stage, diagnosis date, death date and other information were collected.
DALY calculation
DALY can be defined as the total loss of healthy life years from onset to death [27], which consists of Years of Life Lost (YLLs) due to premature mortality and Years Lived with Disability (YLDs) due to disability [11]. The basic formula for calculating DALY in terms of specific disease could be expressed as
Several social preference values should be considered in the calculation of DALY, such as the disability weight between 0 and 1, the larger the value indicates the more loss of health life. The age weight is used to distinguish the relative life value of different age groups [28] and the time discount rate to distinguish the relative value of health life loss occurs in different periods. However, there have always been debates on whether or not the social preference values adopted are suitable and/or justifiable. We use the simplified DALY calculation method commonly used by WHO, which ignores the age weight and time discount, as shown in Eq. (2) and Eq. (3), respectively [10]:
$$YLD=I\times D\times L$$
(3)
where, N: number of premature deaths caused by a specific disease; L: standard life expectancy loss for each death in Eq. (2) or average duration of disease in Eq. (3); I: number of disabilities caused by a specific disease; and D: disability weight [29].
ARIMA model
ARIMA model has two parts: autoregressive (AR) and moving average (MA). In general, the model is expressed as ARIMA (p, d, q), p means the order of auto-regression, d means the order of difference and q means the order of moving average [30]. ARIMA needs to transform the non-stationary time-series into a stationary time-series, and then a model is established by regression of the lag value of the dependent variable and the present value and lag value of the random error term. The basic idea is to regard the data formed by the predicted object over time as a random sequence, describe the autocorrelation in the sequence with the corresponding mathematical model, and predict the future value by using the potential relationship between the past value and the present value of the sequence. The three main steps of establishing ARIMA time-series model are as follows: (1) Data preprocessing, observing the time-series diagram, autocorrelation analysis diagram and using the Augmented Dickey-Fuller (ADF) unit-root test to estimate whether the time-series is stable. If the sequence is a non-stationary sequence, the corresponding difference is used to smooth the sequence, and white noise test is carried out to test whether the difference sequence is white noise sequence; (2) Model identification, order determination and model parameter estimation. Autocorrelation Function (ACF) graph and Partial Autocorrelation (PACF) graph are used to estimate parameters, and the optimal model types and parameters can be screened by combining Akaike information criterion (AIC) and Bayesian information criterion (BIC), usually with the lowest AIC or BIC values [31]; (3) The Q-Q plots are used to test whether the residuals of the model meet the independent normal distribution, and the white noise analysis of the residuals is used to diagnose and test the optimal model. Finally, the better fitting model is used to predict [32].
DNN model
A DNN is an extension of an artificial neural network (ANN) with multiple hidden layers using a supervised learning technique called back propagation. The feedforward neural network consists of an input layer, an output layer and one or more hidden layers. In addition to the input nodes, each node uses a nonlinear activation function. If the number of hidden layers is more than one then it qualifies the term “deep”, so it is called deep neural network [33]. As shown in Fig. 1, the neurons in each layer of DNN use the following equation to calculate the function σ and activation function f(σ).(Eq. (4), Eq. (5)) [19].
$$\sigma : Sum=w\bullet x+b$$
(4)
$$y:f\left(\sigma \right)=f\left(w\bullet x+b\right)$$
(5)
where b is the bias; x is the input; y is the output; w is the weight; σ is the calculation function; f(σ) is the activation function.
LSTM model
LSTM is a machine learning algorithm with recursive neural network structure, which aims to avoid long-term dependency problems by remembering historical information [34].. According to the defined parameters and algorithms, LSTM neural network adds three gates structure to control the state of memory cells in each neuron: the input gate, the output gate and the forget gate (Fig. 2), all of which are controlled by the Sigmoid unit (0,1) [35].
The first forgetting gate ft is used to control the historical information last stored by the hidden layer node in the last time (Fig. 3):
$${f}_t=\sigma \left(\ {W}_f\left[\ {h}_{t-1},{x}_t\right]+{b}_f\right)$$
(6)
where ft is the forget gate; σ is the sigmoid function; Wx is the weight for the respective gate neurons; xt is the input and ht − 1 is the output of the hidden layer at the previous time; bf is the bias for the respective gate.
The input gate it is used to processes xt and ht − 1 in the current cell state (Fig. 4).
$${i}_t=\sigma\ \left(\ {W}_i\ \left[\ {h}_{t-1},{x}_t\right]+{b}_f\right)$$
(7)
The output gate ot is used to control the output of the currently hidden layer node (Fig. 5).
$${o}_t=\upsigma \left(\ {W}_0\ \left[\ {h}_{t-1},{x}_t\right]+{b}_0\right)$$
(8)
The expression of the current input unit state \({\overset{\sim }{C}}_t\) =tanh (Wc· ht − 1 +Wc· xt + bc); The current unit state is the last unit state multiplied by the element to the forgetting gate, plus the current input unit state multiplied by the element to the input gate: Ct = ft ∗ Ct − 1 + \({i}_t\ast {\overset{\sim }{C}}_t\); Final output of LSTM model: ht = ot *tanh (Ct). Where Ct represents the cell states at time t, \({\overset{\sim }{C}}_t\) is the candidate for cell state; tanh is the hyperbolic tangent function [22, 26, 36].
Model comparison
Three performance metrics including root mean square error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE) were used to compare and evaluate the fitting and prediction accuracy of the three models. The smaller the values of the three metrics, the better the prediction effect. MAE is the simplest measure of fitting and prediction accuracy that determines the average prediction error. MAPE is the mean value of unsigned percentage error, which can solve the problem of distinguishing large error from small error, but it may underestimate the rare error. The root mean square error is extremely sensitive to rare errors by amplifying the prediction error, which can better reflect the accuracy of the prediction results. The specific calculation formulas are as follows [37]:
$$\textrm{RMSE}=\sqrt{\frac{1}{n}{\sum}_{i=1}^n\left({\hat{y}}_i-{y}_i\right)2}$$
(9)
$$\textrm{MAE}=\frac{1}{n}{\sum}_{i=1}^n\mid {\hat{y}}_i-{y}_i\mid$$
(10)
$$\textrm{MAPE}=\frac{100\%}{n}{\sum}_{i=1}^n\mid \frac{{\hat{y}}_i-{y}_i}{y_i}\mid$$
(11)
Where \({\hat{y}}_i\) is the predicted value, yi is the actual value, and n is the number of predicted data.
Statistical analysis
Excel 2019 software was used to establish a database, and the World Health Organization disease burden Excel template was used to calculate the DALY of pneumoconiosis. In our study, spearman correlation analysis was used to explore the correlation between variables. Restricted cubic splines (RCS) were used to study the nonlinear relationship between DALY caused by pneumoconiosis and the number of patients, the average age of onset, the average dust exposure time and the gross industrial production. These analytical methods were performed using R4.2.0.
The data from 1990 to 2016 were used as the training set, and 2017–2021 were used as the testing set to establish the prediction model. Python 3.9.5 was used to establish ARIMA model, multivariate LSTM model and DNN model. ARIMA model was mainly realized by statsmodels library, LSTM model and DNN model were mainly constructed based on PyTorch framework library of Anaconda environment. In this study, the statistical significance level of all hypothesis tests was set to 0.05.