Research on hand, foot and mouth disease incidence forecasting using hybrid model in mainland China

Zhao, Daren; Zhang, Huiwu; Zhang, Ruihua; He, Sizhang

doi:10.1186/s12889-023-15543-9

Research
Open access
Published: 31 March 2023

Research on hand, foot and mouth disease incidence forecasting using hybrid model in mainland China

Daren Zhao¹,
Huiwu Zhang¹,
Ruihua Zhang^2,3 &
…
Sizhang He⁴

BMC Public Health volume 23, Article number: 619 (2023) Cite this article

1258 Accesses
3 Citations
Metrics details

Abstract

Background

This study aimed to construct a more accurate model to forecast the incidence of hand, foot, and mouth disease (HFMD) in mainland China from January 2008 to December 2019 and to provide a reference for the surveillance and early warning of HFMD.

Methods

We collected data on the incidence of HFMD in mainland China between January 2008 and December 2019. The SARIMA, SARIMA-BPNN, and SARIMA-PSO-BPNN hybrid models were used to predict the incidence of HFMD. The prediction performance was compared using the mean absolute error(MAE), mean squared error(MSE), root mean square error (RMSE), mean absolute percentage error (MAPE), and correlation analysis.

Results

The incidence of HFMD in mainland China from January 2008 to December 2019 showed fluctuating downward trends with clear seasonality and periodicity. The optimal SARIMA model was SARIMA(1,0,1)(2,1,2)_[12], with Akaike information criterion (AIC) and Bayesian Schwarz information criterion (BIC) values of this model were 638.72, 661.02, respectively. The optimal SARIMA-BPNN hybrid model was a 3-layer BPNN neural network with nodes of 1, 10, and 1 in the input, hidden, and output layers, and the R-squared, MAE, and RMSE values were 0.78, 3.30, and 4.15, respectively.

For the optimal SARIMA-PSO-BPNN hybrid model, the number of particles is 10, the acceleration coefficients c1 and c2 are both 1, the inertia weight is 1, the probability of change is 0.95, and the values of R-squared, MAE, and RMSE are 0.86, 2.89, and 3.57, respectively.

Conclusions

Compared with the SARIMA and SARIMA-BPNN hybrid models, the SARIMA-PSO-BPNN model can effectively forecast the change in observed HFMD incidence, which can serve as a reference for the prevention and control of HFMD.

Peer Review reports

Background

Hand, foot, and mouth disease (HFMD) is an acute infectious disease caused by EV71 and the Cox A16 enterovirus, which spreads globally and is prevalent among children under five years of age [1,2,3]. HFMD is transmitted primarily through contact with the gastrointestinal and respiratory tracts and close contacts, and can develop throughout the year [4]. Most children with HFMD have mild symptoms, but a small percentage of infected individuals can develop severe disease [5, 6]. HFMD is a self-limiting disease that mainly manifests as fever and herpes on the hands, feet, and mouth [7]. Few children develop complications, such as myocarditis, pulmonary edema, and meningoencephalitis [7, 8]. HFMD is also a global infectious disease, and its prevalence has been reported in most regions of the world, especially in the Asia–Pacific and Western Pacific Region [3, 9]. It has been reported that 96,900 Disability Adjusted life years per year are due to HFMD in some countries in East and Southeast Asia, and HFMD causes a more severe economic burden of disease in these countries [10].

HFMD is not only a public health issue of global concern but has also become a widespread and typical infectious disease in mainland China. Several large outbreaks of HFMD occurred in 2007 and early 2008 in China; therefore, HFMD was included in the reporting of category C infectious diseases on May 2, 2008 [11]. Since HFMD was included in the management of category C infectious diseases of the Chinese Communicable Diseases Control Law, the number of cases of HFMD incidence and deaths has been ranked at the top of the list of legally reported infectious diseases in mainland China [12]. An average of approximately 2 million cases have been reported each year in 31 provinces and municipalities of mainland China [13]. HFMD causes a greater economic burden of disease in mainland China. A study on the economic burden of HFMD in mainland China showed that the average per capita cost of HFMD cases during treatment was 600–1,000 RMB for mild outpatient cases, 3,000–5,000 RMB for general inpatient cases, and 15,000–25,000 RMB for severe cases (without considering their impact on social productivity) [14]. Moreover, it is estimated that the direct economic burden of all severe HFMD cases in Jiangsu Province, China, was RMB 16.64 million during 2017–2018 [15]. Therefore, prevention and control of HFMD continues to be an important public health issue in mainland China.

Early surveillance and warning of HFMD are of high priority and important work. If the government and related departments can effectively monitor and provide accurate early warnings of HFMD, they will be able to respond in advance and provide information for the proper allocation of medical resources [16]. Therefore, strengthening the surveillance and prediction of HFMD epidemiological trends in China is important for implementing effective preventive and control measures. Exploring approaches to enhance early monitoring and warning capabilities has become an urgent priority for improving China’s public health system.

Many scholars have conducted extensive research on predicting the incidence of HFMD. Because the incidence of HFMD presents obvious seasonal and periodic characteristics, most current studies have focused on using the traditional time series auto-regressive integrated moving average (ARIMA) model for forecasting. Although the ARIMA model has achieved a better performance in predicting the incidence of HFMD [17, 18], there is still a failure to fully mine the nonlinear information of seasonal infectious disease data [19]. Some studies have focused on using machine learning models to predict the incidence of HFMD [20,21,22], but they may not explain the nonlinear functions within the time-series data in practice [23]. Moreover, a few studies have combined traditional time-series ARIMA models with machine learning models to develop hybrid models that have achieved better prediction performance [16, 23, 24]. However, hybrid models only combined the advantages of the two models, and there may be insufficient optimization of the model parameters. Therefore, the prediction performance of these models needs to be further improved.

To overcome the shortcomings of a single SARIMA for nonlinear information processing and the hybrid model with insufficiently optimized parameters, in the present study, we first proposed a SARIMA-PSO-BPNN hybrid model for forecasting the incidence of HFMD between January 2008 and December 2019 in mainland China. We constructed the SARIMA and SARIMA-BPNN hybrid models based on the data characteristics of HFMD incidence in mainland China and optimized the SARIMA-BPNN hybrid model using the Particle Swarm Optimization (PSO) algorithm. Predictions from the SARIMA-PSO-BPNN hybrid model can serve as an information reference for the surveillance and early warning of HFMD in mainland China.

Methods

Data source

Data on monthly HFMD incidence from January 2008 to December 2018 in 31 provinces and municipalities in mainland China were obtained from the China Public Health Science Data Center website (https://www.phsciencedata.cn/Share/index.jsp). The total number of HFMD cases from January to December 2019 was obtained from the National Health Commission of the People’s Republic of China’s website (http://www.nhc.gov.cn/jkj/pgzdt/new_list.shtml). The overall population size in 2019 was obtained from the Chinese Statistical Yearbook (http://www.stats.gov.cn/tjsj/ndsj/2021/indexch.htm). The average population per year was calculated as the population at the beginning and end of the year.

A total of 144 data on the monthly incidence of HFMD in mainland China from 2008 to 2019 were included in this study. We divided the HFMD incidence data into training and test sets. HFMD incidence data from January 2008 to December 2018 were used as the training set to construct the models, and data from January to December 2019 were used as the test set to evaluate the generalization capability of the models.

SARIMA model

Auto-regressive integrated moving average (ARIMA) model is a well-known time-series forecasting method proposed by Box and Jenkins in the early 1970s, also known as the Box-Jenkins model [25]. If the time series contains significant seasonal characteristics, the model can be identified as a SARIMA model. The SARIMA model is expressed as SARIMA (p, d, q) (P, D, Q)s and can be expressed as [26, 27]:

$$\nabla^{d} \nabla_{S}^{D} Y_{t} = \frac{{\theta_{q} (B)\Theta Q(B^{S} )}}{{\phi_{p} (B)\Phi P(B^{S} )}}\varepsilon_{t}$$

(1)

$$\phi_{p} (B) = 1 - \phi_{1} {\text{B}} - \phi_{2} {\text{B}}^{{2}} - \phi_{3} {\text{B}}^{{3}} - ...\phi_{p} {\text{B}}^{{\text{p}}}$$

(2)

$$\theta_{q} (B) = 1 - \theta_{1} {\text{B}} - \theta_{2} {\text{B}}^{{2}} - \theta_{3} {\text{B}}^{{3}} - ...\theta_{q} {\text{B}}^{{\text{q}}}$$

(3)

$$\Phi P(B^{S} ) = 1 - \Phi_{1} B^{S} - \Phi_{2} B^{2s} - \Phi_{3} B^{3s} - ...\Phi_{P} B^{Ps}$$

(4)

$$\Theta {\text{Q}}(B^{S} ) = 1 - \Theta_{1} B^{S} - \Theta_{2} B^{2s} - \Theta_{3} B^{3s} - ...\Theta_{P} B^{{{\text{Q}}s}}$$

(5)

where p, q, P, and Q denote the order of auto-regression, the order of moving average, seasonal auto regression lag, and seasonal moving average, respectively. D and d denote the degree of seasonal and degree of trend differences, respectively, and s denotes the length of the seasonal period. Where B is the backward shift operator, Y_t is the HFMD incidence time-series at time t, and ε_t is the residual of the HFMD incidence time-series. Where $\phi_{p}$ is the p-order auto-regressive coefficient polynomial, $\theta_{q}$ is the q-order moving average coefficient polynomial, $\Phi P(B^{S} )$ is the seasonal polynomial function of order P, and $\Theta {\text{Q}}(B^{S} )$ is the seasonal polynomial function of order Q.

Four major steps are involved in the construction of the SARIMA model [19, 28]. The first step was to determine whether the time series were stationary. In general, the stationary of a time series is determined by plotting the original time series chart or using methods such as Augmented Dickey-Fuller (ADF) tests. If the time series is non-stationary, it must be converted into a stationary time series using a difference or logarithmic transformation. The second step is to identify the parameters of the SARIMA model. The possible parameters of p, q, P, and Q were initially determined by plotting auto-correlation function (ACF) and partial auto-correlation function (PACF) charts. We then initially fitted the candidate SARIMA models based on the possible parameters p, q, P,and Q. The third step was to conduct model diagnosis. Residual tests were performed using the Ljung-Box Q test. Statistical significance of the model parameters was assessed using t-test and p-value. The fourth step is to identify the parameters of the SARIMA model and select the optimal model. The optimal model was selected based on the white noise residuals and the lowest AIC and BIC values.

BPNN model

A backward propagation neural network (BPNN) is a multilayer feedforward neural network with output results using forward propagation and errors using backward propagation [29]. The main working principle of the BPNN is to use machine learning to continuously iterate the training model, calculate the error between the actual and expected output values based on the minimum mean squared error criterion, and adjust the weights and thresholds of each layer of the network using the gradient descent strategy to minimize the error [30]. A classical BPNN is a 3-layer neural network consisting of an input layer, hidden layer, and output layer with fully interconnected neurons between adjacent layers and unconnected neurons within the same layer [21].

There are three main steps in BPNN modeling [21, 31]: (1) initialization of the network and setting of network parameters, (2) normalization of the original data, dividing the training and test sets of the data, and back-propagation of the associated error calculation and adjustment of thresholds and weights, and (3) inverse normalization of the data to obtain the predicted values. The basic structure of a BPNN is shown in Fig. 1.

As shown in Fig. 1, we set up a 3-layer neural network with an input layer, hidden layer, and output layer. Assume that the input vector A = [X₁, X₂, X₃,..., Xi], hidden layer input vector F = [f₁, f₂, f₃,..., f_n], output layer input vector G = [g₁, g₂, g₃,..., gj], and actual output vector Y = [ Y₁, Y₂,Y₃,..., Y_k].

The hidden layer output (H_i) is expressed as [32, 33]:

$${\text{H}}_{{\text{i}}} = f\left( {\sum\limits_{i = 1}^{n} {{\text{W}}_{ij} - \theta_{j} } } \right)\begin{array}{*{20}c} {} & {j = 1,2,...,n} \\ \end{array}$$

(6)

where i is the number of hidden nodes, W_ij is the connection weight of the input layer unit i to hidden layer unit j, θ_j is the threshold from the input layer to the hidden layer, and f is the excitation function.

$$f(x) = \frac{1}{1 + \exp ( - x)}$$

(7)

The prediction output (O_k) is expressed as:

$${\text{O}}_{{\text{k}}} = \sum\limits_{j = 1}^{l} {H_{{\text{i}}} } {\text{W}}_{jk} - \theta_{k}$$

(8)

where W_jk is the connection weight of the hidden layer unit j to the output layer unit k and θ_k is the threshold from the hidden layer to the output layer.

And then the prediction error e is expressed as:

$${\text{e}}_{{\text{k}}} = Y_{k} - O_{k}$$

(9)

where O_k is the prediction output, and Y_k is the actual output.

The number of hidden layer nodes was calculated using Eq. (10).

$$h = \sqrt {m + n} + a$$

(10)

where h is the number of hidden layer nodes and m and n are the numbers of input layer nodes and output layer nodes, respectively. Where a is the adjustment constant between 1 and 10.

SARIMA-BPNN hybrid model

The SARIMA-BPNN hybrid model was developed similarly to that of the BPNN model. The modeling steps of the SARIMA-BPNN hybrid model are as follows: (1) The optimal SARIMA model was constructed using the HFMD incidence data from January 2008 to December 2018 in mainland China. (2) A 3-layer BPNN model was constructed using the predicted values from the optimal SARIMA model as the input variables and the observed value of the HFMD incidence data as the output variable. (3) According to the modeling steps of the BPNN model, the data were divided into a training set (70% of the data) and a test set (30% of the data) and then normalized before constructing the model. (4) The model with the trained BPNN was then simulated and the data obtained from the simulation were back-normalized to obtain the predicted values. The optimal SARIMA-BPNN hybrid model was determined by the largest R-squared value and lowest MAE and RMSE values.

SARIMA-PSO-BPNN hybrid model

Particle Swarm Optimization (PSO) is a heuristic search technique with simple implementation, high global search capability, and superior performance [34]. PSO simulates the foraging behavior of birds and is used to solve optimization problems [35]. PSO was introduced into the BPNN model to accelerate the convergence of the traditional BPNN algorithm. The main modeling steps of the SARIMA-PSO-BPNN hybrid model are as follows [36]: (1) The parameters of the PSO algorithm were set based on the established SARIMA-BPNN hybrid model. The size of the population, variable range, inertia weight, learning factor, and maximum number of iterations were determined by many attempts. (2) Based on Eq. (11), the fitness value of each particle is calculated as follows [36]:

$$fitness = \frac{1}{1 + E}$$

(11)

where E is the training error precision.

And then, according to Eq. 12, the velocity and position of the particle are updated.

$$v_{i + 1} (t + 1) = \omega v_{i} (t) + {\text{c}}_{1} r_{1} (pbest_{i} (t) - x_{i} (t)) + {\text{c}}_{2} r_{2} (gbest_{i} (t) - x_{i} (t)),x_{i + 1} (t + 1) = x_{i} (t) + v_{i + 1} (t + 1)$$

(12)

where pbest and gbest are the best particle and swarm positions, respectively. x_i is the position vector; v_i is the velocity vector; c₁ and c₂ are the learning factors; and r₁ and r₂ are random values between 0 and 1. (3) The PSO-optimized weights and thresholds were substituted into the BPNN. The neural network optimized using PSO was trained using training samples until the error requirement was satisfied. Finally, an optimal SARIMA-PSO-BP hybrid model was constructed.

Evaluation of prediction performance

The mean absolute error (MAE), mean squared error (MSE), root mean square error (RMSE), mean absolute percentage error (MAPE), and correlation analyses were applied to comprehensively evaluate the prediction performance of the SARIMA, SARIMA-BPNN, and SARIMA-PSO-BPNN hybrid models. The smaller the values of MAE, MSE, RMSE, and MAPE, the better is the prediction performance of the model [37]. These indicators are expressed as follows:

$${\text{MAE}} = \frac{{\sum\limits_{t = 1}^{n} {\left| {X_{t} - {\hat{\text{X}}}_{t} } \right|} }}{n}$$

(13)

$${\text{MSE}} = \frac{1}{n}\sum\limits_{t = 1}^{n} {(X_{t} - \hat{X}_{t} )^{2} }$$

(14)

$${\text{RMSE}} = \sqrt {\frac{{\sum\limits_{t = 1}^{n} {(X_{t} - {\hat{\text{X}}}_{t} )^{2} } }}{n}}$$

(15)

$${\text{MAPE}} = \frac{1}{n}\sum\limits_{t = 1}^{n} {\left| {\frac{{X_{t} - {\hat{\text{X}}}_{t} }}{{X_{t} }}} \right|}$$

(16)

where ${\hat{\text{X}}}_{t}$ is the predicted value, $X_{t}$ is the observed value of the monthly HFMD incidence, and n is the sequence sample size.

Pearson’s and Spearman’s correlation coefficients were used to test the correlation between the predicted values of each model and observed values. A correlation coefficient with an absolute value closer to 1 indicates a stronger correlation between two variables [38, 39]. The strength of the correlation was evaluated [38, 39] as shown in Table 1.

Table 1 Correlation strength judgment

Full size table

Data analysis

R software (version 4.1.1) was used to construct the SARIMA model, and MATLAB software (Version R2020b, MathWorks, Natick, MA, USA) was used to construct the SARIMA-BPNN and SARIMA-PSO-BPNN hybrid models. The level of significance was set at p < 0.05.

Results

General description

As shown in Fig. 2, the incidence of HFMD time series presented clear seasonality and periodicity patterns. The HFMD incidence rate increased at an average monthly rate of 2.56% from January 2008 to December 2019, with a peak incidence between May and June each year and a low incidence between January and February each year.