Forecasting the incidence of mumps in Chongqing based on a SARIMA model

Background Mumps is classified as a class C infection disease in China, and the Chongqing area has one of the highest incidence rates in the country. We aimed to establish a prediction model for mumps in Chongqing and analyze its seasonality, which is important for risk analysis and allocation of resources in the health sector. Methods Data on incidence of mumps from January 2004 to December 2018 were obtained from Chongqing Municipal Bureau of Disease Control and Prevention. The incidence of mumps from 2004 to 2017 was fitted using a seasonal autoregressive comprehensive moving average (SARIMA) model. The root mean square error (RMSE) and mean absolute percentage error (MAPE) were used to compare the goodness of fit of the models. The 2018 incidence data were used for validation. Results From 2004 to 2018, a total of 159,181 cases (93,655 males and 65,526 females) of mumps were reported in Chongqing, with significantly more men than women. The age group of 0–19 years old accounted for 92.41% of all reported cases, and students made up the largest proportion (62.83%), followed by scattered children and children in kindergarten. The SARIMA(2, 1, 1) × (0, 1, 1)12 was the best fit model, RMSE and MAPE were 0.9950 and 39.8396%, respectively. Conclusion Based on the study findings, the incidence of mumps in Chongqing has an obvious seasonal trend, and SARIMA(2, 1, 1) × (0, 1, 1)12 model can also predict the incidence of mumps well. The SARIMA model of time series analysis is a feasible and simple method for predicting mumps in Chongqing.


Background
Mumps is a disease caused by an infection due to mumps virus. It is a vaccine-preventable toxic disease in children [1], and the main population affected is children and adolescents [2]. The clinical manifestation of mumps virus infection is pain and swelling of the parotid gland, but it may also affect various tissues and organs [3]. It can also cause serious complications, such as encephalitis, meningitis, orchitis, myocarditis, pancreatitis, and nephritis [3,4]. Patients generally recover spontaneously within a few moments of infection, but the disease has long-term consequences, such as seizures, cerebral palsy, hydrocephalus and deafness [4,5]. Mumps is a global epidemic [6], with outbreaks occurring in several regions, such as Ireland [7], Nebraska [8], and Arkansas [9]. The incidence of mumps in China is high [10].
In 2018, China had the highest number of cases in the world (259,071 cases), followed by Nepal (29,614 cases) and Burkina Faso (26,982 cases) [11]. From 2004 to 2018, China reported 4,272,368 cases, and the average incidence was 2144 per 100,000 per year [12]. In 1990, mumps was included in the management of class C infectious diseases (class C infectious diseases are known as surveillance and management infectious diseases, including filariasis, hydatid disease, leprosy, influenza, mumps, epidemic and endemic typhus, rubella, acute hemorrhagic conjunctivitis, hand, foot and mouth disease, and infectious diarrhoeal diseases other than amoebic dysentery, typhoid and paratyphoid, etc.) [13]. Mumps-containing vaccines were included in the expanded national immunization program in 2008 [14]. From 2014 to 2016, the reported incidence of mumps began to decline nationwide, but rose again in 2017 and 2018 [12]. Mumps is highly contagious and often causes outbreaks in school nurseries and other collective units, seriously affecting the normal school teaching order. Mumps is one of the important public health problems that endanger the physical and mental health of children and adolescents in China [15]. Thus, understanding the epidemic regularity and predicting the epidemic trend of mumps is crucial for risk analysis and health resource allocation in the health sector.
Time series analysis is a scientific quantitative prediction of the future trend of diseases based on historical data and time variables. It is a quantitative analysis method that does not consider the influence of complex factors [16]. The ARIMA model is one of the most representative and widely used models in time series prediction [17], and this method is simple and requires only endogenous variables instead of other exogenous variables. In epidemiological studies, the ARIMA model has been used in many studies, such as malaria [18], tuberculosis [19,20], dengue [21] and other diseases [22,23]. To the best of our knowledge, Chongqing is one of the areas with the highest incidence rates in China [24]. Furthermore, no previous research has been conducted to predict the incidence of mumps in Chongqing. To address these noted gaps, this paper is the first to establish a time series analysis of the mumps incidence data for short-term prediction in Chongqing. We developed a seasonal autoregressive comprehensive moving average (SARIMA) model to forecast the incidence of mumps. As we all know, the SARIMA model has one more seasonal effect than the ARIMA model, and it is generally handled by seasonal difference [25]. The results of this study may be useful for predicting mumps epidemics and offer reference information for mumps control and intervention in Chongqing.

Data collection
In this study, mumps data from 2004 to 2018 were collected from Chongqing CDC. All cases in each region were verified through clinical and laboratory diagnosis, and reported to the CDC by the health department. The reported data includes the sex, occupation, age and region of the patient. The data were finally collected from each region and submitted to the Chongqing CDC.

SARIMA model construction
SARIMA differs from ARIMA models in that it contains seasonal characteristics of time series [26], and is an extension of ARIMA model. The general structure of SARI MA model is expressed as SARIMA (p, d, q) × (P, D, Q) S , and its formula is as follows [27]: In the above equation, B represents the backward shift operator, ε t denotes the residual at time t, the mean of ε t is zero and the variance of ε t is constant, x t is the observed value at time t (t = 1, 2 … k). In SARIMA (p, d, q) × (P, D, Q) S , s is the length of the seasonal period, p, P, d, D, q and Q are the autoregressive order, seasonal autoregressive order, number of difference, number of seasonal difference, moving average order and seasonal moving average order, respectively [27]. According to the sequence autocorrelation function (ACF) and partial autocorrelation function (PACF) for determining the values of the six parameters in the SARIMA model. Akaike information criterion (AIC) and Schwarz Bayesian Criterion (BIC) are two indexes of model optimization. A small AIC shows that is the better fitting model [26].
Finally, two indexes were used to compare the fitting effect. The formulas for RMSE and MAPE are [28]:      In the above equation, x t is the actual incidence value, x t ∧ is the estimated incidence value, n is the number of months for forecasting. The lower RMSE and MAPE value, the better the data fitting effect.

Statistical analysis
Firstly, we used Excel 2010 to conduct a descriptive analysis of mumps in Chongqing from 2004 to 2018, and explain the sex, age, and occupational distribution of the disease onset. Secondly, the time series analysis of the mumps incidence sequence was analyzed. The "stl" function in the R 3.5.0 software was used to decompose the seasonal trend of the sequence. Finally, under the operation of the R3.5.0 software, a SARIMA model was established to predict the incidence of mumps. In this study, the incidence of mumps from 2004 to 2017 was used as a training dataset to fit the SARIMA model, predict the incidence of mumps in 2018, and verify the predicted effect. The significance level was p < 0.05.

Results
The monthly incidence of mumps number is presented in Fig. 1, showing that monthly mumps incidence was low in February and peaked in April to July. Table 1 shows the top five regions with the highest number of mumps in Chongqing in the past 15 years. Although the location of Chongqing has changed, it is mainly concentrated in the northeast and west of Chongqing. Table 2 shows reported 159,181 mumps cases in the past 15 years (2004-2018), in Chongqing, The cases included 93,655 males and 65,526 females (male-to-female ratio of 1.43:1), with mumps commonly occurring between the ages of 0 and 19, and the age group of 0-19 years accounted for the 92.41% (n = 147,094) of all reported cases. The group with the highest proportion of mumps is students, amount to 62.83% (n = 100,013), followed by scattered children and children in kindergarten.  Figure 2 shows the sequence diagram of mumps, and the sequence is nonstationary. Furthermore, the monthly incidence of mumps in Chongqing suggested a slightly decreasing tendency and seasonal tendency. Figure 3 shows the decomposition diagram of the sequence, with an obvious seasonality. After the natural logarithm transformation of the original data, a one-step difference and seasonal difference with a period of 12 was conducted to remove nonstationarity. The sequence diagram after the difference was stationary (Fig. 4), and the ADF test results showed that the sequence was stationary (p < 0.05). Figure 5 shows that ACF and PACF of the sequence were both trailing. Considering that the value of p, q, P and Q generally does not exceed 2, a trial order from 0 to 2 was performed. Only five models passed the test and the model parameter test: SARIMA(1, 1, 1) × (1, 1, 0) 12 , SARI MA(0, 1, 2) × (1, 1, 0) 12 , SARIMA(1, 1, 1) × (0, 1, 1) 12 , SARIMA(2, 1, 1) × (0, 1, 1) 12 , SARIMA(1, 1, 2) × (0, 1, 1) 12 . The AIC, BIC values, and two error indicators of the five models were compared in Table 3. And the SARI MA(2, 1, 1) × (0, 1, 1) 12 model was selected as the best one. Table 4 shows the estimated and standard errors of model parameters and their corresponding significance values. The model equation is given as The SARIMA(2, 1, 1) × (0, 1, 1) 12 model was used to forecast the incidence of mumps in 2018. Table 5 shows the value of prediction, RMSE and MAPE values are 0.9950 and 39.8396%, respectively. The actual value of incidence and fitted incidence of SARIMA model monthly are shown in Fig. 6. Figure 6 and Table 5 show that the tendency and epidemics from predicted incidence are close to actual value of incidence and epidemic circumstance of mumps.

Discussion
In this study, we found that the annual incidence of mumps decreased significantly from 2004 to 2007 and increased from 2007 to 2011. The lowest incidence was in 2007 (175,463/100000) and the highest was in 2011 (591,471/100000). In the 159,181 reported cases, males were 1.43 times as many as females. These findings are consistent with the results of relevant literature [29]. In terms of age and occupation, the largest proportion of mumps cases were 0-19 years old (92.41%) and students (62.83%), indicating that children and students are the main targets of protection. In addition, this study shows that the western and northeastern of Chongqing are high-incidence areas (Table 1). Therefore, the government should strengthen the prevention and control measures of mumps in important areas and populations.
We decomposed the mumps incidence sequence and found that the mumps incidence sequence had obvious trends and seasonality, the monthly incidence of mumps was low in February and peak in April to July, which was consistent with previous studies [10,30]. Therefore, taking some interventions are necessary to reduce the spread of infectious diseases in public transportation from April to July.
The results of this study indicate that model SARI MA(2, 1, 1) × (0, 1, 1) 12 is the best predictive model. Figure 6 shows the 95% CIs of the forecast value in this paper containing all of the real observed data, which is a Fig. 4 Sequence diagram after a 1-step difference and seasonal difference with a period of 12 good match between the observed value and the fitted value. The incidence of mumps in 2018 peaked from April to June but showed a decreasing trend after June due to the reduction of students' contact during the summer vacation. Thus far, in terms of infectious diseases, the SARIMA model has good results in predicting hand, foot, and mouth disease [31] and tuberculosis [32]. This study is the first to use the SARIMA model to analyze the incidence of mumps in Chongqing. The SARIMA(2, 1, 1) × (0, 1, 1) 12 model in this study can well reflect the incidence of mumps in Chongqing, and has a good short-term predictive effect. It can provide   Compared with the ARIMA model, the SARIMA model increases the seasonal effect and is suitable for analyzing sequences with obvious seasons and periodicity [33], while most epidemiological data are seasonal and periodic [32,[34][35][36]. Compared with other time series analysis methods, The SARIMA model generally adopts the logarithm method, difference method and the seasonal difference method [37,38] to make the series stable without the need for complicated conversion or variable substitution [39,40]. In addition, related research shows that among various time series analysis methods, ARIMA is a useful tool for interpreting surveillance data for disease prevention and control [41,42].
This study has several limitations. We only made short-term predictions. Therefore, various factors affecting the incidence of mumps should be considered to establish a long-term and stable prediction model. In future research, hybrid models, such as the SARIMA-NAR hybrid model [43], SARIMA-NARNNX hybrid model [44], and SARIMA-NARX hybrid model [45], can be used to analyze or predict diseases.

Conclusions
The results of this study suggest that applying the ARIMA time series models to forecast the incidence of mumps is feasible. The confidence intervals of the predicted values contain all the actual values. Thus, the SARIMA(2, 1, 1) × (0, 1, 1) 12 model may be used to predict the incidence of mumps. The short-term prediction of mumps is effective, which is helpful for the evaluation of prevention or control measures. Meanwhile, timely and effective countermeasures can be adopted for the epidemic peak that may occur. For instance, the government has strengthened publicity on mumps prevention and control knowledge to improve the public's awareness of the disease.