Spatiotemporal characteristics and the epidemiology of tuberculosis in China from 2004 to 2017 by the nationwide surveillance system

Background China has always been one of the countries with the most serious Tuberculosis epidemic in the world. Our study was to observe the Spatial-temporal characteristics and the epidemiology of Tuberculosis in China from 2004 to 2017 with Joinpoint regression analysis, Seasonal Autoregressive integrated moving average (SARIMA) model, geographic cluster, and multivariate time series model. Methods The data of TB from January 2004 to December 2017 were obtained from the notifiable infectious disease reporting system supplied by the Chinese Center for Disease Control and Prevention. The incidence trend of TB was observed by the Joinpoint regression analysis. The Seasonal autoregressive integrated moving average (SARIMA) model was used to predict the monthly incidence. Geographic clusters was employed to analyze the spatial autocorrelation. The relative importance component of TB was detected by the multivariate time series model. Results We included 13,991,850 TB cases from January 2004 to December 2017, with a yearly average morbidity of 999,417 cases. The final selected model was the 0 Joinpoint model (P = 0.0001) with an annual average percent change (AAPC) of − 3.3 (95% CI: − 4.3 to − 2.2, P < 0.001). A seasonality was observed across the 14 years, and the seasonal peaks were in January and March every year. The best SARIMA model was (0, 1, 1) X (0, 1, 1)12 which can be written as (1-B) (1-B12) Xt = (1–0.42349B) (1–0.43338B12) εt, with a minimum AIC (880.5) and SBC (886.4). The predicted value and the original incidence data of 2017 were well matched. The MSE, RMSE, MAE, and MAPE of the modelling performance were 201.76, 14.2, 8.4 and 0.06, respectively. The provinces with a high incidence were located in the northwest (Xinjiang, Tibet) and south (Guangxi, Guizhou, Hainan) of China. The hotspot of TB transmission was mainly located at southern region of China from 2004 to 2008, including Hainan, Guangxi, Guizhou, and Chongqing, which disappeared in the later years. The autoregressive component had a leading role in the incidence of TB which accounted for 81.5–84.5% of the patients on average. The endemic component was about twice as large in the western provinces as the average while the spatial-temporal component was less important there. Most of the high incidences (> 70 cases per 100,000) were influenced by the autoregressive component for the past 14 years. Conclusion In a word, China still has a high TB incidence. However, the incidence rate of TB was significantly decreasing from 2004 to 2017 in China. Seasonal peaks were in January and March every year. Obvious geographical clusters were observed in Tibet and Xinjiang Province. The relative importance component of TB driving transmission was distinguished from the multivariate time series model. For every provinces over the past 14 years, the autoregressive component played a leading role in the incidence of TB which need us to enhance the early protective implementation.

(Continued from previous page) average while the spatial-temporal component was less important there. Most of the high incidences (> 70 cases per 100,000) were influenced by the autoregressive component for the past 14 years. Conclusion: In a word, China still has a high TB incidence. However, the incidence rate of TB was significantly decreasing from 2004 to 2017 in China. Seasonal peaks were in January and March every year. Obvious geographical clusters were observed in Tibet and Xinjiang Province. The relative importance component of TB driving transmission was distinguished from the multivariate time series model. For every provinces over the past 14 years, the autoregressive component played a leading role in the incidence of TB which need us to enhance the early protective implementation.
Keywords: Tuberculosis, Spatial-temporal, Epidemiology, Multivariate time series model Background Tuberculosis (TB) continues to challenge the international community. It is estimated that there are about 1.7 billion people with potential TB infection, accounting for 23% of the world's population, are at risk of developing TB disease during their lifetime [1]. Moreover, the global burden was estimated by the World Health Organization (WHO) at 10.0 million incident cases in 2017. It is also one of the top 10 causes of death which caused an estimated 1.6 million deaths in 2017, and has killed more people than other infectious diseases in the past few decades [1,2].
China has always been one of the countries with the most serious Tuberculosis epidemic in the world [3][4][5][6]. There were 866,000 patients with infection of TB in China, 2018 [7]. Due to the continuous attention to public health and increasing investment in resources, China's Tuberculosis epidemic has significantly improved in recent years. However, due to the large number of people infected with TB, the epidemic situation of Tuberculosis is still not optimistic, so further long-term research on the incidence of it in China is needed.
Currently, China has conducted five national epidemiological investigations to find the epidemiological characteristics of Tuberculosis. However, the spatiotemporal distributions of Tuberculosis cannot be evaluated continuously, and the survey was unable to measure other important indicators of the severity of the epidemic. The mathematical models may help us better understand the epidemiological characteristics of Tuberculosis. Some of the studies mainly focused on the seasonality impact on the transmission of Tuberculosis [5,8,9], while others focused on the spatial distributions [10,11]. There is no model that assesses the spatiotemporal characteristics and the epidemiology of Tuberculosis among the whole population in China over 14 years.
The aim of this study was to observe the Spatialtemporal characteristics and the epidemiology of Tuberculosis in China from 2004 to 2017. The incidence trend of the TB was observed by the Joinpoint regression analysis. The Seasonal autoregressive integrated moving average (SARIMA) model was used to predict the monthly incidence. Geographic clusters was employed to analyze the spatial autocorrelation. The relative importance component of TB was detected by the multivariate time series model. These models additively divided TB risks into spatiotemporal, autoregressive, and endemic components.

Methods
The data collection Tuberculosis incidence data were extracted from the Chinese Center for Disease Control and Prevention (http://www.phsciencedata.cn/Share/edtShareNew.jsp?id= 39208) in 31 provinces of China from 2004 to 2017. The data were aggregated to 168 monthly counts across the 14 years. Population data came from the website of the statistical yearbook of the National Bureau of Statistics (http:// www.stats.gov.cn/tjsj/ndsj/). The population size was easy to find in the website, and it represented the average population each year.

Joinpoint regression
From 2004 to 2017, the continuous change of the TB incidence trend was analyzed using Joinpoint software. The grid search method was applied to find significant trends, and multiple permutation tests were applied to detect the Joinpoint points for each trend [12][13][14]. The overall time trend was calculated by the annual average rate of change (AAPC). If the final model was 0 Joinpoint model, the average percent change (APC) was considered equal to AAPC. We used the Joinpoint regression model to find the long-term trend of the TB incidence.

Time-series estimation
The SARIMA model was used to predict the future trends in many disease incidences [13,[15][16][17]. In our study, A SARIMA model was applied to predict the incidence of TB epidemics in China. The SARIMA model can be written as the form of (p, d, q) (P, D, Q) [s], which P, D, and Q indicate seasonal SAR terms, seasonal differencing, and seasonal SMA terms, respectively; p, d, and q indicate non-seasonal AR terms, non-seasonal differencing, and non-seasonal MA, respectively; s indicated the seasonal period (s = 12 in our study).
The construction of the SARIMA model can be divided into the following steps. First, an augmented Dickey-Fuller (ADF) test was performed to test the stationary status of time series. Second, model parameters (p, d, q, P, D, and Q) were determined by autocorrelation function (ACF) plot, partial autocorrelation function (PACF) plot, and inverse autocorrelation function (IACF) plot. An alternative SARIMA model was constructed by transforming the parameters of model. Lastly, the Akaike information criterion (AIC) and Schwartz Bayesian Criterion (SBC) were used to determine the fitness of different SARIMA models. An optimal model was considered to have the lowest AIC and SBC values, and the residuals of the final model were tested by the Box-Ljung test to know whether they were time independent. The mean square error (MSE), mean absolute percentage error (MAPE), mean absolute error (MAE), and root mean square error (RMSE) were used to see the predictive validity of the models. We use year 2004-2016 to construct the SARIMA model, and year 2017 to testify the forecast of the model. The SARIMA model is used to forecast the short-term incidence of TB to testify the accuracy of model. We also decompose the monthly data into the overall trend, seasonal trend, and random noise with a goal to identify the truly long-term trend.

Spatial autocorrelation analysis
Spatial analysis was used to identify the clustering regions and observe geographic variation [18,19]. Global Moran's I of reported TB cases was computed to detect the spatial clustering pattern. A Moran's I value is between − 1 and 1, whereas the value near 1 means positive spatial autocorrelation, the value near − 1 means negative spatial autocorrelation, and 0 means random distribution. Local Moran's Index was calculated and a hotspot analysis was performed to determine the location of clusters. Local Moran's Index was applied to determine the spatial autocorrelation, which detects some spatial clusters with similar adjacent features and exception values. When the incidences rate had similar low values or high values, these areas were deemed as having positive autocorrelation (low-low or high-high autocorrelation). If not, they were defined as having a negative autocorrelation (low-high or high-low autocorrelation) [10].

The multivariate time series model
A multivariate time-series model for disease counts Y i,t during periods t = 1, T from units i = 1, I was first established by Held et al [20] and was extended and applied in some papers [21][22][23]. The Y i,t denoted the number of TB cases which were considered to be a negative binomial distribution Yit|Y i,t-1~N egBin (u it , ψ), with an additively decomposed mean: Where ψ is an over-dispersed parameter that the conditional variance of Y i,t is μ it ð1 þ ψu it Þ . υ it ℯ it is the endemic component, and the autoregressive component λ it Y i,t-1 reflects the patient numbers at previous time. The spatiotemporal component ϕ it ∑j ≠ iwjiYj, t − 1 reflects the transmission among different units. Each parameter υ it, λ it , and ϕ it follow the form of log-linear: and b i (ϕ) are random effects accounting for heterogeneity among different regions. The endemic υ it contains a sinusoidal frequency wave (w s = 2π/12 for monthly data), and S is the seasonal parameters. The population fraction e i can be used as a multiplicative offset for the regional specific measure for the incidence of infectious disease.
The weights ω ji describe the transmission from district j to district I. Considering that most regions are very large, higher-order neighbourhood are not that relevant as we only constructed our model with first-order neighbourhood. The score rule of the Dawid-Sebastiani score ("dss") was applied to identify the optimal model with random effects. The optimal model corresponds to lower scores with better predictions [22,24]. All the multivariate time analysis used the R package Surveillance.

Statistical analysis
The incidence trend of TB from 2004 to 2017 was observed by the Joinpoint software (version 4.7.0.0). The Seasonal autoregressive integrated moving average (SARIMA) model was used to predict the monthly incidence of TB by SAS9.4 (SAS Institute Inc., Cary, NC). Geographic clusters was employed to analyze the spatial autocorrelation with ArcGIS software (version 10.2, ESRI Inc.; Redlands, CA, USA). The relative importance component of TB was detected by the multivariate time series model with R software (version 3.6.0, package = surveillance). P value < 0.05 was considered as statistically significant for all the tests.

Time trends, seasonal characteristics of the TB incidence
We included 13,991,850 TB cases from January 2004 to December 2017, with a yearly average morbidity of 999, 417 cases. A fluctuant reduction was seen from 74.57 (/100,000) cases in 2004 to 60.08 (100,000) cases in 2017, with the highest incidence of 96.30 (100,000) cases in 2005. The final model was the 0 Joinpoint model (P = 0.18). The annual average percent change (AAPC) was − 3.3 (95% CI: − 4.3 to − 2.2, P < 0.001) from 2004 to 2017, indicating a downward trend in the TB incidence (Fig. 1).
The occurrence of TB with obvious seasonality was observed in the past 14 years (Fig. 2), and the seasonal cycle kept on fluctuating within 12 months. There were two incidence peaks in January and March every year, with a burst From December of the previous year to January of the following year.
The null hypothesis of white noise was strongly rejected with the results of the white noise test (χ 2 = 131.98, DF = 6, P < 0.0001), which can extract some useful information from the time series. Although the null hypothesis was significant (Tau = − 3.91, P = 0.003, lag = 1) for the augmented Dickey-Fuller (ADF) test, we should make a seasonal difference taking account of the fluctuation of the incidence figure. We performed a seasonal differencing to make sure that the transformed TB incidence was stationary (Tau = − 7.6, P < 0.0001, lag = 1) to better construct the SARIMA model (Fig. 3). Based on the figures of PACF, ACF, and IACF, the best ARIMA model was (0, 1, 1) X (0, 1, 1) 12 which can be written as (1-B) (1-B 12 ) X t = (1-0.42349B) (1-0.43338B 12 ) ε t , with a minimum AIC (880.5) and SBC (886.4). There was no significant correlation between residuals (lag = 6, χ 2 = 3.65, DF = 3, P = 0.45), and the residual was a white noise. We then did an incidence forecast of 2017 shown in Fig. 2, the predicted and actual incidence were shown in Table 1. The predicted value and the original incidence data of 2017 were well matched. The mean square error (MSE), mean absolute percentage error (MAPE), root mean square error (RMSE), and mean absolute error (MAE) of the modelling performance were 201.76, 0.06, 14.2, and 8.4 respectively. The time series can divide into three components: seasonal effect, trend curve, and irregular noise. The seasonal effect refers to the fluctuations of the trend that is reproduced in a similar way every year, the trend curve is the long-term movement of the time series, and the irregular noise is the surplus component after trend curve and seasonal effect are removed. After eliminating the influence of seasonal effect and irregular noise on TB, the incidence curve of TB became smoother (Fig. 2), and it was found that the trend of the incidence from 2004 to 2016 was gradually decreasing.

Spatial clustering distribution and geographic characteristics
The TB cases were reported in every province of China from 2004 to 2017, with the lowest incidence of 19.52(/  (Fig. 4). The provinces with a high incidence were located in the northwest (Xinjiang, Tibet) and south (Guangxi, Guizhou, Hainan) of China. Based on the global autocorrelation analysis, the distribution of TB was spatially correlated from 2004 to 2017 ( Table 2). The Moran's index range from 0.28 to 0.36, and had the highest index in 2011(Moran's index = 0.36, Z-score = 5.51, P < 0.001). According to the local Moran's I autocorrelation results, it was found that there were totally 35 high-high clusters and 1 high-low cluster from 2004 to 2017 (Table 3), with 4, 3, 3, 3, 5, 2, 2, 2, 2, 2, 2, 2, 2, and 2 clusters each year. The hotspot of TB transmission was mainly located at southern region of China from 2004 to 2008, including Hainan, Guangxi, Guizhou, and Chongqing, which disappeared in the later years. It should be noted that the center of the high-high clusters moved from the East to the Northwest (Xinjiang and Tibet) after 2008, and Tibet was a high-low cluster in 2008 (Fig. 5).

Multivariate time series analysis
Two models following negative binomial distribution and the Poisson distribution constructed by the monthly data from 2004 to 2017were built in the first step, and the AIC of the two models were 72,247.32 and 260, 511.37, which meant the better distribution of the model would be the negative binomial distribution. Second, we included the random effects of the model, and found that the random effects model (0.20) introduced by DSS rule was better than the negative binomial distribution model (2.06). Considering that most regions are very large, higher-order neighbourhood are not that relevant when we only construct the model with first-order neighbourhood.
In order to classify the spatial-temporal effect of the TB, the relative importance of the model components by province, with an average of 14 years is shown in Fig. 6. The autoregressive component had a leading role in the incidence of TB which accounted for 81.5-84.5% of the Fig. 2 The actual and seasonal-adjusted incidence of TB in China, from January 2004 to December 2017 at monthly intervals. The blue line is the original incidence, the red line is the seasonal-adjusted incidence, and the green line is the trend line patients across all provinces on average (Fig. 6b). The endemic component was about twice as large in the western provinces as the average while the spatialtemporal component was less important there ( Fig. 6a/ c). It should be noted that some economic circles, such as the Yangtze River Delta economic circle (Zhejiang, Jiangsu, and shanghai), Pearl River Delta economic circle (Guangxi and Guangdong), Bohai Economic Rim (Hebei, Tianjin, Beijing, and Shanxi) and Hanjiang ecological economic belt (Henan and Hubei), had higher proportions of the spatial-temporal component (especially in Beijing), whereas there was very little spatial correlation in the western provinces.
An intuitive method to quantify the relative contributions of the high incidence regions (> 70 cases per 100, 000 persons over 14 years) of the three components is provided by Fig. 7. In general, most of the high incidences were mainly affected by the autoregressive component for the past 14 years. There was clear seasonality with two incidence peaks in January and March every year, with a burst From December of the previous year to January of the following year. Guangxi, Heilongjiang, Hubei, Guangdong and Hainan were partly affected by the spatial-temporal component, while the rest of the high incidence provinces had nearly no associations with the spatial-temporal effect.

Discussion
According to our research, there were 13,991,850 TB cases from January 2004 to December 2017, with a yearly average morbidity of 999,417 cases which was a huge burden for the public health of China. Understanding the epidemiology   [25,26], which was also found in other countries [27,28]. Consistent with previous research [9], we found two peaks in January and March every year for TB incidence in China, with close numbers in these two peaks. The low number of confirmed cases in February may probably attribute to the Chinese traditional Spring Festival  holiday. The average time from disease onset to confirmation of the diagnosis was 72 days when some infected persons develop active Tuberculosis, and patients were most likely to be diagnosed 2-3 months after symptom onset [8]. So, we should enhance patient control and the prevention of susceptible population in the autumn and winter, and the detection of TB in spring.
For a long time, the hotspots were distributed in the northwest areas such as Xinjiang Province and Tibet. Xinjiang Province has been at a high incidence level in the 14 years, while the incidence in Tibet increased since 2012. Except for 2004, 2010, and 2014, Guizhou Province has been at a high incidence level in the later 11 years. Some provinces such as Hainan, Guangxi, and Chongqing were at a high incidence level before 2009, but have been at a low level since 2009. More attention is needed in these high incidence areas, especially in Xinjiang, Tibet and Guizhou, which may need more financial assistance. It should be noted that some High-High spatial autocorrelation including Hainan, Guizhou, Guangxi, and Chongqing Province have disappeared since 2009, while Xinjiang Province and Tibet have become new H-H regional areas since 2009. The possible explanation is the unbalanced economic development in these areas [29]. Some studies [30][31][32] have demonstrated that there has positive correlations between the poverty level of regions, families or individuals and the incidence of Tuberculosis. At the average level of the province component over the 14 years, autoregressive components dominated all . The HH is the high-high spatial autocorrelation, the HL is the high-low spatial autocorrelation, the LH is the low-high spatial autocorrelation, and the LL is the low-low spatial autocorrelation in Beijing but become infected with TB in their hometowns should stay at home before anti-Tuberculosis treatment and maintain the treatment for a couple of weeks, avoiding going to public places or having close contact with others. We also did an analysis for the provinces with a high incidence (> 70 cases per 100,000 over 14 years) of the three components. For the autoregressive component which dominated all the high Fig. 7 Fitted components in the multivariate time series model for the 12 counties with more than 70 cases during the past fourteen years. The black dots represent the monthly counts of incidence, the light grey area shows the endemic component, the blue area shows the autoregressive component, and the yellow area corresponds to the spatiotemporal component incidence provinces, early protective implementation 2-3 months ahead of the peak could help us reduce the number of TB patients [8]. For the endemic parts, most infected patients could be explained by living conditions, ecological and climatological changes, and socioeconomic activities. Active treatment for TB patients and cutting off the pathway of transmission may be the most effective way to prevent TB [8,33]. Another important method is increasing the public awareness, especially among old people and children, and enhancing their physical exercise, immunity, and general hygiene. In addition, the spatial-temporal component can also affect the transmission of TB. Guangxi and Guangdong Provinces, which are in the south-east coastal area, were partly influenced by the spatial-temporal component, indicating that these regions may have imported TB from adjacent country with high incidence such as Philippines [34,35] or the neighbouring province Guizhou. Alarmingly, although there was no clear evidence that Tibet and Xinjiang had a high value of spatial-temporal component, we still need to pay attention to transmission from India [33,34] which was ranked first in global TB patients.
Our study had several limitation. First, the monthly data from 2004 to 2017 did not collect some risk factors including socioeconomic status, climatic factors, gender, age, and human activities. The relationship between the incidence of TB and these factors was still unknown. These factors should be included in the future studies in order to get an accurate multivariate time series model. Second, we included TB patients reported from the passive surveillance system which inevitably underestimated the total number of TB cases. Further researches could consider the level of reporting, including some subclinical and mild individuals not accessing healthcare. Lastly, the level of diagnosis in some provinces can lead to an underestimation of the TB incidence. We should think over the diagnostic level in the future studies to correct the incidence.

Conclusion
In conclusion, China still has a high TB incidence. However, the incidence rate of TB was significantly decreasing from 2004 to 2017 in China. Seasonal peaks were in January and March every year, with a burst From December of the previous year to January. Obvious geographical clusters were observed in Tibet and Xinjiang Province. The relative importance component of TB driving transmission was distinguished from the multivariate time series model. For every provinces over the past 14 years, the autoregressive component played a leading role in the incidence of TB which need us to enhance the early protective implementation.