Modeling the effect of age on quantiles of the incubation period distribution of COVID-19

Background The novel coronavirus SARS-CoV-2 (coronavirus disease 2019, COVID-19) has caused serious consequences on many aspects of social life throughout the world since the first case of pneumonia with unknown etiology was identified in Wuhan, Hubei province in China in December 2019. Note that the incubation period distribution is key to the prevention and control efforts of COVID-19. This study aimed to investigate the conditional distribution of the incubation period of COVID-19 given the age of infected cases and estimate its corresponding quantiles from the information of 2172 confirmed cases from 29 provinces outside Hubei in China. Methods We collected data on the infection dates, onset dates, and ages of the confirmed cases through February 16th, 2020. All the data were downloaded from the official websites of the health commission. As the epidemic was still ongoing at the time we collected data, the observations subject to biased sampling. To address this issue, we developed a new maximum likelihood method, which enables us to comprehensively study the effect of age on the incubation period. Results Based on the collected data, we found that the conditional quantiles of the incubation period distribution of COVID-19 vary by age. In detail, the high conditional quantiles of people in the middle age group are shorter than those of others while the low quantiles did not show the same differences. We estimated that the 0.95-th quantile related to people in the age group 23 ∼55 is less than 15 days. Conclusions Observing that the conditional quantiles vary across age, we may take more precise measures for people of different ages. For example, we may consider carrying out an age-dependent quarantine duration in practice, rather than a uniform 14-days quarantine period. Remarkably, we may need to extend the current quarantine duration for people aged 0 ∼22 and over 55 because the related 0.95-th quantiles are much greater than 14 days. Supplementary Information The online version contains supplementary material available at (10.1186/s12889-021-11761-1).


Background
In December 2019, some cases of pneumonia with unknown etiology were identified in Wuhan, Hubei province in China. After being investigated by the National Coronavirus Research Group, this pneumonia was identified as caused by a new coronavirus (2019-nCoV). The World Health Organization (WHO) has named this disease COVID-19, standing for "2019 coronavirus disease" [1].
It turns out that the novel coronavirus, like SARS-COV, is the seventh member of the Nidovirales family of coronaviruses [2], but COVID-19 has a shorter serial interval than that of SARS [3] and higher transmissibility than MERS in the Middle East countries [4]. It is highly infectious [5] and even contagious during the incubation period [6]. It can cause severe symptoms or even death [7]. The novel coronavirus not only threatens cities in China [8], but also seems to have exploded worldwide. Hence, it is important to take necessary measures to prevent and control it as quickly as possible.
In prevention and control efforts, it is well known that the incubation period distribution plays an important role. Knowledge of this distribution can help mathematically model the size of the epidemic [8], predict the time at which the disease will outbreak, and determine the efficacy of the medical intervention [9], etc. The pioneering work on deriving the incubation period distribution was conducted by Philip Sartwell in 1950 [10]. After that, the lognormal distribution was widely used to model the incubation period distribution for infectious diseases. Many authors studied the incubation period distributions of various other diseases. Some other distributions, e.g., Gamma distribution and Weibull distribution, were also suggested to fit the observed incubation periods; see for example [9,[11][12][13].
In the literature, Li et al. [14] first studied the incubation period distribution of COVID-19 based on the early 10 observations in Hubei province in China. Relying on their estimation, Li et al. [14] suggested a 14-day medical observation period or quarantine for exposed persons. Guan et al. [15] reported the median incubation period, i.e., 3.0 days (range 0 to 24.0 days), of 1099 patients from 552 hospitals in 31 provinces/provincial municipalities through January 29th, 2020. Recently, Backer et al. [16] updated this distribution based on the reported travel histories and symptom onset dates of 88 travelers from Wuhan with confirmed 2019-nCoV infection in the early outbreak phase. Backer et al. [16] estimated that the 97.5 percentile of the incubation period distribution of COVID-19 is 11.1 days. Linton et al. [17] further considered the biased sampling issue and obtained that the estimated 95%-th quantile is greater than 14 days.
However, no existing literature above investigates the distributed characteristic of the incubation period of COVID-19 over people of different ages. Based on 2172 confirmed cases collected outside Hubei provinces in China, a simple ANOVA indicates that the age of confirmed cases has a significant effect on the incubation period of COVID-19 [18]. This motivates us to estimate the conditional incubation period distribution on ages. Note that the collected data subject to biased sampling because COVID-19 is still ongoing throughout February 16th, 2020 in China. The current study differs itself from [19][20][21], and [22], which investigated the relationship between the age and the incubation period of AIDS, but did not touch the biased sampling issue.
In this study, we developed the conditional quantiles model of the incubation period of COVID-19 on the age of infected cases and provided the estimating method in detail. The main results were calculated based on the collected data, and the conclusion was presented accordingly.

Methods
In this section, we provide a summary on the collected data, and introduce the estimating method according to the major characters of the collected data.
The data set is taken from the websites of the health commission, or the daily public reports on COVID-19 in 29 provinces outside Hubei province through February 16th, 2020. It consists of 2172 confirmed cases, including four indexes, i.e., gender, age, onset time, and infection time. The incubation period value here is calculated by using the formula "Incubation Period = Onset date − Infection date + 1". Note that the default count unit is supposed to be 'day' throughout this paper. Among these 2172 cases, there are more cases from Zhejiang, Henan and Anhui than from the other provinces because of the large population of confirmed cases in these provinces. An additional file shows the details in the data (see Additional file 1). Figure 1 reports the scatter plot of the incubation period of COVID-19 v.s. the age of confirmed cases.
We conduct a preliminary one-way ANOVA study on the incubation period of COVID-19 over four age groups, i.e., 0∼17, 18∼40, 41∼65, and over 65, and find that the age of confirmed cases has a significant effect on the incubation period. Hence, we further investigate the incubation period distribution of COVID-19 conditional on age as follows.
Note that the Weibull distribution fits well the data set, and the mean incubation period varies over people of different ages. We propose to model the relationship between the true incubation period, say T, and the age, i.e., X, through the conditional distribution G(t|λ(X), η) on X. The related density function is specified as follows. where I(·) denotes the indicator function, and η > 0 and λ(X) The reasons for using this kind of conditional distribution form are as follows: (i) The conditional mean of T takes the form E(T|X) = λ(X) (1 + 1/η), which implies that the age X has an obvious effect on E(T|X) through λ(X); (ii) λ 3 (x) is flexible enough to characterize the trend of the change of E(T|X) over X. Note that λ 3 (x) includes β 0 , β 0 + β 1 x, and β 0 + β 1 x + β 2 x 2 as special cases. Here (·) denotes the gamma function. Furthermore, information from the empirical result shown in Fig. 2 indicates that one may model the distribution of the age X by normal distribution. Write its density as φ(x; μ, σ 2 ). Then, a natural idea is through maximizing the likelihood function (1) to estimate the conditional distribution based on {T j , X j } m j=1 . Unfortunately, the incubation period of some infected cases cannot be fully observed when the COVID-19 is still ongoing. The observed incubation period of COVID-19, say Y, subjects to biased sampling. That is, Y observed at some fixed time t * is not the same as the true incubation period T. This is because, for some case infected at time t S , we only can observe such incubation period Y with Y = T at time t * if T ∈ (0, ], where = t * −t S . This implies that the distribution of Y is in fact the conditional distribution depending on the random event {T ∈ (0, ] }. That is, we have Denote the collected samples as where Y i 's denote the observed incubation periods, X i 's the ages, and i = t * − t S,i the difference between the infected time of the i-case and the observing time t * , i.e. February 16th, 2020. Since the number of infected cases does not grow exponentially throughout February 16th, 2020 (see Fig. 3), it is unreasonable to use the likelihood function developed in [17] again in this paper. Fortunately, note that there are cases infected almost every day throughout the data collecting time. Hence, it is reasonable to assume that i 's are non-random. Furthermore, note that Y i , X i are independent of the number of infected cases in each day t S,i . Then, after obtaining F(y|X) in (2), we propose to use the following likelihood function: Note that E log(L T (β, η, μ, σ 2 )) = E log( Y (β, η, μ, σ 2 )) when all samples involved are independent, which is mild because all observations in this paper are collected nationwide. It is easy to check that the maximum likelihood estimator based on (3) is asymptotically the same as that based on (1), which is consistent and satisfies the asymptotic normality under some general conditions.
, which are computationally difficult. Fortunately, it holds that Noting that the estimation of μ, σ 2 is trivial based on {X i } n i=1 , we focus on how to estimate β, η in the sequel. To handle the integral values, we propose to use the EM algorithm as follows. Here suppose that we have obtained some initial estimatorsβ ,η (0) , which may be easily computed by pretending Y i 's having no bias.
If it is true, return and η =η (k+1) ; otherwise, repeat the E-step and M-step until convergence achieves.
We have coded this algorithm by R program relying on the optimization function constrOptim(). The implementation runs quite fast. An additional file shows the codes in detail (See Additional file 2). Usually, convergence can achieve by several iterations.

Results
Based on the information of 2172 confirmed cases, we computed the estimated parameters by using the implementation of the EM algorithm mentioned above; see Table 1. It is worth mentioning thatβ 3 = −1.1 × 10 −6 is very small, which implies that the cubic form λ 3 (x) = β 0 + β 1 x + β 2 x 2 + β 3 x 3 is flexible enough to characterize the trend of the conditional mean E(T|X) on X. We do not need to assume a higher-order polynomial for λ(x).
Using results in Table 1, we obtained the conditional 0.05, 0.25, 0.5, 0.75, 0.9, and 0.95-th quantiles of the incubation period distribution of COVID-19 on ages; see Fig. 4 for details. Figure 4 indicates that quantiles corresponding to people of the middle ages seem to be less than those of the others. Especially, the estimated 0.95-th conditional quantile of the children and the elderly is greater obviously than that of the middle-aged. To be more detailed, we specify the 0.95-th conditional quantiles of the incubation period distribution of COVID-19 on different ages in Table 2. Table 2 indicates that the 0.95-th conditional quantiles of people in the age group 23∼55 lie between 14 and 15 days, shorter than those of the other groups. We also list the numbers of cases in each group and the corresponding proportions. It turns out the infected cases of age 23∼55 account for more than 70% of the total collected cases. Further, note that we collected 136 cases whose incubation periods are greater than 14 days. We provide the distribution of these cases over all age groups. It is shown that the age group 23∼55 accounts for the smallest proportion, only 5.11%, and the proportion of other age groups far exceeds 5%.
Furthermore, in order to further verify the results above, we divide the observed incubation periods into three groups by age: 0 to 25 years old, 26 to 60 years old, and over 60 years old, according to World Population  Prospects: the 2019 Revision 1 . Then we fit the Weibull distribution in each group by setting λ(x) = β 0 . Figure 5 shows that people under the age of 0 to 25 years old or over 60 years old have a higher probability that would emerge longer incubation period than people under the age of 26 to 60 years old. Moreover, the right figure is the fitted Weibull distribution function of the incubation period of COVID-19 in three age groups. It is obvious that the 0.95 quantile of people under 26 and over 60 is greater than that of people aged from 26 to 60. This roughly coincides with the results reported in Fig. 4, and hence indicates that the conditional distribution considered above can characterize the relationship of age and real incubation period of COVID-19.

Discussion
Our estimation of the conditional quantiles indicates that the incubation period of COVID-19 varies depending on the age of the infected cases. Precisely, the incubation period of the young and the old tends to be longer than that of the middle-aged people.
It seems that we can find some supports from the immune theory in medical science. Note that human immunity refers to the sensitivity of the immune system in response to infection. During the incubation period, since the host's immune system has not yet been activated, and the body has not begun to show symptoms, the virus can use this period to make a lot of replications. In many situations, to the infection, the more responsive the host's immune system is, the shorter the incubation period tends to be. By further noting that the human immunity is weak at the beginning, improves with age, and will decline in the old period [23,24], it hence maybe not surprised to see that the incubation period of the young and the old cases are longer than that of the middle-aged cases.
Currently, the quarantine duration is fixed to 14 days. It does not consider any other facts, e.g., age. Hence, our results may be helpful for disease control and prevention efforts, because it enables us to take some more precise measures. For example, personalized quarantine duration can be taken for individuals of different ages. Especially, people between the ages of 23∼55 play important roles in real life and are a significant part of the labor force. Besides, they account for the largest proportion of the population. A relatively short quarantine duration for them not only can reduce the burden of the medical staff but also is conducive to social-economic development. On the contrary, the conditional quantiles on ages 0∼22 and over 55 are much greater than 14 days. We may need to extend the quarantine duration for people of these ages. Such extension may help the prevention but have limit impacts on social-economic development.
It is worth mentioning there exist some other ways in statistics to characterize the conditional quantile of the incubation period over age. That is, first model the relationship between T and X by the following linear model: and then use the technique of quantile regression to estimate the unknown parameters β i , i = 0, 1, 2, 3. However, note that the true incubation period T cannot be fully observed, and the observed incubation period Y is randomly smaller than T. The estimated quantiles may suffer from some problem, e.g., underestimation.
In fact, we also report in Fig. 6 the result of the ordinary quantile regression mentioned above. Figure 6 shows that the regression quantiles follow a similar fashion to the conditional quantiles reported in Fig. 4. Nevertheless, we also note that the 0.25-th quantile and the 0.05-th quantile intersect with each other when the age is greater than 80. It seems difficult to have a reasonable explanation for this phenomenon. This strange result may be caused by the biased sampling issue. Hence, we did not take the regression quantiles to analyze the current data, although it provides some similar results as the conditional quantiles.

Conclusion
In this paper, a model of the age effect on quantiles of the incubation period distribution of COVID-19 was proposed to explore the influence of age on the incubation period for COVID-19. Based on the collected data, our model showed that the incubation period of COVID-19 varies depending on the age of the infected cases. Specifically, the incubation period of the young and the old tends to be longer than that of middle-aged people. These findings enable us to take some more precise measures rather than fixed ones and thus may be helpful for disease control and prevention efforts. For example, personalized quarantine duration, namely shorter for middle-aged people and longer for the young and the old, can be taken for individuals of different ages, instead of fixed 14 days. People aged 23∼55 are a major part of the labor force and account for the largest proportion of the population, therefore, such methods may help the prevention but have limit impacts on social-economic development.
There are two major contributions in the papers. First, a relatively comprehensive description of the age effect on the incubation period of COVID-19 is provided. By modeling the conditional quantiles of the incubation period distribution given the age, we can learn about any quantiles of the incubation period of COVID-19 for people of certain ages. In contrast to methods that divide people into different age groups, it offers more information for selecting a more adequate quarantine period. Second, a reasonable likelihood function (3) was proposed to tackle the biased sampling problem when modeling the conditional distribution of the incubation period given age. The proposed likelihood function can almost eliminate the undesirable consequences and lead to a better estimation of the incubation period distribution of COVID-19.
It is worth mentioning that we are aware that some other researchers have also discussed the relationship between the age and the incubation period of COVID-19 during our submission. Some of them indicate that the age has no significant effect on the incubation period of COVID-19 [25,26], or the length of the incubation period and age are positively correlated [27], while some others report that the age does have an effect [28][29][30]. However, most existing studies on the effect issue of age are summary, depending on a few discrete age subgroups. Besides, few of them touched on the biased sampling issue. Note that our current study takes age into account as a covariate. Hence, it can serve a purpose beyond the summary study.
Furthermore, although 2172 confirmed cases were included in the study, the data only covered confirmed cases through February 16th, 2020 in 29 provinces outside Hubei province in China. Since COVID-19 has spread worldwide, the effect pattern of age on the incubation period in different countries and regions needs further verification. How about the effect of some other factors, e.g., climatic conditions and air pollution status, on the incubation period of COVID-19 is also of great interest and worthy of further studies. However, it is beyond the scope of the current paper. We will pursue it in the future.