Data sources and data extraction
Data on individual COVID-19 cases and the source transmission clusters were obtained from publicly available data, mainly from the websites of provincial and municipal health commissions in China and the Chinese Center for Disease Control and Prevention (China CDC), or through internet searches using Chinese keywords (“coronavirus” OR “pneumonia”) and (province and city names). For each identified COVID-19 case who had clear epidemiological survey information, basic demographic characteristics (age, sex, type of residence, living city), starting and ending dates of probable exposure, date of symptom onset (fever, respiratory symptoms, myalgia, etc.), date of diagnosis, date of discharge, infection route (case contact in public place or in workplace, traveling to Hubei Province, and/or household contact) were extracted as necessary information. The related epidemiological cluster were determined and likewise had epidemiological data extracted, if available. Two researchers independently reviewed the information of each case and entered the data into a standardized reporting sheet to establish a database. Discrepancies were resolved by discussion between the two researchers and facilitated by a third senior researcher to reach a consensus [14, 15].
Individual data on occupation, underlying diseases, and clinical severity were additionally collected from epidemiological investigation reports of COVID-19 cases provided by China CDC. These data were matched to the publicly available dataset by city, age, gender, reported date, and other overlapped variables. The pooled de-identified data were used for the subsequent analysis. Suspected cases and asymptomatic cases were excluded from the current study. Cases reported in Hubei Province and imported cases from abroad were also excluded due to a lack of detailed exposure information.
Definitions of key variables
-
(1)
Incubation period: for each case i, let \({T}_i^E\) and \({T}_i^S\) be the exposure (infection) and symptom onset dates, respectively. The incubation period is then \({V}_i^{Inc}={T}_i^S-{T}_i^E\). The exact exposure date is usually not directly observed but rather bounded by an interval, i.e., \({L}_i\le {T}_i^E\le {U}_i\), and the incubation interval is thus bounded by \({T}_i^S-{U}_i\le {V}_i^{Inc}\le {T}_i^S-{L}_i\). We considered the earliest onset of clinical symptoms as the date of symptom onset. The date of exposure usually had the following two situations. First, if a patient had a history of travel to Hubei Province before symptom onset, the starting and ending dates of exposure (Li and Ui) were set as the dates of arriving at and departing from Hubei province, respectively. Second, if a patient was exposed to (a) a confirmed COVID-19 patient, (b) a person who resided in or had traveled to Hubei Province, or (c) a person with known contact with a confirmed COVID-19 case, the starting and ending dates of exposure (Li and Ui) were set as the initial and the last contact dates, respectively.
-
(2)
Cluster: the case clusters in our crowdsourced data were obtained by contact-tracing. All cases that were determined to be in close contact with each other were defined as a case cluster.
-
(3)
Primary and secondary cases: we use the earliest symptom onset date in each case cluster as baseline and call it day 0. A local case with symptom onset on days 0 or 1 or an imported case with symptom onset on days 0–3 is considered as a primary case; otherwise, the patient is considered a secondary case. All cases that did not belong to any cluster, were treated as primary cases.
-
(4)
Transmissibility: the number of secondary cases infected by the primary case within a cluster was a measure of the transmissibility of the primary case. If there were multiple primary cases within a cluster, they were treated as having jointly infected the secondary cases in the cluster.
-
(5)
Type of residence: Urban area or Rural area as the permanent residence.
-
(6)
Epidemic phase: Cases were assigned to a phase “Before Level I response was employed” or “After Level I response was employed” based on their onset date, with different Level I response date across different provinces.
-
(7)
Geographical Location: Northern, Southern, and Central China are defined according to the latitude of the city where the case is reported. Herein Northern China referred to north of 35°N latitude, Central China referred to between 30°N and 35°N and Southern China referred to south of 30°N.
Estimation of incubation period
For local cases and imported cases for whom both the left and right intervals of the incubation period are complete, we respectively fitted 4 commonly used distributions of incubation period (Weibull, Gamma, Loglogistic, and Lognormal) using the package ‘fitdistrplus’ of the statistical software R. In addition, the cases were stratified according to age, duration from onset to discharge, and infection route, and estimated for the incubation periods in a disaggregated way. The optimal fitted distribution for incubation period was determined by AIC (Akaike’s Information Criterion) and was used to calculate the median of incubation period and 95% confidence interval. Based on the optimal distribution, we estimated the conditional probability that the incubation period of each case was greater than 14 days under the condition of their upper and lower intervals of the incubation period, P(t > 14| t > tlower, t < tupper). Use this probability value to randomly classify each case (includes interval-censored data and right-censored data) into a prolonged incubation period group (>14 days) or a normal incubation period group (≤14 days). We repeated this process 10,000 times, and for those with classification into prolonged incubation period group more than 5000 times, the case was grouped as with prolonged incubation period group, otherwise, the case was defined as a normal incubation period group.
Statistical analysis
The baseline characteristics, epidemiological information, and clinical phenotype were compared between local COVID-19 cases with an incubation period of ≤14 and > 14 days. Pearson’s Chi-square test was used for categorical variables and Fisher’s exact test was used when more than 20% of cells of “R×C contingency table” have expected frequencies < 5. Wilcoxon sum-rank test was used to compare continuous variables between the two groups of patients. The changing patterns of the incubation period were profiled over four epidemic periods, by different case characteristics.
An accelerated failure time model (AFT) assuming a Gamma distribution for the incubation period was applied to evaluate the impact of patients’ characteristics on the length of incubation period. The AFT model was implemented using the “survreg” function in the R package “survival” [16]. In the “survreg” function, the parameters (baseline shape and scale parameters and covariate coefficients) were estimated via the maximum likelihood approach. This model allowed us to analyze the associations between interval-censored response variables and explanatory variables.
A Logistic regression model was used to evaluate the association between incubation period and clinical severity, with sex, age, geographical location, occupation, type of residence, underlying diseases included as covariates. Attribute value frequency (AVF) and Z-score were used to filter out outliers and be evaluated for their influence on the model [17,18,19].
To evaluate the impact of primary case’s characteristics on the transmissibility of COVID-19, we used the epidemiological cluster as research unit to fit the modified Poisson regression model. The number of secondary cases in a cluster was used as the dependent variable, and the characteristics of the primary cases, mainly comprised of incubation period (normal or prolonged incubation), sex, age, geographical location, type of residence, underlying diseases, epidemic phase, and clinical severity, were used as explanatory variables. If the cases did not report epidemiological association with any other confirmed cases, then transmissibility of “0” was assigned. To reduce the bias caused by incomplete epidemiological surveys information, we determined transmissibility of “0” only for cases that were reported from cities with high-quality epidemiological surveys (herein referred to the cities with over 40% of the total cases defined for their association with other confirmed cases).
Since there could be more than one primary case in a cluster, we modified the ordinary Poisson regression model to represent those multiple primary cases jointly infected secondary cases in a cluster and constructed a new logarithmic likelihood function as follows:
$$\mathit{\ln}L\left(\beta \right)=\sum_{k=1}^n\left[{y}_k\mathit{\ln}\left(\sum_{t=1}^m\mathit{\exp}\left({X_{kt}}^T\beta \right)\right)-\sum_{t=1}^m\mathit{\exp}\left({X_{kt}}^T\beta \right)-\mathit{\ln}\left({y}_k!\right)\right]$$
Where n represents the number of clusters, m represents the number of primary cases in a cluster, XT is the characteristics of primary cases, yk represent the number of secondary cases in a cluster. The maximum likelihood estimation was used to estimate the regression coefficients of this model.
To assess multicollinearity among the model predictors, variance inflation factor (VIF) of each variable was calculated and all VIFs in our models were lower than 2, indicating a very low multicollinearity of them (Supplementary Table 1). All the analyses were performed using R software (version 3.6.3, R Foundation for Statistical Computing, Vienna, Austria).