To estimate cumulative cases, we used inputs on Lyme disease incidence and treatment failure rates commonly reported in the peer-reviewed literature. We base our simulations on the technique presented by Crouch et al. [11] and utilize the probability distribution function from classical statistics that most closely represents the type of data used in each step of the simulation, i.e. the binomial distribution for “yes/no” data and the negative binomial distribution for overdispersed count data. We also use a simple deterministic approach as a check for the simulations. The six settings represent three scenarios for LD incidence and two treatment failure rates within each scenario. While any one of the settings may currently be considered more realistic, more research is being conducted that may change our understanding. At that time, an improved framework could be developed that incorporates all uncertainty into one simulation.
Input data sources
Duration of epidemic
The first cases of Lyme disease in the United States were diagnosed in the late 1970s in the state of Connecticut [12]. LD has emerged since then, with 96% of diagnosed and reported cases found in 14 states, according to the US CDC.
Yearly incidence of new Lyme infections
Based on direct surveillance reporting to state health departments, there have been approximately 30,000 confirmed cases per year in the United States. However, two recent publications by US CDC researchers suggest that the actual number of new infections is much higher. Hinckley et al. (2014) estimated about 288,000 new infections in 2008 (range 240,000-444,000) based on surveys of seven national commercial labs that performed Lyme disease testing. Due to insensitivity in the diagnostic tests currently used by mainstream medical authorities, incidence estimates based solely on these tests are likely to significantly undercount the numbers of infected. Nelson et al. (2015) used data from a health insurance claims database to estimate there have been approximately 329,000 incident Lyme diagnoses per year during 2005–2010 (range 296,000-376,000). These publications show that reliance on surveillance, reported by local and regional public health authorities results in significant under-reporting.
Age and gender of newly infected individuals
The US CDC has estimated the age and gender distribution of new Lyme disease diagnoses in the United States. The distribution is available online at http://www.cdc.gov/lyme/stats/graphs.html (accessed June 6, 2016). This distribution shows two peaks, one in childhood (ages 5–9) that is higher among boys, and one at ages 45–55 with similar distribution by gender.
Treatment failure
Recent studies have shown that treatment failure rates may range from 10 to 20% [13,14,15]. Given the variability of treatment failure due to regional, geographical differences, socioeconomic factors, co-morbidities, treatment delays, and non-standardized treatment protocols, we chose to encompass both extremes of this range, basing our estimations on either 10 or 20%.
Survival
Since LD is rarely reported as a cause of death [16], we assume death rates for those with PTLD are identical to those of the general US population. Death rates for the general US population are available online at the US CDC website (accessed June 6, 2016) by gender, year (1980–2014) and 10-year age groups (except for age 0–4) [17, 18].
Distributional assumptions for simulation
Incidence of Lyme disease infections
Mean: The simulations were set up to evaluate three arguably plausible assumptions about disease incidence, since the exact growth rate of the US LD epidemic is not known. All scenarios assumed 0 cases prior to 1981 and allowed linear growth between years for which estimates are available. Scenario A represents LD cases captured for surveillance purposes and assumed linear growth from 0 cases in 1980 to 34,449 cases in 2005, and remained stable with 34,449 annual cases from 2005 onward. These values derive from surveillance data published by the US CDC [19]. Scenario B similarly assumed 0 cases in 1980, linear growth to 329,000 cases in 2005 and then a stable 329,000 cases in years from 2005 onward. Scenario C is the same as scenario B to 2005, but allowed continued linear growth thereafter. Use of linear growth in our predictions is conservative over exponential growth; the latter is a potentially realistic option in an expanding epidemic, due to rapid growth of the vector population and tick habitats [20].
Variability: While the mean (or expected number) of new infections was presented as input, the simulations allowed the actual number to vary stochastically using the negative binomial distribution with the variance set to mu + mu2/size, where mu is the mean and size is an overdispersion parameter. This allowed uncertainty to increase with a higher expected number of new infections. The dispersion parameter was estimated by fitting overdispersed generalized linear negative binomial regression models to the number of confirmed cases in the CDC surveillance data from 1997 to 2017, using linear growth over time as the only independent variable [19]. We fit two regressions due to the assumption of Scenario A of linear growth from 1997 to 2005 with constant incidence thereafter and because the CDC employed two different reporting criteria before and after 2008. The dispersion parameter estimated from the 1997–2005 data was 112, and the dispersion parameter estimated from the 2008–2017 data was 127. To be consistent with both models, the dispersion parameter, size, was set to 120 in our simulations.
Age and gender of new infections
Mean: The mean age and gender of new infections was assumed to follow the distribution provided by the CDC and remain stable for the duration of the epidemic.
Variability: Each new infection was assigned to a 5-year age group and gender using a multinomial distribution parametrized using the distribution of the CDC data. Each new infection was randomly assigned an age in years assuming a uniform distribution. After drawing a uniform [0, 1] deviate we assigned an age as follows: if the uniform deviate was [0–.2), we placed the individual in the lowest age in the age group, if it was [0.2, 0.4), the second lowest was used, etc.
Progression to PTLD
Mean: We assumed rates of treatment failure to be 10% or 20%.
Variability: With a wide range of reported rates for transitioning to PTLD from the literature, our study assumed these estimates were based on a relatively small sample size of approximately 500 individuals. To model the uncertainty, we drew a probability from a beta distribution for each year with variability determined by a sample size of 500. The parameters of the beta distribution for a mean probability, p, were alpha = p*500 and beta = (1-p)*500. For a failure rate of 10%, p = 0.10, the mean is 10%, and the 2 standard deviation (SD) range is 7.4 to 12.8%; for a failure rate of 20%, p = 0.20, the mean is 20%, and the 2 SD range is 16.4 to 23.6%.
Survival
Mean: While nine cases of fatal Lyme carditis have been recognized in national surveillance data from 1985 to 2018 in the US [21], a review of death reports and death certificates in the US from 1999 to 2003 cited Lyme disease as a rare cause of death [16, 22]. Survival rates for patients with PTLD were, therefore, assumed to be the same as the general US population and survival rates after 2014 were assumed to be as in 2014.
Variability: Survival to the next year was simulated by drawing a binomial random variable for each case of PTLD, parameterized using the CDC survival rates. Our non-parametric survival model differs from Crouch et al., who fit a semi-parametric survival model to survival data specific to their disease population [11].
Simulation algorithm
The simulations estimate Nth-year prevalence as described by Crouch et al. [11]. Nth-year prevalence is the sum of cases identified in the N-1 years prior to the index date that survived to the index date plus the new cases from the current year. Therefore, 1-year prevalence is simply the number of new PTLD patients each year, and 2-year prevalence is the number of cases from the prior year that survived to the current year plus the current year’s cohort, etc. The total number of PTLD cases at an index date of 2016 is the sum of surviving cases since 1981, or the 36-year prevalence, and the total number of cases in 2020 is the 40-year prevalence.
A total of 6 simulations were performed, one for each of the three incidence scenarios at the two PTLD progression rates (3*2 = 6). For each simulation, we performed 5000 runs and provide the median and the 2.5th and 97.5th percentiles as coverage intervals (CI) of the results [23]. All analyses were performed in R version 3.5.1.
For each run, we did the following:
Step 1. Drew the number of incident PTLD cases per year by age and gender, specifically
-
a)
Drew the number of new infections each year from 1981 to 2020
-
b)
Drew a probability of progressing to PTLD for each year from 1980 to 2020
-
c)
Using the probabilities in (b), randomly assigned each new infection as recovered or PTLD
-
d)
Randomly assigned each PTLD case a 5-year age and gender category
-
e)
Randomly assigned an age in years to each case
Step 2. Allowed the cohorts to age for the duration of the epidemic, specifically
-
a)
Determined survivors to the following year using age, gender, and year-specific survival rates
-
b)
Increased the age of survivors by one year
-
c)
Added the survivors and updated as the N + 1-year prevalence
-
d)
Repeated for the duration of the epidemic and appropriately tallied the results
Step 3. Steps 1 and 2 were repeated 5000 times and the results were stored for analysis.
Deterministic estimate of prevalence
As a check of the validity of the simulation, the expected prevalence of PTLD cases was also estimated using a deterministic approach. For each year of the epidemic, new infections were expected to follow scenario A, B or C exactly, have an age and gender distribution precisely equal to that presented by the CDC report, exactly 10 or 20% of new cases transitioned to PTLD, and death occurred at 80 years of age. The deterministic values should have approximated the mean from the simulations, but did not have associated variability.