Early prediction of median survival among a large AIDS surveillance cohort

Background For individuals with AIDS, data exist relatively soon after diagnosis to allow estimation of "early" survival quantiles (e.g., the 0.10, 0.15, 0.20 and 0.30 quantiles, etc.). Many years of additional observation must elapse before median survival, a summary measure of survival, can be estimated accurately. In this study, a new approach to predict AIDS median survival is presented and its accuracy tested using AIDS surveillance data. Methods The data consisted of 96,373 individuals who were reported to the HIV/AIDS Reporting System of the California Department of Health Services Office of AIDS as of December 31, 1996. We defined cohorts based on quarter year of diagnosis (e.g., the "931" cohort consists of individuals diagnosed with AIDS in the first quarter of 1993). We used early quantiles (estimated using the Inverse Probability of Censoring Weighted estimator) of the survival distribution to estimate median survival by assuming a linear relationship between the earlier quantiles and median survival. From this model, median survival was predicted for cohorts for which a median could not be estimated empirically from the available data. This prediction was compared with the actual medians observed when using updated survival data reported at least five years later. Results Using the 0.15 quantile as the predictor and the data available as of December 31, 1996, we were able to predict the median survival of four cohorts (933, 934, 941, and 942) to be 34, 34, 31, and 29 months. Without this approach, there were insufficient data with which to make any estimate of median survival. The actual median survival of these four cohorts (using data as of December 31, 2001) was found to be 32, 40, 46, and 80 months, suggesting that the accuracy for this approach requires a minimum of three years to elapse from diagnosis to the time an accurate prediction can be made. Conclusion The results of this study suggest that early and accurate prediction of median survival time after AIDS diagnosis may be possible using early quantiles of the survival distribution. The methodology did not seem to work well during a period of significant change in survival as observed with highly active antiretroviral treatment, but results suggest that it may work well in a time of more gradual improvement in survival.


Background
Since the beginning of the AIDS epidemic, the prediction of trends in survival after an AIDS diagnosis has been important for planning health care services and for monitoring the impact of the epidemic. Temporal associations between improved survival following the introduction of expanded treatment options provide population-based evidence that there may be beneficial treatment effects long before these hypotheses can be tested formally. In a time when health care resources are limited and health priorities must be established, it is crucial to project the short-term mortality after AIDS for future planning of health care resources [1].
Temporal trends and improvements in survival with AIDS were reported early in the epidemic even before the introduction of advances in therapy [2]. Shortly after the introduction of zidovudine therapy, temporal trends in survival were (eventually) noted using surveillance data [3]. Other registry-based studies investigated the relationship between survival following an AIDS diagnosis and calendar date of diagnosis [4,5]. These studies consistently showed marked improvements in AIDS survival after the introduction of zidovudine therapy and Pneumocystis carinii pneumonia prophylaxis. More recently, the introduction of highly active antiretroviral therapy (HAART) has renewed the idea of examining trends in survival after an AIDS diagnosis in order to study both the short-and longterm effects of these new drugs on HIV-related morbidity and mortality.
In this study, a new approach was implemented to predict AIDS survival and test its accuracy using AIDS surveillance data. The purpose of this study was: (1) to determine the earliest quantile (such as 0.10, 0.15, or 0.20) of the survival distribution that can be used to predict accurately a cohort's subsequently observed median survival, and (2) to estimate the survival quantiles using the Inverse Probability of Censoring Weighted (IPCW) estimator [6] in order to improve the prediction methodology in the common situation with registry data in which death is subject to delays in reporting [7].
For cohorts of individuals who have recently been diagnosed with AIDS, data exist for "early" survival quantiles (such as the 0.10, 0.15, 0.20 quantiles, etc.) but many years of additional observation must elapse before later quantiles, such as the median (0.50 quantile) can be estimated with accuracy. Assuming a linear relationship between the early survival quantile and the median survival, an early prediction of the median value for a cohort's eventual survival distribution is compared to the actual or true median value for the cohort. If the predicted median is accurate, then early estimation of AIDS survival is possible and will be of great benefit to health care plan-ners developing strategies and financing for the health care needs of these patients. Additionally, such an accurate, early prediction methodology could be extended to other large, population-based surveillance systems where survival prediction is a major goal.

California AIDS surveillance data
The California Department of Health Services, Office of AIDS (OA), in cooperation with the Centers for Disease Control and Prevention (CDC), maintains a registry of all reports of AIDS cases in California. This registry, the HIV/ AIDS Reporting System (HARS), contains demographic, risk factor and limited clinical information on each reported case. A HARS data set as of December 31, 2001 was used to obtain four variables: the dates of AIDS diagnosis (month and year only; the day was assumed to be 15 in order to calculate a date), the dates of death (if reported), the dates the deaths were reported to the CDC, and the date each case was entered into the registry. Dates of death are updated periodically by local city and county health departments and by OA using the California Death Registry and the National Death Index.
The State of California Health and Human Services Agency Committee for the Protection of Human Subjects and the Committee for the Protection of Human Subjects at the University of California at Berkeley approved the use of these data for this purpose.

Identification of cohorts of AIDS patients as of December 31, 1996
The date of AIDS diagnosis is the date of the first condition that would allow a person to be classified as having AIDS under the 1993 change in the AIDS case definition [8]. This definition was retroactively applied to cases diagnosed prior to 1993. Cases were grouped into cohorts defined by the calendar quarter of their AIDS diagnosis. For example, a person diagnosed in November of 1992 (i.e., the fourth quarter of 1992) was classified into the "924" cohort. All AIDS cases diagnosed according to the 1993 change in the AIDS case definition and entered into the HARS Registry between January 1, 1983 and December 31, 1996 were eligible to be included in the study.

Determination of survival from AIDS diagnosis to death
CDC receives information from California's HIV/AIDS Registry on a monthly basis. For all newly-reported deaths, the date on which the death was first reported to the CDC is recorded. In order to re-create the death information that would have been available to any investigator as of December 31, 1996, death dates were included only if they were reported on or before this date. Survival time was defined as the time elapsed from the date of the AIDS diagnosis until death from all causes, or until December 31, 1996, the date of analysis for the study.

The Inverse Probability of Censoring Weighted estimator
For many sources of registry-based data, there is a delay between the recording of vital status and its availability for analysis. In such situations, the analyst may assume mistakenly that those who are not yet known to have died are still alive when, in fact, some of these individuals may have died but the deaths have not yet been reported to the registry. The use of the Kaplan-Meier (K-M) estimator to estimate survival in this situation has been shown to be inconsistent and to yield biased results [9]. Following the approach of Robins and Rotnitzky [10], van der Laan and Hubbard [6] and Hubbard et al. [11] proposed a simple inverse probability of censoring weighted estimator to account for this delay in vital status information and this estimator was applied in this study.
The study sample consists of 56 cohorts of individuals with AIDS defined by the quarter year of diagnosis. Since the censoring date (the date of analysis) is December 31, 1996, individuals diagnosed with AIDS in the 951 cohort who survived the entire period can only have censoring times equal to 23 months (for those diagnosed in January), 22 months (for those diagnosed in February), or 21 months (for those diagnosed in March). One possible concern with using the IPCW estimator to estimate the survival distribution is that the estimator may perform poorly if the censoring distribution has all of its weight on a small set of times, as observed with this data. If there are subjects for whom the reporting time is greater than the support of possible censoring times, the IPCW may be biased [11]. In order to account for this, artificial censoring was used to augment the estimator.
Each case was assigned a new, uniformly distributed censoring time from 0 months to the maximum censoring time according to the cohort to which each case belonged. For example, the individuals diagnosed in the first quarter of 1995, the 951 cohort, were each assigned randomly a censoring time from a uniform distribution ranging from 0 months to 23 months, the maximum censoring time for this cohort. Similarly, the members of the 941 cohort were each assigned a censoring time from a uniform distribution from 0 months to 35 months, the members of the 931 cohort from 0 months to 47 months, and so forth. The censoring time for each individual was taken to be the minimum of this new censoring time or the original censoring time defined as the time elapsed from their date of diagnosis to December 31, 1996. By doing so, an artificial censoring distribution is created with more uniform mass over the possible times of death for each of the cohorts.
The reason to artificially censor the date arises from the type of censoring distribution encountered in these data. Specifically, subjects are enrolled within a narrow window of time (three months) for each cohort and all subjects are censored at the same chronological time. Thus, the censoring distribution has all of its mass over a three month period. The consequence of this is the potentially high variability in the IPCW estimator for quantiles within the support of censoring. By artificially censoring the data, censoring is "spread" over a larger interval which reduces the variability of estimates of survival at later quantiles. The cost is that the variability of survival estimates of earlier quantiles is increased by censoring originally uncensored observations.

Prediction of median estimates of survival
Since our goal was to use early survival experiences to predict later survival, we examined the relation between the early quantiles (i.e., 0.10, 0.15, 0.20, and 0.30 quantiles) of the survival distribution and the 0.50 quantile. Assuming a linear relationship, predicted median estimates were calculated based on the estimation of the linear model by entering the observed early quantile into the model. By assuming a linear model, this implies that our method only works so long as there is the same systematic shift in the survival distribution over time. That is, if the early quantile increases over time for a particular cohort, our method works only if the later quantile increases as well.

The "true" quantiles of the survival distribution
The "true" quantiles (i.e., the best possible estimate of the quantiles) of the survival distribution were assumed to be the quantiles of survival estimated empirically from the data using the IPCW estimator as of December 31, 2001. This provided at least an additional five years of observation after the date of analysis upon which the early predictions were made. In order to assess the performance of the prediction method, the predicted median estimates using our method were compared to these "true" medians (i.e., observed median estimates using data as of December 31, 2001) for the study sample.

Deaths in HARS
The justification for using five years of follow-up as providing the "true" survival estimates (i.e., the length of follow-up necessary for a cohort until the quantiles are "stable" and the "true" quantiles are achieved for a particular cohort diagnosed with AIDS) is based upon empirical data. Using the raw data as of December 31, 2001 (with no artificial censoring imposed), the cumulative numbers of deaths over ten years of follow-up for four cohorts were determined (Table 1). Among the deaths that were known to occur after ten years of follow up for the cohort of individuals diagnosed in 854 (n = 309 deaths reported as of December 31, 1995 among n = 317 individuals identified as part of this cohort), 83.6% of the cohort were known to have died within four years and 88.3% were reported within 5 years. On average, 80% or more of cohorts were known to have died within four years of the identification of the cohort. These results give empiric evidence that the "true" quantiles of survival are those which are observed five years after identification of the cohort and were our basis for our decision to derive the "true" estimates using data from December 2001.

Study sample using data as of December 31, 1996
There were 96,754 AIDS cases diagnosed between 1978 through 1996 and entered into the HARS database on or before December 31, 1996. Of those, 84 cases (0.1%) were excluded who were reported as having negative survival times (n = 1) or negative reporting times (n = 83).
After excluding 297 cases (0.31%) diagnosed prior to January 1, 1983 due to small sample sizes for each of these cohorts, 96,373 (99.6%) of all AIDS cases diagnosed and entered by December 31, 1996 were included in the analysis. Figure 1 shows the 0.10, 0.15, 0.20, and the 0.50 quantiles for cohorts estimated using the database as it would have existed on December 31, 1996 (plotted on a log scale).

Survival quantiles according to the Inverse Probability of Censoring Weighted estimator
The median estimate appears to be increasing beginning with the 864 cohort and again with the 904 cohort. There are 11 cohorts (933 through 954 and 962) for which the 0.15 quantile could be estimated and the 0.50 quantile could not be estimated using the IPCW estimator.  The differences between the predicted median and the "true" median increase as the cohorts get closer to the censoring date, i.e., for cohorts 934, 941, and 942. A scatterplot of the observed 0.15 and 0.50 quantiles is given in Figure 2.
A comparison of the predicted medians and the true medians using other early quantiles of the survival distribution as predictors in separate linear regression models are shown graphically in Figure 3 (using the 0.10 and 0.15 quantiles as predictors) and Figure 4 (using the 0.20 and 0.30 quantiles as predictors).
A closer look at the predicted median survival estimates according to the IPCW estimator (Table 2) shows that this technique overestimated the median for earlier cohorts (suggesting a steeper linear slope) and underestimated the median for later cohorts (suggesting a less steep slope). This would suggest that the relation between the 0.15 quantile and the 0.50 quantile while assumed to be linear is changing over time. Thus, predicted median survival estimates based on a model with an interaction between the 0.15 quantile and calendar time was evaluated ( Table  3). The inclusion of an interaction term with time yielded a predicted median that was closer to the truth (in comparison to the model without an interaction term) for three cohorts for which a median could not be observed at the time the prediction was made (933, 941, and 942).

Predicted median estimates based on other dates of analyses
We also examined the performance of our prediction method using two other dates of analyses other than December 31, 1996. Prediction median estimates for data as of December 31, 1992 and December 31, 1994 are presented in Additional Files 3 and 4 respectively.

Discussion
For cohorts of individuals who have been diagnosed recently with AIDS, data exist relatively soon after diagnosis for estimating "early" survival quantiles (such as the 0.10, 0.15, 0.20 quantiles, etc.) but many years of additional observation must elapse before later quantiles can be estimated accurately. The purpose of this study was to determine if median survival could be predicted accurately using earlier quantiles of survival distributions provided by AIDS surveillance data. Our approach for predicting median survival consisted of two components: (1) the estimation of quantiles of the survival distribution using the IPCW estimator, and (2) the use of a linear model to reflect the relationship between the early quantile and the later quantile. The utility of such an approach would allow early information to predict later unobserved survival patterns in order to accurately identify changes in population-based survival years before such changes are observed. If accurate estimation could be achieved, this approach could offer a method for researchers to estimate the expected survival distribution after AIDS diagnosis (or after many conditions for which surveillance databases are maintained such as cancer). This approach enables accurate predictions of changes in survival among HIVinfected individuals like that observed in 1987 [3][4][5]12] and more notably with the advent of the use of highly active antiretroviral therapies [13][14][15][16].
In this study, the 0.15 quantile of the survival distribution predicted accurately the median survival for cohorts diag-   tor) and without the use of our methodology, at least six months of additional follow-up would be required to observe any median survival estimate for the 933 cohort and at least 45 months until the predicted median of 34 months (as predicted by our methodology) would be observed for this cohort. This demonstrated that our methodology not only provides an accurate estimate of median survival, but an estimate of median survival long before traditional approaches would allow.
The results of this study suggest that our methodology yields an accurate prediction of median survival for cohorts diagnosed at least three years earlier than the date when the prediction is made. For example, the difference between the predicted and the true median survival was ≤ 6 months for the cohorts 933 and 934 but greater than 6 months for the cohorts 941 and 941 using data as of December 31, 1996. The "true" median estimate (using data as of December 31, 2001) according to the IPCW estimator for the 942 cohort was estimated to be 80 months. This may indeed be an early estimate of the median survival for this cohort and, as more data for this cohort becomes available, this median estimate may decrease over time like that observed with the K-M estimator (Additional File 1). This early estimate would greatly affect our assessment of the accuracy of our predictions since we are using this estimate as the "true" median survival. As a result, if one applies this methodology now in the second quarter of 2006, we could only expect to be able to predict with accuracy the median survival for cohorts that were   833  834  841  842  843  844  851  852  853  854  861  862  863  864  871  872  873  874  881  882  883  884  891  892  893  894  901  902  903  904  911  912  913  914  921  922  923  924  931  932  933  934  941  942  943  944  951  952  953  954  961  962  963  The ability of the method, however, to accurately predict median survival in the short-term based on historical data is greatly influenced by three factors: (1) the variability of the estimates of the various quantiles of the survival distribution, (2) the assumption that the relation between the early quantile and the later quantile (median) can be represented by a linear model, and (3) this relation will remain the same in the short-term.
In addition, delays in the reporting of death can bias estimates of survival if one assumes that a case is alive if a death date has not been reported. In this study, the use of the IPCW estimator is an attempt to mitigate the potential for bias in estimating survival after an AIDS diagnosis due to reporting delays of death. This estimator adjusts the estimates of survival for a given cohort to account for delays in death reports, provided that the delay of death report distribution is known. The bias introduced by failing to account for delays in death reporting when estimating survival after an AIDS diagnosis has already been established [11].
For simplicity, the prediction method assumed that the early quantile such as the 0.15 quantile, a representation of the "early"' survival experience was linearly related to the median of the survival estimate, the "later survival experience". Higher-dimensional models were explored but did not improve the predictive ability (data not shown). In assuming a linear relationship and extrapolating the observed relationship to the future, an additional assumption made was that this relationship would remain as observed in the past, at least in the short-term. Validation from the more recent cohorts (i.e., the 933 and 934 cohorts) confirms that the linear model accurately predicts median survival in the short-term, but may not perform well for all cohorts (e.g., the 941 and 942 cohorts). Assuming that HAART first became available with the approval of Saquinavir (hard-gel) in December 1995, the 941 and 942 cohorts would have been introduced to HAART earlier after diagnosis (21 months for the 941 cohort and 18 months for the 942 cohort) in comparison to the 933 and 934 cohorts (27 months and 24 months respectively). When comparing survival across different cohorts diagnosed over time, we would expect the later cohorts to demonstrate a shift in survival, thus violating any observed linear relationships between earlier quantiles and later quantiles observed in the past.

Conclusion
This investigation suggests that this approach to survival estimation accurately predicted subsequent survival experience observed in two of these cohorts (the 933 and 934 cohorts). It is notable that the technique did not perform well during a period of rapid increase in AIDS survival that is not unlike the presently observed increases in survival influenced by current advances in therapy. However, the performance of the methodology before the introduction of HAART suggests that this methodology may work well in a time of more gradual improvement in survival with antiretroviral treatment. This technique may also have application in other areas of research (e.g. cancer surveillance) where population changes in survival have been observed and should be validated using additional data.