California AIDS surveillance data
The California Department of Health Services, Office of AIDS (OA), in cooperation with the Centers for Disease Control and Prevention (CDC), maintains a registry of all reports of AIDS cases in California. This registry, the HIV/AIDS Reporting System (HARS), contains demographic, risk factor and limited clinical information on each reported case. A HARS data set as of December 31, 2001 was used to obtain four variables: the dates of AIDS diagnosis (month and year only; the day was assumed to be 15 in order to calculate a date), the dates of death (if reported), the dates the deaths were reported to the CDC, and the date each case was entered into the registry. Dates of death are updated periodically by local city and county health departments and by OA using the California Death Registry and the National Death Index.
The State of California Health and Human Services Agency Committee for the Protection of Human Subjects and the Committee for the Protection of Human Subjects at the University of California at Berkeley approved the use of these data for this purpose.
Identification of cohorts of AIDS patients as of December 31, 1996
The date of AIDS diagnosis is the date of the first condition that would allow a person to be classified as having AIDS under the 1993 change in the AIDS case definition [8]. This definition was retroactively applied to cases diagnosed prior to 1993. Cases were grouped into cohorts defined by the calendar quarter of their AIDS diagnosis. For example, a person diagnosed in November of 1992 (i.e., the fourth quarter of 1992) was classified into the "924" cohort. All AIDS cases diagnosed according to the 1993 change in the AIDS case definition and entered into the HARS Registry between January 1, 1983 and December 31, 1996 were eligible to be included in the study.
Determination of survival from AIDS diagnosis to death
CDC receives information from California's HIV/AIDS Registry on a monthly basis. For all newly-reported deaths, the date on which the death was first reported to the CDC is recorded. In order to re-create the death information that would have been available to any investigator as of December 31, 1996, death dates were included only if they were reported on or before this date. Survival time was defined as the time elapsed from the date of the AIDS diagnosis until death from all causes, or until December 31, 1996, the date of analysis for the study.
The Inverse Probability of Censoring Weighted estimator
For many sources of registry-based data, there is a delay between the recording of vital status and its availability for analysis. In such situations, the analyst may assume mistakenly that those who are not yet known to have died are still alive when, in fact, some of these individuals may have died but the deaths have not yet been reported to the registry. The use of the Kaplan-Meier (K-M) estimator to estimate survival in this situation has been shown to be inconsistent and to yield biased results [9]. Following the approach of Robins and Rotnitzky [10], van der Laan and Hubbard [6] and Hubbard et al. [11] proposed a simple inverse probability of censoring weighted estimator to account for this delay in vital status information and this estimator was applied in this study.
The study sample consists of 56 cohorts of individuals with AIDS defined by the quarter year of diagnosis. Since the censoring date (the date of analysis) is December 31, 1996, individuals diagnosed with AIDS in the 951 cohort who survived the entire period can only have censoring times equal to 23 months (for those diagnosed in January), 22 months (for those diagnosed in February), or 21 months (for those diagnosed in March). One possible concern with using the IPCW estimator to estimate the survival distribution is that the estimator may perform poorly if the censoring distribution has all of its weight on a small set of times, as observed with this data. If there are subjects for whom the reporting time is greater than the support of possible censoring times, the IPCW may be biased [11]. In order to account for this, artificial censoring was used to augment the estimator.
Each case was assigned a new, uniformly distributed censoring time from 0 months to the maximum censoring time according to the cohort to which each case belonged. For example, the individuals diagnosed in the first quarter of 1995, the 951 cohort, were each assigned randomly a censoring time from a uniform distribution ranging from 0 months to 23 months, the maximum censoring time for this cohort. Similarly, the members of the 941 cohort were each assigned a censoring time from a uniform distribution from 0 months to 35 months, the members of the 931 cohort from 0 months to 47 months, and so forth. The censoring time for each individual was taken to be the minimum of this new censoring time or the original censoring time defined as the time elapsed from their date of diagnosis to December 31, 1996. By doing so, an artificial censoring distribution is created with more uniform mass over the possible times of death for each of the cohorts.
The reason to artificially censor the date arises from the type of censoring distribution encountered in these data. Specifically, subjects are enrolled within a narrow window of time (three months) for each cohort and all subjects are censored at the same chronological time. Thus, the censoring distribution has all of its mass over a three month period. The consequence of this is the potentially high variability in the IPCW estimator for quantiles within the support of censoring. By artificially censoring the data, censoring is "spread" over a larger interval which reduces the variability of estimates of survival at later quantiles. The cost is that the variability of survival estimates of earlier quantiles is increased by censoring originally uncensored observations.
Prediction of median estimates of survival
Since our goal was to use early survival experiences to predict later survival, we examined the relation between the early quantiles (i.e., 0.10, 0.15, 0.20, and 0.30 quantiles) of the survival distribution and the 0.50 quantile. Assuming a linear relationship, predicted median estimates were calculated based on the estimation of the linear model by entering the observed early quantile into the model. By assuming a linear model, this implies that our method only works so long as there is the same systematic shift in the survival distribution over time. That is, if the early quantile increases over time for a particular cohort, our method works only if the later quantile increases as well.
The "true" quantiles of the survival distribution
The "true" quantiles (i.e., the best possible estimate of the quantiles) of the survival distribution were assumed to be the quantiles of survival estimated empirically from the data using the IPCW estimator as of December 31, 2001. This provided at least an additional five years of observation after the date of analysis upon which the early predictions were made. In order to assess the performance of the prediction method, the predicted median estimates using our method were compared to these "true" medians (i.e., observed median estimates using data as of December 31, 2001) for the study sample.