Skip to main content

Treating loss-to-follow-up as a missing data problem: a case study using a longitudinal cohort of HIV-infected patients in Haiti



HIV programs are often assessed by the proportion of patients who are alive and retained in care; however some patients are categorized as lost to follow-up (LTF) and have unknown vital status. LTF is not an outcome but a mixed category of patients who have undocumented death, transfer and disengagement from care. Estimating vital status (dead versus alive) among this category is critical for survival analyses and program evaluation.


We used three methods to estimate survival in the cohort and to ascertain factors associated with death among the first cohort of HIV positive patients to receive antiretroviral therapy in Haiti: complete case (CC) (drops missing), Inverse Probability Weights (IPW) (uses tracking data) and Multiple Imputation with Chained Equations (MICE) (imputes missing data). Logistic regression was used to calculate odds ratios and 95% confidence intervals for adjusted models for death at 10 years. The logistic regression models controlled for sex, age, severe poverty (living on <$1 USD per day), Port-au-Prince residence and baseline clinical characteristics of weight, CD4, WHO stage and tuberculosis diagnosis.


Age, severe poverty, baseline weight and WHO stage were statistically significant predictors of AIDS related mortality across all models. Gender was only statistically significant in the MICE model but had at least a 10% difference in odds ratios across all models.


Each of these methods had different assumptions and differed in the number of observations included due to how missing values were addressed. We found MICE to be most robust in predicting survival status as it allowed us to impute missing data so that we had the maximum number of observations to perform regression analyses. MICE also provides a complementary alternative for estimating survival among patients with unassigned vital status. Additionally, the results were easier to interpret, less likely to be biased and provided an alternative to a problem that is often commented upon in the extant literature.

Peer Review reports


HIV programs are often assessed by the proportion of patients who are alive and retained in care, which has direct consequences for funding and programmatic services offered [1, 2]. However, among individuals who initiate antiretroviral treatment (ART), the reported rate of lost to follow up ranges from 5 to 53% [1, 3,4,5,6,7,8,9,10]. Clinically, these LTF patients are at risk for adverse outcomes such as medication resistance, transmission to others, lack of care, or at best, incomplete medical records when they transfer care to another clinic [1, 6, 7]. Programmatically, lost to follow up leads to underestimates of retention which could be mis-interpreted as under-performance on program outcomes [1, 5, 6, 11].

The category of lost to follow-up (LTF) is not a homogeneous outcome—e.g., “dead” or “alive”—but rather a heterogeneous category of three disparate health states: undocumented deaths, undocumented or silent transfers to another source of HIV care, or alive and complete disengagement from HIV care [12,13,14]. Alive and being retained in care is synonymous with the proportion of patients who are neither dead nor LTF. The fact that LTF is part of the definition makes this outcome complex and problematic.

In reality, LTF is a marker for missing data on vital status. We argue that LTF should not be treated as a legitimate outcome category because it’s meaning can easily change over time and across sites. For example, patients who silently transfer to another provider, move domiciles or die outside of a healthcare facility could all be classified as LTF. Thus, studying predictors of LTF should be avoided. Instead, LTF should be considered a missing data problem that needs to be solved. We present a unique application of MICE to impute both missing outcome (vital status) and missing covariates, simultaneously, using a large longitudinal cohort of patients from Haiti who were treated for HIV infection, and compare the results with MICE to the more traditional analytic methods of using complete cases and inverse probability weights. We also evaluated associations that were predictive of death using three different methods: complete case, inverse probability weights and multiple imputation with chained equations.

Statistical methods for handling missing vital status

In the HIV literature, for studies assessing predictors of mortality/survival, the most common methods of dealing with LTF are complete case analysis, survival models that censor those LTF, and tracing with inverse probability weights [10, 15,16,17,18,19,20,21,22]. But there are other methods, including simple imputation, multiple imputation, and Bayesian analysis [15]. Each method has different underlying assumptions about the missing data.

Complete case analysis

Complete case analysis omits observations with missing data in multivariable analyses. It is the default method, employed automatically, of most statistical software programs. As only complete observations are used, sample size is decreased, statistical power is compromised, and study results are often biased [10, 16].

Kaplan Meier survival analysis

Kaplan Meier analysis assumes that lost to follow up is unrelated to mortality. To state this another way, patients who are censored due to LTF have the same probability of survival as those who are not lost to follow up [23]. However, one cannot verify the Kaplan Meier assumption without more information. From the extant literature, studies have traced patients who are categorized as lost and found that between 12 and 87% were dead [24]. With this wide range in mind, it is impossible to say if LTF is associated with higher mortality, lower mortality, or if there is no association. Employing this method, patients who are LTF are censored at a time point typically defined by the date when vital status was last verified. It is often used for analyzing HIV cohort data because all cases can be included, at least for the duration that they were followed before being lost.

Inverse probability weights from tracing

Inverse probability weights (IPW) offer another general method for dealing with missing data [17,18,19,20,21, 25]. In the HIV literature, they are often used in conjunction with tracing data. This approach involves using physical or contact tracing to determine the true vital status among a sample of those LTF [20,21,22, 25]. Then, assuming this sample is representative of all LTF, tracing data is used to apply weights to the subjects with no missing outcome data, so that the weighted analysis provides less biased results, compared to the biased results when using (unweighted) complete cases. The results of the tracing are used to calculate the inverse probability of being a complete case (given the unique set of patient characteristics, including predictors and outcomes), which is used to weight each of the complete cases [20,21,22, 25]. This method assumes that those who are unsuccessfully traced have a mortality that can be accurately estimated from those successfully traced.

For example, consider a simple analysis to assess whether gender predicts mortality. Among 100 women 50 are documented dead and 50 are documented alive, among 100 men there are 20 documented dead, 20 documented alive, and 60 LTF. A “complete case” analysis suggests that men and women have the same risk of dying (RR = 1), since 50% of the men died and 50% of the women died. However, suppose all 60 of the men LTF were successfully traced and found to be dead. For women who died, all were complete cases, so the IPW is the inverse of the probability of being a complete case, or 1/1.0, or 1. For all women who did not die, all were also complete cases, so the IPW is also 1/1.0. For men who were alive, all were complete cases, so their IPW is also 1/1.0. But for men who died (n = 80, 20 complete case deaths and 60 traced deaths), the probability of being a complete case was 20/(20 + 60), and therefore the IPW is 1/.25, or 4. If we apply these weights and do an IPW analysis—giving complete case men who died 4x the weight of any other complete case—then the average mortality among men is 20 × 4/(20 + 20 × 4) = 80%; and the risk of dying among men compared to women is 80/50 = 1.6.

Note: If only a fraction (f) of the LTF get traced, then each of the traced cases is weighted by the inverse probability of being traced, that is, by 1/f.

However the performance of the IPW model is dependent on methods used to track patients. In resource-limited settings, tracing is difficult, costly, and often unsuccessful. In our case study, Haiti does not have a unique national identification number for its citizens, making it difficult to track patients across various health systems or to verify vital status by referencing a current national death registry [3].

Multiple imputation with chained equations (MICE)

Multiple Imputation with Chained Equations (MICE) is a less commonly used method for estimating the vital status of those LTF. Although MICE is commonly used to impute missing covariate (predictor) data, [10, 26, 27] it can also be used to impute missing outcome data [26, 27]. MICE is optimal when less than 30% of a variable’s data are missing and when subjects with missing data are only randomly different (“missing at random”) from those subjects who share an identical set of patient characteristics, or covariate values [28,29,30,31]. However, to our knowledge, no articles in the extant HIV literature have reported results after imputing both the outcome and covariates simultaneously.

The aim of this analysis is to present the application of MICE to impute both missing outcome (vital status) and missing covariates, simultaneously, using a large longitudinal cohort of patients from Haiti who were treated for HIV infection, and compare the results with MICE to the more traditional methods of using complete cases, survival analysis and inverse probability weights. Specifically, we compare adjusted logistic regression models for factors associated with death using complete case, IPW and MICE.


Study population

The study population is a cohort of 910 individuals age 13 years or older who initiated antiretroviral therapy (ART) for HIV according to international guidelines between March 2003 and April 2004 in Haiti [32, 33]. The cohort was followed for ten years through 2015. Details of this cohort are described in previous publications [32, 33].

Clinical measurements and outcomes

Clinical characteristics available from routinely documented data included body weight, CD4+ cell count (CD4), WHO stage, and diagnosis of tuberculosis. Sociodemographic data included age, sex, severe poverty, and residence within the city of Port au Prince. Severe poverty was defined as living on less than one United States dollar per day. Date of death and transfer were documented in the medical record. Lost to follow-up was defined as no documented death or transfer and no clinical visit or pharmacy pick-up during the last 180 days of the 10-year follow-up. Patients who were classified as LTF were traced by clinic staff at the time of their 10-year anniversary to ascertain vital status.

Missing data

The frequency of missing data at baseline was 3% for weight, 12% for CD4 count, and 12% for vital status at 10 years of follow-up. The 71 subjects who were documented to have transferred their care to another clinic (8%) were assumed to be alive at 10 years.

Multiple imputation with chained equations (MICE)

Data were assumed to be missing at random; i.e. considered only randomly different from other subjects that share the same pattern of values for the non-missing variables. MICE was used to impute all missing values, whether for missing covariates, such as CD4 count and weight, or for missing vital status (LTF) at 10 years of follow-up. We used Stata’s implementation of MICE, which allows the imputation of various types of variables (categorical, ordinal, or continuous) in chained equations using a semi-Bayesian approach in [30] In this study, CD4 and baseline weight were continuous variables and vital status was a dichotomous variable. Results from multivariable fractional polynomial models on complete case data indicated that CD4 is best represented as a cubic function and baseline weight is best represented as a squared function. These transformations were included in the multiple imputation model. Equations were created to impute missing values and were composed of all variables used in the fully adjusted models [30]. Predictive mean matching using 5 nearest neighbors was used to impute CD4 and baseline weight [34,35,36,37]. Twenty imputations were computed based on current guidelines in the literature [30]. Various diagnostic measures were performed to check the fitness of the generated datasets. Specifically, proportions were calculated to assess imputed values of categorical variables and continuous variables were assessed using trace plots [38]. The Stata command midiagplots was used to assess the imputed datasets [38].

Classification and regression trees (CART)

Classification and regression trees were utilized to ascertain if any interaction should be incorporated into the multiple imputation [39]. Classification trees, in contrast to traditional statistical models, are especially useful for assessing for interactions when there are significant amounts of missing data [40]. After building the tree and pruning it using the R command cptable, no statistically significant interactions were found [41].

Statistical analysis

Kaplan Meier

Survival estimates were calculated using Kaplan Meier analyses and a Kaplan Meier curve was generated. Time from enrollment to death or end of study censor (ten years after enrollment with a maximum date of June 26, 2014) was calculated. Participants who were classified as LTF were censored at their last visit to the clinic.

Inverse probability weights from tracing

In September 2013, staff attempted to contact all 156 patients who were classified as LTF, using telephone and home visits. Results of this tracing method were used to create inverse probability weights (IPW) that were applied to cases with similar covariates and known vital status.

Multiple imputation with chained equations

The mi suite of commands from STATA was used to perform analyses using the multiply imputed datasets. Stata’s mi suite of commands follows Rubin’s rules for the combination of results across imputed datasets [42].

Logistic regression

For each predictor (covariate), logistic regression models were created to calculate odds ratios and 95% confidence intervals for being dead after 10 years of follow-up (univariable models). Additionally, we created multivariable (fully adjusted) models that included all clinical and sociodemographic variables. Although age, weight, and CD4 count were measured as continuous variables, when reporting the results of the logistic regression models, we describe the effects of a 10-year age difference, 10-kg weight difference, and 100-cell difference in CD4 count.

Sensitivity analysis — Multiple imputation then deletion

As a sensitivity analysis, we performed multiple imputation of all missing data, followed by deletion of all cases of missing outcomes. In this method, both the outcome and covariates are imputed and after the datasets are created, observations where the outcome was imputed are deleted from the dataset running the same univariable and multivariable models [43]. This method has been reported to lead to more efficient estimates and narrower confidence intervals [43].

All analyses were performed using STATA version 13 and R version 3.4.2. Additional file 1.

Ethics and consent to participate

The institutional review boards at GHESKIO and at Weill Medical College of Cornell University approved this analysis.


Outcome tracing

Among the 156 patients who were categorized as LTF, the clinical team was able to trace and find 45 (29%). Of the 45 patients successfully traced, 37 (82%) were found to be alive and 8 (18%) had died prior to 10 years of follow-up. Based on the 18% risk of death among those successfully traced, we assume that 18% of the 156 LTF (n = 28) were dead at 10 years and the remainder were alive. Since the probability of being known alive at 10 years among all patients who were actually alive (known alive plus number estimated to be alive among LTF by the tracing method) is 0.79, then the IPW for all those subjects who are known alive is 1/0.79.

Missing data/ diagnostics of the multiple imputation

Convergence was achieved when MICE was performed. To assess the results of the multiple imputation, kernel density and trace plots were constructed. The kernel density plots for the imputed values of CD4 and weight are shown in Fig. 1 for the first 5 imputed datasets. The means and interquartile ranges for CD4 and weight are similar to the observed non-missing observations in the dataset (Table 1). Figure 2 displays the trace plots for the twenty imputed datasets. These plots show no discernable pattern, which is the result expected of a well-executed multiple imputation.

Fig. 1

Kernel Density plots for imputed CD4 and Weight for first five imputed datasets. Panel a shows the plots specific to the imputation of the CD4 variable. Panel b shows the plots specific to the imputation of the Weight variable

Table 1 Comparison of CD4+ and weight using Complete Case, Imputation and Imputation then Deletion
Fig. 2

Trace Plots of imputed data across 20 imputed datasets

Probability of death after 10 years of follow-up using Kaplan Meier, IPW and MICE: A comparison

At 10 years, and accounting for the tracing efforts described above, 53% of patients (N = 482) were alive and engaged in care, 27% (N = 246) were confirmed dead, 12% (N = 111) were LTF, and 8% (N = 71) had transferred to another clinic for care. Survival was ascertained to be 71% (95% CI: 68–74%) by Kaplan Meier, 63% (95% CI: 59–67%) by IPW, and 67% (95% CI: 64–71%) by MICE (Fig. 3) [44].

Fig. 3

Survival estimates from Kaplan-Meier, IPW, and MICE

Predictors of death using complete case, IPW and MICE: A comparison

The weighted sample when using IPW weights from tracing had 111 fewer observations (N = 799) compared to the MICE dataset, which included all observations (N = 910), because any subject with missing covariate data was dropped. The complete case model should have the least number of observations (N = 735) because any case with any missing value was dropped from the analysis.

Table 2 displays the logistic regression results for each individual predictor of death using three types of models: complete case (CC), inverse probability weighting (IPW) and multiple imputation with chained equations (MICE). Severe poverty was statistically significant across all models and the odds ratio had an approximate 20% difference between CC and IPW (CC OR = 1.78, IPW OR = 1.59, MICE OR = 1.74). WHO stage and baseline weight were statistically significant across all models and had similar odds ratios from the three methods (Table 2). CD4 had a similar point estimate across all 3 models (CC OR = 0.86, IPW OR = 0.86, MICE OR = 0.85). However, the point estimate was not statistically significant in the CC model. Age was slightly different across all three models (CC OR = 1.17, IPW OR = 1.26, MICE OR = 1.20). Similar to CD4, age was not statistically significant for CC. Baseline tuberculosis was statistically significant across all models and had a slight variation in the point estimates (CC OR = 1.97, IPW OR = 1.92, MICE OR = 1.98). Gender and residence were not statistically significant in any model.

Table 2 Comparing Predictors of Death using Inverse Probability Weighting from Tracing and Imputationa

Although in univariable analysis (single predictor), the beta coefficients have similar point estimates regardless of method; differences are seen among the point estimates in multivariable models. Table 3 displays results from multivariable logistic regression models using the three methods. Severe poverty was statistically significant across all models and the odds ratio had an approximate 10% difference between CC and MICE (CC OR = 1.63, IPW OR = 1.64, MICE OR = 1.80). Similarly, WHO stage was statistically significant across all models and had an approximate 15% difference between the odds ratios from the CC models (OR = 1.50) compared to the MICE model (OR = 1.76). Age and baseline weight were statistically significant across all the models with a slight variation in the point estimates and 95% confidence intervals. Female gender was found to be protective for death across all three models; however, it was statistically significant only in the MICE model (OR 0.62; 95% CI: 0.44–0.87) and there was about a 10% difference between the IPW and the MICE models’ odds ratios. Baseline tuberculosis infection was associated with a higher odds of death across the three models, however it was only statistically significant in the complete case model (OR 1.83; 95% CI: 1.05–3.20). Additionally, there was an approximate 20% difference between the odds ratios of the CC and the MICE models for baseline tuberculosis infection. Port au Prince residence and CD4 were not statistically significant across the three models.

Table 3 Comparing Predictors of Death using Inverse Probability Weighting and Imputation in adjusted models

Sensitivity analysis

Results from the sensitivity analysis were very similar to the results from the MICE models for univariable and multivariable models. For severe poverty and baseline weight, with the multivariable model only, the 95% confidence intervals were slightly narrower for the MICE with deletion compared to the MICE models without deleting cases with imputed outcomes (Table 2).


Among the first cohort of HIV patients who initiated antiretroviral therapy in Haiti from 2003 to 2014, we aimed to find associations that were predictive of death using three different methods: complete case, inverse probability weights and multiple imputation with chained equations. These three procedures have different assumptions and differed in the number of observations included in the adjusted model due to how missing values for co-variates were addressed. Although the point estimates were similar across the three models, for statistically significant factors we found as much as a 20% difference in odds ratio values. For statistically significant factors, such as severe poverty and WHO stage, the odds ratios in the MICE models were farther away from the null compared to the CC and IPW models. Severe poverty was a statistically significant predictor of death in the MICE model (OR 1.80; 95% CI: 1.28–2.52). In a similar cohort from the same clinic in Haiti, income was associated with a higher odds of attrition (OR 1.65; 95% CI: 1.25–2.19) [45]. Additionally, these estimates are similar to those from an intensive contact tracing program performed in Malawi on HIV positive patients, which found about 70% of people who were initially categorized as LTF were alive and 30% were dead [13].

Worldwide, LTF rates for patients who have initiated ART treatment for at least one year range from 5 to 53% [1, 3,4,5,6,7,8,9,10]. Patient characteristics associated with becoming LTF include being clinically ill, as measured by CD4 count or WHO symptom staging, low socioeconomic status, and concern for stigma, as well as structural factors such as transportation issues [3, 7,8,9,10, 45,46,47]. Several studies have reported high rates of re-engagement in care by patients who were previously labeled as LTF [3, 4, 7, 8, 11, 45]. A study in South Africa found that up to 50% of patients who disengaged from care will re-engage within 3 years including care received at a hospital or emergency department visit [7]. Contemporary studies that were able to determine the true status of LTF patients—which is a small number—most had transferred care to clinics closer to their home or newer clinics that provide different services; or alternatively, were alive and not engaged in care [3, 4, 7, 11, 45]. Forster et al. found a strong correlation between clinics with high LTF rates also had high rates of missing data for patient characteristics [1]. Ideally, a formal tracking system that “follows” patients when they receive care at other institutions would be an optimal way to track silent transfers; however this is still in development in most countries [3, 4, 7, 10, 48]. With these findings that most LTF patients are actually alive, our method of imputing LTF status and missing covariates, at the same time, is a cost effective method to estimate true mortality and to study risk factors for HIV.

Each of the described methods in this article has different assumptions for LTF, as well as limitations and strengths (Table 4). For complete case analysis, the loss of statistical power by automatically excluding observations that have missing information is a concern for many researchers [15, 29]. This automatic exclusion leaves room for bias depending on the types and patterns of missingness [28, 29]. Many HIV studies have found that the underlying assumption that LTF is unrelated to mortality is an incorrect assumption and thus survival estimates and associations of death to be biased and incorrectly estimated [17, 21, 25, 49]. Clinicians report that those who were LTF back in the early 2000’s were later found to be dead compared to more contemporary cohorts whose LTF participants are more likely to be alive [13, 22, 25, 50, 51].

Table 4 Assumptions, Limitations, Strengths and Biases between different methods of analysis

With regards to IPW from tracing data, there are many limitations associated with this methodology. IPW from tracing techniques assume that the traced participants are a representative sample of all LTF. With this assumption in mind, a random sample of LTF participants is selected for tracing [13, 20, 21, 52,53,54,55]. In this cohort, tracing was attempted on all participants who were LTF and was performed with telephone and in-person follow up. Additionally, in this cohort, tracing was done at the end of the 10 year follow-up period, and those who were more recently lost were more likely to be found compared to those lost at the beginning of the follow-up period. Another limitation, inherent in most IPW analyses, is the non-inclusion of several observations because of automatic case-wise deletion by the analysis software due to missing data. With this in mind, estimates might be biased and a loss of statistical power might occur when utilizing this method [22, 56].

Unlike IPW, MICE is able to use all the observations in a dataset by imputing the missing values, resulting in robust results. However, it too has assumptions and is prone to limitations. One major assumption is that the risk of death among patients who are LTF is constant over time. This may not be the case as mortality is known to be highest in early periods after ART initiation and decreases over time [33, 34, 45, 57]. Additionally, MICE relies on a good prediction model and requires data to be missing at random (MAR) [29, 31]. Although MAR is difficult to ascertain, recent publications have explored the application of MICE in non-MAR situations and found that a small amount of bias might be present in the results. However, compared to the other methods, the small amount of bias that might be present is offset by the gains of using all observations present in the dataset and the robust standard errors calculated by the procedure [29, 30, 58]. Several studies have incorporated MICE as a method to estimate associations due to attrition or lost to follow up in longitudinal studies [59, 60]. Regardless of the method used, one must diligently explore patterns of missingness before performing any analyses [10, 25, 28,29,30,31]. We believe that, despite some limitations with MICE, the benefits of using all available data and the subsequent calculation of robust standard errors outweigh the limitations. Therefore, the approach of imputing both the outcome and covariates seems better than more traditional methods.

Although we describe a statical approach to approximating survival rates, implementation research is needed to determine the effectiveness and scalability of interventions to keep patients engaged in care and to return them into care [3, 44, 45, 48]. HIV programs should consider including sensitivity analyses or other methods for estimating the vital status among those categorized as lost, as traditional methods, such as CC, IPW, Kaplan Meier and Cox proportional hazards models,do not consider that patients who are lost re-engage in care. The multiple imputation method that we describe in this paper provides an estimate that is closer to the actual outcome rates. Further research is needed to test this method in other countries and HIV programs to see if it provides outcome estimates close to actual rates.


In the last ten years, there has been an increase in the number of journal articles citing multiple imputation as a method used for filling in missing values or as a secondary analysis [53, 61, 62]. MICE might be a cost efficient mathematical alternative that can be employed in resource limited settings such as Haiti to impute outcome status estimates for program evaluation to estimate survival. However, data should be evaluated for patterns of missingness. Currently, MICE is underutilized in public health research—especially of HIV-infected cohorts. Because the benefits of MICE outweigh the potential for erroneous use, we encourage the use of MICE among our HIV research colleagues.


95% CI:

95% Confidence Interval


Antiretroviral Therapy


Complete Case


CD4+ cells


Haitian Group for the Study of Kaposi’s Sarcoma and Opportunistic Infections


Human Immunodeficiency Virus


Inverse Probability Weight


Lost to Follow Up


Missing at Random


Multiple Imputation with Chained Equations


Odds Ratio


Risk Ratio


World Health Organization


  1. 1.

    Forster M, Bailey C, Brinkhof MWG, Graber C, Boulle A, Spohr M, et al. Electronic medical record systems, data quality and loss to follow-up: survey of antiretroviral therapy programmes in resource-limited settings. Bull World Health Organ. 2008;86:939–47.

    Article  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Lambdin BH, Micek MA, Koepsell TD, Hughes JP, Sherr K, Pfeiffer J, et al. An assessment of the accuracy and availability of data in electronic patient tracking systems for patients receiving HIV treatment in Central Mozambique. BMC Health Serv Res. 2012;12:30.

    Article  PubMed  PubMed Central  Google Scholar 

  3. 3.

    McNairy ML, Joseph P, Unterbrink M, Galbaud S, Mathon J-E, Rivera V, et al. Outcomes after antiretroviral therapy during the expansion of HIV services in Haiti. PLoS One. 2017;12:e0175521.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Wolff MJ, Giganti MJ, Cortes CP, Cahn P, Grinsztejn B, Pape JW, et al. A decade of HAART in Latin America: long term outcomes among the first wave of HIV patients to receive combination therapy. PLoS One. 2017;12:e0179769.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Carriquiry G, Fink V, Koethe JR, Giganti MJ, Jayathilake K, Blevins M, et al. Mortality and loss to follow-up among HIV-infected persons on long-term antiretroviral therapy in Latin America and the Caribbean. J Int AIDS Soc. 2015;18:20016 Accessed 14 Aug 2018.

    Article  Google Scholar 

  6. 6.

    Farahani M, Vable A, Lebelonyane R, Seipone K, Anderson M, Avalos A, et al. Outcomes of the Botswana national HIV/AIDS treatment programme from 2002 to 2010: a longitudinal analysis. Lancet Glob Heal. 2014;2:e44–50.

    Article  Google Scholar 

  7. 7.

    Kaplan SR, Oosthuizen C, Stinson K, Little F, Euvrard J, Schomaker M, et al. Contemporary disengagement from antiretroviral therapy in Khayelitsha South Africa: A cohort study. PLOS Med. 2017;14:e1002407.

    Article  PubMed  PubMed Central  Google Scholar 

  8. 8.

    Mberi MN, Kuonza LR, Dube NM, Nattey C, Manda S, Summers R. Determinants of loss to follow-up in patients on antiretroviral treatment, South Africa, 2004–2012: a cohort study. BMC Health Serv Res. 2015;15:259.

    Article  PubMed  PubMed Central  Google Scholar 

  9. 9.

    Sowah LA, Turenne FV, Buchwald UK, Delva G, Mesidor RN, Dessaigne CG, et al. Influence of transportation cost on long-term retention in clinic for HIV patients in rural Haiti. JAIDS J Acquir Immune Defic Syndr. 2014;67:e123–30.

    Article  PubMed  Google Scholar 

  10. 10.

    Puttkammer NH, Zeliadt SB, Baseman JG, Destiné R, Wysler Domerçant J, Labbé Coq NR, et al. Patient attrition from the HIV antiretroviral therapy program at two hospitals in Haiti. Rev Panam Salud Publica. 2014;36:238–47 Accessed 23 Apr 2015.

    PubMed  PubMed Central  Google Scholar 

  11. 11.

    Gloyd S, Wagenaar BH, Woelk GB, Kalibala S. Opportunities and challenges in conducting secondary analysis of HIV programmes using data from routine health information systems and personal health information. J Int AIDS Soc. 2016;19(5 4).

  12. 12.

    Maskew M, MacPhail P, Menezes C, Rubel D. Lost to follow up: contributing factors and challenges in south African patients on antiretroviral therapy. S Afr Med J. 2007;97:853–7 Accessed 23 Apr 2015.

    CAS  PubMed  Google Scholar 

  13. 13.

    Tweya H, Feldacker C, Estill J, Jahn A, Ng’ambi W, Ben-Smith A, et al. Are they really lost? “True” status and reasons for treatment discontinuation among HIV infected patients on antiretroviral therapy considered lost to follow up in urban Malawi. PLoS One. 2013;8:e75761.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Geng EH, Glidden D V, Bangsberg DR, Bwana MB, Musinguzi N, Nash D, et al. A causal framework for understanding the effect of losses to follow-up on epidemiologic analyses in clinic-based cohorts: the case of HIV-infected patients on antiretroviral therapy in Africa. Am J Epidemiol 2012;175:1080–1087. Accessed 11 June 2015.

    Article  Google Scholar 

  15. 15.

    Karahalios A, Baglietto L, Carlin JB, English DR, Simpson JA. A review of the reporting and handling of missing data in cohort studies with repeated assessment of exposure measures. BMC Med Res Methodol. 2012;12:96.

    Article  PubMed  PubMed Central  Google Scholar 

  16. 16.

    Kenward MG, Molenberghs G. Parametric models for incomplete continuous and categorical longitudinal data. Stat Methods Med Res 1999;8:51–83. Accessed 9 Oct 2015.

    CAS  Article  Google Scholar 

  17. 17.

    Kurth T, Walker AM, Glynn RJ, Chan KA, Gaziano JM, Berger K, et al. Results of multivariable logistic regression, propensity matching, propensity adjustment, and propensity-based weighting under conditions of nonuniform effect. Am J Epidemiol. 2006;163:262–70.

    Article  PubMed  Google Scholar 

  18. 18.

    Lippman SA, Shade SB, Hubbard AE. Inverse probability weighting in STI/HIV prevention research: methods for evaluating social and community interventions. Sex Transm Dis. 2010;37:1.

    Article  Google Scholar 

  19. 19.

    Buchanan AL, Hudgens MG, Cole SR, Lau B, Adimora AA. Worth the weight: using inverse probability weighted cox models in AIDS research. AIDS Res Hum Retrovir. 2014;30:1170–7.

    Article  PubMed  Google Scholar 

  20. 20.

    Van Cutsem G, Ford N, Hildebrand K, Goemaere E, Mathee S, Abrahams M, et al. Correcting for mortality among patients lost to follow up on antiretroviral therapy in South Africa: a cohort analysis. PLoS One. 2011;6:e14684.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  21. 21.

    Henriques J, Pujades-Rodriguez M, McGuire M, Szumilin E, Iwaz J, Etard J-F, et al. Comparison of methods to correct survival estimates and survival regression analysis on a large HIV African cohort. PLoS One. 2012;7:e31706.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  22. 22.

    Geng EH, Glidden DV, Bangsberg DR, Bwana MB, Musinguzi N, Nash D, et al. A causal framework for understanding the effect of losses to follow-up on epidemiologic analyses in clinic-based cohorts: the case of HIV-infected patients on antiretroviral therapy in Africa. Am J Epidemiol. 2012;175:1080–7.

    Article  PubMed  PubMed Central  Google Scholar 

  23. 23.

    Goel MK, Khanna P, Kishore J. Understanding survival analysis: Kaplan-Meier estimate. Int J Ayurveda Res. 2010;1:274–8.

    Article  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Brinkhof MWG, Pujades-Rodriguez M, Egger M. Mortality of patients lost to follow-up in antiretroviral treatment Programmes in resource-limited settings: systematic review and meta-analysis. PLoS One. 2009;4:e5790.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Geng EH, Odeny TA, Lyamuya RE, Nakiwogga-Muwanga A, Diero L, Bwana M, et al. Estimation of mortality among HIV-infected people on antiretroviral treatment in East Africa: a sampling based approach in an observational, multisite, cohort study. Lancet HIV. 2015;2:e107–16.

    Article  PubMed  PubMed Central  Google Scholar 

  26. 26.

    Plutzer K, Mejia GC, Spencer AJ, Keirse MJNC. Dealing with missing outcomes: lessons from a randomized trial of a prenatal intervention to prevent early childhood caries. Open Dent J. 2010;4:55–60.

    Article  PubMed  PubMed Central  Google Scholar 

  27. 27.

    Tanski SE, McClure AC, Li Z, Jackson K, Morgenstern M, Li Z, et al. Cued recall of alcohol advertising on television and underage drinking behavior. JAMA Pediatr. 2015;169:264.

    Article  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Knol MJ, Janssen KJM, Donders ART, Egberts ACG, Heerdink ER, Grobbee DE, et al. Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example. J Clin Epidemiol. 2010;63:728–36.

    Article  PubMed  Google Scholar 

  29. 29.

    White IR, Carlin JB. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Stat Med. 2010;29:2920–31.

    Article  PubMed  Google Scholar 

  30. 30.

    White IR, Royston P, Wood AM. Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011;30:377–99.

    Article  PubMed  Google Scholar 

  31. 31.

    Hedden SL, Woolson RF, Carter RE, Palesch Y, Upadhyaya HP, Malcolm RJ. The impact of loss to follow-up on hypothesis tests of the treatment effect for several statistical methods in substance abuse clinical trials. J Subst Abus Treat. 2009;37:54–63.

    Article  Google Scholar 

  32. 32.

    Leger P, Charles M, Severe P, Riviere C, Pape JW, Fitzgerald DW. 5-year survival of patients with AIDS receiving antiretroviral therapy in Haiti. N Engl J Med. 2009;361:828–9.

    CAS  Article  Google Scholar 

  33. 33.

    Severe P, Leger P, Charles M, Noel F, Bonhomme G, Bois G, et al. Antiretroviral therapy in a thousand patients with AIDS in Haiti. N Engl J Med. 2005;353:2325–34.

    CAS  Article  Google Scholar 

  34. 34.

    Rodwell L, Lee KJ, Romaniuk H, Carlin JB. Comparison of methods for imputing limited-range variables: a simulation study. BMC Med Res Methodol. 2014;14:57.

  35. 35.

    Allison P. Imputation by Predictive Mean Matching: Promise &amp; Peril | Statistical Horizons. March 5. 2015. Accessed 13 Jan 2018.

  36. 36.

    Vink G, Frank LE, Pannekoek J, van Buuren S. Predictive mean matching imputation of semicontinuous variables. Stat Neerl. 2014;68:61–90.

    Article  Google Scholar 

  37. 37.

    Morris TP, White IR, Royston P. Tuning multiple imputation by predictive mean matching and local residual draws. BMC Med Res Methodol. 2014;14:75.

  38. 38.

    Eddings W, Marchenko Y, Eddings W, Marchenko Y. Diagnostics for multiple imputation in Stata. Stata J 2012;12:353–367. Accessed 8 Sept 2017.

  39. 39.

    Recursive Partitioning and Regression Trees [R package rpart version 4.1–11]. Accessed 15 Nov 2017.

  40. 40.

    Harrell FE, E. F. Regression modeling strategies : with applications to linear models, logistic regression, and survival analysis. Springer; 2001. Accessed 22 Sept 2017.

  41. 41.

    Hastie, Trevor, Tibshirani, Robert, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Stanford: Springer; 2016.

  42. 42.

    Rubin DB. Multiple imputation for nonresponse in surveys: Wiley-Interscience; 2004.

  43. 43.

    Hippel PT von. Regression with Missing Ys: An Improved Strategy for Analyzing Multiply Imputed Data. Sociological Methodology. 37:83–117.

  44. 44.

    Pierre S, Jannat-Khah D, Fitzgerald DW, Pape J, McNairy ML. 10-year survival of patients with AIDS receiving antiretroviral therapy in Haiti. N Engl J Med. 2016;374:397–8.

    Article  Google Scholar 

  45. 45.

    Noel E, Esperance M, McLaughlin M, Bertrand R, Devieux J, Severe P, et al. Attrition from HIV testing to antiretroviral therapy initiation among patients newly diagnosed with HIV in Haiti. J Acquir Immune Defic Syndr. 2013;62:e61–9.

    CAS  Article  Google Scholar 

  46. 46.

    Pierre S, Jannat-Khah D, Fitzgerald DW, Pape J, McNairy ML. No Title. 2016;374.

    Article  Google Scholar 

  47. 47.

    Coria A, Noel F, Bonhomme J, Rouzier V, Perodin C, Marcelin A, et al. Consideration of Postpartum Management in HIV-Positive Haitian Women. JAIDS J Acquir Immune Defic Syndr. 2012;61:636–43.

    Article  Google Scholar 

  48. 48.

    Hennessey KA, Leger TD, Rivera VR, Marcelin A, McNairy ML, Guiteau C, et al. Retention in care among patients with early HIV disease in Haiti. J Int Assoc Provid AIDS Care. 2017;16:523–6.

    Article  PubMed  Google Scholar 

  49. 49.

    Falcaro M, Nur U, Rachet B, Carpenter JR. Estimating excess Hazard ratios and net survival when covariate data are missing. Epidemiology. 2015;26:421–8.

    Article  PubMed  Google Scholar 

  50. 50.

    Wubshet M, Berhane Y, Worku A, Kebede Y. Death and seeking alternative therapy largely accounted for lost to follow-up of patients on ART in Northwest Ethiopia: a community tracking survey. PLoS One. 2013;8:e59197.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  51. 51.

    Caluwaerts C, Maendaenda R, Maldonado F, Biot M, Ford N, Chu K. Risk factors and true outcomes for lost to follow-up individuals in an antiretroviral treatment programme in Tete Mozambique. Int Health. 2009;1:97–101.

    Article  PubMed  Google Scholar 

  52. 52.

    Reidy W, Agarwal M, Lamb M, Hawken M, Chege D, Elul B, et al. Loss to follow-up: determining outcomes for adults enrolled in HIV Services in Kenya. 2014.

    Google Scholar 

  53. 53.

    Schomaker M, Gsponer T, Estill J, Fox M, Boulle A. Non-ignorable loss to follow-up: correcting mortality estimates based on additional outcome ascertainment. Stat Med. 2014;33:129–42.

    CAS  Article  PubMed  Google Scholar 

  54. 54.

    Geng EH, Odeny TA, Lyamuya RE, Nakiwogga-Muwanga A, Diero L, Bwana M, et al. Estimation of mortality among HIV-infected people on antiretroviral treatment in East Africa: a sampling based approach in an observational, multisite, cohort study. Lancet HIV 2015;2:e107–e116. Accessed 2 June 2015.

    Article  Google Scholar 

  55. 55.

    Geng EH, Glidden DV, Bwana MB, Musinguzi N, Emenyonu N, Muyindike W, et al. Retention in care and connection to care among HIV-infected patients on antiretroviral therapy in Africa: estimation via a sampling-based approach. PLoS One. 2011;6:2004.

    Google Scholar 

  56. 56.

    Witkiewitz K, Falk DE, Kranzler HR, Litten RZ, Hallgren KA, O’Malley SS, et al. Methods to analyze treatment effects in the presence of missing data for a continuous heavy drinking outcome measure when participants drop out from treatment in alcohol clinical trials. Alcohol Clin Exp Res. 2014;38:2826–34.

    Article  PubMed  PubMed Central  Google Scholar 

  57. 57.

    Lawn SD, Campbell L, Kaplan R, Boulle A, Cornell M, Kerschberger B, et al. Time to initiation of antiretroviral therapy among patients with HIV-associated tuberculosis in Cape Town, South Africa. JAIDS J Acquir Immune Defic Syndr. 2011;57:136–40.

    Article  PubMed  Google Scholar 

  58. 58.

    White IR, Royston P. Imputing missing covariate values for the cox model. Stat Med. 2009;28:1982–98.

    Article  PubMed  PubMed Central  Google Scholar 

  59. 59.

    Biering K, Hjollund NH, Frydenberg M. Using multiple imputation to deal with missing data and attrition in longitudinal studies with repeated measures of patient-reported outcomes. Clin Epidemiol. 2015;7:91–106.

    Article  PubMed  PubMed Central  Google Scholar 

  60. 60.

    McCaul KA, Almeida OP, Norman PE, Yeap BB, Hankey GJ, Golledge J, et al. How Many Older People Are Frail? Using Multiple Imputation to Investigate Frailty in the Population. J Am Med Dir Assoc. 2015;16:439.e1–7.

    Article  Google Scholar 

  61. 61.

    Mackinnon A. The use and reporting of multiple imputation in medical research - a review. J Intern Med. 2010;268:586–93.

    CAS  Article  PubMed  Google Scholar 

  62. 62.

    Fatti G, Meintjes G, Shea J, Eley B, Grimwood A. Improved survival and antiretroviral treatment outcomes in adults receiving community-based adherence support: 5-year results from a multicentre cohort study in South Africa. J Acquir Immune Defic Syndr. 2012;61:e50–8.

    Article  PubMed  Google Scholar 

Download references


The authors would like to thank the wonderful staff and patients at GHESKIO and Weill Cornell Medical College. Additionally we would like to thank Kevin J. Pain for his assistance with the literature search.


Supported by grants from the National Institutes of Health (AI098627, TW009337, and TW010062) and the President’s Emergency Plan for AIDS Relief, Centers for Disease Control and Prevention (GGH000545).

Availability of data and materials

Datasets analyzed during the current study are not publicly available as they contain personalized health information from the GHESKIO clinic. Data are available from the co-authors upon reasonable request.

Author information




DPJ wrote the manuscript and analyzed the data. MU and AE contributed to the data analysis and provided methodological support. MM and SP collected data and provided clinical support and guidance for interpretation of the data. DWF and JP provided funding, acquired regulatory approval for the project, provided access to the dataset and assisted with the programmatic support and clinical interpretation of the data. All authors, DPJ, MU, AE, MM, SP, DWF, and JP reviewed and edited the manuscript.

Corresponding author

Correspondence to Deanna P. Jannat-Khah.

Ethics declarations

Ethics approval and consent to participate

The institutional review boards at GHESKIO and at Weill Medical College of Cornell University approved this analysis.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Additional file 1:

Data analysis using R is a supplementary file that describes how to download the free statistical software package R and R studio. It also includes the names of the R packages used for this analysis and various websites that one could consult for help using R. (DOCX 12 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jannat-Khah, D.P., Unterbrink, M., McNairy, M. et al. Treating loss-to-follow-up as a missing data problem: a case study using a longitudinal cohort of HIV-infected patients in Haiti. BMC Public Health 18, 1269 (2018).

Download citation


  • HIV
  • AIDS
  • Kaplan Meier
  • Complete case
  • Multiple imputation
  • Inverse probability weights