Skip to main content

Machine learning algorithms using national registry data to predict loss to follow-up during tuberculosis treatment



Identifying patients at increased risk of loss to follow-up (LTFU) is key to developing strategies to optimize the clinical management of tuberculosis (TB). The use of national registry data in prediction models may be a useful tool to inform healthcare workers about risk of LTFU. Here we developed a score to predict the risk of LTFU during anti-TB treatment (ATT) in a nationwide cohort of cases using clinical data reported to the Brazilian Notifiable Disease Information System (SINAN).


We performed a retrospective study of all TB cases reported to SINAN between 2015 and 2022; excluding children (< 18 years-old), vulnerable groups or drug-resistant TB. For the score, data before treatment initiation were used. We trained and internally validated three different prediction scoring systems, based on Logistic Regression, Random Forest, and Light Gradient Boosting. Before applying our models we splitted our data into training (~ 80% data) and test (~ 20%) sets, and then compared the model metrics using the test data set.


Of the 243,726 cases included, 41,373 experienced LTFU whereas 202,353 were successfully treated. The groups were different with regards to several clinical and sociodemographic characteristics. The directly observed treatment (DOT) was unbalanced between the groups with lower prevalence in those who were LTFU. Three models were developed to predict LTFU using 8 features (prior TB, drug use, age, sex, HIV infection and schooling level) with different score composition approaches. Those prediction scoring systems exhibited an area under the curve (AUC) ranging between 0.71 and 0.72. The Light Gradient Boosting technique resulted in the best prediction performance, weighting specificity and sensitivity. A user-friendly web calculator app was developed ( to facilitate implementation.


Our nationwide risk score predicts the risk of LTFU during ATT in Brazilian adults prior to treatment commencement utilizing schooling level, sex, age, prior TB status, and substance use (drug, alcohol, and/or tobacco). This is a potential tool to assist in decision-making strategies to guide resource allocation, DOT indications, and improve TB treatment adherence.

Peer Review reports


Despite the widespread availability of curative treatment of tuberculosis (TB), this disease remains a major plague of humanity, accounting for more than one million deaths annually [1]. Global treatment success is still below the targets established by the World Health Organization (WHO) [2, 3], especially in low- and middle-income countries (LMIC) such as Brazil [4].

Current WHO treatment recommendations for drug-susceptible TB include six months of a combination of antibiotics [3]. Such long treatment is associated with an increased risk of loss to follow up (LTFU) and may lead to adverse drug reactions [2]. Early identification of patients at high risk of LTFU at the moment of diagnosis with clinical and sociodemographic characteristics is key to providing personalized care, which may involve directly observed treatment (DOT), and helping decision-making strategies to mitigate losses in the cascade of care. Noteworthy, the Brazilian Ministry of Health recommends DOT for all TB cases, but the rates of cases that carry out the DOT still represent less than 50% of the total cases reported. To do so, the establishment of reliable and accurate prediction tools [4] is necessary, especially when limited resources require prioritization of intensive case management tools with a high-middle TB disease burden .

Brazil is among the countries with the highest number of TB cases in the world, despite the fact that it follows the WHO’s standardized TB treatment recommendations. Importantly, the cascade of care in Brazil for drug-sensitive TB is composed of 3 steps: (1) mandatory reporting of TB cases to the Notifiable Diseases Information System (SINAN) [5, 6]; (2) a six-month treatment regimen, usually in fixed-dose combination (FDC) [7]; and (3) treatment-associated outcomes are reported in the SINAN database. Thus, this is a significant source of data that could be explored to develop prediction models for LTFU during anti-TB treatment (ATT).

Therefore, we aimed to develop a web-based prediction model for LTFU among pulmonary TB treatment cases in Brazil at the baseline consultation utilizing secondary data elements readily available at diagnosis. Importantly, the developed a model that could be used by both the Brazilian government and clinicians as a readily available web-based tool for decision-making to achieve higher rates of TB treatment success.

Materials and methods

Ethics statement

All data accessed in this study were obtained from a publicly available platform and pre-processed by the Brazilian Ministry of Health ( This processing verified the data regarding consistency, duplicate registration, and completeness, following the instructions set by Resolution Number 466/12 on Research Ethics of the National Health Council, Brazil. There was no identifiable information in the databases and thus the study was exempt from approval by ethics committees.

Study population

We performed a retrospective analysis of de-identified data from pulmonary TB cases reported to the Brazilian Notifiable Diseases Information System (SINAN).

SINAN is a centralized system for the notification of transmissible diseases, including TB. Data stored in SINAN are maintained by the Brazilian Ministry of Health specifically by the DATASUS (the Information Technology Department of the Brazilian Unified Health System) and can be accessed through a file transfer protocol [6].

We included in our study all individuals 18 years old or older, notified in SINAN with pulmonary TB from 2015 to 2022. We exclude from our study any patient that: (i) postmortem TB diagnosed; (ii) belongs to any special population (i.e. homelessness, liberty deprivation, pregnant, immigrants, and health worker), (iii) is resistant to any drug (rifampin, isoniazid, pyrazinamide, or ethambutol), (iv) outcome other than cure or LTFU, and with PTB and also had > = 1 EPTB site. (Fig. 1). Vulnerable populations were removed because they present a different pattern of risk of illness and LTFU than the general population.

Fig. 1
figure 1

Flowchart demonstrating the study population. Abbreviations: TB: Tuberculosis; LTFU: Loss to follow up; EPTB: extrapulmonary tuberculosis; PTB: pulmonary tuberculosis; HCU: health care unit; .

Variables definitions

The age variable was categorized using the following bins: children/teenage (0, 18], Young adult (18, 35], Adult (35, 50], Senior adult (50, 65] and Eldery > 80 years old. Biological sex: female or male, HIV infection: presence of an HIV diagnosis (self-reported); alcohol consumption: ever use of alcohol; tobacco use: ever smoking tobacco; drug use: ever use of drugs (including marijuana, cocaine, heroin or crack); race: self-reported races/ ethnicities, subdivided into Non-White (including “Yellow”, “Black”, “Pardo”, which defines mixed-race ancestry in Latin America [European, Indigenous and African], and Indigenous) and White; DOT: implementation of directly observed therapy; schooling: self-reported years of schooling. abnormal chest X-ray: thorax radiographic result indicative of TB; sensibility TB test: susceptible to all first-line drugs, resistant to any drug; smear grade: positive, negative, not performed. Comorbidities such as diabetes and mental illness were classified according to the presence or abstence in the moment of the TB diagnois (self-reported). Prior TB: patient report a history of TB treatment. This stratification was performed following criterion adopted by Brazilian Ministry of Health to report TB data [8].

Data analyses

We divided our data analysis process into seven portions/steps: (i) descriptive analyses, (ii) data under sample, (iii) split data, (iv) feature elimination, (v) hyper-parameters tuning, (vi) model evaluation, and (vii) model building. To conduct descriptive analysis we used median followed by interval interquartile (IQR) to describe continuous variable and absolute and relative frequency to categorical. As our data could be considered imbalanced (i.e. ~3 cures for 1 LTFU) we performed an under sample of the most frequent class [9]. Hence, the data set resulting from this process has the same proportion of outcome (i.e. 1 cure for 1 LTFU), and then we split in train test data [10]. The training set was composed by 70% of the total data whereas 30% was kept for model evaluation. To reduce data dimensionality, we used Recursive Feature Elimination using Cross-Validation (RFECV) [11]. In this case, we selected RF as the estimator and used it in a 10-fold stratified cross-validation, then we selected the minimum number of variables that leads to the higher model accuracy following the elbow rule. To find the best set of parameters we used the grid search approach, thus for each model (i.e. Logistic Regression, Random Forest, and Light Gradient Boosting [12, 13]) we created a grid of parameters, in the train set we evaluated the best combination of the parameters. To select the best algorithm evaluation, we applied each model with its best combination of parameters to the test set. We then evaluate AUC, accuracy, sensitivity, and specificity [14, 15]. To understand the feature importance and feature contribution to each outcome on a global and local level we used Shapley values. The last step consisted of retraining the model using the whole data set [16, 17]. All codes are provide and could be checked at (


Characteristics of the overall study population

Between, 2015, and 2022, 743,823 TB cases were notified in SINAN. 243,726 were included in the final study population, with 500,097 (~ 67%) of notifications removed according to our exclusion criteria (Fig. 1). The selected population was stratified according to 202,353 cases that experienced cure and 41,373 experienced LTFU (Fig. 1). At the time of the TB diagnosis, the LTFU group was younger (median age LTFU: 37.1 vs. Cure 42.1 years), had more self-identified as non-white (LTFU 72.8% vs. Cure65.3%), with lower schooling rates (≥ 12 years, LTFU 2.33% vs. Cure6.38%) highest prevalence of HIV infection (LTFU 13.2% vs. Cure5.99%) and prior TB (LTFU32.4% vs. Cure 10.8%). Among consumption habits, the LTFU group presented the highest prevalence of all the consumption habits evaluated, such as alcohol use (LTFU 29.0% vs. Cure16.1%), tobacco use (LTFU 35.0% vs. Cure21.9%) and drug use (LTFU 28.6% vs. Cure9.12%). Interestingly diabetes was less prevalent in LTFU group (LTFU 5.67% vs. Cure10.9%). Noteworthy, the DOT was more prevalent among the cure group (LTFU 21.2% vs. Cure 41.4%). All the evaluated characteristics were statistically significant between the groups (Table 1).

Table 1 Characteristics of the overall population of the study

Comparing machine learning algorithms to predict LTFU

We initiated our model development with 13 variables of which 8 were selected as the most informative by our RFECV approach (Fig. 2): (i) schooling, (ii) sex, (iii) prior TB, (iv) HIV infection, (v) alcohol use, (vi) drug use, (vii) tobacco use and (vii) age. To predict those patients who are more likely to experience an LTFU we proposed three different models using the variables listed above. In our investigation into predicting patient outcomes, three diverse models were employed, each revealing unique hyperparameter preferences for optimal performance. The logistic regression model demonstrated its peak predictive capabilities with a strong regularization, notably C = 0.01. This underscored the critical role of regularization strength in striking a balance between model complexity and generalization. The RF model achieved its best performance by setting the maximum depth to 8, which means each of the model’s decision trees is allowed to make decisions down to eight levels deep. Additionally, it used an ensemble of 500 decision trees, meaning the model’s final prediction is based on the combined output of 500 trees. This setup highlights the critical importance of these specific settings—both the depth of decision-making in each tree and the total number of trees in the ensemble—for improving the model’s ability to accurately predict outcomes. In the case of the Light Gradient Boosting model, optimal performance was achieved with trees of max depth 4, 500 decision trees (no. of estimators), and a learning rate of 0.01. These results highlighted the intricate interplay between tree complexity, ensemble size, and the learning rate in achieving superior predictive capabilities.

Fig. 2
figure 2

Recursive feature selection elimination. In the x-axis indicating the number of features used by the model while in the y-axis indicating the AUC achieved during the cross-validation

The next phase consisted of evaluating the three models (using the parameters described above) on the test set. In this case, we found that classifiers presented similar results (Supplementary Table S1).

According to our calibration plot, the Light Gradient Boosting presented the best result since the predicted probability of an LTFU corresponds to the true likelihood of the positive class being true (Supplementary Fig. S1). The Random Forest presented the worst result. In this case, the model probability underestimated the real likelihood of the positive class. Thus based, on all the results we found, we decided to use the Light Gradient Boosting to construct our predictive model (Fig. 3). We used SHAP values to allocate the contribution of each feature to a model’s prediction, offering insights into feature importance and interactions. Such values help interpret complex models, providing a nuanced understanding of the factors influencing specific predictions. According to our model, previous TB was the most important feature. In this case, a patient who experienced prior TB had increased likelihood to evolve to LTFU. Another important feature was drug use. Patients who reported to use drugs had the probability of evolve to LTFU during an ATT increased (Fig. 4).

Fig. 3
figure 3

Receiver operating characteristic curve (ROC) for prediction of LTFU based on data available in SINAN using three different Machine Learning algorithm

Fig. 4
figure 4

Feature importance computed using SHAP-values on test set. and Relationship between feature value and treatment outcome. Blue dot indicates Cure and, for categorical features the value of no. Red dots indicating LTFU patients and, for categorical features the value of Yes


In this study of pulmonary TB cases reported to SINAN in Brazil, we developed a risk score that effectively stratified before treatment initiation those TB cases at higher risk of LTFU during ATT. Our score used data from 7 features, all of which were from the case notification form, and were publicly available. Those features included clinical and epidemiologic information, that can be collected by health professionals before treatment initiation, and which predicted LTFU independent of other characteristics. The use of this risk score could potentially provide crucial information to target specific patients since the diagnosis and improve the successful ATT completion, potentially facilitating the achievement of the WHO target of 90% of patients with treatment success [18].

Importantly, in our study, 14.5% of the total population experienced LTFU, which represents an important problem for public health because of the risk of M. tuberculosis transmission; drug-resistant strains can also be generated [19]. Importantly, the rates of DOT in the group that experienced the LTFU were significantly lower than the cure group. Enhancing the importance of the detection of these patients at the beginning of TB treatment might help clinicians in choosing priorities for DOT and the target populations for the Brazilian national TB program.

Our probabilistic score was developed using clinical and sociodemographic data readily collected in most clinical care settings, even in resource-limited settings. Among the variables selected, prior TB, consumption habits (alcohol, tobacco, or drug use), age (adult and elderly), biological sex, HIV infection, and schooling level were the risk factors that most contributed to an LTFU during TB treatment. Some of these characteristics have been explored and linked to unfavorable TB treatment outcomes through the relationship with poor therapy adherence, LTFU, and treatment discontinuation [20,21,22,23,24,25,26,27]. It is important to highlight that our study identified history of prior TB as the variable with the most significant impact on the model’s ability to predict LTFU. This finding is consistent with extensive literature, which attributes this impact to a mix of psychological factors, barriers to healthcare access, social conditions, and stigma [28,29,30,31]. . Additionally, a study using the SINAN database highlighted that a history of previous treatment abandonment is the primary risk factor for LTFU in new treatment cycles, underlining the importance of past treatment adherence in predicting and managing future outcomes [32]. .

In a previous study, a similar score was developed to predict unfavorable anti-TB treatment outcomes in people living with diabetes from China, however using clinical and radiologic data [23]. Another study from Mexico developed an algorithm to predict mortality, failure, and drug resistance in newly diagnosed TB patients with clinical features and laboratory tests [27]. In contrast, our score could be applied in patients with or without diabetes, by utilizing only clinical information, without the necessity of laboratory data or radiographic exams.

While exploring data from the RePORT-Brazil consortium, we have previously reported a clinical prediction model for unfavorable pulmonary TB treatment outcomes [20]. That score utilized information that was not readily available in SINAN, thus we found it difficult to translate to the nationwide TB program in Brazil. The present study intended to create a score that could be employed in all settings, especially in those with limited resources, which could certainly help guide interventions at the moment of diagnosis, before starting treatment in a large country such as Brazil.

Our risk model had several limitations. First, the study utilized nationwide public data, and several features had missing data and were exposed to a wide range of demographic and regional discrepancies. Second, most co-morbidities and clinical characteristics were self-reported, which may provide potential misclassification bias. The study included only pulmonary TB cases and consequently may not be applied to extrapulmonary or disseminated TB. Also, we excluded vulnerable populations, and the total number of exclusions were higher than 50% of the total cases reported limiting the use in similar populations to those included in our study. We suggest that future scores include more clinical data, physical exam, and social economic conditions to improve the accuracy and extend the applicability in clinical practice.

Despite the limitations, to the best of our knowledge, this is the first prognostic score model developed in South America using only clinical and epidemiologic data from disease notification forms, obtained before therapy initiation, with relatively accurate prediction. The resulting model is parsimonious and should be utilized by clinicians through a nomogram or web application (, assisting in TB care and potentially improving the successful completion of ATT of pulmonary TB patients.

Data availability

All data accessed in this study were obtained from a publicly available platform and pre-processed by the Brazilian Ministry of Health ( All generated and/or analyzed during the current study are available in the github repository, available in the link:


  1. WHO. Global tuberculosis report 2023 [Internet]. [cited 2023 Nov 28].

  2. Rapid communication. key changes to the treatment of drug-resistant tuberculosis [Internet]. [cited 2023 Dec 4].

  3. WHO consolidated guidelines on tuberculosis. module 4: treatment: drug-susceptible tuberculosis treatment [Internet]. [cited 2023 Dec 4].

  4. The World Bank Group. The World Bank In Brazil [Internet]. World Bank. [cited 2023 Dec 4].

  5. Campos T. Manual SINAN – Normas e Rotinas 2a edição – Portal da Vigilância em Saúde [Internet]. 2018 [cited 2023 Nov 28].

  6. Rocha MS, Bartholomay P, Cavalcante MV, et al. Notifiable diseases Information System (SINAN): main features of tuberculosis notification and data analysis. Epidemiol Serv Saude. 2020;29(1):e2019017.

    PubMed  Google Scholar 

  7. BRASIL. Manual de Recomendações para o Controle da Tuberculose no Brasil [Internet]. 2023.

  8. Boletim Epidemiológico de Tuberculose. – 2022 | Departamento de Doenças de Condições Crônicas e Infecções Sexualmente Transmissíveis [Internet]. [cited 2023 Mar 15].

  9. Tanha J, Abdi Y, Samadi N, Razzaghi N, Asadpour M. Boosting methods for multi-class imbalanced data classification: an experimental review. J Big Data. 2020;7(1):70.

    Article  Google Scholar 

  10. Pedregosa F, Varoquaux G, Gramfort A et al. Scikit-learn: Machine Learning in Python [Internet]. arXiv; 2018 [cited 2023 Mar 20].

  11. Misra P, Singh A, with Cross-Validation. Improving the Classification Accuracy using Recursive Feature Elimination. 2020 [cited 2024 Mar 21].

  12. Ke G, Meng Q, Finley T et al. LightGBM: a highly efficient gradient boosting decision tree. Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2017. pp. 3149–3157.

  13. Ferreira AJ, Figueiredo MAT, Boosting Algorithms. A Review of Methods, Theory, and Applications. In: Zhang C, Ma Y, editors. Ensemble Machine Learning: Methods and Applications [Internet]. New York, NY: Springer; 2012 [cited 2023 Dec 4]. pp. 35–85.

  14. Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128–38.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Fawcett T. An introduction to ROC analysis. Pattern Recognit Lett. 2006;27(8):861–74.

    Article  Google Scholar 

  16. Lundberg SM, Erion G, Chen H, et al. From local explanations to Global understanding with explainable AI for trees. Nat Mach Intell. 2020;2(1):56–67.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Explainable AI. from black box to glass box [Internet]. [cited 2023 Mar 20].

  18. Partership STB. Global Plan to End TB 2018–2022: The Paradigm Shift [Internet]. 2023.

  19. Walker IF, Shi O, Hicks JP et al. Analysis of loss to follow-up in 4099 multidrug-resistant pulmonary tuberculosis patients. Eur Respir J. 2019; 54(1).

  20. Novel stepwise approach. to assess representativeness of a large multicenter observational cohort of tuberculosis patients: The example of RePORT Brazil - International Journal of Infectious Diseases [Internet]. [cited 2023 Dec 4].

  21. Clinical Prediction Model for Unsuccessful Pulmonary Tuberculosis Treatment Outcomes. | Clinical Infectious Diseases | Oxford Academic [Internet]. [cited 2023 Dec 4].

  22. Mendelsohn SC, Fiore-Gartland A, Awany D, et al. Clinical predictors of pulmonary tuberculosis among South African adults with HIV. EClinicalMedicine. 2022;45:101328.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Singano V, Kip E, Ching’ani W, Chiwaula L. Tuberculosis treatment outcomes among prisoners and general population in Zomba, Malawi. BMC Public Health. 2020;20(1):700.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Unsuccessful TB. treatment outcomes with a focus on HIV co-infected cases: a cross-sectional retrospective record review in a high-burdened province of South Africa | BMC Health Services Research | Full Text [Internet]. [cited 2023 Dec 4].

  25. Systematic review of prediction models. for pulmonary tuberculosis treatment outcomes in adults | BMJ Open [Internet]. [cited 2023 Dec 4].

  26. The impact of alcohol use on tuberculosis. treatment outcomes: a s… Ingenta Connect [Internet]. [cited 2023 Dec 4].;jsessionid=3d3ck2jiq0o72.x-ic-live-02.

  27. You N, Pan H, Zeng Y, et al. A risk score for prediction of poor treatment outcomes among tuberculosis patients with diagnosed diabetes mellitus from eastern China. Sci Rep Nat Publishing Group. 2021;11(1):11219.

    CAS  Google Scholar 

  28. Caminero JA. Multidrug-resistant tuberculosis: epidemiology, risk factors and case finding [State of the art series. Drug-resistant tuberculosis. Edited by C-Y. Chiang. Number 4 in the series]. The International Journal of Tuberculosis and Lung Disease. 2010; 14(4):382–390.

  29. Abubakar I, Lipman M. Reducing loss to follow-up during treatment for drug-resistant tuberculosis. European Respiratory Journal [Internet]. European Respiratory Society; 2019 [cited 2024 Mar 21]; 53(1).

  30. Soedarsono S, Mertaniasih NM, Kusmiati T, et al. Determinant factors for loss to follow-up in drug-resistant tuberculosis patients: the importance of psycho-social and economic aspects. BMC Pulm Med. 2021;21(1):360.

    Article  PubMed  PubMed Central  Google Scholar 

  31. Jiang Y, Chen J, Ying M, et al. Factors associated with loss to follow-up before and after treatment initiation among patients with tuberculosis: a 5-year observation in China. Front Med (Lausanne). 2023;10:1136094.

    Article  PubMed  Google Scholar 

  32. Barreto-Duarte B, Villalva-Serra K, Miguez-Pinto JP, Retreatment and Antituberculosis Therapy Outcomes in Brazil between 2015 and 2022: A Nationwide Study of Disease Registry Data [Internet]., Rochester NY et al. 2023 [cited 2024 Mar 21].

Download references


The authors thank the study participants. Thank the teams of clinical and laboratory platforms of RePORT Brazil. A special thanks to Elze Leite (FIOCRUZ, Salvador, Brazil).


Intramural Research Program of the Fundação Oswaldo Cruz (B.B.A.), Departamento de Ciência e Tecnologia (DECIT) - Secretaria de Ciência e Tecnologia (SCTIE) – Ministério da Saúde (MS), Brazil [25029.000507/2013-07 to V.C.R.], the National Institutes of Allergy and Infectious Diseases [U01-AI069923 to T.R.S, MSR, ALK, TRS, BBA, and MCS] and, Programa Inova FIOCRUZ/Edital Inovação Amazônia (Fiocruz, FAPEAM and FAPERO to MR). MAP and B.B.D received a fellowship from Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (Finance code: 001). B.B.A, A.L.K., M.C.S., and V.C.R. are senior investigators of CNPq/Ministry of Science Technology. All authors have read and agreed to the submitted version of the manuscript. The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.

Author information

Authors and Affiliations



Conceptualization, T.R.S.,BB-D, M.C.F., M.C.S., V.C.R., and B.B.A.; Data curation, A.T.L.Q, K.B.B, V.R.S, M.A-P. and B.B.A.; Investigation, L.S., M.M.S.R., B.B-D., J.R.L.S., A.L.K., S.C., B.S.G-R, C.F.D, V.C.R., T.R.S., M.C.S., and B.B.A.; Formal analysis, A.T.L.Q, M.M.S.R and B.B.A.; Funding acquisition, J.R.L.S., A.L.K., S.C., V.C.R., T.R.S., M.C.S., and B.B.A.; Methodology, L.S., B.B-D M.B.A., M.A-P., and B.B.A.; Project administration, M.C.F., T.R.S., and B.B.A.; Resources, T.R.S., and B.B.A.; Software, A.T.L.Q, M.M.S.R, and B.B.A.; Supervision, A.T.L.Q., T.R.S., and B.B.A.; Writing—original draft, M.M.S.R, B.B-D and B.B.A.; Writing—review and editing, all authors.

Corresponding authors

Correspondence to Moreno M. S. Rodrigues or Bruno B. Andrade.

Ethics declarations

Ethics approval and consent to participate

All data accessed in this study were obtained from a publicly available platform following the instructions set by Resolution Number 466/12 on Research Ethics of the National Health Council, Brazil. There was no identifiable information in the databases and thus the study was exempt from approval by ethics committees.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.


The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rodrigues, M.M.S., Barreto-Duarte, B., Vinhaes, C.L. et al. Machine learning algorithms using national registry data to predict loss to follow-up during tuberculosis treatment. BMC Public Health 24, 1385 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: