How much do the physician review and InterVA model agree in determining causes of death? a comparative analysis of deaths in rural Ethiopia

Background Despite it is costly, slow and non-reproducible process, physician review (PR) is a commonly used method to interpret verbal autopsy data. However, there is a growing interest to adapt a new automated and internally consistent method called InterVA. This study evaluated the level of agreement in determining causes of death between PR and the InterVA model. Methods Verbal autopsy data for 434 cases collected between September 2009 and November 2012, were interpreted using both PR and the InterVA model. Cohen’s kappa statistic (κ) was used to compare the level of chance corrected case-by-case agreement in the diagnosis reached by the PR and InterVA model. Results Both methods gave comparable cause specific mortality fractions of communicable diseases (36.6 % by PR and 36.2 % by the model), non-communicable diseases (31.1 % by PR and 38.2 % by the model) and accidents/injuries (12.9 % by PR and 10.1 % by the model). The level of case-by-case chance corrected concordance between the two methods was 0.33 (95 % CI for κ = 0.29–0.34). The highest and lowest agreements were seen for accidents/injuries and non-communicable diseases; with κ = 0.75 and κ = 0.37, respectively. Conclusion If the InterVA were used in place of the existing PR process, the overall diagnosis would be fairly similar. The methods had better agreement in important public health diseases like; TB, perinatal causes, and pneumonia/sepsis; and lower in cardiovascular diseases and neoplasms. Therefore, both methods need to be validated against a gold-standard diagnosis of death. Electronic supplementary material The online version of this article (doi:10.1186/s12889-015-2032-7) contains supplementary material, which is available to authorized users.


Background
Information that comes from accurate and complete records of deathswho died of whatis a huge resource for evidence-based health planning and development [1]. However, most deaths in low-income countries remain unregistered due to the absence of civil registration system [1][2][3]. In countries, where deaths are not routinely recorded and classified by cause, verbal autopsy (VA) has become an alternative technique [4][5][6]. Verbal autopsy covers the entire process of interviewing close caregivers about the circumstances proceeding to death [4,7,8]. Verbal autopsy data are collected by trained lay interviewers, and then interpreted into a probable cause of death [2,4,7,8]. Either of the following methods can be applied to derive the cause of death from VA data: physician review, physician review using an algorithm, computer algorithms and InterVA model [2,9].
Physician review (PR) is the most commonly used method of interpreting VA data [4,9]. According to this method, two independent physicians review VA questionnaires to assign probable cause of death and corresponding International Classification of Diseases (ICD) code [9,10]. However, the diagnosis reached by physicians can vary depending on their training, experience and knowledge of local epidemiology; which in turn limit the internal consistency and comparability of findings [9,10]. Physician review is also costly and time-taking. A single verbal autopsy review may take up to half an hour of physicians' time; competing the time required for patient care [3,9,10]. Despite all its shortcomings, numerous validation studies have demonstrated PR to be capable of producing reasonably valid cause of death information, and is used in many surveillance systems [2,5,11].
Recently, an automated method of interpreting verbal autopsy, called InterVA, has been developed and used [1,4,7,10]. It calculates the probability of a set of causes of death, given the presence of indicators (circumstances, signs, and symptoms) reported in VA interviews [10,12]. Although statistical modeling of this sort may not reflect the subjective subtleties of physicians' review; the InterVA is faster, cheaper and internally consistent [4,9,10]. Therefore, there is an increasing interest to shift from PR, which is widely used in several research centers and surveillance systems, in to the automated InterVA approach [2,4,5,9,12]. However, the reliability of the diagnosis that can be reached by using these methods is not sufficiently studied. According to prior studies which compared PR and InterVA, the level of agreement vary from low (κ = 0.27) to moderate (κ = 0.42-0.48) [10,[13][14][15].
The Kilite Awlaelo Health and Demographic Surveillance System (KA-HDSS), located in northern Ethiopia, has been using PR method to determine cause of death since September 2009. We used data from the KA-HDSS to measure the agreement in diagnosis between physician review and the computer-based InterVA-4 model.

Study setting and population
This study was undertaken in the KA-HDSS, which is a longitudinal population-based surveillance system established in 2009. It operates in ten contiguous Kebelles (smallest administrative unit with an average population of 5000). The KA-HDSS is a member of the INDEPTH Network [16]. Details of the surveillance system, study population, and operating procedures have been published previously [2,[17][18][19]. The surveillance started with a baseline population of 66,453 individuals living in 14,453 households. Data on core health and demographic events (pregnancy status, birth, death, marital status change, and migrations) and verbal autopsy are collected during regular house-to-house visits.

Verbal autopsy tool and the interview process
Death, to any member of the KA-HDSS cohort, was identified by trained full time data collectors during a regular household visit. Adult relative who was a caregiver during the terminal illness was interviewed using the standard VA questionnaire adapted from the World Health Organization (WHO), INDEPTH Network, and Sample Vital Registration with Verbal Autopsy (SAVVY) [16,20,21]. The VA tool was translated into Tigrigna, the local language of the study area, and rechecked again for its consistency with the original version. Three separate questionnaires for the three age groups: neonate, post-neonate and children (29 days to 15 years) and adults (>15 years) were used.

Interpretation of VA data into probable cause of death
The same verbal autopsy data were interpreted in to probable cause of death (CoD) using physician review and the InterVA-4 model. In the case of PR, two physicians independently reviewed the completed VA questionnaires to assign underlying CoD using the International Classification of Diseases (ICD) coding [22]. Agreement in diagnosis between the two physicians was checked by the research team. Physicians listed up to three underlying CoD, ranked according to their chance to be a CoD in a specific subject. Cause of death was established only if the two physicians coded similar diagnosis for the case. When disagreement in diagnosis existed, the case was reviewed by a third tie-breaker physician. Final diagnosis was then assigned based on the agreement between the third and any of the two physicians. Yet, if three of the physicians assigned different diagnosis for the same case, the diagnosis was considered as "undetermined".
The InterVA-4 model (version 4.02) was used to interpret VA data into probable cause(s) of death. As described by Byass et al. [7], the model is based on Bayes' theorem, which calculates the probability of a set of CoD given the presence of indicators (circumstances, signs, and symptoms), reported in VA interviews [7,10]. The InterVA, which is a freely available package (http://www.interva.net/), requires extraction of the defined set of indicators from the VA questionnaire of each case in to a text files in the comma separated variable (.csv) format [23]. This dataset is then processed to generate a summary of as many as three possible causes of death with their corresponding likelihood [2,12]. Before running the model, the InterVA requires adjusting for the incidence of malaria and HIV/AIDS in the study population as "high" or "low". This would conceptually mean a physician's knowledge of whether his/her current case comes from a setting where malaria is more or less likely, irrespective of particular symptoms [23]. National morbidity data from Ethiopia were widely available in the form of prevalence rate, which was used to set the level of indicators in this study. The prevalence of malaria and HIV/AIDS in Ethiopia are estimated to be 1 % and 1-1.9 %, respectively [24,25]. Therefore, in this study, we set the prevalence of both malaria and HIV/AIDS to be "low". The current report is based on VA data for deaths from September 2009 to November 2012. In the present study, the comparison between the two methods was made based on the most probable CoD per case, assigned by both methods rather than all three possible causes.
Sample size was calculated using the sample size for the kappa-statistic of interrater agreement, as implemented for Stata [26]. Based on previous studies, we assumed that physicians would identify the most common outcome (infectious diseases) in 58 % of the cases and the InterVA model in 61 % of the cases [10]. Besides, existing estimates on the concordance between the two methods range between 0.27 and 0.80 [10,27]. Thus, in the present study, on average we expected a concordance of 0.5. With an absolute precision of 0.1 and significance level 0.05, this study requires a minimum sample size of 332 cases [27]. However, there were 434 cases that have both PR and InterVA based COD, and all of them were included in the present study.

Data analysis
Probable CoD for the 434 deaths was attributed to 47 and 67 possible causes by the InterVA model and PR, respectively. To make direct case-by-case comparison, causes of death which were specific to either of the methods were re-categorized into broader categories common to both methods. For example, physicians coded more specific causes like "eclampsia" and "antepartum" and "post-partum hemorrhage", while the InterVA has only one category of "maternity-related deaths". In this case they were grouped in to "maternal causes". Finally, 19 CoD common to both methods were prepared for the analysis (see Additional file 1). The PR process use ICD-10 coding, while the InterVA use "WHO VA cause of death code". Thus, to facilitate the comparison, outputs of the InterVA model were re-coded in to equivalent ICD-10 code.
Cause specific mortality fraction (CSMF) was used to describe the proportion of cases attributed to probable CoD. Chance corrected concordance between the physician review and InterVA was measured using Cohen's kappa coefficient with corresponding 95 % confidence interval (CI). Data analyses were carried out using Stata version 11.2 for Windows.

Ethical statement
The study protocol was approved by the Ethiopian Science and Technology Agency (IERC-0030). Ethical approval, with a reference number ERC 0432/2014, was also obtained from the Health Research Ethics Review Committee (HRERC) of Mekelle University. Informed verbal consent was obtained from an eligible respondent. This consent procedure was stated in the proposal which was approved by the ethical review committee. To keep confidentiality, data containing personal identifiers were not shared with third party.

Results
Verbal autopsy data for a total of 434 deaths were interpreted using physician review and InterVA. The median age at death was 52 years (SD = 32.5, IQR = 62). People aged 65 years and above, and children accounted for the largest and smallest proportions accounting for 43.3 and 3.9 % of the total cases, respectively ( Table 1). We also compared the frequency of causes of death attributed to specific causes by each method. Figure 1 compares CSMF generated for both methods. Tuberculosis (TB) was the leading cause of death according to the InterVA (13.8 %), and the third commonest cause by PR (12.7 %). Similarly, comparable proportions of deaths were attributed to accidents/injuries, and cardiovascular causes

Agreement in diagnosis between physician review and InterVA model
In this study, the overall agreement between the InterVA and PR on the broad causes of death was 56.5 % (κ = 0.40, 95 % CI: 0.39-0.45) [ Table 3]. However, on specific causes of death level (19 causes), the two methods agreed on 165 cases (38.0 %) (κ = 0.33, 95 % CI: 0.29-0.34). The level of agreement has notably varied by cause of death. The highest and lowest agreement were shown for accidents/injuries and NCDs; with κ = 0.75 and κ = 0.37, respectively. Similarly, there was a moderate agreement for perinatal causes (κ = 0.59), TB (κ = 0.59); but not for cardiovascular causes (κ = 0.23) and neoplasms (κ = 0.27). The two methods had a better agreement in children than adults and the old age group.

Discussion
This study compared the level of case-by-case agreement in diagnosis between the InterVA model and PR methods. In general, the CSMF for the major causes of death were comparable in both methods and also consistent with previous findings [2,11,13]. However, the overall case-by-case agreement in diagnosis lies within the fair range of agreement [28]. The level of agreement has varied by causes of death, age and sex of the deceased, ranging from fair (κ = 0.23 for cardiovascular diseases) to substantial level agreement (κ = 0.75 for accidents/injuries). The proportions of deaths attributed to communicable causes by both methods were similar and consistent to  the existing knowledge of the burden of communicable diseases in Ethiopia [2,11]. Both methods were similar in attributing the proportion of TB, pneumonia/sepsis, acute infections, malaria and HIV/AIDS. In a similar study in Kenya, the two methods comparably attributed pneumonia/sepsis, TB, malaria and meningitis [10]. However, according to other studies, the InterVA overestimated TB than the physician review [10,13,15]. The comparability of both methods in the number of times they diagnosed HIV/AIDS was inconsistent. In some studies, the InterVA diagnosed HIV/AIDS more frequently than physician review [10], while less frequently than PR [13] in other study. This discrepancy may be related to misclassification of HIV/AIDS and TB, which is reported in several studies [10,22,29,30]. In our study, both methods attributed NCDs comparably and the magnitude of the estimate accords with previous findings [31][32][33]. Similarly, cardiovascular causes of death were comparably estimated by both methods, which was also reported in another study [15]. The consistency of both methods in estimating deaths attributed to accidents/injuries shown in our study concurs with that of other studies [10,15]. This may be related to the clarity of the indicators (signs and symptoms) reported for accidents and injuries than other causes.
In our study, the overall chance corrected agreement, at broad and specific causes of death categories, falls between 0.21 and 0.40, which is considered as a fair agreement [28]. The case-by-case agreement at specific cause of death level was higher than a similar study in Kenya (κ = 0.27) and lower than another findings from Ethiopia (κ = 0.49) and Kenya (κ =0.42) [10,13,15]. As reported in a similar study, the level of agreement was better in younger ages than the older age groups [34]. This could be explained in terms of the difference in epidemiology of causes of diseases across age groups. Older age groups experience multiple illness conditions with overlapping symptomatic nature than younger groups [34].
Findings from several [10,[13][14][15], but not all [26], studies show that the concordance level between the PR and InterVA is insufficient. A comparative study in Northwest Ethiopia, which included 408 adult deaths, measured a concordance level of 0.49 on broad CoD level [13]. Even much lower levels of agreement were also reported from the African Population and Health Research Center (k = 0.27) [10] and Kilifi Health Demographic Surveillance System (k = 0.32) [15], both in Kenya, which did similar comparison. On the other hand, finding from a recent multi-center study, which used data from Health and Demographic Surveillance systems, and Health and Demographic Surveys, showed an almost perfect level of agreement, reporting overall concordance correlation coefficient of 0.83 [26].
In the present study, inference about validity of either of the methods can not be made in the absence of a goldstandard diagnosis. However, validation studies which simultaneously evaluated PR and InterVA methods against hospital certified deaths, showed that the PR performs better than the InterVA model [14,15]. A validation study which compared both the InterVA and PR methods against hospital CoD revealed that the level of agreement between InterVA and hospital CoD (κ = 0.32) was lower than the agreement between physician review and hospital CoD (k = 0.52). In addition, in another study which evaluated the PR and InterVA using clinical diagnostic gold standards in a sample of 12,542 verbal autopsy cases, the PR has shown a better performance than the InterVA, across all age-groups [14].
Discrepancy in the diagnosis between these two methods may not be unexpected, though further investigation is needed to explain the variation. Nevertheless, according to previous studies the discordance in diagnosis was related to a variation on how the two methods process and use the verbal autopsy data. The InterVA uses the data from the closed ended questions only, while the PR involve extensive use of the open ended narrative part of the VA data [3,9,10,14]. In addition, the InterVA use a probability matrix to process the indicators in the verbal autopsy data, while the PR is based on expert judgment [7,9]. In addition to the minimal effort it requires, the InterVA has a comparative advantage of being completely internally consistent that enables producing comparable outputs. In contrary, PR is labor intensive, and prone to interobserver variation. However, it has also some benefits. As a part of their routine clinical practice, reviewer physicians treat patients who come from the same population where the VA cases come from. This gives reviewer physicians a chance to correlate the signs and symptoms used to describe illness in the specific community with the actual illness confirmed through clinical investigations. Although, such prior knowledge can affect the possibility of coding less prevalent causes [9], it may help the PR process to be a robust on CoD which are common in the community.
The present study has the following limitations. The two methods were compared in the absence of a goldstandard diagnosis. As a result, it was possible to conclude about the validity of the methods. Although the study included more cases than the minimum sample size required, it was not sufficient when it comes to comparing sub-groups or rare causes of death.

Conclusion
In summary, this study reported an overall low chance corrected agreement in probable cause of death between PR and InterVA. The level of agreement varies across different categories of causes of death, and age of the deceased. The agreement ranged from moderate to substantial for important public health diseases like TB, perinatal causes, pneumonia/sepsis, and accidents and injuries; while the agreement for NCDs, especially for cardiovascular causes and neoplasms was low. Both methods showed a relatively better agreement in underfive children and adults aged 15-45, while they least agreed for cases aged 45 and above years. Therefore, if the InterVA were used in place of the PR process, the overall diagnosis would be fairly similar.