This study compared the level of case-by-case agreement in diagnosis between the InterVA model and PR methods. In general, the CSMF for the major causes of death were comparable in both methods and also consistent with previous findings [2, 11, 13]. However, the overall case-by-case agreement in diagnosis lies within the fair range of agreement . The level of agreement has varied by causes of death, age and sex of the deceased, ranging from fair (κ = 0.23 for cardiovascular diseases) to substantial level agreement (κ = 0.75 for accidents/injuries).
The proportions of deaths attributed to communicable causes by both methods were similar and consistent to the existing knowledge of the burden of communicable diseases in Ethiopia [2, 11]. Both methods were similar in attributing the proportion of TB, pneumonia/sepsis, acute infections, malaria and HIV/AIDS. In a similar study in Kenya, the two methods comparably attributed pneumonia/sepsis, TB, malaria and meningitis . However, according to other studies, the InterVA overestimated TB than the physician review [10, 13, 15]. The comparability of both methods in the number of times they diagnosed HIV/AIDS was inconsistent. In some studies, the InterVA diagnosed HIV/AIDS more frequently than physician review , while less frequently than PR  in other study. This discrepancy may be related to misclassification of HIV/AIDS and TB, which is reported in several studies [10, 22, 29, 30].
In our study, both methods attributed NCDs comparably and the magnitude of the estimate accords with previous findings [31–33]. Similarly, cardiovascular causes of death were comparably estimated by both methods, which was also reported in another study . The consistency of both methods in estimating deaths attributed to accidents/injuries shown in our study concurs with that of other studies [10, 15]. This may be related to the clarity of the indicators (signs and symptoms) reported for accidents and injuries than other causes.
In our study, the overall chance corrected agreement, at broad and specific causes of death categories, falls between 0.21 and 0.40, which is considered as a fair agreement . The case-by-case agreement at specific cause of death level was higher than a similar study in Kenya (κ = 0.27) and lower than another findings from Ethiopia (κ = 0.49) and Kenya (κ =0.42) [10, 13, 15]. As reported in a similar study, the level of agreement was better in younger ages than the older age groups . This could be explained in terms of the difference in epidemiology of causes of diseases across age groups. Older age groups experience multiple illness conditions with overlapping symptomatic nature than younger groups .
Findings from several [10, 13–15], but not all , studies show that the concordance level between the PR and InterVA is insufficient. A comparative study in Northwest Ethiopia, which included 408 adult deaths, measured a concordance level of 0.49 on broad CoD level . Even much lower levels of agreement were also reported from the African Population and Health Research Center (k = 0.27)  and Kilifi Health Demographic Surveillance System (k = 0.32) , both in Kenya, which did similar comparison. On the other hand, finding from a recent multi-center study, which used data from Health and Demographic Surveillance systems, and Health and Demographic Surveys, showed an almost perfect level of agreement, reporting overall concordance correlation coefficient of 0.83 .
In the present study, inference about validity of either of the methods can not be made in the absence of a gold-standard diagnosis. However, validation studies which simultaneously evaluated PR and InterVA methods against hospital certified deaths, showed that the PR performs better than the InterVA model [14, 15]. A validation study which compared both the InterVA and PR methods against hospital CoD revealed that the level of agreement between InterVA and hospital CoD (κ = 0.32) was lower than the agreement between physician review and hospital CoD (k = 0.52). In addition, in another study which evaluated the PR and InterVA using clinical diagnostic gold standards in a sample of 12,542 verbal autopsy cases, the PR has shown a better performance than the InterVA, across all age-groups .
Discrepancy in the diagnosis between these two methods may not be unexpected, though further investigation is needed to explain the variation. Nevertheless, according to previous studies the discordance in diagnosis was related to a variation on how the two methods process and use the verbal autopsy data. The InterVA uses the data from the closed ended questions only, while the PR involve extensive use of the open ended narrative part of the VA data [3, 9, 10, 14]. In addition, the InterVA use a probability matrix to process the indicators in the verbal autopsy data, while the PR is based on expert judgment [7, 9].
In addition to the minimal effort it requires, the InterVA has a comparative advantage of being completely internally consistent that enables producing comparable outputs. In contrary, PR is labor intensive, and prone to inter-observer variation. However, it has also some benefits. As a part of their routine clinical practice, reviewer physicians treat patients who come from the same population where the VA cases come from. This gives reviewer physicians a chance to correlate the signs and symptoms used to describe illness in the specific community with the actual illness confirmed through clinical investigations. Although, such prior knowledge can affect the possibility of coding less prevalent causes , it may help the PR process to be a robust on CoD which are common in the community.
The present study has the following limitations. The two methods were compared in the absence of a gold-standard diagnosis. As a result, it was possible to conclude about the validity of the methods. Although the study included more cases than the minimum sample size required, it was not sufficient when it comes to comparing sub-groups or rare causes of death.