Skip to main content

Table 1 Systematic solutions for improving data quality and quantity from DHIS2, with a special focus on strategies for the analysis phase

From: A roadmap for using DHIS2 data to track progress in key health indicators in the Global South: experience from sub-saharan Africa

Issue

Implication

Mitigation

Time period 1: during data collection “on the ground”

Reporting varies by facility, especially between public and private facilities [18]

It is difficult to accurately capture regional incidence of cases when not all facilities are reporting. Aggregation of data at the regional and national level is meaningless if there are differences in reporting across regions.

- Various strategies to incentivize accurate reporting through performance-based funding (without linking funding amounts to case numbers) have been used successfully [19], although sustainability is a problem

- Increase built-in functionality to capture and report data quality issues

- Automating manual data collection, including integration with electronic medical records where possible

- Improved training of staff responsible for monthly data entry and validation using inexpensive digital learning platforms [20, 21]

On the analysis level

- Instead of using data aggregated at the regional level, use reporting by individual facilities. Strategies such as k-means clustering have been used to use patterns in reporting to identify high and low performing health facilities [22]

Denominator data not always available

Calculating population-level rates is challenging; increases in case counts over time can be difficult to interpret without knowing changes in the local population.

- As part of routine data collection, estimates of the local catchment area population each health facility is serving could be updated yearly by local experts

On the analysis level

- When using aggregated data, census estimates could be used

- Other datasets (e.g. WorldPop) could be utilized in cases where local population size is difficult to calculate

Time periods 2–3: during the data entry and upload process

Large differences in quality between indicators

Generally, every indicator comes with its own set of specific issues and definitions, often determined by programmatic priorities; a unified approach to clean and analyze all indictors in a similar fashion is not possible

- By streamlining the number of required indicators to the most important ones and employing a dedicated DHIS2 staff member trained in data collection and reporting

- Automating manual data collection, including integration with electronic medical records where possible

- Using data validation rules during data entry

- Automatic outlier detection functions

On the analysis level

- Identify and analyze only the most high quality indicators in a standardized process [23]

Both reports of “zero cases” and missing values are reported in the same way (on monthly level)

We cannot distinguish zero reported cases from missing values, introducing bias

- Adjust DHIS2 software to require separate zeroes or unknowns for complete reports

On the analysis level

- Reporting rates could be used to determine if report was filed; if so, empty cells could be assumed to be zero cases reported

- Compare paper registry level data vs. DHIS2 level data (if available to analyst)

- Use imputation methods on monthly data level if zero cases are rare

Total attendance by facility is sometimes reported through a different process than cases on the reporting sheets (i.e. not through the monthly tally sheet). Monthly reports may be missing total attendance, but report cases of individual conditions. Similarly, first and re-attendance recording can differ. The way that attendance numbers are collected is often only documented in a limited way.

Missingness of total attendance may differ from cases per month, and can lead to wrong denominator for rates; this can be missed by an analyst only looking at yearly data

- Make reporting attendance a standard part of reporting

On the analysis level

- Explore magnitude of the issue (using monthly level data)

- Impute total attendance (on monthly level)

- Explore other reliable denominators instead of case/attendance (see above)

- Acquire registry level data (on facility/health program level)

Limited documentation

Lack of exact definitions of variables affects interpretation, comparability of variables across regions, generalization

- Responsible units (e.g., ministries of health) should define and document key variables together with local staff

Time period 4: correcting for data limitations at the analysis phase

Extracting monthly level data at the facility for large areas such as country wide to do more complex analyses in external statistical software is very resource intensive and often not possible on a practical level

Limits any cleaning or analysis on monthly level; i.e., understanding the degree of missing data from individual health facilities is impossible to see from annual case counts only

- Use monthly level data for smaller areas only; imputation can then be used to extend the results to larger areas if necessary

- Create finalized datasets for the most important monthly indicators for analysts to use at the end of each year

Data available with different periodicity (e.g. some monthly, some quarterly, some yearly)

Quality parameters for all indicators only possible on highest level (year) (often due to the fact that the appropriate population denominator is only available from the yearly census data)

- First assess quality for all indicators on highest level (year)

- Then potentially re-extract data on monthly level for individual indicators where it is available

Data on some indicators available only from higher level health facilities and can therefore only be analyzed at the regional level (e.g. C-sections are often performed only at hospitals)

Some indicators reported at health facility level, others at district level; sometimes difficult to compare different indicators or assess data quality

- Discuss with local experts to understand where services are provided

- Assess quality of these indicators only at the level of the health facility where they are provided

Reporting over time or region may be variable: changes over time and region are multi-causal (e.g. a decrease in malaria incidence may be due to decreases in reporting and/or a new malaria prevention program)

Increase/decrease of cases can be due to artefacts such as increased reporting rather than actual increase in cases

- Consult local experts to interpret any type of unexpected changes observed in data (e.g., spikes due to outbreaks; public health programs that increased reporting)

- Adjust estimates of incidence by incorporating measures of local health-seeking behavior reported by other representative surveys (e.g., Demographic and Health Survey can be used to determine how many women accessed prenatal care on average in a region) [24]

- Assess data quality by year (reporting rates)

Extraction process introduces variability in data (e.g., technical errors can produce different outputs for same query)

Quality of indicators and results are driven by extraction

- Standardize data cleaning process using semi-automatic procedures such as described in [23]

Diagnostic capabilities vary across facilities and regions (e.g. availability of PCR testing)

Underreporting in areas/facilities with lower diagnostic capabilities may lead to spurious associations derived from data

- Consult with local experts on each outcome/indicator, especially those requiring advanced diagnostic capabilities

- Quantify extent of issue

- Use of existing databases on health facility capability such as SARA to understand differing diagnostic capacity

Data aggregated on district level, although it is generated by point-based health facilities

Modifiable Areal Unit Problem effects (MAUP), a spatial statistical problem when point-based data are aggregated into districts [25]

- Use facility level data where possible

- Focus on local spatial regression rather than global

- Overlay aggregation method for disease mapping [26]