Chapter 15. Selecting data sources and addressing data gaps

Overview

Health inequality monitoring requires data, which may be obtained from a variety of different sources. The selection of a data source or sources with sufficient and adequate data for health inequality monitoring requires consideration of data availability (what sources exist, what data do they contain and how easily can they be accessed?); data quality (what quality standards do the data meet?); and data suitability (how well do the data serve the purpose of a particular inequality monitoring application?). Data may be available from multiple potential sources, but there are other cases where sources do not contain all relevant data or the data are of poor quality or are otherwise inadequate. In some cases, data sources may be lacking altogether, indicating that efforts and investments may be needed to develop them.

Building on Chapters 12–14, which overview the main data sources used for health inequality monitoring, this chapter considers how to select data sources for health inequality monitoring. It provides strategies and insights for assessing data availability, weighing the strengths and limitations of data sources, selecting from among multiple data sources, and navigating situations of low data availability. It introduces techniques such as data source mapping and data source linking, with examples and tools to support these practices.

Assessing data availability

To streamline the assessment of data availability, it is helpful to clarify the general parameters for a specific application of health inequality monitoring – for example, to identify which populations, health topics and/or dimensions of inequality will be the focus of inequality monitoring (see Chapter 3), noting that the specifics of monitoring may be determined or revised based on the availability of data.

Taking an inventory of existing data sources covering the relevant population, health topic or inequality dimension provides insights into the range of existing sources that may be relevant for inequality monitoring. The data source mapping process described below is a systematic approach to cataloguing the contents and characteristics of existing data sources. The common data sources for health inequality monitoring (see Chapters 12–14) provide a solid starting point.

Data source mapping

Data source mapping is an exercise that helps to assess data availability for health inequality monitoring. It is a process by which data sources are systematically enlisted and their properties detailed. Each source is appraised for its suitability for inequality monitoring (1). By creating an inventory of available data, the exercise helps determine where data for health inequality monitoring can be sourced. It also facilitates cross-referencing between data sources. Data source mapping can expose gaps that indicate where additional data collection is required or where means for enabling data links may be introduced.

Data source mapping involves four main steps and a possible fifth step. Creating sequential and cross-referenced tables for each of the steps facilitates the process. Quality assessments of relevant data sources should accompany the data source mapping exercise. Considerations related to data quality can be recorded as notes in the corresponding sections of the tables or as separate documentation (see Chapter 11).

The first step is to create a list of available data sources, specifying the source type (e.g. census, administrative source, household survey, civil registration and vital statistics (CRVS) system), the name of the source, and the year(s) of data collection. Alongside each source, other pertinent information can be noted, such as the frequency of data collection, data representativeness and data accessibility.

The second step involves assessing the dimensions of inequality data contained in each source. This can be done by creating a second table that is an expanded version of the first table, making note of the available dimensions of inequality contained within each source, and the year for which data are available. Quality considerations specific to the corresponding inequality dimension should be recorded. For easy reference during the next steps of data source mapping, each source can be listed in a separate row and assigned a unique data source reference number.

The third step involves assessing the availability of data about health indicators of interest. Health indicators can be listed as rows. The unique data source reference numbers from the second table are indicated for sources that contain corresponding data about the health indicator. If needed, the indicator definition can be noted for future reference. If the definition of an indicator differs between data sources, the indicators should be listed as separate rows. Data quality considerations pertaining to the health indicator should be noted.

The fourth step is to integrate information from the second and third steps. This indicates the data sources that contain both health indicators and dimensions of inequality data. The health indicators remain as rows, and the dimensions of inequality are listed as columns. The unique data source reference numbers are then translated from the second and third tables to show where data are available.

A fifth step entails identifying possibilities for linking data using common identifiers across sources.

The steps of data source mapping, and in particular the type of additional details extracted, can be tailored to reflect the specific application of inequality monitoring. A series of worksheets for data source mapping is available online (Box 15.1). For an example of the application of data source mapping techniques in Indonesia, see Annex 7.

BOX 15.1. Table templates for data source mapping and data source linking

WHO has developed worksheet templates for data source mapping and data source linking. The worksheets allow users to input and organize information about available data sources. They are available online and can be downloaded and modified for use (2).

Selecting among multiple data sources

In some cases, there may be multiple data source options that cover the relevant health indicator or inequality dimension data requirements for the population of interest. These two streams of data for inequality monitoring do not necessarily need to be sourced from a single data source. Indeed, data about health and data about inequality dimensions can be linked across different sources using common identifiers (see Data source linking below).

Situations may arise where data about the same or similar health indicators or inequality dimensions are available in multiple sources. To determine which source is most suitable, the strengths of each data source should be weighed against its limitations. The suitability of a data source can be evaluated in three ways, considering its inherent purpose and attributes, the general data quality, and its relevance to a specific application of inequality monitoring.

The major types of data sources used in health inequality monitoring have different inherent purposes, reflected in their design and the type of information that they collect (see Chapters 12-14). For example, multicountry household health surveys are appropriate for expansive assessments of the state of inequality in the topic covered by the survey, including comparisons across countries. CRVS systems collect a more limited scope of data and are better suited for monitoring certain inequalities related to birth and death indicators within a country (although linking CRVS data with census or institution-based sources may open opportunities for monitoring of other indicators).

Understanding the inherent purpose of the data source (i.e. what the data source is designed to do) allows for assessment of quality (i.e. how well that data source fulfils its stated purpose). General data quality considerations relate to the relevance, credibility, accuracy, timeliness, methodological soundness, accessibility, completeness and reliability of the data (see Chapter 11). For example, the purpose of an institution-based recordkeeping system may be to record detailed health information and basic demographic information for individuals visiting a clinic: a system that relies on non-standardized paper forms is likely to have lower reliability, methodological soundness and accessibility than a system using standardized electronic forms (see Chapter 13).

Taking into account the considerations above, the overarching scope and purpose of a particular application of inequality monitoring help to define what is deemed a “suitable” data source. There will be variability across different inequality monitoring applications in terms of how recent the data need to be; whether data across multiple time points are required; whether combining data across several years is required to have a large enough sample size for disaggregation (see Chapter 17); the frequency at which the data should be collected; and what population (or populations, if benchmarking is a requirement) should be covered by the data source. For example, monitoring inequalities during the peak of a pandemic requires data covering the entire affected population that are produced quickly and are available on a frequent basis (e.g. daily or weekly) without delay.

On a practical note, additional considerations pertain to data accessibility. It is important to consider the requirements to gain permission to access the data, and whether they can reasonably be met within the timeframe of monitoring. Additionally, consider whether the data are available in a format that is compatible with analysis software, and whether they require “cleaning” to remove values that are incorrect, unreliable, duplicated, outlying or otherwise unsuitable.

Strategies for navigating data scarcity

For certain applications of inequality monitoring, data may be limited or lacking altogether. The following subsections discuss strategies for navigating such situations. If there are no data sources that contain information about both streams of data for health inequality monitoring, but each stream of data is available from a different source, there may still be options to enable monitoring to proceed. Data source linking may be possible if all sources contain a common identifier, such as a personal identity number or postal code. In some cases, empirical measures may not be available, and modelling may be done to yield estimates for population subgroups. Parallel efforts to advocate for expanded and improved data collection may also be warranted.

Ultimately, however, if the necessary data are not available or the quality of available data is deemed too poor, health inequality monitoring cannot proceed as planned. The focus of monitoring may need to be adjusted to accommodate the data that are currently available, selecting health indicators and dimensions of inequality accordingly. Limited data availability for inequality monitoring should be flagged as an area for attention during the planning and design of subsequent data collection activities.

Data source linking involves merging information about health indicators and dimensions of inequality from different data sources using a common identifier.

Data source linking

Data source linking is a useful option to pursue if information about health indicators and dimensions of inequality is available in two or more different data sources. Linking data from multiple sources is possible only if there are common identifiers across sources. Identifiers may be at the individual level (e.g. personal identity numbers) or the small-area level (e.g. postal codes). These identifiers serve as common tags and are used to link sources. Exact matching relies on a unique identifier variable in all sources. Probability matching is an indirect means of linking data across data sources using multiple matching variables. This technique relies on the probabilities of agreement and disagreement for a greater number of matching variables to link between records (3).

Data source linking for the assessment of inequalities is a common practice across many settings. Data source linking at the individual level is particularly feasible in countries where each citizen has a unique identification number for different administrative and health purposes (e.g. filing taxes, registering civil events such as births, marriages and deaths, enrolling in health insurance, making social insurance contributions and claims, being admitted to hospital, or filling medical prescriptions). For example, income derived from tax registers has been linked to death records to assess economic-related inequality in life expectancy (4). Access to and use of individual data and their record linkages are strictly governed and protected, and – if use is permitted – subject to access, privacy and ethical standards.

Data linking using small-area common identifiers may be part of ecological analyses, where health and dimension of inequality information pertains to a small-area level, such as a district, census tract or neighbourhood. For example, deprivation indices combine multiple streams of socioeconomic information into a score for the small area, which can then be linked with health indicator data to assess inequality (5). Additional examples of linking between data sources are provided in Box 15.2.

BOX 15.2. Examples of data source linking

The following examples illustrate linking data sources using individual and small-area identifiers for inequality analyses. A study among adults in Canada measured inequalities in preventable mortality between populations that reported different sexual orientations. It also explored whether the association between sexual orientation and preventable mortality was mediated by one or more variables related to social and material resources. To conduct the analysis, data were linked at the individual level using personal health numbers between the Canadian Community Health Survey and the Canadian Human Mortality Database (which contains information about all deaths registered through vital statistics registries). The results showed that LGBTQI+ people experience more preventable deaths than heterosexual people, and that higher mortality could be explained by differences in marital status and income (6).

A study in Buenos Aires, Argentina explored inequalities in life expectancy related to neighbourhoods and neighbourhood-level socioeconomic factors. The study obtained georeferenced data from death registry records in the vital registration system, which were aggregated at the neighbourhood level (i.e. census units consisting of about 5000 households). Socioeconomic data for census units were obtained from the 2010 census. Analyses were conducted for men and women. The study reported gaps in life expectancy between neighbourhoods in both sexes. For both men and women, differences in life expectancy were reported between the areas with the highest and lowest socioeconomic status (7).

Assessments of the possibility for data linking can be conducted as an extension of the data source mapping exercise described earlier in this chapter. Drawing on the tables created in the first four steps of data source mapping, a fifth table can be created, listing all relevant common identifiers, as well as the unique data source number (imported from Step 2) of sources that contain information about each corresponding identifier. (See Box 15.1 for information about worksheet templates and Annex 7 for an example of data source mapping, including data source linking, in Indonesia.)

After exploring whether data linking is possible, there are a few additional considerations. If proceeding with data linking, experienced data managers or analysts may be consulted to assist with merging data sources using statistical software. It is important to adhere to data protection protocols to ensure confidentiality is maintained, especially if working with individual-level data.

Modelled estimates

Modelling exercises help to fill in data gaps in time or place using existing data sources. Modelling is particularly useful when the availability of reliable empirical measures is restricted, or when relevant information is dispersed across multiple data sources. For example, modelling may be a useful way to generate estimates for health outcomes associated with stigma or for outcomes that are illegal or have a fluctuating legal status, such as HIV status, abortion, use of drugs or alcohol consumption. When modelling yields estimates for population subgroups, the data may be useful for inequality monitoring. Modelled estimates are derived using statistical methods, drawing from available data sources and expert opinions and accounting for known or suspected biases. Estimates should be documented following the Guidelines for Accurate and Transparent Health Estimates Reporting (GATHER) (see Chapter 23) (8).

Global monitoring that relies on modelled estimates can provide a useful general assessment of global and regional inequalities and trends. For example, the WHO State of inequality: HIV, tuberculosis and malaria report included modelled annual estimates of HIV and tuberculosis incidence and mortality and certain HIV testing and treatment indicators to explore sex-related inequality (9). The generation of estimates, however, is subject to several limitations and assumptions (10). Therefore, modelled estimates are less ideal to inform country programme planning and implementation, which ideally should be guided by high-quality empirical data.

Advocating for expanded and improved data collection

Existing sources that contain data of poor quality require strengthening. For example, groups that experience discrimination or exclusion are sometimes not captured adequately in data collection efforts. Depending on the setting, data sources may exclude people living in remote or conflict-affected areas, people engaged in sex work, people who use drugs, people in prisons and other closed settings, migrants and refugees living outside camps, and people for whom aspects of their lives are criminalized, undocumented or hidden.

To explore and expose gaps in the data, reports may choose to publish incomplete data, with sufficient caveats, highlighting the need for improvements. If reporting poor-quality or insufficient data, quantifying the level of uncertainty in the results can help to ensure transparency around the limitations in the data. Overall, efforts to advocate for expanded and improved data collection can drive changes that open new possibilities for inequality monitoring. This is an opportunity for engagement with stakeholders to ensure data collection practices are inclusive, culturally appropriate and respectful (see Chapter 4).

References

1. Hosseinpoor AR, Nambiar D, Suparmi, Kusumawardani N. Data source mapping: an essential step for health inequality monitoring. Glob Health Action. 2018;11(Suppl. 1):1456743. doi:10.1080/16549716.2018.14567.

2. Data source mapping templates. Geneva: World Health Organization; 2023 (https://www.who.int/data/inequality-monitor/tools-resources, accessed 15 May 2024).

3. Blakely T, Salmond C. Probabilistic record linkage and a method to calculate the positive predictive value. Int J Epidemiol. 2002;31(6):1246–1252. doi:10.1093/ije/31.6.1246.

4. Chetty R, Stepner M, Abraham S, Lin S, Scuderi B, Turner N, et al. The association between income and life expectancy in the United States, 2001–2014. JAMA. 2016;315(16):1750. doi:10.1001/jama.2016.4226.

5. Rea M, Tabor D. Health state life expectancies by national deprivation deciles, England: 2018 to 2020. Newport, United Kingdom: United Kingdom Statistics Authority Office for National Statistics; 2022 (https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/healthinequalities/bulletins/healthstatelifeexpectanciesbyindexofmultipledeprivationimd/2018to2020, accessed 15 May 2024).

6. Salway T, Rich AJ, Ferlatte O, Gesink D, Ross LE, Bränström R, et al. Preventable mortality among sexual minority Canadians. SSM Popul Health. 2022;20:101276. doi:10.1016/j.ssmph.2022.101276.

7. Trotta A, Bilal U, Acharya B, Quick H, Moore K, Perner SM, et al. Spatial inequities in life expectancy in small areas of Buenos Aires, Argentina 2015–2017. J Urban Health. 2023;100(3):577–590. doi:10.1007/s11524-023-00730-1.

8. Guidelines for accurate and transparent health estimates reporting. Geneva: World Health Organization (https://www.who.int/data/gather/statement, accessed 29 May 2024).

9. State of inequality: HIV, tuberculosis and malaria. Geneva: World Health Organization; 2021 (https://iris.who.int/handle/10665/350198, accessed 5 June 2024).

10. Victora CG, Boerma T. Inequalities in child mortality: real data or modelled estimates? Lancet Glob Health. 2018;6(5):e477–e478. doi:10.1016/S2214-109X(18)30109-8.