Indicators of data quality: review and requirements from the perspective of networked medical research
Indikatoren zur Datenqualität: Stand und Anforderungen aus Sicht der vernetzten medizinischen Forschung
Published: | July 9, 2019 |
Data quality is of highest importance for quantitative medical research. A common set of indicators for data quality is needed to cope with the future challenges in data management for biomedical informatics. A guideline for adaptive data management was developed in 2006, which offers indicators for data quality organized in three categories: integrity, organization, and trueness. The guideline was revised in 2014 bottom-up by extending its content with standards from a cancer registry, a cohort, and a data repository in Germany. In parallel, a systematic literature review identified indicators of data quality published in the literature since 2005 using Medline as literature database. The guideline differentiates in its second version 51 indicators (integrity: 30, organization: 15, trueness: 6). The literature review identified 34 indicators in 31 articles. A lack of indicators in the literature addressing the organizational aspects of data sets became visible comparing both sets. Furthermore, indicators useful for data sets used in health care practice, such as timeliness, were missing in the guideline’s set. The comparison is a first step towards a common set of indicators. Beyond a consented denomination of the indicators, this set should offer an operational definition that supports a reliable application from different parties to different data sets. Furthermore, a systematic organization of the indicators would foster an appropriate selection of the individual indicators according to specific use cases.
Datenqualität ist für die quantitative medizinische Forschung von höchster Bedeutung. Ein einheitliches Set von Indikatoren zur Datenqualität wird benötigt, um die zukünftigen Herausforderungen an das Datenmanagement in der biomedizinischen Informatik zu bewältigen. Dazu wurde eine Leitlinie zum adaptiven Datenmanagement im Jahre 2006 erarbeitet, die Indikatoren zur Datenqualität über drei Ebenen organisiert: die Ebenen Integrität, Organisation und Richtigkeit. Inhaltlich wurde die Leitlinie im Jahre 2014 Bottom-up durch die Einbindung von Standards eines Krebsregisters, einer Kohorte und eines Data Repository aus Deutschland erweitert. Parallel wurden über ein systematisches Literaturreview publizierte Indikatoren der Datenqualität mit Medline als Literaturdatenbank recherchiert. Die Leitlinie weist in ihrer zweiten Version 51 Indikatoren aus (Integrität: 30, Organisation: 15, Richtigkeit: 6). Das Literaturreview identifizierte 34 Indikatoren in 31 Publikationen. Im Vergleich beider Quellen war das Fehlen von Indikatoren zu organisatorischen Aspekten in der Literatur auffällig. Der Leitlinie fehlten hingegen Indikatoren mit Bedeutung für die Krankenversorgung wie Rechtzeitigkeit. Der vorgenommene Vergleich stellt einen weiteren Schritt zur Festlegung einem einheitlichen Sets von Indikatoren zur Datenqualität in der medizinischen Forschung dar. Neben einheitlichen Bezeichnungen sollte ein solches Set umsetzbare Definitionen beinhalten, die eine zuverlässige Anwendung auf unterschiedlichen Datenbeständen durch unterschiedliche Forschergruppen sicherstellt. Zusätzlich würde eine systematische Organisation der Indikatoren eine angemessene Auswahl von Indikatoren für unterschiedliche Anwendungsszenarien unterstützen.
Data are the treasure of quantitative research. Strenuous efforts are undertaken to obtain high-quality data [1]. Metadata are defined, data acquisition is standardized, data collection is supported by plausibility checks, data quality is reported to study sites, recorded data are compared to originals ones, to mention only some of the available methods to achieve high data quality. However, those methods are only beneficial if their success is controlled. Moreover, those methods could be tailored to the level of the assessed data quality [2].
Data are more and more used beyond their original context, for example data from the electronic patient record in clinical trials [3]. Then, an assessment of the data quality is needed to decide whether the data are appropriate to answer a specific research question or not [4]. The use of indicators or key performance measures is an established methodology in health care to assess quality [5]. Results that are closer to a predefined goal or closer to an optimum indicate better quality. Meanwhile, the use of quality indicators becomes accepted also for the assessment of data quality. For example, cancer registries have a long tradition in calculating measures such as case completeness, data completeness and validity [6].
There is a strong emphasis on synthesizing a conceptual framework covering terminological and ontological aspects in data quality research. Wang and Strong distinguished four dimensions of data quality through a systematic approach; intrinsic data quality, contextual data quality, representational data quality, and accessibility data quality [7]. Fifteen indicators were assigned to one dimension each and briefly described by a single sentence. However, this work was not really elaborated in view of health care. Botsis et al. reduced data quality to the aspects of incompleteness (i.e. missing information), inconsistency (i.e. information mismatch), and inaccuracy [8]. Weiskopf and Weng came up with completeness, correctness, concordance, plausibility, and currency as dimensions [9]. They defined currency as “a relevant representation of the patient state at a given point in time”. Recency and timeliness were listed as related terms. Furthermore, Weiskopf and Weng extended the perspective of data quality dimensions with seven data quality assessment methods like data source agreement. Kahn et al. shortened the top-level dimensions of Weiskopf and Weng to the data quality categories conformance, completeness, and plausibility related to the data quality assessment contexts verification and validation [10].
More than ten years ago, a group within the TMF – Technology, Methods, and Infrastructure for Networked Medical Research, an umbrella organization for networked medical research in Germany, developed a guideline for an adaptive management of data quality [2], [11]. This work started with the aim to support the practice of data management, in opposite to the conceptual approaches introduced before. The methodology applied was influenced by quality research, in particular health care quality research. According to Donabedian,
- quality can be described on the levels of structures, processes, and outcomes [12],
- quality is measured using indicators [5], and
- continuous quality improvement is driven by the quality circle of Deming [13].
Central to the guideline is, first, the measurement of data quality using a set of indicators, and second adapting source data verification and feedback to the level of data quality that becomes evident by the indicator results. The first version of the guideline, published in 2006, included 24 indicators organized in three categories: plausibility (10 indicators), organization (7), and trueness (7) [11]. Plausibility referred to Donabedian’s level of structures, organization to the level of processes, and trueness to the level of outcomes. The indicators were identified based on a systematic literature review. Due to the focus of the TMF members, the guideline particularly addressed the needs of cohorts and registries.
In the revision of the guideline a different approach was applied [14]. On the one hand, the list of indicators was evaluated and extended bottom-up making use of real-world examples of cohorts, data repositories, and registries. On the other hand, the systematic literature review concerning data quality was updated in parallel. Objective of the current study was to compare both results to find gaps that could be closed in future work and to identify consensus that could help to establish a consistent and unambiguous matrix of indicators of data quality.
Material and methods
Guideline revision
A bottom-up approach was applied in order to evaluate and update the list of quality indicators [14]. Projects were identified that are suited as proxies for cohorts, registries, and data repositories representing the main types of quantitative research of the TMF members. The measures for data quality used by those projects were collected and mapped onto the list of indicators defined in the first version of the guideline. Measures missing in the guideline were added based on a consensual decision by the study participants.
The proxies were as follows.
- One statutory epidemiological cancer registry participated as proxy for registries, being an active member of the Association of Population-based Cancer Registries in Germany (GEKID).
- The Study of Health in Pomerania (SHIP) participated as a proxy for cohort studies. SHIP is a large population-based epidemiological study in the region of Western Pomerania, Germany.
- The Open European Nephrology Science Center (OpEN.SC) represented data repositories. Data repositories collect data from a wide range of studies without a predefined research question. The data are then provided to third parties.
Systematic literature review
The literature review followed a standardized approach according to the recommendations on Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [15]. Medline was used as literature database. Citations from 2005 to March 2013 in English and German were included. The queries covered the following terms in several combinations: clinical trial, cohort, data accuracy, data collection, data quality, feedback, fraud, medical registry, quality assessment, quality control, registries, and source data verification. The selection of the relevant literature was conducted in two steps and controlled by an overlapping evaluation between three raters. Decisions in case of questionable citations were made in a consensus. The indicators from the systematic literature review were mapped to the guideline’s list of indicators. Denominations and definitions were used both for the mapping.
Quality indicators proposed by the guideline
The second version of the guideline was expanded to 51 quality indicators organized in three categories: integrity (former denomination plausibility, 30 indicators), organization (15) and trueness (6) [16]. For the first time, an indicator addressing the quality of metadata was included. This indicator belongs to the category integrity. In the second version, the structured description of each indicator was substituted by information about the appropriate context. Three possibilities were differentiated for that context:
- an indicator can be calculated for an individual record,
- an indicator can be calculated for an individual observational unit,
- an indicator can be calculated for a complete data set.
Context 3 is the traditional one in the application of indicators. Table 1 [Tab. 1] shows the list of all 51 indicators. Each quality indicator is defined in a structured format dating back to recommendations of the Joint Commission on Accreditation of Healthcare Organisations (JCAHO) [5] using the following attributes: name, description, definition of terms, identifier, type of indicator (structure, process or outcome), literature references, context (see above), alternative definitions, comments, numerator, denominator, subcategories, method of calculation, interpretation of results, predictors and confounders (cf. Attachment 1 [Attach. 1] for an example).
Quality indicators identified in the literature review
The systematic literature review yielded 39 articles concerned with either indicators of data quality, feedback about data quality, or source data verification [7], [9], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53]. Thirty-one of the 39 articles included information about 34 different quality indicators. Table 2 [Tab. 2] shows the list of the quality indicators along with a reference to the indicators listed in Table 1 [Tab. 1]. Ten indicators from the guideline could not be attached to any of the indicators mentioned in the literature (20% from 51 indicators). Four of those ten indicators had been introduced by representatives of the cancer registry, two by representatives of SHIP, and one by representatives of the data repository from OpEN.SC.
Thirteen indicators mentioned in the literature were not addressed in the guideline (38% from 34 indicators): accessibility, appropriate amount of data, availability, believability, contextualization, granularity, inaccuracy, policy relevance, predictive value, relevancy, responsiveness of data items, spatial stability, and timeliness. Combining both sets, 64 indicators were available.
The second version of the TMF guideline for the management of data quality in cohorts and registries offers 51 indicators for the assessment of data quality. A systematic literature review revealed 34 quality indicators used in 31 articles. Three indicators were mentioned ten times or more: accuracy, comprehensiveness and correctness. The mapping between both sets of indicators failed for ten of the TMF indicators (1031, 1032, 1033, 1034, 1037, 1038, 1040, 1041, 1042, and 1047). Nine out of those ten indicators are defined in the category organization. There seems to be a lack of understanding in the literature concerning the importance of measures related to the organization of cohorts and registries. Furthermore, seven of the missing indicators in the category organization were applied in data management of real data sets. Indicator TMF-1047 “Compliance with operating procedures” is not only missing in the reviewed literature, but also in the introduced conceptual frameworks [7], [9], [10]. This neglects the requirements of empirical research to be compliant with predefined procedures, e.g. the timeline of follow-ups defined in the study protocol. One might assume that the elaboration of data quality focused in the past on the data itself neglecting the importance of process-related issues to some extent.
Thirteen out of the 34 indicators from the literature remained without a corresponding TMF indicator. Twelve were less frequently mentioned in the literature with only one or two citations. Timeliness was mentioned six times. Timeliness is an important issue particularly in diagnosis and treatment as well as for reimbursement. Timeliness is of minor importance for research purposes. However, to offer a comprehensive set of quality indicators, timeliness should be added. The other twelve are to some extend overlapping with other indicators proposed in the literature. For example, there is an unclear relationship between accuracy, believability, correctness, and validity. Proposals had been made in the literature to clarify the definitions [54]. However, the differentiation is still unclear. Contextualization, policy relevance, responsiveness of data items, and spatial stability extended the list of TMF indicators as well as the health care related conceptual frameworks mentioned before. These indicators could be assigned to the dimension contextual data quality defined by Wang and Strong. They already stated that contextual data quality “was not explicitly recognized in the data quality literature” [8]. Possibly, this conflicts with the paradigm of empirical research and health care, to define the tasks first and collect the required data second. Then, the usefulness of the data is guaranteed by the predefined usage. However, in view of an increasing use of already existing data, contextual data quality might receive a greater importance in the future [55], [56], [57].
Some shortcomings have to be mentioned concerning the list of quality indicators in Table 1 [Tab. 1]. The granularity of the indicator denominations and definitions varies, having broad measures as concordance on the one hand and particular measures as the rate of Death Certificate Only cases (DCO rate) on the other hand. The hierarchical organization is an attempt to address those differences. However, this solution is still suboptimal. Terms like “data element” and “value” are not always precise enough to represent the content of the indicator by its denomination. Therefore, the structured description offered in the long version of the guideline is essential to understand the meaning of an indicator.
The list of indicators for data quality derived in the presented project (cf. Table 1 [Tab. 1]) covers many of the concepts used in the literature. It combines different perspectives, all relevant for data quality,
- the perspective of data management responsible for data collection and data control,
- the perspective of data users, being unable to influence the process of data acquisition, and
- the perspective of process owners, defining the host projects and studies.
Therefore, that list could be the starting point for a harmonization of indicators of data quality urgently needed noticing the variety and sometimes incompatibility of the measures mentioned in the literature. It will be a next step to offer a synthesis of both lists presented here along with precise definitions. This could be a valuable mission for standardization organizations that deal with data in health care and health care research. This research should take into account proposals for a formal definition of indicators [58]. Formal definitions would enable an automatic application of indicators to data sets, for example offering a syntax for statistical software based on standards for metadata [59].
