gms | German Medical Science

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH)

08.09. - 13.09.2024, Dresden

Multicentric Outlier Detection Algorithm for Healthcare Laboratory Data

Meeting Abstract

  • Vladimir Belov - HELIOS IT Service GmbH, Berlin, Germany
  • Clara von Münchow-Pohl - Helios Kliniken GmbH, Berlin, Germany
  • Madeleine Kittner - Helios IT Service GmbH, Berlin, Germany
  • Sarah Löser - Helios IT Service GmbH, Berlin, Germany
  • Peter Martens - Helios IT Service GmbH, Berlin, Germany
  • Sebastian Ortleb - Helios Kliniken GmbH, Berlin, Germany
  • Markus Bockhacker - Helios Kliniken GmbH, Berlin, Germany

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH). Dresden, 08.-13.09.2024. Düsseldorf: German Medical Science GMS Publishing House; 2024. DocAbstr. 414

doi: 10.3205/24gmds158, urn:nbn:de:0183-24gmds1585

Veröffentlicht: 6. September 2024

© 2024 Belov et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: Precise laboratory results are critical in healthcare, as they often provide the initial insights into a patient's medical condition, guiding subsequent diagnostic and treatment decisions 1. Technical errors in laboratory results can lead to false diagnoses and suboptimal treatment strategies. Therefore, it is crucial to differentiate between implausible values resulting from technical errors and realistic values that reflect true clinical conditions. The manual setting of clinically plausible ranges is impractical in the Big Data domain, thus automated and scalable solutions are required to ensure consistency across diverse laboratory datasets. Using multicentric clinical data enhances outlier detection, allowing to pinpoint site-specific variances and thus indicative of a greater likelihood of unrealistic laboratory results. Our multicentric outlier detection algorithm systematically evaluates standardizes according to LOINC (Logical Observation Identifiers, Names and Codes) laboratory results stemming from 68 medical sites, creating robust reference ranges across a large set of laboratory results.

Methods: We developed an outlier-detection algorithm that considers statistics from individual sites as well as aggregated across all sites. Specifically, we extract the minimum and maximum values from each site and calculate the median M of these values across all sites. Then we define the lower range as M minus the 33% percentile, and the upper range as the median plus the 66% percentile. This approach allows ranges to be less sensitive to sites prone to producing extreme values, while still yielding sensitive ranges, able to capture all plausible values. Values outside of the ranges are flagged as outliers. This approach was validated using data from 68 sites, encompassing a total of 1630 LOINC laboratory codes. We excluded LOINC tests represented at fewer than three sites, resulting in 530 remaining codes for analysis. The algorithm was developed and tested within a PySpark environment.

Results: The samples provided enabled us to validate the effectiveness of our novel outlier detection algorithm by manually evaluating the computed boundaries. All LOINC laboratory tests were categorized into two groups based on their value distributions: 1) those where only positive values are valid, and 2) those where both positive and negative values are possible. Our algorithm successfully distinguished between implausible and real values for both groups of laboratory tests.

Discussion: While numerous studies have explored various outlier detection algorithms for laboratory results 2,3, our novel algorithm distinctively incorporates both the distribution of laboratory results from each site and the aggregate data. This dual approach reduces sensitivity to extreme values from sites prone to producing unrealistic values. Consequently, our method provides a general solution that can be applied to each site. Furthermore, the computational efficiency of our algorithm allows an immediate implementation in a real-world setting.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Miriovsky BJ, Shulman LN , Abernethy AP. Importance of health information technology, electronic health records, and continuously aggregating data to comparative effectiveness research and learning health care. J Clin Oncol. 2012;30(34):4243-8. DOI: 10.1200/JCO.2012.42.8011 Externer Link
2.
Monjas AM , Ruiz DR , Pérez-Rey D, Palchuk M. Automatic Outlier Detection in Laboratory Result Distributions Within a Real World Data Network. Stud Health Technol Inform. 2023;18:302:88-92. DOI: 10.3233/SHTI230070 Externer Link
3.
Estiri H , Klann JG, Murphy SN. A Clustering Approach for Detecting Implausible Observation Values in Electronic Health Records Data“. BMC Med Inform Decis Mak. 2019;19:142. DOI: 10.1186/s12911-019-0852-6 Externer Link