gms | German Medical Science

20. Jahrestagung des Deutschen Netzwerks Evidenzbasierte Medizin e. V.

Deutsches Netzwerk Evidenzbasierte Medizin e. V.

21. - 23.03.2019, Berlin

Predicting health care expenditures with statistical learning methods under increasing data complexity – an application using German private health insurance data

Meeting Abstract

Search Medline for

  • Jan Dyczmons - Deutsches Diabetes-Zentrum, Institut für Versorgungsforschung und Gesundheitsökonomie, Deutschland

EbM und Digitale Transformation in der Medizin. 20. Jahrestagung des Deutschen Netzwerks Evidenzbasierte Medizin. Berlin, 21.-23.03.2019. Düsseldorf: German Medical Science GMS Publishing House; 2019. Doc19ebmS1-V1-05

doi: 10.3205/19ebm005, urn:nbn:de:0183-19ebm0055

Published: March 20, 2019

© 2019 Dyczmons.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Background/research question: How to capitalize on the recent trend of “Big Data” collection is currently one of the main challenges in predictive analytics. This study compares two popular machine-learning methods (gradient boosting machine and lasso regression) and two traditional methods (ordinary least squares regression and regression trees) in terms of quality of prediction under increasing data complexity. To analyze the interdependence of quality of prediction and increasing data complexity, the study intends to show

1.
whether recent machine learning methods have an advantage over traditional methods in terms of prediction quality as information complexity increases and
2.
whether an increase in information complexity leads to better prediction results.

Methods: Individual claims data of a large German private health insurance are used, which include information about the costs and the disease of every claim that each insurant submitted during a seven-year period. Using information about the current year’s aggregated medical expenditure for each disease, each individual’s total cost for the next year is predicted. Diseases are coded according to the International Classification of Diseases (ICD), with 1,669 different diseases encoded in the data. Because of the hierarchical structure of the ICD, the diseases can also be presented as 237 categories or 22 chapters, which is interpreted as different levels of complexity. This setting allows analyzing how the same information across different levels of complexity is handled by different prediction methods.

Results: The results do not show a clear dominance of any machine learning method using 22 chapters or 237 categories. Overall prediction quality declines for all four methods if 1,669 variables of the ICD-Code level are used. However, as machine-learning methods display only a small decline, traditional methods fail completely. The estimated machine-learning models at the ICD-Code level discard about 95% of the variables.

Conclusions: The results indicate that under increasing data complexity, there is a trade-off between additional information and the ability to process this information. Thus, the ability to discard useless information could be essential for achieving good predictions in “Big Data” calculations.