Artikel
Incremental Machine Learning using Distributed Data Processing Techniques for Malaria Data Across Multiple Online Sources
Suche in Medline nach
Autoren
Veröffentlicht: | 15. September 2023 |
---|
Gliederung
Text
Introduction: Data about severe infectious diseases, such as Malaria, is continually collected by each country. This results in highly fragmented data that must be integrated to study differences on a country or more significant regional level. While data analysis often follows the centralized approach integrating all relevant data in a new data source, some distributed analysis approaches are available. The Personal Health Train (PHT) [1] allows to keep the data where they have been collected and send the algorithm to the data for the intended analysis. This includes publicly available and private data sources, such as data integration centers at University Medical Centers [2], without providing direct access to the scientist.
This work proposes an incremental regression machine learning model approach for time-series prediction for distributed Malaria data.
Methods: The Malaria data set comprises infection incidences from 107 countries across different continents from 2000-2018 and is distributed over three sources in our setup by the WHO region attribute. While data source 1 (Leipzig University) comprises Malaria data for Eastern Mediterranean and Africa, source 2 (University of Cologne) contains data for the Americas and Europe. Source 3 (Southeast Asia, Western Pacific) is web-based, re-using the De-NBI cloud, whereas sources 1 and 2 are on-premise solutions. A PHT Station is attached to each source, allowing flexibility to include this data in the analysis. We incrementally trained a prediction model using these datasets. The model is built at the first station and fine-tuned at the subsequent stations; the final version is obtained at the last station.
Experiment: The study implemented time-series regression models based on recurrent neural networks: Simple Recurrent Neural Networks, Long Short-Term Memory (LSTM) [3], and Gated Recurrent Units (GRU) [4]. Data normalization was performed using the min-max technique to accelerate the training process. The models are taught by data from 2000 to 2017, while the 2018 data was considered the label for the models to predict. The models were evaluated using a 3-fold cross-validation approach, and three evaluation measures (RMSE, MAE, and R²) were used to compare predicted values against observed values. The results showed that all three model baselines provided satisfactory results for predicting the number of cases where GRU outperformed the other two baselines in general.
Results and discussion: The experiment showed that African countries had the highest number of cases and variances, and models trained on data from these countries also produced accurate predictions for other countries. Incremental models trained on data from Stations 1 and 2 had the best fit for Station 3, with an RMSE of 0.01 and an R² of 0.99 for the LSTM model. We also found that LSTM had difficulty identifying patterns in the data from Stations 2 and 3 to accurately predict the number of cases in instances from Station 1 (containing information from African countries), resulting in an RMSE of 0.11 and an R² of 0.67. In contrast, the GRU model had an RMSE of 0.06 and an R² of 0.91.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Welten S, Mou Y, Neumann L, Jaberansary M, Ucer Y, Kirsten T, Decker S, Beyan O. A Privacy-Preserving Distributed Analytics Platform for Health Care Data. Methods Inf Med. 2022;61(S 01): e1-e11. DOI: 10.1055/s-0041-1740564
- 2.
- Maia M, Jaberansary M, Ucer Y, Beyan O, and Kirsten T. Providing Publicly Available Medical Data Access under FAIR Principles for Distributed Analysis. In: Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie, editor. 67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). 21.-25.08.2022. Düsseldorf: GMS; 2022. DocAbstr. 203. DOI: 10.3205/22gmds071
- 3.
- Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Comput. 1997;9(8):1735–1780. DOI: 10.1162/neco.1997.9.8.1735
- 4.
- Chung J, Gulcehre C, Cho K, Bengio Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. In: Twenty-eighth Conference on Neural Information Processing Systems. 2014. DOI: 10.48550/arXiv.1412.3555