gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Longitudinal data clustering using k-mean trajectories and missing data prediction using Recurrent Neural Network

Meeting Abstract

Search Medline for

  • Halimu N. Haliduola - University of Munich (LMU), Munich, Germany
  • Frank Bretz - Novartis Pharma AG, Basel, Switzerland
  • Ulrich Mansmann - Universität München, München, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 7

doi: 10.3205/20gmds253, urn:nbn:de:0183-20gmds2531

Published: February 26, 2021

© 2021 Haliduola et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Background: In clinical trials, missing data is the data that would be meaningful for the analysis of a given estimand but was not collected (ICH E9 addendum, 2019). Missing data, if not handled properly, will lead to lower statistical power for analysis as the sample size is reduced. In addition, the dropouts from the trial may have extreme value, e.g., treatment failure leads to dropout. Therefore, the loss of these dropouts could lead to an underestimate of variability and hence narrow down the confidence interval for the treatment effect. Missing data may also lead to bias in the study result if patients with missing data are excluded from the analysis, e.g., if the unobserved measurements have a higher proportion of poor outcomes; or if missing values are more likely in one treatment arm because it is not as effective as other. Missing data may also impact the external validity of study outcome, i.e., the representativeness of the study sample in relation to the target population.

??????

Methods: A machine learning based missing data prediction framework was developed using the simulated data with a focus on the Missing Not at Random (MNAR) data. Some machine learning techniques for imbalanced data were applied, e.g., stratified k-fold cross validation, oversampling of minority class. To implement these methods in longitudinal continuous data, clustering via k-mean trajectories was performed first. Recurrent Neural Network (RNN) was used to model the longitudinal data. RNN can learn to remember state from the past that are relevant to predict future outcomes. This allows it to exhibit temporal dynamic behavior for a time sequence, therefore, it is an appropriate tool for longitudinal clinical data. Different RNN architectures have been experimented to tune various hyperparameters and the optimal model was selected via the bias-variance tradeoff approach. Different input data was considered, including the initial state of RNN (baseline value), static inputs (patient level information) and dynamic inputs (variables change over time). To improve the accuracy of prediction and also to consider uncertainty of the prediction, bootstrap aggregating (bagging) was applied in this study.

Results: The result from the proposed method is the best estimation of true treatment effect for all simulation scenarios. The prediction result was evaluated in individual patient level by visualizing the effcacy profile and in the overall population level by comparing the treatment effect estimated using different methods.

Conclusion: Overall, the proposed method provided plausible individual prediction for both of the MCAR and MNAR data and reduced the bias of missing data in treatment effect estimation.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.