Artikel
Longitudinal data clustering using k-mean trajectories and missing data prediction using Recurrent Neural Network
Suche in Medline nach
Autoren
Veröffentlicht: | 26. Februar 2021 |
---|
Gliederung
Text
Background: In clinical trials, missing data is the data that would be meaningful for the analysis of a given estimand but was not collected (ICH E9 addendum, 2019). Missing data, if not handled properly, will lead to lower statistical power for analysis as the sample size is reduced. In addition, the dropouts from the trial may have extreme value, e.g., treatment failure leads to dropout. Therefore, the loss of these dropouts could lead to an underestimate of variability and hence narrow down the confidence interval for the treatment effect. Missing data may also lead to bias in the study result if patients with missing data are excluded from the analysis, e.g., if the unobserved measurements have a higher proportion of poor outcomes; or if missing values are more likely in one treatment arm because it is not as effective as other. Missing data may also impact the external validity of study outcome, i.e., the representativeness of the study sample in relation to the target population.
??????
Methods: A machine learning based missing data prediction framework was developed using the simulated data with a focus on the Missing Not at Random (MNAR) data. Some machine learning techniques for imbalanced data were applied, e.g., stratified k-fold cross validation, oversampling of minority class. To implement these methods in longitudinal continuous data, clustering via k-mean trajectories was performed first. Recurrent Neural Network (RNN) was used to model the longitudinal data. RNN can learn to remember state from the past that are relevant to predict future outcomes. This allows it to exhibit temporal dynamic behavior for a time sequence, therefore, it is an appropriate tool for longitudinal clinical data. Different RNN architectures have been experimented to tune various hyperparameters and the optimal model was selected via the bias-variance tradeoff approach. Different input data was considered, including the initial state of RNN (baseline value), static inputs (patient level information) and dynamic inputs (variables change over time). To improve the accuracy of prediction and also to consider uncertainty of the prediction, bootstrap aggregating (bagging) was applied in this study.
Results: The result from the proposed method is the best estimation of true treatment effect for all simulation scenarios. The prediction result was evaluated in individual patient level by visualizing the effcacy profile and in the overall population level by comparing the treatment effect estimated using different methods.
Conclusion: Overall, the proposed method provided plausible individual prediction for both of the MCAR and MNAR data and reduced the bias of missing data in treatment effect estimation.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.