gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Tall data subsampling and outlier detection via optimal experimental design

Meeting Abstract

Suche in Medline nach

  • Radoslav Harman - Comenius University Bratislava, Bratislava, Slovakia
  • Samuel Rosa - Comenius University Bratislava, Bratislava, Slovakia

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 452

doi: 10.3205/20gmds005, urn:nbn:de:0183-20gmds0058

Veröffentlicht: 26. Februar 2021

© 2021 Harman et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

A dataset is called “tall” if it contains a very large number of observations (corresponding to, say, a million of rows of the data matrix), but each observation has only a moderate dimension (the data matrix has at most a few tens of columns). In the talk, we will propose a method that utilizes the notion of the minimum-volume enclosing ellipsoid (MVEE) to construct an information-based subsample and identify outliers of a tall dataset. Our method alternates between (i) using the REX algorithm [1] for D-optimal design to calculate the MVEE on an auxiliary subset of the original data and (ii) removing redundant data-points based on the “partial” MVEEs determined by the method of [2].

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Harman R, Filová L, Richtárik P. A Randomized Exchange Algorithm for Computing Optimal Approximate Designs of Experiments. Journal of the American Statistical Association. 2020;115:348-361.
2.
Harman R, Pronzato L. Improvements on removing non-optimal support points in D-optimum design algorithms. Statistics & Probability Letters. 2007;77:90-94.