Artikel
Tall data subsampling and outlier detection via optimal experimental design
Suche in Medline nach
Autoren
Veröffentlicht: | 26. Februar 2021 |
---|
Gliederung
Text
A dataset is called “tall” if it contains a very large number of observations (corresponding to, say, a million of rows of the data matrix), but each observation has only a moderate dimension (the data matrix has at most a few tens of columns). In the talk, we will propose a method that utilizes the notion of the minimum-volume enclosing ellipsoid (MVEE) to construct an information-based subsample and identify outliers of a tall dataset. Our method alternates between (i) using the REX algorithm [1] for D-optimal design to calculate the MVEE on an auxiliary subset of the original data and (ii) removing redundant data-points based on the “partial” MVEEs determined by the method of [2].
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Harman R, Filová L, Richtárik P. A Randomized Exchange Algorithm for Computing Optimal Approximate Designs of Experiments. Journal of the American Statistical Association. 2020;115:348-361.
- 2.
- Harman R, Pronzato L. Improvements on removing non-optimal support points in D-optimum design algorithms. Statistics & Probability Letters. 2007;77:90-94.