gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Optimal subsample selection in big datasets

Meeting Abstract

Suche in Medline nach

  • Chiara Tommasi - University of Milan, Milan, Italy
  • Laura Deldossi - Università Cattolica del Sacro Cuore, Milan, Italy

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 323

doi: 10.3205/20gmds004, urn:nbn:de:0183-20gmds0043

Veröffentlicht: 26. Februar 2021

© 2021 Tommasi et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Big Data are a huge quantity of data that usually are automatically accrued. To achive inferential goals might be computationally difficult due to their large dimension; from here, the idea of selecting a subsample of the Big Dataset.

To accomplish this goal, the Big Dataset is conceived of as a finite population, even though it is not a planned observation of objects. We consider the model-based survey approach, and we make inferences about the parameters of the model that generates the Big Dataset. To form a subsample of data, we apply the theory of optimal design instead of considering the most commonly used sampling schemes.

We propose a purposive selection strategy which is called the “Optimal Design Based” (ODB) method – consisting of two steps. First, we identify the “most informative” values of the explanatory variables according to an optimality criterion (these optimal “theoretical” values are not necessarily present in the observed Big Dataset). Then, we select the observations from the full data set that are closer to these “theoretical” optimal values. Hence, this “optimal-sampling” approach enables us to select the most “informative” observations from the Big Dataset. In addition, we borrow the concept of “design efficiency” from the Optimal Design Theory as a tool to measure the quality of the Big Dataset and of the selected subsamples in terms of their per-unit information.

Indeed, the connection between the sampling and experimental design had been already explored by Wynn [1], [2] and Fedorov [3], among others.

In addition, the topic of subsampling from a big dataset has been already studied by Ma and Sun [4] , Drovandi et al. [5] and Wang et al. [1] among others.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Wang H, Yang M, Stufken J. Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association. 2019;114(525):393-405.
1.
Wynn HP. Minimax purposive survey sampling design. Journal of the American Statistical Association. 1977;72(359):655-657.
2.
Wynn HP. Optimum submeasures with applications to finite population sampling. In: Gupta S, Berger J, editors. Statistical Decision Theory and Related Topics III. Proceedings 3rd Purdue Symposium. Vol. 2. New York: Academic Press; 1982.
3.
Fedorov V. Optimal design with bounded density: optimization algorithms of the exchange type. Journal of Statistical Planning and Inference. 1989;22(1):1-13.
4.
Ma P, Sun X. Leveraging for big data regression. Wiley Interdisciplinary Reviews: Computational Statistics. 2015;7(1):70-76.
5.
Drovandi CC, Holmes CC, McGree JM, Mengersen K, Richardson S, Ryan EG. Principles of experimental design for big data analysis. Statistical Science. 2017;32(3):385-404.