gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Extracting interpretable patterns from omics data with deep generative models under sample size constraints

Meeting Abstract

Search Medline for

  • Moritz Hess - Universitätsklinikum Freiburg, Medizinische Fakultät, Albert-Ludwigs-Universität Freiburg, Deutschland, Freiburg, Germany
  • Stefan Lenz - Universitätsklinikum Freiburg, Medizinische Fakultät, Albert-Ludwigs-Universität Freiburg, Deutschland, Freiburg, Germany
  • Harald Binder - Universitätsklinikum Freiburg, Medizinische Fakultät, Albert-Ludwigs-Universität Freiburg, Deutschland, Freiburg, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 359

doi: 10.3205/20gmds374, urn:nbn:de:0183-20gmds3742

Published: February 26, 2021

© 2021 Hess et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Background: Deep generative models such as Deep Boltzmann Machines (DBM) or variational autoencoders (VAE) are now frequently employed on omics data, such as single-cell gene expression data. These models learn non-linear dependencies in the data and allow for extracting compact latent representations that enable researchers e.g. to study the differentiation of cells based on few dimensions. While these representations can be easily employed for a better visual inspection of the data, it is challenging to draw any conclusions about the underlying biological processes, since this would require an approach that identifies, how the learnt latent representations relate to the observed variables. Recently, we developed such an approach. Based on categorized synthetic data, sampled from trained generative models, the observed variables that form joint patterns with the learnt latent representation are extracted with log-linear models. Combinations of the states of the extracted variables can then easily be interpreted as a pattern which is characteristic e.g. for a given cell state. Since in omics data, there are usually much more variables than independent data points, i.e. samples, extracting patterns from this data is challenging. Here we demonstrate how extraction of patterns can be performed in case of an unfavorable ratio of observed variables to samples.

Methods: In order to extract interpretable patterns from high-dimensional omics data with limited sample size, we employ an approach for partitioned training of generative models, which we sucessfully applied on single nucleotide polymorphism (SNP) data [1]. In brief, groups of observed variables are identified based on their coarse correlation structure. For each of these groups, a separate generative model is trained. After training, samples, drawn from the models can either be jointly analyzed, using a single log-linear model, or separately, using multiple log-linear models.

Results: Using single cell gene expression data as an example, we demonstrate that the information-carrying genes, e.g. cell type-specific marker genes, can be extracted using the above described log-linear modeling approach. We then show how the partitioned training allows employing the log-linear modeling approach when there are more observed variables compared to independent training data. We further evaluate, how the selection of cut-points used for partitioning the variables into groups, affects the performance of the log-linear modeling approach.

Conclusion: Here we propose an approach that allows for extracting interpretable patterns from omics data. We also demonstrate, how this approach can be employed in the frequently observed scenario, when there are more observed variables than independent data points.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Hess M, Lenz S, Blätte TJ, Bullinger L, Binder H. Partitioned learning of deep Boltzmann machines for SNP data. Bioinformatics. 2017;33:3173–3180.