Article
Extracting interpretable patterns from omics data with deep generative models under sample size constraints
Search Medline for
Authors
Published: | February 26, 2021 |
---|
Outline
Text
Background: Deep generative models such as Deep Boltzmann Machines (DBM) or variational autoencoders (VAE) are now frequently employed on omics data, such as single-cell gene expression data. These models learn non-linear dependencies in the data and allow for extracting compact latent representations that enable researchers e.g. to study the differentiation of cells based on few dimensions. While these representations can be easily employed for a better visual inspection of the data, it is challenging to draw any conclusions about the underlying biological processes, since this would require an approach that identifies, how the learnt latent representations relate to the observed variables. Recently, we developed such an approach. Based on categorized synthetic data, sampled from trained generative models, the observed variables that form joint patterns with the learnt latent representation are extracted with log-linear models. Combinations of the states of the extracted variables can then easily be interpreted as a pattern which is characteristic e.g. for a given cell state. Since in omics data, there are usually much more variables than independent data points, i.e. samples, extracting patterns from this data is challenging. Here we demonstrate how extraction of patterns can be performed in case of an unfavorable ratio of observed variables to samples.
Methods: In order to extract interpretable patterns from high-dimensional omics data with limited sample size, we employ an approach for partitioned training of generative models, which we sucessfully applied on single nucleotide polymorphism (SNP) data [1]. In brief, groups of observed variables are identified based on their coarse correlation structure. For each of these groups, a separate generative model is trained. After training, samples, drawn from the models can either be jointly analyzed, using a single log-linear model, or separately, using multiple log-linear models.
Results: Using single cell gene expression data as an example, we demonstrate that the information-carrying genes, e.g. cell type-specific marker genes, can be extracted using the above described log-linear modeling approach. We then show how the partitioned training allows employing the log-linear modeling approach when there are more observed variables compared to independent training data. We further evaluate, how the selection of cut-points used for partitioning the variables into groups, affects the performance of the log-linear modeling approach.
Conclusion: Here we propose an approach that allows for extracting interpretable patterns from omics data. We also demonstrate, how this approach can be employed in the frequently observed scenario, when there are more observed variables than independent data points.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Hess M, Lenz S, Blätte TJ, Bullinger L, Binder H. Partitioned learning of deep Boltzmann machines for SNP data. Bioinformatics. 2017;33:3173–3180.