gms | German Medical Science

MAINZ//2011: 56. GMDS-Jahrestagung und 6. DGEpi-Jahrestagung

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V.
Deutsche Gesellschaft für Epidemiologie e. V.

26. - 29.09.2011 in Mainz

Searching for Biomarkers of Pluripotent Stem Cells

Meeting Abstract

  • Lena Scheubert - Universität Osnabrück, Osnabrück
  • Rainer Schmidt - Universität, Rostock
  • Mitja Lustrek - Universität, Rostock
  • Dirk Repsilber - Leibniz Institut, Dummerstorf
  • Georg Fuellen - Universität Rostock, Rostock

Mainz//2011. 56. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (gmds), 6. Jahrestagung der Deutschen Gesellschaft für Epidemiologie (DGEpi). Mainz, 26.-29.09.2011. Düsseldorf: German Medical Science GMS Publishing House; 2011. Doc11gmds028

DOI: 10.3205/11gmds028, URN: urn:nbn:de:0183-11gmds0284

Published: September 20, 2011

© 2011 Scheubert et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( You are free: to Share – to copy, distribute and transmit the work, provided the original author and source are credited.



Introduction: Pluripotent stem cells are able to self-renew, and to differentiate into all adult cell types. For the cellular phenotype commonly called “pluripotent stem cell” and for its more heterogeneous counterpart “differentiated” or “non-pluripotent” cell, we attempted to find answers for the following research questions:

Which genes are differentially expressed, i.e. expressed more or less strongly in the pluripotent state, compared to the differentiated one?
Which sets of genes enable the best distinction of a pluripotent state from a differentiated one, considering their (differential) expression?
Which small sets of genes still enable a good distinction of the two states?

Material and Methods: We obtained gene expression data from the GEO database [1] by taking samples from experiments related to pluripotency in mouse. At the same time we aimed to get a large data set, correctness in class labels, and a big variety in phenotype. Classification was performed with the Weka machine learning suite [2]. The following machine learning algorithms were used: Naive Bayes, C4.5 decision trees (J48 in Weka), Random Forest, Nearest Neighbor (IBk in Weka), and SVM (SMO in Weka). Furthermore, we propose a combination of Genetic Algorithms and SVMs to find small gene sets with which it possible to classify sufficiently.

To evaluate the biological relevance of our results, we applied gene set enrichment analysis using the hypergeometric distribution [3]. Since we assumed that the genes selected by our feature selection methods might be over-represented in gene sets that can be directly associated with the pluripotent status of cells, we compared our selected genes with several pluripotency-related networks and pathways.

Results and Discussion: Our results show that on the basis of gene expression data, the distinction between a pluripotent and a differentiated (non-pluripotent) cell state can be learned with cross-validated accuracies of nearly 100%. Furthermore, our analysis provides evidence that with the small gene sets selected by our combination of Genetic Algorithms and SVMs classification can be performed very well.

Our literature investigation revealed that most genes we found are related to pluripotency, even though many of them are not well-known pluripotency genes. Furthermore, our enrichment analyses shows that many selected genes are implicated in pluripotency and are included in networks describing pluripotency.


Barrett T, et al. NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res. 2009;37:885-890.
Hall M, et al. The WEKA data mining software: An update. SIGKDD Explorations. 2009;11(1):10-18.
Backes C, et al. GeneTrail-advanced gene set enrichment analysis. Nucleic Acids Res. 2007;35:186-192.