gms | German Medical Science

GMDS 2014: 59. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

07. - 10.09.2014, Göttingen

Biomarkers for pluripotent stem cells in mice

Meeting Abstract

Suche in Medline nach

  • R. Schmidt - Universitätsmedizin Rostock, Rostock

GMDS 2014. 59. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Göttingen, 07.-10.09.2014. Düsseldorf: German Medical Science GMS Publishing House; 2014. DocAbstr. 93

doi: 10.3205/14gmds083, urn:nbn:de:0183-14gmds0834

Veröffentlicht: 4. September 2014

© 2014 Schmidt.
Dieser Artikel ist ein Open Access-Artikel und steht unter den Creative Commons Lizenzbedingungen (http://creativecommons.org/licenses/by-nc-nd/3.0/deed.de). Er darf vervielfältigt, verbreitet und öffentlich zugänglich gemacht werden, vorausgesetzt dass Autor und Quelle genannt werden.


Gliederung

Text

Introduction: Classifying high-level phenotypes based on high-throughput gene-level data is a fundamental task in bioinformatics. Analyzing corresponding sets of important features improves the understanding of the genotype-phenotype map and delivers basic insights into the biology underlying a particular phenotype. For the cellular phenotype commonly called “pluripotent stem cell” and its more counterpart “differentiated” or “non-pluripotent” cell, we set out to collect data in the form of gene expression data from the GEO (Gene Expression Omnibus database [1]). Gene expression data are among the most abundant molecular data, and they are still very close to the true genotype of the (static) genome. They may inform us about which genes are responsible for the phenotype we wish to understand. The following points are of interest:

a) Which genes are differentially expressed?
b) Which sets of genes enable the best distinction of the pluripotent state from the differentiated one, considering their (differential) expression?
c) Which small sets of genes still enable a good distinction of the two states?

A univariate statistical testing approach (often involving normalization / regularization) together with the inspection of “fold change” is a standard approach to answer question a) [2]. Answers to question b) shall yield a comprehensive description of the molecular basis of pluripotency, based on machine learning and variable selection approaches. Answers to question c) are useful to define small sets of best biomarkers for pluripotency. Such “minimal-best” approaches with machine learning have gained popularity in recent years, in particular in search for cancer biomarkers.

Material and Methods: Samples from experiments (data series) related to pluripotency in mouse were taken from the GEO database. To obtain a large data set, we collected gene expression data from many different GEO series. The positively labeled samples are gene expression data of pluripotent stem cells and the inner cell mass (ICM) of the embryo, whereas the negatively labeled samples arise from all sorts of differentiated cells/tissues.

For classification, the following machine learning algorithms were used within the Weka environment [3]: Naïve Bayes, C4.5 decision trees (J48 in Weka), Random Forest, Nearest Neighbor (LBk in Weka), and Support Vector Machines (SMO in Weka). All parameters were kept at Weka’s default values. The classification performance was evaluated by three-fold cross validation. That means training of classifiers was performed on two thirds of the data set and testing on the remaining third. Addiotionally, the usual inner 10-fold cross-validation was performed within Weka. Since the goal was not only to classify pluripotent and non-pluripotent stem cells, but to obtain lists of genes that are most important for pluripotency, we additionally applied feature selection methods, namely information gain, random forest, and genetic algorithm.

Results: We compared five algorithms chosen to represent different approaches to machine learning: naive Bayes, C4.5 decision trees, random forest, nearest neighbor, and SVM (with two kernels: Gaussian and linear). The algorithms were tested on the full set of 20,668 genes and on the filtered sets of 5,000 and 1,000 genes. The filtering was done by applying a two-sample t-test for samples with unequal variances, testing the difference in mean expression of the genes in the respective training set and corrected resulting p-values based on the concept of false discovery rate.

Most of the applied algorithms classify very well, some even perfectly. Since for the three best performing algorithms (SVM with both kernels and nearest neighbor), classification with 5,000 genes is better than with all 20,668 genes, and with 1,000 genes is better than with 5,000 genes, searching for pluripotency biomarkers in the 1,000-gene data set is preferable, because many irrelevant genes are already eliminated.

Feature selection is a technique used in machine learning to reduce the set of features to find the most relevant ones and, in our case, also to identify biomarkers with potentially important roles in pluripotency. Three feature selection methods were used. The first one, information gain measures how much information about a class one gains by knowing the value of a feature. It considers each feature on its own. The second method is feature importance computed by random forest. The importance is obtained by randomly permuting the values of a feature and measuring the resulting decrease in classification accuracy. While this method still evaluates single features, it measures their importance as part of a classifier that uses other features as well. The third method is feature selection with genetic algorithm. Here, the set of features is optimized guided by the classification accuracy as the main part of the fitness function. This method evaluates whole sets of features. We listed the top 20 most important genes selected from the 1,000-gene dataset by each of the three feature selection methods. The lists by the random forest and the information gain are quite similar, whereas the one by the genetic algorithm is different.

Discussion: On the basis of gene expression data, with machine learning classification algorithms a distinction between pluripotent and non-pluripotent cell states can be learned with accuracies reaching nearly 100 %. Furthermore, we listed genes as potential biomarkers selected by three feature selection methods.


References

1.
Barrett T, et al. NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res. 2009;37:885-90
2.
Tusher VG Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001;98(9): 5116-21.
3.
Hall M, et al. The WEKA data mining software: An update. SIGKDD Explorations. 2009;11(1):10-8.