gms | German Medical Science

50. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (gmds)
12. Jahrestagung der Deutschen Arbeitsgemeinschaft für Epidemiologie (dae)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie
Deutsche Arbeitsgemeinschaft für Epidemiologie

12. bis 15.09.2005, Freiburg im Breisgau

Knowledge Based Analysis of Microarray Gene Expression Data

Meeting Abstract

Search Medline for

  • Thomas Karopka - Universität Rostock, Rostock
  • Änne Glass - Universität Rostock, Rostock

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. Deutsche Arbeitsgemeinschaft für Epidemiologie. 50. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (gmds), 12. Jahrestagung der Deutschen Arbeitsgemeinschaft für Epidemiologie. Freiburg im Breisgau, 12.-15.09.2005. Düsseldorf, Köln: German Medical Science; 2005. Doc05gmds064

The electronic version of this article is the complete one and can be found online at:

Published: September 8, 2005

© 2005 Karopka et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( You are free: to Share – to copy, distribute and transmit the work, provided the original author and source are credited.




The sequencing of the human genome and other organisms has led to an unprecedented flood of data in the biomedical field. High through-put technologies like microarray gene expression measurements have contributed to increase the amount of produced data. Methods to interpret these datasets range from clustering over network reconstruction to literature based approaches [1], [2]. An alternative approach is the integration of available knowledge in form of biological pathways. This method was first discussed by Fellenberg and Mewes [3] and further developed by Zien et. al. [4]. Biological Pathways are the key concepts in understanding and interpretation of processes in an organism. Many such pathways exist in form of a text book graphic. Other pathways exist in the structured and accurate form of a database (e.g. XML). To allow the integration of pathways from these heterogeneous sources we have extended our microarray analysis suite gEn0m [5] for the automatic conversion of pathways from databases as well as a simple pathway editor to integrate new pathways based on a text book graphic.

Several other tools exist for the combined analysis of pathways and expression data such as KEGG [6], MetaCyc [7], ViMAc [8], MAPPFinder [9], Pathway Editor [10] or Pathway Processor [11]. However, these applications either focus on metabolic pathways (MetaCyc, ViMAc, PathwayProcessor) and/or on yeast data. Additionally these tools only map the expression data to the pathways but do not score the most affected pathway. The methods presented here allow for a ranking of candidate pathways.

Material and Methods

One of the prerequisites for the analysis proposed here is the integration of the pathway data into the database. Two options exist for this task:

Integration of pathways in structured form from other databases
At the time of writing we have already integrated 314 KEGG pathways, 5342 pathways from GenMAPP and 596 pathways from BioCarta. Adding new pathways from these databases is straight forward. The graphic formats of KEGG (XML or HTML) or GenMAPP files can be converted into the required internal file format for the automatic analysis of pathways. Maps from BioCarta might be automatically downloaded and integrated upon user initiation.
Integration of pathway images. New pathways are integrated using the Map Editor. The pathway may be in either of the graphics formats BMP, JPEG or GIF. After loading of the pathway diagram, each gene has to be connected to the expression data by overlaying the symbol in the diagram with a graphical object. This object is used to display the expression level in the corresponding microarray experiment. By using this method nearly every form of pathway can be integrated into the analysis process.

Currently the mapping for all Affymetrix GeneChips® for the organisms human, mouse and rat is implemented.

Two alternative methods for linking expression data to graphical objects are implemented: mapping by accession number or mapping by gene name. Due to the ambiguity in gene names the accession number is the preferred way of mapping. However, for a lot of pathways this information is not available within the diagram. Therefore we use gene name mapping for the automatic conversion if the accession number is not available. Taking the large number of pathways into account (>6000), manually editing of every pathway is not feasible. Often the user is interested in a limited number of pathways. For these pathways the user should edit the map file and supply the accession number to get a higher accuracy.

Mapping of gene expression data to pathways allows the visualisation of gene expression data in a biological meaningful context. The gene expression level is colour coded, i.e. red for up-regulation and green for down-regulation. For the genes that where mapped further information may be visualized by clicking on the gene symbol. This method of linking gene expression data and pathways is easy to adapt to other technologies like cDNA arrays. The only information the user has to supply is a file containing chip IDs mapped to accession numbers or gene names.

Before scoring the pathways a hit list has to be generated which contains the number of genes that are matched in the pathway. The hit ratio consists of two numbers calculated for each pathway: the first number represents the hits, the second number represents all genes in the pathway. In this way the system allows the ranking of pathways according to the number of affected genes.

To assist the automatic finding of the most affected pathways we use a filter as well as a compare function. The filter allows to set a cut-off value. Only pathways that exceed the cut-off are presented. The cut-off may be either specified as a number of hits or as the hit ratio [%]. The compare function allows to rank the pathways with respect to the number of hits. Furthermore, it is possible to compare several experiments. For each experiment a hit list has to be generated and saved. The saved lists can afterwards be compared by using the compare function.


With the implemented analysis procedure it is possible to view gene expression data in the context of biological processes, i.e. pathways. Using the scoring system it is possible to filter out the most affected pathways and sort the pathways according to the number of differentially expressed genes. Another use case is the application of the analysis method in time series experiments where it is possible to filter out early or late regulated pathways. We conducted an analysis of gene expression data from time series experiments in the context of experimental autoimmune encephalomyelitis as well as in the context of pancreatic stellate cells. In both experiments it was possible to filter out 20-30 candidate pathways out of the over 6000 pathways in the database.


The interpretation of gene expression data in the context of biological pathways is a promising approach. However, the approach presented here is only a first step. Only a small fraction of the biological processes can be measured at the transcriptional level as presented here. The next step should therefore integrate knowledge about the functioning of the entities in the pathway. If we know whether a gene acts as a suppressor or as an activator in the considered pathway, we are able to carry out plausibility checks. A promising standard for this next step is the BioPAX standard ( BioPAX is a community driven effort to create a standard exchange format for pathway data. BioPAX is based on the Ontology Web Language OWL-DL, which is a knowledge representation language based on description logics. Several pathway databases like BIND, BioCyc, Reactome or KEGG will support this exchange format. In future work we will implement an interface for the integration of BioPAX data.


This project is funded by the Ministry for Education of the German federal state M/V with European Regional Development Funds EFRE 0400210/2004


Shatkay, H., Edwards, S., Wilbur, W. J., and Boguski, M., 2000, Genes, themes and microarrays: using information retrieval for large-scale gene analysis. 8, 317-28.
Quackenbush, J., 2001, Computational analysis of microarray data. Nat Rev Genet, 2 (6), 418-427.
Fellenberg, M., Mewes H. W., 1999, Interpreting Clusters of Gene Expression Profiles in Terms of Metabolic Pathways. German Conference on Bioinformatics. 185-187.
Zien, A., Kuffner, R., Zimmer, R., and Lengauer, T., 2000, Analysis of gene expression data with pathway scores. International Conference Intelligent Systems in Molecular Biology, UNITED STATES. 407-417.
Bansemer, S., Scheel, T., Glass, Ä., and Karopka, T., 2003, gEn0M - Software. European Conference on Computational Biology 2, Paris. 2-7261-1257. 397-398.
Kanehisa, M. and Goto, S., 2000, KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research (Online), 28 (1), 27-30.
Karp, P. D., Riley, M., Paley, S. M., and Pellegrini_Toole, A., 2002, The MetaCyc Database. Nucleic Acids Research (Online), 30 (1), 59-61.
Luyf, A. C., de Gast, J., and van Kampen, A. H., 2002, Visualizing metabolic activity on a genome-wide scale. Bioinformatics (Oxford, England), 18 (6), 813-818.
Doniger, S. W., Salomonis, N., Dahlquist, K. D., Vranizan, K., Lawlor, S. C., and Conklin, B. R., 2003, MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biology, 4 (1), R7.
Trost, E., Hackl, H., Maurer, M., and Trajanoski, Z., 2003, Java editor for biological pathways. Bioinformatics (Oxford, England), 19 (6), 786-787.
Grosu, P., Townsend, J. P., Hartl, D. L., and Cavalieri, D., 2002, Pathway Processor: a tool for integrating whole-genome expression results into metabolic networks. Genome Research, 12 (7), 1121-1126