[an error occurred while processing this directive]
GOALIE was constructrued to analyze the results of (microarray) clustering experiments arranged in a time-line.
Microarray technologies today constitute a popular approach for characterizing cellular transcriptional states by simultaneously measuring mRNA abundance of many thousands of genes. The measured (absolute or relative) gene expression levels while the cell is subjected to a particular ambient condition can be readily studied by contemporary statistical methods, visualization techniques, and data mining algorithms. Typically, statistical and data-mining analysis methods draw the biologist's attention to targeted sets of genes, e.g., those that vary in a well correlated manner [ESB+98], are under similar regulatory control [SSR+03], or that have consistent functional annotation or ontological categorizations; yet, information in the full dataset complement, most of it abandoned by these techniques, contains a richer and more detailed picture. For instance, how does a cell marshall resources and respond to a given stress? To obtain such global and dynamic perspectives on transcription states, we must bring together quantitative analysis of microarray datasets with formal models for characterizing the temporal evolution of biological processes.
A formal way to reason about a dynamical system is to encode its properties into the vernacular of temporal logic. Temporal logics are traditionally defined in terms of Kripke structures (V, E, P) [CGP99], which can also be interpreted as a "semantic support" for hybrid systems. Here (V, E) is a directed graph having the reachable states of the system as vertices and state transitions of the system as edges. For instance, in a classic cell-cycle example, there would be six states: M, G1(I), G1(II), S, G2 and G0. P is a labeling of the states of the system with properties that hold in each state. To obtain a Kripke structure from a reachability graph, one first needs to fix a set of atomic propositions AP, which denote the properties of individual states. For instance, we can define a proposition p to be "cell size large enough for division". p is hence not true in states M, G1(I), and G0. It, however, becomes true in G1(II). Once we have defined a vocabulary of such propositions, we replace the state symbols (M, G1(I), etc.) with the set of atomic propositions that hold in that state. This is a map P from the set of states to the power set of AP. The resulting labeled graph is a Kripke structure.
In general, how do we obtain Kripke structures in the first place? Typically, only well understood model systems or experimental conditions afford such formal definitions. Notice that this is a problem of both defining a state transition diagram as well as providing a labeling for the states using a vocabulary. By bringing together our prior work in redescription mining [RKM+04] and `model checking' algorithms for systems biology [APUM03], we present a completely novel approach approach to automatically infer Kripke structures from time course micro-array data-sets, that presents global and dynamic perspectives of transcriptional states.
A redescription is a shift-of-vocabulary, or a different way of communicating a given aspect of information. Redescription mining is a technique to find sets (here, of genes) that afford multiple definitions. The inputs to redescription mining are the universal set of open reading frames (ORFs) in a given organism, and various subsets (called descriptors) defined over this universal set. These subsets could be based either on prior biological knowledge or defined by the outputs of algorithms operating on gene expression data. An Example descriptor can be: "genes involved in glucose biosynthesis." The goal of redescription mining is to connect these diverse vocabularies, by relating set-theoretic constructs formed over the descriptors.
Basic to such redescription analysis would be an algorithm to reveal a hidden Kripke model (HKM), composed of a set of hidden states or possible worlds, transitions among the states, and the states labeled with logical propositions. At first glance, this may appear to be a variation of the classical Hidden Markov Model (HMM), a popular approach among bioinformaticists. There are several basic differences: there are no obvious emission alphabets that can be observed. Rather, true logical propositions from a universe of discourse must be inferred or redescribed. No system architecture can be assumed a priori, and the transitions themselves must be inferred from the structure and the semantics of the possible worlds. Once the HKM has been inferred, however, it is expected to be equally powerful in discovering invariants, predicting dynamic properties of unannotated genes, predicting behavior of a cell, or even an organ or an organism at a system level under various conditions. The underlying theoretical questions are deep and challenging.
To see how we can use redescriptions to infer Kripke structures, consider one vocabulary based on expression levels in given time points/intervals and another vocabulary based on the GO biological process taxonomy. Redescription in this scenario is equivalent to labeling time-dependent expression clusters (states) with atomic symbols based on Gene Ontology (GO) categories (propositions). To obtain state transitions, we perform redescription again, but this time helping to connect states defined over one time slice to states defined in the neighboring (successive) time slice. Essentially, we have used descriptors defined in a propositional temporal logic and performed redescriptions both within and across intervals of time. By subsequently piecing together these redescriptions into a Kripke-structure model, we obtain a global picture of the temporal nature of the underlying biological processes. This approach effectively integrates our earlier work on model-checking methods [APUM03] with the data-driven emphasis of redescriptions.
The main GOALIE interface
Another sub-view with comparison of cluster plots.
GOALIE is a simple application that can be downloaded and stored anywhere on disk. For the time being we do not provide anything more.
However, before using GOALIE there are some prerequisites that need to be met. GOALIE relies on the presence of an ODBC connection to an instance of the GO database. The GO database itself requires MySQL. Just download the appropriate tools and datasets and make sure that you have an ODBC connection to the GO database.
Please read the detailed installation instructions and the the INSTALLATION file that comes in the downloadable package for a step-by-step description of the procedure.
GOALIE has been tested on a number of examples. The well-known Yeast Cell Cycle example by Spellman et al. has been GOALIE showcase for some time.
A preliminary manual about how to go ahead and use GOALIE is available.
The following are references that are pertinent about GOALIE.