PRESENTATION No - ISCB



TITLES, AUTHORS, ABSTRACTS FOR ISMB02

Paper01, Presentation02

Title:

Mining viral protease data to extract cleavage knowledge

Authors:

Ajit Narayanan, Xikun Wu and Z. Rong Yang

Abstract:

Motivation: The motivation is to identify, through machine learning techniques, specific patterns in HIV and HCV viral polyprotein amino acid residues where viral protease cleaves the polyprotein as it leaves the ribosome. An understanding of viral protease specificity may help the development of future anti-viral drugs involving protease inhibitors by identifying specific features of protease activity for further experimental investigation. While viral sequence information is growing at a fast rate, there is still comparatively little understanding of how viral polyproteins are cut into their functional unit lengths. The aim of the work reported here is to investigate whether it is possible to generalise from known cleavage sites to unknown cleavage sites for two specific viruses - HIV and HCV. An understanding of proteolytic activity for specific viruses will contribute to our understanding of viral protease function in general, thereby leading to a greater understanding of protease families and their substrate characteristics.

Results: Our results show that artificial neural networks and symbolic learning techniques (See5) capture some fundamental and new substrate attributes, but neural networks outperform their symbolic counterpart.

Availability: Publicly available software was used (Stuttgart Neural Network Simulator - , and See5 - ). The datasets used (HIV, HCV) for See5 are available at:

Contact: a.narayanan@ex.ac.uk, z.r.yang@ex.ac.uk

Paper02, Presentation03

Title:

The metric space of proteins - comparative study of clustering algorithms

Authors:

Ori Sasson, Nati Linial, Michal Linial

Abstract:

Motivation: A large fraction of biological research concentrates on individual proteins and on small families of proteins. One of the current major challenges in bioinformatics is to extend our knowledge also to very large sets of proteins. Several major projects have tackled this problem. Such undertakings usually start with a process that clusters all known proteins or large subsets of this space. Some work in this area is carried out automatically, while other attempts incorporate expert advice and annotation.

Results: We propose a novel technique that automatically clusters protein sequences. We consider all proteins in SWISSPROT, and carry out an all-against-all BLAST similarity test among them. With this similarity measure in hand we proceed to perform a continuous bottom-up clustering process by applying alternative rules for merging clusters. The outcome of this clustering process is a classification of the input proteins into a hierarchy of clusters of varying degrees of granularity. Here we compare the clusters that result from alternative merging rules, and validate the results against InterPro.

Our preliminary results show that clusters that are consistent with several rather than a single merging rule tend to comply with InterPro annotation. This is an affirmation of the view that the protein space consists of families that differ markedly in their evolutionary conservation.

Availability: The outcome of these investigations can be viewed in an interactive Web site at .

Supplementary information: Biological examples for comparing the performance of the different algorithms used for classification are presented in .

Contact: ori@cs.huji.ac.il

Paper03, Presentation04

Title:

Authors:

Nicholas Steffen, Scott Murphy, Lorenzo Tolleri, Wesley Hatfield, Richard Lathrop

Abstract:

Motivation: Direct recognition, or direct readout, of DNA bases by a DNA-binding protein involves amino acids that interact directly with features specific to each base. Experimental evidence also shows that in many cases the protein achieves partial sequence specificity by indirect recognition, i.e., by recognizing structural properties of the DNA. (1) Could threading a DNA sequence onto a crystal structure of bound DNA help explain the indirect recognition component of sequence specificity? (2) Might the resulting pure-structure computational motif manifest itself in familiar sequence-based computational motifs?

Results: The starting structure motif was a crystal structure of DNA bound to the integration host factor protein (IHF) of {\it E.~coli}. IHF is known to exhibit both direct and indirect recognition of its binding sites. (1) Threading DNA sequences onto the crystal structure showed statistically significant partial separation of 60 IHF binding sites from random and intragenic sequences and was positively correlated with binding affinity. (2) The crystal structure was shown to be equivalent to a linear Markov network, and so, to a joint probability distribution over sequences, computable in linear time. It was transformed algorithmically into several common pure-sequence representations, including (a) small sets of short exact strings, (b) weight matrices, (c) consensus regular patterns, (d) multiple sequence alignments, and (e) phylogenetic trees. In all cases the pure-sequence motifs retained statistically significant partial separation of the IHF binding sites from random and intragenic sequences. Most exhibited positive correlation with binding affinity. The multiple alignment showed some conserved columns, and the phylogenetic tree partially mixed low-energy sequences with IHF binding sites but separated high-energy sequences. The conclusion is that deformation energy explains part of indirect recognition, which explains part of IHF sequence-specific binding.

Availability: Code and data on request.

Contact: Nick Steffen for code and Lorenzo Tolleri for data. nsteffen@uci.edu , Tolleri@chiron.it

Paper04, Presentation05

Title:

Beyond tandem repeats: complex pattern structures and distant regions of similarity

Authors:

Amy M. Hauth, Deborah A. Joseph

Abstract:

Motivation: Tandem repeats (TRs) are associated with human disease, play a role in evolution and are important in regulatory processes. Despite their importance, locating and characterizing these patterns within anonymous DNA sequences remains a challenge. In part, the difficulty is due to imperfect conservation of patterns and complex pattern structures. We study recognition algorithms for two complex pattern structures: variable length tandem repeats (VLTRs) and multi-period tandem repeats (MPTRs).

Results: We extend previous algorithmic research to a class of regular tandem repeats (RegTRs). We formally define RegTRs, as well as, two important subclasses: VLTRs and MPTRs. We present algorithms for identification of TRs in these classes. Furthermore, our algorithms identify degenerate VLTRs and MPTRs: repeats containing substitutions, insertions and deletions. To illustrate our work, we present results of our analysis for two difficult regions in cattle and human which reflect practical occurrences of these subclasses in GenBank sequence data. In addition, we show the applicability of our algorithmic techniques for identifying Alu sequences, gene clusters and other distant regions of similarity. We illustrate this with an example from yeast chromosome I.

Availability: Algorithms can be accessed at .

Contact: Amy M. Hauth kryder@cs.wisc.edu, 608-831-2164) or Deborah A. Joseph joseph@cs.wisc.edu 608-262-8022), FAX: 608-262-9777.

Paper05, Presentation06

Title:

The POPPs: Clustering and Searching Using Peptide Probability Profiles

Author:

Michael J. Wise

Abstract:

The POPPs is a suite of inter-related software tools which allow the user to discover what is statistically "unusual" in the composition of an unknown protein, or to automatically cluster proteins into families based on peptide composition. Finally, the user can search for related proteins based on peptide composition. Statistically based peptide composition provides a view of proteins that is, to some extent, orthogonal to that provided by sequence. In a test study, the POPP suite is able to regroup into their families sets of approximately 100 randomised Pfam protein domains. The POPPs suite is used to explore the diverse set of late embryogenesis abundant (LEA) proteins.

Availability: Contact the author.

Contact: mw263@cam.ac.uk

Paper06, Presentation08

Title:

A sequence profile based HMM for predicting and discriminating beta barrel membrane proteins

Authors:

Pier Luigi Martelli, Piero Fariselli, Anders Krogh, Rita Casadio

Abstract:

Motivation: Membrane proteins are an abundant and functionally relevant subset of proteins that putatively include from about 15 up to 30% of the proteome of organisms fully sequenced. These estimates are mainly computed on the basis of sequence comparison and membrane protein prediction. It is therefore urgent to develop methods capable of selecting membrane proteins especially in the case of outer membrane proteins, barely taken into consideration when proteome wide analysis is performed. This will also help protein annotation when no homologous sequence is found in the database. Outer membrane proteins solved so far at atomic resolution interact with the external membrane of bacteria with a characteristic _ barrel structure comprising different even numbers of _ strands (_ barrel membrane proteins). In this they differ from the membrane proteins of the cytoplasmic membrane endowed with alpha helix bundles (all alpha membrane proteins) and need specialised predictors.

Results: We develop a HMM model, which can predict the topology of _ barrel membrane proteins, using as input evolutionary information. The model is cyclic with 6 types of states: two for the _ strand transmembrane core, one for the _ strand cap on either side of the membrane, one for the inner loop, one for the outer loop and one for the globular domain state in the middle of each loop. The development of a specific input for HMM based on multiple sequence alignment is novel. The accuracy per residue of the model is 82% when a jack knife procedure is adopted. With a model optimisation method using a dynamic programming algorithm seven topological models out the twelve proteins included in the testing set are also correctly predicted. When used as a discriminator, the model is rather selective. At a fixed probability value, it retains 84% of a non-redundant set comprising 145 sequences of well-annotated outer membrane proteins. Concomitantly, it correctly rejects 90% of a set of globular proteins including about 1200 chains with low sequence identity (< 30%) and 90% of a set of all alpha membrane proteins, including 188 chains.

Availability: The program will be available on request from the authors.

Contact: gigi@lipid.biocomp.unibo.it, biocomp.unibo.it

Paper07, Presentation09

Title:

Fully automated ab initio protein structure prediction using I-SITES, HMMSTR and ROSETTA

Authors:

Christopher Bystroff, Yu Shao

Abstract:

Motivation: The Monte Carlo fragment insertion method for protein tertiary structure prediction (ROSETTA) of Baker and others, has been merged with the I-SITES library of sequence structure motifs and the HMMSTR model for local structure in proteins, to form a new public server for the ab initio prediction of protein structure. The server performs several tasks in addition to tertiary structure prediction, including a database search, amino acid profile generation, fragment structure prediction, and backbone angle and secondary structure prediction. Meeting reasonable service goals required improvements in the efficiency, in particular for the ROSETTA algorithm.

Results: The new server was used for blind predictions of 40 protein sequences as part of the CASP4 blind structure prediction experiment. The results for 31 of those predictions are presented here. 61% of the residues overall were found in topologically correct predictions, which are defined as fragments of 30 residues or more with a root-mean-square deviation in superimposed alpha carbons of less than 6Å. HMMSTR 3-state secondary structure predictions were 73% correct overall. Tertiary structure predictions did not improve the accuracy of secondary structure prediction.

Availability: The server is accessible through the web at Programs are available upon requests for academics. Licensing agreements are available for commercial interests.

Supplementary information: ,

Contacts: bystrc@rpi.edu, shaoy@rpi.edu

Paper08, Presentation10

Title:

Prediction of Contact Maps by GIOHMMs and Recurrent Neural Networks Using Lateral Propagation From All Four Cardinal Corners

Authors:

Gianluca Pollastri, Pierre Baldi

Abstract:

Motivation: Accurate prediction of protein contact maps is an important step in computational structural proteomics. Because contact maps provide a translation and rotation invariant topological representation of a protein, they can be used as a fundamental intermediary step in protein structure prediction.

Results: We develop a new set of flexible machine learning architectures for the prediction of contact maps, as well as other information processing and pattern recognition tasks. The architectures can be viewed as recurrent neural network parameterizations of a class of Bayesian networks we call generalized input-output HMMs. For the specific case of contact maps, contextual information is propagated laterally through four hidden planes, one for each cardinal corner. We show that these architectures can be trained from examples and yield contact map predictors that outperform previously reported methods. While several extensions and improvements are in progress, the current version can accurately predict 60.5% of contacts at a distance cutoff of 8Å and 45% of distant contacts at 10Å, for proteins of length up to 300.

Availability and Contact: The contact map predictor will be made available through

as part of an existing suite of proteomics predictors.

Email: {gpollast,pfbaldi}@ics.uci.edu

Paper09, Presentation11

Title:

Rate4Site: An Algorithmic Tool for the Identification of Functional Regions on Proteins by Surface Mapping of Evolutionary Determinants within Their Homologues

Authors:

Tal Pupko, Rachel Bell, Itay Mayrose, Fabian Glaser, Nir Ben-Tal

Abstract:

Motivation: A number of proteins of known three-dimensional (3D) structure exist, with yet unknown function. In light of the recent progress in structure determination methodology, this number is likely to increase rapidly. A novel method is presented here: "Rate4Site", which maps the rate of evolution among homologous proteins onto the molecular surface of one of the homologues whose 3D-structure is known. Functionally important regions correspond to surface patches of slowly evolving residues.

Results: Rate4Site estimates the rate of evolution of amino acid sites using the maximum likelihood (ML) principle. The ML estimate of the rates considers the topology and branch lengths of the phylogenetic tree, as well as the underlying stochastic process. To demonstrate its potency, we study the Src SH2 domain. Like previously established methods, Rate4Site detected the SH2 peptide-binding groove. Interestingly, it also detected inter-domain interactions between the SH2 domain and the rest of the Src protein that other methods failed to detect.

Availability: Rate4Site can be downloaded at:

Contact: tal@ism.ac.jp; rebell@ashtoret.tau.ac.il; fabian@ashtoret.tau.ac.il; bental@ashtoret.tau.ac.il

Supplementary Information: multiple sequence alignment of homologous domains from the SH2 protein family, the corresponding phylogenetic tree and additional examples are available at

Paper10, Presentation12

Title:

Inferring sub-cellular localization through automated lexical analysis

Authors:

Rajesh Nair, Burkhard Rost

Abstract:

Motivation: The SWISS-PROT sequence database contains keywords of functional annotations for many proteins. In contrast, information about the sub-cellular localization is only available for few proteins. Experts can often infer localization from keywords describing protein function. We developed LOCkey, a fully automated method for lexical analysis of SWISS-PROT keywords that assigns sub-cellular localization. With the rapid growth in sequence data, the biochemical characterisation of sequences has been falling behind. Our method may be a useful tool for supplementing functional information already automatically available.

Results: The method reached a level of more than 82% accuracy in a full cross-validation test. Due to a lack of functional annotations, we could infer localization for less than half of all proteins in SWISS-PROT. We applied LOCkey to annotate five entirely sequenced proteomes, namely Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly), Arabidopsis thaliana (plant) and a subset of all human proteins. LOCkey found about 8000 new annotations of sub-cellular localization for these eukaryotes.

Availability: Annotations of localization for eukaryotes at: .

Contact: rost@columbia.edu

Paper11, Presentation14

Title:

Support vector regression applied to the determination of the developmental age of a Drosophila embryo from its segmentation gene expression patterns

Authors:

Ekaterina Myasnikova, Anastassia Samsonova, John Reinitz, Maria Samsonova

Abstract:

Motivation: In this paper we address the problem of the determination of developmental age of an embryo from its segmentation gene expression patterns in Drosophila.

Results: By applying support vector regression we have developed a fast method for automated staging of an embryo on the basis of its gene expression pattern. Support vector regression is a statistical method for creating regression functions of arbitrary type from a set of training data. The training set is composed of embryos for which the precise developmental age was determined by measuring the degree of membrane invagination. Testing the quality of regression on the training set showed good prediction accuracy. The optimal regression function was then used for the prediction of the gene expression based age of embryos in which the precise age has not been measured by membrane morphology. Moreover, we show that the same accuracy of prediction can be achieved when the dimensionality of the feature vector was reduced by applying factor analysis. The data reduction allowed us to avoid over-fitting and to increase the efficiency of the algorithm.

Availability: This software may be obtained from the authors.

Contact: samson@fn.csa.ru

Paper12, Presentation15

Title:

Variance stabilization applied to microarray data calibration and to the quantification of differential expression

Authors:

Wolfgang Huber, Anja von Heydebreck, Holger Sueltmann, Annemarie Poustka, Martin Vingron

Abstract:

We introduce a statistical model for microarray gene expression data that comprises data calibration, the quantification of differential expression, and the quantification of measurement error. In particular, we derive a transformation $h$ for intensity measurements, and a difference statistic *h whose variance is approximately constant along the whole intensity range. This forms a basis for statistical inference from microarray data, and provides a rational data pre-processing strategy for multivariate analyses. For the transformation $h$, the parametric form h(x) = arsinh(a+bx) is derived from a model of the variance-versus-mean dependence for microarray intensity data, using the method of variance stabilizing transformations. For large intensities, h coincides with the logarithmic transformation, and *h with the log-ratio. The parameters of h together with those of the calibration between experiments are estimated with a robust variant of maximum-likelihood estimation. We demonstrate our approach on data sets from different experimental platforms, including two-color cDNA arrays and a series of Affymetrix oligonucleotide arrays.

Availability: Software is freely available for academic use as an R package at

Contact: w.huber@dkfz.de

Paper13, Presentation16

Title:

Binary Tree-Structured Vector Quantization Approach to Clustering and Visualizing Microarray Data

Authors:

M. Sultan, D. Wigle, CA Cumbaa, M. Maziarz, Janice Glasgow, M.-S. Tsao and Igor Jurisica

Abstract:

Motivation: With the increasing number of gene expression databases, the need for more powerful analysis and visualization tools is growing. Many techniques have successfully been applied to unravel latent similarities among genes and/or experiments. Most of the current systems for microarray data analysis use statistical methods, hierarchical clustering, self-organizing maps, support vector machines, or k-means clustering to organize genes or experiments into "meaningful" groups. Without prior explicit bias almost all of these clustering methods applied to gene expression data not only produce different results, but may also produce clusters with little or no biological relevance. Of these methods, agglomerative hierarchical clustering has been the most widely applied, although many limitations have been identified.

Results: Starting with a systematic comparison of the underlying theories behind clustering approaches, we have devised a technique that combines tree-structured vector quantization and partitive k-means clustering (BTSVQ). This hybrid technique has revealed clinically relevant clusters in three large publicly available data sets. In contrast to existing systems, our approach is less sensitive to data preprocessing and data normalization. In addition, the clustering results produced by the technique have strong similarities to those of self-organizing maps (SOMs). We discuss the advantages and the mathematical reasoning behind our approach.

Availability: The BTSVQ system is implemented in Matlab R12 using the SOM toolbox for the visualization and preprocessing of the data(}. BTSVQ is available for non-commercial use(}.

Contact: ij@uhnres.utoronto.ca

Paper14, Presentation17

Title:

A Variance-Stabilizing Transformation for Gene-Expression Microarray Data

Authors:

Blythe Durbin, Johanna Hardin, Douglas Hawkins, David Rocke

Abstract:

Motivation: Standard statistical techniques often assume that data are normally distributed, with constant variance not depending on the mean of the data. Data that violate these assumptions can often be brought in line with the assumptions by application of a transformation. Gene-expression microarray data have a complicated error structure, with a variance that changes with the mean in a non-linear fashion. Log transformations, which are often applied to microarray data, can inflate the variance of observations near background.

Results: We introduce a transformation that stabilizes the variance of microarray data across the full range of expression. Simulation studies also suggest that this transformation approximately symmetrizes microarray data.

Contact: bpdurbin@wald.ucdavis.edu

Paper15, Presentation18

Title:

Linking Gene Expression Data with Patient Survival Times Using Partial Least Squares

Authors:

Peter Park, Lu Tian, Isaac Kohane

Abstract:

There is an increasing need to link the large amount of genotypic data, gathered using microarrays for example, with various phenotypic data from patients. The classification problem in which gene expression data serve as predictors and a class label phenotype as the binary outcome variable has been examined extensively, but there has been less emphasis in dealing with other types of phenotypic data. In particular, patient survival times with censoring are often not used directly as a response variable due to the complications that arise from censoring.

We show that the issues involving censored data can be circumvented by reformulating the problem as a standard Poisson regression problem. The procedure for solving the transformed problem is a combination of two approaches: partial least squares, a regression technique that is especially effective when there is severe collinearity due to a large number of predictors, and generalized linear regression, which extends standard linear regression to deal with various types of response variables. The linear combinations of the original variables identified by the method are highly correlated with the patient survival times and at the same time account for the variability in the covariates. The algorithm is fast, as it does not involve any matrix decompositions in the iterations. We apply our method to data sets from lung carcinoma and diffuse large B-cell lymphoma studies to verify its effectiveness.

Contact: peter_park@harvard.edu

Paper16, Presentation20

Title:

Microarray Synthesis through Multiple-Use PCR Primer Design

Authors:

Rohan Fernandes, Steven Skiena

Abstract:

A substantial percentage of the expense in constructing full-genome spotted microarrays comes from the cost of synthesizing the PCR primers to amplify the desired DNA. We propose a computationally-based method to substantially reduce this cost. Historically, PCR primers are designed so that each primer occurs uniquely in the genome. This condition is unnecessarily strong for selective amplification, since only the primer pair associated with each amplification need be unique. We demonstrate that careful design in a genome-level amplification project permits us to save the cost of several thousand primers over conventional approaches.

Contact: {skiena, rohan}@cs.sunysb.edu

Paper17, Presentation21

Title:

Discovering Statistically Significant Biclusters in Gene Expression Data

Authors:

Amos Tanay, Roded Sharan, Ron Shamir

Abstract:

In gene expression data, a bicluster is a subset of the genes exhibiting consistent patterns over a subset of the conditions. We propose a new method to detect significant biclusters in large expression datasets. Our approach is graph theoretic coupled with statistical modeling of the data. Under plausible assumptions, our algorithm is polynomial and is guaranteed to find the most significant biclusters. We tested our method on a collection of yeast expression profiles and on a human cancer dataset. Cross validation results show high specificity in assigning function to genes based on their biclusters, and we are able to annotate in this way 196 uncharacterized yeast genes. We also demonstrate how the biclusters lead to detecting new concrete biological associations. In cancer data we are able to detect and relate finer tissue types than was previously possible. We also show that the method outperforms the biclustering algorithm of Cheng and Church (2000).

Contact: {amos,roded,rshamir}@tau.ac.il.

Availability: cs.tau.ac.il/~rshamir/biclust.html.

Paper18, Presentation22

Title:

Co-Clustering of Biological Networks and Gene Expression Data

Authors:

Daniel Hanisch, Alexander Zien, Ralf Zimmer, Thomas Lengauer

Abstract:

Motivation: Large scale gene expression data are often analyzed by clustering genes based on gene expression data alone, though a-priori knowledge in the form of biological networks is available. The use of this additional information promises to improve exploratory analysis considerably.

Results: We propose to construct a distance function which combines information from expression data and biological networks. Based on this function, we compute a joint clustering of genes and vertices of the network. This general approach is elaborated for metabolic networks. We define a graph distance function on such networks and combine it with a correlation-based distance function for gene expression measurements. A hierarchical clustering and an associated statistical measure is computed to arrive at a reasonable number of clusters. Our method is validated using expression data of the yeast diauxic shift. The resulting clusters are easily interpretable in terms of the biochemical network and the gene expression data and suggest that our method is able to automatically identify processes that are relevant under the measured conditions.

Contact: Daniel.Hanisch@scai.fhg.de

Paper19, Presentation23

Title:

Statistical process control for large scale microarray experiments

Authors:

Fabian Model, Thomas Koenig, Christian Piepenbrock, Peter Adorjan

Abstract:

Motivation: Maintaining and controlling data quality is a key problem in large scale microarray studies. Especially systematic changes in experimental conditions across multiple chips can seriously affect quality and even lead to false biological conclusions. Traditionally the influence of these effects can only be minimized by expensive repeated measurements, because a detailed understanding of all process relevant parameters seems impossible.

Results: We introduce a novel method for microarray process control that estimates quality solely based on the distribution of the actual measurements without requiring repeated experiments. A robust version of principle component analysis detects single outlier microarrays and only thereby enables the use of techniques from multivariate statistical process control. In particular, the T^2 control chart reliably tracks undesired changes in process relevant parameters. This can be used to improve the microarray process itself, limits necessary repetitions only to affected samples and therefore maintains quality in a cost effective way. We prove the power of the approach on 3 large sets of DNA methylation microarray data.

Contact: Fabian.Model@

Paper20, Presentation24

Title:

Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays

Authors:

John Tobler, Mike Molla, Jude Shavlik

Abstract:

Motivation: Microarrays are a fast and cost-effective method of performing thousands of DNA hybridization experiments simultaneously. DNA probes are typically used to measure the expression level of specific genes. Because probes greatly vary in the quality of their hybridizations, choosing good probes is a difficult task. If one could accurately choose probes that are likely to hybridize well, then fewer probes would be needed to represent each gene in a gene-expression microarray, and, hence, more genes could be placed on an array of a given physical size. Our goal is to empirically evaluate how successfully three standard machine-learning algorithms - naïve Bayes, decision trees, and artificial neural networks - can be applied to the task of predicting good probes. Fortunately it is relatively easy to get training examples for such a learning task: place various probes on a gene chip, add a sample where the corresponding genes are highly expressed, and then record how well each probe measures the presence of its corresponding gene. With such training examples, it is possible that an accurate predictor of probe quality can be learned.

Results: Two of the learning algorithms we investigate - naïve Bayes and neural networks - learn to predict probe quality surprisingly well. For example, in the top ten predicted probes for a given gene not used for training, on average about five rank in the top 2.5% of that gene's hundreds of possible probes. Decision-tree induction and the simple approach of using predicted melting temperature to rank probes perform significantly worse than these two algorithms. The features we use to represent probes are very easily computed and the time taken to score each candidate probe after training is minor. Training the naïve Bayes algorithm takes very little time, and while it takes over 10 times as long to train a neural network, that time is still not very substantial (on the order of a few hours on a desktop workstation). We also report the information contained in the features we use to describe the probes. We find the fraction of cytosine in the probe to be the most informative feature. We also find, not surprisingly, that the nucleotides in the middle of the probes sequence are more informative than those at the ends of the sequence.

Contact: molla@cs.wisc.edu

Paper21, Presentation26

Title:

The Degenerate Primer Design Problem

Authors:

Chaim Linhart, Ron Shamir

Abstract:

A PCR primer sequence is called degenerate if some of its positions have several possible bases. The degeneracy of the primer is the number of unique sequence combinations it contains. We study the problem of designing a pair of primers with prescribed degeneracy that match a maximum number of given input sequences. Such problems occur when studying a family of genes that is known only in part, or is known in a related species. We prove that various simplified versions of the problem are hard, show the polynomiality of some restricted cases, and develop approximation algorithms for one variant. Based on these algorithms, we implemented a program called HYDEN for designing highly-degenerate primers for a set of genomic sequences. We report on the success of the program in an experimental scheme for identifying all human olfactory receptor (OR) genes. In that project, HYDEN was used to design primers with degeneracies up to 10^10 that amplified with high specificity many novel genes of that family, tripling the number of OR genes known at the time.

Availability: Available on request from the authors

Contact: {chaiml,rshamir}@tau.ac.il

Paper22, Presentation27

Title:

Splicing Graphs and EST Assembly Problem

Authors:

Steffen Heber, Max Alekseyev, Sing-Hoi Sze, Haixu Tang, Pavel Pevzner

Abstract:

Motivation: The traditional approach to annotate alternative splicing is to investigate every splicing variant of the gene in a case-by-case fashion. This approach, while useful, has some serious shortcomings. Recent studies indicate that alternative splicing is more frequent than previously thought and some genes may produce tens of thousands of different transcripts. A list of alternatively spliced variants for such genes would be difficult to build and hard to analyze. Moreover, such a list does not show the relationships between different transcripts and does not show the overall structure of all transcripts. A better approach would be to represent all splicing variants for a given gene in a way that captures the relationships between different splicing variants.

Results: We introduce the notion of the splicing graph that is a natural and convenient representation of all splicing variants. The key difference with the existing approaches is that we abandon the linear (sequence) representation of each transcript and substitute it with a graph representation where each transcript corresponds to a path in the graph. We further design an algorithm to assemble EST reads into the splicing graph rather than assembling them into each splicing variant in a case-by-case fashion.

Availability:

Contact: sheber@ucsd.edu

Paper23, Presentation28

Title:

Exact Genetic Linkage Computations for General Pedigrees

Authors:

Ma'ayan Fishelson, Dan Geiger

Abstract:

Motivation: Genetic linkage analysis is a useful statistical tool for mapping disease genes and for associating functionality of genes to their location on the chromosome. There is a need for a program that computes multipoint likelihood on general pedigrees with many markers that also deals with two-locus disease models.

Results: In this paper we present algorithms for performing exact multipoint likelihood calculations on general pedigrees with a large number of highly polymorphic markers, taking into account a variety of disease models. We have implemented these algorithms in a new computer program called SUPERLINK which outperforms leading linkage software with regards to functionality, speed, memory requirements and extensibility.

Availability: SUPERLINK is available at

Contact: {fmaayan, dang}@cs.technion.ac.il}

Paper24, Presentation29

Title:

Assigning Probes into A Small Number of Pools Separable by Electrophoresis

Authors:

Teemu Kivioja, Mikko Arvas, Kari Kataja, Merja Penttil‰, Hans Sˆderlund, Esko Ukkonen

Abstract:

Motivation: Measuring transcriptional expression levels (transcriptional profiling) has become one of the most important methods in functional genomics. Still, new measuring methods are needed to obtain more reliable, quantitative data about transcription on a genomic scale. In this paper we concentrate on certain computational optimization problems arising in the design of one such novel method. From a computational point of view the key feature of the new method is that the hybridized probes are distinguished from each other based on their different size. Therefore the probes have to be assigned into pools such that the probes in the same pool have unique sizes different enough from each other. Identification of expressed RNA is given by probe pool and probe size while quantification is given by the label of the probe, e.g. fluorescence intensity.

Results: We show how to computationally find the probes and assign them into pools for a whole genome such that i) each gene has a specific probe suitable for amplification and hybridization, and ii) the expression level measurement can be done in a minimal number of pools separable by electrophoresis in order to minimize the total experiment cost of the measurement. Our main result is a polynomial-time approximation algorithm for assigning the probes into pools. We demonstrate the feasibility of the procedure by selecting probes for the yeast genome and assigning them into less than 100 pools. The probe sequences and their assignment into pools are available for academic research on request from the authors.

Contact; Teemu.Kivioja@cs.Helsinki.FI

Paper25, Presentation30

Title:

Representing Genetic Sequence Data for Pharmacogenomics: An Evolutionary APproach Using Ontological and Relational Models

Authors:

Daniel Rubin, Farhad Shafa, Diane Oliver, Micheal Hewett, Teri Klein, Russ Altman

Abstract:

Motivation: The information model chosen to store biological data affects the types of queries possible, database performance, and difficulty in updating that information model. Genetic sequence data for pharmacogenetics studies can be complex, and the best information model to use may change over time. As experimental and analytical methods change, and as biological knowledge advances, the data storage requirements and types of queries needed may also change.

Results: We developed a model for genetic sequence and polymorphism data, and used XML schema to specify the elements and attributes required for this model. We implemented this model as an ontology in a frame-based representation and as a relational model in a database system. We collected genetic data from two pharmacogenetics resequencing studies, and formulated queries useful for analyzing these data. We compared the ontology and relational models in terms of query complexity, performance, and difficulty in changing the information model. Our results demonstrate benefits of evolving the schema for storing pharmacogenetics data: ontologies perform well in early design stages as the information model changes rapidly and simplify query formulation, while relational models offer improved query speed once the information model and types of queries needed stabilize.

Availability: Our ontology and relational models are available at .

Contact: rubin@smi.stanford.edu, russ.altman@stanford.edu, help@

Paper26, Presentation32

Title:

Evaluating Functional Network Inference Using Simulations of Complex Biological Systems

Authors:

V. Anne Smith, Erich D. Jarvis, Alexander J. Hartemink

Abstract:

Motivation: Although many network inference algorithms have been presented in the bioinformatics literature, no suitable approach has been formulated for evaluating their effectiveness at recovering models of complex biological systems from limited data. To overcome this limitation, we propose an approach to evaluate network inference algorithms according to their ability to recover a complex functional network from biologically reasonable simulated data.

Results: We designed a simulator to generate data from a complex biological system at multiple levels of organization: behavior, neural anatomy, brain electrophysiology, and gene expression of songbirds. About 90% of the simulated variables are unregulated by other variables in the system and are included simply as distracters. We sampled the simulated data at intervals as one would sample from a biological system in practice, and then used the sampled data to evaluate the effectiveness of an algorithm we developed for functional network inference. We found that our algorithm is highly effective at recovering the functional network structure of the simulated system---including the irrelevance of unregulated variables---from sampled data alone. To assess the reproducibility of these results, we tested our inference algorithm on 50separately simulated sets of data and it consistently recovered almost perfectly the complex functional network structure underlying the simulated data. To our knowledge, this is the first approach for evaluating the effectiveness of functional network inference algorithms at recovering models from limited data. Our simulation approach also enables researchers a priori to design experiments and data-collection protocols that are amenable to functional network inference.

Availability: Source code and simulated data are available upon request.

Contact: amink@cs.duke.edu, asmith@neuro.duke.edu, jarvis@neuro.duke.edu

Paper27, Presentation33

Title:

The Pathway Tools Software

Authors:

Peter Karp, Suzanne Paley, Pedro Romero

Abstract:

Motivation: Bioinformatics requires reusable software tools for creating model-organism databases (MODs).

Results: The Pathway Tools is a reusable, production-quality software environment for creating a type of MOD called a Pathway/Genome Database (PGDB). A PGDB such as EcoCyc (see ) integrates our evolving understanding of the genes, proteins, metabolic network, and genetic network of an organism. This paper provides an overview of the four main components of the Pathway Tools: The PathoLogic component supports creation of new PGDBs from the annotated genome of an organism. The Pathway/Genome Navigator provides query, visualization, and Web-publishing services for PGDBs. The Pathway/Genome Editors support interactive updating of PGDBs. The Pathway Tools ontology defines the schema of PGDBs. The Pathway Tools makes use of the Ocelot object database system for data management services for PGDBs. The Pathway Tools has been used to build PGDBs for 13 organisms within SRI and by external users.

Availability: The software is freely available to academics and is available for a fee to commercial institutions. Contact ptools-support@ai. for information on obtaining the software.

Contact: pkarp@ai.

Paper28, Presentation34

Title:

Discovering regulatory and signaling circuits in molecular interaction networks

Authors:

Trey Ideker, Owen Ozier, Benno Schwikowski, Andrew Siegel

Abstract:

Motivation: In model organisms such as yeast, large databases of protein-protein and protein-DNA interactions have become an extremely important resource for the study of protein function, evolution, and gene regulatory dynamics. In this paper we demonstrate that by integrating these interactions with widely-available mRNA expression data, it is possible to generate concrete hypotheses for the underlying mechanisms governing the observed changes in gene expression. To perform this integration systematically and at large scale, we introduce an approach for screening a molecular interaction network to identify active subnetworks, i.e., connected regions of the network that show significant changes in expression over particular subsets of conditions. The method we present here combines a rigorous statistical measure for scoring subnetworks with a search algorithm for identifying subnetworks with high score.

Results: We evaluated our procedure on a small interaction network of 332 genes and a large network of 4160 genes containing all 7462 protein-protein and protein-DNA interactions in the yeast public databases. In the case of the small network, we identified five significant subnetworks that covered 41 out of 77 (53%) of all significant changes in expression. Both network analyses returned several top-scoring subnetworks with good correspondence to known regulatory mechanisms in the literature. These results demonstrate how large-scale genomic approaches may be used to uncover signaling and regulatory pathways in a systematic, integrative fashion.

Availability: The methods presented in this paper are implemented in the Cytoscape software package which is available to the academic community at .

Contact: trey@wi.mit.edu

Paper29, Presentation35

Title:

Modelling Pathways in E.coli from Time Series Expression Profiles

Authors:

Irene Ong, David Page, Jeremy Glasner

Abstract:

Motivation: Cells continuously reprogram their gene expression network as they move through the cell cycle or sense changes in their environment. In order to understand the regulation of cells, time series expression profiles provide a more complete picture than single time point expression profiles. Few analysis techniques, however, are well suited to modeling such time series data.\

Results: We describe an approach that naturally handles time series data with the capabilities of modeling causality, feedback loops, and environmental or hidden variables using a Dynamic Bayesian network. We also present a novel way of combining prior biological knowledge and current observations to improve the quality of analysis and to model interactions between sets of genes rather than individual genes. Our approach is evaluated on time series expression data measured in response to physiological changes that affect tryptophan metabolism in E. coli. Results indicate that this approach is capable of finding correlations between sets of related genes.

Contact: ong@cs.wisc.edu

Paper30, Presentation36

Title:

Of Truth and Pathways: Chasing Bits of Information through Myriads of Articles

Authors:

Michael Krauthammer, Carol Friedman, George Hripcsak, Ivan Iossifov, Andrey Rzhetsky

Abstract:

Knowledge on interactions between molecules in living cells is indispensable for theoretical analysis and practical applications in modern genomics and molecular biology. Building such networks relies on the assumption that the correct molecular interactions are known or can be identified by reading a few research articles. However, this assumption does not necessarily hold, as truth is rather an emerging property based on many potentially conflicting facts. This paper explores the processes of knowledge generation and publishing in the molecular biology literature using modeling and analysis of real molecular interaction data. The data analyzed in this article was automatically extracted from 50,000 research articles in molecular biology using a computer system called GeneWays containing a natural language processing module. The paper indicates that truthfulness of statements is associated in the minds of scientists with the relative importance (connectedness) of substances under study, revealing a potential selection bias in the reporting of research results. Aiming at understanding the statistical properties of the life cycle of biological facts reported in research articles, we formulate a stochastic model describing generation and propagation of knowledge about molecular interactions through scientific publications. We hope that in the future such a model can be useful for automatically producing consensus views of molecular interaction data.

Contact: ar345@columbia.edu

Paper31, Presentation37

Title:

A Fast and Robust Method to Infer and Characterize an Active Regulator Set for Molecular Pathways

Authors:

Dana Pe'er, Aviv Regev, Amos Tanay

Abstract:

Regulatory relations between genes are an important component of molecular pathways. Here, we devise a novel global method that uses a set of gene expression profiles to find a small set of relevant active regulators, identifies the genes that they regulate, and automatically annotates them. We show that our algorithm is capable of handling a large number of genes in a short time and is robust to a wide range of parameters. We apply our method to a combined dataset of S. cerevisiae expression profiles, and validate the resulting model of regulation by cross-validation and extensive biological analysis of the selected regulators and their derived annotations.

Paper32, Presentation39

Title:

Marginalized Kernels for Biological Sequences

Authors:

Koji Tsuda, Taishin Kin, Kiyoshi Asai

Abstract:

Motivation: Kernel methods such as support vector machines require a kernel function between objects to be defined a priori. Several works have been done to derive kernels from probability distributions, e.g. the Fisher kernel. However, a general methodology to design a kernel is not fully developed.

Results: We propose a reasonable way of designing a kernel when objects are generated from latent variable models (e.g. HMM).First of all, a joint kernel is designed for complete data which include both visible and hidden variables. Then a marginalized kernel for visible data is obtained by taking the expectation with respect to hidden variables. We will show that the Fisher kernel is a special case of marginalized kernels, which gives another viewpoint to the Fisher kernel theory. Although our approach can be applied to any object, we particularly derive several marginalized kernels useful for biological sequences (e.g. DNA and proteins).The effectiveness of marginalized kernels is illustrated in the task of classifying bacterial gyrase subunit B (gyrB} amino acid sequences.

Contact]: koji.tsuda@aist.go.jp

Paper33, Presentation40

Title:

A tree kernel to analyze phylogenetic profiles

Authors:

Jean-Philippe Vert

Abstract:

Motivation: The phylogenetic profile of a protein is a string that encodes the presence or absence of the protein in every fully sequenced genome. Because proteins that participate in a common structural complex or metabolic pathway are likely to evolve in a correlated fashion, the phylogenetic profiles of such proteins are often "similar" or at least "related" to each other. The question we address in this paper is the following: how to measure the "similarity" between two profiles, in an evolutionarily relevant way, in order to develop efficient function prediction methods?

Results: We show how the profiles can be mapped to a high-dimensional vector space which incorporates evolutionarily relevant information, and we provide an algorithm to compute efficiently the inner product in that space, which we call the tree kernel. The tree kernel can be used by any kernel-based analysis method for classification or data mining of phylogenetic profiles. As an application a Support Vector Machine (SVM) trained to predict the functional class of a gene from its phylogenetic profile is shown to perform better with the tree kernel than with a naive kernel that does not include any information about the phylogenetic relationships among species. Moreover a kernel principal component analysis (KPCA) of the phylogenetic profiles illustrates the sensitivity of the tree kernel to evolutionarily relevant variations.

Availability: All data and software used are freely and publicly available upon request.

Contact: Jean-Philippe.Vert@

Paper34, Presentation41

Title:

Statistically Based Postprocessing of Phylogenetic Analysis by Clustering

Authors:

Cara Stockham, Li-San Wang, Tandy Warnow

Abstract:

Motivation: Phylogenetic analyses often produce thousands of candidate trees. Biologists resolve the conflict by computing the consensus of these trees. Single-tree consensus as postprocessing methods can be unsatisfactory due to their inherent limitations.

Results: In this paper we present an alternative approach by using clustering algorithms on the set of candidate trees. We propose bicriterion problems, in particular using the concept of information loss, and new consensus trees called characteristic trees that minimize the information loss. Our empirical study using four biological datasets shows that our approach provides a significant improvement in the information content, while adding only a small amount of complexity. Furthermore, the consensus trees we obtain for each of our large clusters are more resolved than the single-tree consensus trees. We also provide some initial progress on theoretical questions that arise in this context.

Availability: Software available upon request from the authors. The agglomerative clustering is implemented using Matlab (MathWorks, 2000) with the Statistics Toolbox. The Robinson-Foulds distance matrices and the strict consensus trees are computed using PAUP (Swofford, 2001) and the Daniel Huson's tree library on Intel Pentium workstations running Debian Linux.

Contact: lisan@cs.utexas.edu,telephone: +1 (512) 232-7455, fax: +1 (512) 471-8885.

Supplementary Information:

Paper35, Presentation42

Title:

Efficiently Detecting Polymorphisms During The Fragment Assembly Process

Authors:

Daniel Fasulo, Aaron Halpern, Ian Dew, Clark Mobarry

Abstract:

Motivation: Current genomic sequence assemblers assume that the input data is derived from a single, homogeneous source. However, recent whole-genome shotgun sequencing projects have violated this assumption, resulting in input fragments covering the same region of the genome whose sequences differ due to polymorphic variation in the population. While single-nucleotide polymorphisms (SNPs) do not pose a significant problem to state-of-the-art assembly methods, these methods do not handle insertion/deletion (indel) polymorphisms of more than a few bases.

Results: This paper describes an efficient method for detecting sequence discrepencies due to polymorphism that avoids resorting to global use of more costly, less stringent affine sequence alignments. Instead, the algorithm uses graph-based methods to determine the small set of fragments involved in each polymorphism and performs more sophisticated alignments only among fragments in that set. Results from the incorporation of this method into the Celera Assembler are reported for the D. melanogaster, H. sapiens, and M. musculus genomes.

Availability; The method described herein does not constitute a stand-alone software application, but is laid out in sufficient detail to be implemented as a component of any genomic sequence assembler.

Contact: daniel.fasulo@

Paper36, Presentation43

Title:

Multiple Genome Rearrangement: A General Approach via the Evolutionary Genome Graph

Authors:

Dmitry Korkin, Lev Goldfarb

Abstract:

Motivation: In spite of a well-known fact that genome rearrangements are supposed to be viewed in the light of the evolutionary relationships within and between the species involved, no formal underlying framework based on the evolutionary considerations for treating the questions arising in the area has been proposed. If such an underlying framework is provided, all the basic questions in the area can be posed in a biologically more appropriate and useful form: e.g., the similarity between two genomes can then be computed via the nearest ancestor, rather than "directly", ignoring the evolutionary connections.

Results: We outline an evolution-based general framework for answering questions related to the multiple genome rearrangement. In the proposed model, the evolutionary genome graph (EG-graph) encapsulates an evolutionary history of a genome family. For a set of all EG-graphs, we introduce a family of similarity measures each defined via a set of genome transformations associated with a particular EG-graph. Given a set of genomes and restricting ourselves to the transpositions, an algorithm for constructing an EG-graph is presented. We also present the experimental results in the form of an EG-graph for a set of concrete genomes (for several species). This EG-graph turns out to be very close to the corresponding known phylogenetic tree.

Contact: dkorkin@unb.ca

Paper37, Presentation44

Title:

Efficient multiple genome alignment

Authors:

Michael Hoehl, Stefan Kurtz, Enno Ohlebusch

Abstract:

Motivation: To allow a direct comparison of the genomic DNA sequences of sufficiently similar organisms, there is an urgent need for software tools that can align more than two genomic sequences.

Results: We developed new algorithms and a software tool "Multiple Genome Aligner" (MGA for short) that efficiently computes multiple genome alignments of large, closely related DNA sequences. For example, it can align 85% percent of the complete genomes of six human adenoviruses (average length 35,305 bp.) in 159 seconds. An alignment of 74% of the complete genomes of three strains of E. coli (lengths: 5,528,445; 5,498,450; 4,639,221 bp.) is produced in 30 minutes.

Availability: The software MGA is available free of charge for non-commercial research institutions. For details see

Contact]: {kurtz, enno}@techfak.uni-bielefeld.de

Paper38, Presentation46

Title:

PSEUDOVIEWER: Automatic Visualization of RNA Pseudoknots

Authors:

Kyungsook Han, Yujin Lee, Wootaek Kim

Abstract:

Motivation: Several algorithms have been developed for drawing RNA secondary structures, however none of these can be used to draw RNA pseudoknot structures. In the sense of graph theory, a drawing of RNA secondary structures is a tree, whereas a drawing of RNA pseudoknots is a graph with inner cycles within a pseudoknot as well as possible outer cycles formed between a pseudoknot and other structural elements. Thus, RNA pseudoknots are more difficult to visualize than RNA secondary structures. Since no automatic method for drawing RNA pseudoknots exists, visualizing RNA pseudoknots relies on significant amount of manual work and does not yield satisfactory results. The task of visualizing RNA pseudoknots by hand becomes more challenging as the size and complexity of the RNA pseudoknots increase.

Results: We have developed a new representation and an algorithm for drawing H-type pseudoknots with RNA secondary structures. Compared to existing representations of H-type pseudoknots, the new representation ensures uniform and clear drawings with no edge crossing for all H-type pseudoknots. To the best of our knowledge, this is the first algorithm for automatically drawing RNA pseudoknots with RNA secondary structures. The algorithm has been implemented in a Java program, which can be executed on any computing system. Experimental results demonstrate that the algorithm generates an aesthetically pleasing drawing of all H-type pseudoknots. The results have also shown that the drawing has high readability, enabling the user to quickly and easily recognize the whole RNA structure as well as the pseudoknots themselves.

Availability: available on request from the corresponding author.

Contact: khan@inha.ac.kr

Paper39, Presentation47

Title:

A powerful non-homology method for the prediction of operons in prokaryotes

Authors:

Gabriel Moreno-Hagelsieb, Julio Collado-Vides

Abstract:

Motivation: The prediction of the transcription unit organization of genomes is an important clue in the inference of functional relationships of genes, the interpretation and evaluation of transcriptome experiments, and the overall inference of the regulatory networks governing the expression of genes in response to the environment. Though several methods have been devised to predict operons, most need a high characterization of the genome analyzed. Log-likelihoods derived from inter-genic distance distributions work surprisingly well to predict operons in Escherichia coli and are available for any genome as soon as the gene sets are predicted.

Results: Here we provide evidence that the very same method is applicable to any prokaryotic genome. First, the method has the same efficiency when evaluated using a collection of experimentally known operons of Bacillus subtilis. Second, operons among most if not all prokaryotes seem to have the same tendencies to keep short distances between their genes, the most frequent distances being the overlaps of four and one base pairs. The universality of this structural feature allows us to predict the organization of transcription units in all prokaryotes. Third, predicted operons contain a higher proportion of genes with related phylogenetic profiles and conservation of adjacency than predicted borders of transcription units.

Contact: moreno@cifn.unam.mx

Supplementary information: Additional materials and graphs, are available at: .

Paper40, Presentation48

Title:

Identifying Operons and Untranslated Regions of Transcripts Using Escherichia coli RNA Expression Analysis

Authors:

Brian Tjaden, David Haynor, Sergey Stolyar, Carsten Rosenow, Eugene Kolker

Abstract:

Microarrays traditionally have been used to assay the transcript expression of coding regions of genes. Here, we use Escherichia coli oligonucleotide microarrays to assay transcript expression of both open reading frames (ORFs) and intergenic regions. We then use hidden Markov models to analyze this expression data and estimate transcription boundaries of genes. This approach allows us to identify 5( untranslated regions (UTRs) of transcripts as well as genes which are likely to be operon members. The operon elements we identify correspond to documented operons with 99% specificity and 63% sensitivity. Similarly we find that our 5( UTR results accurately coincide with experimentally verified promoter regions for most genes.

Contact: tjaden@cs.washington.edu

Paper41, Presentation49

Title:

Detecting Recombination with MCMC

Authors:

Dirk Husmeier, Grainne McGuire

Abstract:

Motivation: We present a statistical method for detecting recombination, whose objective is to accurately locate the recombinant breakpoints in DNA sequence alignments of small numbers of taxa (4 or 5). Our approach explicitly models the sequence of phylogenetic tree topologies along a multiple sequence alignment. Inference under this model is done in a Bayesian way, using Markov chain Monte Carlo (MCMC). The algorithm returns the site-dependent posterior probability of each tree topology, which is used for detecting recombinant regions and locating their breakpoints.

Results: The method was tested on a synthetic and three real DNA sequence alignments, where it was found to outperform the established detection methods PLATO, RECPARS, and TOPAL.

Availability: The algorithm has been implemented in the C++ program package BARCE, which is freely available from

Contact: dirk@bioss.ac.uk

Paper42, Presentation50

Title:

Finding Composite Regulatory Patterns in DNA Sequences

Authors:

Eleazar Eskin, Pavel Pevzner

Abstract:

Motivation: Pattern discovery in unaligned DNA sequences is a fundamental problem in computational biology with important applications in finding regulatory signals. Current approaches to pattern discovery focus on monad patterns that correspond to relatively short contiguous strings. However, many of the actual regulatory signals are composite patterns that are groups of monad patterns that occur near each other. A difficulty in discovering composite patterns is that one or both of the component monad patterns in the group may be "`too weak". Since the traditional monad-based motif finding algorithms usually output one (or a few) high scoring patterns, they often fail to find composite regulatory signals consisting of weak monad parts. In this paper, we present a MITRA (MIsmatch TRee Algorithm)} approach for discovering composite signals. We demonstrate that MITRA performs well for both monad and composite patterns by presenting experiments over biological and synthetic data.

Availability: MITRA is available at

Contact: eeskin@cs.columbia.edu

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download