THE FLORIDA STATE UNIVERSITY



THE FLORIDA STATE UNIVERSITY

COLLEGE OF ARTS AND SCIENCES

DEVELOPING A BIOINFORMATICS UTILITY BELT TO ELIMINATE SEARCH REDUNDANCY FROM THE EVER-GROWING DATABASES

By

MISHA TAYLOR

A Thesis submitted to the

Department of Computer Science

in partial fulfillment of the

requirements for the degree of

Master of Science

Degree Awarded:

Spring Semester, 2003

Copyright © 2003

Misha Taylor

All Rights Reserved

The members of the Committee approve the thesis of Misha Taylor defended on April 1, 2003.

______________________________

Robert van Engelen

Professor Directing Thesis

______________________________

David Swofford

Committee Member

______________________________

Theodore Baker

Committee Member

______________________________

Steven M. Thompson

Committee Member

The Office of Graduate Studies has verified and approved the above named committee members

THIS THESIS IS DEDICATED TO MY DEAR FRIEND JANE SINAGUB, WITHOUT WHOSE SUPPORT THIS WORK COULD NOT HAVE BEEN PRODUCED. WE’VE BEEN THROUGH SOME TOUGH TIMES, BUT I’M GLAD THAT WE WERE ABLE TO MAKE IT THROUGH THE YEARS TOGETHER AS FRIENDS. I LOVE YOU DEARLY JANE, AND I’M HAPPY THAT YOU DECIDED TO LET ME PLAY A BIT PART IN YOUR LIFE.

To Paul, “Brother” Max, and “Ma Petite” Flora, you’re as much as my family as Jane and I love you just as much.

Also, thanks to Deena Westbrook for keeping me company after bioinformatics class and entertained with chitchat while writing my thesis. It made the process a lot easier and more fun than it should have been.

ACKNOWLEDGMENTS

STEVEN THOMPSON TAUGHT ME MORE ABOUT BIOINFORMATICS IN SIX MONTHS THAN I COULD HAVE LEARNED ON MY OWN IN SIX YEARS. I AM STILL AMAZED THAT HE WAS WILLING TO TAKE ON A COMPUTER SCIENCE STUDENT WITH ABSOLUTELY NO BACKGROUND IN BIOLOGY WHATSOEVER. HE TOOK ME OUT TO DINNER AND SLOWLY INTRODUCED ME TO THE FIELD OF BIOINFORMATICS. ALL THIS WITHOUT ME BEING ENROLLED IN HIS CLASSES OR BEING ONE OF HIS STUDENTS, BUT JUST WANTING TO KNOW MORE ABOUT HIS WORLD. STEVE IS ONE OF THE MOST EXTRAORDINARY TEACHERS I’VE EVER MET. I ALSO APPRECIATE HIS PATIENCE IN RE-EXPLAINING BIOLOGICAL CONCEPTS TO ME TWO OR THREE TIMES AS IT WAS VERY DIFFICULT FOR ME TO START ACQUIRING A “BIOLOGICAL MINDSET”.

Thanks to Theresa Thompson for helping me understand the “tao of Steve”, when things were getting difficult at one point, and also getting to the bottom of the mystery of the pink pants. Her advice on literature and politics was also very helpful and insightful.

Robert van Engelen was my major professor throughout this process. I appreciate his guidance and advice, and his willingness to work with an interdisciplinary project. I also would like to thank my original advisor, Ernest McDuffie, who sadly left The Florida State University before I was able to complete my original thesis.

I also appreciate having taken a semester of Dave Swofford’s bioinformatics class while completing this thesis. I think it really helped to give me a solid grounding in some of the fundamental algorithms. He is a very knowledgeable and accomplished instructor.

TABLE OF CONTENTS

LIST OF TABLES VII

List of Figures viii

Abstract ix

INTRODUCTION 1

BIOLOGICAL DATABASES 3

Biological Databases are massive 4

Biological data can be noisy and redundant, with unclear features 5

Another Artificial Source of Redundancy 6

A REAL-WORLD PROBLEM 7

Process of the human expert 8

Sorting out redundancies 8

Taxonomy Analysis 10

SIMILARITY SEARCHES 12

BLAST Output 12

Part 1 – Overview of the Query 12

Part 2 – Descriptions of each significant alignment. 13

Part 3 – Pairwise Alignments 13

Part 4 – Statistical Summary 14

Evolution 15

Three fundamental algorithm types 18

Dot plots 18

The need for quantitative methods 21

Pairwise sequence alignment 24

Global Alignment 28

Local Alignments 32

Multiple sequence alignment 32

GCG WISCONSIN PACKAGE 34

MUMS AND SUFFIX TREES 41

MUMs 42

Longest Increasing Subsubsequence 47

CONSENSUS ALGORITHMS 48

Clustering in MUMmer 2.1 49

The Union-Find Algorithm 49

IMPLEMENTATION RECOMMENTATIONS: C++ OR JAVA? 52

NINJA 53

The Cost of Being Object-Oriented 54

Further adventures of the Rice Research Team 55

CONCLUSION 57

EPILOGUE 61

REFERENCES 63

BIOGRAPHICAL SKETCH 67

List of Tables

TABLE 1 - DOT PLOT EXAMPLE 18

Table 2 - The 4 DNA Nucleotide Sequences and Their Official Codes 23

Table 3 - The 20 Amino Acids and Their Official Codes 23

List of Figures

FIGURE 1 - GROWTH OF GENBANK 5

Figure 2 - Align ESTs to cDNAs 9

Figure 3 - Example of a sequence region which contains a gene model 10

Figure 4 - BLAST Output, Part 1 - Overview of the Query 12

Figure 5 - BLAST Output, Part 2 - Description of each significant alignment 13

Figure 6 - BLAST Output, Part 3 – Pairwise Alignments 14

Figure 7 - BLAST Output, Part 4 - Statistical Summary 15

Figure 8 - Origin of genes having a similar sequence[1] 16

Figure 9 - Dot plot window example 1 19

Figure 10 - Dot plot window example 2 20

Figure 11 - Dot plot of two paralogues 21

Figure 12 - Key observation for dynamic programming 26

Figure 13 - Dynamic programming always looks back to previous diagonal 27

Figure 14 - Recurrence relation 28

Figure 15 - Global alignment, part 1 29

Figure 16 - Global alignment, part 2 30

Figure 17 - Global alignment, part 3 30

Figure 18 - Global alignment, part 4 31

Figure 19 - Global alignment, part 5 31

Figure 20 - Suffix tree 44

Figure 21 - Aligning Genome A and Genome B after locating MUMs 45

Figure 22 - Crossing anchors 46

Figure 23 - Aligned anchors 46

Figure 24 - Clustering two-dimensional data 48

Abstract

BIOLOGICAL DATABASES ARE GROWING AT AN EXPONENTIAL RATE. DESIGNING ALGORITHMS TO DEAL WITH THE INHERENT REDUNDANCY IN THESE DATABASES WHICH CAN COPE WITH THE OVERWHELMING AMOUNT OF DATA RETURNED FROM SIMILARITY SEARCHES IS AN ACTIVE AREA OF RESEARCH.

This paper presents an overview of a real-world problem related to biological database searching, outlining how a human expert solves this problem. Then, several bioinformatics approaches are presented from the literature, forming a “utility belt” which might be used to solve the problem computationally.

INTRODUCTION

BIOINFORMATICS TOOLS ARE BASICALLY A SET OF COMPUTER ALGORITHMS AND PROCEDURES FOR ANALYZING AND SIFTING THROUGH BIOLOGICAL DATA[1]. THE FIELD OF BIOINFORMATICS HAS BEEN AROUND FOR ABOUT 20 OR 30 YEARS. BIOLOGISTS, REALIZING JUST HOW VALUABLE A TOOL COMPUTERS WERE, STARTED APPLYING COMPUTER TECHNOLOGY TO THEIR SCIENTIFIC ENDEAVORS JUST AS SOON AS COMPUTERS WERE AVAILABLE TO THE GENERAL PUBLIC. ACCORDING TO STANFORD PROFESSOR, MARK A. MUSEN[2], IT IS ONLY THE TERM BIOINFORMATICS THAT IS RELATIVELY NEW, NOT THE ACTUAL FIELD ITSELF.

The bio-prefix of the word should be fairly straightforward to understand. Bioinformatics deals with biology and biological data.

It is the –informatics suffix that is more interesting. The word informatics is derived from the French word, informatique, which can be translated into English as “data processing”[3].

Why is an uncommon word like informatics used in the term bioinformatics? Mark Musen[2] claims that it is in the European tradition to incorporate elements of data processing into the fundamental computer science curriculum, whereas in the United States it is not. Instead, if a student wishes to pursue a serious study of “information, its structure, its acquisition, and its use,” that student historically enrolled in a business program, or more recently, an information sciences program. Apparently in Europe, it is more common that the equivalent of information sciences is integrated into the computer science curriculum and not as a separate field of study or departmental entity.

A key difference between the informatics perspective and the traditional computer science perspective is focus. In informatics, the information and knowledge is of central focus, whereas in computer science, the focus is mostly on algorithms and writing programs[2]. From an informatics perspective “you can’t do anything without the data – knowledge is power,” whereas from a computer science perspective one might say “you can’t do anything without a program – after all, it is the code actually performing the action.” Neither statement is invalid, depending on your perspective.

In fact, bioinformatics incorporates both data and algorithms. The term used for the field is a reminder that the biology and the data are paramount.

That is a brief explanation of the word bioinformatics and the etymology of the term. As mentioned earlier, it is a fairly new term for the field.

So, our task, should we choose to accept it, is to do the research necessary in order to prepare for the development of a bioinformatics utility belt. This virtual bioinformatics utility belt, explained herein in this document, will help us tackle the information overload from the ever-growing biological databases.

Biological databases are filled with redundant sequence information for various reasons, which will be covered in more detail in the next chapter. When searches are performed against these databases, the result sets also contain redundant “hits”. In this document, it is our goal to explore the nature of this redundancy problem more fully, putting together a list of possible solutions and approaches to address this issue – the utility belt. We won’t be able to offer a solution to the redundancy problem here, but rather it is our goal to offer a starting point for a potential solution.

CHAPTER 1

BIOLOGICAL DATABASES

WHILE ONE CAN ARGUE THAT EVEN IN THE INTERNET AGE, MOST NEW BIOLOGICAL INFORMATION IS STILL PUBLISHED AS TEXT IN THE FORM OF JOURNAL ARTICLES, PAPERS AND ANNOTATIONS[4] – NOTHING IS GOING TO KEEP THE ACADEMIC ENGINE AWAY FROM THE “PUBLISH OR PERISH” MANTRA ANYTIME SOON – THERE HAS BEEN AN EXPLOSION IN THE AMOUNT OF BIOLOGICAL DATA “PUBLISHED” TO COMPUTER DATABASES OVER THE PAST 20 YEARS. RESEARCHERS ROUTINELY PUBLISH THEIR BIOLOGICAL FINDINGS TO INTERNET DATABASES SUCH AS GENBANK, SWISS-PROT, PFAM, AND SMART.

GenBank is one of the largest and oldest biological databases. It contains all publicly available DNA sequences[5]. Obviously, DNA sequences are important to biologists because DNA sequences contain the unique instructions that indicate how to create a particular organism. DNA consists of varying sequences (typically very long sequences of thousands or millions of molecules) out of 4 possible molecules called “bases” or “nucleotides”[6].

SWISS-PROT is another one of the older biological databases. It contains “annotated protein sequences,” including shape and sequence information. Proteins are important because they are responsible for performing most of the functions of life. Information about the three-dimensional (3D) shape of a protein is important, because shape determines the function of a protein within an organism. Proteins are sequences of 20 possible amino acids. As far as biologists know, proteins with identical sequences fold into the same three-dimensional shape.

The Pfam and SMART databases are newer, smaller, more specialized protein databases, classifying and categorizing proteins into families and domains. They contain much more detailed information, complementing the SWISS-PROT protein database.

GenBank, SWISS-PROT, Pfram, and SMART are just a few selected examples of popular biological databases. There are many others available that are also widely used. Biologists rarely perform research without referring to these biological databases. Even if results are published in the form of a report or paper, most journals require that researchers post their findings to at least one of these databases. In addition, most of these database efforts are funded by government-sponsored grants, so many of these databases are also publicly-accessible, including all of the databases mentioned above[5].

Biological Databases are massive

One characteristic of these biological databases is that they are massive. As of August 2002, there are approximately 18,197,000 sequence records in GenBank[7]. Release 40.44 of SWISS-PROT contains 122,214 annotated protein sequence records comprising 44,864,044 amino acids on February 22, 2003[8]. Pfam contains 3,735 alignments and 3-D models for protein families, and the SMART databases contains 500 domains encompassing more than 54,000 proteins[5].

Also, biological databases are rapidly growing. For example, GenBank, one of the oldest and largest biological databases, is doubling is size every 15 months[5]. The following diagram shows the almost exponential growth rate of GenBank from 1982 to 2002. The number of bases has been plotted against the release date[9].

[pic]

Figure 1 - Growth of GenBank

These databases are enormous. Any researcher wanting to search through this data would have to scan through tens of thousands to millions of records to find data relevant to their research. At times, this can seem like looking for a needle in a haystack.

Biological data can be noisy and redundant, with unclear features

Another characteristic of biological data is that it can be fuzzy. Any quantitative measure of a biological system is merely an approximation. Biological researchers use a wide variety of procedures to collect their data, and the results are typically an “average value of several independent experiments”[10].

In addition to being fuzzy, biological data also tends to have a lot of noise in it. While some of this noise is due to experimental error, this noise can also be produced by nature itself in the form of mutations. As an organism evolves over time, physical changes get expressed in its genes and chromosomes. A gene is a specific sequence of DNA that produces a protein. A chromosome is a collection of genes and is the basic unit of how traits get passed from parents to children[11]. Two genes are matched based on their sequence similarity, not on the exact sequence, given that one must account for evolution in an organism, and the overall “fuzziness” in this whole process.

Another issue that contributes to noise in the data is that biologists do not yet understand the relevance of many components in DNA sequences and protein sequences. For example, certain sections in a DNA sequence code for certain proteins – these are called exons. Other components do not seem to affect an organism’s protein-making machinery in any way that biologists can understand. These sections are known as “spacer DNA” or “noncoding DNA”. Spacer DNA within a gene are also known as introns[12]. While looking for a sequence match, one must take into account only the exons (the coding regions) in a sequence. In humans, only 3% of a genetic sequence may consist of exons. As with mutations, this means that sequence matching involves similarity and inexact matching techniques rather than exact techniques[6].

Another Artificial Source of Redundancy

In addition to the previously mentioned sources of redundancy, there is another artificial source of redundancy, due to the re-entry of the duplicate sequences in the database added with newer and more advanced sequencing techniques. While effort is made to purge old, duplicate entries from databases, this process is not perfect. Because experimental error is involved, the older, shorter copies might not “exhibit the same quality” as the longer version, and it may be difficult to identify an older sequence as a duplicate entry. As the technology improves to gather longer nucleotide sequences, there may be multiple “copies” of a gene – a longer version, and perhaps several other older, shorter copies that were sequenced earlier. Thus, there is more redundancy and noise in biological databases as a result[13].

CHAPTER 3

A REAL-WORLD PROBLEM

IN ORDER TO PROVIDE FOCUS FOR OUR RESEARCH IN THE DEVELOPMENT OF A BIOINFORMATICS UTILITY BELT, WE HAVE BEEN PROVIDED A REAL-WORLD BIOLOGICAL PROBLEM BY STEVEN M. THOMPSON, A COLLEAGUE AT THE FLORIDA STATE UNIVERSITY SCHOOL OF COMPUTATIONAL SCIENCE & INFORMATION TECHNOLOGY. WE WON’T BE ABLE TO SOLVE THE PROBLEM HERE. RATHER, WE NEED IT TO INVESTIGATE THE REQUIREMENTS OF THE UTILITY BELT. IN ADDITION, HAVING A REAL-WORLD PROBLEM OR APPLICATION AREA IN MIND WHEN ONE PERFORMS RESEARCH CAN HELP MAINTAIN FOCUS.

Having some idea of the real-world constraints and strains and stresses of a real-world application makes the research take into account some of the noisy aspects of the real-world problems in bioinformatics.

The real-world problem is this – with the exponential growth in modern biological databases (which are doubling every 15 months), many real-world database searches return too much data to deal with at one time. Several computer algorithms have been developed to return results from biological databases quickly and effectively, but still, biological researchers are required to further sift through the results of a search manually, sorting out redundancies because these algorithms don’t do this automatically.

Our aim is to develop a computer program which automates the process of sorting out these search result redundancies, producing smaller, more manageable result sets. The ultimate goal would be to develop a tool which is as good at sorting out redundancies as a human expert, but a tool which can make the list of possibilities smaller in a short period time would suffice, as this process is time-consuming for a human to perform.

Our goal is to help, providing a bioinformatics utility belt to accomplish this task.

Process of the human expert

In order to get useful information from the results of a search performed on a biological database, a human expert will often first go through a list of sequences returned and perform two processes in parallel:

1. Sort out redundancies

2. Taxonomy analysis

While performing these two processes, the human expert distills a subset of the resultant sequences, those that are redundant or duplicate, down to just one sequence. By iteratively going through this process, the list of candidate sequences gets progressively smaller.

This process is somewhat analogous to the contig assembly process in which short sequences are assembled together to form a long, unambiguous sequence that does not overlap on DNA sequence machines. This process is just on a much larger scale than contig assembly. Instead of forming one unambiguous sequence from a bunch of shorter sequences, any kind of sequence in the DNA database is fair game.

If one performs these two processes in sequence, sorting out redundancies and taxonomy analysis, the order of what should be done should the reverse of what is above. One should first perform taxonomy analysis, then sort out redundancies, since taxonomy analysis will reduce the size of the sequence list more quickly.

Sorting out redundancies

Let us delve into the “sorting out redundancies” step in a little more detail. For the types of searches that this problem deals with, the resultant search will usually contain three basic types of sequence data:

1. Genomic sequence – Full sequences with introns (non-coding), exons (coding), and “nongenic” (non-coding) information.

2. cDNA – Complementary DNA from mRNA, roughly represents the coding components of a genomic sequence. May be partial or complete.

3. EST – Expressed sequence tag. Fairly short cDNA sequences that do not cover the entire coding sequence of a gene. Used for quick identification of genes. ESTs are fairly error prone, because an EST is generated in a single pass, however it “allows [for] gene RNA products to be observed more directly than [with] genomic sequencing”[14].

The human expert performs the following steps to sort out redundancies:

1. Cluster analysis – Align ESTs to cDNAs to genomic sequences (as shown in the figure below).

2. Splice out introns.

3. Use the consensus.

[pic]

Figure 2 - Align ESTs to cDNAs

The most difficult problem in redundancy analysis is aligning cDNA to genomic sequences. Since genomic sequences contain introns (non-coding regions) and spacer sequences, it can be difficult to determine whether or not a genomic sequence and a cDNA sequence match up.

The following figure explains some of the symbols used in Figure 2 in more detail. The diagram shows a sequence region which contains a gene model from the Sanger Institute’s WormBase[13]. “Exons are shown as block boxes linked by introns”[13]. The portion of the sequence that codes for a protein is shaded in red. As shown above in Figure 2 and in this diagram, coordinate start/stop “spans” for sequence features are marked off as horizontal bars. These are indicated by the lettered features: [A] sequence, [B] exon, [C] CDS coding exon, and [D] introns.

[pic]

Figure 3 - Example of a sequence region which contains a gene model

Taxonomy Analysis

In order to perform taxonomy analysis, the human expert could do the following:

1. Use a text-based taxonomy tool, like GCG’s LookUp program, a European Molecular Biology Laboratory (EMBL) Sequence Retrieval System (SRS) derivative, “to sort out the desired taxonomic level from the similarity search output” or from the databases initially.

2. Homologous genes are identified.

A basic issue with taxonomy analysis is determining which homologues to compare, the paralogues or orthologues. Paralogues are duplicated genes within the same species (e.g. alpha and beta hemoglobin in human beings). Orthologues are the same gene between different species (e.g. alpha hemoglobin in humans and chimpanzees).

It is up to the biologist to decide whether or not to use both. Often, if the final goal is to ascertain molecular phylogenies, it is best only to use one or the other set depending on the question being asked – organismal taxonomy versus gene phylogeny.

A good first-pass approximation of the analysis can be executed by performing the following for a particular sequence:

1. Estimate orthologues – most similar sequences from different taxonomies

2. Estimate paralogues – most similar sequences from same taxonomy

CHAPTER 4

SIMILARITY SEARCHES

THE PROCESS OF SEARCHING FOR MATCHES AGAINST A PARTICULAR GENETIC SEQUENCE IN A BIOLOGICAL DATABASE (OR DATABASES) IS CALLED A SIMILARITY SEARCH. ONE MIGHT THINK THAT THIS IS SIMILAR TO THE PROCESS OF PERFORMING A SEARCH ON THE WORLD-WIDE-WEB WITH AN INTERNET SEARCH ENGINE LIKE GOOGLE[15]. IN SOME FASHION, AN INPUT SEQUENCE ON WHICH TO SEARCH IS ENTERED, A BIOINFORMATICS TOOL WILL THEN PERFORM A SEARCH AGAINST ALL THE REQUESTED BIOLOGICAL DATABASES, RETURNING BACK A LIST OF “HITS” – TYPICALLY BIOLOGICAL SEQUENCES AS RESULTS FROM THE SEARCH.

However, the information returned from a similarity search is very different from an Internet search, since you are dealing with sequence data and not textual data. According to Bioinformatics for Dummies, the information returned from a similarity search includes:

• The Expectation value (E-Value), which tells you how likely it is that the similarity between your sequence and a database sequence is due to chance

• The length of the segment similarity between the two sequences

• The patterns of amino acid conversion

• The number of insertion/deletions[13]

BLAST Output

Excerpts from the output of a similarity search using BLAST, a popular bioinformatics database searching algorithm[16], follows:

Part 1 – Overview of the Query

Query: = gi|2501594|sp|Q57997|Y577_METJA PROTEIN MJ0577 (162 letters)

Database: Non-redundant GenBank CDS translations+PDB+SwissProt+SPupdate+PIR 437,713 sequences; 134,605,311 total letters

Figure 4 - BLAST Output, Part 1 - Overview of the Query

The first item in the similarity search output is an overview of the query. A summary of the query sequence is presented in the summary. In this example, it is the protein MJ0577, which has 162 letters.

Part 2 – Descriptions of each significant alignment.

1. sp|Q57997|Y577_METJA PROTEIN MJ0577 >gi|2128018|pir||A64372... 314 2e-85

2. pdb|1MJH| Structure-Based Assignment Of The Biochemical F... 272 1e-72

3. dbj|BAA29916| (AP000003) 170aa long hypothetical protein [P... 107 6e-23

4. sp|Q57951|Y531_METJA HYPOTHETICAL PROTEIN MJ0531 >gi|212801... 91 4e-18

5. gi|2622094 (AE000872) conserved protein [Methanobacterium t... 85 4e-16

6. gi|2621993 (AE000865) conserved protein [Methanobacterium t... 81 4e-15

7. gi|2621194 (AE000803) conserved protein [Methanobacterium t... 80 7e-15

8. gi|2622163 (AE000877) conserved protein [Methanobacterium t... 79 2e-14

9. sp|P42297|YXIE_BACSU HYPOTHETICAL 15.9 KD PROTEIN IN BGLH-W... 76 1e-13

10. sp|Q50777|YB54_METTM HYPOTHETICAL 16.1 KD PROTEIN IN MTR RE... 66 2e-10

11. gi|2648791 (AE000981) conserved hypothetical protein [Archa... 65 3e-10

12. gi|2648610 (AE000970) conserved hypothetical protein [Archa... 64 5e-10

Figure 5 - BLAST Output, Part 2 - Description of each significant alignment

The overview of the query is followed by a listing of descriptions of each significant alignment. This is usually a very long listing for most query sequences, given the amount of redundancy in the database. For this query sequence, the list had to be truncated to make the figure brief. Each entry in the list contains the sequence database code, its name, the alignment score, and an expectation or E-value.

Alignment scores are not a sufficient indicator of how similar two sequences are and can vary between the different searching algorithms. It would be similar to giving a number for a person’s height without the units. Expectation scores are used to measure the sequence similarity. An E-value is “the expected number of sequences of score >= S that would be found by random chance.”[17] This value has absolutely no biological significance, it is just used for sequence similarity and comparison. “Low values of E indicate that the search result was unlikely to have been obtained by random chance, and thus is likely to bear an evolutionary relationship to the query sequence. E values of 10-3 and below are often consider indicator of statistically significant results.”[17]

Part 3 – Pairwise Alignments

sp|Q57997|Y577_METJA MJ0577 - Methanococcus jannaschii >gi|5107801|pdb|1MJH|A

Chain A, Structure-Based Assignment Of The Biochemical

Function Of Hypothetical Protein Mj0577: A Test Case Of

Structural Genomics >gi|5107802|pdb|1MJH|B Chain B,

Structure-Based Assignment Of The Biochemical Function

Of Hypothetical Protein Mj0577: A Test Case Of

Structural Genomics >gi|1591284 (U67506) conserved

hypothetical protein [Methanococcus jannaschii]

Length = 162

Score = 314 bits (796), Expect = 2e-85

Identities = 162/162 (100%), Positives = 162/162 (100%)

Query: 1 MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVA 60

MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVA

Sbjct: 1 MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVA 60

Query: 61 GLNKSVEEFENELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEG 120

GLNKSVEEFENELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEG

Sbjct: 61 GLNKSVEEFENELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEG 120

Query: 121 VDIIIMGSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNS 162

VDIIIMGSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNS

Sbjct: 121 VDIIIMGSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNS 162

pdb|1MJH| Structure-Based Assignment Of The Biochemical Function Of

Hypothetical Protein Mj0577: A Test Case Of Structural

Genomics

Length = 287

Score = 272 bits (687), Expect = 1e-72

Identities = 145/161 (90%), Positives = 145/161 (90%), Gaps = 16/161 (9%)

Query: 2 SVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVAG 61

SVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIK

Sbjct: 143 SVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIK------------- 189

Query: 62 LNKSVEEFENELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEGV 121

SVEEFENELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEGV

Sbjct: 190 ---SVEEFENELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEGV 246

Query: 122 DIIIMGSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNS 162

DIIIMGSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNS

Sbjct: 247 DIIIMGSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNS 287

Figure 6 - BLAST Output, Part 3 – Pairwise Alignments

Following the descriptions of each significant alignment are the actual pairwise alignments used to generate the individual alignment scores against the query sequence and the database entry. As with Part 2, this output has been abbreviated for clarity.

Part 4 – Statistical Summary

1. Database: Non-redundant GenBank CDS

translations+PDB+SwissProt+SPupdate+PIR

2. Posted date: Feb 26, 2000 10:08 PM

3. Number of letters in database: 142,135,17

Number of sequences in database: 461,162

4. Lambda K H

0.313 0.135 0.349

Gapped Lambda K H

0.270 0.0470 0.230

5. Matrix: BLOSUM62

6. Gap Penalties: Existence: 11, Extension: 1

7. Number of Hits to DB: 39862250

Number of Sequences: 461162

Number of extensions: 1595704

Number of successful extensions: 8417

Number of sequences better than 1.0: 86

Number of HSP's better than 1.0 without gapping: 57

Number of HSP's successfully gapped in prelim test: 29

Number of HSP's that attempted gapping in prelim test: 8293

Number of HSP's gapped (non-prelim): 121

8. length of query: 162

length of database: 142,135,178

effective HSP length: 60

effective length of query: 102

effective length of database: 114,465,458

effective search space: 11675476716

effective search space used: 11675476716

9. T: 11

A: 40

X1: 16 ( 7.4 bits)

X2: 38 (14.8 bits)

X3: 64 (24.9 bits)

S1: 42 (21.9 bits)

S2: 75 (33.6 bits)

Figure 7 - BLAST Output, Part 4 - Statistical Summary

Finally the BLAST output concludes with statistical details on the search.

Evolution

It should seem obvious why you might want to perform a similarity search against textual information on the Internet – when you do not know the exact information you are looking for, but you would like to look for information that is related-to or similar to your search term. After all, to a computer, a genetic sequence is just a string of characters. If a similarity search were performed on biological databases in the exact same manner search engines worked with documents and text, what would happen? Actually it wouldn’t work at all, due to the nature of biological data.

In biology, sequence similarity means a lot more than just mere comparison of the characters in the sequence – how many characters are the same and how many are different, and so forth. Why? Because of evolution. Over time, as a species reproduces and produces offspring, their genetic sequences can be changed by:

• Insertions

• Deletions

• Substitutions

[pic]

Figure 8 - Origin of genes having a similar sequence[1]

This might mean that say, a great-great-great-grandparent and a grandchild might be related genetically, but many things might have potentially happened to change their genetic sequence between the generations so that if one sequenced their DNA it wouldn’t be a perfect match. We would still say that they are similar, if we could determine somehow all the possible insertions, deletions, and/or substitutions that might have happened along the way. Note that in the great-great-great-great-grandparent to grandchild relationship, chances are low that there is great sequence divergence, but perhaps if this great-great-grandparent was the proverbial Eve and the grandchild was a modern-day human, it might be a different story. And definitely if we talk about a more evolutionary time scale, like millions of years, then there is enough time changes of this sort to happen.

Another note about similarity – biologists like to distinguish between quantitative and qualitative similarity. Quantitative similarity is just called similarity. Typically some sort of index is used to quantify the “likeness” of two or more genetic sequences.

Biologists have other special terms for similarity that can only be inferred, but cannot be observed or derived quantitatively – homology and homoplasy. Homology is “the presence of a similar feature due to inheritance from a common ancestor”[18]. Homoplasy is “the shared presence of a similar feature for reasons other than ancestry, [perhaps due to] convergence, reversal, [or] parallelism”[18].

Note that a human expert used “homologues” – sequences that exhibit the property of homology. Homologues are sequences that are similar because of a common ancestor, inferred after performing “taxonomy analysis”. A basic issue with taxonomy analysis is determining which homologues to compare: the paralogues or orthologues”.

Taxonomy is “the science of classifying living organisms specifically named categorized based on shared characteristics and natural relationships”. “Orthologue sequences are the same sequences that can be found in different species. Paralogue sequences are similar sequences within the same genome that have evolved through gene duplication, or the same sequence found in the same species”.

When you are trying to assess similarity after performing a biological database search, it is a very important matter to decide whether or not you care whether or not your definition of similarity involves just a single species or not. An example of this is performing a search for the alpha and beta hemoglobin in human beings vs. the alpha and beta hemoglobin in human beings and chimpanzees.

Searching in biological databases involves not only simple lookup and searching methods, but also more sophisticated methods that take into account the ancestry and possible evolution of a sequence in order to find matches in a database.

Three fundamental algorithm types

There are three fundamental types of algorithms used in most biological sequence analysis methods: dot plots, pairwise sequence alignment, and multiple sequence alignment. You’ll find them mentioned over and over again in bioinformatics literature. These types of algorithms are the fundamental building blocks for assembling more sophisticated searching algorithms, so understanding them is essential.

Dot plots

Dot plots are also known as dot matrix plots. It is a qualitative method for comparing two sequences which can take into account complex things like rearrangements or repeated sequences, since it relies on visual perception and interpretation.

A two-dimensional matrix is used in a dot plot. The first sequence is placed along the top row, the second sequence is placed along the first column. Then, for every row and column in the matrix, a dot is placed in the row and column where there is a match between the elements in each sequence. Diagonal lines, or diagonal patterns of dots, indicate matching areas between the two sequences.

Here is an example of a dot plot comparing the two strings “I LOVE DOT PLOTS” and “I HATE DOT PLOTS”:

Table 1 - Dot plot example

| |

|# |1-Letter Code |Nucleotide Name |

|1 |A |Adenine |

|2 |C |Cytosine |

|3 |G |Guanine |

|4 |T |Thymine |

Table 3 - The 20 Amino Acids and Their Official Codes

|The 20 Amino Acids and Their Official Codes |

|# |1-Letter Code |3-Letter Code Name |

|1 |A |Ala Alanine |

|2 |R |Arg Arginine |

|3 |N |Asn Asparagine |

|4 |D |Asp Aspartic acid |

|5 |C |Cys Cysteine |

|6 |Q |Gln Glutamine |

|7 |E |Glu Glutamic acid |

|8 |G |Gly Glycine |

|9 |H |His Histidine |

|10 |I |Ile Isoleucine |

|11 |L |Leu Leucine |

|12 |K |Lys Lysine |

|13 |M |Met Methionine |

|14 |F |Phe Phenylalanine |

|15 |P |Pro Proline |

|16 |S |Ser Serine |

|17 |T |Thr Threonine |

|18 |W |Trp Tryptophan |

|19 |Y |Tyr Tyrosine |

|20 |V |Val Valine |

So a lot of complex biochemistry can be abstracted away to just sequences and strings. However, one doesn’t have to take into account all the complexity of a molecule while performing sequence analysis – just the information contained within – one of the most fascinating discoveries made in the 20th century by Watson, Crick and Hershey[12].

Pairwise sequence alignment

First, let us talk about pairwise sequence alignment. In pairwise sequence alignment, only two sequences are compared at a time. Here is a statement of the problem, according to Russ B. Altman[20]:

Given:

• 2 sequences

• scoring system for evaluating match (or mismatch) of two characters

• penalty function for gaps in sequences

Produce an optimal pairing of sequences that retains the order of characters in each sequence, perhaps introducing gaps, such that the total score is optimal.

To calculate the score, one iterates through the pair of sequences comparing each pair of characters, adding up scores. In the simplest case, without gaps, the alignment score is:

[pic]match score if seq1i = seq2i / mismatch score if seq1i ≠seq2i

When gaps are introduced, then there is just one extension to the score – a gap penalty:

[pic]match score if seq1i = seq2i / mismatch score if seq1i ≠seq2i / gap penalty if seqli is gap or seq2i is gap

Some examples follow. I’m going to follow Professor’s Altman’s lead and sometimes use the English alphabet for examples rather than using DNA or protein sequences. Sequences like “TTTGACAC” or “RKVA—GMAKPNM” are confusing. So for any English examples, our alphabet will be the 26 letters in the English language, plus the ‘ ‘ (space) character and gaps will be represented by the “#” (pound) symbol.

Keep in mind that real-world examples will use DNA and protein sequences. When DNA and protein sequences are used, only the 4 letter DNA alphabet or the 20 letter protein alphabet is permissible, and gap characters are usually represented by a ‘-‘ character (hyphen). Note: the 4-letter DNA alphabet sometimes permits many algorithmic shortcuts that would not be possible if biology actually used 26 letters like the English language.

So here is an example of a pairwise sequence alignment:

[pic]

Is the alignment in the previous example optimal? Well, one way that we could quantitatively prove this is to try all the possible alignments and see which alignment has a maximum score. Notice the one thing which makes this problem hard is the introduction of gaps anywhere in the string/sequence. This means that a naïve approach where one generates a sequence with gaps in every position and of every length, is similar to a dot plot which has a computational complexity of O(N2).

This notion of gaps comes directly from biology. Recall that sequences can be similar, even if there may have been any number of insertions, deletions or transpositions between one or more nucleotides. The alignment algorithms must take this biological notion into account. Well, it is more than a “notion” – it is a biological fact.

Dynamic programming[23] is similar to the concept of “divide-and-conquer”. When a problem is too difficult or large to solve by itself, break it down into subproblems, solve those problems individually (or recursively), then combine all the individual solutions to solve the original, larger problem – simple enough.

Dynamic programming is just a bit of a twist on divide-and-conquer in that the subproblems cannot be solved independently – the “subproblems share subsubproblems”. In order to “solve every subproblem only once” dynamic programming algorithms keep track of the temporary subproblem solution in a temporary “traceback area” or table, so it won’t have to go through “the work of recomputing the answer every time the subproblem is encountered”.

Dynamic programming algorithms are typically used for optimization problems like our pairwise alignment problem, where we are trying to find the alignment(s) which produce an optimal score. The dynamic programming approach will handle multiple solutions, if there is more than one optimal score. There is a good description of dynamic programming in the Algorithm Design Manual: “[it] considers all possible decisions and always selects the one[s] that proves to be the best. By storing the consequences of all possible decisions to date and using this information in a systematic way, the total amount of work is minimized.”[24]

So how do you break this pairwise alignment problem down into subproblems?

Well, first a key observation is “the score of the best possible alignment that ends at a given

pair of positions (i..j) in the two sequences is the score of the best alignment previous to those two positions PLUS the score for alignment those two positions.”[20]

[pic]

Figure 12 - Key observation for dynamic programming

Another way to look at the problem is to go back to dot plot diagrams, and note the following:

[pic]

Figure 13 - Dynamic programming always looks back to previous diagonal

So this is exactly what we do when implementing a pairwise algorithm using dynamic programming. The overall approach is the following:

1. Fill out a matrix with the first sequence along the top row and the second sequence along the first column (just as with a dot pot). This matrix holds all the best possible scores for the alignments. You’ll also keep a second set of pointers (or a second matrix) called a “trace back” to keep track of the dependencies of the scores.

2. Find the best score in the entire matrix. The best score depends on whether or not you are doing a “global” or “local” alignment. More details will be given later.

3. Go back through the trace back matrix to reconstruct position-by-position the elements of sequences in the alignment, including gaps.

To figure out the scores in the matrix we use something called a mathematical recurrence relation, which looks something like this:

[pic]

Figure 14 - Recurrence relation

Global Alignment

If we compare the two sequences in their entirety and do not distinguish whether or not gaps are located at the ends of one or both sequences, then a global alignment is being performed. Since a global alignment is the “best alignment of the entirety of two sequences,” the best score will be in the final row or final column in the matrix, depending on the algorithm.

One relatively simple algorithm for performing a pairwise global alignment of two sequences is the Needleman and Wunsch algorithm[25]. In fact, Needleman and Wunsch were the very first to apply the concept of dynamic programming to the problem of sequence alignment.

We demonstrate an example using the Needleman and Wunsch algorithm for performing a global alignment. Note that we still use the three general steps outlined above. In this case, the two sequences are AATCG and AGGC, where the gap cost is -1, the match bonus is +1, and the mismatch score is 0.

As outlined in the previous section, first we must create and fill a matrix with the first sequence along the top row and the second sequence along the first column. In the Needleman and Wunsch algorithm, the first row and column of numbers are initially filled with multiples of the gap cost, as shown in the following figure:

[pic]

Figure 15 - Global alignment, part 1

Next, we continue to fill out the matrix using the recurrence relation, making a note of the best score in the matrix. BestScore[i,j] = BestScore[ ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download