Data mining of electrostatic interactions between amino ...



1 Data mining of electrostatic interactions

2 between

3 amino acids in coiled-coil proteins

4 using

5 the Stable Coil Algorithm

6

7 By

8 Ankur S. Deshmukh

A project submitted to the Faculty of Graduate School of the

University of Colorado at Colorado Springs

in partial fulfillment of the

requirements for the degree of

Master of Science

Department Computer Science

2008

This project for the Masters of Science Degree

by

Ankur Deshmukh

has been approved for the

Department of Computer Science by

Approved by Date

__________________________________

Advisor: Dr. Jugal Kalita

__________________________________

Committee member: Dr. Edward Chow

__________________________________

Committee member: Dr. Robert Hodges

date__________

TABLE OF CONTENTS

Chapter 1 1

INTRODUCTION 1

Chapter 2 3

BACKGROUND RESEARCH 3

2.1 Background Research in understanding coiled-coils 3

2.1.1 UNDERSTANDING PROTEIN STRUCTURE 4

2.1.1.1. PRIMARY STRUCTURE 4

2.1.1.2. SECONDARY STRUCTURE 8

2.1.1.3. TERTIARY STRUCTURE 10

2.1.1.4. QUATERNARY STRUCTURE 10

2.1.2 COILED-COILS 12

2.2 Background Research in understanding coiled-coil prediction algorithms 15

2.2.1. COILS ALGORITHM 15

2.2.2. PAIRCOILS ALGORITHM 16

2.2.3. SOCKET ALGORITHM 16

2.2.4. 2ZIP ALGORITHM - IDENTIFYING LEUCINE ZIPPERS 17

2.2.5. STABLE INPUT ALGORITHM 17

Chapter 3 19

STABLE COIL ALGORITHM 19

3.1 STABLE COIL ALGORITHM: PART I 21

3.2 STABLE COIL ALGORITHM: PART II 22

3.3 Cluster patterns in coiled-coils 25

Chapter 4 26

PROJECT ARCHITECTURE 26

4.1 DATABASE ARchitecture 27

4.1.1 STRUCTURE AND CONTENT OF THE TABLES INVOLVED IN PHASE 1 28

4.1.1.1 PROTEIN TABLE 29

4.1.1.2 COILED-COIL TABLE 30

4.1.1.3 PROTEIN COIL TABLE 32

4.1.2 STRUCTURE AND CONTENT OF THE TABLES INVOLVED IN PHASE 2 33

4.1.2.1 SALT RESIDUES LOOKUP TABLE 34

4.1.2.2 SALT BRIDGE TABLE 35

4.1.2.3 COILED-COIL HEPTAD TABLE 36

4.1.2.4 HEPTAD SALT TABLE 37

4.1.2.5 SCRAPE FILE TABLE 39

4.1.3 MATERIALIZED VIEWS 41

4.1.3.1 AMINO ACID OCCURRENCES 41

4.1.3.2 COIL LENGTH vs. CLUSTER PER COIL 42

4.2 Perl Code Design 44

4.3 WEBSITE ARCHITECTURE 48

4.3.1 INDEX PAGE 49

4.3.2 PROTEIN SEARCH PAGE 50

4.3.3 COILED-COIL SEARCH PAGE 52

4.3.4 COILED-COIL MOTIF SEARCHING 54

4.3.5 COIL HEPTAD AND SALT BRIDGE SEARCH 55

4.3.6 GENERATED REPORTS 57

Chapter 5 60

RESULTS 60

Chapter 6 79

CONCLUSION 79

Chapter 7 80

REFERENCES 80

APPENDIX A: CREATING materialized views in Mysql 83

APPENDIX B: SQL QUERIES FOR CREATING MATERIALIZED VIEW 86

APPENDIX C: INSTALLATION AND PERFORMANCE OF THE PROJECT 91

LIST OF FIGURES

Figure 2-1: The structure of part of a DNA double helix. Figure obtained from [23]. 3

Figure 2-2: A general structure of α-amino acid, with the amino group on the left and the carboxyl group on the right. Figure obtained from [1]. 5

Figure 2-3: A condensation reaction between two α-amino acids resulting in a peptide bond. Figure obtained from [1]. 5

Figure 2-4: Phi and Psi angles 6

Figure 2-5: Hydrogen bonding between amino acids in the proteins. Figure obtained from [25] 7

Figure 2-6: A depiction of α-helix, most commonly occurring protein structure in coiled-coils. Figure obtained from [1]. 8

Figure 2-7: A depiction of β-sheet, in anti-parallel and parallel formation. Figure obtained from [2] 9

Figure 2-8: Protein structure, from primary to quaternary. Figure obtained from [26]. 11

Figure 2-9: Classic example of Coiled-coil GCN4 leucine zipper. Figure obtained from [1]. 12

Figure 2-10: Positions of amino acids in the coiled-coil. The figure has been obtained from [3] 13

Figure 2-11: Cross-sectional view of a two-stranded coiled-coil. Hydrophobic and Electrostatic Interactions between two stranded α-helical coiled-coils formed by the homodimerization of 35-residue polypeptide chains. Adapted from [7]. 13

Figure 4-1: E-R Diagram detailing the relationship between tblProtein and tblCoiledCoil 28

Figure 4-2: Structure of Protein Table (tblProtein) 29

Figure 4-3: Structure of Coiled-coil Table (tblCoiledCoil) 31

Figure 4-4: Structure of Protein Coil Table (tblProteinCoil) 32

Figure 4-5: E-R Diagram detailing the relationship between tblSaltBridge and tblSplitHeptadCoils 33

Figure 4-6: Structure of Salt Residues Lookup Table (tblSaltResiduesLookup) 34

Figure 4-7: Structure of Salt Bridge Table (tblSaltBridge) 35

Figure 4-8: Structure of Coiled-coil Heptad Table (tblSplitHeptadCoil) 37

Figure 4-9: Structure of Heptad Salt Bridge Table (tblHeptadSalt) 38

Figure 4-10: Structure of Scrape File table (tblScrapeFile) 39

Figure 4-11: Materialized view of Amino Acid Occurrences (matview_AminoAcidOcurrences) 41

Figure 4-12: Materialized view of Coiled-coil Length vs. the Cluster Count (matview_CoilClusterCount) 42

Figure 4-13: Process Flow Diagram for the Stable Coil Algorithm 47

Figure 4-14: Index Page of the Stable Coil website 49

Figure 4-15: Protein Related Search Page 50

Figure 4-16: Coiled-coil Related Search Page 52

Figure 4-17: Coiled-coil Motif Search web page 54

Figure 4-18: Coiled Heptad and Salt Bridge Search Page 55

Figure 5-1: Coiled-coils Count vs. Coiled-coil Length 65

Figure 5-2: Location of occurrence of amino acid within the coiled-coil when the amino acid is at heptad offset a. 66

Figure 5-3: Location of occurrence of amino acid within the coiled-coil when the amino acid is at heptad offset d. 66

Figure 5-4: Normalized Value of Destabilizing Clusters in Coiled-Coils of particular length. Results obtained by dividing the total number of coiled-coils with de-clusters by the total number of de-clusters. 67

Figure 5-5: Normalized Value of Stabilizing Clusters in Coiled-Coils of particular length. Results obtained by dividing the total number of coiled-coils with clusters by the total number of clusters. 68

Figure 5-6: Distribution of Destabilizing Cluster of Length 3 with respect to the Coiled-Coil length 69

Figure 5-7: Distribution of Destabilizing Cluster of Length 4 with respect to the Coiled-Coil length 69

Figure 5-8: Distribution of Destabilizing Cluster of Length 5 with respect to the Coiled-Coil length 70

Figure 5-9: Distribution of Destabilizing Cluster of Length 6 with respect to the Coiled-Coil length 70

Figure 5-10: Distribution of Destabilizing Cluster of Length 7+ with respect to the Coiled-Coil length 71

Figure 5-11: Distribution of Stabilizing Cluster of Length 3 with respect to the Coiled-Coil length 71

Figure 5-12: Distribution of Stabilizing Cluster of Length 4 with respect to the Coiled-Coil length 72

Figure 5-13: Distribution of Stabilizing Cluster of Length 5 with respect to the Coiled-Coil length 72

Figure 5-14: Distribution of Stabilizing Cluster of Length 6 with respect to the Coiled-Coil length 73

Figure 5-15: Distribution of Stabilizing Cluster of Length 7+ with respect to the Coiled-Coil length 73

Figure 5-16: Relationship of amino acids in offset a to an i to i + 3 salt bridge 75

Figure 5-17: Relationship of amino acids in offset a to an i to i + 4 salt bridge 75

Figure 5-18: Relationship of amino acids in offset a to an i to i’ + 5 salt bridge 76

Figure 5-19: Relationship of amino acids in offset d to an i to i + 3 salt bridge 76

Figure 5-20: Relationship of amino acids in offset d to an i to i + 4 salt bridge 77

Figure 5-21: Relationship of amino acids in offset d to an i to i’ + 5 salt bridge 77

LIST OF TABLES

Table 2-1: Table of standard amino acid abbreviations and side chain properties 7

Table 3-1: Helical Propensity and Stability Values of the 20 standard amino acids at various positions in the heptad 20

Table 3-2: Coiled-coil Sequence starting at offset a 23

Table 3-3: Coiled-coil Sequence starting at offset b 23

Table 3-4: An aggregation of stability values 42 amino acids at a time 23

Table 3-5: Determining the presence of a coiled-coil in the protein sequence 24

Table 3-6: Determining the presence of a cluster (stabilizing or de-stabilizing) in the coiled-coil sequence 25

Table 4-1: List of Salt Bridges which provide i to i + 3, i to i + 4 and i to i’ + 5 electrostatic interactions 46

Table 4-2: Search Parameters used on the Protein Related Search Page 51

Table 4-3: Search parameters used on the Coiled-coil Related Search Page 53

Table 4-4: Search parameters used on the Coiled-coil Motif Search Page 54

Table 4-5: Search parameters used on the Coil Heptad and Salt Bridge Search Page 56

Table 5-1: Top 10 amino acid pairs occurring in heptad offsets a and d which form the hydrophobic core 60

Table 5-2: Top 10 amino acid pairs occurring in heptad offsets d and e 60

Table 5-3: Top 10 amino acid pairs occurring in heptad offsets g and e usually associated with electrostatic attraction i to i’ + 5 61

Table 5-4: Top 10 amino acid pairs occurring in heptad offsets ‘e’ and g usually associated with electrostatic attraction i to i’ + 2 61

Table 5-5: Top 10 amino acid pairs occurring in heptad offsets g and a 61

Table 5-6: Top 30 Amino Acid Pair Occurrences in Coiled-coils 63

Table 5-7: Top 30 frequently occurring amino acids in the Stable Coil Database 64

. INTRODUCTION

The sequencing of the human genome, as well as the genomes of many other species, has introduced a wide array of research fields within the discipline of molecular biology. One of the fastest growing fields is Proteomics, the study of proteins, protein structures, and the functions these proteins perform. Various research facilities are dedicated to this field of study, including the Peptide Chemistry Lab of Dr. Robert Hodges at the University of Colorado Health Sciences Center (UCHSC). The primary focus of this group of researchers is to understand the factors that affect the stability of proteins in general and the coiled-coil oligomerization domain in particular. These factors include, but are not limited to, the hydrophobic and hydrophilic interactions and the intrachain and interchain electrostatic interactions between the amino acids present in these coiled-coils. The ability to determine coiled-coil stability will greatly facilitate the prediction of coiled-coils in protein structures and will advance protein design. Because coiled-coils are the most commonly occurring oligomerization domain in nature, understanding the interactions within them can advance the study of proteomics as a whole.

The protein data available today is not only voluminous but complex. In order to interpret results in a timely and inexpensive manner, it is necessary to create prediction algorithms, which act as precursors to the lab experiments. This project uses such an algorithm to explore two of the primary areas of interest being studied at UCHSC. First, this project uses an established prediction algorithm, the Stable Coil Algorithm, to determine the existence of coiled-coils programmatically, eliminating the need for any human intervention. The second part of the project revolves around finding out what kind of interactions occurs within coiled-coils. Researchers have proposed that hydrophobic and electrostatic interactions are the primary forces that abet in the stability of the coiled-coil; hence, it is necessary to find out all possible information about these forces. Important areas of study include efforts to determine how the location of an amino acid in the heptad sequence affects coiled-coil stability and which amino acids hinder or aid that stability. In order to accomplish these goals, this project presents researchers with tools to efficiently study coiled-coil stability. These tools revolve around a revamped Stable Coil database.

The first rendition of the Stable Coil database used the Stable Coil Algorithm to predict the presence of coiled-coils [4]. However, the database had become corrupted. Furthermore, the database did not recognize updates to the raw data available on the ExPASy[1] server. These issues, combined with query performance and the absence of error logging, made it necessary for this project to recreate the Stable Coil database as well as the Perl programs involved in data collection. The resulting database allows researchers to designate new sources of raw data for collection; the associated Perl programs then process the new data automatically. In addition to redesigning the original database, this project provides additional tables to facilitate research into the various factors affecting coiled-coil stability. This database is freely available via the Stable Coil web interface at .

This website provides users with three basic search functionalities that can help them understand the various aspects of coiled-coil formations within proteins and the electrostatic interactions within those coiled-coils. A fourth functionality allows for complex motif searching within coiled-coils, thus providing users with information on the types and frequency of amino acid residues occurring within coiled-coils. In addition, the website presents users with fourteen unique reports, each of which provide a different insight into the realm of coiled-coils. These reports return results ranging from cluster distribution across coiled-coil lengths to lists of amino acids frequently found in and around coiled-coils. We hope this project will provide researchers with an efficient way to determine the presence of coiled-coils in proteins and act as a learning tool to help users understand the varying complexities of coiled-coil structures in protein.

. BACKGROUND RESEARCH

1 2.1 Background Research in understanding coiled-coils

Deoxyribonucleic acid (DNA) contains the genetic instructions used in the development and functioning of all known living organisms. As the main role of DNA is long-term storage of information, it is often likened to a blue print repository that is used to construct cell components such as proteins and RNA molecules. The DNA is a double helix consisting of two long polymers of simple units called nucleotides, with backbones made of sugars and phosphate groups joined by ester bonds. These two strands run in opposite directions to each other and therefore called anti-parallel. Attached to each sugar is one of the four molecules known as the bases, which encode the genetic information. This information is interpreted using the genetic code, which specifies the sequence of amino acids within a protein sequence.

[pic]

Figure 2-1: The structure of part of a DNA double helix. Figure obtained from [23].

The process by which genetic information is decoded from the DNA and converted to a protein is known as Protein Biosynthesis. Protein Biosynthesis is a multi-step process consisting of two major steps: Transcription and Translation.

◆ Transcription is the process of synthesizing of RNA under the direction of DNA. Both nucleic acid sequences use the same language, and the information is simply transcribed or copied from one molecule to the other. The DNA sequence is enzymatically copied by RNA polymerase to produce a complementary nucleotide RNA strand, called messenger RNA (mRNA), which then carries a genetic message from the DNA to the protein-synthesizing machinery of the cell.

◆ During translation, the mRNA sequence is used as a guide to synthesize a chain of amino acids into a protein sequence. During this process, the mRNA is decoded using specific genetic instructions. A transfer RNA (tRNA), which is a small RNA, then transfers a specific amino acid to the growing polypeptide chain which is being catalyzed at the ribosomal site of protein synthesis.

◆ During and after protein synthesis, amino acid chains often fold to assume the tertiary and quaternary structures commonly associated with proteins. This process is known as protein folding. Many proteins undergo post-translational modifications, which extend the range of a protein’s functions by attaching it to other biochemical functional groups or by formation of disulfide bridges.

The various types of protein structures and sub-structures formed play important roles in how a protein will function. In order to understand the functions a protein performs at a molecular level, it is necessary understand the three dimensional protein structure. This constitutes the field of Proteomics. Researchers employ techniques such as X-ray crystallography or NMR spectroscopy to determine the structure of proteins.

1 2.1.1 UNDERSTANDING PROTEIN STRUCTURE

Proteins are an important class of macromolecules present in all biological organisms. All proteins are polymers of the 20 standard α-amino acids listed in Table 2.1. Proteins fold into one or more specific spatial conformations, driven by a number of non-covalent interactions such as hydrogen bonding, ionic interactions, Van der Waals' forces and hydrophobic packing, so as to be able to perform their biological functions. In order to decipher protein folding, researchers require a fundamental understanding of the stability contributions of non-covalent stabilizing and destabilizing interactions. These interactions not only guide the initial hydrophobic collapse of a protein into an aqueous environment, but also provide a basis for which a protein assumes its overall structure. In biochemistry, a basic protein structure can be classified into four levels of hierarchy. These levels range from a singular linear arrangement of proteins to complex aggregate structures. These levels are described in detail below:

1 2.1.1.1. PRIMARY STRUCTURE

A protein’s primary structure is the linear sequence of amino acids. Most protein databases represent a protein in this linear sequence, creating a list of the amino acids which constitute each protein. The sequence of the amino acids is unique to the protein and defines the structure and the function of the protein. In its elemental form, each amino acid in a protein is a four-part molecule starting with an amine group (NH2), also known as the N-terminus, and ending with a carboxylate group (-COOH), known as the C terminus. In between these termini lies the α-Carbon atom (Cα). The Cα is bonded to an R-group and a hydrogen atom. Counting of residues always starts at the N-terminal end, which is the end where the amino group is not involved in a peptide bond.

[pic]

Figure 2-2: A general structure of α-amino acid, with the amino group on the left and the carboxyl group on the right. Figure obtained from [1].

Two amino acids are joined by a peptide bond in a condensation reaction. By repeating this process over multiple amino acids, long chains can be generated. This reaction is catalyzed by ribosomes in the translation process. During the formation of the peptide bond, the OH of the carboxylate bond of the first amino acid combines with the H of the amine bond in the second amino acid to form water. Once the bond is formed, the two joined amino acids have only one amine group or N-terminus and one carboxylate group or C-terminus, as shown in Figure 2-3.

[pic]

Figure 2-3: A condensation reaction between two α-amino acids resulting in a peptide bond. Figure obtained from [1].

The three dimensional structure of the protein is controlled by the dihedral angles the Cα carbon atom forms with the N-terminus and the C-terminus. The phi angle (φ) is the angle formed by the Cα carbon atom with the amine group. The psi angle (ψ) is the angle formed by the Cα carbon atom with the previous amino acid’s carboxylate group. The Figure 2.4 illustrates the phi and the psi angles.

[pic]

Figure 2-4: Phi and Psi angles

The R group in the amino acid is called the side chain. A side chain can vary from a single hydrogen atom in glycine through a methyl group in alanine to a large hetrocyclic group in tryptophan [1]. The type and number of side chains in a protein influence its structure. The side chain determines whether the amino acid will be hydrophobic or hydrophilic, polar or non-polar. Table 2-1 lists the 20 standard amino acids and the relative polarity.

|Amino Acid Name |One Letter Code |Three Letter Code |Category |

|Alanine |A |Ala |Non-polar Amino Acids (hydrophobic) |

|Cysteine |C |Cys |Polar Amino Acids (hydrophilic) |

|Aspartic Acid |D |Asp |Electrically Charged (negative and hydrophilic) |

|Glutamic Acid |E |Glu |Electrically Charged (negative and hydrophilic) |

|Phenylalanine |F |Phe |Non-polar Amino Acids (hydrophobic) |

|Glycine |G |Gly |Non-polar Amino Acids (hydrophobic) |

|Histidine |H |His |Electrically Charged (positive and hydrophilic) |

|Isoleucine |I |Ile |Non-polar Amino Acids (hydrophobic) |

|Lysine |K |Lys |Electrically Charged (positive and hydrophilic) |

|Leucine |L |Leu |Non-polar Amino Acids (hydrophobic) |

|Methionine |M |Met |Non-polar Amino Acids (hydrophobic) |

|Asparagine |N |Asn |Polar Amino Acids (hydrophilic) |

|Proline |P |Pro |Non-polar Amino Acids (hydrophobic) |

|Glutamine |Q |Gln |Polar Amino Acids (hydrophilic) |

|Arginine |R |Arg |Electrically Charged (positive and hydrophilic) |

|Serine |S |Ser |Polar Amino Acids (hydrophilic) |

|Threonine |T |Thr |Polar Amino Acids (hydrophilic) |

|Valine |V |Val |Non-polar Amino Acids (hydrophobic) |

|Tryptophan |W |Trp |Non-polar Amino Acids (hydrophobic) |

|Tyrosine |Y |Tyr |Polar Amino Acids (hydrophilic) |

|Unknown |X |UNK |Unknown Protein |

Table 2-1: Table of standard amino acid abbreviations and side chain properties

Hydrophobic amino acids repel a mass of water and tend to be non-polar. They do not form hydrogen bonds with any ionic group. Water is electrically polarized, and hence is able to form hydrogen bonds internally. But since hydrophobic amino acids are not electrically polarized, water repels hydrophobes, in favor of bonding with itself. This is true for all polar solvents. It is this effect that causes the hydrophobic interaction. To prevent destabilization, hydrophobic amino acids tend to be buried in the center of the protein away from the surrounding aqueous solution. For similar reasons, hydrophilic amino acids occur on the protein surface. The hydrophilic residues can be polar or electrically charged. The electric charge is +ve if the side chain is basic and –ve if the side chain is acidic. The bonds formed due to these interactions are also known as ionic bonds. The distributions of hydrophobic and hydrophilic amino acids in the protein determine the tertiary structure of the protein, and their physical location on the outside structure of the protein influences the quaternary structure, by reducing the collective surface area and therefore the amount of water that can influence the protein structure.

Besides these amino acid characteristics, there are electrostatic interactions determined by Van der Waal’s forces and hydrogen bonding that determines the protein structure. Van der Waal’s forces are the attractive and repulsive forces between atoms, molecules and surfaces. They differ from covalent bonds or ionic bonds in that they are caused by the fluctuating polarizations of nearby particles. Hydrogen bonding is another intermolecular force that affects protein structure, characterized by the presence of a hydrogen atom in the intermolecular bond. This hydrogen is chemically bound in one molecule as the proton donor and in the other as a proton acceptor. Figure 2-5 depicts a hydrogen bond formation in water dimer.

[pic]

Figure 2-5: Hydrogen bonding between amino acids in the proteins. Figure obtained from [25]

In this figure, the water molecule on the right is the proton donor while the water molecule to the left is the proton acceptor. The hydrogen bond which is used as the donor is often covalently bonded to an electronegative atom, oxygen in our case. Thus, the result of this bonding is a dimer which has relatively large dipole-dipole forces. In a protein, hydrogen bonding interactions contribute to the secondary structure of a protein.

2 2.1.1.2. SECONDARY STRUCTURE

Due to the interactions between the chemical groups in amino acids, mediated by hydrogen bonds, a few characteristic patterns occur within folded proteins. These recurring shapes describe the secondary structure of a protein. Their repeated occurrence renders a protein stable. Kabsch and Sander [8] in 1983 came up with an actual listing of the secondary structures found in proteins with a known 3D structure. The DSSP (Dictionary of Protein Secondary Structure) code they proposed is frequently used to describe secondary protein structures with single letter codes. The most commonly occurring protein structures in a protein include, but are not limited to, the α-helix, β-sheet and the β-turn.

The α-helix is the most commonly occurring secondary structure in a protein. It is a right-handed coil formation resembling a spring, in which every backbone N-H group donates a hydrogen bond to he backbone C=O group of the amino acids four residues behind it (i + 4 to i hydrogen bonding). Each amino acid corresponds to a 100° turn in the helix. This means that that the α-helix has 3.6 residues per turn. For example, a helix of 36 amino acids long would form 10 turns. A coiled α-helix depicts the tight packing of bonds, leaving almost no free space in the helix. The amino acid side chains are on the outside of the helix pointing roughly downwards.

[pic]

Figure 2-6: A depiction of α-helix, the most commonly occurring protein structure in coiled-coils. Figure obtained from [1].

The β-sheet is yet another form of protein secondary structure formed by the collaboration of β-strands, connected laterally by 3 or more hydrogen bonds, forming a generally twisted, pleated sheet [2]. In other words, a β-sheet is an extended conformation of amino acids in a zig-zag manner. In a β-sheet, hydrogen bonding occurs between C=O and N-H groups of two or more β-strands. This is in contrast to the α-helix, where all hydrogen bonds involve the same element of the secondary structure. These hydrogen bonds can occur among adjacent β-strands in anti-parallel, parallel, or mixed arrangements. In an anti-parallel arrangement, the successive β-strands run in opposite directions; thus, the C-terminus of one β-strand is adjacent to the N-terminus of next β-strand. In a parallel arrangement, all the N-termini of these strands are oriented in the same direction. An individual strand may also exhibit mixed hydrogen bonding pattern, with a parallel strand on one end and an anti-parallel strand on the other end. These structures are depicted in Figure 2-7.

[pic][pic]

Figure 2-7: A depiction of β-sheet, in anti-parallel and parallel formation. Figure obtained from [2].

The third type of secondary structure, the β-turn, is characterized by the hydrogen bonds in which the acceptor, meaning the main chain carboxyl oxygen (C=O), and the donor residues, meaning the main chain amine group (N-H), are separated by three residues (i to i + 3 hydrogen bonding). Turns are important secondary structures in proteins and occur abundantly on the surface of the protein molecule. They are distinguished by the hydrogen bonding in the i, i + 1, i + 2, and i + 3 residues. Helical regions are excluded from this definition, while turns between β-strands form a special class of turns known as the hairpin [10]. A β-hairpin connects to hydrogen bonded anti-parallel β-strands. Turns can also connect two regular secondary structure elements that do not interact to form what is known as diverging turns.

Amino Acids vary in their ability to form secondary structures. Proline and Glycine, which are known as helix breakers, have amazing conformational abilities and are commonly found in turns. The most common amino acids that adopt the helical conformations include Methonine, Alanine, Leucine, Glutamate, and Lysine. The bigger amino acids, in contrast, prefer to adopt a β-sheet.

3 2.1.1.3. TERTIARY STRUCTURE

The tertiary structure is the three dimensional arrangement of a protein, usually developed due to the presence of a variety of amino acids in the side chains. The tertiary structure of a protein is largely determined by the sequence of amino acids in the proteins and the interactions that occur among their side chains. As a result of these side chain interactions, the protein may have a number of folds, bends, and loops, thus assuming its final three dimensional structure.

There are four types of side chain bonding interactions: disulfide bonds, hydrogen bonding, salt bridges, and non-polar hydrophobic bonding. Disulfide bonds are the only covalent bonds and are formed during oxidation of the sulfhydryl groups on Cysteine (C). The hydrogen bonding between side chains occurs mainly between two alcohols, between alcohol and an acid, or between two acids. Salt bridges are ionic interactions, resulting from the neutralization of an acid and amine on the side chains. Any combination of various acids and amine groups in the side chains will have this interaction. The salt bridges contribute towards the strengthening of the helix. The hydrophobic interactions are the most important factors contributing to the stability of the protein. As discussed in the primary structure, these interactions follow the simple solubility rule that likes dissolve likes. The hydrophobic components will repel water or any polar solvent, in turn forming strong bonds with other hydrophobic elements. In many cases this causes in the hydrophobic side chain to be buried in the centre of the protein and the hydrophilic residues to be exposed to the surface of the protein.

4 2.1.1.4. QUATERNARY STRUCTURE

Many large proteins consist of multiple polypeptide chains, sometimes known as protein subunits. In addition to the tertiary structure of these subunits, these large proteins also possess a quaternary structure. These large proteins in essence are polymers. The most common examples of proteins with quaternary structure are hemoglobin and the DNA polymerase. Changes in the quaternary structure can occur through conformational changes in the underlying subunits or through the orientation of the subunits relative to each other. The forces that affect the tertiary structure of the protein also affect the quaternary structure. The different protein structures discussed above are pictorially represented in Figure 2-8.

[pic]

Figure 2-8: Protein structure, from primary to quaternary. Figure obtained from [26].

Now that we understand the different levels of hierarchy in the structural formation of a protein, we can better understand coiled-coils which form the basis of this Master’s Project. The Stable Coil database is built for predicting the α-helical motifs with the ability to form α-helical coiled-coil motifs, and here we take a deeper look into coiled-coils and the importance of studying them.

2.1.2 COILED-COILS

Many proteins are involved in important biological functions. Kinesin is a protein which transports cellular components between cells, while myosin is a fundamental protein used in muscle contractions, and both of these proteins perform these functions due to the ability of the coiled-coil to uncoil allowing the unattached heads to move.

A coiled-coil is a structural motif in which two or more α-helices are coiled together like strands of a rope. α-helical structures are abundant in proteins. This project focuses on what is perhaps one of the most commonly occurring dimerization motifs in nature, the two stranded α-helical coiled-coil. This structure consists of a two amphiphatic, right handed α-helices that adopt a left handed super coil analogous to a two stranded rope where the non-polar face of the first α-helix is continually adjacent to that of the other helix [16] as shown in Figure 2-9.

[pic]

Figure 2-9: Classic example of Coiled-coil GCN4 leucine zipper. Figure obtained from [1].

The two stranded coiled-coil is an ideal model for coiled-coils studies because of its rod-like structure, which makes protein folding a one dimensional problem, thereby removing much of the complexity found in globular proteins. Coiled-coils are characterized by hydrophobic amino acids at every third and fourth residue within their sequence. They are distinguished by a heptad repeat defined as abcdefg where positions a and d are the hydrophobic amino acids responsible for the formation and stability of the coiled-coil. Shown below is an example of coiled-coil alongside its heptad repeat.

[pic]

Figure 2-10: Positions of amino acids in the coiled-coil. Figure obtained from [3].

The hydrophobic residues occur at positions a, d, a’, and d’ and are indicated in red. These patterns repeat every 3.5 residues in the side chain; thus it takes less than two full heptads for the coiled-coil to turn twice, as indicated in Figure 2-10. The hydrophobic residues are buried in the center, away from the surrounding aqueous solutions, while the hydrophilic residues are exposed to the surface. These hydrophobic interactions provide stability to the coiled-coil by aiding in inter- and intra-helical interactions.

Various researchers over the years, including [7] [12] [18] have shown how not only the hydrophobic heptad repeats but also how the inter-helical and intra-helical electrostatic interactions between amino acids have contributed to the formation and the stability of the coiled-coil structure. A schematic representation of two-stranded, α-helical coiled-coils, with all the hydrophobic and electrostatic interactions is shown below in Figure 2-11.

[pic]

Figure 2-11: Cross-sectional view of a two-stranded coiled-coil. Hydrophobic and Electrostatic Interactions between two stranded α-helical coiled-coils formed by the homodimerization of 35-residue polypeptide chains. Adapted from [7].

Figure 2-11 uses the letters a to g and a’ to g’ designate the positions of the heptad repeat. As discussed earlier the hydrophobic residues interact at a and a’ and d and d’ indicated by open arrows. Electrostatic interactions can occur between b and e (b’ and e’) indicating intrachain i to i + 3 interactions or e and b (e’ and b’) indicating intrachain i to i + 4 interactions (dashed arrows) or g to e’ (g’ to e) indicating interchain i to i’ + 5 interactions (solid arrows) [7]. These interactions can consist of an attraction between the amino acid residues (salt residues) at these positions, or they can be repulsions which can respectively add or subtract from the overall stability of the coiled-coil.

Coiled-coil prediction is an important goal pursued in bioinformatics and theoretical chemistry. Its aim is the prediction of the three-dimensional structure of proteins from their amino acid sequences, sometimes including additional relevant information such as the structures of related proteins. In other words, the goal is to predict a protein's tertiary/quaternary structure from its primary structure. A number of algorithms have been created to predict coiled-coils. Most of the current algorithms created use a statistical approach, in which they compare newly discovered proteins to existing ones and determine the probability of a coiled-coil being present. The University of Colorado at Colorado Springs in conjunction with the Department of Biochemistry and Molecular Genetics at University of Colorado Health Sciences Center has built a protein database that depicts proteins that contain coiled-coil motifs and their stability clusters as determined by the Stable Coil Algorithm [3][4]. This algorithm is based on the stability of the structure determined by the amino acids present in the protein sequence and the structural position of those amino acids. The algorithms mentioned above are described in detail in the following section.

2 2.2 Background Research in understanding coiled-coil prediction algorithms

From previous discussions, it can be concluded that it is not only essential to determine the existence of coiled-coils in proteins; it is also essential to determine how the stability of the coiled-coil can be affected by certain amino acid residues at certain positions along the coiled-coil. Traditionally, the three dimensional structure of a coiled-coil has been determined by X-ray crystallography and NMR spectroscopy. Not only are these methods very expensive but they can also be very time consuming. Furthermore, it is highly improbable for a single group of researchers to apply these methods to all the naturally occurring proteins known to man. A better way to approach this problem is through the use of predictive algorithms, which provide the researchers with answers to questions like: Which proteins are more likely to contain coiled-coils? Which proteins are more/less stable due to the presence/absence of a hydrophobic residue or an electrostatic attraction/repulsion, etc.?

Protein structure analysis was thus born out off the desire to determine protein characteristics without conducting laboratory experiments or using crystallography. Processes based on protein statistics and past experiments were generalized to create methods and algorithms, which provide insights into a given protein’s structure and/or stability. This chapter exemplifies some of the predictive algorithms created to catalog coiled-coils present in proteins.

1 2.2.1. COILS ALGORITHM

The COILS Algorithm [21], an enhanced version of the Lupus Algorithm [20], was developed by Andrew Lupus in 1996. The COILS is a program which compares the given amino acid sequence to a database of known parallel two stranded coiled-coils. The comparison yields a similarity score which is then compared with the distribution of the scores in coiled-coil proteins. Thus the program calculates a probability that a sequence will adopt a coiled-coil.

The similarity scores are calculated by comparing against two different matrices:

◆ MTK is a matrix derived from the sequences of myosins, tropomyosins and keratins

(intermediate filaments type I and II).

◆ MTIDK is a new matrix derived from myosins, paramyosins, tropomyosins, intermediate

filaments types I - V, desmosomal proteins, and kinesins, calculated by weighing the

residue frequencies of different protein families.

Although using the MTIDK matrix results in a 20-30% drop in the generation of false-positives in the prediction algorithm, the results are still biased towards hydrophobic, hydrophilic charged residues. The program produces a fair amount of statistical noise as the window width decreases.

2.2.2. PAIRCOILS ALGORITHM

PAIRCOILS [15] classifies coiled-coils using a statistical approach and utilizes a matrix similar to the MTIDK matrix used in the COILS Algorithm. The matrix used in PAIRCOILS contains all known coiled-coil sequences, extracted from the GENpept database [13]. Instead of comparing the entire sequence to the database, as is the case with COILS, PAIRCOILS determines conditional probabilities that two amino acids are found in any two heptad positions. These frequencies are then normalized and used to determine the probability that a certain pair of amino acids appears at a given heptad repeat. The probability cut off determines how stringently the data will be scrutinized in detecting the existence of a coiled-coil domain.

Although the PAIRCOIL Algorithm successfully predicts coiled-coils reducing the number of false positives by using a scoring method based on “pairwise probabilities”, it is marred with the same problems as COILS; a large amount of statistical noise is present in the data as the probability cut off increases.

The PAIRCOIL algorithm was extended to become what is known as the MULTICOIL Algorithm [22], in order to identity three stranded coiled-coils as well. The accuracy of the statistical results was limited as the MUTICOIL Algorithm was run against a small subset of proteins.

4 2.2.3. SOCKET ALGORITHM

The SOCKET [19] program finds the Knobs-into-Holes mode of packing between alpha-helices which is characteristic of coiled-coils. It unambiguously defines the beginning and end of coiled-coil motifs in protein structures and assigns a heptad register to the sequence.

Specifically, the purposes of SOCKET are:

◆ To objectively and unambiguously define the location of a coiled-coil motif in a protein structure, so that its sequence can be used to test new coiled-coil prediction algorithms and benchmark existing ones.

◆ To automatically collect statistics on frequencies of amino acids at each of the heptad positions (abcdefg) of the sequence/structure motif. Such data are useful for training computer programs that predict coiled-coils from primary structure, and for providing insights into new design rules.

◆ To highlight unusual assemblies of alpha-helices that go beyond the traditional coiled-coil, again it is hoped that design principles, founded on knobs-into-holes packing between alpha-helices, will enable us to create novel and useful protein assemblies.

2.2.4. 2ZIP ALGORITHM - IDENTIFYING LEUCINE ZIPPERS

In order to implement the 2ZIP Algorithm, the TRESPASSER Algorithm is first used to extract from the SWISS-PROT [10] database only those residues that contain annotated leucine zippers, leucine-like zippers, and non-leucine zippers. TRESPAPPER is the algorithm of choice for this extraction, as it has been reported to predict leucine zippers with high reliability. Once this extraction is complete, the 2Zip Algorithm [6] is designed to determine the two general classes of the Leucine Zipper, strict and relaxed. The strict zipper is distinguished by occurrence of the of at least five leucine residues at four heptad repeats. A relaxed zipper occurs where in any of the five positions Leu is replaced with Met, Val or Ile.

The results of the 2ZIP Algorithm show that the annotated proteins in the SWISS-PROT database do not really follow a strict or relaxed definition of the leucine zipper, as had been hypothesized due to the generation of a lot of false-positives. However, the algorithm does demonstrate, based on the appearance of the leucine zippers in DNA binding basic region (bZIP) and helix-loop-helix (bHLH-ZIP), both of which have coiled-coil characteristics, that the presence of a leucine zipper is the hallmark of the coiled-coil itself rather than the leucine repeat.

6 2.2.5. STABLE INPUT ALGORITHM

All coiled-coil prediction algorithms mentioned thus far are based on statistical probability. The Stable Input Algorithm [3] is the first algorithm created to determine the presence of coiled-coils using the experimentally determined stability and helical propensity values of various amino acids present in the protein sequence. This algorithm also provides stability clusters of amino acids based on the varying amount of residues at a and d positions in the coiled-coil. The SWISS-PROT database was used as the source data for this algorithm. This algorithm is the precursor to the Stable Coil Algorithm implemented in this project.

Once coiled-coils are extracted from the SWISS PROT proteins, the Stable Input Algorithm uses a windowing function over which to calculate the relative stability of the coiled-coil. When researchers tested this algorithm, using window widths of 7 and 11, their results yielded some interesting observations concerning the stability of coiled-coils. According to these results, hydrophobic amino acids occupy hydrophobic a and d residues on average 65% for the SWISS-PROT dataset. As each hydrophobic core is added to the sequence length, the number of hydrophobic clusters decreases by a factor of 2, while the number of non-hydrophobic clusters decreases by a factor of 8. Also, the cluster frequency decreases as the heptad length increases. The Stable Input Algorithm does not evaluate the intermediate positions in the coiled-coil as strictly as it does the start and end positions. The result is that clusters are missed about 70% of the time. Also, researchers found it difficult to compare results from different sequences or perform quantitative queries, as the results were not stored in a database. Furthermore, they found that the algorithm is more susceptible to false positives; this can be attributed to the shortness of the windows lengths and the method in which the stability values were assigned to each amino acid.

It is interesting to observe that although all of these algorithms predict coiled-coils in proteins, either by using statistical approaches or using the stability values, none of them store this data to allow users to perform customized searches. All of the above algorithms require the user to enter a protein sequence or a file in a certain format to produce the desired results. Not only is this approach inconvenient, it is also time consuming, particularly if the users want to run large set of data.

The initial emphasis of this project is to retrieve additional information from this SWISS-PROT database concerning the electrostatic interactions among the amino acids within these coiled-coils. The scope of this project also entails helping researchers study in detail the role various amino acid residues play in the hydrophobic core of the coiled-coils. Finally, the project will try to improve the performance, accuracy, and user friendliness of the first rendition of the Stable Coil Algorithm, described in the next chapter. Hence, this project was undertaken to provide the researchers at UCHSC with a bigger, more readily accessible dataset of coiled-coils. The architecture and implementation of the database and the website are described detail later in this document.

. STABLE COIL ALGORITHM

THE RESEARCHERS AT UCHSC HAVE USED A MODEL PROTEIN, CONSISTING OF TWO IDENTICAL 38 RESIDUE POLYPEPTIDE CHAINS COVALENTLY LINKED AT THEIR N TERMINI VIA A DISULFIDE BRIDGE, TO DETERMINE THE EFFECTS THAT SUBSTITUTING DIFFERENT AMINO ACIDS IN A COILED-COIL SEQUENCE MAY HAVE ON THE COILED-COIL STABILITY. THIS WORK FORMS THE BASIS FOR THE DESIGN OF NEW COILED-COIL STRUCTURES, TO ALLOW BETTER UNDERSTANDING OF THE STRUCTURAL RELATIONSHIPS BETWEEN AMINO ACIDS IN A PROTEIN SEQUENCE, AND ALSO PROVIDES IMPETUS TO THE DESIGN OF NEW ALGORITHMS TO PREDICT THE PRESENCE OF COILED-COILS WITHIN THE NATIVE PROTEIN SEQUENCES. THE STUDY OF THE COILED-COIL DOMAIN HAS A NUMBER OF ADVANTAGES. THESE ADVANTAGES AREA BEST DETAILED BY [5]:

◆ Abundant motif in proteins

◆ Only one type of secondary structure is present, i.e., the α-helix

◆ Only two interacting α-helices are required to introduce tertiary and quaternary structure

◆ Diversity in length makes it an ideal system to test predictions

◆ All non-covalent interactions that stabilize the three-dimensional structure of the proteins are found in the coiled-coil domain

◆ Experimentally easy to analyze structure and stability.

To understand the proteins and the functions they perform, it is necessary to predict the occurrence of a coiled-coil before performing expensive and time consuming experiments. Hence the researchers at UCHSC have experimentally derived stability values for the twenty amino acids in their different heptad positions as described in the Table 3.1.

|Amino Acid Name |One |Three |Stability |Stabili|Stability | |Alanine |

| |Letter |Letter |Value at |ty |Value at | | |

| |Code |Code |Offset A |Value |Other | | |

| | | | |at |Positions | | |

| | | | |Offset | | | |

| | | | |D | | | |

|Sequence Of Amino Acids |M |D |Y |L |D |L |G |

|Stability Values |2.96 |0.116 |0.237 |3.7 |0.116 |0.446 |0.000 |

Table 3-2: Coiled-coil Sequence starting at offset a

|Heptad Offset Position |b |c |d |e |f |g |a |

|Sequence Of Amino Acids |M |D |Y |L |D |L |G |

|Stability Values |0.369 |0.116 |2.500 |0.446 |0.264 |0.446 |0.000 |

Table 3-3: Coiled-coil Sequence starting at offset b

To detect the presence of a coiled-coil, we use two experimentally determined values, the cutoff value of 38 and the window length of 42. The values are experimentally proven to be the best for predicting coils at UCHSC. The next step is to calculate seven scoring arrays obtained by aggregating the stability arrays. This is where we use the window length. The aggregation is performed for 42 residues at a time. If the number of residues left does not equal 42, we just aggregate the values till the end of the sequence. This provides us with seven arrays, known as the scoring arrays (score_array).

Example:

|Heptad Offset Position |a |b |c |d |e |f |g |

|Sequence Of Amino Acids |M |D |Y |L |D |L |G |

|Stability Values |2.96 |0.116 |0.237 |3.7 |0.116 |0.446 |0.000 |

|Scoring Arrays |7.575 |4.615 |4.499 |4.262 |0.562 |0.446 |0.000 |

Table 3-4: An aggregation of stability values 42 amino acids at a time

The next experimentally determined value, the cutoff value of 38, is used here. If the aggregate scoring value for each amino acid in the protein sequence is greater than or equal to 38, we mark the scoring array as 1 else we mark it 0, thus generating a marker_array. Then, we look for the occurrence of 42 or more consecutive 1’s in the marker_array. If we find this pattern, then we predict the presence of a coiled-coil with the starting location of the pattern as the starting heptad offset of the coiled-coil. Only coiled-coils with 42 or more sequences are considered for this project as researchers at UCHSC are interested in cluster patterns found in large coiled-coils.

Example:

|Heptad Offset Position |a |b |c |d |e |f |g… |

|Sequence Of Amino Acids |M |D |Y |L |D |L |G… |

|Scoring Arrays |52.75 |50.756 |40.756 |38.254 |37.656 |….... |…… |

|Coiled-coil Arrays |1 |1 |1 |1 |0 |…… |…… |

Table 3-5: Determining the presence of a coiled-coil in the protein sequence

2 3.3 Cluster patterns in coiled-coils

Once we have predicted an occurrence of the coiled-coil, the presence of a cluster can be determined by the particular hydrophobic residues occurring at a and d positions. If a certain hydrophobic amino acid, i.e. Phenylalanine, Isoleucine, Leucine, Methionine, Valine, or Tyrosine, is found in the a or d position then the cluster sequence gets 1, or else it gets 0.

Example:

|Heptad Offset Position |d |

|Lys / Glu |Lys / Lys |

|Glu / Lys. |Lys / Arg |

|Lys / Asp |Arg /Arg |

|Asp / Lys |Arg / Lys |

|Arg / Glu |Glu / Glu |

|Glu / Arg |Glu / Asp |

|Arg / Asp |Asp / Glu |

|Asp / Arg. |Asp / Asp |

Table 4-1: Amino acid electrostatic interactions which the researchers at

UCHSC are interested in

This table provides information on the position of the salt bridge in the coiled-coil, as well as, the start and end heptad offsets of the salt bridge. These residues are searched using a regular expression parser.

Using the salt bridge table, the researchers are able to answer questions such as what is the total number of Lys/Glu i to i + 3 salt bridges and where are these salt bridges distributed within the coiled-coil?

.

As of July 7th 2008, there are 1,017,241 salt bridges present in the 141,204 unique coiled-coils. The table structure is listed in Figure 4-7 below.

[pic]

Figure 4-7: Structure of Salt Bridge Table (tblSaltBridge)

The columns specific to this table are described in detail here:

• SaltBridgeID – The SaltBridgeID is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table. The column also has a referential integrity to tblHeptadSalt.SaltBridgeID, which is the bridge table that connects the salt bridges with coiled-coil heptads.

• SaltResidueID – The salt residue id is looked up against tblSaltResiduesLookup. This id indicates the type of salt bridge, for example whether it is an “i to i + 3” salt bridge between salt residues Lys and Glu. This column is foreign keyed to tblSaltResiduesLookup.SaltResidueID. Currently there are 48 different attraction and repulsion salt residues in the lookup table.

• CoilID – This field is foreign keyed to the tblCoiledCoil.CoilID. This field will give us which salt bridges are located in which coiled-coils.

• SaltBridgeMatch – This field refers to all the residues contained within the said salt bridge. These intermediate amino acids can help the researchers determine what the most common residues are occurring in a particular type of salt bridge.

• SaltStartLoc – This field refers to start location of the salt bridge in the coiled-coil.

• SaltEndLoc – This field refers to end location of the salt bridge in the coiled-coil.

• SaltStartOff – This field refers to the starting offset of the salt bridge.

• SaltEndOff – This field refers to the ending offset of the salt bridge. The offset fields help us identify what salt residues commonly occur at what heptad offsets in a coiled-coil.

1 4.1.2.3 COILED-COIL HEPTAD TABLE

The coiled-coil heptad table is generated by splitting the coiled-coils into their respective heptad sequences. A heptad is defined as the sequence of offsets g,a,b,c,d,e,f. The heptad starts with g here to capture all i to i + 5 interactions in a given heptad of a coiled-coil. The table contains the residues occurring at each of these heptads for each of the coiled-coils in tblCoiledCoil. This table helps build queries which determine whether certain residues or certain pairs of residues occurring more frequently than others.

As of June 7th 2008, there are 1,186,214 heptads for 141,204 unique coiled-coils in the database. The table structure is listed in Figure 4-8 below.

[pic]

Figure 4-8: Structure of Coiled-coil Heptad Table (tblSplitHeptadCoil)

The columns specific to this table are described in detail here:

• HeptadOffsetID – The HeptadOffsetId is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table. The column also has a referential integrity to the tblHeptadSalt.HeptadOffsetID, which is the bridge table to connect the coiled-coil heptads with the corresponding salt bridges.

• CoilID – This field is foreign keyed to tblCoiledCoil.CoilID. This field will tell us which heptads belong to which coiled-coils.

• OffsetG, OffsetA, OffsetB, OffsetC, OffsetD, OffsetE, and OffsetF – These fields store the amino acid residues of the coiled-coils at the corresponding offsets.

• HeptadStartLoc – This field refers to start location of the corresponding heptad in the coiled-coil.

• HpetadEndLoc – This field refers to end location of the corresponding heptad in the coiled-coil.

• HeptadStartOff – This field refers to the starting offset of the heptad.

• HeptadEndOff – This field refers to the ending offset of the heptad.

4.1.2.4 HEPTAD SALT TABLE

The table tblHeptadSalt is a bridge table between the tblSaltBridge and tblSplitHeptadCoils. It stores the salt bridge IDs and heptad offset IDs, where the salt bridge is present in the heptad offset for the given coiled-coil. The table does not store any salt bridges that overlap two heptads.

As of July 7th 2008, there are 549,440 unique coiled-coils in the database. The table structure is listed in Figure 4-9 below.

[pic]

Figure 4-9: Structure of Heptad Salt Bridge Table (tblHeptadSalt)

The columns specific to this table are described in detail here:

• HeptadSaltID – The heptad salt id is an auto increment column. New ids are created every time a row is inserted. This column is also the primary key for this table.

• HeptadOffsetID – This field is foreign keyed to the tblSplitHeptadCoils.HeptadOffsetID. This ID is used to indicate all the heptads which contain salt bridges.

• SaltBridgeID – This field is foreign keyed to the tblSaltBridge.SaltBridgeID.

4.1.2.5 SCRAPE FILE TABLE

Finally there is a table which drives the data scrape process (tblScrapeFile). It contains information about the location of the file on an ftp server, the last time the file was update on the source and the size of the file. This information is used to check if the file has been changed on the host and, if so, to retrieve it. Once the new version of the file has been retrieved, the file size and the mod date are updated, so as to store the most current attributes of a file.

As of July 7th 2008, there are 3 entries in the tblScrapeFile in the database. The first entry is used to retrieve the SWISS-PROT file which gets updated on the source site monthly. The second entry is used to retrieve the TREMBL file. The final entry is used to retrieve the updates to the SWISS-PROT database. If the researchers would like to add more datasets, they simply need to add an entry to this table. The only factor they must take into account is the file format of the dataset. In order for the data in the new file to successfully load into the database, it must be of the same format as the SWISS-PROT file, a format in which protein sequences are commonly represented. The table structure is listed in Figure 4-10 below.

[pic]

Figure 4-10: Structure of Scrape File table (tblScrapeFile)

The columns specific to this table are described in detail here:

• ScrapeID – The ScrapeID is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table.

• FtpSiteUrl – This field stores the main URL to the ftp site which hosts the data. The Perl program has been designed specifically to get data from a FTP server as most protein distribution files are large in size and are almost always accessible via FTP.

• FtpDirPath – This field refers to directory path of the file on the FTP server

• FtpFileName – This field is the actual file name retrieved from the FTP site.

• LocalDirPath – This field refers to the directory path on the local machine where we plan to save the data obtained from the FTP site.

• LastModDate – This field refers to the last time the file on the website was modified. If this date and date on the file do not match, Perl programs mark the file as changed and scrape the new version of the file.

• SizeInBytes – This field refers to the file size in bytes as retrieved from the file during the most recent scrape. If this file size and size of the file on the FTP site do not match, , the Perl Programs mark the file as changed and scrape the new version of the file.

• PullWeeklyFlag – This field acts as a flag which tells the main scrape program whether or not to scrape a file every week. A Perl program in turn calls a stored procedure, which changes the status of the PullWeeklyFlag field based on the last time a file was scraped. The field has just two values ‘Y’ or ‘N’.

• ScrapeFileFlag – This field indicates whether or not the scrape program requires a restart. Because the Perl programs are scraping huge files, it is entirely possible that the data transfer might end before finishing the complete download. If this happens and the researchers are trying to retrieve multiple files, this field determines which was the last file successfully scraped. Every time the program starts it looks at which files it needs to be retrieved and marks the ScrapeFileFlag field as ‘N’. Once the scrape of a particular file is completed, the ScrapeFileFlag field is marked as ‘Y’.

3 4.1.3 MATERIALIZED VIEWS

To provide faster access from the user interface, the database includes materialized views. Materialized views are exactly like standard views which are based on certain select queries. A materialized view, however, takes a different approach, wherein the query result is cached as a concrete table that may be updated from the original base tables from time to time. This enables much more efficient access, at the cost of some data being potentially out-of-date. It is most useful in data warehousing scenarios, where frequent queries of the actual base tables can be extremely expensive. In addition, because the view is manifested as a real table, anything that can be done to a real table can be done to the view, the most important being the ability to build indexes on any column, thus enabling drastic speedups in query time. In a normal view, it is typically only possible to exploit indexes on columns that come directly from (or have a mapping to) indexed columns in the base tables. But MySQL 5.0 does not support materialized views. Hence a new approach was designed to create tables than will be automatically updated, just like a materialized view.

MySQL provides a “Create table as” (CTAS) syntax which allows users to create tables using a select statement. This property was utilized to create a procedure which takes a select statement and a table name as inputs and creates a table on the fly. This procedure also assigns primary keys and any indexes if specified. The details of the stored procedure that creates these materialized views are described in Appendix A. The updates to these materialized views are scheduled using the UNIX crontab. Also for each of these queries it takes from 1.2 seconds to 5 seconds to refresh, which provides for a minimal down time, if any.

The materialized views were created to provide answers to questions like what roles different amino acids residues play in the hydrophobic core (a and d positions), or What is frequency of the occurrence of pairs of residues in the coiled-coils. Some of the important materialized views are described here in detail.

1 4.1.3.1 AMINO ACID OCCURRENCES

This materialized view provides the frequency of occurrence of a pair of amino acids at a certain heptad offset in a coiled-coil. This view, in turn, allows users to find out which amino acid pair occurs most frequently at a given heptad position and which occur less frequently. This table is based on a select query performed on the tblSplitHeptadCoils.

As of July 7th 2008, there are 2,010 distinct amino acid pair occurrences out of a total possible 2,800. The most common occurring pair of amino acids is ‘L-L’ at offsets a and d respectively, and they occur 57,854 times. The table structure is listed in Figure 4-11 below.

[pic]

Figure 4-11: Materialized view of Amino Acid Occurrences (matview_AminoAcidOcurrences)

The columns specific to this materialized view (table) are described in detail here:

• Amino Acid Pair – This column defines distinct pairs of amino acids which are found within the coiled-coils in the database.

• Offset Location 1 – This field stores the heptad offset of the first amino acid for the specific oiled coil

• Offset Location 2 – This field stores the heptad offset of the second amino acid for the amino acid pair we have found.

• Offset Pair Occurrence– This field stores the number of occurrences of the amino acid pair in the different heptad offsets, (a, b, c, d, e, f and g).

2 4.1.3.2 COIL LENGTH vs. CLUSTER PER COIL

This materialized view provides data on the frequency of occurrence coiled-coils of a certain length. In addition, it also provides the normalized value of the number of clusters occurring in coiled-coil of a given length. This view will allow users to see how cluster distribution varies as coil length increases. The coils have been divided into seven different subgroups divided by coiled-coil length:

1. Coiled-coils with coil length less than 50 amino acids

2. Coiled-coils with coil length between 50 and 59 amino acids

3. Coiled-coils with coil length between 60 and 69 amino acids

4. Coiled-coils with coil length between 70 and 79 amino acids

5. Coiled-coils with coil length between 80 and 89 amino acids

6. Coiled-coils with coil length between 90 and 99 amino acids

7. Coiled-coils with coil length greater than 100 amino acids

The table structure is listed in Figure 4-12 below.

[pic]

Figure 4-12: Materialized view of Coiled-coil Length vs. the Cluster Count (matview_CoilClusterCount)

The columns specific to this materialized view (table) are described in detail here:

• Coiled-coils By Length – This field splits coiled-coils into different categories by length.

• Coiled-coil Count – This field provides information on the number of coiled-coils in each group once they have been split by length.

• Stabilizing clusters per coil – This field stores a normalized value of the Total Stabilizing Clusters by the Total Number of Coiled-coils.

• Stabilizing Clusters per coil in Coiled-coils with Clusters– This field stores a normalized value of the Total Stabilizing Clusters by the Total Number of Coiled-coils that actually contain stabilizing clusters.

• Destabilizing clusters per coil – This field stores a normalized value of the Total Destabilizing Clusters by the Total Number of Coiled-coils.

• Destabilizing Clusters per coil in Coiled-coils with Clusters– This field stores a normalized value of the Total Destabilizing Clusters by the Total Number of Coiled-coils which actually contain destabilizing clusters.

These are a couple of materialized views to accelerate the execution of the searches. The details concerning the creation of these materialized views and, the SQL statements used to create these views are covered in Appendix B.

4 4.2 Perl Code Design

One of the most important and complex parts of the Stable Coil project was the process of loading the data from the source data. The raw data for the scrape is obtained from the ExPASy (Export Protein Analysis System)[2] server. The ExPASy database is an open source database developed to help researchers by providing the latest annotated protein sequences. The entire database is reposted monthly while protein updates that have been sequenced as a result of various genome projects are added to the database weekly. This database can be downloaded at . The database is available in XML and DAT formats and can be downloaded in compressed or uncompressed formats. The updates to the protein sequences are available at in the DAT format. This project uses the DAT format in order to keep the data type consistent across the entire process. The main database is currently 2.9 gigabytes in size. The weekly updates range from 30 to 40 megabytes.

There are four Perl programs which are used to retrieve source data from the website, parse the data using the Stable Coil Algorithm and load the data into the MySQL database. They are as follows:

1. StableCoil_Algorithm_setpullweekly_call.pl

2. StableCoil_Algorithm_scrape_parse_load.pl

3. StableCoil_Algorithm_saltresidues_heptadoffsets_extract.pl

4. Older_Files_Archiving_Removal.pl.

The first program is used to set the PULLWEEKLYFLAG in the tblScrapeFile table. This program sets the flag depending on how long it has been since the last scrape for the file. For updates the PULLWEEKLYFLAG is set every seven days, while for the entire database it is set every 180 days. The second program is the main program that actually scrapes and loads the data. It starts with first checking whether the PULLWEEKLYFLAG has been set for the specified file. If it is set then the program compares the LASTMODDATE and SIZEINBYTES in the tblScrapeFile against the mod date and the size of the file on the ftp website; if they differ, the file is scraped. At this time, the LASTMODDATE and SIZEINBYTES for the particular file are also updated. Once the file has been scraped the sequences, their molecular weights, their create dates, and any other relevant information are extracted from the file. This information is then run through the Stable Coil Algorithm which predicts the presence and the location of the coiled-coils in the protein sequences. If the protein sequence length is less than 42, the program ignores it, as the researchers are interested in coiled-coils with 42 or more amino acids.

Next the program checks to see if the coiled-coil already exists in the tblCoiledCoil table. If the coiled-coil does exist, it retrieves the CoilID for that coiled-coil. If the coiled-coil does not exist in the table, the program inserts a new record in to the table and retrieves its CoilID. Similarly we insert a new protein sequence in tblProtein or get a ProteinID from the table depending on whether the protein already exists in the database. The next step is to determine the clusters in the coiled-coils using Perl’s pattern matching operators. The pattern matches are done as follows.

Cluster3: /(?:(? ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download