Bioinformatics Protocol - University of Arizona - Arizona biological research center

MCB 415

Bioinformatics Exercises

University of Arizona

Department of Molecular and Cellular Biology

Tucson, Arizona 85721

Phone:

Fax:

Email: fetax@email.arizona.edu

Developed by: Dr. Frans Tax, Kimberly Finnell

Table of Contents

Table of Contents………………………………………………………………..2-3

Introduction………………………………………………………………………...4

A Primer of Important Things to Know When Analyzing DNA…………………..5

1. Protein Sequence………………………………………………………..5

Figure 1: DNA Polarity…………………………………………………....5

Figure 2: Gene Colinearity………………………………………………...6

2. Protein Structure………………………………………………………...6

Figure 3: Primary, Secondary, Tertiary, and Quaternary Structure……….7

Figure 4: IGF-1 Protein Structure…………………………………………7

3. Protein Function…………………………………………………………8

A. Biochemical Functions………………………………………………….8

Figure 5: Location of Plasma Membranes……………………………..….8

Table 1: Protein Functions in the Cell……………………………………..9

B. Genetic Functions………………………………………………..….....10

Types of Bioinformatics Programs………………………………………….….….10

1. Open Reading Frame Database …………………………………….…10

Table 2 Codon Table………………………………………………….…...10

2. ExPaSy…………………………………………………………….…...11

3. PROSITE……………………………………………………………….11

4. BLOCKS…..……………………………………………………………11

5. BLAST……….…………………………………………………………11

Growth Factors……………………………………………………………………..12

Gene 1 (IGF-2 gene)……………………………………………………………..12 Figure 6: IGF-2 Growth Regulation………………………………………12

Figure 7: Membrane Receptor/Ligand Dependent Tyrosine Kinase……...13

Deletions in the IGF-2 gene………………………………………………………...13

Gene 2 (Phytosulfokine gene)………………………………………………………13

Figure 8: Processing of the Phytosulfokine gene……………………….....14

Figure 9: Nucleotide and Amino Acid Sequence of AtPSK2 and AtPSK3.15

Figure 10: PSK Amino Acid Sequence Indicating Intron Location……. ...15

Figure 11: Arabidopsis thaliana plant……………………………………...16

Deletions in the Phytosulfokine gene……………………………………………......16

Database Procedures IGF-2 gene……………………………………………….…...16

Step 1 DNA sequence and PubMed……………………………………...16-17

Step 2 ORF……………………………………………………………….....17

Step 3 ExPaSy and PROSITE (domains)………………………………..17-18

Step 5 BLAST…………………………………………………………...19-20

Database Procedures Phytosulfokine gene……………………………………….....20

Step 1 DNA Sequence and PubMed…………………………………….20-21

Step 2 ORF……………………………………………………………….…21

Step 3 ExPaSy and PROSITE (domains)……………………………….21-22

Step 5 BLAST…………………………………………………………..23-24

List of other databases…………………………………………………………..24-25

Other applications for DNA analysis……………………………………………….25

Glossary…………………………………………………………………………25-26

References……………………………………………………………………….27-28

MCB415

Fall 2007

Bioinformatics is the science of managing and analyzing biological data using advanced computing programs.23 In this exercise our goal is to analyze the function and structure of a protein of interest by analyzing a sequence of DNA. First, we will take a DNA sequence and determine its protein coding capability. This is done by entering a DNA sequence into a universal database provided by the National Center for Biotechnology Information (NCBI). Second, we will go to the ORF (open reading frame) database to find a subset of the sequenced piece of DNA that begins with an initiation codon (methionine ATG) codon and ends with a nonsense codon. These ORFs have the potential to encode for our proteins of interest. Next; the PROSITE database will determine which family and domain of proteins this sequence fits into. Based on the information obtained, protein function can be analyzed. Finally, the DNA sequence will be compared to similar sequences using the BLAST database. Along the way, we will show you how analyzing sequences is incorporated into the biological sciences literature databases (PubMed). The NCBI sequence database can be linked to PubMed to find data concerning the specific function of a protein. PubMed has numerous citations and abstracts from published literature referenced by genetic sequence records.

You will be working with the two following growth factors.

Insulin Human Growth Factor (IGF )1

NM_000612 Insulin-like human growth factor gene (this is the NCBI nucleotide entry)

Phytosulfokine (PSK)1

NM_127851 Arabidopsis thaliana phytosulfokine-related gene (this is the NCBI nucleotide entry)

Both of these growth factors are small, secreted proteins or peptides that activate cell-surface receptors.

A Primer of Important things to know when analyzing DNA.

1. DNA sequence – The two DNA strands in each of the chromosomes of an organism are antiparallel to each other. This means they have opposite chemical polarity or their sugar phosphate backbones run opposite to each other as shown in figure 1 below. The 5’ end of DNA indicates that is the 5th carbon in the ring structure and the 3’ indicates it is the 3rd carbon in the ring structure. When DNA is copied by RNA, synthesis always occurs in the 5’- 3’ direction. This is because the 5th carbon has an available phosphate bond for the new base pair addition and the 3’ carbon ring structure does not have this bond availability. For some genes the top DNA strand is copied with mRNA and other genes are encoded by the bottom strand.

Figure 1: DNA polarity 2

[pic]

After we have the sequence of the base pairs of the gene we can determine the specific order of amino acids in the protein. Since a gene and protein are colinear then the linear nucleotide sequence becomes translated into the linear amino acid sequence in a protein. This concept is called gene-protein colinearity

Figure 2: Gene Colinearity

[pic]

2. Protein structure – There are four levels of organization in the structure of a protein. The primary structure of the protein is the amino acid sequence. Sometimes a pattern of identical or similar amino acids with a particular spacing are a signature that a specific domain of a protein is conserved. The secondary protein structure is formed by alpha helices and beta sheets. A tertiary protein structure consists of the 3-dimensional organization of a polypeptide chain. The final protein structure is the quaternary protein molecule which is formed by a complex of more than one polypeptide chain.26 By looking at a protein’s three dimensional structure, protein function can be determined. For example, knowledge of the active sites of enzymes and the ligand binding sites of receptors will help show protein function. A protein domain is a discrete portion of the proteins’ three dimensional structure assumed to fold independently of the rest of the protein and possessing its own function. In some cases, when the domain is separated from the rest of the protein it can still function. A domain usually contains between 40 to 350 amino acids.26 Some proteins have similar three dimensional structures but different amino acid sequences.

Figure 3: Primary, Secondary, Tertiary, and Quaternary protein structure.27

[pic]

Figure 4: IGF gene three dimensional structure 3

[pic]

3. Protein function

A. Biochemical functions – The biochemical function of a protein may include catalytic and structural roles. One of these specialized functions can be in how cells respond to extracellular signals. For a given signaling pathway there can be components localized within cell membranes, attached to the cytosolic surface of the membrane, within the cytosol, or nuclear-localized. Some proteins serve as a structural link between the intra and extracellular surface, others help to transfer molecules or ions in and out of the cell, some membrane proteins help to catalyze reactions, and other membrane proteins provide structural links between the extracellular matrix and the cell’s cytoskeleton.4

Figure 5: Location of membrane proteins 4

[pic]

Table 1: This table represents some of the various roles proteins can play within and outside of the cell.

| |

|Protein Functions in the Cell |

| |

| |

|Protein Function Protein Type Protein Example |

| | | |

|Signal Transduction and Metabolism: These proteins |Integral Membrane Proteins |Hormones and Receptors |

|detect, amplify, and integrate extracellular proteins | | |

|into ion channels and initiate gene expression.7 | |For example: Phytosulfokine peptide |

| | |growth factor and its LRR (leucine rich |

| | |receptor) kinase. |

| | | |

|Fibrous and |Fibrous and Structural: Cytoskeleton|Fibrous and Structural: Intracellular: |

|Structural: Serves as the biological glue for keeping |Proteins |Actin filaments and microtubules. |

|structures in place.7 | |Extracellular: Collagen, Fibrin, and |

|Mechanical: Proteins that do | |Keratin. |

|mechanical work.7 | | |

| |Mechanical: Contractile Muscle |Mechanical: Myosin and Actin |

| |Proteins | |

| | | |

|Immune Response: The antibody immune response consists |Antibodies: Are an amino acid |An antibody protein is called an |

|of several hundred genes to initiate an immune attack.5 |protein that protect the body against|immunoglobin. |

| |antigens. |For example: IgG antigen binding site is |

| | |composed of 108 amino acids.5 |

| | | |

|Nucleic Acid Binding: These proteins interact in the DNA|Regulatory proteins are involved in |These include: The Helix turn helix motif,|

|regulatory process. They regulate DNA packing, |recognizing specific DNA sequences. |Leucine zipper motif, and the Zinc finger |

|replication, regulation, and transcription. | |motif. |

| | | |

|Catalyst: Increases the rate of a reaction.7 |Enzymes are proteins that act as |Ligases are a class of enzymes that |

| |biological catalyst. |catalyze the joining of two molecules. |

| | | |

|Storage and Transport: Storage proteins serve as |Storage proteins provide essential |Ovalbumin (egg white) acts as a storage |

|reservoirs for amino acids and have good nutritional |nutrients. |protein. |

|value for us. Transport proteins move substances against|Transport proteins transport glucose,|Hemoglobin acts as a gas (O2) transporter |

|a diffusion gradient.6 |fatty acids, and gases. |in the cell.6 |

| |. | |

B. Genetic Functions- The genetic function is defined by the role of the protein in the organism. When the gene is altered by mutation, alterations in the organism’s phenotype may become apparent. For example a case study of a 15 year old boy that was homozygous for a deletion of the IGF-1 gene had delays in intrauterine growth, sensorineural deafness, and mild mental retardation. Absence of the IGF-1 gene indicates this gene is necessary for fetal growth and development, including central nervous system development.8

4. ORF (open reading frame)

An ORF is a section of a sequenced piece of DNA or cDNA that begins with an initiation codon (methionone ATG) and ends with a nonsense codon (TAA, TAG, and TGA). The ORF is defined by the placement of start and stop codons. These are the sites on the mRNA where translation starts and stops. The region between the start and stop codons determines the protein sequence. The following table shows amino acids and their corresponding codons.

|Table 2: Reverse codon table. This table shows the 20 amino acids used in proteins, |

|together with the mRNA codons that code for them. 9 |

|Ala |GCU, GCC, GCA, GCG |Leu |UUA, UUG, CUU, CUC, CUA, CUG |

|Arg |CGU, CGC, CGA, CGG, AGA, AGG |Lys |AAA, AAG |

|Asn |AAU, AAC |Met |AUG |

|Asp |GAU, GAC |Phe |UUU, UUC |

|Cys |UGU, UGC |Pro |CCU, CCC, CCA, CCG |

|Gln |CAA, CAG |Ser |UCU, UCC, UCA, UCG, AGU, AGC |

|Glu |GAA, GAG |Thr |ACU, ACC, ACA, ACG |

|Gly |GGU, GGC, GGA, GGG |Trp |UGG |

|His |CAU, CAC |Tyr |UAU, UAC |

|Ile |AUU, AUC, AUA |Val |GUU, GUC, GUA, GUG |

|START |AUG, AUU*, GUG* |STOP |UAG, UGA, UAA |

Note: The GUG and AUU start codons are for Prokaryotes only.28

Each region of a DNA sequence has six open reading frames to choose from. For instance, the upper strand has the 0, +1, and +2 frames and the lower strand also has 0, -1, and –2 frames. Generally, since most proteins are large, the longest open reading frame found is assumed to be the correct reading frame.

Why do we pick the longest open reading frame? Let’s think about why a large open reading frame is significant. In a random stretch of DNA there are 3 stop codons out of 64 codons. Thus, you would predict a stop codon for every 20 codons. Therefore, any amino acid sequence of > 80 amino acids is not expected by chance and has a good chance to encode a protein. Of course this is more useful when analyzing large proteins.

Types of Bioinformatics tools

1. NCBI web site, we will be using the ORF finder (open reading frame finder) database . The ORF finder is a graphical analysis tool that will help to choose the most appropriate reading frame. This amino acid sequence can be used to search against other databases for protein analysis.

2. ExPASy (Expert Protein Analysis System) allows one to translate a DNA sequence into a protein sequence and predicted structure.12

3. PROSITE is a database at the ExPASy web site that scans new protein sequence and identifies which protein family and domain it belongs to. When the protein sequence does not have similarities with other proteins in the database the system will use sequence patterns or (motifs) within the protein to determine functions of proteins. PROSITE contains a database of known protein domains.12 Protein domains that belong to a particular family of proteins share functional similarities and are derived from a common ancestor. These functional similarities may include: an amino acid sequence and a three-dimensional conformation that resemble those of the other family members.11

4. BLAST (Basic Local Alignment Search Tool), will do a comparison of your DNA sequence with similar nucleotide and protein databases. BLAST can be used to analyze functional and evolutionary relationships between sequences as well as identify gene families. There are five different types of BLAST programs.

a. blastp, compares an amino acid query sequence against a protein sequence database.

b. blastn, compares a nucleotide query sequence against a nucleotide sequence database.

c. blastx, compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database.

d. tblastn, compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).

e. tblastx, compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.10

How might you use a BLAST search for something other than identifying a protein for which you have partial sequence information?

Answer: BLAST also identifies other sequences that are homologous with the protein.

The proteins that we will be looking at are from a group of proteins called growth factors.

Growth Factors

Growth Factors are small proteins which bind specific transmembrane receptors that control cell growth and differentiation. Growth factors signal from the outside to initiate a response within cells. We will look at two different examples, one in animals (IGF) and one in plants (PSK).13, 14

IGF (Insulin-like growth factor)

IGF is a protein produced primarily in the liver in response to stimulation by growth hormone (GH). IGF provides the best indicator of growth hormone levels and optimal levels are linked to healthy bone, heart, thyroid, skin, and nervous system. IGF is produced by many different tissues within the body, such as the heart, lung, kidney, liver, pancreas, spleen, small intestines, large intestines, ovaries, placenta, testes, brain, bone, and pituitary.

Figure 6: IGF growth regulation 15

[pic]

The IGF-2 gene (also known as somatomedin C) plays a role in cell metabolism, growth, and differentiation.13 The IGF-2 receptor has intrinsic tyrosine kinase activity. Tyrosine kinases are transmembrane receptors with hormone dependent enzymatic activity. Tyrosine kinase binds an extracellular chemical signal, causing a conformational change in the receptor. The intracellular domain of the receptor becomes an activated tyrosine kinase that can autophosphorylate itself or other proteins.16, 17 This cascade of cellular events causes the gene to become activated. Cellular growth and differentiation are very tightly regulated, as you will see below.

Figure 7: reference 16 [pic]

PSK (phytosulfokine)

Plants also have growth factors and receptors, but they are a little different from IGF-1 and the IGF-1 receptor. The phytosulfokine gene that we will be looking at functions as a peptide growth factor. PSK is a 5 amino acid peptide modified (cleaved) from an 87 amino acid protein. The phytosulfokine receptor kinase is a receptor with a serine/threonine-protein kinase activity instead of tyrosine kinase activity. We know that the phytosulfokine receptor kinase binds phytosulfokine, but the roles of this hormone and receptor in plants are not yet completely known. A recent report indicates that phytosulfokine may determine the timing of senescense. Treatment of plant cells grown in culture with phytosulfokine results in cell differentiation, organogenesis, and embryogenesis.19 Remember that many plant cell types can be cultured so that entire plants can be regenerated. Phytosulfokine enhances many of these culture growth events. Physosulfokine genes may also be involved in enhancing chlorophyll synthesis under high nighttime temperature conditions (Yamakawa et al., 1999).19

Figure 8: Processing of PSK 19

[pic]

PSK is cleaved from the C terminus of a preproprotein precursor. PSK becomes chemically active after sulfation occurs. The precursor is synthesized through the secretory pathway where the signal sequence is removed, and where the sulfate groups are attached to tyrosine residues. This induces intracellular growth activity in the plant.20 The following diagram illustrates the cDNA nucleotide sequence of the AtPSK2 and AtPSK3 genes. Also, identified below are significant areas to know within the gene. When sequencing a genome it is harder to identify the gene of interest, therefore, in this exercise we will be working with cDNA.

Figure 9: Nucleotide and amino acid sequence of AtPSK2 (A) and AtPSK3 (B) gene 20

[pic]

In this nucleotide and amino acid sequence of AtPSK2 (A) and AtPSK3 (B) cDNA, the introns are indicated with a down arrow, repeats are shown with a horizontal arrow, polyadenylation areas are boxed, N- terminals are underlined, and the PSK sequence is underscored with double lines.

Figure 10: Amino Acid sequence indicating intron.

MANVSALLTIALLLCSTLMCTARPEPAISISITTAADPCNMEKK IEGKLDDMHMVDENCGADDEDCLMRRTLVAHTDYIYTQKKKHP

E= Glutamic acid (glu)/nucleic acid base pair (GAA or GAG)

The organism that our PSK gene is from is Arabidopsis thaliana, a member of the mustard (Brassicaceae) family. This has become a model organism to study gene function in plants because of its fully sequenced small genome. It has a rapid growth cycle, and there are good genetic and molecular tools to look at gene function

Figure 11: Arabidopsis thaliana plant 21

[pic]

Deletions in the phytosulfokine gene

We do not know the consequences of PSK loss / mutation in plants. Mutants in one of three biochemically identified receptor kinases show defects in the timing of senescence.

Database Procedures

IGF-2

Step 1

DNA sequence

Analyzing nucleotide sequence using PubMed literature database: The National Center for Biotechnology Information website holds databases for accessing and analyzing DNA sequences. NCBI was created by the United States Congress in 1988 as part of the Human Genome Project to develop information systems and help the biological research community.

1. Open the web site home page (), click the menu by the search button on NUCLEOTIDE, enter the code NM_000612 for the IGF-2 gene and press GO.

2. A single record shows up. Click on the one and then, click on the NM_000612 link and review the information.

3. A long list of citations is followed by the protein and DNA sequence of this gene.

4. To obtain more information about the function of this gene you can click on one of the journal articles from PubMed.

Questions

After reviewing this data answer the following questions.

1. From the information given you should be able to determine how many base pairs

are in this nucleotide sequence? How many amino acids in the predicted protein? What would the rest of the parts of this sequence be called?

2. What was the source of this DNA sequence, ie what experiments were done before the DNA could be sequenced?

3. Which organism is this protein from?

4. Based on the title of the journal article you chose determine what function this gene may have.

5. What is the phenotype of mutations in IGF-2?

Copy the DNA sequence exactly as shown (numbers and spaces included), so it can be analyzed with other programs.

Step 2

ORFs

Find an open reading frame sequence that codes for the IGF-2 protein. Translation and open reading frames can be interpreted at

.

1. Paste your DNA sequence in the lower big box and click on ‘ORF find’.

2. Usually I would have you click on the largest open reading frame, but this gene encoded a fairly small ORF. So instead click on the largest open reading frame (blue box) in the plus 3 reading frame, The predicted protein will be below the 6 ORFs. Largest ORFs are to the right. The one you click on is in pink..

Remember!! The region between the start and stop codons determines the protein sequence.

Use the same DNA sequence as before, the one from the Nucleotide entry.

Questions

Find the largest ORF in reading frame 3.

1. How many base pairs are in the frame?

2. How many amino acids are in the frame?

3. Which nucleotides define this protein sequence reading frame?

Step 3

Domains

Next we will use the ExPaSy website to interpret the cDNA sequence into a protein. Then, with PROSITE we can look at which family of proteins your sequence fits into. PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs.

1. Go to to analyze your protein.

2. Look in the box labeled ‘tools and software packages’ (upper right, second open circle) and find ‘DNA ->Protein’ . A new page will open, click on ‘translate’.

3. Paste your sequence in the translate box and click the ‘translate sequence’ button. The software will translate your nucleotide sequence into a protein sequence.

4. Six frames will show up (three 5’3’ and another three of 3’5’). Click on the 5’ 3’ frame 3.

5. Hint: the correct ORF begins with MGIP. Now follow the directions at the top of the page and make a virtual Swiss-Protein database file entry.

6. Now click on the Scan Prosite at the bottom of the page. You will see your virtual sequence in the box to be scanned. Go down the bottom and start the scan. If you don’t get anything, use the back button on the browser and uncheck the box that says” Exclude motifs with a high probability of occurrence”. Rescan.

Questions

1. Based on the ORF you chose how many amino acids are in the protein identified by PROSITE?

2. What is the best protein match identified by PROSITE?

3. What are some of the other protein matches found by PROSITE?

4. Based on this similarity, what is the function of this protein?

This is one of the places where one has to make judgement calls. Keep in mind this is a vertebrate gene so use some common sense in looking at the results. Where would IGF-2 be found in the cell? Use this to eliminate some other possibilities.

Step 4

BLAST

BLAST (basic local alignment search tool) will compare your DNA sequence with other sequences in the database. For this example, let’s see what the closest human gene is, so keep the database on the human genome. There are five different types of BLAST programs. For a protein database use the blastp and blastx programs and for a nucleotide database use the blastn, tblastn, and tblastx programs.

1. Go to and click on ‘BLAST’.

2. In the ‘nucleotide’ box, click on ‘nucleotide blast’, and paste your sequence (from step 2, see above) in the box.

3. Next, use the” human genome plus transcript” for the box marked database and press ‘blast’.

4. For kicks, press on the genome viewer button. There you will see the region of the genome that best matches. (One of the ways in which a genome sequence can help with mapping genes to a location.)

5. This database shows a color key for alignment scores. The matches are color coded red being the closest match and black being the furthest from a match.

6. Scroll down the page to alignments and observe the values given. One of the values you see is gaps. Gaps in the sequence analysis are caused by insertions, deletions, and substitutions. These differences will affect the final sequence alignment and final alignment score. Insertions and deletions create gaps in the alignment which affects the overall score of the sequence comparison.

7. While analyzing this database you will notice a set of data beside your sequence

comparisons. The ‘e values’ (expectation values) are the different alignments with scores equivalent or better than S (score of alignment) that are expected to occur in a data base by chance. The ‘e value’ describes the random background noise that exists for matches between sequences. The lower the e value the closer the match.

Questions

1. Write the best matches.

2. What is the E value for your matches?

3. What are the genes?

4. Are there any gaps in your chosen sequence?

5. Using the genomic sequences that you are comparing to cDNA sequences, can you think of a way to make a map of the gene based on the numbers from the contig that was sequenced? Draw a picture of what the gene looks like.

6. What is the level of difference in the nucleotide sequence and why might it be there?

Do this same search, using the blastx program (go back to the blast page, in the ‘’translated’ box click on ‘blastx’ and paste the genomic sequence in the box). This time use the non-redundant database.

1. Give the top 5 best matches.

2. What is the E value for your matches?

3. From what organisms are the ten most similar sequences? Why do you think that is?

4. How long is the best match and is it similar to your match?

PSK

Step 1

DNA sequence

Analyzing nucleotide sequence using PubMed literature database. The National Center for Biotechnology Information website holds databases for accessing and analyzing DNA sequences. NCBI was created by the United States Congress in 1988 as part of the Human Genome Project to develop information systems and help the biological research community.

1. Open the web site home page () click the search button arrow on ‘NUCLEOTIDE’, enter the code NM_127851 for the phytosulfokine 2 (PSK 2) gene and press GO.

2. A single record shows up; click on the NM_127851 link and review the information.

3. To obtain more information about the function of this gene you can go to the black search tool bar, click on ‘PubMed’ (first left) and type in your organism and the gene. PubMed will have a list of journal articles about your gene of interest.

4. Click on one of the journal articles from PubMed.

Questions

From the information given you should be able to determine the following:

1. How many base pairs are in this nucleotide sequence.

2. What was the source of this DNA sequence, ie what was done before the DNA could be sequenced?

3. Which organism is this protein from?

4. Based on the abstract of the journal article you chose determine what function this gene may have.

5. From your article what is the expected phenotype of a mutant in the phytosulfokine receptor?

Copy the DNA sequence exactly as shown (numbers and spaces included), so it can be analyzed with other programs.

Step 2

ORFs

Choose an open reading frame sequence that codes for the PSK protein. Translation and open reading frames can be interpreted at

.

1. Paste your DNA sequence in the lower big box and click on ‘ORF find’.

2. Click on the largest reading frame.

Copy the DNA sequence exactly as shown (numbers and spaces included), so it can be analyzed with other programs.

Remember!! The region between the start and stop codons determines the protein sequence.

Questions

Find the largest ORF.

1. How many base pairs are in the frame?

2. How many amino acids are in the frame?

3. Which nucleotides define this protein sequence (ORF) reading frame?

Step 3

Domains

Next we will use ExPaSy website to interpret the cDNA sequence into a protein. Then, with PROSITE we can look at which family of proteins your sequence fits into. PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs.

Go to to analyze your protein.

Look in the box labeled ‘tools and software packages’ and find ‘DNA -> Protein and click on ‘translate’ and click the ‘translate sequence’ button.

Paste your sequence in the translate box.

Six frames will show up (three 5’3’ and another three of 3’5’). Click on one of the frames titles and follow the instructions on the next page.

If you think this is the best open reading frame, copy it and go back to ExPaSy home page. If you want to choose a different orf, go back, re-choose and then go back to ExPaSy home page.

In the’Database’ box on the left click on PROSITE and paste your sequence in the box ‘Tools for PROSITE’ and press ‘quick scan’.

Questions

1. Based on the ORF you chose how many amino acids are in the protein identified by PROSITE?

2. How many amino acids are in the signal peptide region?

3. How many amino acids are in the PSK protein?

4. What is the best protein match identified by PROSITE? Why might this be hard to find? Go back and see how many of the amino acids in the PSK gene are in the active peptide.

Step 4

Sequence Comparison

BLAST (basic local alignment search tool) will compare your DNA sequence with other sequences in the database. There are five different types of BLAST programs. For a protein database use the blastp and blastx programs and for a nucleotide database use the blastn, tblastn, and tblastx programs.

1. Go to and click on ‘BLAST’.

2. Click in the nucleotide box, click on ‘blastn’, and paste your sequence (from step 2, see above) in the box.

3. Next, use the ‘nucleotide collection’ for the box marked database and press ‘blast’ followed in the next page by ‘format’. This database shows a color key for alignment scores. The matches are color coded red being the closest match and black being the furthest from a match.

4. Scroll down the page to alignments and observe the values given. One of the values you see is gaps. Gaps in the sequence analysis are caused by insertions, deletions, and substitutions. These differences will affect the final sequence alignment and final alignment score. Insertions and deletions create gaps in the alignment, which affects the overall score of the sequence comparison.

5. While analyzing this database you will notice a set of data beside your sequence

comparisons. The ‘e values’ (expectation values) are the different alignments with scores equivalent or better than S (score of alignment) that are expected to occur in a data base by chance. The ‘e value’ describes the random background noise that exists for matches between sequences. The lower the e value the closer the match.

Questions

1. Give the top 5 best matches.

2. What is the E value for your matches?

3. What are the genes and from what tissue are they derived? Try the third entry first. You may have to click on a few links to figure this out. Don’t give up.

4. Which sequence is identical to the phytosulfokine gene?

5. Are there any gaps in your chosen sequence?

Do this same search using the blastx program (go back to the blast page, in the ‘translated’ box click on ‘blastx’ and follow the rest of instruction 2 and 3 above).

1. Give the top 5 best matches.

2. What is the E value for your matches?

3. What are the genes and from what organism are they derived?

4. How long is the best match and is it similar to your match?

5. Which program gives a more accurate gene comparison blastn or blastx?

The following is a list of other data bases used for DNA analysis

Sources of sequence data

GenBank at the National Center of Biotechnology Information, National library of medicine, Washington D.C., (nucleotides and proteins)

Protein International Research data base at the National Biomedical Research Foundation, Washington D.C.

The sequence retrieval system (srs) at the European Bioinformatics Institute

Sources of protein structure data

RCSB Protein Data Bank: rcsb.rog

SWISS-PROT protein knowledge base

BioMagResBank:

MMDB:

Wake Forest University

Other applications for DNA analysis may include: DNA and protein sequence patterns and similarities, pair-wise comparisons of two sequences, alignment of multiple sequences, primer search programs, promoter and gene prediction, gene structure prediction, gene and protein domain function, tools used for biomedical research, and tools used for evolutionary analysis of molecular sequence data.

Glossary

Bioinformatics - The science of managing and analyzing biological data using advanced computing devices.25

BLAST (Basic Local Alignment Search Tool)- Blast is a sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query.25

E value (expectation value) - The number of different alignments with scores equivalent or better than S (score of alignment) that are expected to occur in a data base by chance. The e value describes the random background noise that exists for matches between sequences. The lower the e value the closer the match.25

EST (expressed sequence tags) –A tiny portion of an entire gene that can be used to help identify unknown genes and to map their positions within a genome.25

Domain – A discreet portion of a protein assumed to fold independently of the rest of the protein and possessing its own function.25

Multiple Sequence Alignment- An alignment of three or more sequences with gaps inserted in the sequences such that residues with common structural positions and/ or ancestral residues are aligned in the same column.

ORF (open reading frame) – A section of a sequenced piece of DNA that begins with an initiation codon (methionine ATG) codon and ends with a nonsense codon. ORF’s all have the potential to encode a polypeptide or protein.25

Query – The input sequence with which all of the entries in a database are to be

compared.25

Protein Kinase –A kinase is an enzyme that modifies other proteins by chemically adding phosphate groups to them (phosphorylation). This usually results in a functional change of the target protein (substrate), by changing enzyme activity, cellular location or association with other proteins.

Protein Motifs-These are short amino acid sequences which often give the protein a specific function. The presence of specific motifs can give a researcher clues to the function of a protein, and may be valuable in designing experiments to test the function or activity of a protein.29

Taxonomy- The theory and practice of describing, naming, and classifying plants and animals.

Raw score – The score of an alignment, S, calculated as the sum of substitution and gap scores.25

References

1. Bioinformatics Database. National Center for Biotechnology information. Available at: . Accessed July 2005.

2. Molecular Structure of DNA. Andrew Hughes. Available at: . Accessed July 2005.

3. IGF-1 Structural Image. CSIRO. Available at: csiro.au/images/ general/IGF-1.jpg. Accessed July 2005.

4. Membrane Proteins. J. Walshaw, Birbeck College. Available at:

. Accessed

July 2005.

5. Immune System and Antibodies. Young-Tae Chang, New York University.

Available at: . Accessed

July 2005.

6. Functions of Proteins. Robert J. Huskey, University of Virginia. Available at:

.

Accessed July 2005.

7. Protein. Wikipedia Encyclopedia. .

Accessed July 2005.

8. IGF-1 gene deletion causing intrauterine growth and retardation and severe short

stature. Woods KA, Camacho-Hubner C, Bartner D, Clark AJ, Savage MO,

University Department of Pediatrics, John Radcliff Hospital, Oxford, UK.

Available at: PMID: 9401537 [PubMed - indexed for MEDLINE]. Accessed July

2005.

9. Genetic Code. Encyclopedia. Available at:

wiki/co/Codon. Accessed July 2005.

10. Blast Query Tutorial. NCBI. Available at:

.

Accessed July 2005.

11. Protein, Pattern, Motif, and Domain Databases. Sean Eddy, Cold Spring Harbor.

Available at: .

Accessed July 2005.

12. PROSITE. Swiss Institute of Bioinformatics, Centre Medical Universitaire.

Available at: . Accessed July 2005.

13. Growth Factors. Michael W. King, IU School of Medicine. Available at:

. Accessed July 2005.

14. Signal Transduction and General Cell Signaling. Gabriel Fenteany, The Virtual Library of Biochemistry and Cell Biology. Available at: . Accessed July 2005.

15. IGF-1 Pathway Image. Obesity Meds and Research News. Available at: . Accessed July 2005.

16. Signal Transduction at Cell Membranes: Protein Kinases/Phosphatases. Dr. Jakobowski, St. Johns’s University. Available at: . Accessed July 2005.

17. Insulin-like Growth Factor-1. My Vitanet. Available at: . Accessed July 2005.

18. Diversity of Arabidopsis Genes Encoding Precursors for Phytosulfokine, a Peptide Growth Factor. Heping Yang, Yoshikatsu Matsubayahi, Kenzo Nakamura, and Youji Sakagami, Graduate School of Bio-Agricultural Sciences, Nagoya University, Japan. Available at: . Accessed July 2005.

19. Proteolytic Processing of Phytosulfokine (PSK). Clarence A. Ryan, Gregory Pierce, Justin Scheer, and Daniel S. Moura, Institute of Biological Chemistry, Washington State University. Available at: . Accessed July 2005.

20. Nucleotide and Amino Acid Sequences of AtPSK2 and AtPSK3 genes Figure 1. Heping Yang, Yoshikatsu Matsubayahi, Kenzo Nakamura, and Youji Sakagami, Graduate School of Bio-Agricultural Sciences, Nagoya University, Japan. Available at: . Accessed July 2005.

21. Arabidopsis thaliana. Bennett Laboratory, University of Nottingham, Loughborough, UK. Available at: nottingham.ac.uk/ bennett-lab/Arabidopsis. Accessed July 2005.

22. Insulin-like Growth Factor. Art to Science, Vol. 19, No., 2000. Available at: . Accessed July 2005.

23. Bioinformatics. Dr. Jacquelyn Fetrow, Dr. Jennifer Burg, Dr. Tim Miller, Wake Forest University. Available at: . Accessed July 2005.

24. Phytosulfokine: a key factor regulating cellular de-differentiation and re-differentiation in plants. Ligand receptor pairs in plant peptide signaling. Yoshikatsu Matsubayashi, Graduate School of Bio-Agricultural Sciences, Nagoya University, Japan. Available at: . Accessed July 2005.

25. NCBI Glossary. National Center Biotechnology of Information. Available at:

. Accessed

July 2005.

26. Molecular Biology of the Cell. Alberts, Johnson, Lewis, Raff, Roberts, and

Walter. 4th Edition, Copyright 2002. Pages 140-143.

27. Protein Structure Image. E.D. Hirsch, Jr., Joseph F. Kett, and James Trefil, Copyright 2002 by Houghton Miffin Company. Available at:

topic/protein-structure. Accessed July 2005.

28. Start Codons GUG and AUU. Wikipedia. Available at: . Accessed July 2005.

29. Protein Motifs. © 2005 The Board of Regents of the University of Wisconsin System. Available at: . Accessed July 2005.

[pic][pic][pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

Bioinformatics Protocol - University of Arizona

To fulfill the demand for quickly locating and searching documents.

Related download

Related searches