Nifty Bioinformatics Assignments



Nifty Bioinformatics Assignments

By Dean Zeller

Kent State University

Summer, 2007

dzeller@cs.kent.edu



Table of Contents

Assignment 1 – DNA Modeling 1

Assignment 2 – Evolution Models I (Binary Evolution Trees) 3

Assignment 3 – Evolution Models II (Incremental K-Leaf Root) 6

Assignment 4 – DNA Pattern Matching Statistics 8

Assignment 5 – DNA Visualization 13

Assignment 6 – Longest Common Subsequence 17

Introduction

The following is a collection of assignments relating to bioinformatics. All assignments have been written in the past two years and were tested in an actual classroom environment. Bioinformatics itself is an involved and complex field, combining concepts of biology, mathematics, and computer science. However, the introductory topics of bioinformatics are not difficult, and need not require a college level mentality. The prerequisite knowledge for these assignments are purposely low, and are intended to “spark” potential interest within students.

Copyright notice

These assignments and accompanying materials are copyright ©2007 Dean Zeller are all available at the author’s web site. Permission of these assignments is granted for the copy and use for personal or educational purposes. Please contact the author with questions, comments, or feedback.

Assignment 1 – DNA Modeling

Dean Zeller Due: _________________

C&I 47330 10 points

Fall, 2006

Objective The student will create structurally accurate models of DNA out of pipe cleaners.

Materials pipe cleaners, scissors, 3x5 card (optional), hole punch (optional)

Level 0 – no experience required

Topics

This assignment introduces students to concepts of modeling using DNA as the modeled system. It is appropriate for any age over 5, is challenging, and contains of breadth of useful skills in a wide variety of areas. This assignment is suitable for any mathematics, science, or art class. No prior experience or knowledge is required to complete the assignment.

Background: Modeling

Chemists, mathematicians, and many other scientific professions use models as visual representations of structures. One important aspect of modeling is structural accuracy, i.e. the extent to which it represents key elements of the system it is modeling. In this assignment, the main element to represent about DNA is the bonding of AT and CG nucleotide pairs. As you increase in skills, there are other elements that can be added to increase accuracy.

When crafting a model, structural integrity (or stability) is important for the model to keep its shape. It cannot fall apart while being analyzed or modified. There are many materials used in modeling of this nature. For this assignment, the modeling material is pipe cleaners because of their ability to bend and keep their shape. The instability of a pipe cleaner model is in how the pieces are put together. Pieces can be connected through a variety of twisting, wrapping, and coiling. Doing this correctly takes skill and practice.

Instructions

Follow the steps below to create a single DNA strand of 10 nucleotides. Create additional strands as time permits. See the poster for picture diagrams of the instructions.

Step 0: Setup

A. Select colors for DNA stalks and nucleotides (A, T, C, G).

B. Cut the nucleotide pipe cleaners in fourths.

C. Put the nucleotides into AT and CG pairs.

Step 1: Create nucleotides

Use these steps to create ten (10) nucleotide pairs.

A. Select two pipe cleaner pieces of the appropriate colors to form a nucleotide pair (AT or CG).

B. Put in a V-shape with a small overlap (1/4”). Choose a long end and the corresponding short end on the same side of the V.

C. Coil the long end three or four revolutions around the other short end, covering it completely. There should be some extra left over to attach to a stalk.

D. Bend the remaining long and short ends 90° to form a T-shape.

E. Repeat step 1C with the remaining long and short ends.

Step 2: Connect Nucleotides to Stalks

Attach nucleotides to stalks to form a “ladder”

A. Attach one side of nucleotide to stalk by wrapping extra length around stalk twice (once on the left, and once on the right). Wrap any remaining length around itself.

B. Repeat for other nucleotides, spreading evenly across the stalk.

C. Attach opposite stalk in similar way. Leave enough room at ends to connect to other DNA strands.

Step 3: …and do the Twist

In “ladder” form, the DNA strand may look inconsistent in size. When twisted into a double-helix form, the inconsistencies tend to work themselves out.

Further Activities

• Create a model of an actual DNA sequence.

• Longer strands: strands of length 100 can be completed in about two hours, once the procedure is learned and practiced.

• In groups, use an “assembly line” approach to create a very long DNA strand.



• Class contests: longest DNA strand, strongest DNA strand





Grading

You will be graded on the following criteria:

Accuracy Correct representation of AT/CG pairs in the model

Stability Structural integrity of the model

Effort Quantity, length, and quality of models created

Extra credit will be given for the following:

• Use a 3x5 card to label the model with the appropriate model nucleotide characters (both sides of the DNA strand).

• Connect models together to create longer models

Assignment 2 – Evolution Models I (Binary Evolution Trees)

Dean Zeller Due: _________________

CS10051 10 points

Spring, 2006

Objective The student will use a graphics package to create diagrams of binary evolution trees.

Level 1 – use of a graphics package and/or word processor required

Readings Read, R.C., and R.J. Wilson, An Atlas of Graphs, Oxford Science Publications, 1999.

Background: Evolution Trees

This assignment deals with the cutting-edge topic of bioinformatics. It is a complex field of graph theory with applications to mathematics, computer science, and genetics. An evolutionary tree (or phylogeny) is a tree-structure that demonstrates evolution of species over time. A tree consists of vertices (nodes) connected by edges (links). In evolution, a node represents a point in which a species population “splits” into two genetically different species. Once a species splits, the two species created are genetically unable to produce offspring. The leaves of the tree represent the extant (non-extinct) species.

Background: Isomorphism

Two trees are isomorphic if they contain the same structure and are different only through symmetry of a node. In order to simplify the problem, isomorphic trees are considered the same for purposes of this assignment.

[pic][pic][pic][pic]

[pic] [pic]

Background: Classification labels

In order to easily name trees, they are given a text description called a classification label. This allows a text label instead of a visual picture to represent a tree structure. A good classification system has a unique name for each tree. For this assignment, the classification system is simply a listing of the number of nodes at each level. Trees A, B, C, and D above have the label (1242), indicating the first level has one node, the second has two nodes, the third has four nodes, and the fourth has two. Since the four trees are isomorphic, the same label represents all four trees. Ultimately, this classification system is incomplete. While simple to understand, it does not provide unique names for each tree at the higher levels. Trees E and F are not isomorphic, and thus are given separate labels (1244a and 1244b). This system will suffice for now, but can get confusing as the size of the trees increase.

Background: Assumptions

Introductory study of phylogenies makes another assumption that greatly simplifies the problem. For purposes of this assignment, the only non-leaf structure possible are nodes with exactly two offspring, representing a point in time in which a population “splits” into two populations. This is called a binary evolution tree. A node with a single offspring is called a redundant node and does not significantly add to the tree structure. Nodes with three or more children can be isomorphically approximated, and thus can be ignored at this stage with only a minimal loss of information. While all nodes in evolution trees are unique species, at this point only the leaf nodes need to be considered.

Task 1 – Draw Given Trees (5 points)

Given below are thirteen evolution trees. This represents the complete set of all non-isomorphic evolution trees of up to six leaves (11 nodes). Use a graphics package to recreate these trees. Give the classification for the tree and label its leaves with successive letters (a, b, c, etc…) Your design style may differ from the trees below, but the structure must be correct.

[pic][pic][pic][pic][pic][pic][pic][pic][pic][pic][pic][pic][pic]

Task 2 – Generate Trees (5 points)

Use a graphics package to create the eleven isomorphically unique evolutionary trees of 7 leaves (13 nodes). The following labels are the classifications for the unique trees: 1222222, 122224, 122242, 122422, 12244a, 12244b, 124222, 12424, 12442a, 12442b, and 1246.

Grading:

You will be graded on the following criteria:

Accuracy Correctly drawing the evolutionary trees, attention to detail.

Creativity Style, appearance, and consistency of the trees

Extra credit will be given for including the following:

• Create the sixteen 8-leaf trees: 12222222, 1222224, 1222242, 1222422, 1224222, 1242222, 124224, 124242, 124422a, 124422b, 12444aa, 12444ab, 12444ba, 12444bb, 12462a, 12462b, 1248.

• Create the 9 leaf (17 node) non-isomorphic trees and classifications.

• Create the trees without the redundant node and isomorphic approximation assumptions, allowing for any number of offspring from a node. This will exponentially increase the number of possible trees.

Graphs Diagram Drawing Guidelines

Creating well-drawn graphs is an art form in itself. A diagram is a visual representation, and care should be taken to make sure diagrams are organized, readable, and easily understood. When using a graphics package to draw graphs, it is imperative that students use a consistent and visually pleasing style. Sloppy graph diagrams stick out like a sore thumb and can completely ruin a professional presentation. Follow the below guidelines for drawing your graphs. See the figures for examples of following and violating these guidelines.

Vertices All vertex markers should be the same size and shape. If necessary, different markers can represent types of vertices.

Labels It should be clear which vertex the label references. Use a reasonable location scheme for the labels. Vertex labels should not cover any vertex, edge, or other label. A logical labeling sequence may help in understanding the graph structure.

Edges All edges should be straight lines of the same thickness and connect to the center of the vertex markers. Edge highlighting can be done with thicker or grayed lines. Planar graphs should be drawn without any crossing edges. Curved edges are generally avoided, but may be used for artistic effect.

Graph The graph shape should be pleasing to the eye. Use a regular polygon or other meaningful shape whenever possible. Physical distances between vertices should be relatively consistent. Use the alignment and distribution tool within the graphics package.

[pic][pic][pic][pic]

Assignment 3 – Evolution Models II (Incremental k-Leaf Root)

Dean Zeller Due: Wednesday, March 22nd by 9:00 pm

CS10051 10 points

Spring, 2006

Objective The student will build on assignment 2 to create diagrams of the incremental k-leaf root evolution model.

Level 1 – use of a graphics package and/or word processor required

Readings Read, R.C., and R.J. Wilson, An Atlas of Graphs, Oxford Science Publications, 1999.

Background: k-leaf root, cliques

As an alternate to phylogenies, evolution can be modeled by a series of incremental graphs called k-leaf roots that contain visual information indicating the distances between species. The tree leaves serve as the graph vertices. Edges within the graph indicate the distance between the two species is no more than a specified k-value in a corresponding tree. A clique is defined as a subset of vertices such that all members are connected to all other members within the clique. An incremental k-leaf root model is essentially a series of (possibly overlapping) cliques. In the diagrams below, the corresponding cliques are labeled below the graphs.

Instructions

1. Use a word processor to create a table similar to the example below.

a. Set your document Page Setup to Landscape to give more space widthwise.

b. The first cell in each row should contain the phylogenies from assignment 2.

c. The column header should contain increasing k-values from two (2) up to the number of leaves in the tree.

2. Within each row…

a. Draw the k-leaf root for increasing k-values. Start at k=2 and end when the graph is fully connected. Draw a left arrow (() for graphs that show no change from the previous k-value.

b. Indicate the cliques created within each distance graph.

3. Create a phylogeny with at least 10 leaves and the corresponding incremental k-leaf root evolution model.

4. Review the graphs drawn so far. Indicate which distance graphs you think contain the most useful information for geneticists.

Grading

You will be graded on the following criteria:

Accuracy Correctly drawing the evolutionary trees.

Creativity Style, appearance, and consistency of the graphs

Extra credit will be given for any of the following:

• Create more complex phylogenies for the report.

|2 leaves (3 vertices) |

|phylogeny |k = 2 |k = 3 |k = 4 |k = 5 |

|[pic] |[pic] |[pic] |[pic] |[pic] |

|3 leaves (5 vertices) |

|phylogeny |k = 2 |k = 3 |k = 4 |k = 5 |

|[pic] |[pic] |[pic] |[pic] |[pic] |

|4 leaves (7 vertices) |

|phylogeny |k = 2 |k = 3 |k = 4 |k = 5 |

|[pic] |[pic] |[pic] |[pic] |[pic] |

|[pic] |[pic] |[pic] |[pic] |[pic] |

|5 leaves (9 vertices) |

|phylogeny |k = 2 |k = 3 |k = 4 |k = 5 |

|[pic] |[pic] |[pic] |[pic] |[pic] |

|[pic] |[pic] |[pic] |[pic] |[pic] |

Assignment 4 – DNA Pattern Matching Statistics

Dean Zeller Due: _________________

CS10061 10 points

Spring, 2007

Objective The student will create a Python program to calculate pattern matching statistics on DNA sequences.

Level 2 – understanding and use of the Python programming language

Background: Bioinformatics

DNA can be represented digitally by long strings of A, C, G, and T. In this assignment, you will develop a program to perform statistics on a given DNA sequence.

Definitions

Sequence A long string of characters.

DNA Sequence A long string of A’s, C’s, G’s, and T’s representing a DNA strand.

Pattern A short string of characters to be searched within a sequence

Concepts Introduced

Strings as character arrays

# can access each character in a string individually

name = "Dean Zeller"

print name[0], name[1], name[2], name[3]

String length

# len(string) returns the length of a string in characters

print len(name)

for i in range(len(name)):

print name[i],

File input

# input from file instead of from the user

inputfile = open("input.txt", "r")

contents = inputfile.read() # contents contains entire file

inputfile.close()

File output

# output to file instead of to the IDLE window

outputfile = open("output.txt", "w")

outputfile.write("This is my output. 4-3-2-1-Wheee! ")



# when finished writing, close the file

outputfile.close()

While loops

# similar to if, but continues looping until condition is false

response = raw_input("Would you like to do something? ")

while response.upper()=="Y":

print "Okay, do something! "

response = raw_input("Would you like to do it again? ")

Current Programs

In DNAstats.py, user input consists of a filename containing a sequence of DNA. The output is a report of statistics for short patterns (i.e. number of A’s, number of C’s, etc.). It then allows the user to search for patterns of any length within the given sequence. RandomDNA.py is a program to generate synthetic DNA randomly for testing purposes. Read, understand, and test these programs before starting on the tasks.

Tasks

DNAstats.py contains the framework for a program to statistically analyze DNA sequences. Implement at least three of the tasks described below. More can be completed for extra credit.

1. Implement all two-character patterns (16) in the statistics report.

2. Implement all three-character patterns (64) in the statistics report.

3. Put the entire main program into a while loop to allow the user to run statistics on multiple files.

4. Write the statistics report and patterns searched to an output file. Prompt the user for the name of the output file.

5. Use the program to find long patterns that occur more than once, short patterns that do not occur, and relationships between the patterns.

6. Find actual DNA sequences on the web and run them through the program. Research some meaningful patterns and look for them in some DNA sequences.

7. Use the program to create the data for a table similar to table 1. You may use Microsoft Excel to enter the values and calculate the percentages.

8. Repeat all analyses on the reverse of the string. To do this, create a new string that is the characters of the old string in reverse order and send through the analysis methods.

9. Create a loop to cycle through all possible patterns. This is not an easy task, but if someone is up to a challenge, see me for help.

10. Modify the randomDNA.py program to create random or weighted sequences. Compare your analysis with actual DNA files.

11. Own idea: write up a task idea of your own and implement within the program. Get your idea approved first by your instructor.

Documentation

Your instructor originally wrote this code, as noted in the documentation. Create documentation blocks for any new functions you create, and list yourself as the author.

Turning in your assignment

1. Print a copy of your DNAstats program. If you made significant modifications to the randomDNA program, print that as well.

2. Print a test your program with the given input files and others that you create.

3. Write a report briefly describing the tasks completed.

4. Informally demonstrate the tasks completed to the class

Grading

You will be graded on the following criteria:

Quantity Variety of tasks correctly implemented

Readability Documentation indicating the lines of code created and modified

Testing Testing your program on all input files

Creativity Use new methods to solve the problem

Extra Credit

Extra points will be given for including the following features:

Extra quantity Implement more than three tasks

Report/Analysis Include a formal report and spreadsheet of your analysis

|Table 1: DNA Statistics |

|  |A |C |G |T |  |

  |Pat |# |% |Pat |# |% |Pat |# |% |Pat |# |% |total |avg # |avg % | |A |AA |21756 |35% |AC |10882 |26% |AG |9815 |28% |AT |20100 |31% |62553 |15638 |30% | |C |CA |13824 |22% |CC |9404 |22% |CG |5999 |17% |CT |13304 |20% |42531 |10633 |20% | |G |GA |9802 |16% |GC |9117 |21% |GG |6102 |18% |GT |9423 |14% |34444 |8611 |17% | |T |TA |17170 |27% |TC |13128 |31% |TG |12528 |36% |TT |22839 |35% |65665 |16416 |32% | |total |  |62552 |  |  |42531 |  |  |34444 |  |  |65666 |  |  |  |  | |avg # |  |15638 |  |  |10633 |  |  |8611 |  |  |16417 |  |  | |  | |avg % |  |  |25% |  |  |25% |  |  |25% |  |  |25% |  |  |  | |

DNAstats.py

########################################################################

# #

# DNA Statistics 1.00 #

# Written by: Dean Zeller #

# #

# This program was written for demonstration purposes for CS10061 #

# (Introduction to Programming) at Kent State University. #

# (C) 2007 Dean Zeller #

# #

# This program performs pattern matching statistics on DNA #

# sequences. #

# #

########################################################################

########################################################################

# #

# dnaTools #

# This object contains various methods to perform pattern-mathing #

# statistics on a given DNA sequence. #

# #

########################################################################

class dnaTools(object):

####################################################################

# printTitle #

# Print the title centered in a space width characters wide #

####################################################################

def printTitle(self, title, width):

print "+" + "-"*(width-2) + "+"

print "|" + title.center(width-2) + "|"

print "+" + "-"*(width-2) + "+"

####################################################################

# removeErrors #

# Return DNA sequence with non-ACTG characters removed. #

####################################################################

def removeErrors(self,seq):

newSeq = ""

for i in range(len(seq)):

if seq[i]=="A" or seq[i]=="C" or seq[i]=="G" or seq[i]=="T":

newSeq = newSeq + seq[i]

return newSeq

####################################################################

# stats #

# Perform basic statistics on a given DNA sequence #

####################################################################

def stats(self, seq):

numA=0

numC=0

numG=0

numT=0

numX=0 # number of errors

numAT=0

numCG=0

numCAT=0

numACT=0

for i in range(0,len(seq)): # 1-character patterns

if seq[i]=='A':

numA = numA+1

elif seq[i]=='C':

numC = numC+1

elif seq[i]=='G':

numG = numG+1

elif seq[i]=='T':

numT = numT+1

else:

numX = numX+1

for i in range(1,len(seq)): # 2-character patterns

if seq[i-1]=='A' and seq[i]=='T':

numAT = numAT+1

elif seq[i-1]=='C' and seq[i]=='G':

numCG = numCG+1

for i in range(2,len(seq)): # 3-character patterns

if seq[i-2]=='C' and seq[i-1]=='A' and seq[i]=='T':

numCAT = numCAT+1

elif seq[i-2]=='A' and seq[i-1]=='C' and seq[i]=='T':

numACT = numACT+1

w=5

self.printTitle("Statistics Report",80)

if numX > 0:

print "WARNING -- There were",numX,"errors in the sequence."

print "A:".rjust(w), str(numA).rjust(w), ' '.rjust(w),

print "C:".rjust(w), str(numC).rjust(w), ' '.rjust(w),

print "G:".rjust(w), str(numG).rjust(w), ' '.rjust(w),

print "T:".rjust(w), str(numT).rjust(w)

print "AT:".rjust(w), str(numAT).rjust(w), ' '.rjust(w),

print "CG:".rjust(w), str(numCG).rjust(w), ' '.rjust(w)

print "CAT:".rjust(w), str(numCAT).rjust(w), ' '.rjust(w),

print "ACT:".rjust(w), str(numACT).rjust(w), ' '.rjust(w)

print

####################################################################

# matchSimplePattern #

# Count the number of occurrances of pat (pattern) within #

# seq (sequence) using a simple string-matching algorithm. #

####################################################################

def matchSimplePattern(self, pat, seq):

numPat=0

for i in range(len(seq)):

if seq[i:i+len(pat)]==pat:

numPat += 1

print numPat

########################################################################

# main program #

# Demonstrate the methods defined in the dnaTools object. #

########################################################################

# get input file

inputFileName = raw_input("Enter DNA Sequence filename: ")

dnaFile = open(inputFileName,"r")

dnaSequence = dnaFile.read()

dnaFile.close()

# set up tools

D = dnaTools()

# run stats on original sequence

D.stats(dnaSequence)

# run stats on fixed sequence

dnaFixed = D.removeErrors(dnaSequence)

D.stats(dnaFixed)

# allow user to search for patterns

while True:

pattern = raw_input("Enter search pattern (blank to exit): ")

if pattern=="":

break

D.matchSimplePattern(pattern.upper(),dnaFixed)

print "Thanks, and have a nice day."

randomDNA.py

import random

size = 10000

# Equal probability

sequence = ""

for i in range(size):

r = random.randrange(1,5)

if r==1:

sequence += "A"

elif r==2:

sequence += "C"

elif r==3:

sequence += "G"

else:

sequence += "T"

print

print size,"character random DNA sequence:"

print sequence

outputfile = open("random.txt","w")

outputfile.write(sequence)

outputfile.close()

# Weighted probability

sequence=""

weightA = 10

weightC = 15

weightG = 5

weightT = 13

total = weightA + weightC + weightG + weightT

for i in range(10000):

r = random.randrange(0,total)

if r < weightA:

sequence += "A"

elif r < weightA + weightC:

sequence += "C"

elif r < weightA + weightC + weightG:

sequence += "G"

else:

sequence += "T"

print

print size,"character weighted DNA sequence:"

print "weightA =",weightA,weightA*1.0/total

print "weightC =",weightC,weightC*1.0/total

print "weightG =",weightG,weightG*1.0/total

print "weightT =",weightT,weightT*1.0/total

print sequence

outputfile = open("weighted.txt","w")

outputfile.write(sequence)

outputfile.close()

Assignment 5 – DNA Visualization

Dean Zeller Due: _________________

CS10061 10 points

Spring, 2007

Objective The student will write a Python program to implement the chaos-game representation of DNA.

Level 2 – understanding and use of Python with the Tkinter graphics library

Readings H. Joel Jeffrey (1990). “Chaos game representation of gene structure.” Nucleic Acids Research, Vol 18, No. 8, pp 2163-2170.

Background: Chaos Game Representation (CGR)

It is difficult for humans to look at a bunch of numbers and find meaningful patterns. It is far more effective to provide a visual represention the statistics. Bar charts, pie charts, and line graphs are used to represent data in a visual manner. The chaos game representation is just one of many DNA visualization algorithms. Analyzing the data in this format can show some characteristics about the DNA structure. DNA sequences can be found at the National Center for Biotechnology Information website at . Interesting fractal patterns can be created from contrived examples of DNA. For example, a random sequence will fill the square uniformly, but a random sequence without any A’s will create the famous Sierpinski triangle fractal.

Input

The user must provide parameters for the drawing of the visualization. Collect these values from the user before execution. You may create more parameters as needed.

Filename: the file containing the DNA sequence

Interval size: how often to pause execution

Dot size: the radius of the dot created at each point

Output

Follow the CGR algorithm to draw the appropriate dots for a given DNA sequence. Use the Interval Size variable to pause the drawing process for the user.

Testing

Test your program on a wide range of actual DNA sequences, random sequences, and repeated patterns. Compare the visualization to the report generated, and make note of any patterns found.

Tasks

Implement at least three of the following tasks using your chaos automata visualization:

1. Implement the chaos-automata algorithm using Python.

2. Create visualizations for actual DNA sequences for organisms.

3. Use the randomDNA.py program to generate DNA files with specific characteristics to create different artistic patterns.

4. Create guidelines similar to the diagrams above to indicate the different quadrants.

5. Allow the user to specify the characters represented for each corner.

6. Change the color of the dot every so often. Or allow the user to change the color at every interval.

7. Own idea: write up a task idea of your own and implement within the program. Get your idea approved first by your instructor.

Documentation

This assignment builds on the previous assignment. Create documentation blocks for any new functions you create, and list yourself as the author.

Turning in your assignment

1. Print a copy of your CGR program. If you made significant modifications to the randomDNA program, print that as well.

2. Print at least three interesting patterns generated by your program.

3. Write a report briefly describing the tasks completed.

Grading

You will be graded on the following criteria:

Quantity Variety of tasks correctly implemented

Readability Documentation indicating the lines of code created and modified

Testing Testing your program on a variety of input files

Creativity Use new methods to solve the problem

Extra credit will be given for including the following features:

Extra quantity Implement more than three tasks

Report/Analysis Include a formal report and spreadsheet of your analysis

CGR algorithm

This assignment combines concepts from the previous assignment on DNA sequencing and material from the graphics assignments. This assignment will implement the CGR algorithm to visualize the input DNA. The algorithm is actually quite simple, once the graphics procedures are understood.

Step 0: Create the DNA Square.

Based on parameters from the user, create a square within the canvas where all points will be drawn. You will need to know the left, top, bottom, and right positions within the square.

Step 1: Initial Point

Create two variables, x and y, denoting the point to draw. The initial point should be the center of the square. Note: you do not draw the initial point; it just serves as a starting point.

x = (left + right)/2

y = (left + right)/2

Step 2: Draw dots

For each character in the DNA sequence

a) Calculate the next point to draw, which is halfway between the current point and the appropriate corner for the letter. (A: top-left, C: top-right, G: bottom-right, T: bottom-left)

b) Draw the current point

Pseudocode

The following is pseudocode for the CGR algorithm. Your job is to implement the algorithm using Python.

x = (left + right)/2

y = (top + bottom)/2

for i each character in seq

if seq[i]==’A’

x = (x+left)/2

y = (y+top)/2

else if seq[i]==’C’

x = (x+right)/2

y = (y+top)/2

else if seq[i]==’G’

x = (x+right)/2

y = (y+bottom)/2

else if seq[i]==’T’

x = (x+left)/2

y = (y+bottom)/2

drawDot(x,y)

Example: Sequence: “ATAGCCTGTGA”

[pic][pic][pic][pic][pic][pic][pic][pic][pic][pic][pic][pic]

Analysis

Consider the diagrams below. The first contains additional guidelines for different size patterns. The final diagram is the chaos-automata result for the sequence ATAGCCTGTGA. Note the correspondence between the final pattern.

[pic][pic][pic][pic][pic][pic]

Assignment 6 – Longest Common Subsequence

Dean Zeller Due: _________________

CS33001 10 points

Fall, 2006

Objective The student will write a C++ program to implement and test the dynamic programming approach to the longest common subsequence problem.

Level 2 – understanding and use of the C++ programming language

Topics covered

This assignment is a review of your skills as a programmer using the string and two-dimensional array data structures.

Background: Bioinformatics

DNA can be represented digitally by long strings of A, C, G, and T. One of the most simplistic tests to determine genetic similarity is the longest subsequence of nucleotides common to both species. Determining this similarity value is of great use to bioinformatic scientists to determine position on an evolutionary tree.

It is a common problem in computer science to determine the longest common substring of two strings, particularly useful in web search engines. In bioinformatics, the largest subsequence of two strings is far more useful (and difficult) to determine. It is simple to write a brute-force program to solve the problem. However, DNA strings can be millions of characters long, and thus a brute-force method could take hours (or even years) to give results for real-world data. This assignment implements the dynamic programming method of efficiently solving the longest common subsequence problem in O(n2) time.

Program Requirements

Input Prompt the user to enter two strings, representing the DNA sequence of two species. Alternately, the strings can be read in from a file.

Error checking Before executing the algorithm, check each character in both strings to ensure only acceptable DNA letters are entered. Do not run the algorithm on input with erroneous data. In the event of an error, indicate how many characters are illegal, and recreate the string with the illegal letters highlighted (example: AGCTSACT ( agctSact).

Output The program should output the dynamic subsequence table and the length of the longest subsequence. The actual subsequence string can be printed for extra credit.

Program Architecture

Functions Use functions to modularize your code. The following is a suggestion on how to organize your functions.

functions.h header file for functions

functions.cpp implementation of the functions

main.cpp main program implementation that calls the functions

Documentation Each function must include a documentation block describing the input, output, and method of solving the problem. Create documentation for the main program describing the program overall. Use proper indentation and self-descriptive variable names.

Grading

You will be graded on the following criteria:

Accuracy Correctly implementing and using the LCS algorithm

Readability Documentation, descriptive variable names, and indenting

Testing Ensuring the code works for a variety of inputs

Creativity Using new methods to solve the problem

Extra credit will be given for including the following features:

File input Allow the user to enter a filename with the two input strings

Subsequence Print the longest common subsequence in addition to the length

Brute-force In addition to the dynamic programming method described in this assignment, also implement a brute-force solution to the problem. Create a short report that describes the program duration for a variety of input lengths.

Turning in your assignment

1. Print all C++ and header files.

2. Print a test of your program with input sets given below.

Algorithm Pseudocode

Given below is the pseudocode to solve the problem. You are to implement the pseudocode in object-oriented C++ functions.

Step 1: initialize variables

string s = ‘ ‘ + string1

string t = ‘ ‘ + string2

Twidth = s.length

Theight = t.length

int T[Twidth][Theight]

for (i=0; i ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download