Intro-bio 102 Lab #5 Perl Programming for Pattern Matching



Intro-bio 102 Lab #1 Perl Programming for Pattern Matching.

There are 9 exercises in total:

USE PERL PROGRAM regex.pl

1. Scrabble in English

2. “Scrabble” in DNA Sequences

3. Direct Repeats

4. Direct Repeats in DNA Sequences

5. Mirror Repeats also called palindromes in English (English Word Play)

6. Mirror Repeats in DNA Sequences

USE PERL PROGRAM book_search.pl

7. Pattern Matching in Large Texts.

USE PERL PROGRAM omic_search.pl

8. Pattern Matching for Protein Sequences

9. Pattern Matching for Protein Sequences in DNA

There is more if you finish early and are still enjoying yourself!

Note: If you run a Perl program and it runs on and on ... and on and on, you’ll need to HALT the program.

Hold down the keys: Option-Apple-Escape, select the TextWrangler application, and click "Force Quit".

In your lab notebook keep a record of the regex’s that you try. Keep track of whether or not they worked the way you expected them to. Write down your errors as diligently as your successes, these are what you will learn from. Whenever you see gray highlighted text, you should be writing in your lab notebook. At the end of lab, you will trade notebooks to learn from each other’s efforts.

__ Logon to your computer using your own name and password

__ Connect to the courses server and open the week1_Perl_lab folder in Behavior-Renn. __ Drag the folder regexPlay onto your desktop.

__ Open this folder and you should see the following files:

|anagram.pl |Perl program to use regex to find anagrams |

|book_search.pl |Perl program to use regex to return full paragraphs |

|ecoli_K12_genes.fasta |DNA coding sequence for predicted E.coli proteins in FASTA format |

|ecoli_K12_proteins.fasta |amino acid sequence for predicted E.coli proteins. |

|ecoli.txt |7-mers found in E.coli genome |

|english.txt |English dictionary (list of words one per line) |

|omic_search.pl |Perl program to use regex to return genes with specified pattern |

|origin.txt |Full text of Darwin’s “Origin of species” |

|regex.pl |Perl program to use regex to find patterns |

Perl is already installed on the intro-bio lab computers for you.

__ Open the application TextWranglerTM from the Dock (Gold “W”).

__ From the FILE menu in TextWrangler, select Open (or use ζo) and open the Perl program: regex.pl from the folder on your desktop. The Perl program will appear in your programming environment. It will be color coded according to Perl syntax. The text that you will work with will be the color coded pink text under the words “ENTER HERE”. Notice the “#” sign is used to frame the code and include comments.

__ You should also see a “Documents Drawer” on the right. If not, go to the View menu and select Show Documents Drawer.

__ Before making any changes, make your own version of this script by saving it as regex_yourname.pl From the File menu select Save As, give the file its new name, check that you are saving it in the regexPlay folder on the desktop.

__ If you do not see line numbers next to the code use the Edit menu to open Text Options and check the box under Display: for Line Numbers.

This is the only section you will be changing:

######################################################################

# ==========ENTER HERE============

my $regex = 'q[^u]';

my $filename = "english.txt";

# my $filename = "ecoli.txt";

######################################################################

(note: The word “my” is Perl’s way of initializing variables. It is necessary here.)

To write a new regex, you will change the pink text between the single quotes.

Note: you must keep the single quotes!

The initial regex supplied in this Perl program means:

“match all words that contain a ‘q’ that is followed by a letter that is not ‘u’.”

____You must also choose the file to search:

either the English dictionary (english.txt)

or the list of 7-mers in E.coli (ecoli.txt).

You will make this selection by putting a comment (#) symbol in front of the line that you do not want to use.

The initial program is set to search the file "english.txt", thus there is a # to “comment out” (shut off) the "ecoli.txt” file.

Note: After you change the regex expression, you must SAVE your changes to the file before you re-run the new program. Your program is NOT saved if it has a dark diamond next to the title in the Documents panel. You can save your Perl Program with ζS or using the pulldown menu from File.

____To run your program from TextWrangler use the #! Menu and select Run

(this combination of symbols is call hashbang or she-bang).

Your program will run, and a new file with your results will open.

You will see a new document name added to the Documents panel called “Unix Script Output.” This file is not currently saved, and you do not need to save it. Subsequent runs will be added below, and it maybe a very long file by the end of the day. You should record all of your coding attempts in your lab notebook as you work. When your attempted code fails to give the expected results, try to figure out why and correct it. Record this learning process in your lab notebook.

___ Run your program now before you change anything in the code.

How many words were there with a ‘q’ that is followed by a letter that is not ‘u’. (Write your findings in your lab notebook.).

____To Return to the script, click on the document icon to the left of your saved regex_yourname.pl in the document drawer, or use the back-arrow button

[pic]

OUTPUT of initial program:

Work your way through the rest of the handout.

Always start at the top of each left hand page.

This is a demonstration using the English dictionary and English letters. Each page then contains a “going back for more” that you must solve, and record your solution in your lab notebook. You are asked to be creative using the regular expressions that you have learned on each page.

Then, move to the facing page, the right hand page.

This is a demonstration using DNA or Protein code rather than English.

Again, there is an example followed by one or more assigned problems, and an opportunity to be creative.

The answers are available from your instructor.

Working with your fellow students is encouraged.

Please ask for and offer help to each other before looking up the answer.

Scrabble in English

Question: Are there any words in an English dictionary where the letter ‘q’ is not

immediately followed by the letter ‘u’?

Pattern to match expressed in English: Search for two adjacent characters: the letter ‘q’

immediately followed by any character that is not the letter ‘u’.

|[…] |match any one of the characters in the set |

|[^…] |match any one character that is not in the set |

|{n} |match any of the characters in the set exactly n times |

|{n,} |match any of the characters in the set at least n times |

|{x,y} |match at least x and not more than y times |

| |these can be combined |

|[…]{x,y} |match any of the characters in the set at least x and not more than y times |

| | |a vertical bar separates alternatives |

| |this is OR in Boolean logic. |

regex = ‘q[^u]’

(1) Iraqi

(2) Iraqis

(3) Qatar

Note: In these facing page examples, we are assuming that the Perl program is instructing each regex to ignore the case of letters. You need not worry about upper vs. lower case in your regex. In the example above, “q” is really handling both upper and lower case ‘q’.

This is part of the work that the Perl program is doing for you. After you play with regex, we can look at the code if you are interested.

Going back for more:

Select regex_yourname.pl in the Documents Drawer to return to your Perl script. Edit your new regex in pink text between the single quotes, and then save your changes before running the script from the hash-bang menu #! . If you forget to save, it will run the previous regexs.

(1a) All words that contain the three-letter string “ghi”.

Note: Do not confuse this with the request to find words with any one of the letters ‘g’, ‘h’, or ‘i’. [ghi] is not the regex you want. [ghi] finds words with one of the letters in the set.

(1b) All words that contain “fin” or “phin”.

(1c) All words with ‘yz’ that are not immediately followed by an ‘e’ or an ‘i’ .

(1d) All words with.... (make up your own).

“Scrabble” in DNA Sequences

Our context for DNA sequences is a file with one DNA motif (or word) per line. The file is comprised of 16,384 7-mers, each line holding one DNA sequence or motif of length seven(7), where a motif is defined as “a pattern with putative biological meaning”. The 7-mers were collected by us from NCBI’s publicly available DNA sequence for the E.coli bacteria.

Change the file that will be searched by inserting and deleting the hash sign # as shown.

# my $filename = "english.txt";

my $filename = "ecoli.txt";

Question: Are there any 7-mers in E. coli where the nucleotide ‘G’ (guanine) is not

immediately followed by the nucleotide ‘T’ (thymine)?

Search for two adjacent nucleotides: guanine ‘G’ followed by any nucleotide that is not thymine ‘T’.

regex = ‘G[^T]’

TAAAGAA

ACGTGCC

GATATTT … 11267 total

The program is checking All Motifs for a G followed by a letter that is not T

TCAGTGT no

GTTCACG no

TAAAGAA yes

AGTAGTG no

CTTTTTT no

ACGTGCC yes

ACTCATT no

etc

Going back for more:

(2a) All 7-mers that contain GTGAC.

(2b) All 7-mers that contain the dimer GC, and this dimer is immediately followed by a letter that is not a G nor a C.

(2c) All 7-mers where ... (write your own)?

Direct Repeats

Question: Are there any words in which a sequence of three letters is repeated elsewhere in the word?

Note: Don’t forget to switch back to the English text.

Search for words where a sequence of three letters is repeated. This pattern might appear in the middle of a longer word.

|. |Match any character except a new line |

|(.) | Match any character and remember it |

|.* |Match any character zero or more times |

|.+ |Match any character one or more times |

|.? |Match any character zero or one time |

|\1 |Recall the character from the 1st match |

|\2 |Recall the character from the 2nd match |

|\3 |Recall the character from the 3rd match |

regex = ‘(.)(.)(.).*\1\2\3’

(1) acclimatization

(2) agglutinating

(3) aggressiveness

(4) Albuquerque

(5) alfalfa

(6) allegorically

(7) amalgamate

(8) amalgamated

(9) amalgamates

(10) amalgamating … 305 total

Going back for more:

(3a) Are there any words in which a sequence of two letters is directly repeated at least

three times in the word?

(3b) Are there any words ... (write your own)?

(Note, “directly” refers to “in the same order”. It is different than “immediately” in that it allows for additional characters to come between the repeated motif.) (How would you change the regex to require an immediately repeated 3 letter motif?)

Direct Repeats in DNA Sequences

Question: Are there any 7-mers in the DNA sequence of E.coli where a sequence of three nucleotides is repeated elsewhere in the motif?

Search for motifs where a sequence of three nucleotides is repeated.

regex = ‘(.)(.)(.).*\1\2\3’

1) AAAAAAA

(2) AAAAAAC

(3) AAAAAAG

(4) AAAAAAT

(5) AAACAAA

(6) AAACAAC

(7) AAAGAAA

(8) AAAGAAG

(9) AAATAAA

(10) AAATAAT … 700 total

Going back for more:

(4a) Are there any motifs in which a sequence of two nucleotides is repeated at least three

times in the motif?

(4b) Are there any motifs ... (write your own)

Mirror Repeats (also called palindromes in English)

Question: Are there any words in which the first three letters of the beginning of a word are the same as the reverse of those letters at the end of the word?

Search for words where the initial sequence of three letters ends with the reverse of those letters.

regex = ‘^(.)(.)(.).*\3\2\1$’

(1) despised

(2) detected

(3) deteriorated

(4) detested

(5) foolproof

(6) Greenberg

(7) Hannah

(8) redder

(9) reviver

(10) revolver

(11) rotator

remember:

|^ab |Match ‘ab’ at the beginning of a word |

|xyz$ |Match ‘xyz’ at the end of a word |

|\2 |Recall the character from the 2nd match |

Going back for more:

(5a) Are there any words in which three consecutive letters anywhere in a word are followed by the reverse of those letters anywhere in the word? (Hint: you do not need ^ (start of word) and $ (end of word) for this).

(Note: the following two examples combine both direct and mirror patterns)

(5b) Are there any words in which a pair of letters is followed by a direct repeat of those pair of letters followed by two occurrences of the reverse of the pair of letters? Each of the pairs can be zero or more letters apart), e.g., “AB...AB...BA...BA”

(5c) Are there any words in which this pattern occurs twice: a pair of letters is followed by its reverse of those pair of letters? (Each of the pairs can be zero or more letters apart).

Mirror Repeats in DNA Sequences

Question: Are there any motifs in which the first three nucleotides of the beginning of a motif are the same as the reverse of those letters at the end of the motif?

Search for motifs where the reverse of the initial three nucleotides is repeated at the end of the motif.

regex = ‘^(.)(.)(.).*\3\2\1$’

1) AAAAAAA

(2) AAACAAA

(3) AAAGAAA

(4) AAATAAA

(5) AACACAA

(6) AACCCAA

(7) AACGCAA

(8) AACTCAA

(9) AAGAGAA

(10) AAGCGAA … 256

remember:

|^ATG |Match ‘ATG’ at the beginning of a motif |

|TAC$ |Match ‘TAC’ at the end of a motif |

|(.) |Match any character and remember it |

|\2 |Recall the character from the 2nd match |

Going back for more:

(6a) Are there any motifs in which three consecutive nucleotides anywhere in a motif are

followed by the reverse of those nucleotides anywhere in the motif?

(Hint: you do not need ^ (start of motif) and $ (end of motif) for this).

(Note: the following two examples combine both direct and mirror patterns)

(6b) Are there any motifs in which a pair of adjacent nucleotides is followed by a direct

repeat of those pair of nucleotides followed by two occurrences of the reverse of the pair

of nucleotides? (Each of the pairs can be zero or more nucleotides apart), e.g.,

“GT...GT...TG...TG”.

(6c) Are there any motifs in which this pattern occurs twice: a pair of adjacent nucleotides is followed by its reverse of those pair of nucleotides?

(Each of the pairs can be zero or more nucleotides apart).

Pattern Matching in Large Texts.

Perl can be a powerful tool for searching and manipulating larger text documents.

We have downloaded the full text of The Origin of Species by means of Natural Selection, 6th Edition by Charles Darwin.

You will write regex’s to find interesting words in this text.

Choose words or concepts from Bob Kaplan’s lectures first semester.

In TextWrangler close (ζw) the regex.pl script and open (ζo) book_search.pl

This script will return the entire paragraph that contains your interesting word so you can learn how Charles Darwin used these terms.

Search the text origin.txt

You could search any other downloaded text by changing line #13 in the program.

(my $filename = "origin.txt"; #enter the text file name here)

(9a) How many times do you think Darwin used the word “evolution” in his text?

Write your guess in your lab notebook.

(9b) Write a regex to find the correct answer.

This would be easy to do in a simple text program like Word using the Find tool, but now try to find any term related to evolution (evolve, evolution, evolved).

(9c) Write a regex that will find any term related to evolution.

(9d) How many are there?

(9e) Look carefully. Did you get the words you weren’t expecting (“revolved” “revolution”)? Look back at your answer to 9b now also

(9f) Write a regex that will avoid these words? (Hint \s means “white space” or \b means “word boundary”)

Several special characters can be helpful regex tools in this type of search in Perl.

|\n |new line character |

|\t |tab |

|\s |whitespace |

|\S |non-whitespace |

|\d |digit |

|\D |non-digit |

|\b |word boundary |

|\B |not word boundary |

If you are finding it difficult to find your word in the full printed paragraph, comment out line 52 and use line 53 by removing that comment mark.

52 print @paragraph; # and print the paragraph

53# print "$match\n"; # will print only the line that matched

Pattern Matching for Protein Sequences

Remember Janis Shampay’s lectures last semester?

Proteins are a string of amino acids of which there are 20. Each amino acid has a three letter abbreviation but also a one letter abbreviation used for writing protein sequences.

* G - Glycine (Gly)

* P - Proline (Pro)

* A - Alanine (Ala)

* V - Valine (Val)

* L - Leucine (Leu)

* I - Isoleucine (Ile)

* M - Methionine (Met)

* C - Cysteine (Cys)

* F - Phenylalanine (Phe)

* Y - Tyrosine (Tyr)

* W - Tryptophan (Trp)

* H - Histidine (His)

* K - Lysine (Lys)

* R - Arginine (Arg)

* Q - Glutamine (Gln)

* N - Asparagine (Asn)

* E - Glutamic Acid (Glu)

* D - Aspartic Acid (Asp)

* S - Serine (Ser)

* T - Threonine (Thr)

What amino acid sequence would be represented by the protein sequence GENE?

In TextWrangler close (ζw) book_search.pl; open (ζo) omic_search.pl

This program will return either DNA or protein sequence for genes. It is expecting a FASTA formatted file, rather than a book.

Use the file E_coli_K12_proteins.fasta by commenting the appropriate line.

If you want to see only one matching line of protein code, comment out line 43 and use 44.

(10a) Write a regex to find how many proteins in E.coli include the protein code GENE.

(10b) Write a regex to find how many proteins in E.coli include the amino acids DARWIN in any order.

(10c) Write a regex to find proteins in E.coli include …. (make up your own).

(10d) Is your name part of E.coli proteome? (If you don’t know what the word proteome means, look it up in Wikipedia quickly ).

Regular expressions can be used to search for secondary structure.

Different amino acids have different chemical characteristics. The secondary structure of a protein depends on the number and spacing of these different chemical characteristics. Different amino-acid sequences have different propensities for forming (-helical structure. Methionine, alanine, leucine, glutamic acid, and lysine ("MALEK" in the amino-acid 1-letter codes) all have high helix-forming propensities. Proline (“P”) tends to break or kink helices. However, proline is often seen as the first residue of a helix.

(10e) Write a regex for an amino acid pattern that is likely to form a long (10 amino acid) alpha helix secondary structure.

There are many web-based tools that search for complex protein or DNA patterns such as Neuropeptide cleavage predictors (see appendix for an example). These tools rely on programming in languages like Perl or Python. One advantage of learning to write your own programs is that you can search multiple sequences (or whole genomes as you have done today). You do not need an internet connection, you do not have to wait for results, and you can format the output to be compatible with your own files. You can adjust the parameters. Often it is possible to obtain the program code that is used by these web-based tools, and you can then adjust specific parameters to meet your own needs.

Pattern Matching for Protein Sequences in DNA

Each amino acid is encoded by 3 letters in the genomic DNA (translated via mRNA). These three bases are called a “codon”. Remember that some of the amino acids can be encoded by more than one codon, so the code is called “degenerate”.

Do you remember the Universal Genetic Code? Of course not, nobody does.

[pic]

Use the Perl script omic_search.pl

Use the file E_coli_K12_genes.fasta by commenting out the appropriate file line.

A regular expression for alanine is

regex = ‘gc[acgt]’

A regular expression for Arginine is a bit tougher

remember the vertical line symbol | means OR

regex= ‘(cg[acgt])|(ag[ag])’

(11a) Write a regular expression that will search the genomic sequence for the possible amino acid sequence GENE.

(11b) Do you find this sequence the same number of times in the DNA as you do when you search the Protein file for GENE? If not, what is the reason?

(11c) Write a regular expression to find a Lysine rich region (3 or more in a row).

Or make it harder. Find a region where every other amino acid is a Lysine,

Think about how you could specify which amino acids are allowed to be interspersed with the Lysines.

(11d) Why might a researcher be interested in looking for secondary structure in a given DNA sequence rather than looking directly at an amino acid sequence?

(11e) Why might a researcher want to write their own program rather than using a web-based tool?

When you have finished

__The answer key is available in lab if you wish to check your own work.

__There is an additional exercise on finding anagrams if you are having fun, have finished early and want to try more difficult regular expressions.

__If you want to save your output file, use the Save As command, give it a name and save it to your Home Server. This file contains only the output, not the regular expressions you used to get what you were looking for. Your lab notebook is the real record of your efforts.

__Before you leave, find another student who has also finished the exercises. Trade notebooks with this student. Complete the post-lab evaluation form while you learn what creative regular expressions your classmate has attempted.

__Take your own lab notebook home with you.

__Leave your post-lab evaluation form with Carey or Ned.

Appendix

This lab is based on Exercise 3 from “Programming in PERL for Biology” a text by Mark LeBlanc and Betsy Dreyer from Wheaton College. The full text will be published later in 2007 and is recommended for any student wishing to go further with this work. Until that book is published, I recommend Beginning Perl for Bioinformatics from O'Reilly & Associates as it addresses the needs of biologists who want to learn Perl programming.

• If you want to install Perl on your own computer; it comes installed with Mac OS X or it can be downloaded for free from

• If you want to install it on your own computer, TextWrangerTM is available at TextWrangler is a freeware programming environment provided by BBEdit. TextWranglerTM is a “lighter” version of BBEdit®, the industry standard Perl programming environment for MAC OS X.

• If you want to learn more Perl, there are several books available as well as several online courses. Perl is open source language and many scripts and modules can be freely found and downloaded from the web. One good online tutorial that I have seen is available at

• The english.txt dictionary and 7mer text that you used in class can be downloaded from

• The full text of “On the Origin of Species” (and other works) can be downloaded for free from project Gutenberg at There are >20,000 free books for download from project Gutenberg. Project Gutenberg is the first and largest single collection of free electronic books, or eBooks. It was founded by Michael Hart, who invented eBooks in 1971.

• Whole genomes, collections of genes, protein coding regions, upstream regulatory regions etc. can be downloaded from many sources. For this lab, the DNA coding sequence file and the protein file for E. coli strain K12 were downloaded from The Institute for Genome Research, Comprehensive Microbial Resource. C elegans genomes can downloaded from biomart if you want to play with larger genomes.

• NeuroPred is a web application designed to predict cleavage sites at basic amino acid locations in neuropeptide precursor sequences. The user can study one amino acid sequence or multiple sequences simultaneously, selecting from several prediction models and optional, user-defined functions.

FASTA format

In bioinformatics, FASTA format is a text-based format for representing either nucleic acid sequences or protein sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The simplicity of FASTA format makes it easy to manipulate and parse sequences using text-processing tools and scripting languages like Perl.

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence.

-----------------------

pink regex text

Document Drawer

Open diamond = saved file

Black diamond = unsaved file

Bold text = working file

Yellow highlighted text at cursor

counts of words returned

Backarrow button

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download