MSU - Department of Computer Science



Perl for Science MajorsDesigned to support CMPS 1023V .61Image from National Human Genome Research InstituteByRichard Simpson and Tina JohnsonMidwestern State University2010 CONTENTSWhat is Perl and DOSInstalling and using Notepad++Installing and running PerlVariables and Data Types in PerlInput\OutputMathematical operatorsSimple ProgramsThe IF statement and Logical operatorsThe WHILE statement File Input\OutputRegular Expressions Searching Text files using RE’sMore Arrays in PerlHashes and their usesWhat is PerlPerl (not PERL) is a programming language developed by Larry Wall in 1987. Although it was originally intended as a UNIX scripting language for UNIX systems, it has evolved to become a highly used programming language for text processing. This is not to say that it cannot process numerical applications but that it has features that make the processing of text files relatively easy. Computer Scientists refer to a language that can be used to solve almost any problem as a general-purpose language, which is what Perl is. Besides applications in graphics programming, system administration, database systems, Web and networks, the field of Bioinformatics has embraced the language as their (at least the most popular one) preferred programming tool for DNA and other text processing. This little book will restrict its focus to applications in Biology, Chemistry, Physics, Mathematics and Geology which is what the majority of science students major in. We will be working in a unfamiliar environment this semester at least from your perspective. This environment, which is our interface to the operating system, is called command line DOS. It looks very much like the command line interface found in the Unix and Linux operating systems so what you learn here will help you when and if you work on these systems. DOS (Disk Operating System) has been around since 1980 or so. It was developed to work with the new desktop microcomputers that hit the shelves about this time. The version that Bill Gates sold was called MS-DOS. Its popularity and subsequent Windows operating systems is what made Microsoft and its owner Bill Gates so wealthy. The fact that DOS is command line based means that the user interacts with the OS by entering commands on a line, one at a time. There is no mouse or GUI (graphical user interface) that has icons and etc. to click on. It’s all done thru the keyboard. Although the original DOS was displayed on the entire screen, modern DOS is normally run within a window. To create this DOS port, click on the start button of windows, select run from the menu and enter cmd within the opened popup followed by OK. You should get a window that looks like thisThe line c:\Documents and Settings\richard.simpson>is called the prompt. It indicates the directory that DOS is presently accessing, also known as the present working directory (pwd). In order to display the actual contents of this directory, just type in the command dir. The results of executing dir within DOS on a laptop are given below.The directory contents and the initial directory will most likely be different on your computer. In this case let’s look at what is displayed. Note the line06/20/2010 12:04 PM 11,149 gsview32.iniin the display. This line gives the creation date and time as well as its size in bytes for the file gsview32.ini. The data is changed/updated each time the file is modified. Other lines such as02/02/2009 11:46 AM <DIR> mydirrefer to directories as indicated by the <DIR> field. As you first learn to use DOS it might be wise to hide your mouse so it is not within easy reach. Remember you are not supposed to use the mouse while interacting with the DOS interface. Of course if you need to work on another window you can use the mouse to click on some other part of the background windows GUI. So what can we do in this command line window.? The basic process is type a command and enter to see its effect and do this over and over again. Although there are quite a few commands you can use, as given in the DOS command appendix we will look at a few really useful ones here. As we discuss these commands don’t forget that pwd is shorthand for the present working directory. There is one command that allows the user to move around the directory tree. This command is called cd (change directory) and can be used in several ways as given by the following examplescd ..change to the parent dir ( the .. is shorthand for the parent )cd species change to the species dir of the pwd. (note that there is no slash)cd /bin change to the bin directory of the root.ch /Simpson/Files/Perl/ change to the indicated directory if it exists within the dir tree.Each time you execute one of the above you should probably follow it with a dir command to see its contents. The command cd species shown above will only work if a species directory is displayed when the dir command is executed, i.e. species is a dir in the pwd. As an aside if you want to change drives (for example C: to D: where D: is your thumb drive) just type D: at the prompt without the cd.The creation and deletion of directories is straight forward as well. Just type mkdir dir_name to create a new dir within the pwd. You can create as many directories and subdirectories as you like with this method. If you want a to create a subdirectory in say directory FilesList you must make FileList the cwd before you execute mkdir .Another command line command that is useful is the type command( as in type file_name). This command is used to display the contents of text files (those made from ASCII codes only) that you see within the pwd. It will not work properly on .exe files, .doc files, .pdf file as well as many others. If you don’t believe me just type a .exe file and see what you get. If things go crazy while attempting something lke this just type cntl-C to kill the confusion and get back to the prompt.The final set of commands that we will discuss here is the move and copy commands. The move command lets you move a file from one directory to another with the original being deleted. The copy command will do the same but keep BOTH copies. In order to give some examples assume that we have the tree displayed in homework 1.3 below and that the pwd is Exams. We will only discuss copy since the move would be similar. The command that would copy A.txt to the Exams directory( ie the pwd) would be copy /Text files/Letters/A.txt . (Note: the . is shorthand for the pwd)In this case the entire path /Text file/Letters/A.txt starting at the root, was used to select the file we want to copy. You could also have given the full path of the receiving directory as incopy /Text files/Letters/A.txt /Textfiles/Exams/ Assuming that we are in Letters here is a command that will copy a file to its parent directory.copy b.txt .. (Remember that .. always represents the parent directory of the pwd)There are many variations of this command. What do you think this means? copy ../Letters/A.txt . (assuming that we are in Exams)You start where you are and back up one directory, then go down into Letters to retrieve A.txt, copying it to the pwd. (GOT that?) Homework1.1 Open a DOS window and note the pwd. Remember its path is displayed in the prompt. Now move to the root by running cd \ and note that the new prompt should be C:\> Now move around the directory tree by executing dir and cd commands, cd name to go down into the directory and cd .. to back up. Move around the tree until you become comfortable with this process.1.2 Insert a thumb drive (AKA geek stick)(AKA computer sticky thingy from a recent movie) into one of the USB ports. Use one that has no important information you want to keep. Move DOS to this drive by entering the drive letter into the prompt. For example if D is the drive letter that the OS assigned to the thumb drive then just type D: at the prompt and return. After running dir to see what’s on the drive, delete all the files and directories on the drive, one at a time. Note that if you try to delete a directory that is not empty DOS will inform you of this. If so change to that dir and delete the files in it first and then back out, via cd .. , to the parent again. Now you can delete it.1.3 Starting with an empty thumb drive build the following dir tree. The rectangles are directories and the circles are files. Go to drive C and find a small .exe file and copy it to the bin directory as shown in the tree. Also copy a .doc (or .docx) from from drive C to your exams directory. Within the Letters directory run Notepad at the command line and create two files, with a sentence or two of data. Now go to the root directory and run the tree (ie type tree and return) command. You should see this tree drawn on its side. 1.4 Starting with the above tree do the followinga) copy the a.txt file to the Perl programs dirb) move the .exe file to the root dir and to the Other files directory.c) copy the B.txt file to Text files.Installing and using Notepad++ In order to write and run Perl programs, there are two programs that will need to be downloaded and installed on your home system. The computers in our class and in many of the labs here at MSU already have these tools loaded for you convenience.The first tool we need to download is called Notepad++. This is a FREE text editor that we will use to write Perl programs. Many of you already are familiar with Notepad, as used in a previous homework, that comes with Windows. This is a greatly improved upgrade. You may download the software from at this point in time. If the link goes away just Google notepad++ and you will be given a variety of download sites that you can use. In the above case just click on the Download tab and then click Download the current version. Select the Installer.exe version. This will grab the release most recent release. Release 6.1.8 is the current one at the time of this writing. By the time you read this the release number may well have increased. No worries just click on the most recent offering. Once the installer is downloaded you may execute it. This should install Notepad++. When you first execute this little editor you should get a screen that looks like the following.Don’t be discouraged by all the options, we will use only a few of them. Notepad++ basically works just like any other editor that you have worked with. You type in some text and then save the file after giving it a unique name. Check out the installation by typing in the following Perl program and saving it under the name check.pl where the pl extension tells us that this is a Perl program.In order to get this to display instructions using colored syntax you might need to click on the Language tab and select Perl. This editor has been designed to work with a lot different languages. In this case you will note that comments, everything after a # symbol is colored green and the print instructions are color blue. Other things such as variable names will be colored differently. This is a great help to those trying to read a Perl program. In order to save the above code click on File and then save-as and select the correct directory. Normally in class we will use D:\Students or something similar. I don’t have a drive D: at the moment so I am using drive C:. Use drive D: in the lab during class. After saving the file it should look like the following. Note the name at the top. C:\Students\check.pl gives you both the directory the file is in and its name. Saving to a jump drive is also an option.We will assume that you have had enough experience with editors such as Microsoft Word to use the above applications. If you have any problems please ask the Lab assistant or the Instructor. In order to run the above program we will need to install Perl which is what we will do in the next section.Installing and running PerlThere are many versions of Perl on the web but we will be using Active Perl on the labs in class. The website is . Here you can download either the regular 32bit version or the 64 bit one. If you have Windows 7 (64bit) or Windows XP 64 then you can download the 64 bit version. If you don’t have a clue what you have, just download the 32 bit version (ie x86) and it should work in either case. The file you will download is a Microsoft Windows Installer package( note the .msi extension). Save the file and run it. It should be named something like ActivePerl-5-12.2.1202-MSWin32-x86-293621.msi if you are downloading the 32bit (x86) version. It will lead you thru the installation process. The version number may well have changed by the time you read this. If you want to know if you have a 64 bit system just go to computer and click the system properties at the top of the window. The system type will indicate a 64 bit Operating System.Variables and Data Types in PerlA variable is a reference to (or name of) a unique memory location that is used to temporarily store information. Unlike many other languages, variables in Perl do not have to be declared before they are used in a program. Variables in Perl can be scalars, arrays, or hashes. Hashes will be discussed later in this document. Scalars and arrays are discussed below.ScalarsA scalar variable holds a single piece of information, such as number, a character, or a group of characters (referred to as a string). A scalar is the simplest data type in Perl. A scalar variable name in Perl must begin with a dollar sign, followed by a letter or underscore, and then (optionally) followed by one or more letters, digits, or underscores. Perl is case sensitive; $this_variable is not the same as $This_Variable. As with any other programming language, a good variable name should give an indication of the variable purpose. For example, the variable $sum is preferred over $s because it is more descriptive. The following examples demonstrate the use of scalar variables within the context of an assignment statement. In Perl an assignment statement is an instruction that contains a variable on the left, followed by an equal sign (=), followed by either a value or expression. The meaning of this is quite simple. The value of the expression on the right of the equal is stored in the variable on the left. Do not get this confused. The variable you are modifying is ALWAYS on the left. Capice? ! All instructions are always terminated by a semicolon (;). Make sure you understand that an assignment statement is NOT an equation, ie remember those things you studied in algebra. It is an action not a statement of a relationship! Although variables used to store numbers are initially set to zero by default it is good practice for the programmer to perform the initialization explicitly. This is shown in the first example belowThe followings should suffice.$sum = 0; # Start the variable out at zero. $university = 'Midwestern State'; # stores a string into the variable $university.$pick = ‘B’ ; # stores a single character B into the variable $pick$num1 = 52; # Stores 52 into the variable $num1$pi = 3.1415926535; # initialize pi to 10 decimal places$total = $subtotal + $tax_amount; # The sum on the right is stored in $total$avogadro = 6.022E+23 ; # you can even use scientific notation print "I attend $university University \n"; # see note belowThe last example illustrates an important point in Perl: A double-quoted string is variable interpolated. This means that the variable name is replaced with its current value when it is printed. In the example above, the following would be printed, "I attend Midwestern State University", assuming that $university holds the string literal value 'Midwestern State'. Single quoted strings in a print statement will not perform variable interpolation. In other words print “Hello $name” will print Hello $name as is.ArraysScalar values are limited to a single piece of data. Many times it is helpful to store a collection of data in a single data structure that can be manipulated as a single unit. An array is a data type that holds a list of scalar values. Arrays are indexed with integer values beginning with zero. Use the @ symbol followed by the array name to refer to an entire array. To refer to an individual element in an array, use a dollar sign, followed by the name of the array, followed by the element number in brackets. For example @words could be the name of an array of words. The first element in the array is called $words[0], the second is called $words[1] and so on. An array can be visualized as shown below. Here we have an array called @taxonomy that contains the well-known taxonomic categories. 01234KingdomPhylumClassOrderFamily@taxonomy The statement: $taxonomy[2] = $taxonomy[0]; // copies the value in the 0th slot to slot 2.would produce:01234KingdomPhylumKingdomOrderFamily@taxonomy The statement:@taxonomyscopy = @taxonomywould produce (assuming original array):01234KingdomPhylumClassOrderFamily@taxonomy01234KingdomPhylumClassOrderFamily @taxonomycopyArray variables can easily be initialized by using an assignment statement as well. Here are several examples using the required syntax.@numbers = ( 1,2,3,4,5,6,7); # all numbers@names = (‘Harry’, ‘Bob’, ‘Tom’, ‘John’, ‘Bill’); # all strings (names)@mixed = (‘One’,2,’Three’,4, ‘Five’); # we can mix strings and numbers.@bases = (“A”, “C”, “G”, “T”); # Remember you DNA?From these it should be clear that $numbers[4] is 5, $names[0] is Harry and $mixed[1] is 2. Input/OutputPerl programs, at least the ones we will write, can communicate with the outside world in basically two ways, standard I/O (Input/Output) or file I/O. The first method, referred to as standard I/O, either reads data typed in at the keyboard (<STDIN>) or writes data to the DOS screen (STDOUT). Programs that operate in this fashion are said to be interactive. Probably the simplest program is one that writes output to the screen using a print instruction. The syntax for the print instruction is print data_type, data_type, …., data_type; Here data_type can be any of the usual types such as variables, strings or arrays. It can also be an expression that results in one of the data types. Here are some simple examples of printing strings. The associated comment (anything after #) explains the output. The commas are required if you have multiple types.print “Hello world”; # This writes Hello world to the screen leaving the cursor at the end.print “Hello world\n”; #This writes Hello world to the screen moving the cursor to the next line.print “Hello”, “ world\n”; #This writes Hello world using two separated types in the print statement.In both of the above cases we are printing a string with the only difference being that the second print includes the \n formatting character and the third is separated. You can think of \n as being a carriage return (or in our case the Enter key). In other words anytime a print instruction encounters this character within a string then the output moves to the start of the next line. If we execute the following little programprint “If knowledge can create problems,\n It is not through ignorance\n”;print “that we can solve them.\nIsaac Asimov\n”;we see this .If knowledge can create problems,It is not through ignorance that we can solve them. Isaac AsimovThe output is heavily controlled by the careful placement of \n’s throughout the strings. Another data_type is of course variables. A variable can be printed in two ways. The following sequence of instructions show several examples of printing variables together with support strings. Look very close at the output and the associated syntax used to create that output.$a=10; #Assign the value 10 to $a#Here we are printing a string, then a number, and finally a stringprint “The answer is “,$a,”\n”; # the output is The answer is 10 # the following just prints a single string. Since the string is double quoted the enclosed # variable is variable interpolated as discussed earlier. print “This is another way to print $a\n”; # the output is This is another way to print 10#The following is a print of a string using single quotes. Here NO interpolation occurs!!!print ‘This is another way to print $a\n’; # the output is This is another way to print $a\nFrom the above you can see that you can print a variable embedded in a double quoted string or by itself on the print line separated by commas.It is also possible to read from the keyboard (ie <STDIN>) numbers or strings. In order to do this we normally use two instructions. The first instruction is called the prompt. It is just a print statement telling the user what he/she is supposed to type in. The second instruction is an assignment statement that retrieves the typed in value and stores it into a variable of your choice. Here are several examples.print “Enter a number between 1 and 10:”; $selection = <STDIN>; The second instruction contains <STDIN> which stands for standard input which is synonymous with the keyboard. The program actually stops at this instruction and waits for the user (i.e. you) to type something in and hit enter. The main thing you have to remember here is that, all you type goes into the variable INCLUDING the enter key (i.e. \n). Every variable read this way will have the \n on the end. This is quite often a pain and generally one wants to remove it. It can easily be done using the chomp command. The following examples should demonstrate its usage.print “Enter your name:”;$name = <STDIN>;chomp($name); # removes the \n from the end.Print “Enter your age:”;$age = <STDIN>;chomp($age); # you need to remove the \n from both numbers and strings.Homework Problems5.1 Give the output of the following program EXACTLY.$a=23;print “The answer is $a\n”;print ‘The answer is $a\n’;#note the print “I have called this principle,\n by which each slight variation,\n”;print “if useful, is preserved, by the term of Natural Selection.\n”; #by Charles Darwin5.2 Give the output of the following program exactly. Note the first variable is not chomped. What would happen if it were?print “Enter a number :”;$num = <STDIN>;print “Enter another number :”;$val = <STDIN>;chomp($val);print “The second number was $val and the first number was $num. Interesting huh!”;Mathematical operators Perl has a large number of built-in operators for performing operations on numbers. Perl considers all numbers as real ( i.e. floating point numbers). As mentioned previously a variable can be initialized by these values. Here $PI and $cube10 are initialized. $PI=3.1415926535;# a real number with decimal $cube10 = 1000; # an integer valueReal numbers and integers can be processed using the standard operators, +,-,*,/ and can be mixed with little worry. In other words you can add reals (decimals) to integers and it will work properly. These operations can be applied to any combination of numeric literals (things like 3.14 and 2) and variables while using parenthesis in the usual way. When writing expressions for the right hand side of an assignment statement one must always pay attention to order of operations. This is exactly the same order of operations that you learned in elementary school. Remember you were told to, multiply and divide before you add and subtract. If you write an expression such as $a + $b*$c, it will be evaluate according to these rules, i.e. the product $b*$c will be done first and then the $a will be added on. For example the commands in the following program are acceptable and perform the obvious operations. Note that this is a complete program that executes from top to bottom.# A simple program of operational processing.$PI=3.1416;$r = 5;$TwoPi= 2*$PI;$Area = $PI*$r*$r ; $Circum = TwoPI * $r;print “ A circle of radius $r has a circumference of $Cirum and an area of $Area.\n”;Output is:A circle of radius 5 has a circumference of 31.416 and an area of 78.54. There are quite a few other operators in Perl, some which work with numbers and others that work on strings. The first we will look at is the exponentiation operator **. It can be used with whole or real numbers. $Area = $PI *$r**2; #This is of course pi time r squared. $c = ($a**2+$b**2)**.5; #Remember the Pythagorean theorem? $c = sqrt($a**2+$b**2); # Same thing but using the function sqrt()The sqrt() used above is called a function. In fact in this case it is a built-in function. This means that it is already in Perl ready for us to use. When you use a function like this it is referred to as calling the function. The call is replaced by the value that results. For example if you have sqrt(4) in a program, it will be replaced by 2. Although appendix B contains a list of functions that may be useful in this class there are in fact many more that can be used. You can in fact download entire libraries that are specially designed for a specific area such as BioPerl.Another operator that is very useful is the modulo operator %. This returns the remainder that occurs when you divide one number from another.$y =231;$w = $y % 3; # sets w to 0$z = $y % 2; # sets $z to 1Remember when you first learned to divide, before you worked with decimals you divided a large number by a small number and if it divided evenly then you obtained a 0 remainder. If it did not divide evenly you obtained a remainder. The % operator returns this remainder. You can use this to determine if a number is even (i.e. if num%2 is 0) as well as some other applications which we will run into. Can you think of a way to get the quotient instead of the remainder? There is a very useful operator for strings called concatenation. It is a single period (.). It is used to connect two strings together thus creating a longer string. For example$Name = “Richard “.”Simpson”; #creates the single string “Richard Simpson”$genus=”Homo”;$species = “Sapiens”;$comment = $a.” “.$b; #Note a blank was added in between the two words.$len = length($comment); # length is another useful function. Here $len is set to 12!;Homework 66.1 What value is assigned to the variables on the left assuming $x = 5, $y=3 and $n=”Hello”a) $a = 3+$x**2-4*$x;b) $m = $x*$x+sqrt($x*20)+ 75/$x;c) $sentence = $n.” World”.” “.” How are you doing?”d) $t = length($sentence);e) $d = 100%$x + 27%$x;f) $c = sqrt($x**2+$y**2);6.2 How do you determine if a number is even? Is a multiple of 10?Simple ProgramsIn this section we will look at some complete programs. All of these programs are straight line programs where each instruction is executed just once, one after another. It is very important that you think about the execution of a Perl program in this dynamic way. The instructions execute one after another modifying and storing the variable values at each step. The first example is a simple program that reads in the radius of a circle (entered at the keyboard by the person running the program) and then prints out the diameter, circumference and area of the circle as calculated from the entered value. In addition no mention was made of the unit of measure. Is it inches, feet, meters? Since the formulas apply to all these it was not necessary.print “Enter the radius of a circle:”;$radius = <STDIN>;# Get the radius from the keyboard$diameter = 2* $radius;$circumference = 2 * 3.14159 * $radius;$area = 3.14159 * $radius ** 2;print “The radius is $radius\n”;print “The diameter is $diameter\n;print “The circumference is $circumference\n”;print “The area is $area\n”;The output, assuming we enter 3 for the radius, for the above program is given below.Enter the radius of a circle:3The radius is 3The diameter is 6The circumference is 18.84954The area is 28.27431If we run the program again and enter 10 instead we obtain the following output.Enter the radius of a circle:10The radius is 10The diameter is 20The circumference is 62.8318The area is 314.159 The above program is indicative of most Perl programs. It is a program that using I/O (Input and Output). Data is requested from the user which is then processed and the resulting information is printed out for the user to read. Let’s look at another example that works on strings as opposed to numbers as processed in the previous example.print "Enter your last name:";$last = <STDIN>;chomp ($last); #Remove the CR (carriage return ie enter key)print "Enter your first name:";$first = <STDIN>;chomp ($first);$name = $first . " " . $last; # The . is the concatenation operator!print "Hello $name it is nice to see you.\n";and its output.Enter your last name:DarwinEnter your first name:CharlesHello Charles Darwin it is nice to see you.This program has two features that need discussing. First we use the chomp function. It is a simple command that removes the CR from an inputted word. Not doing so has a tendency to foul up our print outs. If you do not remove it, every time you print the string an automatic CR will be printed immediately after it, something that you do not normally want. The next feature of note is the period (.) that we see in the next to last line. Recall that this is the concatenation operator for strings. It allows us to connect strings together (in this case three) in order to create a single string. The command$name = $first . " " . $last; connects $first a blank “ “ and $last into one string placing the result in the variable $name. The blank is necessary so that CharlesDarwin will be split with a separating blank. It would be instructive for you to run the above program without the chomps and then again without the inserted blank to see the result. This will help you with debugging when you make errors in the future. If you forget to use a chomp sometime in the future the error in its printout will hopefully remind you of the cause.Exercises (Formulas that you do not remember can be found on Wikipedia!. Also before you accept your results check them by hand. You may have your order of operations wrong in your program)7.1 Write a Perl program that will request a temperature in Celsius and calculate and print out the temperature in Fahrenheit. Do it again but in this case ask for Fahrenheit and convert it to Celsius. 7.2 Write a Perl program that will request the three coefficients of a quadratic (y=ax2+bx+ c) as well as the value for x and then have it print out the resulting value for y. 7.3 Recall Einstein’s mass energy equation E=mc2. Write a program that will read in the mass of an object and have it print out the Energy that that mass represents. The variable c is the speed of light in meters/sec. Look up its value on the internet and use scientific notation when typing it value into your program.7.4 This is a well known formula a2+b2=c2 often referred to as the Pythagorean Theorem. Write a program that will ask for b and c and have it give you a. You will need to use the built-in square root function in this program. It is possible for bad things to happen when you run this program. Explain.7.5 At home you have power sockets all over the place. Most of them run at 120 volts. You may recall that your electric dryer and electric range run at 240 volts. Write a program that will calculate how much it cost to run a 100 watt light bulb for 7 straight days. Have your program request the wattage of the bulb and the cost per kWh(kilowatt hour). What do we pay here in WF per kWh? The method of calculation is as follows. wattage ? x?? hours used? ÷? 1000? x? price per kWh? =?? cost of electricity7.6 Write a program that will read in the lengths of the three sides of a rectangular prism(box). Have your program print out the volume and surface area with associated comment.7.7 A "molecular clock" is a gene that evolves at a steady rate and is present in many related species. The percent similarity of this gene between any pair of species is given by the number of base positions in the gene that are the same between two species. The time that has passed since the point when two species diverged varies approximately with the percent difference between the two; that is:Time since divergence of two species is given by(100 - X% sequence similarity) / (% change / years).Write a program that reads in the sequence similarity, percent change and number of years. Have the program print out the Time since divergence. The IF statement and Logical operators Straight line programs as discussed in the previous section are quite limited. Although many simple problems can be solved this way the majority cannot. For example suppose we want to solve the quadratic formula. Recall that the solution for the equation y = ax2+bx+c is given by the equationx=-b±b2-4ac2aThis formula creates several problems for us. First is the plus/minus. It is really short hand for the following two solutions which need to be calculated separately.x=-b+b2-4ac2a x= -b-b2-4ac2aThe second issue concerns itself with the radical (ie square root) . You may recall that we cannot calculate the square root of a negative number (without using complex values). If we execute the command sqrt(-3) in Perl an error will occur that will kill our program. We must make sure that the value under the radical (aka the discriminate) is positive before we attempt to take its root. One of the instruction types in Perl that allows us to check this value is called IF statement. This statement allows us to change the order that instructions are executed in the program, a process called Flow Control. In its simplest form the IF statement looks like the following. If ( conditional test){ Instruction 1 Instruction 2 . . . Instruction n } Next instruction Etc.The flow of control works as follows. If the condition test is TRUE and the instructions Instruction 1 thru Instruction n are executed followed by the Next instruction. If the conditional test is FALSE then the instructions enclosed in the braces {…} are SKIPPED and the Next instruction is executed. All this of course depends on the conditional test. There are many kinds of conditional tests. Probably the most used are comparisons between numbers or numbers and variables containing numbers. Here is an example list of numerical comparisons. Numerical Comparisons Meaning3>0is always true since 3 is always greater than 0$x >10is true if the value in $x is greater than 10$y<=100is true if the value in $y is less than or equal to 100$num != $value is true if $num is not equal to $value$count == $y+1 is true if the value in $count is equal to the value in $y +1$count%2 == 0 is true if $count is an even numberIn each of the above comparisons we use a special operator to define the specific comparison we desire. Here is the complete list that can be used to compare numbers! Numerical Comparison Operators String Comparison Operators>Greater thangt>=Greater than or equalge<Less thanlt<=Less than or equalle!=Not equalne==EqualeqIt is very important that you notice the double = signs when comparing numbers. A single = sign using in an instructions such as $num=1 has an entirely different meaning(semantic) than does $num==1. The former is an assignment statement that puts a 1 into the variable $num whereas the second is a comparison that is either TRUE or FALSE. The value of $num is NOT modified in the comparison case. Weird things happen when you forget to use the double = in a comparison. Also pay particular note of the operators required to be used with strings. You cannot compare strings using the numerical operators. Capice!?True and false have actual values. Although normally 0 is defined to be false and not zero is true we normally consider 1 to be the representation for true. Consider the following code segment. if(1) { print “ This will always print”; }Since the value in the parenthesis is 1 ie TRUE, the print statement will always execute. A comparison can be thought of as an operation that returns ( is replaced by ) either a 0 or a 1. The blue comparison in the following command sequence if($n<10) { print “ This will always print out if the value of $n is less than 10”; }is replaced by a 1 or a 0 depending on the value of $n.There is another version of the IF statement that includes an ELSE section. Its format is as follows.If(conditional test){ Instruction 1 Instruction 2 … # True part red Instruction n} else { Instruction a Instruction b … # False part brown Instruction z }#Continuing Code flow chartInstructions 1 thru n are executed if the test is true otherwise Instructions a thru z will be executed. In both of the above instances it is important to note the braces { and }. These must be used to delimit both the true and false sections. Leaving them out will result in a syntax error. In order to make it easier to see whether or not all the braces are in their proper place it is advisable to develop a style or form that is adhered to. The above is reasonable form. Note where the braces are with respect to how they line up. Readability is enhanced by indenting the instructions enclosed within a pair of braces. In order to clarify this instruction lets continue looking at the quadratic formula. The main thing we need to do in the program is to check the value under the radical. If it is negative we will just print out that there are no real solutions. So here we go. # This program evaluates the quadratic formulaprint "Enter a:";chomp($a=<STDIN>); # Read in the a coefficient. Note the two instructions in one!print "Enter b:";chomp($b=<STDIN>); # Read in the b coefficientprint "Enter c:";chomp($c=<STDIN>); # Read in the c coefficient# Determine the discriminate$disc = $b**2-4*$a*$c;if ($disc<0){ print "There are no real solutions where a=$a, b=$b and c=$c\n"; print "This is because the discriminate is $disc\n"; exit; # This little instruction causes the program to exit.}$x1 = (-$b+sqrt($disc))/(2*$a);$x2 = (-$b-sqrt($disc))/(2*$a);print "There are two real solutions and they are $x1 and $x2\n";Running the above program for a=1, b=2 and c=3 givesEnter a:1Enter b:2Enter c:3There are no real solutions where a=1, b=2 and c=3This is because the discriminate is -8Here is another example that has real values. Enter a:2 Enter b:5 Enter c:2 There are two real solutions and they are -0.5 and -2From the above example you can see the utility of the IF statement. It allows the program to executed different instructions depending on the result of each comparison. One must be very careful when writing programs that have one or more if statements. This is a result of the fact that the path a program takes thru the code is dependent on the returned value of the comparisons. A program that has a lot of IF-ELSE statements has a very large number of different paths that the execution might take. This is even clearer when one realizes that IF statements can be nesting within other IF statements. Programs can become very complicated indeed.Looking at another example that contains nested IF’s. Suppose we want to request a string from the user (keyboard) and print out whether or not the length of the string is 5, less than 5 or more than 5. To do this problem we will need to use the length( ) function. This function will return the length of the string that it is applied. In other words if $str=”Hello” then length($str) will return 5.print "Enter a string:"; chomp($str=<STDIN>);$len = length($str); if ($len == 5){ print "The length of the string is equal to 5\n";} else { if ($len <5){ print "The length of the string is less than 5\n";} else { print "The length of the string is greater than 5\n";}} #Continuing CodeNote that the red IF-ELSE statement in the above program is nested within the else part of the enclosing IF. Follow the path thru the program for all three cases and convince yourself that this indeed works. In working with this program you probably have already noticed that when typing in braces within Notepad++ , matching braces are highlighted in red as soon as the second matching brace is typed. This helps you make sure every brace matches up properly. This is a cool feature so use it to your advantage.The next example is a simple program that reads in three numbers and then prints out the largest. There are really two ways to do this. The first way, which we shall explain at this point uses a extra variable to hold the largest so far. Let’s look at it. # Determine the largest of three numbersprint "Enter the first number:"; $n1=<STDIN>;print "Enter the second number:"; $n2=<STDIN>;print "Enter the third number:"; $n3=<STDIN>;$max= $n1; # $n1 is the max so farif ($n2>$max){ $max = $n2;} # if $n2 is bigger make it maxif ($n3>$max){ $max = $n3;} # if $n3 is bigger make it maxprint "The largest value of all three is $max\n";The main point to note in the above program is the use of the variable $max. This variable is used to contain the largest value we have see so far. Every time a new number is checked against this variable the contents of $max will be replaced by the new number if it is indeed larger. If the program were written using only IF statements without the use of an extra variable such as $max it’s structure becomes a little more complicated. See Homework 9.2Our last example involves strings. Here we are to write a program that asks the user 3 DNA related questions, obtains the answers from the user, indicates Right or Wrong for each case and then prints out the percentage correct. For the sake of simplicity the questions are hard coded.# A three question quiz on DNA$num_correct=0;print "What year was the structure of DNA discovered? ";$ans =<STDIN>; chomp ($ans);if ($ans == 1953){ # Note the == for comparing numbers print "Awesome, you are correct!\n";$num_correct = $num_correct + 1;}else{ print "I am sorry. You must be really ignorant!\n";};print "\nThere was a woman who also deserves credit for the discovery\n";print "of the structure of DNA. What was her last name? ";$ans =<STDIN>; chomp ($ans);if ($ans eq "Franklin"){ #Rosalind 1920-1958. Note the eq for comparing strings! print "Awesome, you are correct!\n";$num_correct = $num_correct + 1;}else{ print "Wrong, Wrong ! You must be really stupid!\n";}print "\nWhat is the structure of the DNA molecule ";$ans =<STDIN>; chomp ($ans);if ($ans eq "double helix"){ # Another eq here! print "Awesome, you are correct!\n";$num_correct = $num_correct + 1;}else{ print "What an idiot! Can't you do anything right?\n";}$per = $num_correct/3*100;print "You were $per percent correct!\n";The above program is straight forward. Each question section is structured the same as the others. The main things to note of course are the use of eq and == as commented in the program. The other is the counting variable $num_correct that is incremented every time the user gets an answer correct.It turns out that there are many instances where we need to to enter several options (say in a menu) and process the appropriate option. We can do this with the previous if statement by nesting them, ie an if within an if within an if and so on. This becomes complicated because of all the braces that are required. In order to simplify the coding of these cases the designers of Perl created a special instruction called elsif. It is mainly used to select from a list of options entered at the keyboard. The format is as follows If(first test){# instructions to do is first test is true } elsif ( second test){#instructions to do if second test is true } elsif (third test){#instructions to do if third test is true }else { # instructions to do if none of the above are true. }Although the above shows only 3 test’s there can be as many as desired. The last one should have an else only and be the catch all for anything that’s not caught by the previous cases. This sequence is quite handy and is used very often. A nice example where this is used is in menu systems. In these cases a user is requested to enter a selection from a menu listing possible options. The following is a program that will convert either base 8 or base 16 numbers to base 10 (decimal). It prints a menu on the screen and the user selects either a or b. The program also uses two built-in functions, hex() and oct() to do the conversions for us.print "\n Menu options for acceptable conversions\n";print " a. Base 8(octal) to decimal\n";print " b. Base 16(hexadecimal) to decimal \n";print " Enter the requested conversion, (a or b) :";$selection= <STDIN>;chomp($selection);# get rid of the \n or the following test won’t work!if($selection eq "a"){#convert the octal value print " Enter the octal number :"; $bin = <STDIN>; $result= oct($bin); } elsif ($selection eq "b"){#convert the hexadecimal value print " Enter the hexadecimal number :"; $hex = <STDIN>; $result= hex($hex);} else {# valued entered is incorrect, ie it’s not a or b die "\nERROR :please enter a or b only!\n\n";}print " The value in decimal is $result\n"; Here is an example run where the user requests a hexadecimal converstion.Menu options for acceptable conversions a. Base 8(octal) to decimal b. Base 16(hexadecimal) to decimal Enter the requested conversion, (a or b) :b Enter the hexadecimal number :3C5 The value in decimal is 965 Several of the exercises below will give you practice with using if and elsif instructions. Proper use of these instructions are critical to Perl programming since they define the intended logic of the programmer. Exercises 8.1 Copy and paste the DNA program from the previous page into Notepad++ and create two additional DNA questions, one that has a numerical answer and the other that has a string answer. Test and run.8.2 Write a program that reads in three numbers and prints out the largest. Use only the nesting of multiple IF statements to do this. Do not use any support variables such as $max. Run and test every possible scenario, ie the first is largest, the second is largest and the third is largest etc. How many relative data input scenarios are there?8.3 Convert the above program so that it reads in last names instead of numbers. We are now alphabetical ordering so print out the name that occurs deeper in the alphabet. Test carefully.8.4* Write a program that reads in three numbers and prints out the numbers in numerical order from small to large. Use only nested IF statements to accomplish this. Draw a comparison tree on a piece of paper to help you keep the logic straight. 8.5 Write a program that reads in 5 exam grades (0-100) and have it print out each grade and the letter grade associated with it. Also have the program print out the average and the number of students that passed (# of students that made above 59. You will need an extra variable or two to accomplish this. Do this problem in steps. First read in the numbers and print the average. After this is working have the program print the letter grade for the first exam etc. Incremental development is really a nice way to develop programs. Try it you will like it!8.6 (Use the elsif in this problem) Write a program that will first request two numbers from the user. Then have the program display a menu of the four different operations (add, subtract, multiply, and divide). Then perform that operations and print the result. Here is what an example run should look like.Enter the first number: 4Enter the second number: 5Menu options for operations a. Add b. Subtract c. Multiply d. DivideEnter the requested operation: bThe answer is -18.7 Write a program that converts meters to either kilometer, centimeter, or millimeters. Use a menu to obtain the users choice.8.8 Write a program that will either convert Celsius to Fahrenheit or Fahrenheit to Celsius depending on the users choice from a menu.8.9 HARDY-WEINBURG EQUILIBRIUM: Used to determine if allele frequency in population is changing.p2 + 2pq + q2 = 1 and p + q = 1 p = frequency of the dominant allele in the populationq = frequency of the recessive allele in the populationp2 = percentage of homozygous dominant individualsq2 = percentage of homozygous recessive individuals2pq = percentage of heterozygous individuals Write a program that will read in p and q and have it print out whether or not the population is in equilibrium. Just how close to 1 do you need to be? Make a decision.The While StatementIn this section we add a new statement that greatly increases the power of our programs. This statement is called the WHILE statement and it will allow us to run sections of our program repeatedly. This is called iteration and is heavily used in almost every program on earth. The basic structure of a WHILE is as follows # previous instructionswhile(conditional test){ Instruction 1;Instruction 2; . . .Instruction n;}# subsequent instructionsWhen the above while is encountered for the first time the condition test is performed. If true then instructions 1 thru n are executed. At this point the conditional test is performed again. If it is still true then instructions 1 thru n are again executed. This process is repeated over and over again until the test is false at which time the path of execution continues with the subsequent instructions. Loops can be executed thousands of times making this process very code efficient. Let’s look at the following simple example.$n=1; while($n<=5){# n less than or equal to 5 print “$n “; $n = $n + 1; #increment $n sleep(0); # pause here zero second } print “\nDone\n”; # Throw in a couple of new lines and a DoneThe program initializes $n to 1 in preparation for the upcoming loop. Then the while is encountered and the test $n <= 5 is performed. It is true so 1 is printed and $n is set to 2. Back at the top of the loop the test $n<=5 is done again. It is still true since 2 <=5 so 2 is printed. Continuing we obtain the following output.1 2345DoneRunning the program after changing the 5 to a 10 will print 1 thru 10 and so on.It is interesting to slow the program down some using the sleep command. Change the sleep(0) to sleep(1) and the program will pause ever time it hits this instruction for 1 second. This can be quite educational since it allows you to slow down programs so you can actually see the execution flow.What do you think would happen if we inadvertently left out the $n=$n+1 command? Hmmm.This is called an infinite loop. It will just keep on running since $n never changes and consequently it will never get larger than the limit, 5, in this case. Remove the instruction and run it. Of course if you do so I am sure that you will want to know how to kill(stop) the program. Just hit a Cntl-C and it should stop.There are basically three ways to read in data. The first way, reading from the keyboard (ie STDIN) we have already examined. Here we will look at another, very convenient method. In this technique the data is added on to the end of the program code after the __END__. Note the double underlines on each side. Here is an example.# Read data from the end of the file# The <DATA> file handle refers to the data after __END__while($val=<DATA>){ $sum = $sum + $val; $n = $n + 1;}$ave = $sum/$n;print "There are $n values with an average of $ave\n";__END__ # This starts the data section that the above program will read from34 456212817866OUTPUTThere are 7 values with an average of 54The thing to note is that each time $val=<DATA> is executed an entire line is read ( a line is all text up to AND INCLUDING the \n). This implies that we should place one data value per line. There are ways to extract multiple values from a single input line but we will leave that technique for a later section.Let’s look at another example. Suppose that we would like to read in <DATA> values until end of the file and print out the largest and smallest values in the list.# Read data and print the largest and smallest.# The <DATA> file handle refers to the data after __END__#We first read in one value and call it the smallest and also the largest$val = <DATA>;$smallest = $val; # this is the smallest we have seen so far.$largest = $val; # This is the largest we have seen so farchomp($val);while($val=<DATA>){ # keep on reading and looping until we run out of numbers chomp($val); if ( $val < $smallest ) { $smallest = $val;} if ($val > $largest ) {$largest = $val;}}print "The largest is $largest and the smallest is $smallest\ n"; __END__34 456212817866HomeworkIn each of the following programs read your data from the a DATA list after the __END__ statement. Assume that the data is 1 item per line.9.1 Write a program that will read in the included data and have the code print out the number of data values that are greater 40.9.2 Write a program that will File Input/OutputThe third location that we can read data from is a file. In order to do this programmers use a file handle that is linked to an actual file. For example, suppose the file info.txt contains a list of numbers, one per line. We can access this file form a Perl program by first creating a file handle name, say FILEHDL and link it to the file info.txt using the open command as follows. open(FILEHDL, “info.txt”)Once this is done then the usual $var = <FILEHDL> will read a line from the file info.txt and place it into the variable $var. The previous example that averages data values can be easily converted to read from a file. Assume we have a file named info.txt that contains the previous values, one per line.# Read data from a file # The <FILEHDL> file handle refers to the data in info.txtopen(FILEHDL, “in fo.txt”)while($val=<FILEHDL>){ $sum = $sum + $val; $n = $n + 1;}$ave = $sum/$n;print "There are $n values with an average of $ave\n";OUTPUTThere are 7 values with an average of 54 Let’s now look at a program that reads in a text file that contains a book and prints the book to the screen. In order to have a book to read you first need to download one from the web. is a wonderful site to do this so let’s download a really interesting book by Alfred Russell Wallace entitled The Malay Archipelago Volume I. (of II.). Name the file ARWMalayV1.txt. You may recall that Wallace developed a theory of natural selection at the same time that Darwin did. He and Darwin were good friends and in fact Wallace was one of Darwin’s pall bearers.The following program will dump(print) this file to the screen whilst counting the number of lines. open ( FILE, “ARWMalayV1.txt”) ; while ($line = <FILE>){print $line; # no CR is needed here since the $line already has one.$line_count =$line_count + 1; #just in case we want to know the number of lines. } Each time the above while iterates, a new line from the file is read and printed to the screen.A programmer, more often than not, needs to write the output to a file instead of the screen. Perl allows the opening and creation creation of a file in one command. For example the command open (FILEOUT,”>newfile.txt”)will create a new file named newfile.txt and define FILEOUT as its file handle. The following program will write the numbers 1 thru 10 to this new file and close it.open (FILEOUT, “>numbers.txt”); #note the > sign. This is required for output files.$n=1;while ($n<=10){ print FILEOUT “$n \n”; #writes the number to the file. Print “$n \n”; # prints the number to the screen $n++;}close FILEOUT;Enter the above program (or copy and paste) into notepad++, save. Create a dos window in the directory where the program is to run and note the contents of this directory. Now run the program and note again the contents of this directory. You should see a new file, named numbers.txt, that has been created by the above program. Look inside this file, by running the DOS type command. You should see the then numbers 1 thru 10 displayed one per line. If you do not see this file make sure you are in the correct directory. The file will be created in the same directory that the Perl program is in. When opening a file that should exist on the drive there is a special die command that can be used to kill the program in the case this file does not exist. It is common practice to use this command since the input file name may be misspelled, in the wrong directory or the input file was never created. Here is an example of the syntax for the die as used in conjunction with the open.open(FILEIN, “data.txt”) or die “Input file does not exist!\n”;If the file data.txt exists then the instruction will fall thru to the subsequent instruction, if not, then the program will print Input file does not exist! and exit. This is a safety trap that should always be used when opening files.Perl is a programming language that was designed from the get-go to process strings. One string type that we may well have interest in is one that contains a DNA sequence. Recall that a DNA strand is made up of the four nucleotides, adenine, cytosine , guanine and thymine. These are usually represented by the letters a,c,g and t respectively. A DNA snippet such as acggtattcgttaaaccgt can be processed by Perl if it is stored in a string variable, such as $dna.$dna = “acggtattcgttaaaccgt” In order to facilitate string processing Perl has a built-in function called substr() that allows one to access parts(substrings) of the string. A substring of $dna is any contiguous subsequence containing one or more characters of the string. For example ggtatt is a substring of $dna that starts on position 2 of the $dna string. The first letter ‘a’ is at position 0, the second letter ‘c’ is at position 1 and so on. The Perl command $ss = substr($dna, 2, 6)will load the variable $ss with “ggtatt”. The first parameter in the substr function is the string to be access ed, the second parameter is the position number and the third parameter is the number of characters to extract (or access). So the above command starts at position 2 (the third character) of $dna and copies 6 characters which are subsequently placed in $ss. The variable $dna is unharmed by this function. Here is an example program that will print the nucleotides from the $dna string, one per line. Although we would not normally do this it does indicate how the command works.$dna = “acggtattcgttaaaccgt”;$pos=0;while($pos < length($dna)){ $nucleotide = substr($dna,$pos,1); print “$nucleotide\n”; $pos++; # go to the next character.}Copy and paste this into Notepad++ and run it. Note: when copying and pasting from word or a browser for that matter, your double quotes may be incorrect(the wrong ones). Retype them within notepad++ if this occurs. With the substring function you can perform a large number of dna processing requirements such as counting the number of each nucleotide type. As a simple example lets count the number of c’s that occur in $dna.$dna = “acggtattcgttaaaccgt”;$pos=0;while($pos < length($dna)){ $nucleotide = substr($dna,$pos,1); if($nucleotide eq ‘c’){$count_c++};#count each c we pass by $pos++; # go to the next character.}print "There are $count_c cytosine nucleotides in the dna strand.\n"Now that we are getting into processing DNA we need a web site to download from. We will use the National Center for Biotechnology Information (NCBI) site which can be found at is a huge web site, containing a large number of data bases maintained by NIH (National Institute of Health). We will restrict ourselves initially to the GenBank data base. See of the following homework uses data from this site as will many of the remaining sections of this document.Homework10.1 Create a file entitled data.txt using notepad++ and enter a few lines data. Now write a program that will read this file and copy it over to a new file whose name is data.dat. After this program run check the contents of data.dat to make sure it is the same as data.txt.10.2 Write a program called copy.pl that will request the user to enter two file names, the first is called the from file and the second is called the to file. The from file should already exits. The program should open the from file and copy the contents to the to file. If the from file does not exist then a message should be sent to the user indicating so, and then die.10.3 Write a Perl program that will add up the numbers from 1 to 100 and print out the result.10.3 Write a program that will request the user to enter two numbers and add up the integers that occur from the smaller to the larger inclusive. Your program should handle integer inputs where the first number entered is the largest and the second number is the smallest and vice-versa.10.4 Go to the NCBI web site and download a copy of Mammoth mitochondria dna and save in file mam.txt. Open this file with notepad++ and delete all text that is not dna, leaving only the base characters a,c,t and g. Resave. Now write a Perl program that reads in the mam.txt file and counts the number of a,c,g and t’s printing out the number of each together with the percentage of each within the total dna sequence.10.5 Write a Perl program that will print the reverse complement of a dna strand contained in $dna. For comparision purposes print out both $dna and its reverse complement on two different lines. Recall that the reverse complement is the dna sequence that matches this sequence on the other strand of the dna helix. All you need to do here is reverse the string converting c’s to g’s and t’s to a’s (and vise-versa). For example, the reverse complement of acgggaggacg is cgtcctcccgt. By the way, there is a reverse function, reverse($str), that will reverse the string parameter $str. You can use this if you would like. Regular ExpressionsOne of the features that make Perl so useful to people who work in the files of biology and chemistry is its ability to process large text files rapidly. This feature is greatly enhanced by a built-in pattern recognition system called regular expressions (RE). Regular expressions are extremely powerful and allow the programmer to search and string for virtually any data pattern that they are interested in. A pattern is enclosed in slashes, for example if the programmer would like to look for the string "good” he would use the format /good/ where the pattern in this case is just the four letters good. As an example suppose that the string variable $s is initialized to "Now is the time for all good men to come to the aid of their country” and a short program is required to print out whether or not $s contains the substring good. This can be done in the following way.if ($s =~ /good/){print "The substring good is in the string.\n";}else{print "The substring good is not in the string.\n";}The operators used in the above if, =~, compares $s and the pattern given between slashes (ie /pattern/ ) and returns true if found and false if not. The output for the above program is. The substring good is in the string.There are a variety of methods for defining patterns in Perl. The simplest is the single character or sting patterns. Here the programmer places a sequence of characters that is to be searched for, in for the pattern. For example if we want to search for Bothriolepis then we use /Bothriolepis/. The RE /Devonian period/ could be used to search for the word pair Devonian period. Note that RE’s are case sensitive and the space is also a character and consequently must be matched. There are many times patterns will contain don’t cares. If the programmer wants to specify a don’t care, it is done by inserting a . (period!) in the pattern. For example the pattern /A.G/ could be used to match AIG, AUG, A;G, AEG, A4G, etc. A character class is a list of characters between a set of brackets. One and only one of these characters need to be present at the corresponding part of the string for the pattern to match. For example [aeiou] would match a single vowel. [0123456789] or equivalently [0-9] will match a single digit. If we are interested in matching only letters, either caps or not, we would use [a-zA-Z]. The class [a-zA-Z0-9] matches any alphanumeric character. Do to the frequent use of some of these classes the following special predefined character class abbreviations were created.\d is equivalent to [0-9] \w is equivalent to [a-zA-Z0-9 ]\s matches white space [ \r\t\n\f] #matches space, return, tab, newline and form-feed characters.\b matches a word boundary. Ie space, period, start of string, comma etc.It may be the case you would like to match everything but small vowels. The ^ negation operator is used to do this within a character class definition. In the vowel case we would write [^aeiou] to match anything that is not a vowel or [^0-9] to match anything that is not a digit. Suppose that we would like to match an upper case A followed by anything except x y or z. Recall that [^xyz] is anything except xyz so /A[^xyz]/ will match Aq or Ab but not Ax etc. The match for a string is the first pattern found that matches the RE. If there are several matches of different length then it will match the longest one.Suppose that we download Charles Darwin’s The Origin of the Species 1st ed from and name it oots.txt. The following program will search for the word evolution in this document and display the appropriate result. Note that we are looking for a line that has either Evolution or evolution embedded within it. Since \b is placed only at the front of the RE it is possible that we might match something like evolutionary. Do this exercise and see what you find out.# Here we open the file and read one line at a time and check for the word evolution.open(FILE,"oots.txt");$ct=0;# Every time the following line is executed the next line in the file # is loaded into $linewhile($line = <FILE>){ $ct=$ct+1;# Recall the \b in the following regular expression matches on a word boundary (space etc) if($line =~ /\b[Ee]volution/) { print "Evolution is in the Origin at line number $ct\n"; exit;}}print "Evolution is not in the Origin\n";If we are interesting in matching something at the beginning of a string we use a ^ as well. There is no ambiguity for we are between /'s and not []'s. This is referred to as context sensitive, ie the meaning of ^ is determined by its surrounding context. As an example /^Joseph Hooker/ would match any string that begins with Joseph Hooker such as "Joseph Hooker was a good friend of Charles Darwin" but not "It is well known that Joseph Hooker was a friend of Darwin". Similarly we use a $ to indicate the end of the string. As an example /office$/ would match "Who got Einstein’s office" but not "Where was the office of Einstein". MultipliersAnother feature of regular expressions is the use of multipliers. These are special characters that allow the specification of multiple instances of a character or patter. For example we use an * to represent zero or more copies of the immediately previous character. For example the DNA pattern /a*cgt/ would match aaacgt or cgt or acgt etc. If we want one or more copies we use a + instead of the *. So the pattern /RA+T/ would match RAT or RAAT or RAAAT etc. RT is not matched in the + case. A ? is used if we want to have one or no copies of the preceding character. A pattern such as /Haa?t/ would match Haat or Hat and nothing else. What do you think /fo+ba?r/ matches. It matches an f followed by one or more o’s followed by b then by an a or not and finally an r. How about /^\s*$/. This is an often used RE since it matches blank lines. Why? Note that these operators will match the longest string that it can find. For example /t[A-Za-z ]+d/ will match the red colored section in the following string even though there is an earlier d. How to see exactly what it matches is the subject of a later section.Alternation OperatorsThe symbol | is used to match exactly one of a set of alternatives. For example if we want to match a or b or c we would write /a|b|c/. For single characters you probably should use /[abc]/ which is the same thing. This operator really is useful if you want to match certain words ie /rat|mouse/ would match rat or mouse but nothing else. What does /a|b*/ match? What about /(a|b)*/. Here the parens define a precedence grouping. Precedence rules are as follows with the parentheses being at the topNameRepresentationParentheses( )Multipliers+ * ? {m,n}Sequence and anchoringabc ^ $ \b \BAlternation|Recall that \b matches a word boundary. IE /\bHi\b/ would match Hi but not High. \B matchs a non word boundary. Here are some examples that may help you see the pattern. Regular Expressin Example matches/ab?c/ abc, ac/^0x[0-9A-F]+$/ 0x4FA, 0xFFF8/abc*/ ab, abcccccccc/(abc)+/ abc, abcabcabc/(a|b)(c|d)/ bc,ad,ac,bc/(song|blue)bird/ songbird, bluebird/ab{2,4}c/ abbc, abbbc/100\s*mk/ 100 nk, There are 100 mk/[yY][eE][sS]/ yes, YES, Yes, YeSSuppose we have a file that contains a lot of ordered pairs and would like to extract them from the file. Recall that an ordered pair looks like (number,number). For example (2,3), (-45,23) and (-1,-4) are ordered pairs. The first thing we need to do is construct the regular expression. We need to match a ‘(‘ then an integer, then a comma, then a number and finally a ‘)’. Don’t forget that each number can be negative. This can be accomplished with this expression. /\b\(-?\d+,-?\d+\)\b/Recall that parenthesis are used as special characters in regular expressions. If we want to actually search for a paren we must place a \ in front of it. The characters ‘(‘ and ‘)’ and regular expression grouping symbols while \( and \) are just the characters ‘(‘ and ‘)’. Capice? Lets write a Perl program that matches every line of a file that contains ordered pairs of the above format. Pay particular attention to the regular expression in the following application.open(FILE,"pairs.txt")||die "Could not open file:$filename\n";while($line=<FILE>){# For every line we will do the following while($line =~/\(-?\d+\,-?\d+\)/g) {$c+=1;}; $cttotal++;}print "Number of ordered pairs=$c\n”; print "Total number of lines= $cttotal \n";If the above program is run on the following data set, aka pairs.txt.This is a file of ordered pairs(2,3), (4,5), (-2,34), (65,2)and (-2,-3) order pair as well as this one (-234,342)and this one ( -3, 3)How about that.The output would beNumber of ordered pairs=6Total number of lines=6Why do we count only 6 ordered pairs when it’s clear there are 7. Look closely at the last one. Notice anything different?. This point was not matched (ie counted) because there are spaces in front of each number and we DID NOT allow that case in the above regular expression. I f we wanted to allow one or two blanks then \d{1,2} in front of the – sign will allow it to find and count ( -3, 3)!Searching text files using regular expressionsSearching text files using regular expressionsThe previous section spent some time explaining how to search text files. Since this is such an important topic we will expand on this concept considerably. Why? Because this is what you will probably do most. There are many file formats that scientist run up against. We already looked that NCBI files so let’s continue with these and learn to extract a variety of information from these files. First go to the NCBI web site () and download the FM866397 file on Neanderthal mitochondria DNA. _d Name the file ncbi_dna.txt. Here is what it looks like.LOCUS FM866397 367 bp DNA linear PRI 06-NOV-2009DEFINITION Homo sapiens neanderthalensis mitochondrial D-loop hypervariable region 1, isolate Sidron 1351e.ACCESSION FM866397VERSION FM866397.1 GI:262527002KEYWORDS .SOURCE mitochondrion Homo sapiens neanderthalensis ORGANISM Homo sapiens neanderthalensis Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;REFERENCE 1 AUTHORS Briggs,A.W., Good,J.M., Green,R.E., Krause,J., Maricic,T., TITLE Targeted sequence capture and analysis of multiple Neandertal mitochondrial genomes JOURNAL UnpublishedREFERENCE 2 (bases 1 to 367) AUTHORS Briggs,A.W. TITLE Direct Submission JOURNAL Submitted (05-NOV-2008) Briggs A.W., Human Evolutionary Genetics, MPI-EVA, Deutscher Platz 6, Leipzig, 04103, GERMANYFEATURES Location/Qualifiers source 1..367 /organism="Homo sapiens neanderthalensis" D-loop 1..367 /note="hypervariable region 1"ORIGIN 1 gggagcagat ttgggtacca cccaagtatt gactcaccca tcagcaaccg ctatgtattt 61 cgtacattac tgccagccac catgaatatt gtacagtacc ataattactt gactacctgc 121 agtacataaa aacctaatcc acatcaaacc ccccccccca tgcttacaag caagcacagc 181 aatcaacctt caactgtcat acatcaacta caactccaaa gacgccctta cacccactag 241 gatatcaaca aacctaccca cccttgacag tacatagcac ataaagtcat ttaccgtaca 301 tagcacatta cagtcaaatc ccttctcgcc cccatggatg acccccctca gataggggtc 361 ccttgat//Although the above is a short example, all of these files have the same format. The NCBI web site defines each of the subsections such as LOCUS, FEATURES and ORIGIN etc. For our purposes here we see, just by observation, that the actual DNA string begins after the word ORIGIN. Our first program will read in this file and extract only the dna that occurs after the word origin and then write that out to a file. We will remove spaces and numbers after ORIGIN but keep the CR’s (ie \n). This program uses a flag variable called $inseq to tell the program when we pass the ORIGIN line. Before we see ORIGIN its value is 0 and after it is 1. Read the code very carefully and see if you can follow the logic.open (FILE,"ncbi_dna.txt")or die "File not there";open(OUT,”>rawdna.txt”); # Don’t forget the > sign for output files.while ($line=<FILE>) { if ($line=~/^ORIGIN/){ $inseq = 1; # when we pass the ORIGIN line turn on the inseq flag } elsif($line=~/^\/\/\n/) { # look for the last line with a // last;# make this the last loop in the while } elsif ($inseq == 1){ # We are in the DNA section $line =~ s/[0-9]//g; # remove any digits $line =~ s/[\t ]//g; # remove tabs and blanks leaving \n # What would happen if we used $line =~ s/\s//g; instead? print OUT "$line"; } }The contents of the rawdna.txt as generated from the above program isgggagcagatttgggtaccacccaagtattgactcacccatcagcaaccgctatgtatttcgtacattactgccagccaccatgaatattgtacagtaccataattacttgactacctgcagtacataaaaacctaatccacatcaaacccccccccccatgcttacaagcaagcacagcaatcaaccttcaactgtcatacatcaactacaactccaaagacgcccttacacccactaggatatcaacaaacctacccacccttgacagtacatagcacataaagtcatttaccgtacatagcacattacagtcaaatcccttctcgcccccatggatgacccccctcagataggggtcccttgatHere is another short example from the National Center for Biotechnology Information (NCBI). It is a gene from the extinct Tasmanian wolf (Thylacinus cynocephalus). You can see it is quite similar to the previous Neanderthal mitochondria file. If you look close you see that the entire name for this animal is found immediately after the word ORGANISM and stops with the word REFERENCE. The first line contains the genus species and the remaining lines the complete taxonomic name. This will be true of all the NCBI dna files. LOCUS EU091365 388 bp DNA linear MAM 31-DEC-2008DEFINITION Thylacinus cynocephalus interphotoreceptor binding protein gene, partial cds.ACCESSION EU091365VERSION EU091365.1 GI:158668312KEYWORDS .SOURCE Thylacinus cynocephalus (Tasmanian wolf) ORGANISM HYPERLINK "" Thylacinus cynocephalus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Metatheria; Dasyuromorphia; Thylacinidae; Thylacinus.REFERENCE 1 (bases 1 to 388) AUTHORS Westerman,M., Young,J. and Krajewski,C. TITLE Molecular relationships of species of Pseudantechinus, Parantechinus and Dasykaluta (Marsupialia: Dasyuridae) JOURNAL UnpublishedREFERENCE 2 (bases 1 to 388) AUTHORS Westerman,M., Young,J. and Krajewski,C. TITLE Direct Submission JOURNAL Submitted (09-AUG-2007) Genetics, Latrobe University, Bundoora, Melbourne, Victoria 3086, AustraliaFEATURES Location/Qualifiers source 1..388 /organism="Thylacinus cynocephalus" /mol_type="genomic DNA" /db_xref="taxon: HYPERLINK "" 9275" mRNA <1..>388 /product="interphotoreceptor binding protein" CDS <1..>388 /note="IRBP" /codon_start=1 /product="interphotoreceptor binding protein" /protein_id="ABW76674.1" /db_xref="GI:158668313" /translation="STSKAPQHDSKFTNATQEELLALFQQIIKYQVLEGNVGYLRVDY IPGREMIEEVGEFLVNDIWKKVMETSSLVLDLQHSSGGEVSGIPFVISYLHQGDILLH VDTIYDRPSNTTTEIWTLPQVLGERYS"ORIGIN 1 agcacctcca aggctcctca gcacgactcc aaattcacca atgccactca ggaagagcta 61 ctcgccttat tccagcaaat aatcaagtac caggtactgg agggtaacgt cggttaccta 121 agagtggact acatccctgg ccgggagatg atagaggaag ttggggagtt cctggtgaat 181 gacatctgga agaaggtcat ggagacctcc tctctcgtgt tggatctcca gcacagcagc 241 ggaggtgaag tttcaggaat cccctttgtc atttcctacc tccaccaggg ggatatcctg 301 ctccacgtag acaccattta cgaccggcca tcaaacacca ctactgagat ctggaccctg 361 ccccaggtgc tgggggagag gtacagtg//Let’s write a program that will read the above file and extract the genus and species from the data. Our first attempt is this.open (FILE,"tasmwolf.txt")or die "File not there";while ($line=<FILE>) { if ($line=~/ORGANISM/){ print $line; } }Note that all it really does is print out the line that contains the word ORGANISM which includes the genus species. If this is all we need then cool otherwise a little more work might be required. Suppose we need just the genus and species in separate variables. This shouldn’t be too hard since it’s a part of the line we just printed. A little fancy RE work will do this for us. All we really need to do here is change the $line=~/ORGANISM/ regular expression to something like the following $line=~/ORGANISM\s+(\w+)\s\(w+)/This will match a line that has this sequence of characters ORGANISM spaces word spaces word and then the remaining test until the \n.You may note the extra parenthesis around the two \w+’s. The RE will work just fine without these but by including them it is possible to get from the RE what matches these. The matched text for each RE expression subsection that is surrounded by parenthesis is automatically placed in a variable for use to access. The matched text for the first parenthesized piece is placed in the variable $1, the second in $2 and so on. This gives us a way of knowing what matches what, a very useful feature of Perl. Here is the updated version of the above program and its output.open (FILE,"tasmwolf.txt")or die "File not there";while ($line=<FILE>) { if ($line=~/ORGANISM\s+(\w+)\s(\w+)/){ print $line; print "$1\n$2\n"; # print genus on one line and species on the other. } }Output ORGANISM Thylacinus cynocephalusThylacinuscynocephalusRunning the above program on the Neanderthal file ncbi_dna.txt gives this output ORGANISM Homo sapiens neanderthalensisHomoSapiensNote that it basically grabbed the first two words after ORGANISM and printed them. The subspecies name neanderthalensis was not processed. If we want this as well we would have to include another (\w+) in the RE sequence. This of course would not work with organisms that have only two names on this line. If we want our program to work with any file, whether it contains two or three names here another technique will be required. This will be discussed in the section on arrays where the problem becomes quite easy.In general data mining of these text files is quite easy using RE’s. For a more interesting example we will work on a much larger file say the complete mitochondrion genome for the ornate kangaroo tick. Download it using the NCBI Reference Sequence: NC_005963.1 and call it kangtick.txt. Browse the file carefully. Here is a section of the FEATURES portion of this fileFEATURES Location/Qualifiers source 1..14740 /organism="Amblyomma triguttatum" /organelle="mitochondrion" /mol_type="genomic DNA" /isolate="SB1" /db_xref="taxon: HYPERLINK "" 65637" /sex="male" tRNA 1..62 /product="tRNA-Met" /anticodon=(pos:31..33,aa:Met) gene 64..1029 /gene="ND2" /db_xref="GeneID: HYPERLINK "" 2866202" CDS 64..1029 /gene="ND2" /codon_start=1 /transl_table=5 /product="NADH dehydrogenase subunit 2" /protein_id="YP_044778.1" /db_xref="GI:49619213" /db_xref="GeneID:2866202" /translation="MNFNILMKWLILMTIMISMSVNSWFIFWMMMEMNLMFFIPILNK QKMTNSNSMITYFVIQSFSSTIFIMMAILNFITYFYMFKILMIISIMIKLAIIPFHFW LISISEMIEFNSLFFILSLQKFIPLFILSKFNSQFMIMFALASAILGSLSAMNSKMLK KMLIFSSISHQGWMIMLIMMKSNFWISYLLIYSIMIYKVTSLMKMFKFNYISEFFNYN KNSLSKISLIMMMMSLSGMPPFMGFTLKIISIIILLTYFNFSIIILILSSMLNIYFYL NSIQSFFLLNLIKFKKMIMKTYMFKNMILNFNIFMIIFLFNLMIF" tRNA 1031..1089 /product="tRNA-Trp" /anticodon=(pos:1060..1062,aa:Trp) tRNA complement(1090..1151) /product="tRNA-Tyr" /anticodon=(pos:complement(1119..1121),aa:Tyr) gene 1160..2683 /gene="COX1" /db_xref="GeneID: HYPERLINK "" 2866206" CDS 1160..2683 /gene="COX1" /codon_start=1 /transl_table=5The names on the left such as gene and CDS are annotation names. This file contains the entire mitochondria sequence for this organism, which has quite a few genes as label as such. The linegene 64..1029indicates that authors think there is a gene that goes from nucleotide 64 up to nucleotide 1029. If you look thru the file you will see there is another gene that occurs from 1160 to 2683 and so on. Rather than look thru this file by hand it is possible to write a small program that goes thru the file and prints out all the genes. All we really need to do is print out each line that starts with the word gene. Note there are other lines that have the word gene in them that we DO NOT want so our RE needs to be carefully designed.open(FILE,"kangtick.txt");$ct=0;while($line = <FILE>){if($line =~ /^\s+gene\s/){ #<- look for the word gene that has spaces before and after. Try it without the \s's and see what happens. print "$line"; $ct++; #<- counts the number of genes that we find}}print "There are $ct genes in the mitochondria of this animal\n";close(FILE);The RE /^\s+gene\s/ matches a line that starts out with a bunch of white space followed by the word gene followed by a space. This makes sure we will not grab any of the other cases. Here is its output. gene 64..1029 gene 1160..2683 gene 2688..3361 gene 3489..3653 gene 3647..4309 gene 4314..5091 gene 5154..5493 gene complement(5748..6749) gene complement(9281..10935) gene complement(10998..12325) gene complement(12319..12594) gene 12731..13159 gene 13164..14239There are 13 genes in the mitochondria of this animalFrom this we can see that there are 13 genes four of which are on the complement strand going from right to left. Now we are finally doing some real data mining. If you are interested in the meaning of individual sections of this GenBank file see contains an example file for the organisim Saccharomyces cerevisiae and the meaning of the annotation labels are at the bottom of this document.Homework12.1 Using Arrays in PerlAlthough arrays were introduced early in this document they were not used in processing of the previous examples. We will correct that omission at this point but before doing so let’s do a little review. Recall that an array is really a named list of elements kept in a variable that has a @ prefix instead of a $. Variables such as @words contains a list of words while variables such as $word would contain only one word. Recall that we access an array using subscripting. If we would like to print the first word in the array @words we would do something like print $words[0];Note that we use the $ prefix when we reference a single item in the array @words. We can initialize an array variable by assigning it an explicitly written list (aka array literal). An array literal is nothing more than a list of elements separated by commas and enclosed in parenthesis. The elements may be strings or numbers and may be mixed in a list. These lists may be made up of numbers or strings or both. For example(1,2,3) is a numeric array literal, (“ genes”,”dna”, “complement”) is an string array literal, and ("Stangl","Fred", 7, 8.5) is a mixed example. The empty array is represented by (). Since numeric lists are often used there is a short hand that makes creation of these easy. For example (1..5) is shorthand for (1,2,3,4,5) and (2..6,10,12) is shorthand for (2,3,4,5,6,10,12).Variables may also be used in initialization of a literal list. If $x=5 and $y=10 then the list ($x,2,3,$y) defines the literal (5,2,3,10).An array variable is a variable that holds lists as defined by the above literals. Its name looks like a normal variable name except it starts with the @ symbol instead of the $. Examples include @list, @people and @Species.The easiest way to initialize an array is by assigning a list (array literal) to it. The following is a list of examples that demonstrate this @list =(1,2,3,4,5)@People=("Bob","Tom","Sally")@list2=@list # you can copy one list into another.@Species = eq(Sapiens Canus Bothriolepus) #eq is a simple function that allows you to not have to type the quotes.The individual values of @list can be accessed using subscripts starting at 0. When you do this you use a $ instead of the @ symbol. For example$list[0] is the value 1$People[2] is "Sally"Here is a simple Perl program that initializes an array and prints out the values one per line.@list=(1..10);$i=0;while($i<=9){ print "list[$i] \n" $i=$i+1;}Here is another example@words = ("Home", "went", "House","Bill");print "$words[3] $words[1] $words[0]\n"which prints the string "Bill went home"The size of an array (ie its length) can be easily obtained by just assigning the array to a single variable. For example $size=@words will assign 4 to the variable $size. We say we are using the array in scalar context when we do this.There is a really interesting way that an array can be loaded. It is possible to load an array with every line of an entire file. If <FILE> is a file handle for some opened file then the command@lines = <FILE>; # This is called slurping the file.Will copy the ENTIRE file into the array @lines one line at a time. Each slot in the array is a string that contains the associated line in the file. Recall that lines are determined by where the \n’s are. Let’s look at some examples that make this clear. The first example is a rewriting of the previous gene search program.open(FILE,"mammoth.gb");@linearray = <FILE>;#<- loads the array with the lines in the file (slurps it)$len = @linearray;# Assigning an array to a variable gets length of the arrayfor($i=0;$i<$len;$i++){ if($linearray[$i] =~ /^\s+gene\s/){ print $linearray[$i],"\n";; $ct++; }}print "There are $ct genes in the mitochondria of this animal\n";close(FILE);Note that this loads in the entire file into the array @linearray and then loops thru the array printing out the matched lines. Although this is rather cool you must take note that the ENTIRE file is loaded into memory (within the array). Some files are quite large and hence may use up most if not all of your memory. In cases such as these it is advisable to process the file a line at a time as we did earlier.A simpler way of processing the array is by using the forevery construct. Its semantics should be clear from this example.open(FILE,"kangtick.txt"); @linearray = <FILE>;#<- slurp it up# Each line in the file is loaded into $line and then processed by the following loop forevery $line (@linearray){ if($line =~ /^\s+gene\s/){ print $line,"\n"; # prints only those lines that have the word gene in it. $ct++; } } print "There are $ct genes in the mitochondria of this animal\n"; close(FILE); The forevery loop construct is very handy for processing every slot in an array. If you need only to process a few specific ones then the array subscripting method probably is required.There is a very cool (and useful) command called split that will take a string of words and break it up into its individual elements. An array will be used to hold these elements. The split operation is written using a regular expression for its first parameter and a string as its second. The regular expression defines the pattern that separates the string into its individual elements. For example $line="Now:is:the:time" is a string whose words are separated by :'s. The split command split(/:/,$line) will return the list ("Now,"is","the","time") and this list can be assigned to the array variable @list by the command @list=split(/:/,$line);Note that if the string is separated by blanks (which is more usual) then the command would be @list=split(/ /,$line) where the : is replaced with a blank. If the words are separated by multiple blanks and other white space ( tabs, \n etc) the we could use @list=split(/\s+/,$line) So what can we do with this? Suppose that Dr. Shipley had a class of 25 students that when out in the field and collected data that involved counting species in different counties. He gave out the specification for typing this information into an ascii file. This specification requested that each species be typed in using the format in the following order, one species per line.genus species population locationAfter all 25 students did this it was discovered that the format was incorrect for the database program being used. This program was designed to read text files in the formatspecies genus : location : populationNote that the genus species is reversed and that colons are required instead of spaces in the later two locations. Mean Dr. Shipley told everyone to just retype the files since each student had only to retype 500 different species lines. Rather than follow this advice student Mr. Awesome decided to write a short Perl program that will do the conversion for you. He gave the program to the other students in the class and consequently became a Hero. Here is the program that he wrote. Read it carefully.#This program will reformat a data file. The original file #looks like the following# genus species population location(county)# and should look like# species genus : location : population#There are several ways to do this. This method used arrays and splitopen(FILE,"data.txt")|| die "Sorry, I could open the file data.txt"; # for inputopen(OUT,">newformat.txt");# for outputwhile($line=<FILE>){#<--- Remember that this reads one line at a time. @list=split(/\s+/,$line); # split every line into its individual blank separated words print OUT "$list[1] $list[0] : $list[3] : $list[2]\n"; #write it to a file print "$list[1] $list[0] : $list[3] : $list[2]\n"; # and to the screen as well.}close(OUT);The above program will convert the following data which is contained in data.txtEquus caballus 232 WichitaEquus asinus 221 FordPhascolarctos cinereus 23 CrocketMyrmecobius fasciatus 456 HarrisTamiasciurus douglasii 18 Smithand write it to the file newformat.txt in the following format.caballus Equus : Wichita : 232asinus Equus : Ford : 221cinereus Phascolarctos : Crocket : 23fasciatus Myrmecobius : Harris : 456douglasii Tamiasciurus : Smith : 18Do you or do you not think that this is a lot easier than retyping everything in? Split has many uses and one solves a problem previously discussed. Recall that we were trying to read in the names of an organism from a GenBank file and that the name appeared after the word ORGANISM. Our problem was that there may be three names, genus, species and subspecies or two names genus and species only. If line has been read into the variable $line then here is a way to handle either case. @names = split(/\s+/,$line);$ct = @names; # Get the length of this array.forevery $n (@names){ print “$n \n”; # print one name per line. } The above code will either print out two names three names following the word ORGANISM depending on how long the array @names is. If you need to print only some of these words use subscripting.Homework13.1 Download the 3NYU.pdb Mycoplasma genitalium MG289 file from the PDB web site. This is information about a specific binding protein that occurs on the bacteria Mycoplasma genitalium. It is a very large file. You job is to write a program that will count a variety of things. Count and print the number of lines that begin with each of the following words.ATOM, CONECT, HELIX, REMARK, HETATM and SEQRES. Print out the percentage of the total number of lines that each of these consume. Here is an example output for a different molecule. Tabs were used in the print statement to get the Percentages to line up. Also if you cannot get the numbers to truncate to the indicated values don’t worry. There are a variety of ways to do this. HINT:There is a function called int() that will truncate a fractional number to an integer by dropping the decimal part.Out of a total of 1692 lines we haveREMARK lines :432 Percentage is 25.5HELIX lines :6 Percentage is 0.3ATOM lines :1087 Percentage is 64.2CONECT lines :7 Percentage is 0.4SEQRES lines :12 Percentage is 0.7HETATM lines :73 Percentage is 4.313.2 Use the same molecule as in problem 13.1 for this exercise. If you study the file section that starts with the word ATOM you will notice that the last character on the line is a letter C, O, N, S etc. There represent the name of each atom in the molecule. Carbon is C, oxygen is O, nitrogen is N, sulfur is S and so on. Your program is to process this file with a Perl program and print out the number of C, O and N’s that occur in the molecule. HINT: Note that each letter is on the end of the line. It is followed by some white space and then a \n.Hashes and their usesImportant Web Reference Sites. Protein Data Bank (PDB) ()Project Gutenberg ()National Center for Biotechnology Information (NCBI) ()Appendix A : Perl insstructionsAppendix B: Functions ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download