Part 1: Regular Expressions (regex)

[Pages:13]Part 1: Regular Expressions (regex)

Regular expressions are character combinations that represent a particular pattern, for example, as we've seen before \n represents a new line. For more detials, see Wiki (), ()

Common regular expressions and character classes

Regular Expressions

^

Matches beginning of line

$

Matches end of line

.

Matches any single character except newline

a|b

Matches either a or b

(re) Groups regular expressions and remembers matched text

\w

Matches word characters

\W

Matches nonword characters

\s

Matches whitespace. Equivalent to [\t\n\r\f]

\S

Matches nonwhitespace

\d

Matches digits

\D

Matches nondigits

\A

Matches beginning of string

\Z

Matches end of string

\b

Matches word boundaries when outside brackets (see note

below)

\B

Matches nonword boundaries

\n

Matches newline

\t

Matches tab

\\t

Negates tab

Repitition

+

1 or more occurrences of the pattern to its left, e.g.

'i+' = one or more i's

*

0 or more occurrences of the pattern to its left

?

0 or 1 occurrences of the pattern to its left

{n}

Matches pattern n times

Character Classes

[Aa]t

Matches "At" or "at"

A[Tt]

Matches "AT" or "At"

[TU]

Matches "T" or "U"

[0-9]

Matches any digit

[a-z]

Matches any lowercase ASCII letter

[A-Z]

Matches any uppercase ASCII letter

[^UT]

Matches anything other than "T" or "U"

[^0-9] Match anything other than a digit

See Tutorials Point () for additional expressions.

The re module

Before we can use many regular expressions in python we have to import the re module:

In [1]: import re

The re.search() method

The re.search(pattern, subject) method is used to search for patterns. Its similar to the unix command grep :

In [2]: seq = 'GTGAGATGGAGTAGTC' re.search('ATG.*', seq)

Out[2]:

. is a wild card matching any single character except newline. * means to match the previous pattern 0 or more times so .* will match essentially any pattern that doesn't contain a new line:

To get the matching pattern, we need to append .group() to re.search() :

In [3]: seq = 'GTGAGATGGAGTAGTC' re.search('ATG.*', seq).group()

Out[3]: 'ATGGAGTAGTC'

To get the index at which the position starts, we need to append .start() :

In [4]: seq = 'GTGAGATGGAGTAGTC' re.search('ATG.*', seq).start()

Out[4]: 5

To get the index for the end of the match, use .end() :

In [5]: seq = 'GTGAGATGGAGTAGTC' re.search('ATG.*', seq).end()

Out[5]: 16

Note that the end index is actually the index just after the match.

Let's look at another approach:

In [7]: seq = 'GTGAGATGGAGTAGTC' match = re.search('ATG.*TAG', seq) # get the match object sequence = match.group() # get the matching sequence start = match.start() # get the index for the start of the match end = match.end() # get the index for the end of the match + 1 print(f'{sequence}: {start}-{end-1}') ATGGAGTAG: 5-13

If a pattern is found, re.search() equates to True , otherwise it equates to False :

In [10]: seq = 'GTGAGATGGAGTAGTC' if re.search('ATG.*', seq): print('Match found') Match found

To identify if one of several possible patterns is matched, use pattern1|pattern2 :

In [12]: seq1 = 'TATGCATTGA' seq2 = 'UAUGCAUUGA' re.search('ATG.*|AUG.*', seq2).group()

Out[12]: 'AUGCAUUGA'

Sometimes we need to group patterns:

In [13]: seq = 'GTGAGATGGAGTAGTC' re.search('ATG.*(TGA|TAA|TAG)', seq).group()

Out[13]: 'ATGGAGTAG'

Patterns can be delimited with () . Unless escaped, () are ignored in pattern matching.

Let's identify the longest open reading frame based on the first start codon encountered in the sequence below:

In [16]: seq = 'TTGCCCTGAAGTAATCATGCCCTGAGCTTACACTATCACTACACTATGATCCCC' def longest_orf(sequence): return re.search('ATG([ACTG]{3})*(TGA|TAA|TAG)', sequence).group()

longest_orf(seq) Out[16]: 'ATGCCCTGAGCTTACACTATCACTACACTATGA'

By default, patern matching greedy, meaning it will match as much of the subject as possible. If we want to make a pattern match as conservative as possible, such that the shortes match is found, we can append ? to * .

In [18]: seq = 'TTGCCCTGAAGTAATCATGCCCTGAGCTTACACTATCACTACACTATGATCCCC' def shortest_orf(sequence): return re.search('ATG([ACTG]{3})*?(TGA|TAA|TAG)', sequence).group() shortest_orf(seq)

Out[18]: 'ATGCCCTGA'

The re.findall() method

The re.findall(pattern, subject) method finds all non-overlapping pattern matches and returns them in list form:

In [21]: seq = 'GTCCCCCCCCAGCCCCCAGGTCCCCCAGCGTCCCCCCCCCCCCCCCCCCCAG' re.findall('GTC*', seq)

Out[21]: ['GTCCCCCCCC', 'GTCCCCC', 'GTCCCCCCCCCCCCCCCCCCC']

If () are included in the pattern, only the sub pattern of the pattern enclosed in () will be returned (and sometimes thats all you want):

In [22]: seq = 'GTCCCACCCCAGCCCCCAGGTCCGCCAGCGTCCCCCCTTTTCCCCCCCCCCAG' re.findall('GTC(.*)', seq)

Out[22]: ['CCACCCCAGCCCCCAGGTCCGCCAGCGTCCCCCCTTTTCCCCCCCCCCAG']

The re.sub() method

For doing substitutions using regular expressions, there's the re.sub(regex, replacement, subject) method.

It's common to need to keep track of part of the pattern that was matched and include it in the subsitution as part of the new pattern. To do this, we can store part or all of a pattern match as follows: (sub_pattern). You can have multiple sub_patterns enclosed within () and each will get a separate reference - \1, \2, ... The part fo the pattern, sub_pattern enclosed within () can be referenced with \1. For example:

Let's extract gene coordinates from a line of a gff file ( chr#:start-end ):

In [24]: gff = 'Chr1\tTAIR10\texon\t4486\t4605\t.\t+\t.\tParent=AT1G01010.1\n' coordinates = re.sub('Chr([0-9]*)\t.*\t.*\t([0-9]*)\t([0-9]*)\t.*\n', r'\1:\2 print(coordinates) 1:4486-4605

Note that in regex backslashes have a special meaning. We negate this by treating using Python's raw string notation r'string'

To escape special characters that might be part of the pattern we're interested in, we use \ . For example:

In [25]: gff = 'Chr1\tTAIR10\texon\t4486\t4605\t.\t+\t.\tParent=AT1G01010.1\n' coordinates = re.sub('Chr([0-9]*)\t.*\t.*\t([0-9]*)\t([0-9]*)\t.*\n', '\\1:\\ print(coordinates) 1:4486-4605

Part 2: Command Line Arguments

Up until now, to pass arguments to a function from the command line, we've relied on the input() function. We'll look at a couple different ways of passing arguments from the command line starting with sys.argv . sys.argv is a list that contains the arguments passed to Python via the command line.

Copy the following code into a file:

import sys # we first have to import the sys module

def main(): argument = sys.argv[1] # the second argument on the command l

ine command_line_arguments(argument)

def command_line_arguments(arg1): """ Print argument passed from the command line """ print(f"Command line argument: {arg1}")

if __name__ == '__main__' : main()

We can pass multiple arguments from the command line and each will be stored in the sys.arvg list.

Copy the following code into a file ( print_args.py ) and execute it from the command line: python print_args.py a b c

import sys

def main(): argument1, argument2, argument3 = sys.argv[1:] # a list slice command_line_arguments(argument1, argument2, argument3)

def command_line_arguments(arg1, arg2, arg3): """ Print three arguments passed from the command line """ print(f"Command line arguments: {arg1}, {arg2}, {arg3}")

if __name__ == '__main__' : main()

What happens if the arguments are integers?

A more versatile and the recommended approach is to use argparse .

Copy the following code into a text file ( concat_seqs.py ) and execute it from the command line:

python concat_seqs.py -a CCCC -b GGGG

def main(): seq1, seq2 = arg_parse() print(cat(seq1, seq2))

def arg_parse(): """ Parse command line arguments """ import argparse # First we need to create an instance of ArgumentParser which

we will call parser: parser = argparse.ArgumentParser() # The add_argument() method is called for each argument: # We provide two version of each argument: # -a is the shortand, --sequence1 is the longhand # We can specify a help message describing the argument with

help="message" # To require an argument, we use required=True parser.add_argument('-a', '--sequence1', required=True, help=

"first sequence") parser.add_argument('-b', '--sequence2', required=True, help=

"second sequence") # The parse_args() method parses the arguments args = parser.parse_args() #print('args:', args) # Here, we'll return the arguments as a tuple return args.sequence1, args.sequence2

def cat(seq1, seq2): """ Concatenate two sequences """ # Assign the values returned from arg_parse to variables return seq1+seq2

if __name__ == '__main__': main()

Part 3: Subprocesses

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download