Python: Regular Expressions - University of Cambridge

[Pages:106]Python: Regular Expressions

Bruce Beckles Bob Dowling University Computing Service Scientific Computing Support e-mail address: scientific-computing@ucs.cam.ac.uk

1

Welcome to the University Computing Service's "Python: Regular Expressions" course. The official UCS e-mail address for all scientific computing support queries, including any questions about this course, is:

scientific-computing@ucs.cam.ac.uk

1

This course:

basic regular expressions

getting Python to use them

2

Before we start, let's specify just what is and isn't in this course. This course is a very simple, beginner's course on regular expressions. It mostly covers how to get Python to use them. There is an on-line introduction called the Python "Regular Expression HowTo" at:

and the formal Python documentation at

There is a good book on regular expressions in the O'Reilly series called "Mastering Regular Expressions" by Jeffrey E. F. Freidl. Be sure to get the third edition (or later) as its author has added a lot of useful information since the second edition. There are details of this book at:

There is also a Wikipedia page on regular expressions which has useful information itself buried within it and a further set of references at the end:



2

A regular expression is a "pattern" describing some text:

"a series of digits"

\d+

"a lower case letter followed by some digits"

[a-z]\d+

"a mixture of characters except for new line, followed by a full stop and one or more letters or numbers"

.+\.\w+

3

A regular expression is simply some means to write down a pattern describing some text. (There is a formal mathematical definition but we're not bothering with that here. What the computing world calls regular expressions and what the strict mathematical grammarians call regular expressions are slightly different things.)

For example we might like to say "a series of digits" or a "a single lower case letter followed by some digits". There are terms in regular expression language for all of these concepts.

3

A regular expression is a "pattern" describing some text:

\d+

Isn't this just gibberish?

The language of regular expressions

[a-z]\d+ .+\.\w+

4

We will cover what this means in a few slides time. We will start with a "trivial" regular expression, however, which simply matches a fixed bit of text.

4

Classic regular expression filter

for each line in a file : does the line match a pattern? if it does, output something

Python idiom how can we tell? what?

"Hey! Something matched!" The line that matched The bit of the line that matched

5

This is a course on using regular expressions from Python, so before we introduce even our most trivial expression we should look at how Python drives the regular expression system.

Our basic script for this course will run through a file, a line at a time, and compare the line against some regular expression. If the line matches the regular expression the script will output something. That "something" might be just a notice that it happened (or a line number, or a count of lines matched, etc.) or it might be the line itself. Finally, it might be just the bit of the line that matched.

Programs like this, that produce a line of output if a line of input matches some condition and no line of output if it doesn't are called "filters".

5

Task: Look for "Fred" in a list of names

Alice Bob Charlotte Derek Ermintrude Fred Freda Frederick Felicity ...

names.txt

Fred Freda Frederick

freds.txt

6

So we will start with a script that looks for the fixed text "Fred" in the file names.txt. For each line that matches, the line is printed. For each line that doesn't nothing is

printed.

6

c.f. grep

$ grep 'Fred' < names.txt Fred Freda Frederick $ (Don't panic if you're not a Unix user.)

7

This is equivalent to the traditional Unix command, grep. Don't panic if you're not a Unix user. This is a Python course, not a Unix one.

7

Skeleton Python script data flow

import sys

for input & output

import regular expression module

define pattern

set up regular expression

for line in sys.stdin:

read in the lines one at a time

compare line to regular expression

if regular expression matches:

sys.stdout.write(line) write out the matching lin8es

So we will start with the outline of a Python script and review the non-regular expression lines first.

Because we are using standard input and standard output, we will import the sys module to give us sys.stdin and sys.stdout.

We will process the file a line at a time. The Python object sys.stdin corresponds to the standard input of the program and if we use it like a list, as we do here, then it behaves like the list of lines in the file. So the Python "for line in sys.stdin:" sets up a for loop running through a line at a time, setting the variable line to be one line of the file after another as the loop repeats. The loop ends when there are no more lines in the file to read.

The if statement simply looks at the results of the comparison to see if it was a successful comparison for this particular value of line or not.

The sys.stdout.write() line in the script simply prints the line. We could just use print but we will use sys.stdout for symmetry with sys.stdin.

The pseudo-script on the slide contains all the non-regular-expression code required. What we have to do now is to fill in the rest: the regular expression components.

8

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download