Pattern Example 1 - Tom Kleen



Searching: Regular ExpressionsWe have been manipulating strings for most of the semester. We have used slicing and built-in string methods.The task of searching strings and extracting text from them is so common that Python has a very powerful library called regular expressions (re) that handles many of these tasks quite elegantly.Regular expressions perform a task called pattern matching. We provide a pattern and the re module will find strings that match our pattern.The following examples use either the file mbox.txt or mbox-short.txt.Pattern Example 1Problem: Find all occurrences of the word "From".Pattern: "From"Code:import refile = open('mbox-short.txt')for line in file: line = line.rstrip() pattern = "From" pattern_regex = pile(pattern) result = pattern_regex.findall(line) print(result)Pattern Example 2:Problem: Find all occurrences of the word "From" at the beginning of a line.Pattern: "^From"Code:import refile = open('mbox-short.txt')for line in file: line = line.rstrip() pattern = "^From" pattern_regex = pile(pattern) result = pattern_regex.findall(line) print(result)The caret symbol (^) means the following text must be at the beginning of a line.Pattern Example 3:Problem: Find all lines that begin with the letter "F" followed by any two characters followed by the letter "m" followed by ":".Pattern: "^F..m:"Code:import refile = open('mbox-short.txt')for line in file: line = line.rstrip() pattern = "^F..m:" pattern_regex = pile(pattern) result = pattern_regex.findall(line) print(result)The period character (.) matches any single character.Pattern Example 4:Problem: Find any line that begins with "From: " and that has the "@" character anywhere on the rest of the line.Pattern: "^From:.+@"Code:import refile = open('mbox-short.txt')for line in file: line = line.rstrip() pattern = "^From:.+@" pattern_regex = pile(pattern) result = pattern_regex.findall(line) print(result)The plus sign (+) following a character means one or more of whatever precedes it.If it follows a period, it means one or more of anything.From: stephen.marquard@uct.ac.zaPattern Example 5:Problem: Find email addresses on a line. Email addresses have the form:numbers, letters, digits, periods followed by @ followed by numbers, letters, digits, periods.Our pattern will have "@" in the middle.A shortcut for a single non-whitespace character is "\S". So to get more than one, use "\S+". Pattern: "\S+@\S+"Code:import reline = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'line = line.rstrip()pattern = "\S+@\S+"pattern_regex = pile(pattern)result = pattern_regex.findall(line)print(result)The output is:['csev@umich.edu', 'cwen@iupui.edu']If we run the program on mbox-small.txt, we get:['wagnermr@iupui.edu']['cwen@iupui.edu']['<postmaster@collab.>']['<200801032122.m03LMFo4005148@nakamura.uits.iupui.edu>']['<source@collab.>;']['<source@collab.>;']['<source@collab.>;']['apache@localhost)']['source@collab.;']Pattern Example 6Problem: Note that we get angle brackets around some of the email addresses. Let's make our filter more specific than just "any non-blank character". Let's specify that the string begin with a letter or a number and end with a letter.Pattern: "[a-zA-z0-9]\S*@\S*[a-zA-Z]"Interpretation:a-z: any character from lowercase "a" to lowercase "z".A-Z: any character from uppercase "A" to uppercase "Z".0-9: any character from "0" to "9".[]: pick one*: repeat the single character to the left 0 or more times +: repeat the single character to the left 1 or more times.Extracting dataProblem: If we want to find numbers on lines that start with the string "X-" such as:X-DSPAM-Confidence: 0.8475X-DSPAM-Probability: 0.0000we don't just want any floating-point numbers from any lines. We only want to extract numbers from lines that have the above syntax.Pattern: We can construct the following regular expression to select the lines:^X-.*: [0-9.]+Interpretation:^pattern must be at the beginning of a lineX-a capital "X" followed by a hyphen ("-").*any number of characters: a colon followed by a space[0-9.]+1 or more digits or periodsCode:file = open('mbox-short.txt')for line in file: line = line.rstrip() pattern = "^X-.*: [0-9.]+" pattern_regex = pile(pattern) result = pattern_regex.findall(line) print(result)Extracting DataProblem: This will produce a list of lines. But what if we only want the number? Use () to indicate the part you want to extract:Pattern: '^X\S*: ([0-9.]+)'Code:file = open('mbox-short.txt')for line in file: line = line.rstrip() pattern = "^X-.*: [0-9.]+" pattern_regex = pile(pattern) result = pattern_regex.findall(line) print(result)ExampleAs another example of this technique, if you look at the file there are a number of lines of the form:Details: : If we wanted to extract all of the revision numbers (the integer number at the end of these lines) using the same technique as above, we could write the following program:Pattern: ".*&rev=([0-9]+)"Code: s = "Details: " pattern=".*&rev=([0-9]+)" pattern_regex = pile(pattern) result = pattern_regex.findall(line) print(result)Things to try:Match phone numbers: 712-279-5411, (712) 279-5411Match Social Security numbers: 999-99-9999Match IP addresses: 1.1.1.1, 255.255.255.255Moreimport re# Saving the text in a variablesample='''Game of Thrones is an fantasy drama television seriescreated by David Benioff and D. B. Weiss. It is an adaptationof A Song of Ice and Fire, George R. R. Martin series of fantasynovels, the first of which is A Game of Thrones. It is filmed inBelfast and elsewhere in the United Kingdom, Canada, Croatia,Iceland, Malta, Morocco, Spain, and the United States. The seriespremiered on HBO in the United States on April 17, 2011, and itsseventh season ended on August 27, 2017. The series will concludewith its eighth season premiering in 2019.'''# Extract all characterspattern = " put your pattern here"pattern_regex = pile(pattern)result = pattern_regex.findall(sample)print(result)# Extract all words# Extract all numbers# Extract all dates# Extract all dates and separate the components# Reformat dates as dd-month-year# Extract all words that start with a vowel# Extract all phone numbers and reformat them as (nnn) nnn-nnnnsample = "7125551111,7125552222,7125553333,7125554444" ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

Related searches