Basic Text Processing

[Pages:71]Basic Text Processing

Regular Expressions

Regular expressions

A formal language for specifying text strings How can we search for any of these? woodchuck woodchucks Woodchuck Woodchucks

Regular Expressions: Disjunctions

Letters inside square brackets []

Pattern [wW]oodchuck [1234567890]

Ranges [A-Z]

Matches Woodchuck, woodchuck Any digit

Pattern [A-Z] [a-z] [0-9]

Matches An upper case letter A lower case letter A single digit

Drenched Blossoms my beans were impatient Chapter 1: Down the Rabbit Hole

Regular Expressions: Negation in Disjunction

Negations [^Ss]

Carat means negation only when first in []

Pattern [^A-Z]

[^Ss] [^e^] a^b

Matches Not an upper case letter Neither `S' nor `s' Neither e nor ^ The pattern a carat b

Oyfn pripetchik

I have no exquisite reason" Look here Look up a^b now

Regular Expressions: More Disjunction

Woodchuck is another name for groundhog! The pipe | for disjunction

Pattern groundhog|woodchuck yours|mine a|b|c

[gG]roundhog|[Ww]oodchuck

Matches woodchuck yours = [abc] Woodchuck

Regular Expressions: ? *+.

Pattern colou?r

oo*h!

o+h!

baa+ beg.n

Matches

Optional

color

previous char

colour

0 or more of oh! ooh! oooh! ooooh! previous char

1 or more of oh! ooh! oooh! ooooh! previous char

baa baaa baaaa baaaaa

Stephen C Kleene

begin begun begun beg3n

Kleene *, Kleene +

Regular Expressions: Anchors ^ $

Pattern ^[A-Z] ^[^A-Za-z] \.$ .$

Matches Palo Alto 1 "Hello" The end. The end? The end!

Example

Find me all instances of the word "the" in a text. the Misses capitalized examples [tT]he Incorrectly returns other or theology [^a-zA-Z][tT]he[^a-zA-Z]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download