Basic Text Processing

Regular Expressions

Regular expressions

A formal language for specifying text strings How can we search for any of these? woodchuck woodchucks Woodchuck Woodchucks

Regular Expressions: Disjunctions

Letters inside square brackets []

Pattern [wW]oodchuck [1234567890]

Ranges [A-Z]

Matches Woodchuck, woodchuck Any digit

Pattern [A-Z] [a-z] [0-9]

Matches An upper case letter A lower case letter A single digit

Drenched Blossoms my beans were impatient Chapter 1: Down the Rabbit Hole

Regular Expressions: Negation in Disjunction

Negations [^Ss]

Carat means negation only when first in []

Pattern [^A-Z]

[^Ss] [^e^] a^b

Matches Not an upper case letter Neither `S' nor `s' Neither e nor ^ The pattern a carat b

Oyfn pripetchik

I have no exquisite reason" Look here Look up a^b now

Regular Expressions: More Disjunction

Woodchuck is another name for groundhog! The pipe | for disjunction

Pattern groundhog|woodchuck yours|mine a|b|c


Matches woodchuck yours = [abc] Woodchuck

Regular Expressions: ? *+.

Pattern colou?r



baa+ beg.n




previous char


0 or more of oh! ooh! oooh! ooooh! previous char

1 or more of oh! ooh! oooh! ooooh! previous char

baa baaa baaaa baaaaa

Stephen C Kleene

begin begun begun beg3n

Kleene *, Kleene +

Regular Expressions: Anchors ^ $

Pattern ^[A-Z] ^[^A-Za-z] \.$ .$

Matches Palo Alto 1 "Hello" The end. The end? The end!


Find me all instances of the word "the" in a text. the Misses capitalized examples [tT]he Incorrectly returns other or theology [^a-zA-Z][tT]he[^a-zA-Z]


