Regular Expressions: The Complete Tutorial - GitHub Pages

[Pages:288]Regular Expressions

The Complete Tutorial

Jan Goyvaerts

Regular Expressions: The Complete Tutorial

Jan Goyvaerts

Copyright ? 2006, 2007 Jan Goyvaerts. All rights reserved.

Last updated July 2007.

No part of this book shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, without written permission from the author.

This book is published exclusively at

Every effort has been made to make this book as complete and as accurate as possible, but no warranty or fitness is implied. The information is provided on an "as is" basis. The author and the publisher shall have neither liability nor responsibility to any person or entity with respect to any loss or damages arising from the information contained in this book.

i

Table of Contents

Tutorial................................................................................................................ 1

1. Regular Expression Tutorial ......................................................................................................................................... 3 2. Literal Characters............................................................................................................................................................ 5 3. First Look at How a Regex Engine Works Internally .............................................................................................. 7 4. Character Classes or Character Sets............................................................................................................................. 9 5. The Dot Matches (Almost) Any Character .............................................................................................................. 13 6. Start of String and End of String Anchors............................................................................................................... 15 7. Word Boundaries.......................................................................................................................................................... 18 8. Alternation with The Vertical Bar or Pipe Symbol ................................................................................................. 21 9. Optional Items .............................................................................................................................................................. 23 10. Repetition with Star and Plus ................................................................................................................................... 24 11. Use Round Brackets for Grouping.......................................................................................................................... 27 12. Named Capturing Groups ........................................................................................................................................ 31 13. Unicode Regular Expressions................................................................................................................................... 33 14. Regex Matching Modes ............................................................................................................................................. 42 15. Possessive Quantifiers ............................................................................................................................................... 44 16. Atomic Grouping ....................................................................................................................................................... 47 17. Lookahead and Lookbehind Zero-Width Assertions........................................................................................... 49 18. Testing The Same Part of a String for More Than One Requirement .............................................................. 52 19. Continuing at The End of The Previous Match.................................................................................................... 54 20. If-Then-Else Conditionals in Regular Expressions .............................................................................................. 56 21. XML Schema Character Classes .............................................................................................................................. 59 22. POSIX Bracket Expressions .................................................................................................................................... 61 23. Adding Comments to Regular Expressions ........................................................................................................... 65 24. Free-Spacing Regular Expressions........................................................................................................................... 66

Examples........................................................................................................... 67

1. Sample Regular Expressions....................................................................................................................................... 69 2. Matching Floating Point Numbers with a Regular Expression ............................................................................ 72 3. How to Find or Validate an Email Address............................................................................................................. 73 4. Matching a Valid Date ................................................................................................................................................. 76 5. Matching Whole Lines of Text................................................................................................................................... 77 6. Deleting Duplicate Lines From a File ....................................................................................................................... 78 8. Find Two Words Near Each Other........................................................................................................................... 79 9. Runaway Regular Expressions: Catastrophic Backtracking................................................................................... 80 10. Repeating a Capturing Group vs. Capturing a Repeated Group ........................................................................ 85

Tools & Languages........................................................................................... 87

1. Specialized Tools and Utilities for Working with Regular Expressions .............................................................. 89 2. Using Regular Expressions with Delphi for .NET and Win32............................................................................. 91

ii

3. EditPad Pro: Convenient Text Editor with Full Regular Expression Support .................................................. 92 4. What Is grep?................................................................................................................................................................. 95 5. Using Regular Expressions in Java ............................................................................................................................ 97 6. Java Demo Application using Regular Expressions..............................................................................................100 7. Using Regular Expressions with JavaScript and ECMAScript............................................................................107 8. JavaScript RegExp Example: Regular Expression Tester ....................................................................................109 9. MySQL Regular Expressions with The REGEXP Operator..............................................................................110 10. Using Regular Expressions with The Microsoft .NET Framework ................................................................111 11. C# Demo Application.............................................................................................................................................114 12. Oracle Database 10g Regular Expressions...........................................................................................................121 13. The PCRE Open Source Regex Library ...............................................................................................................123 14. Perl's Rich Support for Regular Expressions.......................................................................................................124 15. PHP Provides Three Sets of Regular Expression Functions ............................................................................126 16. POSIX Basic Regular Expressions ........................................................................................................................129 17. PostgreSQL Has Three Regular Expression Flavors .........................................................................................131 18. PowerGREP: Taking grep Beyond The Command Line ..................................................................................133 19. Python's re Module ..................................................................................................................................................135 20. How to Use Regular Expressions in REALbasic................................................................................................139 21. RegexBuddy: Your Perfect Companion for Working with Regular Expressions..........................................142 22. Using Regular Expressions with Ruby..................................................................................................................145 23. Tcl Has Three Regular Expression Flavors .........................................................................................................147 24. VBScript's Regular Expression Support...............................................................................................................151 25. VBScript RegExp Example: Regular Expression Tester ...................................................................................154 26. How to Use Regular Expressions in Visual Basic...............................................................................................156 27. XML Schema Regular Expressions .......................................................................................................................157

Reference.......................................................................................................... 159

1. Basic Syntax Reference ..............................................................................................................................................161 2. Advanced Syntax Reference......................................................................................................................................166 3. Unicode Syntax Reference ........................................................................................................................................170 4. Syntax Reference for Specific Regex Flavors.........................................................................................................171 5. Regular Expression Flavor Comparison.................................................................................................................173 6. Replacement Text Reference ....................................................................................................................................182

iii

Introduction

A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is ?.*\.txt? .

But you can do much more with regular expressions. In a text editor like EditPad Pro or a specialized text processing tool like PowerGREP, you could use the regular expression ?\b[A-Z0-9._%+-]+@[A-Z0-9.]+\.[A-Z]{2,4}\b? to search for an email address. Any email address, to be exact. A very similar regular expression (replace the first \b with ^ and the last one with $) can be used by a programmer to check if the user entered a properly formatted email address. In just one line of code, whether that code is written in Perl, PHP, Java, a .NET language or a multitude of other languages.

Complete Regular Expression Tutorial

Do not worry if the above example or the quick start make little sense to you. Any non-trivial regex looks daunting to anybody not familiar with them. But with just a bit of experience, you will soon be able to craft your own regular expressions like you have never done anything else. The tutorial in this book explains everything bit by bit.

This tutorial is quite unique because it not only explains the regex syntax, but also describes in detail how the regex engine actually goes about its work. You will learn quite a lot, even if you have already been using regular expressions for some time. This will help you to understand quickly why a particular regex does not do what you initially expected, saving you lots of guesswork and head scratching when writing more complex regexes.

Applications & Languages That Support Regexes

There are many software applications and programming languages that support regular expressions. If you are a programmer, you can save yourself lots of time and effort. You can often accomplish with a single regular expression in one or a few lines of code what would otherwise take dozens or hundreds.

Not Only for Programmers

If you are not a programmer, you use regular expressions in many situations just as well. They will make finding information a lot easier. You can use them in powerful search and replace operations to quickly make changes across large numbers of files. A simple example is ?gr[ae]y? which will find both spellings of the word grey in one operation, instead of two. There are many text editors and search and replace tools with decent regex support.

Part 1

Tutorial

3

1. Regular Expression Tutorial

In this tutorial, I will teach you all you need to know to be able to craft powerful time-saving regular expressions. I will start with the most basic concepts, so that you can follow this tutorial even if you know nothing at all about regular expressions yet.

But I will not stop there. I will also explain how a regular expression engine works on the inside, and alert you at the consequences. This will help you to understand quickly why a particular regex does not do what you initially expected. It will save you lots of guesswork and head scratching when you need to write more complex regexes.

What Regular Expressions Are Exactly - Terminology

Basically, a regular expression is a pattern describing a certain amount of text. Their name comes from the mathematical theory on which they are based. But we will not dig into that. Since most people including myself are lazy to type, you will usually find the name abbreviated to regex or regexp. I prefer regex, because it is easy to pronounce the plural "regexes". In this book, regular expressions are printed between guillemots: ?regex?. They clearly separate the pattern from the surrounding text and punctuation.

This first example is actually a perfectly valid regex. It is the most basic pattern, simply matching the literal text ,,regex". A "match" is the piece of text, or sequence of bytes or characters that pattern was found to correspond to by the regex processing software. Matches are indicated by double quotation marks, with the left one at the base of the line.

?\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b? is a more complex pattern. It describes a series of letters, digits, dots, underscores, percentage signs and hyphens, followed by an at sign, followed by another series of letters, digits and hyphens, finally followed by a single dot and between two and four letters. In other words: this pattern describes an email address.

With the above regular expression pattern, you can search through a text file to find email addresses, or verify if a given string looks like an email address. In this tutorial, I will use the term "string" to indicate the text that I am applying the regular expression to. I will indicate strings using regular double quotes. The term "string" or "character string" is used by programmers to indicate a sequence of characters. In practice, you can use regular expressions with whatever data you can access using the application or programming language you are working with.

Different Regular Expression Engines

A regular expression "engine" is a piece of software that can process regular expressions, trying to match the pattern to the given string. Usually, the engine is part of a larger application and you do not access the engine directly. Rather, the application will invoke it for you when needed, making sure the right regular expression is applied to the right file or data.

As usual in the software world, different regular expression engines are not fully compatible with each other. It is not possible to describe every kind of engine and regular expression syntax (or "flavor") in this tutorial. I will focus on the regex flavor used by Perl 5, for the simple reason that this regex flavor is the most popular

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download