RegExp Tutorial



Regular Expression Tutorial

Version 1.1.0.0

Rev A

How to Contact Us

|OSIsoft, Inc. |Worldwide Offices |

|777 Davis St., Suite 250 |OSIsoft Australia |

|San Leandro, CA 94577 USA |Perth, Australia |

| |Auckland, New Zealand |

|Telephone |OSI Software GmbH |

|(01) 510-297-5800 (main phone) |Altenstadt, Germany |

|(01) 510-357-8136 (fax) |OSI Software Asia Pte Ltd. |

|(01) 510-297-5828 (support phone) |Singapore |

| |OSIsoft Canada ULC |

|techsupport@ |Montreal, Canada  |

| |OSIsoft, Inc. Representative Office |

|Houston, TX |Shanghai, People’s Republic of China  |

|Johnson City, TN |OSIsoft Japan KK |

|Mayfield Heights, OH |Tokyo, Japan  |

|Phoenix, AZ |OSIsoft Mexico S. De R.L. De C.V. |

|Savannah, GA |Mexico City, Mexico  |

|Seattle, WA | |

|Yardley, PA | |

|Sales Outlets and Distributors |

|Brazil |South America/Caribbean |

|Middle East/North Africa |Southeast Asia |

|Republic of South Africa |South Korea |

|Russia/Central Asia |Taiwan |

| |

|WWW. |

|OSIsoft, Inc. is the owner of the following trademarks and registered trademarks: PI System, PI ProcessBook, Sequencia, |

|Sigmafine, gRecipe, sRecipe, and RLINK. All terms mentioned in this book that are known to be trademarks or service marks |

|have been appropriately capitalized. Any trademark that appears in this book that is not owned by OSIsoft, Inc. is the |

|property of its owner and use herein in no way indicates an endorsement, recommendation, or warranty of such party’s |

|products or any affiliation with such party of any kind. |

| |

|RESTRICTED RIGHTS LEGEND |

|Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the |

|Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 |

|Unpublished – rights reserved under the copyright laws of the United States. |

| |

|© 2002-2007 OSIsoft, Inc. RegExpTutorial.doc |

Table of Contents

Introduction 1

RegExp Tester 3

The Basics 5

Wildcards 6

Sets 7

Escaped Characters 8

Position Characters 9

Repeats 10

Substitutions 11

Whole Text Substitution 11

Reordering Text 12

Example Searches 13

Revision History 15

Introduction

This document is intended to help users of PI interfaces that make use of Regular Expressions.

RegExp is a relatively old utility for searching text and making substitutions. The main concept behind using RegExp is matching a generic pattern the user has supplied to the specific text that is given. A very simple pattern match is used in Windows all the time, with the wildcard character. By bringing up a command prompt, if you issue the command dir c:\winnt\system32\*.dll, you’ll get a list of all the files whose full file name, including path, start with c:\winnt\system32\, and have any amount of text after that, and end with .dll. The c:\winnt\system32\*.dll can be considered the pattern, and all the files that are returned are matches.

RegExp Tester

You may find the following discussion easier to understand if you follow along by using the RegExp Tester program. This utility will allow you to enter text to search, search patterns, substitution patterns, and it will perform the search and replace in the same way that any product that uses the RegExp implementation built into Internet Explorer does.

[pic]

The Search Text field is where you enter the text you want to search. The Search pattern field is where you enter a pattern in the text you want to find. The Substitution pattern field is where you can put a pattern that will be substituted for the search results. If you just want to perform a search without a replace, leave this field blank. Press the Execute button to perform the search and replace.

The Basics

This section will show you the basics of RegExp using the RegExp object built into Internet Explorer. The product you use may or may not be based on the implementation of Regular Expressions built into Internet Explorer.

First, here is a simple example:

Four score and seven years ago, our fathers brought forth upon this continent a new nation: conceived in liberty, and dedicated to the proposition that all men are created equal.

This is the first sentence of Lincoln’s Gettysburg address. The main parameters to a RegExp search are the text being searched (the text above), and the pattern we are trying to find. Let’s first use a pattern of fathers. The RegExp engine would match the eighth word in the sentence, fathers. Pretty simple; we searched for fathers and found it.

[pic]

Now, let’s try to match the word apple. The RegExp engine would return a blank, because the word apple does not appear in the text.

[pic]

Wildcards

Now, to add a little complexity, let’s try to find some text using wildcards. The wildcard character is the period (.). The period represents any single character. So if we were to try to search for d.dicated, the search would return the word dedicated.

[pic]

In this search, the period was allowed to be any character, so the d.dicated matched with dedicated, because the letter e certainly counts as “any character”. A search for ur....r. would return ur score from four score, because there is a u and an r in the last half of the word four, the four periods can match with the space, the s, the c, and the o, there is another r, and finally, the last wildcard character matches with the e at the end of score. In this instance, the space between four and score counts as a character, so it is matched by the wildcard. A search for “.” would return the F at the beginning of the sentence. Searches only return the first match. Since the wildcard character can match anything, the F is the first match it found.

There are several specialized wildcard characters you can use in specifying a pattern.

|Wildcard |What it matches |

|\s |Whitespace (tab or space) |

|\w |Word characters (digits, letters, and “_”) |

|\d |Digits |

Searching the example text above for the pattern ..r\sf., the sequence our fa would be the returned match.

[pic]

The first two wildcards match any characters, o and u in this case. The r matches with the r in our. The \s matches the space between our and fathers. The f matches the f at the beginning of fathers. The final period matches the a in fathers. If the word sheriff had somehow appeared in the sentence, the sub-word heriff would not have matched. The i between the r and the f does not match the whitespace wildcard.

The word and digit wildcards are similar. A \w will only match a digit, letter, or the underscore character. A \d will only match a digit.

Capitalizing these wildcard characters instructs the search to match the opposite of its lowercase counterpart. So a search for the pattern ..r\Sf. in the Gettysburg address sentence would not match our fa, because the \S instructs RegExp not to match anything other than a space. It would match heriff because the I does match with the non-whitespace wildcard \S.

Sets

It is possible to specify sets of characters to match in a search. For example, to search for the pattern vowel-space-vowel in the Gettysburg Address sentence, you would need to use a set. There is no vowel wildcard character. So the set of the letters a, e, i, o, and u would constitute the set of vowels. A set is represented using brackets ([ and ]). The pattern vowel-space-vowel would look like: [aeiou]\s[aeiou]. This search’s results would be e a from the words score and.

[pic]

The e matches with the set [aeiou], the space matches with the whitespace wildcard \s, and the a matches with the set [aeiou].

The characters in the set can be excluded by adding the carat (^) just inside the brackets. For example, to search the sentence for the pattern not-a-vowel-space-vowel, you could use the patterns: [^aeiou]\s[aeiou]. This would match s a from years ago. The letter s is in the set “anything but a, e, i, o, or u”, the space matches the \s, and the a matches the set “a, e, I, o, or u”. Note that [^aeiou] will match anything other than a, e, i, o, or u. This includes digits, whitespace, and punctuation.

Another modification on sets is the range character. A hyphen will indicate a range in the set context. For example, the range [a-h] will match any lowercase letter between a and h, inclusive. The range [a-mo-z] will match any lowercase letter except the lowercase n. When determining if one character is between two others, ASCII representations are used. For example, the space character is represented in ASCII by the number 32. Uppercase A is 65, and lowercase a is 97. So the range [A-a] would include B (ASCII 66), ^ (ASCII 94), but not b (ASCII 98).

You can combine the not-in-set character and the range character. To search for anything other than the uppercase letters, you could specify [^A-Z]. This will match anything other than A through Z.

Escaped Characters

Some characters hold special meaning in RegExp pattern matching. For example, the brackets delimit set definitions. The carat indicates a not-in-set declaration. For that reason, if you actually need to search for a bracket, a carat, or any of the other special characters, you’ll need to “escape” the character by putting a backslash (\) in front of it.

The following table shows escape characters:

|Literal Character |Escaped Character |

|. (period) |\. |

|* (asterisk) |\* |

|+ (plus sign) |\+ |

|? (question mark) |\? |

|| (pipe character) |\| |

|\ (backslash) |\\ |

|^ (carat) |\^ |

|$ (dollar sign) |\$ |

|( (left parenthesis) |\( |

|) (right parenthesis) |\) |

|[ (left bracket) |\[ |

|] (right bracket) |\] |

|{ (left curly brace) |\{ |

|} (right curly brace) |\} |

|New Line (LF) |\n |

|Carriage Return (CR) |\r |

|Tab (HT) |\t |

|Vertical Tab (VT) |\v |

|Page Break (FF) |\f |

Position Characters

There are three patterns that do not match a character, but a position. These three are start of line (^), end of line ($), and word boundary (\b).

Start of Line (^)

The start of line pattern, the carat, will allow the following: Match the word February only if it appears at the beginning of a line. The search pattern would be ^February. So if there is a date (assume it will always be February something) that you want to search for, and the date is in a format where the full month is first, then the day, then a comma, then a year, but the date you want is at the beginning of a line, your pattern could look like: ^February \d\d, \d\d\d\d. The ^ will match the beginning of the line, the February will match itself, the next two \d will match the day of the month (assuming a 2-digit day), and the next 4 will match the year.

Do not confuse the carat in this context with the carat in the set context, which is the not-in-set character. That carat will always be inside brackets.

End of Line ($)

The end-of-line pattern, the dollar sign, works in exactly the same way, except on the end of the line instead of the beginning of the line.

Boundary (\b)

The word boundary pattern (\b) works in a similar way. A word is defined as a series of letters, numbers, and the underscore. \b would match anywhere there is a word break (beginning or end). Searching the text this is a sentence for \bsent would return sent. Note that the space is not included. \b does not match the space; it matches the spot between the space and the next word.

Repeats

You can modify your search to look for repeating characters. In the simplest repeating pattern, use the curly braces ({ and }) after a pattern to search for that pattern repeated N times. For example, to search Look at that! for 2 o’s there are two ways you could search. You could use the pattern oo, or you could use the pattern o{2}. o{2}will match two consecutive o’s. A little more complicated example is to search for any word that has 4 letters, and the middle two are o’s. For example, search You are a fool! for such a pattern. The search pattern would be \b\wo{2}\w\b. The first \b indicates a word boundary. Then, the \w indicates any word character (letter, number, or underscore). The o{2} indicates two o’s. The second \w indicates another word character. The \b indicates another word boundary. The word fool would match this search pattern.

A variation on matching exactly N times is matching N to M times. To search for a pattern N to M times, use the following notation: {N,M}. In the previous example, change the pattern to \b\wo{2,4}\w\b. The word fool will still match, but foool and fooool would also match, because the o is repeated between 2 and 4 times in each of those words. fol and foooool would not match, because the o only appears once and 5 times, respectively, in each word.

Another variation on matching a pattern exactly N times is matching a pattern at least N times. This is denoted by putting a comma after N: {N,}. \b\wo{2,}\w\b would match fool, foool, fooool, foooool, etc. Any word that has some word character, at least two o’s, and one more character will match this pattern.

The * character will match the preceding pattern 0 or more times. This is very useful in matching an unknown number of characters. The pattern fo*l will match any part of the searched text that has an f, any number of o’s (including zero), and an l. fl, fol, fool, etc. would all match the pattern. Combining the asterisk with the period (any character) is an extremely useful way of finding unknown data. For example, if there is a sentence that starts with Today’s temperature and ends in a number and then a period, and you want to extract this whole sentence out of a paragraph or a page, the pattern Today’s temperature.*\d\. would be the pattern to search for. The beginning of the pattern is the phrase Today’s temperature. The period denotes any character and the following asterisk indicates that we’re looking for any character any number of times. The \d and \. indicate that we want the pattern to end at a digit and a period. This pattern would match the following sentences:

• Today’s temperature in the San Leandro area will be 55.

• Today’s temperature in Oakland and the East Bay was supposed to be 67.

There is another important property of the asterisk pattern. By default, the search results will return the longest match that fits the pattern. So if you search the text This is the end of the line! for the pattern This.*n, your search results would be This is the end of the lin. To match This is the en, add a question mark after the asterisk. So the pattern This.*?n would return This is the en.

The .* combination is very powerful and will be discussed further in the Substitutions section. * is equivalent to {0,}.

The plus ( + ) character is identical to the asterick ( * ) character, except that the plus indicates that we want the search to match the preceding pattern at least once. It is equivalent to {1,}.

The question mark ( ? ) character indicates that we want the search to match the preceding pattern once or none at all. It is equivalent to {0,1}.

Substitutions

Searching for a pattern is useful, but many times more detail is in the search criteria than you want to wind up keeping. For example, if the following line appeared as a search target:

Forecast: temperature-76 degrees, wind-25 mph, humidity-60%.

If the goal is to extract the wind speed out of this line, the search might only look for digits as with the pattern: \d+. This search pattern would search for one or more digits in a row. However, this search would return the temperature, not the wind speed, because temperature is matched first. The pattern might instead include the word wind, then a space, and then the actual wind speed digits with the wind-\d+. However, there is a problem with this as well. This search would return wind-25. wind-25 cannot be successfully converted to an integer, because it has the word wind included. This is where substitution is required.

Whole Text Substitution

Substitution will replace the entire matched pattern with whatever you specify. Although useless, the match can be replaced with a hard-coded text. For example, a search of the weather forecast above for the pattern wind-\d+ could be replaced with the number 60. It would take your match, wind-25, replace it with 60, and keep 60 as the final substitution result. However, a much more useful application of substitution is using parts of the search result to replace the entire search result.

In this example, we want to replace wind-25 with just the number 25, which is part of the search result. To do this, first modify the search pattern by adding parentheses around the part of the search pattern to use as the replacement. In this case, put parenthesis around the \d+ part of the search pattern so it becomes wind-(\d+). This does not affect the search. Technically, what occurs is the \d+ is marked as a “group”. Now, in the substitution pattern, the search result wind-25 is to be replaced with the contents of the group, 25. To indicate the contents of a group from the search pattern in the substitution pattern, use a dollar sign ($) followed by the number of the group. In this case, we only have one group, so that number is 1. The substitution pattern would be $1.

Here is a summary of this example. The text being searched is:

Forecast: temperature-76 degrees, wind-25 mph, humidity-60%.

The search pattern is wind-(\d+). The substitution pattern is $1. This searches the text for the search pattern and finds wind-25. The digits have been joined as a group in the search pattern using the parentheses, so they can be referenced later in the substitution pattern. The substitution pattern $1 means that the entire search result is replaced with the contents of the first group in the search pattern. So 25 would replace wind-25 as the final search and substitution result.

[pic]

Reordering Text

Another interesting use for search results is to reorder text. If running on a system that has its locale set to use United States date ordering (month/day/year), but the data being read uses European date ordering (day/month/year), the date parser will get mixed up if just using a search for the date and not the replace. There is a way to switch the day and the month.

For this example, search this text:

Forecast for 24/2/2002: Partly cloudy, highs in the mid 30’s.

The search is only for the date for now. The original search pattern may be \d{1,2}/\d{1,2}/\d{4}. This will search for one or two digits, a forward slash (/), one or two more digits, another forward slash, and then exactly 4 digits. The result will be 24/2/2002. However, the result needs to be reordered so the month comes first. The search needs to be changed to make groups out of all three numbers. The search pattern will now be (\d{1,2})/(\d{1,2})/(\d{4}). This search pattern creates three groups, the first one with one or two digits, the second group with one or two digits, and the third group with 4 digits. Substitution can now be used to put this text back together in the desired order, the second group to go first, a slash, the first group, another slash, and then the third group. The substitution pattern is $2/$1/$3. The results of the substitution are 2/24/2002.

Example Searches

Ex. 1: Search and Replace

|Text to be searched |Weather data for February 10, 2002 |

|Search pattern |Weather data for (.*) |

|Substitution pattern |$1 |

|Results |February 10, 2002 |

|Comments |We’re grouping the date part together here and using that as the substitution pattern. |

| |This gets rid of the “Weather data for” part of the text. |

Ex. 2a: Search and Replace

|Text to be searched |Weather data for 2/10/2002 |

|Search pattern |Weather data for (.*) |

|Substitution pattern |$1 |

|Results |2/10/2002 |

|Comments |This is the hard way to do this one. Since the date here is numeric, we can just search |

| |for the digits and not worry about the letters. |

Ex. 2b: Search

|Text to be searched |Weather data for 2/10/2002 |

|Search pattern |\d+/\d+/\d+ |

|Substitution pattern | |

|Results |2/10/2002 |

|Comments |This searches for at least one digit, a slash, at least one digit, a slash, and at least|

| |one digit. 2/10/2002 is the only part of the original text that matches that pattern. |

Ex. 3: Search and Replace

|Text to be searched |February 10 (Sunday), 2002 |

|Search pattern |(.*?)\s(\d*?)\s\(.*?\),\s(\d*?) |

|Substitution pattern |$1 $2, $3 |

|Results |February 10, 2002 |

|Comments |This one is complicated in that we want to get rid of something in the middle. So we |

| |just make sure to mark the month, day, and year with the group modifier (parentheses). |

| |Here is how this pattern matches: |

| |(.*?) – Non-greedy match up to the space. We want to make sure it only matches up to the|

| |first space. Group 1. |

| |\s – The first space. |

| |(\d*?) – Non-greedy search for digits after the first space but before the next space. |

| |Group 2. |

| |\s – The next space. |

| |\( – The opening parenthesis before “Sunday”. |

| |.*? – Non-greedy search up to the closing parenthesis. |

| |\) – The closing parenthesis after “Sunday”. |

| |,\s – The comma and space after the parentheses. |

| |(\d*?) – Non-greedy search for digits after the last comma and space. Group 3. |

Ex. 4: Search and Replace

|Text to be searched |Temperature: 0 °C (32 °F) |

|Search pattern |\((\d*?) °F\) |

|Substitution pattern |$1 |

|Results |32 |

|Comments |This searches for a left parenthesis, 0 or more digits, a space, a degrees sign, an F, |

| |and a right parenthesis. The digits are grouped as group 1. The substitution pattern |

| |says to replace the whole matched text with the contents of group 1, which is 32. |

Revision History

|Date |Author |Comments |

|04-Feb-02 |LNG |Initial 1.0.0 release. |

|01-Apr-04 |LNG |Version 1.1.0: modified to be generic enough to use with any |

| | |product. |

|06-Apr-04 |CG |Version 1.1.0.0: reformatted search and result text; removed |

| | |reference to html; fixed pages number; fixed headers & footers |

|1-May-2007 |MKelly |Version 1.1.0.0 Rev A; Updated the How to Contact Us page. |

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download