Operating Systems Technology



System Administration

Course Notes #4

Regular Expressions

A regular expression is a string of characters that matches a pattern of some kind. As a simple example, we could say that any string that consists of 0s and 1s constitutes a regular expression (this would actually be the set of strings that are binary numbers, such as 000, 101, 11111, 0, 1010101010, etc). Regular expressions are important to understand because, as a system administrator, you may want to perform operations on a set of files that match a given pattern. The pattern can be described through a regular expression. For instance, you have a directory of images. You want to delete all of the TIFF files. However, the people who downloaded the files weren’t paying particular attention to the spelling of TIFF and so there are files that end with .tiff, .TIFF, .tif, .Tiff, etc. If you want to delete all of the TIFF files with one operation, you need to be clever about it and specify a proper pattern. There are many commands that permit regular expressions as part of the parameters, including ls, grep, sed, awk and shell scripts. So here, we examine regular expressions and how to use them. We limit the use of regular expressions to ls and grep in these notes. The next set of notes cover sed and awk, and at the end of the course, you will study shell scripts.

NOTE: you may want to work through some of these examples. You can experiment with ls on files in your home directory, and you can download some of the example files to test out from ~foxr (addresses.txt, computers.txt, states.txt).

Metacharacters

Linux syntax for representing regular expressions (these are called metacharacters)

* -- match 0 or more characters

. -- match exactly one character

+ -- match one or more characters

^ -- match text at the beginning of the line

$ -- match text at the end of the line

[chars] -- match any of the chars in [ ]

[^chars] -- match any characters except those listed in [ ]

\ -- escapes the meaning of the character that follows (can be used when trying to match a character that is itself a metacharacter, such as looking for * or $ in the text)

| -- separates choices (a list of ORed choices)

{n}-- matches n occurrences of the given pattern

{n,} -- matches at least n occurrences of the given pattern

{n,m} -- matches at least n occurrences but not more than m occurrences of the given pattern

Note: to use {…} in grep, you must use egrep, and to use {n,} or {n,m}, you should use \{…\} as in egrep [0-9]\{4,6\} to find 4-6 digits in a row

Example Patterns

^[0-9] – matches anything that starts with a digit

b.d – matches anything with a b followed by a d or any character and a d (such as bd, bad, bed, bid, bcd)

[a-z] – matches any lower case letter

[a-zA-Z] – matches any letter

[a-z]\{1,3\} – matches any string of 1 to 3 letters (ask, eee, q, be, bet, etc)

[tT][iI][fF]\{1,2\} – matches a t or T followed by an i or I followed by 1 or 2 f or Fs

[a-z]\{3\}|[A-Z]\{3\} – matches any three letters as long as all three are upper case only or lower case only

Using Regular Expressions in ls

We will use regular expressions with other programs, supplying the regular expression as an argument to limit what the program will work with. For instance, we might choose to use a regular expression with ls to only list files that have specific patterns in their names. You have already done this with * or *.*, but now we can be more restrictive. Here are some examples:

ls *[0-9].* -- list only files that have a digit before their period (foo1.gif, arg2.txt, etc)

ls *.[tT][iI][fF] – list files that end with a tiff extension including tif, Tiff, tiFF, etc

You are limited with the regular expressions that you supply to ls, so the above examples will work, but you cannot for instance use ^, { }, +, or $.

Using Regular Expressions in grep

We use a wide variety of the regular expression metacharacters with grep. The grep program attempts to match a pattern within a text file and return the line(s) of any string that matches the given pattern. The grep program only accepts some of the metacharacters, but there is a parameter, -E, to permit the extended set of metacharacters. Rather than using –E however, the name egrep has been aliased as grep –E, so we will just use egrep.

As a simple example, imagine that you have a file of people’s names and addresses, every name is combined with an address on one line and each person’s name/address is on a separate line. You want to find all of the people who live in Kentucky. The grep program will let you specify the pattern (KY or Kentucky) and then display the matching lines, as in egrep ‘KY|Kentucky’ addresses. The ‘’ are needed because of the | symbol (if you were not using the or symbol | then you would not need the quotes in this example). The full command will also require the file name. Assume the file is addresses.txt, then the command is egrep ‘KY|Kentucky’ addresses.txt

Let’s consider an example of a file that stores information about various office computers. The file contains each office number, the person’s last name, and the type of computer. If the person has multiple computers, then the types are separated by commas, as in ST 314, Fox, Unix, PC. You want to find everyone who is running a unix/linux computer. You could do this command:

egrep ‘unix|linux’ computers.txt

Notice that the two expressions, unix and linux, have some similarities. We could also rewrite the command as follows:

egrep ‘n[iu]x’ computers.txt

This would match linux and unix but could also match other platforms like minix. But if you know that the only computers are linux, unix, mac and pc, this might be safe.

If you only want to find people who have PCs, then the command is simply grep PC computers.txt. Notice that egrep is not needed here because we do not need the | metacharacter. If PC may be upper or lower case, you could use grep [Pp][Cc] computers.txt where the pattern means upper or lower case p [Pp] followed by upper or lower case c [Cc]. You can also avoid the awkwardness of upper/lower case by using –i with grep/egrep.

If you want to exclude lines that have a string, you can use either [^…] to indicate the characters to exclude. However, it isn’t as simple as saying grep [^PC] computers.txt. This statement says “match any string where there is a character that is not a P or a C” which means that unless a line has nothing but P’s and C’s, it will match. A better attempt is grep [^P][^C] computers.txt, but this won’t do it either because this says match any line that does not consist of a P followed by a C, but does not exclude a line that contains PC, only a line that consists only of PC. We could try grep [^P][^C]$ computers.txt which will work for us in most cases, but some lines contain multiple computers and PC may or may not be at the end of such a line. Finally, we can’t do grep ‘^PC’ computers.txt because the ^ in this case means “lines that start with”, and none of the lines of the file start with PC. Therefore, to find this particular match, that is, lines that do not contain a PC, we must use the –v option, as in grep –v PC computers.txt The –v option actually means “invert the match”, so that grep PC computers.txt would give us one set of lines from the file and grep –v PC computers.txt would give us all of the lines that didn’t match.

To find all lines that do not end in PC, use either of these:

egrep [^P][^C]$ computers.txt egrep –v PC$ computers.txt

Note that egrep [^PC]$ computers.txt means all lines that do not end with a P or a C, but not the string PC. Ok, let’s find everyone who is on the 3rd floor. All offices are numbered xyz where x represents the floor number. The basement contains office that are numbered yz instead.

egrep 3[0-9]{2} computers.txt

This example says match on any line that has a 3 followed by two digits. grep does not recognize { } so we have to use egrep does. Notice that we did not need the \{ or \} characters for {2} since it was a single value. But {2,3} would require \{2,3\} instead. We could have also written this instruction as grep 3[0-9][0-9] computers.txt (or egrep).

Here are some more complicated patterns using some of the files mentioned earlier.

grep ,.*,.*, computers.txt – finds anyone with at least 2 computers in their office (that is, any line that has 3 commas)

grep –v ^ST computers.txt – finds anyone who is not in the building ST

grep ^[^S] computers.txt – finds anyone whose building name does not start with an S

egrep ‘[aA]pt|#’ addresses.txt – finds anyone with either apt, Apt or # in their address

egrep 4[012][0-9]{3} addresses.txt – finds zip codes that start with 40xxx, 41xxx or 42xxx. Why do you suppose the pattern ends with [0-9]{3}?

egrep ‘, [0-9]{3,4}, ’ addresses.txt – find all lines where the street address has 3 or 4 consecutive digits in it (this would match for instance a line with 911 and 3871 but not 47 or 11911. Notice if we used {3,5}, it would not match any zip codes. Why not?

egrep ‘^[A-Z][a-z]{2} ’ addresses.txt – finds people whose first name is exactly 3 letters. ^ means to start at the beginning of the line, find an upper case letter followed by two lower case letters followed by a space. Without the ‘’ we could not accomplish this as the space after {2} would be taken as just a space, but not part of the regular expression.

grep ‘\ .’ addresses.txt – finds addresses that has a . somewhere in the line (as in Ft. Thomas) – without the \ grep would only look for . which is a metacharacter that matches everything and therefore would match every line of the addresses file

grep ‘, [0-9][0-9] ’ addresses.txt – finds anyone whose address consists of exactly 2 digits

egrep ‘^$’ addresses.txt – finds any blank lines

egrep ‘[^(^$)]’ addresses.txt – finds all non-blank lines

egrep ‘i.o’ states.txt – matches any line that has an i, followed by any character, followed by an o, such as New Mexico and Illinois.

Because ls restricts how you use regular expressions, you can combine ls and grep to search directories for regular expressions. You would do this by doing ls and piping the results to grep and then using a regular expression in grep. Recall that ls returns a list of items in the directory as individually lines of text, so grep then searches the lines of text for any that match the given regular expression, and returns only the lines that have a match. As an example, imagine that you want to find all files do not start with the letters a, e, i, o or u. The ls command will not allow the use of such regular expression characters as ^ even though the proper regular expression would be ‘^[^aeiou]’. So instead, you would do ls and pipe the resulting list of file names to grep where you can then use ‘^[^aeiou]’ as in

ls | grep ‘^[^aeiou]’

Another example is ls | grep –v ‘oo’ which will return any file that does not have ‘oo’ in its name.

By using ls –l and piping to grep, you can search for files with particular file attributes such as a specific pattern of permissions (e.g., you want to find all files that are rwx, or all files that were created in 2006). To find all files that are rwx, you would use

ls –l | grep ‘^-rwx’

This will provide a long listing of all files that start with –rwx (where the – means “regular file” as opposed to a directory). What would happen if you remove the ^ from this? Although unlikely, a file with permissions rwx---rwx would match.

Other File Manipulation Programs

Linux offers a number of other programs that can display, modify or otherwise examine files. Some of the more useful ones are listed below. We will also examine two powerful programs, sed and awk, separately.

sort – sorts the items in the input, line by line. sort –r performs a sort in descending order.

uniq – returns only unique lines of a file, this can be useful in conjunction with sort as in sort foo | uniq > sorted_foo

head – displays the first 10 lines of a file

tail – displays the last 10 lines of a file

diff – compares two files

wc – displays the number of characters (bytes), words and lines of a text file

cmp – compares two files and returns the line number of the first mismatch found (if any).

aspell – a spell checker for text files using the built-in linux dictionary. To spell check a file, use aspell –c filename. The program is fairly easy to use (note that our version of linux does not have ispell, as covered in the textbook).

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download