George Mason University



Chapter 14Searching and SortingHello!The main topics for this chapter are searching and sorting data. We'll limit our discussion to arrays for searching and sorting, so that our data are linear (it's much simpler to discuss the ideas of searching and sorting, but we could also search and sort a tree structure of data, for instance).?Getting Hyped for Search and SortOnce we stop doing silly homework assignments and try to do a truly necessary computation with our programming?skillz, we often find that we’ll apply a single concept a whole lot more—maybe a class structure with 30+ classes, maybe a program with menus inside of menus, maybe reading much more complex files to do computations; in short, we do something a lot.??When we get into complex problems that involve other sub-problems, such as sorting results to show to a human, or finding results (as in virtually any time you run a search engine, search for items in some online store, etc.), the store of information available which needs to be operated on can be quite large.??So large, in fact, that simple, correct, ways of manipulating data simply aren’t feasible.??Would you think Google was so great if it took a week to find your search for "Cliff Notes Hemmingway", which you entered in the search box the night before a reading quiz???What if it took minutes to connect a telephone call???Would that be okay??Probably not.??So, although we present some pretty simple algorithms for sorting and searching, keep in mind that we are only showing quite simple sorts; there is quite intensive study into what kind of algorithms perform best in which situations. That said,?we still want to be able to manage smaller sets of data, even if our small projects or toy online database of phonebook entries can’t be macro-scaled to millions of entries without betraying a bit of naiveté.??So let’s get sorting, and let’s start searching!?Getting StartedGrab the two files, SearchSortComplete.java, and SearchSortIncomplete.java, to your machine and open them. You will be implementing some of the method 'stubs' throughout the chapter in the *Incomplete version, but you will have the *Complete version for testing and comparison's sake. This lab will be slightly more about observing code than writing it: we maybe saw some of these examples in class, and we'll just learn about a couple more algorithms today.There are a few things to notice about this class.First of all, scroll down to the bottom and notice there a few 'helper' functions—printArray,?getChoice?(two versions),?printMenu, and?displaySearchResult.??This just helps us to organize our basic testing and exploring harness. In the main method, using these methods makes the code in the main method much shorter, and more descriptive.??Instead of seeing a loop, and some sc.nextFoo() stuff, we see?getChoice; instead of a lot of print statements, we see?printMenu.Notice the two pairs of functions—the two overloadings of binarySearch, and the two overloadings of quickSort.??The versions with more parameters are called by the others: the simpler ones (fewer parameters) are called "driver functions".??All they do is take a request, make a start-up call to the 'real' method with any initial values for the additional parameters, and pass any results through.??We could just as well have put the extra parameters (usually a zero or an x.length-1 or something) in our call from the main method, but driver methods are one more way we can abstract details away from our code.Notice the class variable debug;?We?make this variable at the very beginning, and then anywhere we want to print things out in testing mode, we conditionally test if debug==true.??Taking a quick look inside a method like?bubbleSort?will exemplify this. Maybe using the debugger is a more effective way to view the state, but if you want to take print-style debugging to the next level, this is one approach.?Basic Searching: Linear SearchA linear search does just that—it starts at the beginning, and searches straight through until it either finds a match or finds itself with no more possible matches left.??This is similar to what we’ve already seen; the method gets (a reference to) an array and a key value to seek, it steps straight through the array until it either finds a match (and returns the index where it was found) or finds out there were no matches (and returns -1 to signify no match found—no valid index is negative).Your Turn!You should be comfortable writing a linear search algorithm by now, so try implementing the body of linearSearch without looking at your notes or the solution.Test your code: noting the default array values created in main, run the program and plug in a key to search for. Search for values that are:in the arraynot in the arrayin the array multiple times→ what would be some good unit tests for these? It's a good chance to practice your testing skills if you've got the time. Compare your code against the solution.Basic Sorting: Bubble SortBubble Sort runs a double loop. The inner loop compares every single adjacent pairing of two elements (walking down the array: indexes 0 and 1, then 1&2, 2&3, 3&4, etc), and swapping them if they are unsorted relative to each other.??Because of the regular pattern of access, no matter where the largest element is, it will get compared (and swapped) with every single smaller value to its right, ensuring that the largest value?has?been placed at the far right.Each time we perform that swapping-traversal, one more almost-as-large value is guaranteed to have bubbled up to its correct location; if we have an array of length n, and we make (n-1) traversals, then we know n elements are in order. Thus all the elements are in order. So the outer loop must run as many as (n-1) times in the worst case.Bubble Sort is the simplest sort; we might bash it in class, because it does far more swaps than necessary.??Notice that, in sorting a particular spot, we might swap many of the values into the sorting spot, if we keep on finding smaller values to the right.Your Turn!Try implementing bubble sort on your own, without looking at the solution.Using the completed code, turn our debugging print statements on (set the class variable debug equal to true); run bubble sort and see how it does a lot of swaps.??Reset the array, and compare to Insertion Sort (using the complete code's version so that there's an implementation there).Now, try placing different values (and more of them) in the array—specifically, in?decreasing?order.?Now see how inefficient?bubbleSort?can be!Going Further. We can make many improvements to the provided bubble sort implementation:Make the inner for-loop smarter: if we've completed k traversals, we know the last k elements are sorted; let's end not when i is less than the array length, but when it has entered the "already sorted" zone.Change the outer for-loop to a while loop. When should it quit? When no swaps were necessary during the last walk. (Use a boolean variable to track this). More Efficient Searching: Binary SearchBinary Search could be called the "phonebook" search or the "dictionary" search. It requires the values to be in sorted order. It continually finds the middle of a range that could contain the key (if it is in the array at all), and then either finds the key at that mid-point or calls the method again on the smaller range (to the left or right of the inspected middle location) which is now known to be the only fit location for the key. Binary search relies on the data being sorted – otherwise, it won't work.Your Turn!Going Further: Implement and test your own binary search, trying not to look at the solution (or class notes) in the process. This lends well to a recursive definition, but does not have to be.Your Turn!Try running this algorithm just like linear search,?without?sorting first.??Notice you can get some buggy results—search for a ‘6’ in the array originally listed, and the program crashes!A binary search needs sorted data. Let’s fix the example:?Your Turn!Add a class variable, a?boolean?named isSorted.??(This must be static; why?)Go ahead and initialize it to false right there, on the same line.??(Reminder: we must always initialize on the same line for class variables—things at the class level don’t need an object, so we can't rely on some constructor call to instantiate it).Now, in the?binarySearch?method, at the very beginning, add an if-statement that checks if?isSorted?is?false,?and if so run a sort algorithm and set?isSorted?to true (take your pick of the various sorting methods in our program).We don’t have any other searches in our lab that require a sorted array, so we don’t have to fix this anywhere else.???However, if we copy over the old array, we need to reset this variable to false.??Go to the main method, and look for the option that resets the array to the copy.??Immediately following that code, set?isSorted?to false.??Now, our?binarySearch?will conditionally sort the information, if necessary, and won’t spend all that time if it’s already sorted!??This is especially good because many sorting algorithms are pathologically slow when working on mostly- or completely-sorted data.??Always sorting first would be an expensive waste of time!??(Slightly) More Efficient Searching: Insertion SortInsertion Sort works a bit more like we might actually think about sorting a list.? The goal is to first sort a small portion of the list at the front; then, we successively consider one more spot in the list, and keep sliding it leftward (forwards) until it's in sorted place. We now have a slightly longer beginning part of the list that is sorted. We keep performing this action on the next unsorted spot, gradually expanding our sorted portion of the list to consume more and more unsorted elements, until the entire list is sorted.The first iteration, we make sure the first element by itself is sorted. Well, yes, of course it is. So really, the first step is to add one more unsorted location to our already sorted list of length one, meaning we consider the first two items. If necessary, swap them.Your Turn!Try implementing insertion sort. Notice that we still will likely have two for-loops: one inner one that handles inserting the next item into our sorted portion of the array, and one outer loop that guides us to adding each unsorted element to the sorted portion, one at a time.Okay, one way or another, let's now take a look at the solution:Notice that the inner loop counts down through indexes, starting as high as the outer loop tells us the next unsorted location is. If you had an array drawn on paper, try tracing your finger over indexes that correspond to j's values.After traversing down through our sorted array and swapping our unsorted element as far down as necessary, this additional value is now in its sorted place. Once it no longer needs to be swapped, it won't have to go any further so we can quit the inner loop.More Efficient Sorting: Merge SortMerge Sort is a different sort of beast; It continually breaks up the array into two halves, sorts both half (using Merge Sort itself), and then?zips?them together—basically, it interleaves the values from the two halves back into one bigger, sorted array.Merge Sort is a recursive algorithm, as implemented.??This means we have to check for base cases first, and then complete the inductive (recursive) cases.Your Turn!If you're up for the challenge, try implementing merge sort!First, create two sub-lists by splitting the list in half (make two new arrays).Then, recursively call merge sort on each.Then, recreate a list of the original length by always copying in the smallest value from either list that hasn't yet been included. (Relying on their being sorted makes this simple, though you might have a many-else-ifs block of code).Okay, let's discuss the provided solution. It uses arrays instead of ArrayLists (which Snyder's in-class examples showed), so that you can compare the two implementations.This sort first creates two halves from the given array.??Then, it calls?mergeSort?on both halves: we now have two smaller sorted arrays, and we just need to combine them into one larger sorted array. Look at the code for that zipping-together portion, and notice it basically checks if the value should come from the next in the first half, the next in the second half, or the smallest of the next in each.The reason this works is that lists with only one element in them get returned—they’re already sorted—and from this base case, we use the zipping portion to merge two sorted sub-arrays together.??So, what this algorithm really does is break the array up into individual spots, then merge them all back together, two arrays at a time, until we get the entire set of values back in one array.Note that, especially with the creation of so many arrays all over the place, this is not going to be a very efficient implementation: it uses up a lot of extra space storing all these 'copies' of the array in various levels of separation. An in-place sort (which reuses the current locations in the array instead of creating copies of those values somewhere else) would be much more space-efficient, though it would probably mean more work shuffling values around in that limited space.?Your Turn!Try running this sort, as well as in debug mode.??It prints a lot, but you can sort of see the two steps—splitting and merging—occurring, just by the width of the left and right arrays being used.?One Last Sort: Quick SortQuick Sort functions in an even ‘smarter’ way.??It chooses a pivot value, and then moves anything smaller than it to the left of it, and anything larger than it to the right of it.??Since the pivot is now correctly between all values it should be, we then quick-sort the left and right portions with a (recursive) call to Quick Sort again.If you are brave, then it is your turn.Try to implement quickSort. It might also be easiest to use ArrayLists, because the number of values less than (or greater than) the pivot isn't known before you step through the list. You can choose any value in the list as your pivot value; simple choices include the value at the first, middle, or last index. (Choosing the middle one is actually not too bad an uneducated choice). Use recursion to sort the sub-lists to the left and right of the pivot value, once you've placed them all on the correct side of the pivot value (and have figured out where the pivot value belongs, for that matter).Looking at the implementation.This sort has a driver?function (explained in the beginning of this lab):??it builds up the first call to the more-parameters overloaded version of quickSort, which does the real work.??Again, this is just to abstract out the extra parameters.??Why should we bother to include parameters whose values we can always deduce for the initial call???Also, if we called?quickSort?with arbitrary alternate values for the left and right, we’d only sort that portion.??Partially sorted is close but not particularly useful.Your Turn!Try turning on debugging to see how calls to quickSort operate.What happens when two values are identical?How does quickSort fare when the list is already sorted? When it's sorted the wrong way? (Try starting with values such as {8,7,6,5,4,3,2,1}).Closing Thoughts on Search & SortThat's it for searching and sorting for us! Hopefully you now understand what the purpose of searching and sorting is, at least for linear data structures like arrays. In later classes and projects, you will have larger and more complex data structures, but the ideas of searching and sorting will constantly show up: searching through those structures for specific parts (values); trying to organize unsorted data; or, even trying to maintain data in a sorted fashion while adding elements to it one at a time.We've looked at just a couple of ways to search through data. A linear search was very simple, and needed nothing more than a way to inspect every location in some sequential order. But if we knew we had ordered data, we could do much better with binary search: each inspection allows us to throw out half the remaining search space each time.We considered a few more ways to sort data, with increasing complexity and different approaches. Bubble sort is very simple, but has pretty poor performance (e.g., it performs many more swaps on average than the other sorts). Insertion sort keeps growing a sorted list at the front of the list, repeatedly sliding the next unsorted value into our sorted portion until it fits. It performs a bit better than bubble sort. Next we considered merge sort, a divide-and-conquer algorithm, which can easily be defined by recursion (but doesn't have to be). It successively splits the list down into sub-lists until they are of size 1 (sorted by definition), and then merges them back together, relying on the simplicity of combining two already-sorted lists. Lastly, we looked at quick sort, which chooses a pivot value, puts all the smaller-than-pivot values on the left, all the larger-than-pivot values on the right, and the pivot between them; it then recursively quicksorts the left and right, and is done. These versions of sorting had different behaviors – some used lots of space (merge sort), some worked in place (the other three). Some performed better on nearly-sorted data than others.There are many ways to improve upon the searching and sorting that we've introduced here. What can you find out about these two ubiquitous programming tasks?Regular ExpressionsWe will step through our own tutorial, but if you'd like a more thorough introduction, you can follow through this link: and test out things as they are introduced there instead of this portion of our lab dedicated to regular expressions.Regular expressions are an extremely powerful mechanism for more nuanced searching through strings than a simple search for specific substrings. A regular expression defines a pattern that we can search for within a given string. We might ask whether a particular string entirely and exactly matches a pattern, or we might instead ask whether a pattern can be found to match some portion (substring) of the string. We might also use a regular expression to extract substrings from a string, or to replace portions of a string with some replacement, whenever the pattern matches in that string. In short, whenever we find ourselves wanting to perform string manipulations, chances are that regular expressions are available as a way to express what we want to occur.Regular expressions show up all over the place in programming and computer science – not just in Java. One common use of regular expressions is to search for a file on your computer (e.g., using the egrep UNIX command).A regular expression is simply a way to represent structure within a string of symbols—for instance, identifying what makes a valid phone number, identifying if a particular word is included, et cetera.A regular expression defines a set of strings. That set of strings is all the strings that comply with the regular expression; all strings which do not comply are not in the set.We can match things directly by simply having them there. The regular expression Cat defines a set of strings with exactly one element, the string "Cat" (notice the sensitivity to case—it does not match "cat"). If we want to match other characteristics, we start developing a set of symbols that imply things like "at least this many of those", "any one of these", "anything not of this group", and so on. Let’s define some of those now.This does not cover all of regular expressions in Java – it is just an introduction.pattern/symbolmeaningany non-special charmatches itself.matches any single char (except newline in some circumstances)*repetition: matches zero or more of the preceding thing+repetition: matches one or more of the preceding thing?repetition: matches zero or one of the preceding thing{7}repetition: matches exactly 7 of the preceding thing {3,6}repetition: matches from 3 to 6 (inclusive) of the preceding thing{7,}repetition: matches 7 or more of the preceding thing(pattern)parentheses group a pattern.pattern1 | pattern2selection: matches either pattern1 or pattern2[aeiou]character class: matches any single character listed in [ ]'s[a-z]matches any single character in ASCII range a to z.[^stuff]matches any single char not in stuff[main[nested]]matches any single char in the main or nested character class[main&&[nested]]matches any single char that is in both main and nested^anchor: 'matches' beginning of input$anchor: 'matches' end of input\b \Bmatches a word boundary (\B matches a non-word boundary)\d \w \smatches a single char of a: digit, word, whitespace. \D \W \Smatches a single char that is NOT a digit, word, whitespace.For testing cases, repeatedly modify the value of regex in the following code:public class TestLabRegex {// basic matching examplepublic static void main1 (String[] args) {Scanner sc = new Scanner(System.in); String str = ""; String regex = "k.t+y"; System.out.println("\nPattern: "+regex+"\n");while (!str.equals("quit")) { System.out.println("next input please: "); str = sc.nextLine(); System.out.println("\n\nstring: \""+str+"\"\nPattern: " +regex+"\nmatches: "+str.matches(regex)+"\n"); } } }Your Turn!For each of the following regular expressions, think of some Strings that do match it, and Strings that do not match it. Test it with the code that calls the matches method on a String.(way-)+back machinex?y+z*(a|b)*(something|) to doabc+(abc)+(do|re|mi|fa|sol|la|ti|do)+When designing your own regular expression, you should always ask yourself two questions: (1) Does it accept enough Strings? (2) Does it accept too many Strings?Your Turn!Create regular expressions that exactly match the set of Strings described in each example.any number of letters, followed by seventy exclamation pointsone of the Pac-Man ghosts: inky, binky, pinky, and clyde (try to be concise)zero to twenty a's, followed by an h.write a regular expression that matches one of the following smileys: :) :( :?) 8-Pvalid Java identifiersa valid Java int representation (just allow base 10 representations, so no leading 0's)any valid int[] initializer, such as {1,4,2,5} or {100,101,-4}Let's review some of those with more descriptive examples. We will underline a regular expression to help mark the boundaries without resorting to the specific String representations, which have a couple of twists related to escape characters.Character Class: To say that we want any of a batch of characters, but just one of them, we surround all of them with square brackets [ ] , with no space in between. The regular expression [aeiou]t matches "at", "et", "it", "ot", and "ut". We always choose exactly one of the options in the brackets: not zero, not many.To say that we want to match any character except what’s listed in a character class, we use the ^ sign at the front of them all. The regular expression [^aeiou]t matches all consonant-followed-by-t strings: "bt", "ct", "dt", "ft"; but it also matches non-characters; "5t", "(t", and more would also match.Since typing things like [456][456][456] can get tedious, we can specify how many of a particular pattern we want, by following the brackets with curly braces indicating how many. That could be reduced to [456]{3}, indicating we want exactly three digits (all either a 4, 5, or 6). If we want at least n repetitions but no more than m repetitions, we represent this as in [456]{n,m}. A specific example would be [456]{1,3}, which matches either a one, two, or three digit number (each digit being a 4, 5, or 6). Thus, the regular expression [45]{1,2} matches "4", "5", "44", "45", "54", and "55". If we have a minimum bound but no upper bound, we can leave that part out—[45]{1,} matches numbers with at least one digit, of only 4’s and 5’s. If we wanted up to some number of matches, we use a zero for the lower bound: e.g., [456]{0,4}.We can match ranges of values by a dash; [0-9] matches any single digit; [a-z] matches any single lowercase letter; [A-Za-z] matches any letter regardless of case, and so on. Notice this only works inside brackets: 0-9 matches the string "0-9". Also, [1-10] matches "1" or "0", it does not match "1","2","3",…,"8","9","10".If we want to allow a pattern to repeat any amount of times, we place an asterisk after it: a*t matches "t", "at", "aat", "aaat", etc. This represents zero or more a’s followed by exactly one t. A close cousin to the star is the plus symbol +, used to represent one or more repetitions of the pattern. a+t matches "at", "aat", etc., but does not match "t".To choose between one sub-pattern and another, we use the vertical bar | to represent union of the sets created by each of them. So (a|parthe)non matches "anon" and "parthenon".If we want a range of values minus a few particular values, we can use && to intersect the two expressions: [a-z&&[^abc]] is all lowercase letters except a, b, c. Be sure that you use the inner [ ]'s! If ^ doesn't appear as the very first character of a character class, it is not a special character and just matches itself.When we want to group parts of a pattern together, we can use parentheses to separate them. For instance, if we want to match as many repetitions of the prefix "sub" at the front of the word "saharan", we could use parentheses as follows: (sub)*saharan matches "saharan", "subsaharan", subsubsaharan", etc.We can match any character at all with a period (.), except a newline character (\n). So .t matches "at", "bt", …, "At", …, "1t", …, "$t", "_t", … . Oftentimes we mean a more restricted set like [a-zA-Z]. Spaces are matched by leaving in the space directly; spaces matter in a pattern.a lot does not match "alot"; it only matches "a lot".Your Turn!write a character class that matches a single one of the following letters: ceiknop (pick one)…write a character class that matches a single upper-case non-vowel in the second half of the alphabet.write a character class that matches any single digit character(this is just \d being defined manually).write a regular expression using a character class to represent a phone number in the format of 555-555-5555. bonus: disallow any String that starts with 911.write a regular expression for length-3 Strings that don't match your initials in any position. (if you have more or fewer initials, adapt it to yourself of course!)Escaping Special CharactersWhat if we wanted to match a square bracket, or a parenthesis, or a period? All characters which get ‘used up’ by controlling how we match characters can still be included, by placing a back-slash in front of them to indicate that we want the actual character, and not the function it usually provides in a regular expression. So \[[0-9]\] matches [0], [1], [2], … [9]. This also works for others: \*\+\.\, matches the string "*+.,". This can be done with all the special characters.Representing Regular Expressions in JavaOne last thing to note, which can really throw a wrench in the gears: Java itself needs backslashes to represent a quote sign; so Strings are already using the backslash and the quote sign ". In order to keep the two separate, it gets a little messy: to represent the regular expression "[\d]" as a String (notice the pattern is matching some quote characters), it is written:String str = "\"[\\d]\"";What happened?! We had to place a backslash in front of the first quote of the regular expression to indicate to Java that it was part of the String and not the end of the String; we had to put an extra backslash in front of the backslash (which was itself in front of d) to indicate that it was not for Java to figure out what control character was being used, but rather that it should leave the slash in as part of the regular expression; the second quote of the regular expression is like the first. The short version of handling this is that all backslashes and quotes get a backslash added in front of them when we go from a regular expression on paper to a regular expression stored as a String in-between the quotes when we construct it.How to deal with this issue: first, just write out your regular expression, not worrying about Java. Perhaps in a comment if you want to record it in your code. Then, character-for-character, represent them in a Java String. Given the bizarre regex abc"\**\bshe\B\\++". we can represent it character for character: a is just "a"; same for b and c. Then, the very next character is \. Don't worry that it is logically part of \*, because we just attack one character at a time. \ becomes "\\" when we put it in a String. Next, ** is just "**". \ is "\\". she is "she". \ is "\\", and B is "B". \ is "\\", again the second \ is "\\". ++ becomes "++", " becomes "\"", and lastly . is just ".". Let's put it all together:abc"\**\bshe\B\\++". → "abc\"\\**\\bshe\\B\\\\++\"."Your Turn!With the testing code above, explore uses of all the rest of the patterns and symbols shown above. If you work with a Scanner, reading input from the user, though, the notion of "end of input" won't be the same as "end of line", and $ won't behave quite the way we'd like for simple explorations. (This would require "embedded flags", but that is advanced enough that we won't cover them in this introductory lab).We next want to get practice with the basic pattern matching capabilities that were added to the String class. We now have two new methods, with the following signatures:public boolean matches(String regex) {…}public String replaceAll(String regex, String replacement){…}matchesWe will learn how to use the first method, matches. First, we need to create a String to use with this method. In practice, this is often part of a document or user-entered information – something to be checked for resemblance to some known pattern.String str = "kitty";Next, we can check if it matches some regular expression, calling the matches method on it with a single parameter, the regular expression:String regex = "ki[t]{2}y";boolean b = str.matches(regex);b should be true after this runs. What if we want to find out if a String simply contains some pattern in it, and we don’t care where in the String that might be? Consider this:String str = "cat on a hot tin roof";boolean b = str.matches(".*hot.*");This will return true as long as there are zero or more characters before h, o, t, and then zero or more characters. Test this with different values than what is in str above. Try it at the start of a word, at the end of a word, and with things in between the h, o, and t.Consider trying to match phone numbers with this pattern:[0-9]{3}-[0-9]{3}-[0-9]{4}Test a few phone numbers. Suppose, though, that you were asking someone to enter in their phone number and you didn’t want to be too strict on the input format; let’s assume we also want to allow phone numbers like (555)-334-2190, (555)334-2190, (555) 334 2190, or any other patterns you can think of. How can we handle this? We have the parentheses to group patterns, and the vertical bar | to say "choose exactly one side of what’s on either side of me". So we can make the patterns, ‘or’ them all together, and get a very flexible regular expression for phone numbers.Your Turn!Try testing your program with two different kinds of phone number patterns first, and then adding others.replaceAllRemember the signature from above, which takes a regular expression String, a replacement string, and returns the String created by replacing all matches:public String replaceAll(String regex, String replacement){…}Suppose we want to not just find out if something matches, but we want to replace the occurrence with something else; perhaps you wrote a form letter, and need to replace all instances of "John" with "Jack". Consider the following:String str = "Please excuse John from school yesterday. John was ill," +"and so I, John’s father, made John stay home.\n\t\tJohn, Sr.";We can get this letter for Jack by just getting the String returned by replacing all "John"s with "Jack"s:String jacksLetter = str.replaceAll("\\bJohn\\b", "Jack");A few things to notice: str’s value was not changed; jacksLetter operates on a copy of str. Also, "John" is a pretty tame pattern to match—that’s where we could match things like our telephone number example from above, replacing all of them with just the local area phone numbers, for instance. Why do we have the \b word boundaries at the edge? What would happen if we left them out, and tried to replace every occurrence of "a" with "an"? Try this:String str = "Once upon a time there was a funny guy who ate"+" an apple with an aardvark.";String aanstr = str.replaceAll("a", "an");System.out.println(aanstr);And that’s why we have the word boundary delimiter: we care about the surrounding environment of the "a" matches that should be replaced; only the ones surrounded by word boundaries should be replaced.Your Turn!replace every occurrence of the word "he" with "she", and also "him" with "her" from some given input. Be sure to not accidentally match portions of other words such as there becoming tshere.replace every digit with an X. (Pretend you're making a safe version of a bank receipt where you don't want to print the account number).Capture GroupsAs we write our regular expressions, we can use parentheses to group things. If we have a match, wouldn't it be nice to extract the portions of our String that matched each specific parenthesized group? We can do this, but we first need to understand how to use the Pattern class (for "compiling" a String that we want to use as a regular expression into some actual internal representation of a regular expression), and how to use the Matcher class (for using a Pattern on specific Strings and seeing if matches are possible).More Efficient Pattern MatchingAlthough we are capable of writing virtually all the pattern matching we’ll ever want with this, it can get a bit tedious, creating Strings, calling matches on the regular expression and the string again and again; also, though it is definitely not a focus in this class, it takes significant time to figure out what a regular expression means before it can be applied, and so if we were to apply the same pattern numerous times (say, in a loop), it would be wasteful to re-figure out what that regular expression is every single usage. So we can create something to handle a particular regular expression and keep it around, just asking it if a String matches its pattern.Enter the classes Pattern and Matcher. We can make an object of the class Pattern to store the understanding of a regular expression:// need to import java.util.regex.*;Pattern p = pile( regex ) ;(Notice that we didn't call the constructor, we called a static method of the Pattern class). Then, we can make an object of the class Matcher, which has an associated String with it:Matcher m = p.matcher ( str ) ;Then, when we want to know if str matches regex, we call the matches method on m (notice there’s no parameter now):boolean b = m.matches();In the grand scheme of things (such as in a loop), this would be prudent as follows: instead of using the matches method with a String,while (…) {String str =…;boolean b = str.matches ( "bar");…}We can instead pull the compilation of the regular expression out of the loop:Pattern p = pile (regex);while (…) {String str = …;Matcher m = p.matcher(str);boolean b = m.matches();}Even though it looks like more code, it turns out that when we call matches on a String, it’s just going to create a Pattern, create a Matcher, and call the matches method of the Matcher object and then throw the two objects away and just yield the result; even though we see more code, there’s still less work going on.Finally, we can discuss our capture groups! If we have a Matcher object and we call matches, and we successfully find a match, then we can next call the group method and retrieve any specific capture group by number, starting with 1 for the leftmost open parenthesis, and numbering upwards for each found open parenthesis:public String group (int groupNum) Here is an example:// group numbers: 1 2 3Pattern p = pile("(c(a|u)r)tai(n|l)");Matcher m = p.matcher("curtail"); if (m.matches()){String first = m.group(1); String second = m.group(2); String third = m.group(3); System.out.println("saw "+second+" inside of " + first +", all in front of "+third+"."); } else { System.out.println("no match. If called, group(#) would throw " +"an IllegalStateException."); }If you call group(0), you get the entire match. You can also call group(), an overloaded version that is equivalent to group(0).Your Turn!Match a phone number in some format, but do so using the Pattern and Matcher class. Now, remove the first three numbers (the area code), and print back just the rest.If the nth capture group was part of some repetition, only the last of the matches is remembered (and returned from the corresponding call to group(n)). Write a regular expression that matches any number of "yes" or "no" strings, followed by "final answer." Print back what the final answer was (the last yes or no).Adventures in One-Liner LandThough we like to keep lines of code "short" for ease of reading, sometimes fewer lines is also considered easy to read. Whenever we put in a String literal, like "puppy", what actually happens is that Java makes a new String object with that contents and then uses it right there. So we could match a literal String to a regular expression in a really-hardwired way, declaring no variables to hold the String or Pattern:boolean b = ("puppy").matches("p[aeiou][p]{2}y");Similarly, and perhaps more likely, if we were getting str from some computation or method call, we could put that in the first set of parentheses:boolean b = (sc.nextLine()).matches("p[aeiou][p]{2,5}y");Beyond the LabNow you’re a pattern-matching pro. Sort of. But there is so much more than we've done here! This was just the simplest of introductions. There are more complex patterns, strategies for how greedy or hesitant a pattern is to consume as much or as little as possible when finding a match, and many, many more pre-defined character classes. See here () for more character classes, and see here () for a much more in-depth tutorial on regular expressions. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download