Answer Questions following the Introduction to Hashing:



In-Class Exercise Answer Questions following the Introduction to Hashing:Hashing (HashMap)“A hash table is a data structure that is used to store keys/value pairs. It uses a hash function to compute an index into an array in which an element will be inserted or searched. By using a good hash function, hashing can work well. Under reasonable assumptions, the average time required to find an element in a hash table is O(1).”Used in:Encryption – for authenticationHash a digital signature, get the value associated with the digital signature,and both are sent separately to receiver. The receiver then uses the same hash function on the signature, gets the value associated with that signature, compares the messages. If the same - authenticationDatabase accessNames->phone numbersAuthor->articles (or books)Topic->articlesusernames->passwordsSocial Security Number->everything about youZip codes->regionsTranslationAnagrams (how?)I think I have used this data structure more than any other data structure we’ve learned – it comes in handy!I’ve used for:Word – surrounding text (use a key word to find other words that tend to associate with the key word)Sounds – basic sound and wave files associated with that soundQuotes-documents to find what document a quote occurs in quicklyEtc.Idea: MOSTLY FOR FINDING QUICKLYHashing is designed largely for finding keys quickly. We know we could find the xth value in an array in O(1).So hashing involves an array of a certain size. It involves taking a key (which is either part or all of the data we’re trying to store in a way that can be found quickly) and converting it quickly to a number that is an index in the array - this is the hash functionWith hashing, the keys are not put in any order. We cannot find the smallest, the largest, the median, etc without converting from a hash map to a different data structure (or doing a lot of looping). If we hash correctly, we can, however, find whether an item is in a database very quickly.Usually a key is associated with a value or set of values. In other words, with our student example, the key was the student’s id, but the values include the student’s first and last name, their address, their year, their major, their gpa, etc. None of the values have to be unique within the database (2 students can even have the same name, let alone the same major), but the key we use for finding this information must be unique.Key: student idValues we want associated with the key: student’s last name, first name, major, year, address, gpa, etc.There are 3 parts to dealing with hashing:Coming up with a good hashing functionMust be quickly computableMust lead to indices that are evenly distributed throughout the arrayMust be consistent (always compute to the same index) A good hash function may depend on the type of data you’re dealing withComing up with a good array size Last step of every hashing function is to mod by the arraysize (why?)So the arraysize is a significant influence on the goodness of the hashing functionDealing with the inevitable collisionsMost hash functions have collisions. What should we do when two keys hash to the same index in an array?Since Good Array Size is critical to the good hashing function, we’ll start with that:Array Size:LAST STEP IN EVERY HASH FUNCTION IS TO MOD WITH ARRAYSIZE!!!!!! So arraysize matters.Prime numbers are bestYou want an array size that has space for about 2 times the number of keys So a prime number that is greater than 2x the number of keys is a good numberHashing Functions:Again, must be quick to calculate, should evenly distribute keys throughout array, and must consistently map a key to the same index.Many, many hashing functions:Often dependent on the type of data you’re dealing with.For keys that are numbers:Could add the digits in a number and mod by array sizeCould just take any number in the key and mod by the array sizeRemember – any pattern or trend in the numbers could lead to uneven distribution of indicesSome improvements on hashing functions:Power hash: take the integer, take each digit in the integer to the power of its place in ascending order: E.g., 324 = 3^1 + 2^2 + 4^3 = 3+4+12 = 19 % arraysizeWorks best with smaller numbers…Middle r: square int, (possibly convert to binary), and use the middle r bits or numbers (then mod by array size)E.g., 442=1936, maybe take the middle 2 numbers, so you’d have 93 % arraysizeFolding: divide the number to equal sized pieces and add the pieces (again, often done with binary representation of numbers) (then mod by arraysize) (e.g., 328-444-2870 becomes (32 +84 + 44+28+70 )%arraysizeMANY hashing functions involve shifting bits…For keys that are strings:Could just add up asci representation of characters (abcd = 97+98+99+100 )Problems: anagrams, MANY more short words than long words (so many words would end up in the beginning of the hash array)Could just use the first 3 charactersMaybe s[0] + s[1]*271+s[2]*272Easy to calculate, but an awful lot of words have the same prefixBetter:Use all N characters of string as an N-digit base-b numberChoose b to be prime number i.e., b = 13, 17, 19len = string.lengthh = 0;for i = len-1; i >0; i-- {h = (19*h + (int)string[i]) % ArrayLength}Even with really good hashing functions and a really good array size, you’ will almost always end up with collisions!Dealing with collisionsChaining:Make your hash array be an array of linked listsInsert every collision into the linked listEvery time 2 keys map to the same index, just insert the second one at the end of the linked list at that indexWorst case: every key ends up in the same indexLinked list is n in length for n keysREALLY bad hash functionLinear probingWhen you have a second key hashing to the same index as a first key, keep “probing” at the next index(es) in the array until you find an empty spotInsert at that indexSo if a second key hashes to index 32 and there’s already a value there, look at index 33, then 34, then 35, etc. When you get to the end of the array, loop back to index 0Watch out for infinite loops! (when will that happen?)Causes “clustering”When a bunch of keys are clustered together in one area in the arrayThis slows the whole finding of the key quickly process downThe more clustering, the more there’s a tendency to clusterSo we probably want to try to avoid thisQuadratic probingAn attempt to avoid clusteringAfter a collision, instead of looking at the next index (h(k) + 1), look at indices quadratically (h(k) + 12, then 22, then 32, etc, so look at h(k)+1, then +4, then +9, etc. Again, loop back to 0 when get to end of arrayHelps most when a lot of keys cluster to the same areaDoesn’t help when a lot keys hash to the same index And still end up having areas of clusteringPseudo-random probingAn alternative to quadratic probing Idea: can’t randomly choose next index after a collision – could never find the key againBut we could generate a list of random numbers and use that list to determine the next indexE.g., {0, 8, 3, 4, 7, 2, 9, 6, 1, 5}index = h(k) + randomarr[0]If collision, index = h(k) + randomarr[1],Then h(k)+ randomarr[2], etc.VERY quick to calculate!Maybe a bit better than quadratic probing when it comes to clustering, but you pretty much have the same problems overallDouble hashingHave 2 different hashing functions. (there are MANY hashing functions, and you can always make your own)If the first hash function on a key gives an index that is already taken, try the second hashing function on the key.The second hashing function can be something simple like h(k) + 1 + ct*(k mod (m)) where ct is the number of times you’ve had a collision with k and m is a prime number less than the size of the array (7 is nice). K is the numeric equivalent of the key (so in a string, it might just be adding up the asci values of the letters, or even a=1, b=2, etc.)We have to add 1 because if k mod m is 0, we could end up in an infinite loopE.g., arraysize = 11, m = 7h2(k) = i+(k mod(m-1))h2(k) = i+(k mod(m)h0(55) = 55%11 = 0h0(66) = 66%11 = 0 XH2((66) =(1+k%(M))) = 1 + (66%7) = 4 P2(66) = 0 + 1*4 = 4%11 = 4h0(11) = 11%11 = 0 XH2(11) = 1+k%(M))) = 1 + (11%7) =5P2(11) = 0 + 1*5 = 5%11 = 5h0(88) = 88%11 = 0XH2(88) = 1+k%(M))) = 1 + (88%7) =5P2(88) = 0 + 1*5 = 5%11 = 5XP3(88) = 0 + 2*5 = 10%11 = 10Double hashing helps with the problem of many keys hashing to the same first value – in theory the 2 hashing functions should be different enough that the chances of both hashing to the same index with the same key should be smallBack to array size:Goal – find things quickly (in 0(1) if possible)REALLY large array sizesfew collisions (get close to 0(1))But a lot of extra spaceSmaller array sizes Many more collisionsBut less extra spaceIdeal – about 2x number of keys (up to next prime number)Load factor:What happens if number of keys is changing?Load factor = number of keys/ array sizeWant this to be less than 70%If the load factor gets over 70%, we want to:Create a new array – double in size of old array, then up to next prime numberRehash all the keys in the old array into the new arrayEqually, if the load factor gets below 25%, Halve the array size, up to the next primeRe-hash all the keys in the old arrayRehashing – 0(n) - not goodDon’t want to do this too oftenHashing Problems:Given the following keys: 71,81,75,89,29,99,79,72And a hash function of key%arraysizeShow the hash array of size 10. How many collisions?Show the hash array of size 11. How many collisions?What would be the ideal array size?Show the hash array with that size. How many collisions?Why must every hash function’s last step be to mod by the arraysize?Why should you not use a random number in your hash function?Given the following set of numbers and an array size of 10 (yep, I know, BAD arraysize!):32,25,17,8,22,42,91Show the hash array using PowerhashingMiddle r (if the number is odd in length, take the middle 3, even take the middle 2)Folding (assume the pieces are only 1 in size)What would be a better array size?Given the following set of strings: hash, map, heap, stack, code, tree, list and an array of size 10 (yep, still a BAD arraysize!)Show the hash array using:adding asci valuesfirst 3 characters method the better methodWith Collisions handled by chaining, what would be the worst case for finding a particular key, and when would that occur?With Collisions handled by linear probing, why do we want to put as a condition, while (ct < arraysize) {What is the best case for finding a particular key in the hash table using double hashing?Why is deleting keys a problem using linear/quadratic/pseudo-random probing?With a perfect hashing function, and a set T consisting of {t1,t2,t3,…tn} that has already been hashed into a hash array, how long would it take to tell whether set S consisting of {s1,s2,s3,…sm} is a subset of set T?Hash the following words into the hash map below using chaining, assuming each character has the corresponding numerical value, and the hash function is: word[0]+word[1]+word[2]+…word[wordlength-1]%arraysize1234567891011121314151617181920212223242526abcdefghijklmnopqrstuvwxyzWords to hash: cat, dog, bird, ant, bug, bat, foxArray:0123456789Hash the following words into the hash map below using linear probing, assuming the same hash function as above (same list of words to hash):Words to hash: cat, dog, bird, ant, bug, bat, foxArray:0123456789Given the following array with the following keys in it, using a hash function of key%arraysize and an array of size 10 (I KNOW – A Horrible array size!) and linear probing, {32, 24,4,12,18,65,17}01234567893212244651718What is the load factor?Show what should happen if you add the key 10 to the set of keys (show the resulting new array)How many collisions?What is hashing used primarily for?If Finished, work on the lab ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

Related searches