[RA] Ch23 Hash Tables

Reading Assignment: All of Programming Chapter 23 Hash Tables

Hash Table Basics

Apply a hashing function to the key, which converts it to an unsigned integer
Compute $H%N$, where $H$ is the result of applying the hashing function to the key, and $N$ is the size of the array
Use the resulting value to index into the array.

A better approach is to pass a template parameter for how to hash the keys. This template parameter can name a class whose function call operator is overloaded to take a const Key reference, and return an unsigned int, and does so by performing the desired hash operation. The hash table can then make an instance of this object, and use it as if it were a function to hash keys as needed:

template<typename Key, typename Value, typename Hasher>
class HashMap {
 private:
  Hasher hash:
 public:
  void add(const Key &k, const Value &v) {
   unsigned int h = hash(k);
   // ...everything else elided
  }
};

Collision Resolution

What happens if we try to put two keys in the same index at the same time? Our hash table needs to be able to deal with this situation-which is called a collision, but in such a way as to maintain its $O(1)$ behaviro.

Chaining

One way that we can deal with collision is to use chaining-we maintain an array of linked lists. Each node in the linked list holds a key/value pair. For example, a start of implementing a Map might look like this:

template<typename Key, typename Value, typename Hasher>
class HashMap {
 private:
  Hasher hash;
  vector<list<pair<Key, Value> > > * table;
 public:
  void add(const Key & k, const Value & v) {
   unsigned int h = hash(k);
   h = h % table.size();
   (*table)[h].push_front(pair<Key, Value>(k, v));
  }
};

Note that we might want to first remove any existing mapping for k, before adding it. Most other operations we might want to do(finding an item, removing an item, etc) basiclly work out to hashing the key, then performing the corresponding item on the appropriate linked list.

Open Addressing

Open addressing is to seek a nearby array index which is not used. The simplest form of open addressing is linear probing, in which indicies are tried sequentially until an open one is found.

A different way to do open addressing is to use quadratic probing. The general principle is the same, but instead of going by a constant step size each time, we increase the step each time. The motivation behind quadratic probing is that, if the data is clustering up in one area, the algorithm will more quickly find its way out of that cluster.

Hashing Functions

We will say that a hash function is valid if

It is purely a function of its input-meaning that if we call hash(x) we are ganranteed to get the same value on subsequent call to hash(x) unless we modify x.
For any two objetcs a and b that we consider equivalent, then hash(a) == hash(b) also evaluates to true.

The second criteria we will consider is how “good” a valid hash function is.

A Bad Hash Function for Strings

We might(but actually not) think the following function is decently good:

unsigned hash(const std::string &str) {
  unsigned ans = 0;
  for (std::string::const_iterator it = str.begin();
       it != st.end();
       ++it) {
       ans += *it;
  }
  return ans;
}

However, this hash function will return the same value for strings which are permutations of each other(like bat and tab).

A Better Hash Function for Strings

We can do much better if we make our hash function slightly more sophisticated:

unsigned hash(const std::string &str) {
  unsigned ans = 0;
  for(std::string::const_iterator it = str.begin();
      it != str.end();
      ++it) {
    ans = ans * 29 + *it;
  }
  return ans;
}

Notice that the only difference between our improved hash function and our original hash function is that we multiply the existing answer by 29 before we add the current character. The particular value of 29 works well because it is a prime, and larger than 26-the strings we are hashing are primarily composed of lower case letters, so each character typically takes one of 26 possible values.

There is also another, subtle reason why the second hash function works better-it has a larger range. The first hash function produces values in the range 0-2621, while the second has function produces values in range 0-4294962258, and this value is close to the max value of unsigned int.

If you need to design hash functions for your own objects, there are a few basic priciples that make for a good starting point. First, your hash function should incorporate all parts of the object-do not try to make it faster by examing only a samll piece of the obejct. While such an approach may make the hash function itself a bit faster, it is likely to degrade the overall performance of your data structure-a few extra additions are faster than searching through hundreds of colliding pieces of data. Second, try to combine the data in ways that permuations and alterations create different results. We saw an example of this principle in our two functions: multiplying by a prime and adding produced much better results than simply adding. Third, whatever you do, be sure to test it and see if it is actually good-you would be surprised how many seemingly good ideas turn out to be bad in practice.

Cryptographic Hash Functions

There is an entirely different class of hash functions which we might use for security-sensitive purposes. This important difference in these hash functions is that they are based on cryptographic principles, and finding values which experience collisions is quite difficult. We will note that such functions generate hash values much larger than 32 bits-generally at least 128 bits.

…

Hash Function Objects

class StringHasher {
 private:
  const int multiplyBy;
 public:
  StringHasher():multiplyBy(29){}
  StringHasher(int m):multiplyBy(m){}
  
  unsigned operator()(const std::string&str) const{
    unsigned ans = 0;
    for(std::string::const_iterator it = str.begin();
        it != str.end();
        ++it) {
      ans = ans * multiplyBy + *it;
    }
    return ans;
  }
}

If you only need to hash strings, using the built-in one is a great option. http://www.cplusplus.com/reference/functional/hash/

Rehashing

Hash tables only work efficiently if there are more buckets than elements sorted in the table-otherwise, the probability of collisions increases, and performance degrades. If one knows exactly how many elements will be placed in a table a priori, then one can size the table appropriately. However, it is much more common to need to resize the table as the number of elements in it grows-a process called rehashing.

The first aspect of rehashing a hash table is when to do so. The metric of interest for this aspect of designing a hash table is the load factor-the ratio of the number of elements actually stored in the table to the number of buckerts.

The second aspect of rehashing a hash table is how to do so. The first step is to allocate a larger array-even if we are using a vector, we cannot just reuse/grow the existing one. We will want the new array to be at roughly twice as a large as the existing one, so that we can amortize the cost of the rehash operation over many additional insertions.

Hash Table Sizing

The choice of doubling exactly versus a prime size presents a bit of a trade-off. If we keep our hash table’s size as a power of 2, we can perform the “mod” operation much more efficiently. However, if we use a prime size, then we reduce the chances of introducing extra collisions when we perform the modulus operation if our hash function creates patterns in the numbers.