Hashing

Hashing is used extensively in applications
and deserves recognition
as one of the cleverer inventions
of computer science.

— E.S. Roberts, Programming Abstractions in C

This chapter uses a small counting problem to introduce a data structure known as hash table.  This structure is responsible for accelerating many algorithms that depend on serching, inserting, and deleting items from a table.

Table of contents:

A counting problem

Suppose we are given a sequence of positive integers on the standard input stdin.  These integers will be called keys.  The keys are in arbitrary order and there may be many repeated keys.  Consider now the following counting problem:

find the number of occurrences of each key.

For example, the number of occurrences (also known as the frequency) of each key in the sequence  4998 9886 1933 1435 9886 1435 9886 7233 4998 7233 1435 1435 1004  is given by the following table:

1004 1435 1933 4998 7233 9886
1 4 1 2 2 3

Our counting problem has an important requirement:  the counting must be done online, that is, each key must be processed as soon as it is received from the input stream, so that, at each step, we have the counting result relative to the part of the sequence seen so far.

The performance of any algorithm for this problem will be measured by the time taken to count one key, i.e., to process the current element of the sequence. Ideally, we would like this time to be constant, that is, independent of the number of keys already seen (as well as of the number of distinct keys already seen). But we shall be forced to be content with something less than this ideal.

Four solutions.  We shall discuss four different algorithms for the problem. The first two are very simple but slow. The next two are much faster because they use the hashing technique. Though simple, the first two algorithms provide an important introduction to the next two.

The algorithms are discussed in terms of two parameters: N and R.  Parameter  N  is the length of the input stream, that is, the number of keys in the input sequence.  The value of N can be very large (millions or even billions), but the number of distinct keys is typically much smaller.

Parameter  R  is a number greater than any key.  In the example above, we can take R to be 10000.  The set  0..R-1  is the universe of keys.  Typically, most keys in the universe do not occur in the input stream.

Exercises 1

  1. Criticize the following tentative algorithm for the counting problem:  store the sequence of keys in an array, then sort the array, and finally count the number of occurrences of each key as you scan the sorted array.

Algorithm 1: direct addressing

We begin with an algorithm known as direct addressing. Though very simple, this algorithm contains the seed of the hashing technique.

Let's assume that the keys are of type int and that R keys fit comfortably in the (random access) memory of our machine.  We can then use a table  tb[0..R-1]  to record the number of occurrences of each key.

int *tb;
tb = malloc (R * sizeof (int));

The direct addressing algorithm initializes the array tb with zeros and iterates the following routine: read a key k from the input stream and execute the following function:

void count (int k) {
   tb[k]++;
}

After each execution of this function, for each k in 0..R-1, the value of tb[k] will be the number of occurrences of k in the part of the sequence seen up to this moment.

Performance.  The amount of time consumed by each execution of count is constant, that is, it does not depend on R and does not depend on the number of keys already processed. (In particular, it does not depend on N.)  Hence, the direct addressing algorithm is very fast,

But the algorithm is only practical if R is small and known explicitly in advance. Moreover, the algorithm can waste a lot of space in memory. For example, if R is 10000 and the sequence contains only 1000 distinct keys, then 90% of the table tb will be idle.

Exercises 2

  1. Performance test.  Write and test a program that uses direct addressing to solve the counting problem, Your program should print a report containing the following information: the length N of the input, the number of distinct keys, the most frequent key, and the frequency. Use as input the files randInt1K.txt, randInt10K.txt, randInt100K.txt, and randInt1M.txt. These files contain 1 thousand, 10 thousand, 100 thousand, and 1 million random keys, all between 0 and 9999.  Time your program using the clock function of the time library.  Does the result agree with the theory?

Algorithm 2: linked list

Our second algorithm stores the frequencies of keys in a linked list. The cells of the list have the following structure:

typedef struc record cell;
struct record {
   int   key, occur; 
   cell *next;
};

If p is the address of a cell then  p->occur  will be the number of occurrences of the key  p->key.  If p and q are addresses of two different cells then p->key will be different from q->key.  The linked list will be pointed to by the global variable tb:

cell *tb;

The counting algorithm initializes tb with NULL and iterates the following routine: read a key k from the input stream and call the following function to count the key:

void count (int k) {
   cell *p;
   p = tb;
   while (p != NULL && p->key != k)
      p = p->next;
   if (p != NULL) 
      p->occur += 1;
   else {
      p = malloc (sizeof (cell));
      p->key = k;
      p->occur = 1;
      p->next = tb;       
      tb = p;       
   }
}

Performance.  In the worst case, each execution of count consumes time proportional the number of distinct keys already read.  Therefore, the execution of count may become slower and slower as the input stream is read.  If all the keys are distinct, the last executions of count may consume time proportional to N.

Even in the average case, typical of practical applications, the count function may consume time proportional to one half of the number of distinct keys already read, which is not very good.

On the happy side, the linked list algorithm does not waste space, since the number of cells in the list is no greater than the number of distinct keys.

Exercises 3

  1. List in increasing order.  Rewrite the count function above so that the keys are stored in increasing order. Estimate the performance. Is it worth keeping the list in increasing order?
  2. Performance test.  Repeat the tests suggested in one of the exercises above, this time using a linked list to store the frequencies.
  3. Array of cells.  Redo the discussion in the previous section using an array of cells instead of a linked list. Resize the array when needed. What is the time consumption of this version?  Now keep the array in increasing order of keys and repeat the analysis.

Hash tables and hash functions

The next algorithms for the counting problem will combine the good characteristics of the two previous algorithms. But before going into this, we must introduce the concept of hashing.  We begin by establishing a terminology and describing the ideias in an abstract manner; concrete implementations will be given in subsequent sections.

We assume that the frequencies are stored in a table tb[0 .. M-1]. The parameter M is knowns as modulus. The value of M and the exact nature of the elements of tb will stay undefined for now. But you must imagine that, somehow, each element of tb records the number of occurrences of some key.  We say that tb is a hash table (or dispersion table).  The size M of the table is usually much smaller than the size R of the universe of keys. Hence, a typical element of tb will have to take care of two or more keys.

To find the position in the table corresponding to a key k, we need to convert k into an index between 0 and M-1.  Any function that does the conversion, taking the universe 0..R-1 of the keys into the set 0..M-1 of indices, is a hash function.  In this chapter, we shall indicate by

hash (k, M)

a call to a hash function for a key k.  The number hash (k, M) is the hash code of key k.  A very popular hash function takes each key k to  k % M,  that is, to the remainder of the division of k by M.  If M is 100, for example, then k % M consists of the two last decimal digits of k.

If a hash function take two keys to the same index, we have a collision.  If M is smaller than R — and, even more so, if M is smaller than the number of distinct keys — then collisions are inevitable.  If M is 100, for example, the function remainder-on-division-by-M will make all the keys that have the same last two decimal digits to collide.

A good hash function must spread, i.e., disperse, very well the keys over the set 0..M-1 of indices.  A function that takes every key to an index that is even, for example, is not good. A function that only depends on a few digits of the key is also not good.  Unfortunately, there is no hash function that is good for all the sets of keys extracted from the universe 0..R-1.  To begin dealing with this difficulty,

M  should be a prime number,

since this tends to reduce the number of collisions.  See, for example, the set of keys in the first column of the following table and consider the division of each key by the nonprime 100 and then by the prime 97. (This example was copied from the book by Sedgewick and Wayne.)  Observe that there are more collisions in the first case:

    k   k%100    k%97
  212      12      18
  618      18      36
  302       2      11
  940      40      67
  702       2      23
  704       4      25
  612      12      30
  606       6      24
  772      72      93
  510      10      25
  423      23      35
  650      50      68
  317      17      26
  907       7      34
  507       7      22
  304       4      13
  714      14      35
  857      57      81
  801       1      25
  900       0      27
  413      13      25
  701       1      22
  418      18      30
  601       1      19

In general, finding a good hash function is more of an art than a science.

Now that we took care of spreading the keys over the interval 0..M-1, we must invent a way to resolve the collisions, that is, a way to store all the keys that a hash function sends to the same position of the hash table. The following sections describe two approaches to this question.

Exercises 4

  1. Why prime modulus?  Suppose that k and M are divisible by an integer d.  Show that k % M is also divisible by d.  (This exercise provides a small indication of the advantages of a prime M.)
  2. Let c be the number ⌈R/M⌉, i.e., the ceiling of R/M.  Consider the hash function that associates the floor of k/c (that is, the result of the integer division of k by c) to each key k.  For example, if R is 105 and M is 102 then c is 103 and therefore k/c is given by the first two decimal digits of k.  Discuss the quality of this hash function.
  3. The birthday paradox.  Apply the hash function remainder-on-division-by-M to a sequence of random keys. After how many keys does the first collision occur? Make experiments, with various values of M, to find the moment of the first collision.  (According to classical probability theory, the first collision occurs after approximately πM/2 keys, where π is the number 3.14159… )

Algorithm 3: hashing with chaining

The hashing technique has two ingredients: a hash function and a mechanism for resolving collisions. The previous section discussed the first ingredient; this section takes care of the second.

The collisions in a hash table can be resolved by means of linked lists: all the keys that have a given hash code are stored in a linked list (in arbitrary order).  The cells of the lists are equal to those of algorithm 2:

typedef struc record cell;
struct record {
   int   key, occur; 
   cell *next;
};

The hash table tb[0..M-1] is implemented as an array and the elements of the array are pointers to the linked lists:

cell **tb;
tb = malloc (M * sizeof (cell *));

For each index h, the linked list tb[h] will contain all the keys that have hash code h.

The resulting counting algorithm is known as hashing with chaining and can be seen as a combination of the algorithms 1 and 2 discussed above.  The algorithm initializes all the elements of the array tb with NULL and iterates the following routine: read a key k from the input stream and then run the following function:

void count (int k) {
   int h = hash (k, M);
   cell *p = tb[h]; 
   while (p != NULL && p->key != k)
      p = p->next;
   if (p != NULL) 
      p->occur += 1;
   else {
      p = malloc (sizeof (cell));
      p->key = k;
      p->occur = 1;
      p->next = tb[h];       
      tb[h] = p;       
   }
}

Performance.  In the worst case, the hash function takes all the keys to the same position in the hash table and therefore all the keys end up in the same linked list. In this case, the performance is no better than that of algorithm 2:  each execution of count consumes time proportional to the number of keys already read from the input stream.

In the average case, typical of practical applications, the performance of count is much better. If the function hash spreads well the keys over the set 0..M-1, all the linked lists will have nearly the same length and we can expect the time consumption of count to be bounded by something proportional to

n / M ,

where n is the number of keys read so far.  If M is 997, for example, we can expect the function to be nearly 1000 times faster than that of algorithm 2.  We should, of course, choose M to be large enough for the M lists to be short (say a few dozen elements) but not so large that many of the lists will be empty.

Exercises 5

  1. Solve the counting problem for the sequence of keys  17 21 19 4 26 30 37  using hashing with chaining. The hash table must have size 13 and the hash function must be the remainder on division by 13.  Draw a figure showing the final state of the hash table.
  2. Suppose that the length N of the input stream is approximately 50000.  Choose a good value for the size M of the hash table.
  3. Performance test.  Repeat the tests suggested in an exercise above, this time using a hash table with collisions resolved by chaining.  Experiment with different values (prime and nonprime) for the size M of the hash table.  Calculate the average and the standard deviation of the lengths of the linked lists.

Algorithm 4: hashing with linear probing

This section describes a second way of resolving collisions in a hash table:  all the keys that collide are stored in the table itself.

The elements of the hash table tb[0..M-1] are cells that have only two fields:

typedef struc record cell;
struct record {
   int key, occur; 
};
cell *tb;
tb = malloc (M * sizeof (cell));

As the counting proceeds, some of the cells of the table tb will be vacant while others will be occupied.  In the vacant cells, the key field will have value -1. In the occupied cells, the key will belong to 0..R-1 and occur will be the corresponding number of occurrences.  If a cell tb[h] is vacant, we can guarantee that no key in the part of the input stream already seen has hash code equal to h. But if tb[h] is occupied, we cannot conclude that the hash code of tb[h].key is equal to h.

Each key k of the input stream will be counted in the following manner.  Let h be the hash code of k. If a cell tb[h] is vacant or has key equal to k, the contents of the cell will be updated.  Else (i.e., if tb[h] is occupied and td[h]->keyk), the algorithm looks for the next cell in tb that is vacant or has key equal to k.

The implementation of these ideias leads to the hashing with linear probing algorithm. The algorithm begins by setting key = -1 and occur = 0 in every cell. Then, it iterates the following routine: reads a key k and calls the following function:

void count (int k) {
   int c, probe, h;
   h = hash (k, M);
   for (probe = 0; probe < M; probe++) {
      c = tb[h].key;
      if (c == -1 || c == k) break; // *
      h = (h + 1) % M;
   }
   if (probe >= M) 
      exit (EXIT_FAILURE);
   if (c == -1) 
      tb[h].key = k;
   tb[h].occur++;
}

The function does M probings, i.e., attempts to find a good cell (see line * of the code).  The search fails only if the table tb is full. In that case, the execution of the function is aborted.  (It would have been better to resize the table tb and continue working.)

Suppose, for example, that the size M of the hash table is 10 (we are using a nonprime number to make the example simpler).  Suppose also that hash (k, M) is defined as k % M.  If the input stream is

333 336 1333 333 7777 446 556 999

then the final state of the hash table tb[0..9] will be as follows:

  key   occur
  999       1
   -1       0
   -1       0
  333       2
 1333       1
   -1       0
  336       1
 7777       1
  446       1
  556       1

Performance.  In the worst case, the hash function takes all the keys to the same position of the hash table and therefore the keys occupy consecutive cells of the table.  In this case, the performance is no better than that of algorithm 2:  each execution of count consumes time proportional to the number of keys already read from the input stream.

In the average case, typical of practical applications, the performance of count is much better.  ⚠ If more than half of the cells of the hash table are vacant (as would happen if we use resizing) and the hash function spreads well the keys over the set 0..M-1, an execution of the count function will not need to do more than a few probings to find a good cell. So, the time consumption of one execution of count will be pratically independent of the number de keys already read.

Exercises 6

  1. Solve the counting problem for the sequence of keys  17 21 19 4 26 30 37  using hashing with linear probing.  The hash table must have size 13 and a hash function must be the remainder on division by 13.  Draw a figure of the final state of the hash table.
  2. Proof of correctness.  Consider the count function in the hashing with linear probing given above.  Suppose that we have c == -1 in some iteration. Prove that there is no cell tb[h] such that tb[h].key == k.
  3. Resizing of the hash table.  The execution of the count function given above is aborted if the hash table is full. Write a better version, one that resizes the table by choosing a new value for M that is approximately twice the previous value, then by allocating a new table tb, and finally by reinserting all the keys into the new table.
  4. Performance test.  Repeat the tests suggested in one of the exercises above, this time using a hash table with collisions resolved by linear probing.  Experiment with different values (primes and nonprimes) for the size M of the table.  Calcule the greatest number of probings that count does in order to find the desired position in the hash table.

Exercises 7

  1. Duplicates filter.  Write a program to read a text file containing (the decimal representations of) positive integers, delete the repeated numbers, and save the resulting file. Do not change the relative order of the numbers. Your program must behave like a filter, that is, it must produce the output as the input file is being read (rather than producing the output after the whole input has been read and processed).
  2. Find the first unique (i.e., not repeated) number in a long text file containing integer numbers.

Hashing of strings

In many applications, the keys are strings (especially ASCII strings) rather than numbers. How to build a hash table in this case?  We could, for example, use a hash table of size 256 and a hash function that takes each string to the numerical value of its first byte.  (All the strings that begin with 'a', for example, would be taken to the position 97 of the table.)  But this would do a poor spreading of the set of keys over the interval 0..255.

A general way of dealing with string keys involves two steps:  first, convert the string into an integer number; second, submit this number to a hash function.  For an obvious conversion, treat each string as an integer in base 256.  The string  "abcd",  for example, is converted into the number  97×2563 + 98×2562 + 99×2561 + 100×2560 which is equal to 1633837924.  This kind of conversion can easily produce numbers greater than INT_MAX, and therefore lead to an arithmetic overflow.  If the calculations are done with int variables, the result can be strictly negative, and that would be disastrous.  To mitigate this, we can use unsigned variables, thereby making sure that the result is between 0 and UINT_MAX even if an overflow occurs:

typedef char *string;

unsigned convert (string s) {
   unsigned h = 0;
   for (int i = 0; s[i] != '\0'; i++) 
      h = h * 256 + s[i];
   return h;
}

To submit the result of the conversion to the hash function, just do  hash (convert (s), M).  If hash is the remainder on division by M, just do  convert(s) % M.

A good alternative is to do the conversion interleaved with the calculation of the hash function. The result may not be identical to convert(s) % M, but will do the job of spreading the keys over the set 0..M-1:

int string_hash (string s, int M) {
   unsigned h = 0;
   for (int i = 0; s[i] != '\0'; i++) 
      h = (h * 256 + s[i]) % M;
   return h;
}

If the string is "abcd" and M is 101, for example, the index computed by string_hash is 11:

( 0 * 256 +  97) % 101 = 97
(97 * 256 +  98) % 101 = 84
(84 * 256 +  99) % 101 = 90
(90 * 256 + 100) % 101 = 11

The use of base 256 is not mandatory; we can use any other base. For reasons that are not too obvious, it is better to use a prime number, say 31, for the base.

Exercises 8

  1. Suppose that the keys are strings and M is 256.  Consider the hash function that associates with each key its first byte.  Discuss the quality of this hash function.
  2. Suppose we replace 256 by 1 in function convert. Show that all the permutations of the string s will collide.
  3. In the function convert, why not replace the two middle lines by the following?
    unsigned h = s[0];
    for (i = 1; s[i] != '\0'; i++) 
    
  4. The frequency of words in a book.  Write a program to count the number of occurrencies of each word in a (digital) book. (You may use The Secret Adversary, by Agatha Christie, for tests.)