Understanding hashmap fundamentals

Maps are a core data type in most modern programing languages, but not every developer understands that they represent a tradeoff between edge case performance, storage overhead and usability.

What to expect

This article will discuss maps, more precisely hashmaps and the fundamental implementation approaches for them. We will focus on concepts and generalized logic instead of implementation, showing go code to help comprehension.

Modern cutting-edge implementations like swiss tables are fairly different in practice, but build on the core principles and tradeoffs explained in this guide.

What is a (hash)map?

You can think of a hashmap as an "array with strings as indexes". This is achieved by combining an internal array for data storage with a hash algorithm that turns the string keys into integers usable as array indexes. The array elements are typically referred to as "buckets" or "slots" in the context of hashmap implementations. We will use "slots" for the rest of this guide.

Key hashing

The hashing algorithm is a key component of hashmaps, with many different options available. Maps benefit most from hashes that have relatively low collision rates and good random distribution of keys.

Arguably the simplest hash with practical use in hashmaps is DJB2:

func djb2_hash(key string) uint{
  var hash int = 5381
  for _, char := range key{
    hash = hash * 33 + uint(char)
  }
  return hash
}

The algorithm simply starts with 5381 as the initial hash value, then loops over all the characters in the key. In each loop iteration, the current hash value is multiplied by 33 and the ordinal (numeric) value of the character added to it.

For simplistic maps this works fine, but more advanced implementations rely on more advanced algorithms like FNV-1A, Siphash or MurmurHash for better hash distribution and performance.

Slot derivation

A hash alone is not enough for the map, as it cannot be used directly as a slot (array index), since it may be much larger than the maximum capacity of the array. To solve this, we need a second function called a "slot derivation function" that turns the numeric hash into a number no larger than the size of the array.

The simplest derivation function is to use modulo:

func derive_slot(hash uint, map_capacity uint) uint{
  return hash % map_capacity
}

Since the remainder of dividing any number by the map's capacity (internal array length) can never be larger than the map capacity itself, this makes results safe for use as an array index.

Modern hashmaps often replace modulo with much cheaper slot derivation based on bit masking by adding a restriction to only grow map capacity by powers of 2.

Slot collisions and map capacity

No matter what hash algorithm and slot derivation technique is used, they all will inevitably produce collisions (two different keys producing the same slot number). There are two popular options to dealing with collisions, but a collision will always degrade map performance, just with different tradeoffs.

Because of this, most hashmaps grow dynamically when hitting a "load factor" (think percentage of slots/array indexes used) of around 70%. Note that what grows is the capacity (the physical size of the underlying array), not the number of elements in the map.

While growing map capacity does create more slots and reduce potential collisions, it also requires that slot numbers for all values are recomputed, as the slot derivation function takes map capacity into account, so all slot positions will be different when changing capacity. If possible, keep hash results for keys stored alongside values so only the slot derivation needs to be recomputed instead of all hashes (at the cost of additional storage overhead).

Handling collisions with doubly-linked lists

The oldest solution to handling slot collisions in maps is to turn array elements into doubly linked lists, also called "chaining". Each map element consists of a pointer to the previous and next elements in the linked list, as well as key and value:

type Map_Element struct{
  previous *Map_Element
  key string
  value *string
  next *Map_Element
}

Note that value can be a pointer to anything, we chose *string as a placeholder.

When looking up values, you would first compute the slot to look at, then loop through all elements of the linked list. For each element, check if the key matches the one you are looking for; if it does, return and stop. If it doesn't, check the next element in the list. If you arrive at the end of the linked list next points to nothing), the element is not in the map (yet).

Using linked lists makes it very easy to add or remove elements by simply changing the next and previous pointers, and can survive exceeding the initial array capacity (at the cost of more and more degraded performance).

On the other hand, they add significantly more storage (2 additional pointers per element) and chasing pointers in linked lists has horrible cache locality (pointers can be anywhere in memory), so frequent RAM fetches reduce read performance.

Doubly linked lists were used by implementations like java's first HashMap implementation and are still common in hand-rolled implementation for C projects, where simplicity is preferred over performance (since maps are almost never the performance bottleneck).

Handling collisions with open addressing

A more performance-oriented approach is called open addressing or probing, and simply tries the next available slots when finding a collision. When looking up a key, the code would first try the "correct" slot, and if it doesn't contain the key and isn't empty, it will try the next slots in order until it either finds the key or an unused slot. Relying on unused slots like this will obviously degrade performance drastically at higher load ratios, making capacity growth more important for this type of map.

Since open addressing relies on current neighboring slot state on collisions, it cannot "empty" a slot when deleting it's map element. Instead, it has to place a "tombstone" value in that slot, indicating that the slot is not used anymore, but was previously, so lookup logic stays intact.

Tombstone values also add an important side-effect: open addressing maps cannot reduce their load. If 5 elements are written and then deleted, the map would still use 5 slots in the underlying array, making it unsuitable for delete-heavy workloads like short-lived caches. Only full rehashing, often combined with capacity growth, will be able to dump tombstone values from the array and free memory. This also means open addressing hashmaps will often need to grow sooner than linked list alternatives.

Open addressing is simply a different tradeoff from chaining: It focuses on lookup performance through better cache locality and removed pointer chasing, at the cost of worse performance at higher load or frequent deletes.

There are other variations built on this foundation that make different tradeoffs, most notably "robin hood hashing". When inserting an element with a collision, this variant checks how far each element is from its "ideal" position and gives the slot to the one with the higher distance, in an attempt to reduce maximum probe length during lookups. This trades some write performance for improved lookup speeds.

High-performance implementations

Modern hashmap implementations use more sophisticated variations of the aforementioned concepts, most recently favoring designs similar to swiss tables.

They are still based on the foundation of open addressing, but optimized for minimal RAM fetches and using SIMD processor instructions to compare multiple slots in parallel, significantly speeding up lookups in practice.

While these implementations are often 1.5x - 3x faster than basic open addressing, almost no real program is bound by hashmap speed in any way, so the real-world performance gain is almost always unnoticeable.