A hash table, also known as a hash map or dictionary, is a fundamental data structure used to store and retrieve data efficiently. At its core, a hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.
In Python, dictionaries serve as hash tables. They are one of the most useful built-in data types, providing fast lookup, insert, delete, and search functionalities. However, it can be instructive for Python developers to implement a hash table from scratch to better understand how they work under the hood.
This guide will walk through the key concepts and techniques for creating a basic hash table in Python. We will cover the following topics:
Table of Contents
Open Table of Contents
Hash Functions
A hash function is used to map a key to an index in the hash table’s array of buckets. It should be deterministic, meaning the same key always maps to the same bucket index. The hash function also needs to uniformly distribute keys to avoid collisions where two keys hash to the same index.
Here is a simple hash function in Python that maps a string key to an integer bucket index:
def hash_function(key):
hash_code = 0
for char in key:
hash_code += ord(char)
return hash_code % array_size
This iterates through the characters of the key and generates a hash code by summing their ASCII ordinal values. The hash code is then modulo the size of the hash table array to fit within the available slots.
While simple, this function is far from ideal, as it is prone to collisions for similar keys. We will cover more robust hash functions later when discussing handling collisions.
Hash Table Class Definition
Here is an overview of the HashTable
class which implements the hash table:
class HashTable:
def __init__(self, size):
self.array_size = size
self.array = [None] * size
def put(self, key, value):
# Insert key-value pair
def get(self, key):
# Retrieve value by key
def remove(self, key):
# Remove key-value pair
def hash_function(self, key):
# Generate hash code
The table is initialized as an array of a fixed size. The put(), get(), and remove() methods handle the basic hash table operations. A hash function is defined to map keys to array indices.
Now let’s implement each of these methods.
The Put Operation
The put() method inserts a new key-value pair into the hash table, or updates the value if the key already exists. Here is an implementation:
def put(self, key, value):
index = self.hash_function(key)
if self.array[index] is None:
self.array[index] = HashTableEntry(key, value)
else:
entry = self.array[index]
entry.value = value
entry.key = key
To insert, it first computes the hash code of the key using the hash function, which gives the index in the array where the entry should be stored.
If that slot is empty, a new HashTableEntry
object is created containing the key-value pair. Otherwise, if a collision occurs, the existing entry is overwritten with the new key-value pair.
The HashTableEntry
class looks like:
class HashTableEntry:
def __init__(self, key, value):
self.key = key
self.value = value
This stores each entry as an object containing the key and associated value.
The Get Operation
To retrieve the value for a given key, the get() method first hashes the key to find the target index, then returns the value in the entry at that index:
def get(self, key):
index = self.hash_function(key)
if self.array[index] is None:
raise KeyError()
entry = self.array[index]
if entry.key == key:
return entry.value
raise KeyError()
It hashes the key, then checks if a valid entry exists at that index. If so, it returns the entry’s value if the keys match. Otherwise, it raises a KeyError
.
This handles the case where the key does not exist in the table or a collision occurred with a different key.
The Remove Operation
To remove a key-value pair, the remove() method hashes the key to find the index, then sets that slot to None
:
def remove(self, key):
index = self.hash_function(key)
if self.array[index] is None:
raise KeyError()
entry = self.array[index]
if entry.key == key:
self.array[index] = None
else:
raise KeyError()
Again this checks that the entry exists and the keys match before removing.
Now that the basic operations are implemented, next we need to improve how collisions are handled.
Handling Collisions
The simple hash function shown initially is likely to cause collisions where two different keys hash to the same index. This needs to be handled by either avoiding collisions or resolving them when they occur.
Some common strategies include:
Chaining - With chaining, each bucket stores a linked list of entries that hash to that index. New entries are appended to the list:
def put(self, key, value):
index = self.hash_function(key)
if self.array[index] is None:
self.array[index] = LinkedList()
self.array[index].append(HashTableEntry(key, value))
Linear Probing - When a collision occurs, linear probing scans the array sequentially until it finds an empty slot to insert the new entry:
def put(self, key, value):
index = self.hash_function(key)
while self.array[index] is not None:
index = (index + 1) % array_size
self.array[index] = HashTableEntry(key, value)
Double Hashing - With double hashing, a second hash function is used to determine the offset from the initial index to probe for an open slot:
def put(self, key, value):
index = self.hash_function_1(key)
increment = self.hash_function_2(key)
while self.array[index] is not None:
index = (index + increment) % array_size
self.array[index] = HashTableEntry(key, value)
There are tradeoffs between these techniques in terms of lookup speed, memory overhead, and implementation complexity. Linear probing tends to be simpler to implement in Python.
Load Factor and Rehashing
As the hash table fills up, the likelihood of collisions increases, reducing lookup performance. The load factor gives the ratio of entries to available slots in the underlying array.
Once the load factor exceeds a threshold, such as 0.7, the array needs to be resized to a larger capacity, and all the entries rehashed into their new indices. This is known as rehashing:
def rehash(self):
old_array = self.array
old_capacity = len(old_array)
new_capacity = 2 * old_capacity
self.array = [None] * new_capacity
for entry in old_array:
if entry is not None:
index = self.hash_function(entry.key)
self.array[index] = entry
This creates a new array double the size, rehashes all entries into it, then replaces the old array.
Rehashing should occur whenever the load factor gets too high or the performance of operations like put() declines. This maintains the hash table’s efficiency.
Time and Space Complexity
The time complexity for lookups with a good hash function is O(1) on average. In the worst case, it could approach O(n) if there are many collisions.
Inserts take O(1) time on average but O(n) worst case if many probing attempts are needed during rehashing. Deletes are O(1) since the key is hashed and index accessed directly.
In terms of space complexity, the underlying array requires O(n) additional memory where n is the capacity needed to maintain the desired load factor.
Choosing a low load factor trades memory utilization for faster lookups and inserts. There is a time-memory tradeoff to balance based on the application requirements.
Example Usage
Here is an example of using the hash table to store student names and ID numbers:
hash_table = HashTable(10)
hash_table.put("Jane Doe", "ID1234")
hash_table.put("John Smith", "ID4567")
print(hash_table.get("Jane Doe"))
# Prints ID1234
hash_table.remove("John Smith")
print(hash_table.get("John Smith"))
# Throws KeyError
This demonstrates inserting key-value pairs, retrieving values, and removing entries.
The hash table can scale to handle large datasets efficiently. It is useful for tasks like:
- Uniquely identifying objects by key
- Retrieving records by a lookup key
- Caching/memoization
- Database indexing
- Sets and dictionaries
Summary
In this guide, we implemented a basic hash table in Python with put(), get(), and remove() methods by:
- Using a hash function to map keys to array indices
- Handling collisions with techniques like chaining or linear probing
- Dynamically resizing and rehashing when the load factor gets too high
- Achieving O(1) lookup time on average along with O(n) memory overhead
Hash tables provide fast access and are ubiquitous data structures for many programming applications. Building one from scratch helps reinforce how they function and can be manipulated to optimize for speed or memory usage based on system constraints.