Implementing a Hash Table from Scratch in Python: A How-To Guide

A hash table, also known as a hash map or dictionary, is a fundamental data structure used to store and retrieve data efficiently. At its core, a hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.

In Python, dictionaries serve as hash tables. They are one of the most useful built-in data types, providing fast lookup, insert, delete, and search functionalities. However, it can be instructive for Python developers to implement a hash table from scratch to better understand how they work under the hood.

This guide will walk through the key concepts and techniques for creating a basic hash table in Python. We will cover the following topics:

Open Table of Contents

Hash Functions
Hash Table Class Definition
The Put Operation
The Get Operation
The Remove Operation
Handling Collisions
Load Factor and Rehashing
Time and Space Complexity
Example Usage
Summary

Hash Functions

A hash function is used to map a key to an index in the hash table’s array of buckets. It should be deterministic, meaning the same key always maps to the same bucket index. The hash function also needs to uniformly distribute keys to avoid collisions where two keys hash to the same index.

Here is a simple hash function in Python that maps a string key to an integer bucket index:

def hash_function(key):
    hash_code = 0
    for char in key:
        hash_code += ord(char)
    return hash_code % array_size

This iterates through the characters of the key and generates a hash code by summing their ASCII ordinal values. The hash code is then modulo the size of the hash table array to fit within the available slots.

While simple, this function is far from ideal, as it is prone to collisions for similar keys. We will cover more robust hash functions later when discussing handling collisions.

Hash Table Class Definition

Here is an overview of the HashTable class which implements the hash table:

class HashTable:

    def __init__(self, size):
        self.array_size = size
        self.array = [None] * size

    def put(self, key, value):
        # Insert key-value pair

    def get(self, key):
        # Retrieve value by key

    def remove(self, key):
        # Remove key-value pair

    def hash_function(self, key):
        # Generate hash code

The table is initialized as an array of a fixed size. The put(), get(), and remove() methods handle the basic hash table operations. A hash function is defined to map keys to array indices.

Now let’s implement each of these methods.

The Put Operation

The put() method inserts a new key-value pair into the hash table, or updates the value if the key already exists. Here is an implementation:

def put(self, key, value):

    index = self.hash_function(key)

    if self.array[index] is None:
        self.array[index] = HashTableEntry(key, value)
    else:
        entry = self.array[index]
        entry.value = value
        entry.key = key

To insert, it first computes the hash code of the key using the hash function, which gives the index in the array where the entry should be stored.

If that slot is empty, a new HashTableEntry object is created containing the key-value pair. Otherwise, if a collision occurs, the existing entry is overwritten with the new key-value pair.

The HashTableEntry class looks like:

class HashTableEntry:

    def __init__(self, key, value):
        self.key = key
        self.value = value

This stores each entry as an object containing the key and associated value.

The Get Operation

To retrieve the value for a given key, the get() method first hashes the key to find the target index, then returns the value in the entry at that index:

def get(self, key):

    index = self.hash_function(key)

    if self.array[index] is None:
        raise KeyError()

    entry = self.array[index]

    if entry.key == key:
        return entry.value

    raise KeyError()

It hashes the key, then checks if a valid entry exists at that index. If so, it returns the entry’s value if the keys match. Otherwise, it raises a KeyError.

This handles the case where the key does not exist in the table or a collision occurred with a different key.

The Remove Operation

To remove a key-value pair, the remove() method hashes the key to find the index, then sets that slot to None:

def remove(self, key):

    index = self.hash_function(key)

    if self.array[index] is None:
        raise KeyError()

    entry = self.array[index]

    if entry.key == key:
        self.array[index] = None
    else:
        raise KeyError()

Again this checks that the entry exists and the keys match before removing.

Now that the basic operations are implemented, next we need to improve how collisions are handled.

Handling Collisions

The simple hash function shown initially is likely to cause collisions where two different keys hash to the same index. This needs to be handled by either avoiding collisions or resolving them when they occur.

Some common strategies include:

Chaining - With chaining, each bucket stores a linked list of entries that hash to that index. New entries are appended to the list:

def put(self, key, value):

    index = self.hash_function(key)

    if self.array[index] is None:
        self.array[index] = LinkedList()

    self.array[index].append(HashTableEntry(key, value))

Linear Probing - When a collision occurs, linear probing scans the array sequentially until it finds an empty slot to insert the new entry:

def put(self, key, value):

    index = self.hash_function(key)

    while self.array[index] is not None:
        index = (index + 1) % array_size

    self.array[index] = HashTableEntry(key, value)

Double Hashing - With double hashing, a second hash function is used to determine the offset from the initial index to probe for an open slot:

def put(self, key, value):

    index = self.hash_function_1(key)
    increment = self.hash_function_2(key)

    while self.array[index] is not None:
        index = (index + increment) % array_size

    self.array[index] = HashTableEntry(key, value)

There are tradeoffs between these techniques in terms of lookup speed, memory overhead, and implementation complexity. Linear probing tends to be simpler to implement in Python.

Load Factor and Rehashing

As the hash table fills up, the likelihood of collisions increases, reducing lookup performance. The load factor gives the ratio of entries to available slots in the underlying array.

Once the load factor exceeds a threshold, such as 0.7, the array needs to be resized to a larger capacity, and all the entries rehashed into their new indices. This is known as rehashing:

def rehash(self):

    old_array = self.array
    old_capacity = len(old_array)
    new_capacity = 2 * old_capacity

    self.array = [None] * new_capacity

    for entry in old_array:
        if entry is not None:
            index = self.hash_function(entry.key)
            self.array[index] = entry

This creates a new array double the size, rehashes all entries into it, then replaces the old array.

Rehashing should occur whenever the load factor gets too high or the performance of operations like put() declines. This maintains the hash table’s efficiency.

Time and Space Complexity

The time complexity for lookups with a good hash function is O(1) on average. In the worst case, it could approach O(n) if there are many collisions.

Inserts take O(1) time on average but O(n) worst case if many probing attempts are needed during rehashing. Deletes are O(1) since the key is hashed and index accessed directly.

In terms of space complexity, the underlying array requires O(n) additional memory where n is the capacity needed to maintain the desired load factor.

Choosing a low load factor trades memory utilization for faster lookups and inserts. There is a time-memory tradeoff to balance based on the application requirements.

Example Usage

Here is an example of using the hash table to store student names and ID numbers:

hash_table = HashTable(10)

hash_table.put("Jane Doe", "ID1234")
hash_table.put("John Smith", "ID4567")

print(hash_table.get("Jane Doe"))
# Prints ID1234

hash_table.remove("John Smith")

print(hash_table.get("John Smith"))
# Throws KeyError

This demonstrates inserting key-value pairs, retrieving values, and removing entries.

The hash table can scale to handle large datasets efficiently. It is useful for tasks like:

Uniquely identifying objects by key
Retrieving records by a lookup key
Caching/memoization
Database indexing
Sets and dictionaries

Summary

In this guide, we implemented a basic hash table in Python with put(), get(), and remove() methods by:

Using a hash function to map keys to array indices
Handling collisions with techniques like chaining or linear probing
Dynamically resizing and rehashing when the load factor gets too high
Achieving O(1) lookup time on average along with O(n) memory overhead

Hash tables provide fast access and are ubiquitous data structures for many programming applications. Building one from scratch helps reinforce how they function and can be manipulated to optimize for speed or memory usage based on system constraints.