Skip to content

Implementing Trie with Auto-Completion in Python - Step-by-Step Guide

Updated: at 04:34 AM

A trie, also called a prefix tree, is a useful tree data structure used to store associative data arrays that map keys to values in a very efficient way for search and insert operations. In this comprehensive guide, we will explore how to implement a trie data structure in Python and extend it to provide auto-completion suggestions based on a given prefix.

Table of Contents

Open Table of Contents

Overview of Tries

A trie is a tree structure where each node represents a key. The keys are typically strings, such as words or IP addresses. The tree is structured such that all descendant nodes of a node share a common prefix. This allows efficient key lookups and insertions.

Some key properties of tries:

The trie data structure provides faster lookup time compared to binary search trees. Lookup, insert and delete operations take O(L) time where L represents the length of the key. This is much faster than a binary search tree’s O(log n) time.

Tries are extremely useful for storing dictionaries and providing auto-complete suggestions. By traversing the trie Till the last known key prefix, we can efficiently find all keys with that prefix, making auto-complete easy to implement.

Implementing a Basic Trie in Python

Let’s walk through a simple implementation of a trie data structure in Python focused on the core functionality.

We will need to implement the following:

First, we define the TrieNode class:

class TrieNode:
    def __init__(self):
        self.children = {}
        self.end_of_word = False

Each node stores a children dictionary to map to its child nodes, and a end_of_word boolean flag to indicate if the node represents the end of a key.

Next, the main Trie class:

class Trie:
    def __init__(self):
        self.root = TrieNode()

We initialize the trie with just the root node. Now, for the insert method:

    def insert(self, key):
        curr_node = self.root

        for char in key:
            if char not in curr_node.children:
                curr_node.children[char] = TrieNode()
            curr_node = curr_node.children[char]

        curr_node.end_of_word = True

We traverse the trie comparing each character in the key to the child nodes. If a child node does not exist for the current character, we create a new TrieNode for it. Finally, we mark the last node as the end of the word.

The search method is similar:

    def search(self, key):
        curr_node = self.root

        for char in key:
            if char not in curr_node.children:
                return False

            curr_node = curr_node.children[char]

        return curr_node.end_of_word

We traverse the trie till the last character, returning False if we cannot find the key at any point. The key exists only if the last node is marked as end of word.

Finally, for prefix based search:

    def starts_with(self, prefix):
        curr_node = self.root
        result = []

        for char in prefix:
            if char not in curr_node.children:
                return result

            curr_node = curr_node.children[char]

        self._find_words(curr_node, prefix, result)
        return result

    def _find_words(self, curr_node, prefix, result):
        if curr_node.end_of_word:
            result.append(prefix)

        for child in curr_node.children:
            self._find_words(curr_node.children[child], prefix + child, result)

We traverse till the end of the provided prefix, get the last node, then recursively search through its children to find all words with the prefix.

And this wraps up the basic functionality of a trie in Python! With just insert, search and starts_with, we can dynamically build a trie and use it for efficient lookups.

Trie Node Optimization with Default Dict

Our current implementation uses a simple dictionary to store child nodes in each TrieNode. This can be optimized using Python’s DefaultDict from the collections module.

DefaultDict allows us to initialize a dictionary that creates any missing keys automatically with a default value. We can define the default value as a new TrieNode:

from collections import defaultdict

class TrieNode:
    def __init__(self):
        self.children = defaultdict(TrieNode)
        self.end_of_word = False

Now children will automatically create new nodes whenever we try to access a key that doesn’t exist. This simplifies the insert and search logic, reducing code repetition.

Adding Auto-Complete Functionality

Now that we have a working trie implementation, let’s extend it to provide auto-complete suggestions based on a given prefix.

We will need two methods:

  1. get_words_for_prefix - Returns all complete words for a prefix
  2. auto_complete - Return auto-complete suggestions for incomplete words

get_words_for_prefix leverages the starts_with logic we already implemented:

def get_words_for_prefix(self, prefix):
    words = self.starts_with(prefix)
    return words

For auto_complete, we traverse the trie till the end of the given prefix, then fetch child nodes to gather complete word suggestions:

def auto_complete(self, prefix):
    curr_node = self.root
    for char in prefix:
        if char not in curr_node.children:
            return []

        curr_node = curr_node.children[char]

    suggestions = []
    self._auto_complete(curr_node, prefix, suggestions)
    return suggestions

def _auto_complete(self, curr_node, prefix, suggestions):
    if curr_node.end_of_word:
        suggestions.append(prefix)

    for child in curr_node.children:
        self._auto_complete(curr_node.children[child], prefix + child, suggestions)

And with that, our trie now supports auto-completion! We can build the trie with a dictionary, then efficiently provide suggestions for partial words based on the characters entered so far.

Here is an example usage:

words = ["hello", "helix", "bye", "cat"]

trie = Trie()
for word in words:
    trie.insert(word)

print(trie.auto_complete("hel"))
# ['hello', 'helix']

Some key points about using tries for auto-complete:

Sample Application: Auto-Completing Code Functions

Let’s demonstrate a practical use case of our auto-completing trie by using it to provide code function suggestions.

We will:

  1. Build a trie from a corpus of Python functions
  2. Handle auto-completing based on the typed prefix

First, we extract a corpus of common Python functions:

import re

corpus = open('python_functions.txt').read()

functions = re.findall(r'def (\w+)', corpus)

Next, build a trie and insert all functions:

trie = Trie()
for func in functions:
    trie.insert(func)

Now we can split the text entered by the user into tokens:

text = "import math\nmath.f"
tokens = text.split()
prefix = tokens[-1]

Pass the prefix to auto-complete and show suggestions:

suggestions = trie.auto_complete(prefix)

print("Suggestions for", prefix)
for sug in suggestions:
    print(sug)

# Suggestions for f
# factorial
# floor
# fabs

The user can then select a suggestion to auto-complete their code!

This demonstrates how we can build functional auto-completion fairly easily using tries in Python.

Trie Optimization and Variants

There are further optimizations and variants of the trie data structure that improve performance in different use cases:

Compressed Trie

A compressed trie saves storage by merging child nodes with just one child. This reduces memory usage while retaining lookup speed.

Sorted Trie

Maintain child nodes in lexicographic order. This allows faster traversal and lookup.

Concurrent Trie

Allow concurrent read/write access from multiple threads via locking and synchronization. Useful in multi-threaded environments.

HAT Trie

The Highest Ancestor with Two children Trie maintains pointers to highest ancestor with two children. This reduces traversal steps for unsuccessful searches.

B Trie

A space-optimized trie where child nodes are stored together in an array. Provides high cache utilization.

Ternary Search Trie (TST)

Uses 3-way branching (compared to binary). Insert and search ops are simplified to single comparisons at each level.

There are various ways the basic trie can be optimized and modified for different use cases. Picking the right variant can lead to substantial performance improvements.

Analysis of Trie Performance

The trie provides fast O(L) time complexity for search, insert and delete where L is the length of the key. This is much faster than a binary search tree.

Some key analysis points:

So in summary, tries provide fast search, insert and delete operations by sacrificing memory consumption. The O(L) time complexity can be orders of magnitude faster compared to a binary search tree. This makes tries very useful for real-time applications like auto-complete.

Conclusion

In this guide, we went over how to implement the trie data structure in Python and extend it to provide auto-complete suggestions.

Key points:

There are many ways to optimize tries further such as compressed or concurrent variants. Picking the right trie implementation can lead to substantial performance gains for your application.

I hope this comprehensive guide provided you with a solid understanding of implementing and using tries for auto-completion in Python! Let me know if you have any other questions.