Implementing Trie with Auto-Completion in Python - Step-by-Step Guide

A trie, also called a prefix tree, is a useful tree data structure used to store associative data arrays that map keys to values in a very efficient way for search and insert operations. In this comprehensive guide, we will explore how to implement a trie data structure in Python and extend it to provide auto-completion suggestions based on a given prefix.

Open Table of Contents

Overview of Tries
Implementing a Basic Trie in Python
Trie Node Optimization with Default Dict
Adding Auto-Complete Functionality
Sample Application: Auto-Completing Code Functions
Trie Optimization and Variants
Analysis of Trie Performance
Conclusion

Overview of Tries

A trie is a tree structure where each node represents a key. The keys are typically strings, such as words or IP addresses. The tree is structured such that all descendant nodes of a node share a common prefix. This allows efficient key lookups and insertions.

Some key properties of tries:

The root node is the starting point for all keys
Each key is associated with a node obtained by traversing the tree from root to leaf, following the key’s sequence of characters
Keys that share prefixes will share initial nodes up to the diversion point
All children of a node share a common prefix of the associated keys

The trie data structure provides faster lookup time compared to binary search trees. Lookup, insert and delete operations take O(L) time where L represents the length of the key. This is much faster than a binary search tree’s O(log n) time.

Tries are extremely useful for storing dictionaries and providing auto-complete suggestions. By traversing the trie Till the last known key prefix, we can efficiently find all keys with that prefix, making auto-complete easy to implement.

Implementing a Basic Trie in Python

Let’s walk through a simple implementation of a trie data structure in Python focused on the core functionality.

We will need to implement the following:

TrieNode - Represents each node in the trie
Trie - The trie class itself
insert - To insert a key into the trie
search - To search for a key in the trie
starts_with - Find keys with a given prefix

First, we define the TrieNode class:

class TrieNode:
    def __init__(self):
        self.children = {}
        self.end_of_word = False

Each node stores a children dictionary to map to its child nodes, and a end_of_word boolean flag to indicate if the node represents the end of a key.

Next, the main Trie class:

class Trie:
    def __init__(self):
        self.root = TrieNode()

We initialize the trie with just the root node. Now, for the insert method:

    def insert(self, key):
        curr_node = self.root

        for char in key:
            if char not in curr_node.children:
                curr_node.children[char] = TrieNode()
            curr_node = curr_node.children[char]

        curr_node.end_of_word = True

We traverse the trie comparing each character in the key to the child nodes. If a child node does not exist for the current character, we create a new TrieNode for it. Finally, we mark the last node as the end of the word.

The search method is similar:

    def search(self, key):
        curr_node = self.root

        for char in key:
            if char not in curr_node.children:
                return False

            curr_node = curr_node.children[char]

        return curr_node.end_of_word

We traverse the trie till the last character, returning False if we cannot find the key at any point. The key exists only if the last node is marked as end of word.

Finally, for prefix based search:

    def starts_with(self, prefix):
        curr_node = self.root
        result = []

        for char in prefix:
            if char not in curr_node.children:
                return result

            curr_node = curr_node.children[char]

        self._find_words(curr_node, prefix, result)
        return result

    def _find_words(self, curr_node, prefix, result):
        if curr_node.end_of_word:
            result.append(prefix)

        for child in curr_node.children:
            self._find_words(curr_node.children[child], prefix + child, result)

We traverse till the end of the provided prefix, get the last node, then recursively search through its children to find all words with the prefix.

And this wraps up the basic functionality of a trie in Python! With just insert, search and starts_with, we can dynamically build a trie and use it for efficient lookups.

Trie Node Optimization with Default Dict

Our current implementation uses a simple dictionary to store child nodes in each TrieNode. This can be optimized using Python’s DefaultDict from the collections module.

DefaultDict allows us to initialize a dictionary that creates any missing keys automatically with a default value. We can define the default value as a new TrieNode:

from collections import defaultdict

class TrieNode:
    def __init__(self):
        self.children = defaultdict(TrieNode)
        self.end_of_word = False

Now children will automatically create new nodes whenever we try to access a key that doesn’t exist. This simplifies the insert and search logic, reducing code repetition.

Adding Auto-Complete Functionality

Now that we have a working trie implementation, let’s extend it to provide auto-complete suggestions based on a given prefix.

We will need two methods:

get_words_for_prefix - Returns all complete words for a prefix
auto_complete - Return auto-complete suggestions for incomplete words

get_words_for_prefix leverages the starts_with logic we already implemented:

def get_words_for_prefix(self, prefix):
    words = self.starts_with(prefix)
    return words

For auto_complete, we traverse the trie till the end of the given prefix, then fetch child nodes to gather complete word suggestions:

def auto_complete(self, prefix):
    curr_node = self.root
    for char in prefix:
        if char not in curr_node.children:
            return []

        curr_node = curr_node.children[char]

    suggestions = []
    self._auto_complete(curr_node, prefix, suggestions)
    return suggestions

def _auto_complete(self, curr_node, prefix, suggestions):
    if curr_node.end_of_word:
        suggestions.append(prefix)

    for child in curr_node.children:
        self._auto_complete(curr_node.children[child], prefix + child, suggestions)

And with that, our trie now supports auto-completion! We can build the trie with a dictionary, then efficiently provide suggestions for partial words based on the characters entered so far.

Here is an example usage:

words = ["hello", "helix", "bye", "cat"]

trie = Trie()
for word in words:
    trie.insert(word)

print(trie.auto_complete("hel"))
# ['hello', 'helix']

Some key points about using tries for auto-complete:

Performance improves compared to checking all keys since we only traverse and search a portion of the trie
Works well for textual data like code or natural languages
Provides fast and incremental suggestions as user types each character
Can be further optimized by sorting sibling nodes alphabetically

Sample Application: Auto-Completing Code Functions

Let’s demonstrate a practical use case of our auto-completing trie by using it to provide code function suggestions.

We will:

Build a trie from a corpus of Python functions
Handle auto-completing based on the typed prefix

First, we extract a corpus of common Python functions:

import re

corpus = open('python_functions.txt').read()

functions = re.findall(r'def (\w+)', corpus)

Next, build a trie and insert all functions:

trie = Trie()
for func in functions:
    trie.insert(func)

Now we can split the text entered by the user into tokens:

text = "import math\nmath.f"
tokens = text.split()
prefix = tokens[-1]

Pass the prefix to auto-complete and show suggestions:

suggestions = trie.auto_complete(prefix)

print("Suggestions for", prefix)
for sug in suggestions:
    print(sug)

# Suggestions for f
# factorial
# floor
# fabs

The user can then select a suggestion to auto-complete their code!

This demonstrates how we can build functional auto-completion fairly easily using tries in Python.

Trie Optimization and Variants

There are further optimizations and variants of the trie data structure that improve performance in different use cases:

Compressed Trie

A compressed trie saves storage by merging child nodes with just one child. This reduces memory usage while retaining lookup speed.

Sorted Trie

Maintain child nodes in lexicographic order. This allows faster traversal and lookup.

Concurrent Trie

Allow concurrent read/write access from multiple threads via locking and synchronization. Useful in multi-threaded environments.

HAT Trie

The Highest Ancestor with Two children Trie maintains pointers to highest ancestor with two children. This reduces traversal steps for unsuccessful searches.

B Trie

A space-optimized trie where child nodes are stored together in an array. Provides high cache utilization.

Ternary Search Trie (TST)

Uses 3-way branching (compared to binary). Insert and search ops are simplified to single comparisons at each level.

There are various ways the basic trie can be optimized and modified for different use cases. Picking the right variant can lead to substantial performance improvements.

Analysis of Trie Performance

The trie provides fast O(L) time complexity for search, insert and delete where L is the length of the key. This is much faster than a binary search tree.

Some key analysis points:

Search: O(L) since in the worst case we traverse all L characters of the key, where L is length of search key.
Insert: Also O(L), as we may have to traverse and create L nodes till we insert a key of length L.
Delete: Again O(L) since removing a key requires walking its L nodes.
Prefix search: O(P) where P is length of prefix, since we only traverse nodes till end of prefix.
Memory: O(N*L) where N is number of keys and L average key length. Requires memory for inner nodes and leaf nodes.
Auto-complete: O(P + S) where P is prefix length and S is number of suggestions. We traverse P nodes for prefix, then iterate children for suggestions.

So in summary, tries provide fast search, insert and delete operations by sacrificing memory consumption. The O(L) time complexity can be orders of magnitude faster compared to a binary search tree. This makes tries very useful for real-time applications like auto-complete.

Conclusion

In this guide, we went over how to implement the trie data structure in Python and extend it to provide auto-complete suggestions.

Key points:

A trie is an efficient tree structure for storing associative data arrays and providing fast lookups
We can implement a trie in Python using dictionaries to store child nodes
Tries allow fast O(L) search, insert and delete operations where L is key length
They can be extended to provide fast prefix-based auto-complete suggestions
Tries are extremely useful for storing dictionaries, phone directories, and providing word or code suggestions

There are many ways to optimize tries further such as compressed or concurrent variants. Picking the right trie implementation can lead to substantial performance gains for your application.

I hope this comprehensive guide provided you with a solid understanding of implementing and using tries for auto-completion in Python! Let me know if you have any other questions.