A trie, also called a prefix tree, is a useful tree data structure used to store associative data arrays that map keys to values in a very efficient way for search and insert operations. In this comprehensive guide, we will explore how to implement a trie data structure in Python and extend it to provide auto-completion suggestions based on a given prefix.
Table of Contents
Open Table of Contents
Overview of Tries
A trie is a tree structure where each node represents a key. The keys are typically strings, such as words or IP addresses. The tree is structured such that all descendant nodes of a node share a common prefix. This allows efficient key lookups and insertions.
Some key properties of tries:
- The root node is the starting point for all keys
- Each key is associated with a node obtained by traversing the tree from root to leaf, following the key’s sequence of characters
- Keys that share prefixes will share initial nodes up to the diversion point
- All children of a node share a common prefix of the associated keys
The trie data structure provides faster lookup time compared to binary search trees. Lookup, insert and delete operations take O(L) time where L represents the length of the key. This is much faster than a binary search tree’s O(log n) time.
Tries are extremely useful for storing dictionaries and providing auto-complete suggestions. By traversing the trie Till the last known key prefix, we can efficiently find all keys with that prefix, making auto-complete easy to implement.
Implementing a Basic Trie in Python
Let’s walk through a simple implementation of a trie data structure in Python focused on the core functionality.
We will need to implement the following:
TrieNode
- Represents each node in the trieTrie
- The trie class itselfinsert
- To insert a key into the triesearch
- To search for a key in the triestarts_with
- Find keys with a given prefix
First, we define the TrieNode
class:
class TrieNode:
def __init__(self):
self.children = {}
self.end_of_word = False
Each node stores a children
dictionary to map to its child nodes, and a end_of_word
boolean flag to indicate if the node represents the end of a key.
Next, the main Trie
class:
class Trie:
def __init__(self):
self.root = TrieNode()
We initialize the trie with just the root node. Now, for the insert method:
def insert(self, key):
curr_node = self.root
for char in key:
if char not in curr_node.children:
curr_node.children[char] = TrieNode()
curr_node = curr_node.children[char]
curr_node.end_of_word = True
We traverse the trie comparing each character in the key to the child nodes. If a child node does not exist for the current character, we create a new TrieNode
for it. Finally, we mark the last node as the end of the word.
The search method is similar:
def search(self, key):
curr_node = self.root
for char in key:
if char not in curr_node.children:
return False
curr_node = curr_node.children[char]
return curr_node.end_of_word
We traverse the trie till the last character, returning False if we cannot find the key at any point. The key exists only if the last node is marked as end of word.
Finally, for prefix based search:
def starts_with(self, prefix):
curr_node = self.root
result = []
for char in prefix:
if char not in curr_node.children:
return result
curr_node = curr_node.children[char]
self._find_words(curr_node, prefix, result)
return result
def _find_words(self, curr_node, prefix, result):
if curr_node.end_of_word:
result.append(prefix)
for child in curr_node.children:
self._find_words(curr_node.children[child], prefix + child, result)
We traverse till the end of the provided prefix, get the last node, then recursively search through its children to find all words with the prefix.
And this wraps up the basic functionality of a trie in Python! With just insert, search and starts_with, we can dynamically build a trie and use it for efficient lookups.
Trie Node Optimization with Default Dict
Our current implementation uses a simple dictionary to store child nodes in each TrieNode
. This can be optimized using Python’s DefaultDict
from the collections module.
DefaultDict
allows us to initialize a dictionary that creates any missing keys automatically with a default value. We can define the default value as a new TrieNode
:
from collections import defaultdict
class TrieNode:
def __init__(self):
self.children = defaultdict(TrieNode)
self.end_of_word = False
Now children
will automatically create new nodes whenever we try to access a key that doesn’t exist. This simplifies the insert and search logic, reducing code repetition.
Adding Auto-Complete Functionality
Now that we have a working trie implementation, let’s extend it to provide auto-complete suggestions based on a given prefix.
We will need two methods:
get_words_for_prefix
- Returns all complete words for a prefixauto_complete
- Return auto-complete suggestions for incomplete words
get_words_for_prefix
leverages the starts_with
logic we already implemented:
def get_words_for_prefix(self, prefix):
words = self.starts_with(prefix)
return words
For auto_complete
, we traverse the trie till the end of the given prefix, then fetch child nodes to gather complete word suggestions:
def auto_complete(self, prefix):
curr_node = self.root
for char in prefix:
if char not in curr_node.children:
return []
curr_node = curr_node.children[char]
suggestions = []
self._auto_complete(curr_node, prefix, suggestions)
return suggestions
def _auto_complete(self, curr_node, prefix, suggestions):
if curr_node.end_of_word:
suggestions.append(prefix)
for child in curr_node.children:
self._auto_complete(curr_node.children[child], prefix + child, suggestions)
And with that, our trie now supports auto-completion! We can build the trie with a dictionary, then efficiently provide suggestions for partial words based on the characters entered so far.
Here is an example usage:
words = ["hello", "helix", "bye", "cat"]
trie = Trie()
for word in words:
trie.insert(word)
print(trie.auto_complete("hel"))
# ['hello', 'helix']
Some key points about using tries for auto-complete:
- Performance improves compared to checking all keys since we only traverse and search a portion of the trie
- Works well for textual data like code or natural languages
- Provides fast and incremental suggestions as user types each character
- Can be further optimized by sorting sibling nodes alphabetically
Sample Application: Auto-Completing Code Functions
Let’s demonstrate a practical use case of our auto-completing trie by using it to provide code function suggestions.
We will:
- Build a trie from a corpus of Python functions
- Handle auto-completing based on the typed prefix
First, we extract a corpus of common Python functions:
import re
corpus = open('python_functions.txt').read()
functions = re.findall(r'def (\w+)', corpus)
Next, build a trie and insert all functions:
trie = Trie()
for func in functions:
trie.insert(func)
Now we can split the text entered by the user into tokens:
text = "import math\nmath.f"
tokens = text.split()
prefix = tokens[-1]
Pass the prefix to auto-complete and show suggestions:
suggestions = trie.auto_complete(prefix)
print("Suggestions for", prefix)
for sug in suggestions:
print(sug)
# Suggestions for f
# factorial
# floor
# fabs
The user can then select a suggestion to auto-complete their code!
This demonstrates how we can build functional auto-completion fairly easily using tries in Python.
Trie Optimization and Variants
There are further optimizations and variants of the trie data structure that improve performance in different use cases:
Compressed Trie
A compressed trie saves storage by merging child nodes with just one child. This reduces memory usage while retaining lookup speed.
Sorted Trie
Maintain child nodes in lexicographic order. This allows faster traversal and lookup.
Concurrent Trie
Allow concurrent read/write access from multiple threads via locking and synchronization. Useful in multi-threaded environments.
HAT Trie
The Highest Ancestor with Two children Trie maintains pointers to highest ancestor with two children. This reduces traversal steps for unsuccessful searches.
B Trie
A space-optimized trie where child nodes are stored together in an array. Provides high cache utilization.
Ternary Search Trie (TST)
Uses 3-way branching (compared to binary). Insert and search ops are simplified to single comparisons at each level.
There are various ways the basic trie can be optimized and modified for different use cases. Picking the right variant can lead to substantial performance improvements.
Analysis of Trie Performance
The trie provides fast O(L) time complexity for search, insert and delete where L is the length of the key. This is much faster than a binary search tree.
Some key analysis points:
-
Search: O(L) since in the worst case we traverse all L characters of the key, where L is length of search key.
-
Insert: Also O(L), as we may have to traverse and create L nodes till we insert a key of length L.
-
Delete: Again O(L) since removing a key requires walking its L nodes.
-
Prefix search: O(P) where P is length of prefix, since we only traverse nodes till end of prefix.
-
Memory: O(N*L) where N is number of keys and L average key length. Requires memory for inner nodes and leaf nodes.
-
Auto-complete: O(P + S) where P is prefix length and S is number of suggestions. We traverse P nodes for prefix, then iterate children for suggestions.
So in summary, tries provide fast search, insert and delete operations by sacrificing memory consumption. The O(L) time complexity can be orders of magnitude faster compared to a binary search tree. This makes tries very useful for real-time applications like auto-complete.
Conclusion
In this guide, we went over how to implement the trie data structure in Python and extend it to provide auto-complete suggestions.
Key points:
- A trie is an efficient tree structure for storing associative data arrays and providing fast lookups
- We can implement a trie in Python using dictionaries to store child nodes
- Tries allow fast O(L) search, insert and delete operations where L is key length
- They can be extended to provide fast prefix-based auto-complete suggestions
- Tries are extremely useful for storing dictionaries, phone directories, and providing word or code suggestions
There are many ways to optimize tries further such as compressed or concurrent variants. Picking the right trie implementation can lead to substantial performance gains for your application.
I hope this comprehensive guide provided you with a solid understanding of implementing and using tries for auto-completion in Python! Let me know if you have any other questions.