Topological sorting is an algorithm for arranging the nodes of a directed acyclic graph (DAG) in a special linear order where for each directed edge from node A to node B, node A appears before node B in the ordering. The topological sort order is not necessarily unique, there can be multiple valid topological orderings for a given DAG.
Topological sorting has many applications in scheduling, data science, computer science, and other fields. Some key use cases include:
- Scheduling tasks that have dependencies on each other
- Ordering compilation tasks to compile source code files
- Data science workflows to run models with dependencies
- Finding instruction ordering in dependency graphs for processors
- Network routing algorithms
In this comprehensive Python guide, we will cover the following topics:
Table of Contents
Open Table of Contents
Understanding Directed Acyclic Graphs
A Directed Acyclic Graph (DAG) is a directed graph that contains no cycles. That means it is not possible to start from a node and follow a sequence of edges that loops back to the same node again.
Some key properties of DAGs:
- The edges have a specific direction associated with them
- There are no loops (acyclic)
- It can have one or more starting nodes that have no incoming edges
- It can have one or more ending nodes that have no outgoing edges
For a DAG, a topological ordering or sort is possible such that for every directed edge from node A to node B, node A appears before node B in the ordering.
Here is an example DAG:
# Example DAG
graph = {
'A': ['D'],
'B': ['D'],
'C': ['A', 'B'],
'D': ['G', 'H'],
'E': ['F'],
'F': ['H'],
'G': [],
'H': []
}
This DAG has 8 nodes (A to H). The edges indicate dependencies, for example, node C depends on nodes A and B. Node D depends on A, B, and C indirectly. G and H are ending nodes with no further dependencies.
There can be different valid topological orderings for this graph, such as:
[G, H, D, A, B, C, E, F]
or
[H, G, D, B, A, C, F, E]
But a node will always appear after its dependencies in a valid topological ordering.
Topological Sort Algorithms
There are two common algorithms to find a topological ordering for a DAG:
Kahn’s Algorithm using Indegree
Kahn’s algorithm works by calculating the indegree for each node, then starts with nodes that have an indegree of 0, and removes their edges one by one.
The steps are:
- Initialize a queue Q to store nodes with indegree 0
- Calculate indegree for each node and add nodes with indegree 0 to Q
- While Q is not empty:
- Remove a node n from Q
- Decrease indegree for all its neighboring nodes
- Add any new nodes with indegree 0 to Q
- If graph has edges left, it has a cycle (not DAG)
- Else print topological sort order using the removed nodes
Depth First Search (DFS) Approach
The DFS approach does a depth-first traversal of the graph, tracking visited nodes and prioritizing nodes that have dependencies.
The steps are:
- Initialize empty list for topological order
- Perform DFS traversal:
- Mark node as visited
- Recursively visit its dependencies that are unvisited
- After recursive calls, add node to topological order
- Print topological order list after DFS
DFS ensures that a node only gets added after its dependencies have been added first. This produces the topological ordering.
Next, let’s look at Python code to implement these algorithms.
Implementing Topological Sort in Python
We will implement both Kahn’s algorithm and the DFS approach in Python to find topological order.
Let’s use the following sample DAG for testing the implementations:
# Sample DAG
graph = {
'A': set(['D', 'F']),
'B': set(['D', 'E']),
'C': set(['D', 'F']),
'D': set(['H']),
'E': set(['F', 'H']),
'F': set([]),
'H': set([])
}
Kahn’s Algorithm
Here is the Python implementation of Kahn’s algorithm using indegree:
from collections import defaultdict
def topological_sort_kahn(graph):
indegree = defaultdict(int) # Track indegrees
queue = [] #Initialize queue
# Calculate indegrees
for node in graph:
for neighbour in graph[node]:
indegree[neighbour] += 1
# Add nodes with 0 indegree to queue
for node in graph:
if indegree[node] == 0:
queue.append(node)
topological_order = []
# Process until queue is empty
while queue:
# Remove node from queue and add to topological order
node = queue.pop(0)
topological_order.append(node)
# Reduce indegree for its neighbors
for neighbour in graph[node]:
indegree[neighbour] -= 1
# Add new 0 indegree nodes to queue
if indegree[neighbour] == 0:
queue.append(neighbour)
if len(topological_order) != len(graph):
print("Cycle exists")
print(topological_order)
topological_sort_kahn(graph)
Output:
['H', 'F', 'E', 'D', 'A', 'B', 'C']
This follows Kahn’s algorithm steps:
- Track indegrees of each node
- Add 0 indegree nodes to queue (H, F)
- Reduce indegree on removal:
- H removed - E now has 0 indegree
- F removed - E added to queue
- E removed - D has 0 indegree
- etc.
- topological_order contains valid order meeting DAG dependencies
Depth First Search Approach
Here is the Python DFS implementation for topological sorting:
visited = set() # Track visited nodes
topological_order = [] # Store topological order
def topological_sort_dfs(graph, node):
# Mark node as visited
visited.add(node)
# Recursively call DFS on its neighbors
for neighbor in graph[node]:
if neighbor not in visited:
topological_sort_dfs(graph, neighbor)
# Add node to topological order
topological_order.append(node)
for node in graph:
if node not in visited:
topological_sort_dfs(graph, node)
print(topological_order)
Output:
['H', 'F', 'D', 'E', 'B', 'A', 'C']
This follows the DFS steps:
- Mark node as visited and recursively visit unvisited dependencies
- After recursive calls end, add node to topological order
- Only add node after its dependencies have been added first
The DFS approach directly enforces the topological ordering based on graph structure.
Applications and Examples
Topological sort has many applications in real-world systems. Some examples:
Instruction Scheduling:
- Compile source code files based on their dependencies
- Schedule CPU instruction execution based on data dependencies
Package Management:
- Install system packages in dependency order
- Resolve conflicts between packages based on versions
Data Science Pipelines:
- Schedule workflows of interdependent data processing tasks
- Run models and feature engineering steps in appropriate order
Network Routing:
- Assign IP addresses based on router dependencies
- Broadcast packets along a dependency graph of routers
Project Scheduling:
- Model task dependencies for large projects and schedule tasks appropriately
- Update schedule when task dependencies change
Let’s look at a project scheduling example.
Consider a project with the following task dependencies:
A -> C
B -> C
C -> D, E
D -> F
E -> F
F -> G
We can model this as a DAG:
project_graph = {
'A': ['C'],
'B': ['C'],
'C': ['D','E'],
'D': ['F'],
'E': ['F'],
'F': ['G'],
'G': []
}
Run topological sort on this graph to get a valid task schedule order:
Topological order: ['A', 'B', 'C', 'D', 'E', 'F', 'G']
This schedule meets all the task dependencies.
Complexity Analysis
The complexity of Kahn’s algorithm is O(V+E) where V is number of nodes and E is number of edges.
This is because it does one O(V) sweep to calculate indegrees, and O(E) work to reduce indegrees of neighbors.
The DFS approach is also O(V+E) since it visits all nodes and edges in the recursion.
Overall, both algorithms are efficient and linear time complexity for sparse graphs.
Common Errors and Debugging
Here are some common errors and how to debug them when implementing topological sort:
-
Cycle detection: If the graph has a cycle, topological ordering is impossible. Use Kahn’s algorithm and check for edges left to detect this.
-
Unvisited nodes: The final ordering may be missing nodes if DFS did not visit all nodes. Track visited nodes and call DFS recursively starting from each node.
-
Stack overflow: The DFS recursion may exceed max depth for large graphs. Use a loop instead of recursion to avoid hitting the stack limit.
-
Ordering issues: The algorithm may produce an incorrect order if dependencies are not handled properly. Double check that each node only gets added after its dependencies.
-
Input issues: The graph data structure passed in may not properly represent the dependencies. Print and visually validate the graph before running topological sort.
-
Edge cases: Watch out for empty graphs or graphs with islands to avoid index errors. Check for nodes with no edges.
Carefully testing the algorithms on different sample DAGs and validating the output ordering is key to debugging issues.
Conclusion
This guide covered the topology sort technique in detail, explaining the algorithms, implementations, complexity analysis, and applications using Python.
Key points:
- Topological ordering arranges nodes in a directed acyclic graph linearly based on dependencies
- Kahn’s algorithm uses indegree calculation and is efficient at O(V+E)
- DFS approach recursively visits unvisited nodes first to enforce ordering
- Useful for scheduling dependent tasks, routing, packaging, and data science workflows
- Watch out for cycles, unvisited nodes, stack overflows, and edge cases
You should now have a solid grasp of topological sorting and be able to leverage it to schedule and optimize real-world systems with dependency constraints using Python.
The techniques discussed can be applied in data science, DevOps, networking, and many other domains where dependency ordering is required. This establishes a key foundation for working with directed acyclic graphs and workflows.