Multi-indexing and hierarchical data structures are advanced techniques in Pandas that allow you to work with complex, multi-dimensional data in an organized and efficient way. A multi-index essentially enables multiple (two or more) index levels on an axis, usually the rows. This allows entries in a DataFrame to be accessed via multiple keys rather than a single key.
Hierarchical data structures take multi-indexing a step further by letting you create tree-like relationships between items using parent-child links. This is useful for representing data that has natural hierarchical relationships like geographical location or organizational structure.
In this comprehensive guide, we will learn how to:
- Create multi-index DataFrames using multiple columns or arrays as row indexes
- Query and slice multi-indexed data efficiently using methods like
.xs()
- Reorder, sort, reset and rename multi-index levels
- Convert multi-index columns to a hierarchical index
- Merge & join datasets with multi-indexes
- Perform aggregations on hierarchical data
- Handle real-world data modeling use cases using these advanced structures
We will illustrate the key concepts with easy-to-follow examples using Pandas, Python’s powerful data analysis library. Let’s get started!
Creating a Multi-Indexed DataFrame
A multi-index can be created by passing multiple columns or arrays during DataFrame construction:
import pandas as pd
data = {
"Name": ["John", "Mary", "Steve", "Sarah"],
"Subject": ["Math", "Science", "Math", "Science"],
"Score": [90, 95, 80, 85]
}
df = pd.DataFrame(data)
# Create multi-index from columns
# Pass arrays as row indexes
df_mi = pd.DataFrame(data, index=[["Q1", "Q1", "Q2", "Q2"], ["A", "B", "C", "D"]])
print(df_mi)
Name Subject Score
Q1 A John Math 90
B Mary Science 95
Q2 C Steve Math 80
D Sarah Science 85
Here we passed two column arrays as row indexes to create a multi-indexed DataFrame with a 2-level index of [Quarter, Student ID]. The innermost array forms the outermost index level.
Alternatively, we can convert existing columns to MultiIndex:
# Convert columns to MultiIndex
df_mi = df.set_index(['Subject', 'Name'])
print(df_mi)
Score
Subject Name
Math John 90
Steve 80
Science Mary 95
Sarah 85
Multi-indexes allow efficient querying and slicing as we’ll see next.
Querying Multi-Indexed Data
We can query multi-indexed data using multiple indexers separated with commas. For example:
# Query specific row by multiple indexers
row = df_mi.loc[('Math', 'John')]
print(row)
Score 90
Name: (Math, John), dtype: int64
The order of indexers must match the actual index order.
We can also select rows using Pandas .xs() method by level name or position. .xs()
selects data at a particular level of the MultiIndex:
# Select by index name
row = df_mi.xs('John', level='Name')
# Select by index position
row = df_mi.xs(1, level=1)
For column selection, we slice with just the required columns separately after indexing rows:
# Select specific columns after row selection
df_mi.loc['Math'].xs('John')[['Score']]
We can query multiple rows and columns together:
# Multi-row & column selection
df_mi.loc[(['Math','Science'], ['John', 'Mary']), ['Score']]
The real power of multi-indexing comes from partially specifying the multi-indexers to slice the data at specific index levels.
For example, we can select all rows for ‘Math’ subject as:
df_mi.loc['Math']
Or extract all students with ‘Science’ as subject:
df_mi.loc[:, 'Science']
The :
slices across all elements at that index level. We can mix and match partial and specific indices:
# Mix of partial and specific row selection
df_mi.loc[(['Math'], ['John', 'Steve']), ['Score']]
In essence, multi-indexing allows us to store and slice multi-dimensional data in a tabular structure efficiently.
Handling Missing Data in MultiIndexes
In real-world data, we often encounter missing (NaN) values in MultiIndex DataFrames. For example:
df_mi.loc[('Science', 'Sarah'), 'Score'] = np.nan
print(df_mi)
Score
Subject Name
Math John 90.0
Steve 80.0
Science Mary 95.0
Sarah NaN
To filter out rows with NaN values:
df_mi = df_mi.dropna()
Alternatively, we can fillna() to replace NaNs with a value:
df_mi = df_mi.fillna(0)
Handling missing data properly is important before further analysis.
Reordering, Sorting and Renaming Index Levels
The index levels can be reordered using .reorder_levels()
:
# Reorder index levels
df_mi = df_mi.reorder_levels(['Name', 'Subject'], axis=0)
df_mi
This changes the outermost index to Name
. Reordering can help organize the data logically.
We can also sort the index lexicographically:
# Sort index lexicographically
df_mi = df_mi.sort_index()
Sorting arranges the index entries in a standard order for improved readability.
To reset the index or remove it entirely:
# Reset index - generates new integer index
df_mi = df_mi.reset_index()
# Remove index completely
df_mi = df_mi.reset_index(drop=True)
Resetting the index is useful when we need to convert back to a default integer index after manipulating the hierarchical index.
Finally, we can rename the index levels using .rename_axis()
:
# Rename index levels
df_mi = df_mi.rename_axis(index={'Subject': 'Course'})
Proper ordering, sorting and naming of index levels is crucial for readability and usability.
Converting Columns to Hierarchical Index
We can convert columns into hierarchical indexes to create nested categories.
For instance:
data = {
'Year': [2019, 2020],
'City': ['London','Toronto'],
'Population': [8900000, 6000000],
'GDP': [450000, 480000]
}
# Make sure 'Year' column is integer type
data['Year'] = data['Year'].astype(int)
df = pd.DataFrame(data)
# Convert columns to MultiIndex
df.columns = pd.MultiIndex.from_tuples([
('Stats', 'Population'),
('Stats', 'GDP')])
print(df)
Stats
Population GDP
0 2019 8900000 450000
1 2020 6000000 480000
Here ‘Stats’ becomes the outer high-level category encompassing the two detailed metrics - Population and GDP.
We can query hierarchical indexes by category (Stats) or leaf node (Population) directly:
df['Stats'] # or
df['Stats']['Population']
Later we’ll see how hierarchical indexes assist in grouped operations.
Joining/Merging Multi-Indexed Datasets
Multi-indexes allow straightforward merging or joining of datasets having one or more matching indexes.
For example:
df1 = pd.DataFrame({
'A': [1, 2],
'B': [3, 4]
}).set_index(['A'])
df2 = pd.DataFrame({
'A': [1, 2],
'C': [5, 6]
}).set_index(['A'])
df = df1.merge(df2, left_index=True, right_index=True)
print(df)
B C
A
1 3 5
2 4 6
Here df1
and df2
were merged on index A
producing a multi-indexed result set.
We can also join by multiple common indexes:
df3 = pd.DataFrame({
'A': [1, 2],
'B': [7, 8]
}).set_index(['A', 'B'])
df = df1.merge(df3, left_index=True, right_index=True)
print(df)
Any shared indexes when set as the left_index
and right_index
will merge two DataFrames like a database join.
Aggregating Hierarchical Data
A common operation on hierarchical data is aggregation across a specific index level. The groupby() method in Pandas groups data by one or more indexes allowing fast data aggregation.
For example, let’s analyze sales data by product categories:
data = {
'Category': ['Electronics', 'Electronics', 'Clothing', 'Home & Kitchen', 'Clothing'],
'Product': ['TV', 'Smart Phone', 'T-Shirt', 'Furniture', 'Trousers'],
'Revenue': [5000, 4000, 1000, 8000, 600]
}
df = pd.DataFrame(data)
# Set category as index
df = df.set_index('Category')
# Groupby index and aggregate
df.groupby(level=0).aggregate({'Revenue': 'sum'})
Revenue
Category
Clothing 1600
Electronics 9000
Home & Kitchen 8000
Here we were able to quickly sum revenue by product category using the set index.
We can also apply multiple aggregation functions together:
df.groupby(level=0).agg({'Revenue': ['sum', 'mean', 'max']})
This shows the flexibility of aggregations on hierarchical data.
Real World Use Cases
Let’s take a look at some real-world use cases where multi-indexing and hierarchical data structures unlock Pandas’ true potential for data analysis.
Modeling Geographical Data
Geo-spatial data can be modeled using a multi-index with location hierarchies like country, region, state, city acting as index levels:
locations = [
('Asia', 'Southern', 'India', 'Hyderabad'),
('Asia', 'Southern', 'India', 'Bangalore'),
('Asia', 'Eastern', 'China', 'Shanghai'),
('North America', 'Northern', 'USA', 'New York')
]
populations = [
12500000, 11500000, 26000000, 8500000
]
city_df = pd.DataFrame(populations, index=pd.MultiIndex.from_tuples(locations),
columns=['Population'])
print(city_df)
Population
Asia Southern India Hyderabad 12500000
Bangalore 11500000
Eastern China Shanghai 26000000
North America Northern USA New York 8500000
Here, multi-indexing allows us to model the geographical hierarchy in a tidy structure suitable for slicing:
city_df.loc[(['Asia', 'Southern', 'India'], :), 'Population']
# or
city_df.loc[('Asia', 'Southern'):, 'Population']
Time Series Analysis
Time series data with a DateTimeIndex can be converted into a multi-index with date parts as levels for handy analysis and visualization:
dates = pd.date_range('2019-01-01', '2019-12-31')
data = np.random.rand(len(dates))
df = pd.DataFrame(data, index=dates, columns=['Value'])
df.index = df.index.to_period('M')
# Convert to multi-index
df.index = [df.index.year, df.index.month]
print(df)
Value
2019 1 0.126918
2 0.526506
3 0.029268
...
2019 10 0.554134
11 0.710316
12 0.073417
Now the data is structured by year and month for intuitive querying:
df.loc[2019] # All months in 2019
df.loc[(2019, 12), :] # Dec 2019 data
Multiple time series can also be merged easily with date indexes.
Analysis Using Categorical Data
Data science datasets often contain categorical features like gender, country, education level etc.
We can model them using a category index:
data = {
'Name': ['Alice', 'Bob', 'Claire', 'Dylan'],
'Gender': ['F', 'M', 'F', 'M'],
'Age': [25, 32, 28, 40],
'Salary': [50000, 60000, 55000, 75000]
}
df = pd.DataFrame(data).set_index('Gender')
print(df)
Age Name Salary
F 25 Alice 50000
M 32 Bob 60000
F 28 Claire 55000
M 40 Dylan 75000
Now rows can be selected by category value:
df.loc['F'] # All females
We can even have multiple categories as a multi-index:
df = df.set_index(['Gender', 'Age'], drop=True)
df.sort_index()
Hierarchical indexing provides the foundation for pivot tables, cross tabs and analytical cubes frequently used in data science.
Conclusion
In summary, multi-indexing and hierarchical data structures in Pandas empower efficient data access and analysis. They allow handling higher dimensional data packed within a 2D DataFrame intuitively.
The key benefits are:
- Modeling complex data relationships like hierarchies
- Enabling efficient multi-dimensional querying and aggregation
- Straightforward merging of datasets on multiple indexes
- Flexible reshaping using advanced indexing
With this comprehensive guide, you should now feel comfortable applying these techniques to handle real-world data modeling and analysis challenges in Python.