Skip to content

Pandas: Multiple Grouping Levels in Python

Updated: at 05:26 AM

Pandas is a popular Python library used for data analysis and manipulation. One of Pandas’ most powerful features is the ability to group data by one or more columns and perform aggregate functions on the grouped data.

Grouping data allows you to get a high-level summary of the data and make comparisons across groups. For example, you may want to group sales data by country and product category to analyze total sales by product in each country.

By default, Pandas groups data by a single column. However, Pandas also provides the ability to group by multiple columns or grouping “levels”. Grouping by multiple levels allows for more complex data analysis and insights by enabling you to create hierarchical groups and subgroups within your data.

In this comprehensive guide, you will learn:

Table of Contents

Open Table of Contents

Grouping Basics with Pandas

Before diving into multi-level grouping, let’s first go over the basics of grouping data in Pandas using the groupby() method.

The groupby() method splits the data into groups based on the values in one or more columns. It returns a DataFrameGroupBy object which contains the grouped data and enables aggregate functions to be performed on each group.

Here is the basic syntax for grouping with groupby():

df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False)

The by parameter specifies the column name(s) to group the data by.

For example:

import pandas as pd

data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
        'Product':['Maps','Search','Excel','Word','Ads','FB App'],
        'Sales':[200,400,300,600,100,150]}

df = pd.DataFrame(data)

# Group by Company
df_grouped = df.groupby('Company')

This groups the data by the Company column, creating a group for each unique company name.

Once the data is grouped, aggregation functions like sum(), mean(), count(), etc can be applied on the groups.

For example, to get the sum of sales for each company:

df_grouped['Sales'].sum()

Output:

Company
FB      250
GOOG    600
MSFT    900
Name: Sales, dtype: int64

This provides the total sales for each company, allowing us to analyze the data by company.

Now that we’ve covered the basics, let’s look at techniques for grouping by multiple columns or levels.

Grouping by Multiple Columns

To group by multiple columns, simply pass a list of column names to the groupby() method:

df_grouped = df.groupby(['Company', 'Product'])

This will first group by Company, then by Product within each company group.

For example, let’s group the sample data by Company and Product:

df_grouped = df.groupby(['Company', 'Product'])
df_grouped['Sales'].sum()

Output:

Company  Product
FB       Ads           100
          FB App        150
GOOG     Maps          200
          Search        400
MSFT     Excel         300
          Word          600
Name: Sales, dtype: int64

This provides sales breakdown by product in each company, allowing for more complex grouping and analysis.

We can also group by 3 or more columns by simply adding more column names to the list:

df_grouped = df.groupby(['Company', 'Product', 'Sales'])

When grouping by multiple columns, the leftmost column in the list is used as the outermost grouping key. Each subsequent column further breaks down the groups into smaller subgroups.

Working with GroupBy Objects

When grouping DataFrames using groupby(), a GroupBy object is returned. This object contains the grouped data and enables us to analyze the groups.

Here are some common operations that can be performed on GroupBy objects:

Aggregate Methods

Methods like sum(), mean(), count(), min(), and max() can be used to get aggregate statistics on the grouped data.

For example:

df_grouped = df.groupby('Company')

# Get sum of sales by company
df_grouped['Sales'].sum()

# Get mean of sales by company
df_grouped['Sales'].mean()

Transformation

The transform() method can be used to perform a function on each group and return the transformed data aligned with the original DataFrame.

For example, to calculate the z-score for each group:

df['Sales_zscore'] = df_grouped['Sales'].transform(lambda x: (x - x.mean())/x.std())

Filtering

The filter() method keeps groups based on a condition. For example:

df_grouped.filter(lambda x: len(x) > 2)

This keeps groups with more than 2 rows.

Iteration

You can also iterate through the groups in a GroupBy object using for:

for name, group in df_grouped:
  print(name)
  print(group)

This prints each group name and the rows in that group.

Now let’s explore some more advanced techniques for multi-level grouping in Pandas.

Grouping on Multi-Indexes

Pandas has special support for grouping on multi-indexed DataFrames.

A multi-indexed DataFrame is created by setting multiple columns as index levels using set_index().

For example:

df = df.set_index(['Company', 'Product'])

This creates a multi-index with Company as the outer index level and Product as the inner level.

When grouping a multi-indexed DataFrame, we pass the level numbers or names instead of column names:

df_grouped = df.groupby(level=0) # Group by outer level index
df_grouped = df.groupby(level='Company') # Group by index name

Grouping by inner index level:

df_grouped = df.groupby(level=1) # Group by Product

We can also group by multiple levels:

df_grouped = df.groupby(level=[0,1]) # Group by Company and Product

The ability to directly access the index levels for grouping enables convenient multi-level grouping on hierarchical indexes.

Analyzing Multi-Level Grouped Data

Once our data is grouped, we can use aggregate methods to analyze and summarize the multi-level grouped data.

For example, to summarize sales by product within each company:

df_grouped = df.groupby(level=[0,1])['Sales'].sum()

This returns a multi-indexed Series with sales by product for each company.

We can also group by level and pass the level name to analyze a specific index:

# Sum of sales by product
df.groupby(level='Product')['Sales'].sum()

# Count of sales by company
df.groupby(level='Company')['Sales'].count()

Pivoting Multi-Level Data

We can also pivot multi-indexed grouped data into a DataFrame using unstack():

df_grouped = df.groupby(level=[0,1])['Sales'].sum().unstack()

This pivots the 2-level grouped data to move the inner index level columns into a column index, allowing us to analyze the data in a pivot table format.

Flatten Hierarchy withxlwith MultiIndex

To analyze multi-indexed grouped data with aggregate methods, it can be useful to flatten the mutli-index down to a single level index.

The groupby(level=<level>).sum() methods keeps the hierarchy.

To flatten down to a single index, add .reset_index() after grouping:

# Flatten hierarchy
df_grouped = df.groupby(level=[0,1])['Sales'].sum().reset_index()

Now the grouped data will have a simple flat index, with the grouping columns moved into regular columns.

Renaming MultiIndex Levels

When working with multi-indexed grouped data, it can be useful to rename the index levels for greater clarity and conciseness.

The rename_axis() method can be used to rename one or more index levels.

For example:

df_grouped = df.groupby(level=[0,1])['Sales'].sum()

df_grouped = df_grouped.rename_axis(['Company_Name', 'Prod_Name'])

We can also pass a dictionary to rename specific levels:

df_grouped.rename_axis({'Company': 'Firm', 'Product' : 'Item'})

Renaming the levels helps make multi-indexed outputs more readable and understandable for further analysis.

Sorting MultiIndex Levels

When grouping by multiple columns, we may want to sort the groups in the output by one of the grouping levels.

The sort_index() method can be used to sort the index levels.

For example, to sort multi-level grouped output by the outer index:

df_grouped = df_grouped.sort_index(level=0)

This sorts alphabetically by the outer Company level while maintaining the inner grouping.

We can also sort by inner levels:

df_grouped.sort_index(level=1) # Sort by Product

And pass a list to sort by multiple levels:

df_grouped.sort_index(level=[0,1]) # Company, then Product

Sorting produces more organized outputs and can reveal insights by ordering the data in a logical way.

Use Cases for Multi-Level Grouping

Here are some examples of useful applications for multi-level grouping in data analysis:

Sales Analysis

Group sales data by customer, product, region, etc to analyze sales by product within each customer segment.

Stock Market Data

Group stock price data by sector and company to analyze performance trends within each sector.

Demographic Analysis

Group census data by state, city, age groups to compare demographic subgroups.

User Behavior Analysis

Analyze clicks, visits, or engagement by country, browser, operating system, and other attributes.

Marketing Campaigns

Group response rates by different marketing channels, campaigns, demographics to optimize spending.

Multi-Experiment Analysis

Group experiment results by experiment type, iteration, and parameters to analyze iterations.

Hierarchical Business Data

Group sales by business unit, region, product line, and other hierarchical company attributes.

Multi-level grouping enables powerful analytic capabilities for these types of complex real-world datasets.

Conclusion

Pandas grouping is a versatile tool for aggregated data analysis. Enabling multiple grouping levels takes Pandas to the next level, opening the door to advanced analytics.

In this guide, we covered:

Multi-level grouping unlocks the full potential of Pandas for exploring complex datasets. By mastering these techniques, you can uncover deeper insights and make important discoveries in your data. The methods discussed provide a solid toolkit for advanced analysis and manipulations using this key Python library.

For further learning, refer to the official Pandas documentation and continue practicing multi-level grouping on your own datasets to improve your skills.