Skip to content

Pandas: Filtering Groups in Python

Updated: at 05:39 AM

Pandas is a popular open-source Python library used for data analysis and manipulation. One of Pandas’ key features is the ability to group and aggregate datasets by one or more columns. After grouping data, we can apply various operations like filtering, transformations, and statistical analysis on each group separately.

Filtering groups in Pandas enables focused analysis on subsets of data that meet specific conditions. It allows us to narrow down large datasets and work with relevant groups only. Filtering groups by row or column values is a common task in data preparation and cleaning.

This comprehensive guide will demonstrate how to filter Pandas groups using different methods. We will cover:

Table of Contents

Open Table of Contents

Prerequisites

To follow this Pandas group filtering guide, you should have:

We will start with simple filtering methods before moving on to more advanced techniques. Let’s get started!

Importing Pandas and Sample Data

We import Pandas and NumPy and create a sample DataFrame data to demonstrate filtering of groups:

import pandas as pd
import numpy as np

data = pd.DataFrame({
    'Product': ['A', 'B', 'C', 'A', 'B', 'C'],
    ' Revenue': [500, 400, 300, 100, 200, 150],
    'Margin': [0.5, 0.3, 0.1, 0.2, 0.25, 0.15]
})

Output:

   Product  Revenue  Margin
0        A      500     0.5
1        B      400     0.3
2        C      300     0.1
3        A      100     0.2
4        B      200     0.25
5        C      150     0.15

Let’s group this sample data by the ‘Product’ column before we apply filters.

Grouping Data with .groupby()

We can group the DataFrame by one or more columns using Pandas’ .groupby() method:

grouped = data.groupby('Product')

This groups data by the unique values in the ‘Product’ column, returning a Pandas GroupBy object.

We can see the groups created:

grouped.groups

# {'A': [0, 3], 'B': [1, 4], 'C': [2, 5]}

The GroupBy object contains the original DataFrame divided into groups based on ‘Product’. Now we can filter on these groups.

Basic Group Filtering with .filter()

Pandas provides a .filter() method on grouped data to filter groups based on a function or condition.

.filter() keeps groups where the function evaluates to True and discards groups where False.

Let’s filter for groups with mean Revenue greater than 250:

revenue_filter = grouped.filter(lambda x: x['Revenue'].mean() > 250)
print(revenue_filter)
   Product  Revenue  Margin
0        A      500     0.5
1        B      400     0.3
2        C      300     0.1

Here we passed a simple lambda function to .filter() that calculates mean Revenue per group and keeps groups having mean > 250.

This filtered our data keeping only groups ‘A’, ‘B’ and ‘C’. The group with product ‘A’ had a lower mean Revenue but was kept as other groups in its category passed the filter.

.filter() works on entire groups and includes/excludes all rows of a group based on the filter condition.

Filtering Based on Group Properties

Beyond a custom function, Pandas has built-in group properties that can be used to filter groups.

For example, we can filter for groups having total Revenue greater than 500:

revenue_filter = grouped.filter(lambda x: x['Revenue'].sum() > 500)
print(revenue_filter)
  Product  Revenue  Margin
0       A      500     0.5
1       B      400     0.3

Here we passed a function to .filter() that sums the Revenue per group and keeps only those having total > 500.

Some other useful group properties are:

These properties allow filtering groups based on aggregated metrics without writing custom aggregation.

Query String Filters on Grouped Data

Pandas allows filtering groups using a query string syntax similar to filtering DataFrames.

We can pass a query expression to .filter():

import numpy as np

query_filter = grouped.filter('mean(Revenue) > 300')
print(query_filter)
  Product  Revenue  Margin
0       A      500     0.5
1       B      400     0.3
2       C      300     0.1

The query uses the mean() aggregation function and keeps groups having mean Revenue greater than 300.

This is a simpler and more expressive alternative to custom lambda functions. Query strings also work with multiple conditions:

grouped.filter('(Revenue > 200) & (Margin > 0.2)')

The ampersand & allows combining multiple query conditions.

Boolean Indexing on Grouped Data

Another method is to create a Boolean mask based on a condition and use it to slice the grouped data.

mask = (grouped.Revenue.transform('mean') > 300) & (grouped.Margin.transform('mean') > 0.2)

filtered = grouped[mask]

Here we calculated the mean Revenue and Margin for each group, compared against thresholds, and took the intersection using & to construct a Boolean mask.

This mask was used to index the groups and return only those meeting the criteria.

Benefits of Boolean indexing:

Using Custom Functions to Filter Groups

For more complex filtering logic, custom functions provide full control over group processing.

Let’s filter groups where total Revenue is above average:

def above_avg_revenue(group):
    avg = group['Revenue'].mean()
    return group['Revenue'].sum() > avg

filtered = grouped.filter(above_avg_revenue)

We defined a function accepting the group DataFrame as input, calculated the mean Revenue across groups, and returned a Boolean if group Revenue sum is greater.

Passing this function to .filter() will retain only groups that satisfy this custom criteria.

Some key pointers when using custom functions:

Filtering Hierarchical Groups and MultiIndexes

For hierarchical groups based on multiple columns, we can use a Pandas MultiIndex with .filter().

data = pd.DataFrame({
    'Dept': ['Sales', 'Sales', 'Tech', 'Tech'],
    'Product': ['A', 'B', 'C', 'D'],
    'Revenue': [100, 200, 300, 250]
})

grouped = data.groupby(['Dept', 'Product'])

This creates a MultiIndex with ‘Dept’ as the first level and ‘Product’ as second.

We can filter these hierarchical groups similar to before:

grouped.filter(lambda x: x['Revenue'].sum() > 150)
         Dept Product  Revenue
1      Sales       B      200
2       Tech       C      300
3       Tech       D      250

MultiIndex allows flexible grouping and filtering across multiple dimensions of data.

Conclusion

This guide covered key Pandas techniques to filter grouped data:

Filtering grouped data enables us to focus on relevant subsets and derive insights more efficiently.

Combining filtering with other Pandas transformations like apply, transform, and aggregate unlocks the full potential of exploring and analyzing segmented data.

I hope this guide provided you a comprehensive overview of filtering groups in Pandas. Please feel free to provide any feedback or suggestions to improve it further.