Skip to content

A Comprehensive Guide to Aggregate Statistics in Pandas

Updated: at 03:37 AM

Pandas is a popular Python library for data analysis and manipulation. It provides highly optimized performance and productivity using DataFrames to organize and analyze data. One of Pandas’ most useful features is the ability to calculate aggregate statistics on DataFrame columns, allowing data scientists to understand the distributions and central tendencies of their data. This comprehensive guide will examine how to find the count, sum, mean, and median of data in Pandas DataFrames using multiple methods.

Table of Contents

Open Table of Contents

Overview of Pandas Aggregate Statistics

Aggregate statistics provide summary information about a dataset. They allow us to understand overall properties and trends without needing to analyze each individual data point. Some common aggregate statistics calculated in data analysis include:

Pandas includes several implementations for each of these aggregate statistics. The subsections below will demonstrate multiple approaches to find these metrics on DataFrame columns using Pandas, along with detailed explanations and example code.

Importing Pandas and Sample Data

To begin using Pandas aggregate functions, we first need to import Pandas and load some sample data:

import pandas as pd

data = {'Score': [15, 18, 11, 19, 10, 16, 18, 19, 12, 14]}
df = pd.DataFrame(data)

This creates a Pandas DataFrame df with one column named ‘Score’ containing some integer values. We will use this simple dataset to illustrate the various aggregation methods below.

Count of Non-Null Values

Finding the count of non-NaN values in a Pandas Series or DataFrame column is a common aggregation task. This allows us to understand the amount of valid data we have to work with in a column. Pandas provides multiple approaches to count values:

1. .count() Method

The easiest way to get the count in Pandas is using the .count() method on a Series or DataFrame:

# 10

This returns the number of non-NaN values in the ‘Score’ column.

2. df.agg('count')

Using the .agg() method on a DataFrame allows aggregation functions to be applied to each column. Passing 'count' will return counts per column:


# Score    10
# dtype: int64

3. Value Counts on Entire DataFrame

To count all valid values across every column, use df.count(). This sums the non-NaN values across the entire DataFrame:

# Score    10
# dtype: int64

Calculating Column Sum

Finding the sum of values in a column is useful to understand the overall magnitude of the data. Pandas provides several approaches to get column sums:

1. .sum() Method

The easiest way to get the sum is using the .sum() method.

# 164

2. df.agg('sum')

We can also pass 'sum' as the aggregation function to .agg() to get sums by column:


# Score    164
# dtype: int64

3. .sum(axis='columns')

Setting axis='columns' sums across each row instead of down columns:


0    15
1    18
2    11
3    19
4    10
5    16
6    18
7    19
8    12
9    14
dtype: int64

Calculating the Mean

The mean, or average, is found by dividing the sum of values by the count. Pandas provides a few options for finding the mean:

1. .mean() Method

The easiest way to get the mean is using the .mean() method:

# 16.4

2. df.agg('mean')

Passing 'mean' as the aggregation function to .agg() will return means by column:


# Score    16.4
# dtype: float64

3. .sum() and .count()

We can manually calculate the mean ourselves by getting the sum and count, then dividing:

total = df['Score'].sum()
num_items = df['Score'].count()
mean = total / num_items

# mean = 164 / 10
# mean = 16.4

This allows full control over the calculation.

Finding the Median Value

The median is the middle value of a sorted dataset - a robust measure of central tendency. Pandas has a few methods to find the median:

1. .median() Method

The .median() method will return the median directly:

# 16.5

2. .quantile(0.5)

Since the median represents the 50th percentile, we can also use .quantile() to calculate it:

# 16.5

3. .agg('median')

Passing 'median' as the aggregation method to .agg() returns the median per column:


# Score    16.5
# dtype: float64

4. Manual Calculation

For full control, we can manually sort the column using .sort_values(), then grab the middle index:

values = df['Score'].sort_values()
median_index = int(len(values) / 2)
median = values.iloc[median_index]

# median = 16.5

This allows handling any edge cases yourself.


This guide covered several methods to calculate count, sum, mean, and median aggregate statistics on Pandas DataFrame columns. The .sum(), .mean(), .median(), and .count() methods provide simple access to these metrics. For more control, we can use manual calculations, .agg(), .quantile(), and more. Calculating aggregates allows us to better understand our data during exploration and analysis.

The examples demonstrated on a simple one-column Pandas DataFrame can be extended across larger, real-world datasets with many columns and data types. Pandas’ speed and flexibility with aggregations makes it a popular choice for data science and analysis. For even more functionality, pandas can be used alongside other Python data tools like NumPy and SciPy.