Pandas is a popular Python library for data analysis and manipulation. It provides highly optimized performance and productivity using DataFrames to organize and analyze data. One of Pandas’ most useful features is the ability to calculate aggregate statistics on DataFrame columns, allowing data scientists to understand the distributions and central tendencies of their data. This comprehensive guide will examine how to find the count, sum, mean, and median of data in Pandas DataFrames using multiple methods.
Table of Contents
Open Table of Contents
Overview of Pandas Aggregate Statistics
Aggregate statistics provide summary information about a dataset. They allow us to understand overall properties and trends without needing to analyze each individual data point. Some common aggregate statistics calculated in data analysis include:
- Count: The number of non-NaN values.
- Sum: The total of all values.
- Mean: The arithmetic average value.
- Median: The middle value of a sorted dataset.
Pandas includes several implementations for each of these aggregate statistics. The subsections below will demonstrate multiple approaches to find these metrics on DataFrame columns using Pandas, along with detailed explanations and example code.
Importing Pandas and Sample Data
To begin using Pandas aggregate functions, we first need to import Pandas and load some sample data:
import pandas as pd
data = {'Score': [15, 18, 11, 19, 10, 16, 18, 19, 12, 14]}
df = pd.DataFrame(data)
This creates a Pandas DataFrame df
with one column named ‘Score’ containing some integer values. We will use this simple dataset to illustrate the various aggregation methods below.
Count of Non-Null Values
Finding the count of non-NaN values in a Pandas Series or DataFrame column is a common aggregation task. This allows us to understand the amount of valid data we have to work with in a column. Pandas provides multiple approaches to count values:
1. .count()
Method
The easiest way to get the count in Pandas is using the .count()
method on a Series or DataFrame:
df['Score'].count()
# 10
This returns the number of non-NaN values in the ‘Score’ column.
2. df.agg('count')
Using the .agg()
method on a DataFrame allows aggregation functions to be applied to each column. Passing 'count'
will return counts per column:
df.agg('count')
# Score 10
# dtype: int64
3. Value Counts on Entire DataFrame
To count all valid values across every column, use df.count()
. This sums the non-NaN values across the entire DataFrame:
df.count()
# Score 10
# dtype: int64
Calculating Column Sum
Finding the sum of values in a column is useful to understand the overall magnitude of the data. Pandas provides several approaches to get column sums:
1. .sum()
Method
The easiest way to get the sum is using the .sum()
method.
df['Score'].sum()
# 164
2. df.agg('sum')
We can also pass 'sum'
as the aggregation function to .agg()
to get sums by column:
df.agg('sum')
# Score 164
# dtype: int64
3. .sum(axis='columns')
Setting axis='columns'
sums across each row instead of down columns:
df.sum(axis='columns')
0 15
1 18
2 11
3 19
4 10
5 16
6 18
7 19
8 12
9 14
dtype: int64
Calculating the Mean
The mean, or average, is found by dividing the sum of values by the count. Pandas provides a few options for finding the mean:
1. .mean()
Method
The easiest way to get the mean is using the .mean()
method:
df['Score'].mean()
# 16.4
2. df.agg('mean')
Passing 'mean'
as the aggregation function to .agg()
will return means by column:
df.agg('mean')
# Score 16.4
# dtype: float64
3. .sum()
and .count()
We can manually calculate the mean ourselves by getting the sum and count, then dividing:
total = df['Score'].sum()
num_items = df['Score'].count()
mean = total / num_items
# mean = 164 / 10
# mean = 16.4
This allows full control over the calculation.
Finding the Median Value
The median is the middle value of a sorted dataset - a robust measure of central tendency. Pandas has a few methods to find the median:
1. .median()
Method
The .median()
method will return the median directly:
df['Score'].median()
# 16.5
2. .quantile(0.5)
Since the median represents the 50th percentile, we can also use .quantile()
to calculate it:
df['Score'].quantile(0.5)
# 16.5
3. .agg('median')
Passing 'median'
as the aggregation method to .agg()
returns the median per column:
df.agg('median')
# Score 16.5
# dtype: float64
4. Manual Calculation
For full control, we can manually sort the column using .sort_values()
, then grab the middle index:
values = df['Score'].sort_values()
median_index = int(len(values) / 2)
median = values.iloc[median_index]
# median = 16.5
This allows handling any edge cases yourself.
Conclusion
This guide covered several methods to calculate count, sum, mean, and median aggregate statistics on Pandas DataFrame columns. The .sum()
, .mean()
, .median()
, and .count()
methods provide simple access to these metrics. For more control, we can use manual calculations, .agg()
, .quantile()
, and more. Calculating aggregates allows us to better understand our data during exploration and analysis.
The examples demonstrated on a simple one-column Pandas DataFrame can be extended across larger, real-world datasets with many columns and data types. Pandas’ speed and flexibility with aggregations makes it a popular choice for data science and analysis. For even more functionality, pandas can be used alongside other Python data tools like NumPy and SciPy.