Skip to content

An In-Depth Guide to NumPy Aggregations in Python

Updated: at 05:12 AM

NumPy is a fundamental Python package for scientific computing and data analysis. It provides efficient implementation of multidimensional arrays and matrices along with a vast library of high-level mathematical functions to operate on these arrays.

One of the most widely used features of NumPy is aggregates - methods to summarize ndarray objects by applying various statistical operations on the array elements. These include sum, mean, median, minimum, maximum, and standard deviation.

In this guide, we will provide a comprehensive overview of using NumPy aggregations in Python. We will cover the following topics:

Table of Contents

Open Table of Contents

Overview of NumPy Aggregations

NumPy aggregates allow you to condense arrays into useful summary statistics with a single method call. This enables concise data exploration and analysis.

The main aggregation functions provided by NumPy are:

These work on both single-dimensional and multi-dimensional arrays. Additional related functions like np.prod, np.cumsum, etc. are also available.

Aggregations are computed along a specified axis of the array by default. However, the numpy.all and numpy.any methods aggregate over the entire array.

Importing NumPy

Before using NumPy aggregates, NumPy needs to be imported:

import numpy as np

The convention is to import NumPy with np as the alias.

Creating NumPy Arrays

The aggregates are applied to NumPy arrays. Let’s create a sample 1D array:

arr = np.array([5, 2, 9, 10, 15])

For multi-dimensional data, arrays of higher rank are used. For example:

arr_2d = np.array([[1, 2, 3], [4, 5, 6]])

The aggregates work consistently on arrays of any shape or size.

NumPy Sum

The np.sum method sums up all the elements in the array. For 1D arrays:

print(np.sum(arr))

# Output: 41

For 2D arrays, it sums along a particular axis:

print(np.sum(arr_2d, axis=0))

# Output: [5 7 9]

print(np.sum(arr_2d, axis=1))

# Output: [6 15]

The first computes column-wise sums while the second calculates row-wise sums.

We can also compute the total sum over all array elements:

print(np.sum(arr_2d))

# Output: 21

NumPy Mean

The np.mean aggregate calculates the arithmetic mean or average:

print(np.mean(arr))

# Output: 8.2

For 2D arrays:

print(np.mean(arr_2d, axis=0))

# Output: [2.5 3.5 4.5]

print(np.mean(arr_2d, axis=1))

# Output: [2. 5.]

This computes means across rows and columns.

The overall mean is given by:

print(np.mean(arr_2d))

# Output: 3.5

NumPy Median

The median or middle value of the data is obtained using np.median:

print(np.median(arr))

# Output: 9

For 2D arrays:

print(np.median(arr_2d, axis=0))

# Output: [2.5 3. 5.]

print(np.median(arr_2d, axis=1))

# Output: [2. 5.]

Medians along rows and columns are computed.

The overall median is:

print(np.median(arr_2d))

# Output: 3.5

NumPy Minimum and Maximum

np.min and np.max return the minimum and maximum elements:

print(np.min(arr))

# Output: 2

print(np.max(arr))

# Output: 15

For 2D arrays:

print(np.min(arr_2d, axis=0))

# Output: [1 2 3]

print(np.max(arr_2d, axis=1))

# Output: [3 6]

Axis-wise minimums and maximums are computed.

The overall extrema are given by:

print(np.min(arr_2d))

# Output: 1

print(np.max(arr_2d))

# Output: 6

NumPy Standard Deviation

The standard deviation using np.std indicates how dispersed the data is:

print(np.std(arr))

# Output: 4.55

Applied to 2D arrays:

print(np.std(arr_2d, axis=0))

# Output: [1.73 1.73 1.73]

print(np.std(arr_2d, axis=1))

# Output: [0.82 1.73]

We get standard deviations for each column and row.

The overall standard deviation is:

print(np.std(arr_2d))

# Output: 1.73

By default, NumPy calculates the sample standard deviation. To compute the population standard deviation, we pass ddof=0:

print(np.std(arr, ddof=0))

# Output: 5.16

Weighted Aggregations

We can apply weighted aggregates by passing additional weights parameters:

arr = np.array([1, 2, 3, 4])
weights = np.array([0.2, 0.3, 0.1, 0.4])

print(np.average(arr, weights=weights))

# Output: 2.8

This computes the weighted average. Other aggregates like sum, mean, std can also be weighted.

Aggregates for Boolean Arrays

NumPy aggregates work element-wise on boolean arrays, treating True as 1 and False as 0:

bool_arr = np.array([True, False, True])

print(np.sum(bool_arr))

# Output: 2

print(np.mean(bool_arr))

# Output: 0.666

This allows aggregations directly on boolean masks.

Accumulate Aggregates With reduce

The np.ufunc.reduce method accumulates aggregates recursively:

arr = np.arange(5)

print(np.add.reduce(arr))

# Output: 10

This cumulatively sums the array. We can also accumulate products, mins, maxs etc.

Comparison to Built-in sum() and min()/max()

NumPy aggregates are faster compared to built-in Python functions:

import time

arr_large = np.random.rand(1000000)

s = time.time()
res = sum(arr_large)
print(time.time() - s)

# Output: 0.08

s = time.time()
res = np.sum(arr_large)
print(time.time() - s)

# Output: 0.0001

So prefer using NumPy aggregations.

Aggregations on Pandas Dataframes

Pandas dataframe columns can be aggregated via the .agg() method by passing NumPy functions:

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

print(df.agg(np.sum))

# Output:
# A    6
# B    15

NumPy aggregates thus integrate cleanly into a Pandas workflow.

Conclusion

In this guide, we explored how to use NumPy aggregations including np.sum, np.mean, np.median, np.min, np.max and np.std on 1D and 2D arrays. We looked at their usage, axis-wise application, weighted aggregations and performance compared to Python built-ins. Finally, we saw how NumPy aggregates can be applied to Pandas dataframes.

NumPy aggregation methods are essential for summarizing and understanding your dataset. They condense arrays into useful statistics that form the basis for analysis and visualization. With the simple examples discussed here, you should be able to start applying NumPy aggregates to real-world data manipulation and exploration tasks.