NumPy is a fundamental Python package for scientific computing and data analysis. It provides efficient implementation of multidimensional arrays and matrices along with a vast library of high-level mathematical functions to operate on these arrays.
One of the most widely used features of NumPy is aggregates - methods to summarize ndarray objects by applying various statistical operations on the array elements. These include sum, mean, median, minimum, maximum, and standard deviation.
In this guide, we will provide a comprehensive overview of using NumPy aggregations in Python. We will cover the following topics:
Table of Contents
Open Table of Contents
- Overview of NumPy Aggregations
- Importing NumPy
- Creating NumPy Arrays
- NumPy Sum
- NumPy Mean
- NumPy Median
- NumPy Minimum and Maximum
- NumPy Standard Deviation
- Weighted Aggregations
- Aggregates for Boolean Arrays
- Accumulate Aggregates With
reduce
- Comparison to Built-in sum() and min()/max()
- Aggregations on Pandas Dataframes
- Conclusion
Overview of NumPy Aggregations
NumPy aggregates allow you to condense arrays into useful summary statistics with a single method call. This enables concise data exploration and analysis.
The main aggregation functions provided by NumPy are:
np.sum
- Calculates the sum of array elements.np.mean
- Computes the arithmetic mean or average.np.median
- Finds the median or middle value of the data.np.min
- Gets the minimum element of the array.np.max
- Returns the maximum element.np.std
- Calculates the standard deviation.
These work on both single-dimensional and multi-dimensional arrays. Additional related functions like np.prod
, np.cumsum
, etc. are also available.
Aggregations are computed along a specified axis of the array by default. However, the numpy.all
and numpy.any
methods aggregate over the entire array.
Importing NumPy
Before using NumPy aggregates, NumPy needs to be imported:
import numpy as np
The convention is to import NumPy with np
as the alias.
Creating NumPy Arrays
The aggregates are applied to NumPy arrays. Let’s create a sample 1D array:
arr = np.array([5, 2, 9, 10, 15])
For multi-dimensional data, arrays of higher rank are used. For example:
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
The aggregates work consistently on arrays of any shape or size.
NumPy Sum
The np.sum
method sums up all the elements in the array. For 1D arrays:
print(np.sum(arr))
# Output: 41
For 2D arrays, it sums along a particular axis:
print(np.sum(arr_2d, axis=0))
# Output: [5 7 9]
print(np.sum(arr_2d, axis=1))
# Output: [6 15]
The first computes column-wise sums while the second calculates row-wise sums.
We can also compute the total sum over all array elements:
print(np.sum(arr_2d))
# Output: 21
NumPy Mean
The np.mean
aggregate calculates the arithmetic mean or average:
print(np.mean(arr))
# Output: 8.2
For 2D arrays:
print(np.mean(arr_2d, axis=0))
# Output: [2.5 3.5 4.5]
print(np.mean(arr_2d, axis=1))
# Output: [2. 5.]
This computes means across rows and columns.
The overall mean is given by:
print(np.mean(arr_2d))
# Output: 3.5
NumPy Median
The median or middle value of the data is obtained using np.median
:
print(np.median(arr))
# Output: 9
For 2D arrays:
print(np.median(arr_2d, axis=0))
# Output: [2.5 3. 5.]
print(np.median(arr_2d, axis=1))
# Output: [2. 5.]
Medians along rows and columns are computed.
The overall median is:
print(np.median(arr_2d))
# Output: 3.5
NumPy Minimum and Maximum
np.min
and np.max
return the minimum and maximum elements:
print(np.min(arr))
# Output: 2
print(np.max(arr))
# Output: 15
For 2D arrays:
print(np.min(arr_2d, axis=0))
# Output: [1 2 3]
print(np.max(arr_2d, axis=1))
# Output: [3 6]
Axis-wise minimums and maximums are computed.
The overall extrema are given by:
print(np.min(arr_2d))
# Output: 1
print(np.max(arr_2d))
# Output: 6
NumPy Standard Deviation
The standard deviation using np.std
indicates how dispersed the data is:
print(np.std(arr))
# Output: 4.55
Applied to 2D arrays:
print(np.std(arr_2d, axis=0))
# Output: [1.73 1.73 1.73]
print(np.std(arr_2d, axis=1))
# Output: [0.82 1.73]
We get standard deviations for each column and row.
The overall standard deviation is:
print(np.std(arr_2d))
# Output: 1.73
By default, NumPy calculates the sample standard deviation. To compute the population standard deviation, we pass ddof=0
:
print(np.std(arr, ddof=0))
# Output: 5.16
Weighted Aggregations
We can apply weighted aggregates by passing additional weights
parameters:
arr = np.array([1, 2, 3, 4])
weights = np.array([0.2, 0.3, 0.1, 0.4])
print(np.average(arr, weights=weights))
# Output: 2.8
This computes the weighted average. Other aggregates like sum, mean, std can also be weighted.
Aggregates for Boolean Arrays
NumPy aggregates work element-wise on boolean arrays, treating True
as 1 and False
as 0:
bool_arr = np.array([True, False, True])
print(np.sum(bool_arr))
# Output: 2
print(np.mean(bool_arr))
# Output: 0.666
This allows aggregations directly on boolean masks.
Accumulate Aggregates With reduce
The np.ufunc.reduce
method accumulates aggregates recursively:
arr = np.arange(5)
print(np.add.reduce(arr))
# Output: 10
This cumulatively sums the array. We can also accumulate products, mins, maxs etc.
Comparison to Built-in sum() and min()/max()
NumPy aggregates are faster compared to built-in Python functions:
import time
arr_large = np.random.rand(1000000)
s = time.time()
res = sum(arr_large)
print(time.time() - s)
# Output: 0.08
s = time.time()
res = np.sum(arr_large)
print(time.time() - s)
# Output: 0.0001
So prefer using NumPy aggregations.
Aggregations on Pandas Dataframes
Pandas dataframe columns can be aggregated via the .agg()
method by passing NumPy functions:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df.agg(np.sum))
# Output:
# A 6
# B 15
NumPy aggregates thus integrate cleanly into a Pandas workflow.
Conclusion
In this guide, we explored how to use NumPy aggregations including np.sum
, np.mean
, np.median
, np.min
, np.max
and np.std
on 1D and 2D arrays. We looked at their usage, axis-wise application, weighted aggregations and performance compared to Python built-ins. Finally, we saw how NumPy aggregates can be applied to Pandas dataframes.
NumPy aggregation methods are essential for summarizing and understanding your dataset. They condense arrays into useful statistics that form the basis for analysis and visualization. With the simple examples discussed here, you should be able to start applying NumPy aggregates to real-world data manipulation and exploration tasks.