NumPy is a fundamental Python package for scientific computing and data analysis. It provides efficient implementation of multidimensional arrays and matrices along with a large collection of high-level mathematical functions and operators to operate on these arrays. NumPy is extremely useful for performing mathematical, statistical, and logical operations on arrays efficiently without writing loops.
This comprehensive guide will provide an overview of NumPy and how to leverage its capabilities for mathematical and statistical computations in Python. We will cover the key features of NumPy arrays, vectorization, broadcasting, universal functions (ufuncs), aggregation, masking, sorting, random number generation, linear algebra, statistics, and more. Code examples are provided to illustrate the functionality.
Table of Contents
Open Table of Contents
- Introduction
- Importing NumPy
- Creating NumPy Arrays
- Array Indexing and Slicing
- Broadcasted Operations
- Universal Array Functions
- Array Aggregations
- Mathematical and Statistical Functions
- Linear Algebra
- Sorting Arrays
- Masked Arrays
- Reshaping and Transposing Arrays
- Reading and Writing Array Data
- Conclusions
Introduction
NumPy aims to provide an efficient multidimensional array and matrix manipulation facility for Python while retaining compatibility with its built-in arrays. Some of the key features of NumPy include:
- N-dimensional array object ndarray with flexible indexing capabilities
- Broadcasting functions and vectorization of mathematical operations
- Standard mathematical functions for operations on arrays
- Tools for reading/writing array data to disk and working with memory-mapped files
- Linear algebra, random number generation, and FFT capabilities
- Useful aggregation and statistics methods
The ndarray provided by NumPy forms the central data structure for many other Python scientific computing packages like SciPy, Matplotlib, Pandas, scikit-learn, TensorFlow, and more. Understanding NumPy arrays and mathematical operations is essential for effective data analysis and machine learning with Python.
Let’s explore the essential NumPy capabilities for performing mathematical and statistical computations on arrays.
Importing NumPy
To start using NumPy, we first need to import the numpy
package:
import numpy as np
The conventional alias np
is used for the numpy
module to make the code more concise.
Creating NumPy Arrays
The fundamental object of NumPy is the homogeneous multidimensional ndarray
array. These arrays are fixed-size with elements stored contiguously in memory. We can create new arrays from lists or tuples using the np.array()
method:
vector = np.array([1, 2, 3])
matrix = np.array([[1, 2], [3, 4]])
The array’s dtype
(data type) is inferred from the input data but can also be explicitly specified:
int_array = np.array([1, 2, 3], dtype=np.int32)
float_array = np.array([1.1, 2.2, 3.3], dtype=np.float64)
Useful array creation functions like zeros()
, ones()
, full()
, arange()
, linspace()
, etc. are also provided for generating arrays populated with specific values.
Multi-dimensional arrays can be created by passing in nested Python structures like lists of lists. The dimensions and shape of an array can be accessed through its ndim
and shape
attributes:
three_d_array = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(three_d_array.ndim) # 3
print(three_d_array.shape) # (2, 2, 2)
Array Indexing and Slicing
NumPy arrays facilitate flexible indexing and slicing with basic and advanced indexing capabilities. We can access elements at specific indices, obtain sections and subsets of the array, and assign new values.
Basic slicing syntax is similar to Python lists:
array = np.array([1, 2, 3, 4, 5])
# Get first 3 elements
array[:3]
# Get last 3 elements
array[2:]
Individual elements can be accessed via integers array indices:
array[0] # 1
array[2] # 3
NumPy also provides full slicing, stride slicing, boolean indexing, and more:
two_d_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Get inner 2x2 sub-array
two_d_array[1:3, 1:3]
# Stride slicing to extract diagonals
two_d_array[[0,1,2], [0,1,2]]
# Boolean indexing
two_d_array[two_d_array > 2]
Assigning new values via indexing modifies the array inplace:
array[0] = 9 # Change first element to 9
Broadcasted Operations
When performing operations between NumPy arrays, the smaller array is broadcasted across the larger array so that they have compatible shapes. This allows vectorized operations without explicit looping.
For example, adding a scalar value to a ndarray
:
array = np.array([[1, 2], [3, 4]])
array + 5
# [[6 7]
# [8 9]]
The scalar value 5 is broadcasted and added to each element. This works for any operation between scalars or 1D arrays with larger arrays.
We can also leverage broadcasting to vectorize operations between arrays:
array1 = np.array([1, 2, 3])
array2 = np.array([0, 2, 4])
array1 + array2
# [1 4 7]
The smaller array’s dimensions are stretched to fit the larger array, eliminating the need to loop over elements.
Universal Array Functions
NumPy provides a large set of vectorized universal array functions called ufuncs
that perform element-wise operations on arrays. This allows efficient mathematical operations without Python loops.
For example:
array = np.array([1, 2, 3, 4])
np.sqrt(array) # Square root of each element
np.exp(array) # Exponential of each element
np.sin(array) # Sine of each element
These work with scalars or multiple array arguments:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
np.maximum(x, y) # Element-wise maximum
# [4, 5, 6]
NumPy provides ufuncs for arithmetic, comparison, trigonometric, statistical, linear algebra and other operations.
Array Aggregations
NumPy has built-in functions to compute aggregations over array elements like sum()
, mean()
, std()
, var()
, min()
, max()
etc.
For example:
array = np.array([1, 3, 4, 7, 5])
array.mean() # 4.0
array.std() # 2.1213203435596424
array.min() # 1
array.max() # 7
These can also be applied along specific axes of multidimensional arrays:
two_d_array = np.array([[1, 3],
[5, 7]])
two_d_array.sum(axis=0) # [6 10]
two_d_array.min(axis=1) # [1 5]
Mathematical and Statistical Functions
In addition to universal functions, NumPy has a large library of vectorized mathematical and statistical functions that operate on entire arrays:
x = np.arange(5)
np.power(x, 3) # x^3
np.square(x) # x^2
np.log(x) # ln(x)
np.median(x)
np.corrcoef(x) # correlation matrix
These provide efficient implementations of commonly used mathematical formulas, norms, products, regression, etc. without explicit loops.
NumPy random module provides various distributions and methods for random sampling - useful for simulations and probabilistic modeling:
from numpy import random
samples = random.normal(size=1000) # Gaussian
random.binomial(n=10, p=0.5, size=10) # Binomial
Linear Algebra
NumPy has a linalg
module for linear algebra operations on arrays. This includes methods for:
- Solving systems of linear equations
- Matrix and vector products (dot, inner, outer, etc.)
- Matrix decompositions like Cholesky, Eigenvalue, SVD
- Matrix inverse, determinants, norms and other transformations
For example:
import numpy.linalg as linalg
x = np.array([[1, 2], [3, 4]])
y = np.array([[5, 6], [7, 8]])
dot_product = linalg.dot(x, y) # Standard matrix product
eigenvalues = linalg.eig(x)
This makes NumPy very useful for applied linear algebra.
Sorting Arrays
NumPy arrays can be sorted in-place along specified axes using sort()
and argsort()
methods:
unsorted_array = np.array([3, 1, 2])
sorted_array = np.sort(unsorted_array)
# [1 2 3]
# Get array indices that would sort an array
sort_indices = np.argsort(unsorted_array)
For 2D arrays, we can sort along rows or columns:
two_d_array = np.array([[5, 2], [4, 1]])
sorted_rows = np.sort(two_d_array, axis=0)
# [[4 1]
# [5 2]]
sorted_cols = np.sort(two_d_array, axis=1)
# [[2 5]
# [1 4]]
Masked Arrays
Masked arrays provide a way to handle missing or invalid data in NumPy. Masks can be applied to hide values in computations where needed.
We create masked arrays using np.ma.masked_array()
:
data = np.array([1, 2, 3, -999, 4])
mask = np.ma.masked_array(data, mask=[0, 0, 0, 1, 0])
print(mask)
# [1 2 3 -- 4]
The masked value is ignored in computations:
print(mask.mean()) # 2.5
print(mask.sum()) # 7
We can access the underlying masked data with mask.data
and mask.mask
.
Reshaping and Transposing Arrays
The shape of arrays can be modified without copying any data using reshape()
and newaxis
:
array = np.array([1, 2, 3, 4])
array.reshape(2, 2)
# [[1 2]
# [3 4]]
array[np.newaxis, :] # Adds new axis
# [[1 2 3 4]]
transpose()
switches index order to permute axes:
array = np.arange(6).reshape(2, 3)
array.transpose()
# [[0 3]
# [1 4]
# [2 5]]
Reading and Writing Array Data
NumPy provides utilities to read and write array data to disk efficiently in binary format. This can be done with:
np.save()
andnp.load()
for npy formatnp.savez()
andnp.load()
for zipped npynp.loadtxt()
andnp.savetxt()
for text files
Large arrays can be mapped to files on disk with np.memmap
without fully loading them into memory.
Conclusions
The NumPy package enables efficient mathematical and statistical computations on arrays in Python without for loops. Key capabilities include:
- Multidimensional arrays with broadcasting
- Vectorized universal functions
- Aggregations, sorting, masking and transformations
- Linear algebra, random sampling, and more
NumPy is fundamental for building mathematical and scientific applications with Python. Using its array-oriented computing tools can help optimize code and achieve orders of magnitude speedups over loops. This guide provided an overview of the core functionality - refer to the official NumPy documentation and resources for more details.