Skip to content

Introduction to NumPy for Numeric Data Processing in Python

Updated: at 03:23 AM

NumPy is a fundamental package for numeric computing in Python. It provides powerful N-dimensional array objects and tools for working with these arrays efficiently and productively. This comprehensive guide will introduce you to the key features of NumPy and how to leverage its capabilities for numeric data processing in Python.

Table of Contents

Open Table of Contents

Overview of NumPy

NumPy (Numerical Python) is an open source Python library that provides multi-dimensional array objects called ndarray, derived datatypes, and a collection of routines for fast operations on arrays. Some of the key features of NumPy include:

NumPy arrays provide a grid-like structure to store homogenous data and are faster and more compact than Python lists. NumPy offers simplified syntax for common mathematical operations like arithmetic, slicing, broadcasting, aggregations, comparisons on array elements.

Here are some common uses of NumPy:

To use NumPy, you need to install it via pip first:

pip install numpy

Now let’s explore some of the main features of NumPy in detail.

The NumPy N-dimensional Array Object

The foundation of NumPy is the ndarray object for multi-dimensional arrays. These arrays are fixed size and contain elements of the same type.

To create a simple 1D array:

import numpy as np

arr = np.array([1, 2, 3])
print(arr)

# Output: [1 2 3]

The array dimensions describe the shape of the array. We can inspect the shape like this:

print(arr.shape)

# Output: (3,)

This array has one axis with 3 elements. For a 2D array with 3 rows and 2 columns:

arr_2d = np.array([[1, 2], [3, 4], [5, 6]])
print(arr_2d)

# Output:
# [[1 2]
#  [3 4]
#  [5 6]]

print(arr_2d.shape)

# Output: (3, 2)

We can also define the data type of the array elements like this:

float_arr = np.array([1.1, 2.2, 3.5], dtype=np.float32)

NumPy supports common data types like float, int, bool, string, datetime64, etc.

Array Creation Functions

NumPy provides various functions to create new arrays based on existing data.

np.zeros and np.ones

Create arrays filled with 0’s or 1’s:

np.zeros((2, 3))

# Output:
# array([[0., 0., 0.],
#        [0., 0., 0.]])

np.ones((3, 4))

# Output:
# array([[1., 1., 1., 1.],
#        [1., 1., 1., 1.],
#        [1., 1., 1., 1.]])

np.full

Create a constant array:

np.full((3, 3), 7)

# Output:
# array([[7, 7, 7],
#        [7, 7, 7],
#        [7, 7, 7]])

np.arange

Returned evenly spaced values within a specified interval:

np.arange(5, 20, 2)

# Output: array([ 5,  7,  9, 11, 13, 15, 17, 19])

np.linspace

Return evenly spaced numbers over a specified interval with num samples:

np.linspace(0, 10, 5)

# Output: array([ 0.,  2.5,  5.,  7.5, 10.])

There are many other helper routines like np.random.rand(), np.identity(), etc. Refer to the NumPy documentation for more details.

Array Indexing and Slicing

NumPy arrays can be indexed and sliced like Python lists. For example:

arr = np.array([1, 2, 3, 4])

# Indexing
print(arr[0]) # 1

# Slicing
print(arr[1:3]) # [2 3]

For multidimensional arrays, you can provide a tuple of indices/slices to select elements:

arr_2d = np.array([[1,2,3], [4,5,6], [7,8,9]])

print(arr_2d[1, 2]) # 6

print(arr_2d[:, 1]) # [2 5 8]

NumPy also provides powerful broadcasting features for array operations.

Broadcasting

Broadcasting allows vectorized operations on arrays of different shapes. The smaller array is broadcast to match the larger array so that they have compatible shapes.

For example:

a = np.array([[1,2,3]]) # Shape (1, 3)
b = np.array([10, 20, 30]) # Shape (3,)

a + b

# Output:
# array([[11, 22, 33]])

Here (3,) array b is broadcast to (1, 3) to match a for the addition.

Broadcasting follows these rules:

  1. Dimensions of size 1 are stretched to match array with longer shape.
  2. Arrays with same shapes are used directly.
  3. After stretching, the final arrays must match.

Understanding broadcasting allows vectorized operations on arrays of different dimensions, avoiding slow Python loops.

Universal Array Functions

NumPy provides vectorized versions of many mathematical operations called universal array functions or ufuncs. These operate element-wise on arrays.

For example:

a = np.array([1, 2, 4])

np.sqrt(a)

# Output: array([1.        , 1.41421356, 2.        ])

Here np.sqrt() calculates element-wise square root. Other useful ufuncs include np.exp, np.sin, np.add, np.greater, etc.

Many ufuncs also take an out parameter to store the output in an existing array rather than create a new one:

out = np.zeros(3)
np.power(a, 2, out)

print(out)
# Output: array([1, 4, 16])

This is more efficient as it avoids allocating new memory.

Array Aggregations

NumPy provides common aggregation functions like sum, mean, std, min, max to aggregate array values:

arr = np.array([[1, 2], [3, 4]])

print(np.min(arr)) # 1
print(np.max(arr)) # 4
print(np.sum(arr)) # 10

print(np.mean(arr)) # 2.5
print(np.std(arr)) # 1.118033988749895

We can also specify the axis along which to compute the aggregations:

print(arr.sum(axis=0)) # [4 6]
print(arr.min(axis=1)) # [1 3]

Array Reshaping and Transpose

The shape of an array can be modified without changing the data:

arr = np.array([[1, 2, 3], [4, 5, 6]])

print(arr.reshape(3, 2))

# Output:
# [[1 2]
# [3 4]
# [5 6]]

The transpose() method swaps axes:

print(arr.transpose())

# [[1 4]
#  [2 5]
#  [3 6]]

Array Concatenation and Splitting

NumPy provides operations like np.concatenate, np.stack, np.hstack, np.vstack etc. to combine arrays:

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])

np.concatenate((a, b), axis=0)

# Output:
# [[1 2]
#  [3 4]
#  [5 6]]

np.hstack((a, b)) # Horizontal stack

# Output:
# [[1 2 5 6]]

np.vstack((a, b)) # Vertical stack

# Output:
# [[1 2]
#  [3 4]
#  [5 6]]

Similarly, np.split, np.hsplit, np.vsplit can be used to split arrays.

Linear Algebra

NumPy provides tools for linear algebra operations on arrays:

a = np.array([[1,1], [0,1]])
b = np.array([2,2])

x = np.linalg.solve(a, b)
print(x)

# Output: array([2., 1.])

# Solve ax = b

Other linear algebra capabilities include matrix eigendecomposition, determinants, vector/matrix norms, matrix multiplication etc.

Random Sampling with np.random

NumPy’s random module np.random provides various functions for random number generation and sampling from different statistical distributions:

np.random.rand(2, 3) # Uniform distribution

# Output:
# array([[0.69646919, 0.28613933, 0.22685145],
#        [0.55131477, 0.71946897, 0.4236548 ]])

np.random.randn(2, 3) # Standard normal distribution

np.random.randint(1, 10, 5) # Random ints

np.random.choice([1, 2, 3], 5) # Random sample

This allows easily generating test data, sampling from simulations, and many other use cases.

Reading and Writing Array Data

NumPy provides convenience functions to read data from files into arrays and write array data to files:

data = np.genfromtxt('data.csv', delimiter=',')

arr = np.array([[1, 2], [3, 4]])
np.save('arr.npy', arr)

arr_reloaded = np.load('arr.npy')

It supports various file formats like CSV, JSON, Numpy binary .npy, etc.

Summary

In this guide, we looked at some of the key aspects of NumPy for numeric data processing in Python:

NumPy is a foundational package for scientific computing, data analysis, and machine learning applications in Python. Mastering NumPy enables you to work efficiently with large datasets in Python.

Check out the official NumPy user guide and reference documentation for more details on all available functionality. The NumPy API is quite extensive, so focus on the essential parts relevant to your specific data tasks.