NumPy Memory Mapping for Large Arrays

NumPy is a popular Python library used for scientific computing and working with multidimensional array data. One key feature of NumPy is its ability to memory map arrays, allowing you to work with arrays larger than available RAM. Memory mapping provides fast access to array data stored on disk without needing to load the entire array into memory. This how-to guide will explain what memory mapping is, why it is useful when working with large NumPy arrays, and provide examples for creating, accessing, and manipulating memory mapped arrays in Python.

Open Table of Contents

Introduction
Memory Mapping Array Basics
Memory Map Open Modes
Accessing and Modifying Memory Mapped Arrays
Releasing Memory Mapped Arrays
Use Cases and Examples
Conclusion

Introduction

When processing and analyzing large datasets in Python, you may encounter MemoryError exceptions when creating large NumPy arrays that exceed available system RAM. For example:

import numpy as np

large_array = np.random.rand(500000000)
# Attempt to create 500 million element array

# MemoryError exception occurs

To work around this limitation, NumPy provides memory mapping capabilities to efficiently access array data stored on disk in a file without needing to load the full contents in memory. The array is still accessed and operated on like a standard in-memory NumPy array.

Memory mapping has several key advantages:

Work with arrays larger than available RAM
Faster file access and data streaming from disk
Read/write access to persistent storage for out-of-core computing
Convenience of array semantics and NumPy APIs

This guide will demonstrate how to create memory mapped NumPy arrays and cover topics like:

Memory mapping array basics
Memory map modes (read-only, read-write, copy-on-write)
Accessing and modifying memory mapped arrays
Releasing memory mapped arrays
Use cases and examples

Memory Mapping Array Basics

The basic steps to create a memory mapped NumPy array are:

Open a file to store the array data on disk
Memory map the file to create the array proxy
Use array and access data from disk as needed
Close the memory mapped array

Here is a simple example:

import numpy as np

# Open file for array storage
filedata = np.memmap('myarray.dat', dtype=float, mode='w+', shape=(1000000,))

# Memory map the file, create array proxy
mapped_array = np.memmap(filedata, dtype=float, shape=(1000000,))

# Use array and access data from disk
mapped_array[0] = 1.23
print(mapped_array[0])

# Close memory mapped array
del mapped_array

The memmap constructor handles both creating the data file and memory mapping with several options:

filename - File path to map data from
dtype - Data type for array
mode - Memory map file open mode (‘r’, ‘r+’, ‘w+’, ‘c’)
shape - Shape of created array
order - Memory layout order (‘C’ or ‘F’)

The memory mapped array can then be used like a standard NumPy array for slicing, indexing, iteration, etc. Any changes are synced back to disk without needing to load the full array.

Memory Map Open Modes

There are several memmap open modes that control how the mapped array can be accessed and modified:

'r' - Read-only mode. Changes raise exception.
'r+' - Read-write mode. Changes are written to file.
'w+' - Read-write mode. Overwrites existing file.
'c' - Copy-on-write mode. Reads are from file, writes create copy.

Here is an example of the different modes:

# Read-only
ro_array = np.memmap('ro_array.dat', dtype=float, mode='r', shape=(10,))
ro_array[0] = 123 # Raises exception

# Read-write
rw_array = np.memmap('rw_array.dat', dtype=float, mode='r+', shape=(10,))
rw_array[0] = 123 # Writes value to disk

# Read-write (overwrites existing)
w_array = np.memmap('w_array.dat', dtype=float, mode='w+', shape=(10,))

# Copy-on-write
cw_array = np.memmap('cw_array.dat', dtype=float, mode='c', shape=(10,))
cw_array[0] = 123 # Writes to copy, not original

Choosing the appropriate open mode depends on how you need to interact with the underlying data. Read-only is good for accessing large data in a read-efficient manner. Read-write allows updating the data. Copy-on-write can provide better performance for arrays that are largely read, but need some writes.

Accessing and Modifying Memory Mapped Arrays

Memory mapped arrays support all the familiar NumPy array operations like indexing, slicing, iterating, etc. The key difference is the data is streamed from disk instead of main memory.

For example:

mapped_array = np.memmap('myarray.dat', dtype=float, mode='r', shape=(10,5))

# Indexing
a = mapped_array[2,3]

# Slicing
subarray = mapped_array[:,2:4]

# Iterating
for val in mapped_array:
   print(val)

# Broadcasting
mapped_array[0,:] = 5.0

Keep in mind that writable memory maps (‘r+’ and ‘w+’ modes) will sync changes back to the disk file. This provides persistence without needing to manually read/write from file.

For array operations that are not easily translated into fixed slices (like sorting or reshaping), NumPy will usually need to create a temporary in-memory copy to compute the results. So performance gains compared to regular in-memory arrays may not always be realized.

Releasing Memory Mapped Arrays

When you are finished using a memory mapped array, you should properly close it to release resources using the del statement:

del mapped_array

The memory mapping resources are also released when the original file handle object is garbage collected. But it is best practice to explicitly del the array when no longer needed.

Leaving many unused memory mapped arrays open can lead to reaching the limit on open files allowed by the OS. So properly releasing arrays when done is important.

Use Cases and Examples

Some common use cases where NumPy memory mapping can be advantageous:

Accessing array slices from large data files:

Only read relevant portions of large array data stored on disk instead of loading entire file into memory.

# Map very large array data file
full_data = np.memmap('data.npy', mode='r', shape=(1000000,500))

# Access just the first 100 rows
partial_data = full_data[:100]

Out-of-Core Computation:

Process array data larger than available RAM by streaming from disk.

# Create large memory map array
array = np.memmap('large_array.dat', dtype=float, shape=(100000000,), mode='w+')

# Compute in chunks, syncing results to disk
chunksize = 100000
for i in range(0, array.shape[0], chunksize):
    array[i:i+chunksize] = expensive_calculation(array[i:i+chunksize])

Serving Data to Multiple Processes:

Use memory mapping to share array data across processes.

# Parent process
data = np.memmap('/shared_array', dtype=float, shape=(500,500), mode='w+')

# Child process
data = np.memmap('/shared_array', dtype=float, shape=(500,500), mode='r')

Conclusion

Memory mapping is a useful technique in NumPy for working with large arrays that don’t fit in memory. It provides an efficient way to access data from disk without loading entire files. The array can still be sliced, indexed, and iterated over like a normal NumPy array.

The main concepts covered in this guide include:

Creating memory mapped arrays with np.memmap
Different modes for read-only, read-write, and copy-on-write access
Indexing, slicing, and modifying memory mapped arrays
Properly releasing arrays when finished
Use cases like out-of-core computation and sharing data across processes

Memory mapping large arrays allows you to work with datasets larger than available RAM. It is a valuable technique for loading subsets of big data stored on disk and minimizing memory usage. NumPy’s memmap functionality helps enable convenient and efficient out-of-core data processing in Python.