Skip to content

NumPy: Sorting Arrays with sort(), argsort(), and lexsort()

Updated: at 01:53 AM

NumPy is the fundamental package for numerical computing in Python. It provides powerful array objects and fast vectorized operations for data manipulation and analysis. One of the most common data manipulation tasks is sorting array elements into a specified order. NumPy offers several methods for sorting arrays - sort(), argsort(), and lexsort().

This how-to guide will provide a comprehensive overview of array sorting in NumPy. We will cover the following topics in detail with example code snippets:

Table of Contents

Open Table of Contents

The Importance of Sorting Arrays

Sorting is the process of rearranging data into a specified order - alphabetical, numerical, chronological, etc. Sorting is an essential skill for any data analyst or data scientist working with Python. Ordered and sorted data is much easier to understand, analyze, and visualize compared to raw unsorted data.

Here are some key reasons why sorting arrays is important:

NumPy arrays provide an optimized environment for fast array operations in Python. Sorting NumPy arrays efficiently is vital for preparing and wrangling data for analysis.

Overview of NumPy Sorting Functions

NumPy provides three main methods for sorting numeric data in arrays:

  1. numpy.sort(array) - Returns a sorted copy of the input array.

  2. numpy.argsort(array) - Returns the indices that would sort the array.

  3. numpy.lexsort(key_arrays) - Indirect sorting using multiple key arrays.

The key differences between these functions are:

Let’s explore each method in further detail with examples.

In-Place Sorting with sort()

numpy.sort(array) sorts the elements of a NumPy array in-place. The original array is modified and no copy is made.

For example:

import numpy as np

arr = np.array([2, 1, 5, 3, 7, 4, 6])

print('Original array:')
print(arr)

np.sort(arr)

print('Sorted array:')
print(arr)

Output:

Original array:
[2 1 5 3 7 4 6]

Sorted array:
[1 2 3 4 5 6 7]

The array is sorted numerically in ascending order. sort() can also sort arrays lexicographically for strings:

arr = np.array(['banana', 'cherry', 'apple'])
np.sort(arr)

print(arr)

Output:

['apple' 'banana' 'cherry']

By default sort() uses quicksort internally for numeric data and mergesort for strings.

We can also specify the kind parameter to control the sorting algorithm:

a = np.array([4, 2, 1])

# Sort using quicksort
np.sort(a, kind='quicksort')

# Sort using heapsort
np.sort(a, kind='heapsort')

# Sort using mergesort
np.sort(a, kind='mergesort')

The order parameter defines the sorting order - ‘ascending’, ‘descending’, or None:

arr = np.array([2, 1, 4, 3])

# Sort in descending order
np.sort(arr, order='descending')

print(arr)

Output:

[4 3 2 1]

The axis parameter allows sorting along rows or columns for 2D arrays:

arr = np.array([[4, 3, 2], [2, 4, 1]])

# Sort along axis 0 (rows)
np.sort(arr, axis=0)

# Sort along axis 1 (columns)
np.sort(arr, axis=1)

In summary, numpy.sort() provides an optimized in-place sorting method for NumPy arrays.

Indirect Sorting with argsort()

numpy.argsort() returns the indices that would sort an array, rather than sorting the array elements themselves.

For example:

import numpy as np

arr = np.array([2, 1, 5, 3, 7, 4, 6])

sort_indices = np.argsort(arr)

print(sort_indices)

Output:

[1 0 4 2 6 3 5]

The indices provide the element positions needed to sort the array. We can use advanced integer indexing to rearrange arr based on sort_indices:

arr[sort_indices]

Output:

array([1, 2, 3, 4, 5, 6, 7])

This returns the sorted array by indexing into the original array using the order provided by argsort().

We can also sort a array reversely using argsort():

desc_indices = np.argsort(-arr)

arr[desc_indices]

Output:

array([7, 6, 5, 4, 3, 2, 1])

Applying -arr performs sorting in descending order.

The returned indices have many uses besides sorting:

Overall, argsort() provides efficient indirect sorting and ranking of NumPy arrays.

Lexsort for Indirect Sorting

numpy.lexsort() allows indirect sorting using multiple key arrays. It uses an algorithm called lexicographical sorting:

import numpy as np

names = ['Alice', 'Bob', 'Charlie']
ages = [25, 18, 22]

# Lexsort using two key arrays
ind = np.lexsort((ages, names))

print(ind)

# [1 2 0]

It returns the indices that would sort the array first by names then ages.

We can use advanced indexing on the original arrays to rearrange the data:

print(names[ind])
# ['Bob' 'Charlie' 'Alice']

print(ages[ind])
# [18 22 25]

This sorts the names alphabetically, resolving ties by sorting ages.

lexsort accepts any number of key arrays - the later arrays are used to break ties:

first_name = ['Alice', 'Bob', 'Charlie']
last_name = ['Jones', 'Smith', 'Williams']

ind = np.lexsort((first_name, last_name))

This sorts by last name then first name.

The power of lexsort() lies in its ability to perform complex multi-key sorting for data analysis and modeling tasks.

Sorting Array Rows and Columns

For 2D arrays, we can sort along rows or columns using the axis argument:

arr = np.array([[4, 3, 2], [2, 4, 1]])

# Sort each row
np.sort(arr, axis=1)

# Sort each column
np.sort(arr, axis=0)

Sorting rows or columns is useful for sorting tabular data stored in 2D arrays.

We can also leverage argsort() to get the sort order along an axis:

arr = np.array([[4, 3, 2], [2, 4, 1]])

# Indices to sort each row
row_ind = np.argsort(arr, axis=1)

# Indices to sort each column
col_ind = np.argsort(arr, axis=0)

These indices can be used to rearrange the rows or columns as needed.

Sorting Structured and Record Arrays

Structured arrays consist of elements with multiple fields that can be accessed by name. We can sort them by specifying the sorting field:

dtypes = [('name', 'U10'), ('age', 'i4'), ('height', 'f8')]
values = [('Alice', 25, 5.1), ('Bob', 18, 5.9)]

arr = np.array(values, dtype=dtypes)

np.sort(arr, order='age')

This sorts arr by the ‘age’ field in ascending order.

For record arrays, we use sorting keys:

arr = np.rec.array([(1, 2.), (4, 3.), (6, 4.)],
                   names='foo, bar')

np.sort(arr, key=lambda x: x[1])

This sorts by the ‘bar’ field values.

So structured and record arrays provide many options for flexible data sorting.

Handling Sorting with Missing Data

For float arrays with NaN missing values:

arr = np.array([2, np.nan, 5, np.nan, 1, 4])

np.sort(arr, kind='mergesort')

NaN values are sorted to the end by default. The mergesort algorithm handles NaN most efficiently.

We can fill missing values by passing an array of fill values:

fill = np.zeros(6)
fill[-1] = np.inf

np.sort(arr, kind='mergesort', fill_value=fill)

Here NaNs are replaced with 0 and infinity fills the last sorted position.

For string arrays:

arr = np.array(['Alice', np.nan, 'Bob'])
np.sort(arr, kind='mergesort')

NaN elements are sorted to the end for object arrays.

Careful handling of missing data ensures the sort order remains well-defined.

Comparing Performance of Sort Algorithms

NumPy provides a variety of sorting algorithms to choose from:

We can compare the performance of these algorithms by timing operations:

import numpy as np
import timeit

arr = np.random.randint(0, 10000, 100000)

t1 = timeit.timeit(lambda: np.sort(arr, kind='quicksort'), number=10)
t2 = timeit.timeit(lambda: np.sort(arr, kind='mergesort'), number=10)
t3 = timeit.timeit(lambda: np.sort(arr, kind='heapsort'), number=10)

print('Quicksort time: ', t1)
print('Mergesort time: ', t2)
print('Heapsort time: ', t3)

Quicksort is fastest for random data. Mergesort has reliable O(n log n) runtime. Heapsort is fastest for large arrays.

Choosing the optimal algorithm depends on your data properties and performance needs.

Applications of Sorting Arrays

Here are some examples of sorting NumPy arrays for real-world data analysis:

Statistics - Sorting provides samples for order statistics like median and percentiles. Extreme values become endpoints.

incomes = np.random.normal(100_000, 50_000, 100_000)

incomes_sorted = np.sort(incomes)

median = np.median(incomes_sorted)
lower_quartile = incomes_sorted[25000]

Data Processing - Sorting shuffles data into a standard order for downstream tasks.

data = np.loadtxt('dataset.txt')

data = data[np.lexsort(data.T)]

# Further processing on sorted data
res = process_dataset(data)

Machine Learning - Sorted order statistics can be used as features for modeling.

X = data[:, :-1]
y = data[:, -1]

ranks = np.argsort(X, axis=0)
X_ranks = np.take_along_axis(ranks, y, axis=0)

# Use rank features to train model
model.fit(X_ranks, y)

NumPy sorting enables efficient and flexible data preparation for analytics.

Conclusion

In summary, NumPy provides powerful sorting functionality for numerical data analysis in Python. The key highlights include:

Learning how to leverage sorting functions like sort(), argsort(), and lexsort() will enable you to prepare, manipulate, and munge data effectively for downstream analytics and modeling tasks. Sorting arrays unlocks simpler, more efficient data processing in NumPy.