Histogramming and Binning Data with NumPy in Python

NumPy is a fundamental package for scientific computing in Python. It provides powerful functionality for working with arrays and matrices that is optimized for performance and speed. One of the key features of NumPy is its ability to generate histograms and bin data, which are useful techniques for exploring, summarizing and visualizing the distribution of datasets.

In this comprehensive guide, we will cover the basics of histograms and binning data in NumPy, including practical examples and code samples.

Open Table of Contents

Overview of Histograms
NumPy’s histogram() Method
Binning Data with np.digitize()
Using pandas for Analysis with Binned Data
Visualizing Histograms with Matplotlib
Example: Analyzing Histograms of Iris Measurements
Binning 2D Data
Choosing Optimal Bin Sizes
Common Histogram Use Cases
Conclusion

Overview of Histograms

A histogram visually displays the distribution of data by dividing the data range into bins or buckets, and then plotting the number of data points that fall into each bin. This allows you to see the shape and spread of the data, identify patterns, and detect outliers.

Some key properties of histograms:

Histograms display frequency distributions of data.
The x-axis represents the data variable being measured.
The y-axis depicts the number of data points (frequency) within each bin.
Bins are sequential, non-overlapping intervals that cover the dataset’s range.
Bin widths can vary but equal-width bins are commonly used.

import numpy as np

data = np.random.normal(size=1000)

# Generate 1000 normally distributed data points

num_bins = 20

# Use 20 equal-width bins

counts, bin_edges = np.histogram(data, bins=num_bins)

# num_bins = 'auto' can also be used to auto-determine number of bins

Histograms are a simple yet extremely useful tool for exploratory data analysis. They help identify the underlying distribution, presence of outliers, and features like modality (single peak, bi-modal, etc). This guides the choice of appropriate statistical models or data transformations for further analysis.

NumPy’s `histogram()` Method

NumPy provides the np.histogram() function to compute histograms on NumPy arrays efficiently without needing to write explicit for loops.

The syntax is:

np.histogram(a, bins=10, range=None, normed=False, weights=None, density=None)

The key parameters are:

a: Input data array
bins: Number of bins or specifying bin edges directly
range: Data range tuple (min, max) to use for histogram
normed: Normalize histogram to form probability density (deprecated - use density instead)
weights: Optional array of weights for each data point
density: Normalize histogram to integrate to 1 (True replaces normed)

It returns a tuple of two arrays - the counts in each bin and the bin edge values.

Let’s see some examples of using np.histogram() on both simulated and real-world data.

import numpy as np

# Normally distributed data
data = np.random.normal(loc=0, scale=1, size=10000)

counts, bin_edges = np.histogram(data, bins='auto')

print("Bin edges: ", bin_edges)
print("Counts in each bin: ", counts)

Here ‘auto’ automatically determines optimal number of bins using Scott’s rule. We get a nice normal distribution histogram.

For real-world data like iris measurements:

from sklearn.datasets import load_iris

iris = load_iris()

sepal_length = iris['data'][:, 0]

counts, bin_edges = np.histogram(sepal_length, bins=15, range=(4, 8), density=True)

# Custom bin edges, normalized histogram

We can pass additional arguments like custom bin edges, weights, density normalization etc. as per the data properties and visualization needs.

Binning Data with `np.digitize()`

While np.histogram() bins and counts data points, NumPy’s np.digitize() function assigns each data point to a bin and returns the bin index per data point.

The bins can be specified as a single sequence of bin edges or a monotonically increasing array specifying the bin edges.

bin_edges = [0, 10, 20, 30, 40]

data = [5, 17, 22, 45, 19]

bins = np.digitize(data, bin_edges)

print(bins)

# Output: [1 3 4 5 2]

Here 22 falls in bin index 3, 19 falls in bin 2 etc. Useful for grouping data points into bins for further processing.

We can also pass right=True option to have the intervals closed on right rather than left by default.

Using pandas for Analysis with Binned Data

Once we have assigned data points to bins using np.digitize(), we can leverage the power of pandas DataFrames to analyze the binned data and gain further insights.

import pandas as pd

bins = np.digitize(data, bin_edges)

df = pd.DataFrame({'data': data, 'bin': bins})

# DataFrame with binned data

print(df.groupby('bin').mean())

# Average value per bin

The pandas groupby() method lets us aggregate metrics per bin like mean, standard deviation, count etc. This allows analyzing trends in the binned data.

Visualizing Histograms with Matplotlib

For plotting histograms, NumPy has to be used alongside Matplotlib’s pyplot module.

We can plot a histogram of the raw data, bin counts, or the probability density function as follows:

from matplotlib import pyplot as plt

# Histogram of raw data
plt.hist(data)

# Using bin edges and counts from np.histogram()
plt.hist(bin_edges[:-1], bins=bin_edges, weights=counts)

# Probability density
plt.hist(data, density=True, bins=30)

plt.title('Histogram')
plt.xlabel('Bin range')
plt.ylabel('Frequency')
plt.show()

We can customize the histogram plots using Matplotlib by adjusting the bin sizes, colors, alpha channels, legends etc.

Example: Analyzing Histograms of Iris Measurements

Let’s put together the concepts so far into an end-to-end example of loading the Iris flower dataset, computing histograms of the measurements, analyzing the distribution and plotting the histogram for sepal length visually.

from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load data
iris = load_iris()
data = iris['data']

# Histogram of sepal length
sepal_length = data[:, 0]
counts, bin_edges = np.histogram(sepal_length, bins=15, density=True)

# Digitize sepal length values into bins
bins = np.digitize(sepal_length, bin_edges)

# Pandas DataFrame with binned data
df = pd.DataFrame({'sepal_length': sepal_length, 'bin': bins})

# Analyze binned data
print(df.groupby('bin').sepal_length.mean())

# Distribution appears normal with peak in the 4-5 bin

# Plot histogram
plt.hist(sepal_length, bins=15, density=True)
plt.xlabel('Sepal length (cm)')
plt.ylabel('Probability density')
plt.show()

This allows us to analyze and visualize the distribution of the sepal length values and gain insights into the dataset. The analysis can be extended to other features as well.

Binning 2D Data

For 2D data, we can use NumPy’s np.histogram2d() to generate a 2D histogram by binning along both axes.

It can reveal correlations, clusters and patterns in multi-dimensional data.

x = np.random.randn(1000)
y = 2 * x + np.random.randn(1000)

counts, xbins, ybins = np.histogram2d(x, y, bins=40)

# xbins, ybins contain bin edges
# counts is 2D array of bin counts

We can plot 2D histograms using Matplotlib’s pcolormesh() method:

plt.pcolormesh(xbins, ybins, counts)
plt.xlabel('x')
plt.ylabel('y')
plt.colorbar()
plt.show()

This generates a heatmap-style visualization of the 2D distribution.

The bin counts array can also be used for further analysis like curve fitting.

Choosing Optimal Bin Sizes

The bin size and range significantly impacts the shape of the histogram. Too few bins will oversmooth the distribution while too many bins can introduce artificial noise.

Some guidelines for selecting bin sizes:

Sturges’ formula - K = 1 + log2(N) gives a rough starting point. N is number of data points, K is number of bins.
Scott’s normal reference rule - More optimal for normal distributions. bin width = (3.5*std)/N^(1/3) where std is standard deviation.
Square root choice - bin width = sqrt(x_max - x_min) / sqrt(N). Works for smoother distributions.
Freedman–Diaconis rule - bin width = 2(IQR) / N^(1/3) where IQR is data interquartile range. Good for outliers.
Shimazaki and Shinomoto method - Optimized dynamic bin widths based on minimizing cost function.
Ultimately, domain knowledge should inform the choice based on tradeoffs between smoothing, overfitting and revealing true distribution patterns.

Automated methods like 'auto' and 'fd' in np.histogram() are also available.

Common Histogram Use Cases

Some common use cases where histograms can provide valuable insights into data:

Exploring datasets and identifying inherent patterns, correlations, clusters, anomalies etc.
Comparing empirical distributions to theoretical probability distributions like normal, Poisson etc. Can inform data modeling and simulation.
Analyzing measurement uncertainties and variability. Histograms can reveal systematic biases.
Data cleaning and preprocessing - identifying and handling outliers, anomalous data.
Feature engineering - analyzing impact of transformations on data distributions.
Model evaluation - comparing predicted vs actual value distributions.
Identifying multi-modality, mixture models and subpopulations within heterogeneous data.
Monitoring data trends and shifts over time.
Visual data storytelling and presenting results to non-technical audiences.

Conclusion

To summarize, NumPy offers powerful primitives for generating histograms and binning data through its histogram(), digitize() and histogram2d() functions.

Histograms provide a simple yet flexible technique for visualizing distributions and gaining insights into the shape, spread, modality, outliers etc. in datasets. Binning data enables analyzing and aggregating measurements within sets of value ranges.

Combined with pandas for analysis and Matplotlib for plotting, NumPy’s histogramming and binning capabilities enable comprehensive exploratory data analysis and data visualization for both 1D and 2D data.