Skip to content

Detecting Missing Data in Pandas

Updated: at 04:46 AM

Missing data is a common occurrence in real-world datasets. Values can be missing due to a variety of reasons - faulty data collection, data corruption, or missing entries. Python’s Pandas library provides several useful methods and functions to detect, analyze, and deal with missing values in datasets.

Detecting missing data is an important step in data preprocessing as the presence of missing values can skew results from data analysis. In this comprehensive guide, we will learn the different options available in Pandas to identify, visualize, and summarize missing values in datasets with code examples.

Table of Contents

Open Table of Contents

Checking for Null or NaN Values

The simplest way to check for missing data in Pandas is to check for Null or NaN (Not a Number) values. Pandas uses the NumPy NaN value for missing data.

We can use the isnull() and notnull() methods to check for Null values:

import pandas as pd

df = pd.DataFrame({'Col1': [1, 2, np.nan],
                   'Col2': [3, np.nan, np.nan]})

print(df.isnull())
#     Col1   Col2
# 0  False  False
# 1  False   True
# 2   True   True

print(df.notnull())
#     Col1   Col2
# 0   True   True
# 1   True  False
# 2  False  False

The isnull() method returns True for missing values, while notnull() returns True for non-Null values.

We can also use the isna() method as an alias for isnull():

print(df.isna())
# Same output as isnull() above

To count the number of Null values in each column:

print(df.isnull().sum())

# Col1    1
# Col2    2
# dtype: int64

The sum() method sums up the Trues, giving us the count of missing values per column.

Detecting Any Missing Values

The isnull() and notnull() methods above only detect Null/NaN values. But Pandas also has built-in methods to detect any type of missing value efficiently:

These functions by default detect and deal with all types of missing data including integers, floats, booleans etc.

For example:

df = pd.DataFrame({'Col1': [1, np.nan, 3],
                   'Col2': [3, 4, None]})

print(df.dropna())

#    Col1  Col2
# 0    1.0    3

print(df.fillna(0))

#    Col1  Col2
# 0    1.0    3
# 1    0.0    4
# 2    3.0    0

Here dropna() removed rows with any missing value while fillna(0) replaced missing values with 0.

Later we will go over how to customize the behavior of these methods for different types of missing data.

Getting Statistics on Missing Data

Pandas provides various functions to get a statistical summary of missing values in the dataset. These can help identify patterns and correlations related to missing data.

Overall Percentage of Missing Data

The overall percentage of missing values in the DataFrame can be calculated as:

import numpy as np

total = df.isnull().sum().sum()
total_cells = np.product(df.shape)
pct_missing = (total/total_cells) * 100

print(pct_missing)

This gives the overall percentage of missing cells.

Percentage of Missing Values Per Column

To get the percentage of missing values per column:

missing_val_count_by_column = (df.isnull().sum())
total_cells_by_column = df.shape[0]

pct_missing = (missing_val_count_by_column/total_cells_by_column) * 100
print(pct_missing)

Correlation with Missing Values

The corrwith() method can be used to get the correlation between different columns of data and missing values:

missing_correlations = df.isnull().corrwith(df)
print(missing_correlations)

This correlation matrix indicates relationships like if missing values in one column correlate to missing values in another.

Bar Plots for Missing Data

Visualizations can also help identify missing data patterns. For example, plotting a bar chart of the percentage of missing values per column:

import matplotlib.pyplot as plt

pct_missing.plot.bar()
plt.xlabel('Column')
plt.ylabel('Percent missing')
plt.title('Percent missing per column')

plt.show()

This bar plot lets us quickly compare the amount of missing values across columns.

Detecting Specific Missing Types

The methods above detect all types of missing data by default. We can also specifically check for certain missing value types like NaN or None values.

Checking for NaN

To check only for NaN float values:

print(df.isnull())

# Detects all missing

print(np.isnan(df))

# Only detects NaN

The Pandas isnull() method detects all missing data while NumPy’s isnan() only checks for NaN float values.

Checking for None

To check only for None missing values:

print(df.isnull())

# Detects all missing values

print(df.applymap(lambda x: x is None))

# Only detects None missing values

The applymap() method with a custom lambda function can be used to check only for None values.

Checking for empty strings

To check only for empty string values:

print(df.isnull())

# Detects all missing

print(df.eq(''))

# Only detects empty string values

The eq('') method checks for empty strings.

By chaining these together, we can check for multiple specific missing value types:

print(np.isnan(df) | df.eq(''))

This returns True for both NaN and empty string missing values.

Handling Different Missing Value Types

When dealing with missing data, we might want to treat different missing types differently.

For example, we might want to keep NaN values but replace empty strings with the mean. Pandas provides options to customize how methods like dropna(), fillna(), etc handle different missing values.

Dropping Rows by Missing Value Types

To drop rows with only certain missing value types:

# Drop rows with any NaN values
df.dropna(subset=['Col1','Col2'], how='any', inplace=True)

# Drop rows only if all cells are empty strings
df.dropna(subset=['Col1','Col2'], how='all', inplace=True)

The how parameter can be set to 'any' or 'all' to control the subset of columns.

Filling Missing Values by Type

We can also fill different missing data types differently:

# Fill NaN with medians
df = df.fillna(value=df.median())

# Fill empty strings with 'UNK'
df = df.fillna(value='UNK', subset=[df.eq('')])

The subset parameter is used to apply filling to just specific columns or conditions.

Converting Data Types

For smarter handling, we can also convert data types before checking for missing values:

# Convert object columns to numeric
df = df.apply(pd.to_numeric, errors='coerce')

# Fill NaNs while keeping original None values
df = df.fillna(df.median())

Converting object dtypes to numeric will convert non-numeric values to NaN. Then we can fill just the NaNs.

Checking for Missing Data in Schemas

For tabular data loaded from CSVs, databases, Excel etc, Pandas also provides functionality to detect missing columns compared to an expected schema.

The io.parsers submodule contains tools to validate data against schemas.

For example:

from pandas.io.parsers import read_csv

schema = ['id', 'first_name', 'last_name', 'age']

data = read_csv('data.csv', names=schema)

print(data.columns)
# Index(['id', 'first_name', 'last_name'], dtype='object')

print(data.columns.difference(schema))
# Index(['age'], dtype='object')

This shows that the ‘age’ column present in the schema is missing from the actual data.

Schema validation is useful for catching discrepancies in production datasets.

Best Practices for Dealing with Missing Data

Some best practices when handling missing values:

The best approach depends on the dataset and downstream usage of the data.

Conclusion

Detecting missing data is a key step in preparing real-world datasets for analysis. Pandas provides many convenient functions and methods for identifying, visualizing, and summarizing missing values in dataframes. By mastering these tools, data scientists can gain valuable insights into their dataset’s integrity and completeness. The methods discussed in this guide should provide a comprehensive overview of handling missing data in Pandas.