Missing data is a common occurrence in real-world datasets. Values can be missing due to a variety of reasons - faulty data collection, data corruption, or missing entries. Python’s Pandas library provides several useful methods and functions to detect, analyze, and deal with missing values in datasets.
Detecting missing data is an important step in data preprocessing as the presence of missing values can skew results from data analysis. In this comprehensive guide, we will learn the different options available in Pandas to identify, visualize, and summarize missing values in datasets with code examples.
Table of Contents
Open Table of Contents
Checking for Null or NaN Values
The simplest way to check for missing data in Pandas is to check for Null or NaN (Not a Number) values. Pandas uses the NumPy NaN value for missing data.
We can use the isnull()
and notnull()
methods to check for Null values:
import pandas as pd
df = pd.DataFrame({'Col1': [1, 2, np.nan],
'Col2': [3, np.nan, np.nan]})
print(df.isnull())
# Col1 Col2
# 0 False False
# 1 False True
# 2 True True
print(df.notnull())
# Col1 Col2
# 0 True True
# 1 True False
# 2 False False
The isnull()
method returns True for missing values, while notnull()
returns True for non-Null values.
We can also use the isna()
method as an alias for isnull()
:
print(df.isna())
# Same output as isnull() above
To count the number of Null values in each column:
print(df.isnull().sum())
# Col1 1
# Col2 2
# dtype: int64
The sum()
method sums up the Trues, giving us the count of missing values per column.
Detecting Any Missing Values
The isnull()
and notnull()
methods above only detect Null/NaN values. But Pandas also has built-in methods to detect any type of missing value efficiently:
dropna()
- Drops missing valuesfillna()
- Fills in missing values
These functions by default detect and deal with all types of missing data including integers, floats, booleans etc.
For example:
df = pd.DataFrame({'Col1': [1, np.nan, 3],
'Col2': [3, 4, None]})
print(df.dropna())
# Col1 Col2
# 0 1.0 3
print(df.fillna(0))
# Col1 Col2
# 0 1.0 3
# 1 0.0 4
# 2 3.0 0
Here dropna()
removed rows with any missing value while fillna(0)
replaced missing values with 0.
Later we will go over how to customize the behavior of these methods for different types of missing data.
Getting Statistics on Missing Data
Pandas provides various functions to get a statistical summary of missing values in the dataset. These can help identify patterns and correlations related to missing data.
Overall Percentage of Missing Data
The overall percentage of missing values in the DataFrame can be calculated as:
import numpy as np
total = df.isnull().sum().sum()
total_cells = np.product(df.shape)
pct_missing = (total/total_cells) * 100
print(pct_missing)
This gives the overall percentage of missing cells.
Percentage of Missing Values Per Column
To get the percentage of missing values per column:
missing_val_count_by_column = (df.isnull().sum())
total_cells_by_column = df.shape[0]
pct_missing = (missing_val_count_by_column/total_cells_by_column) * 100
print(pct_missing)
Correlation with Missing Values
The corrwith()
method can be used to get the correlation between different columns of data and missing values:
missing_correlations = df.isnull().corrwith(df)
print(missing_correlations)
This correlation matrix indicates relationships like if missing values in one column correlate to missing values in another.
Bar Plots for Missing Data
Visualizations can also help identify missing data patterns. For example, plotting a bar chart of the percentage of missing values per column:
import matplotlib.pyplot as plt
pct_missing.plot.bar()
plt.xlabel('Column')
plt.ylabel('Percent missing')
plt.title('Percent missing per column')
plt.show()
This bar plot lets us quickly compare the amount of missing values across columns.
Detecting Specific Missing Types
The methods above detect all types of missing data by default. We can also specifically check for certain missing value types like NaN or None values.
Checking for NaN
To check only for NaN float values:
print(df.isnull())
# Detects all missing
print(np.isnan(df))
# Only detects NaN
The Pandas isnull()
method detects all missing data while NumPy’s isnan()
only checks for NaN float values.
Checking for None
To check only for None missing values:
print(df.isnull())
# Detects all missing values
print(df.applymap(lambda x: x is None))
# Only detects None missing values
The applymap()
method with a custom lambda function can be used to check only for None values.
Checking for empty strings
To check only for empty string values:
print(df.isnull())
# Detects all missing
print(df.eq(''))
# Only detects empty string values
The eq('')
method checks for empty strings.
By chaining these together, we can check for multiple specific missing value types:
print(np.isnan(df) | df.eq(''))
This returns True for both NaN and empty string missing values.
Handling Different Missing Value Types
When dealing with missing data, we might want to treat different missing types differently.
For example, we might want to keep NaN values but replace empty strings with the mean. Pandas provides options to customize how methods like dropna()
, fillna()
, etc handle different missing values.
Dropping Rows by Missing Value Types
To drop rows with only certain missing value types:
# Drop rows with any NaN values
df.dropna(subset=['Col1','Col2'], how='any', inplace=True)
# Drop rows only if all cells are empty strings
df.dropna(subset=['Col1','Col2'], how='all', inplace=True)
The how
parameter can be set to 'any'
or 'all'
to control the subset of columns.
Filling Missing Values by Type
We can also fill different missing data types differently:
# Fill NaN with medians
df = df.fillna(value=df.median())
# Fill empty strings with 'UNK'
df = df.fillna(value='UNK', subset=[df.eq('')])
The subset
parameter is used to apply filling to just specific columns or conditions.
Converting Data Types
For smarter handling, we can also convert data types before checking for missing values:
# Convert object columns to numeric
df = df.apply(pd.to_numeric, errors='coerce')
# Fill NaNs while keeping original None values
df = df.fillna(df.median())
Converting object dtypes to numeric will convert non-numeric values to NaN. Then we can fill just the NaNs.
Checking for Missing Data in Schemas
For tabular data loaded from CSVs, databases, Excel etc, Pandas also provides functionality to detect missing columns compared to an expected schema.
The io.parsers
submodule contains tools to validate data against schemas.
For example:
from pandas.io.parsers import read_csv
schema = ['id', 'first_name', 'last_name', 'age']
data = read_csv('data.csv', names=schema)
print(data.columns)
# Index(['id', 'first_name', 'last_name'], dtype='object')
print(data.columns.difference(schema))
# Index(['age'], dtype='object')
This shows that the ‘age’ column present in the schema is missing from the actual data.
Schema validation is useful for catching discrepancies in production datasets.
Best Practices for Dealing with Missing Data
Some best practices when handling missing values:
- Detect missing values as early as possible in data pipelines.
- Visualize missing patterns using histograms, heatmaps, etc.
- Impute missing values carefully to avoid distorting relationships.
- If dropping rows, do not drop too many to prevent losing information.
- Document reasons for missing values and any methods used to handle them.
- Track metrics on missing values to monitor data quality over time.
The best approach depends on the dataset and downstream usage of the data.
Conclusion
Detecting missing data is a key step in preparing real-world datasets for analysis. Pandas provides many convenient functions and methods for identifying, visualizing, and summarizing missing values in dataframes. By mastering these tools, data scientists can gain valuable insights into their dataset’s integrity and completeness. The methods discussed in this guide should provide a comprehensive overview of handling missing data in Pandas.