Comprehensive Guide to Detecting Missing Data in Pandas with isnull() and notnull()

Handling missing data is an essential skill for any Python developer working with data. When data has missing values, it can skew results and impact the performance of machine learning models. The Pandas library provides powerful tools for detecting, filtering, and handling missing values in Python.

In this comprehensive guide, we will explore the isnull() and notnull() functions in Pandas for identifying missing data in Python. We will cover the following topics:

Open Table of Contents

What is Missing Data and Why Does it Matter?
Using Pandas isnull() to Detect Missing Numeric and Object Values
- Checking isnull on DataFrame Columns
- Checking isnull on DataFrame Rows
Using Pandas notnull() as the Inverse of isnull()
Counting Missing Values in Pandas
Filtering Out Missing Data in Pandas
- Drop Missing Values with dropna()
- Fill Missing Values with fillna()
Imputing Missing Values in Pandas
Use Cases and Examples
Conclusion

What is Missing Data and Why Does it Matter?

Missing data refers to values in a dataset that are unobserved or incomplete. In Python, these missing values are represented by special values like NaN (Not a Number) for numeric data and None for objects.

Some common reasons why data might be missing include:

Data was not collected or lost
Respondents declined to answer survey questions
Errors in data collection or measurement
Data corruption or incorrectly formatted data

Regardless of the cause, missing data can reduce the representativeness and quality of a dataset. It can introduce bias and affect the validity of any analysis results. That’s why detecting and properly handling missing values is so important in data analysis and machine learning pipelines.

Using Pandas `isnull()` to Detect Missing Numeric and Object Values

The Pandas isnull() function detects missing values and returns a Boolean mask indicating whether each value is NaN or None.

import pandas as pd

df = pd.DataFrame({'A': [1, 2, np.nan],
                   'B': [5, None, np.nan]})

print(df)
#      A    B
# 0  1.0  5.0
# 1  2.0  NaN
# 2  NaN  NaN

print(df.isnull())
#        A      B
# 0  False  False
# 1  False   True
# 2   True   True

In this example, df is a Pandas DataFrame containing two columns with numeric and object missing values. Calling isnull() returns a DataFrame showing where the NaN and None values are located.

Let’s break down how isnull() works:

Applies to DataFrames and Series
Returns a Boolean mask with the same shape as the original object
True indicates a missing value, False indicates a valid observed value
Handles missing data for numeric, string, and other object dtypes

Some key points:

NaN values will always be considered missing or isnull
For objects, None and NaN are considered missing
Other values like empty strings '' or 0 are NOT considered missing by default

Checking isnull on DataFrame Columns

To check missing values per column, use isnull() on a specific Series from the DataFrame:

print(df['A'].isnull())

# 0    False
# 1    False
# 2     True
# Name: A, dtype: bool

This returns a Boolean Series showing missing values just for that column.

Checking isnull on DataFrame Rows

Pass axis=1 to check for missing values per row:

print(df.isnull(axis=1))

#        0      1
# 0  False  False
# 1  False   True
# 2   True   True

The result is a DataFrame with rows corresponding to each observation and Boolean columns indicating missing values.

Using Pandas `notnull()` as the Inverse of `isnull()`

The notnull() function serves as the inverse or opposite of isnull(). It detects and returns True for non-missing values:

print(df.notnull())

#        A      B
# 0   True   True
# 1   True  False
# 2  False  False

In this example, notnull() highlights where the valid values are located in the DataFrame.

Some key properties of notnull():

Returns Boolean mask indicating valid non-missing values
Inverse of isnull(), checking for opposite condition
Can be used similarly on DataFrames, Series, columns, or rows
Useful for filtering, selecting, or omitting missing values

Counting Missing Values in Pandas

To get the total number of missing values in a DataFrame, use the isnull() and sum() methods:

missing_count = df.isnull().sum().sum()
print(missing_count)
# 3

Breaking this down:

Call isnull() on the DataFrame to get the Boolean mask
Sum the values in the Boolean mask to count Trues
Call sum() again to add up the per-column counts

We can also summarize the number of missing values per column:

print(df.isnull().sum())

# A    1
# B    2
# dtype: int64

This returns a Series with the null count for each column. Useful for quickly identifying columns with excessive missing values.

Filtering Out Missing Data in Pandas

We can use isnull() and notnull() to filter out missing or non-missing values:

Drop Missing Values with `dropna()`

The dropna() method drops rows or columns containing missing values:

# Drop rows with any missing values
df.dropna()

#    A    B
# 0  1  5.0

# Drop columns with any missing values
df.dropna(axis=1)

#     A
# 0  1.0
# 1  2.0
# 2  NaN

dropna() is a great way to filter out missing data when needed. It has several options for customizing the filtering:

axis: Drop rows (0) or columns (1)
how: Drop rows/columns with ‘any’ or ‘all’ missing values
thresh: Only drop if number of missing values exceeds threshold
subset: Only consider subset of columns when checking for missing values

Fill Missing Values with `fillna()`

The fillna() method lets us fill in missing data with a specified value:

df.fillna(0)

#       A    B
# 0   1.0  5.0
# 1   2.0  0.0
# 2   0.0  0.0

Here we filled missing values with 0. Other common options are using the mean, median, or forward fill of valid values.

This allows us to fill missing values while still retaining the original object shape. Useful for imputing simple dummy values when filtering is not desired.

Imputing Missing Values in Pandas

For many applications, dropping or filling missing values with a simple dummy value is not sufficient. We need intelligent methods to predict and infer what the missing values should be.

Some options for imputing missing values include:

Mean/Median/Mode Imputation: Replace with average, median, or most common value
Regression Imputation: Use regression to predict missing values from other columns
Time Series Methods: Impute based on trend over time
Nearest Neighbors: Use values from similar rows
Machine Learning: Train a model to predict missing values

Let’s look at an example of imputing missing values using the column mean:

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
imputer.fit(df)

df_imputed = pd.DataFrame(imputer.transform(df), columns=df.columns)

print(df_imputed)

#       A    B
# 0   1.0  5.0
# 1   2.0  3.5
# 2   1.5  3.5

Here we used scikit-learn’s SimpleImputer to calculate the mean for each column and impute missing values with it. This allows us to create a complete dataset by intelligently filling missing values.

More advanced methods like MICE can model correlations between columns for higher accuracy imputation.

Use Cases and Examples

Detecting and handling missing data is necessary in many real-world applications:

Healthcare: Patient records with missing lab results or demographic data. Imputing values is needed for predictive models.
Survey research: Respondents skip questions or provide incomplete responses. Missing values must be flagged or addressed to avoid bias.
Predictive maintenance: Sensor measurements from machinery contain gaps due to downtime. Values need interpolation for accurate forecasts.
Fraud detection: Transaction logs have missing identifiers or timestamps. Dropping these rows may discard useful signals.
Image recognition: Dataset images are corrupted or unlabeled. We may filter out missing labels or attempt to restore images.

The main considerations for choosing methods are:

How prevalent and dispersed is the missing data?
What is the cause of the missingness - random or systematic?
How important is the completeness and integrity of data for analysis?

Pandas provides a flexible toolkit to explore these questions and handle missing data using the best approaches for different applications.

Conclusion

Detecting and handling missing data is a key skill for any Python developer working with data analysis and machine learning. Pandas’ isnull() and notnull() provide simple yet powerful ways to identify, analyze, and filter missing values in your data.

Mastering these tools will enable you to wrangle messy datasets, isolate issues, and apply the appropriate remedies to missing data for more accurate modeling and analytics. The next step is to explore how to properly encode categories, transform features, and develop pipelines to feed cleaner, complete data into machine learning models.