Skip to content

Comprehensive Guide to Detecting Missing Data in Pandas with isnull() and notnull()

Updated: at 03:48 AM

Handling missing data is an essential skill for any Python developer working with data. When data has missing values, it can skew results and impact the performance of machine learning models. The Pandas library provides powerful tools for detecting, filtering, and handling missing values in Python.

In this comprehensive guide, we will explore the isnull() and notnull() functions in Pandas for identifying missing data in Python. We will cover the following topics:

Table of Contents

Open Table of Contents

What is Missing Data and Why Does it Matter?

Missing data refers to values in a dataset that are unobserved or incomplete. In Python, these missing values are represented by special values like NaN (Not a Number) for numeric data and None for objects.

Some common reasons why data might be missing include:

Regardless of the cause, missing data can reduce the representativeness and quality of a dataset. It can introduce bias and affect the validity of any analysis results. That’s why detecting and properly handling missing values is so important in data analysis and machine learning pipelines.

Using Pandas isnull() to Detect Missing Numeric and Object Values

The Pandas isnull() function detects missing values and returns a Boolean mask indicating whether each value is NaN or None.

import pandas as pd

df = pd.DataFrame({'A': [1, 2, np.nan],
                   'B': [5, None, np.nan]})

print(df)
#      A    B
# 0  1.0  5.0
# 1  2.0  NaN
# 2  NaN  NaN

print(df.isnull())
#        A      B
# 0  False  False
# 1  False   True
# 2   True   True

In this example, df is a Pandas DataFrame containing two columns with numeric and object missing values. Calling isnull() returns a DataFrame showing where the NaN and None values are located.

Let’s break down how isnull() works:

Some key points:

Checking isnull on DataFrame Columns

To check missing values per column, use isnull() on a specific Series from the DataFrame:

print(df['A'].isnull())

# 0    False
# 1    False
# 2     True
# Name: A, dtype: bool

This returns a Boolean Series showing missing values just for that column.

Checking isnull on DataFrame Rows

Pass axis=1 to check for missing values per row:

print(df.isnull(axis=1))

#        0      1
# 0  False  False
# 1  False   True
# 2   True   True

The result is a DataFrame with rows corresponding to each observation and Boolean columns indicating missing values.

Using Pandas notnull() as the Inverse of isnull()

The notnull() function serves as the inverse or opposite of isnull(). It detects and returns True for non-missing values:

print(df.notnull())

#        A      B
# 0   True   True
# 1   True  False
# 2  False  False

In this example, notnull() highlights where the valid values are located in the DataFrame.

Some key properties of notnull():

Counting Missing Values in Pandas

To get the total number of missing values in a DataFrame, use the isnull() and sum() methods:

missing_count = df.isnull().sum().sum()
print(missing_count)
# 3

Breaking this down:

  1. Call isnull() on the DataFrame to get the Boolean mask
  2. Sum the values in the Boolean mask to count Trues
  3. Call sum() again to add up the per-column counts

We can also summarize the number of missing values per column:

print(df.isnull().sum())

# A    1
# B    2
# dtype: int64

This returns a Series with the null count for each column. Useful for quickly identifying columns with excessive missing values.

Filtering Out Missing Data in Pandas

We can use isnull() and notnull() to filter out missing or non-missing values:

Drop Missing Values with dropna()

The dropna() method drops rows or columns containing missing values:

# Drop rows with any missing values
df.dropna()

#    A    B
# 0  1  5.0

# Drop columns with any missing values
df.dropna(axis=1)

#     A
# 0  1.0
# 1  2.0
# 2  NaN

dropna() is a great way to filter out missing data when needed. It has several options for customizing the filtering:

Fill Missing Values with fillna()

The fillna() method lets us fill in missing data with a specified value:

df.fillna(0)

#       A    B
# 0   1.0  5.0
# 1   2.0  0.0
# 2   0.0  0.0

Here we filled missing values with 0. Other common options are using the mean, median, or forward fill of valid values.

This allows us to fill missing values while still retaining the original object shape. Useful for imputing simple dummy values when filtering is not desired.

Imputing Missing Values in Pandas

For many applications, dropping or filling missing values with a simple dummy value is not sufficient. We need intelligent methods to predict and infer what the missing values should be.

Some options for imputing missing values include:

Let’s look at an example of imputing missing values using the column mean:

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
imputer.fit(df)

df_imputed = pd.DataFrame(imputer.transform(df), columns=df.columns)

print(df_imputed)

#       A    B
# 0   1.0  5.0
# 1   2.0  3.5
# 2   1.5  3.5

Here we used scikit-learn’s SimpleImputer to calculate the mean for each column and impute missing values with it. This allows us to create a complete dataset by intelligently filling missing values.

More advanced methods like MICE can model correlations between columns for higher accuracy imputation.

Use Cases and Examples

Detecting and handling missing data is necessary in many real-world applications:

The main considerations for choosing methods are:

  1. How prevalent and dispersed is the missing data?
  2. What is the cause of the missingness - random or systematic?
  3. How important is the completeness and integrity of data for analysis?

Pandas provides a flexible toolkit to explore these questions and handle missing data using the best approaches for different applications.

Conclusion

Detecting and handling missing data is a key skill for any Python developer working with data analysis and machine learning. Pandas’ isnull() and notnull() provide simple yet powerful ways to identify, analyze, and filter missing values in your data.

Mastering these tools will enable you to wrangle messy datasets, isolate issues, and apply the appropriate remedies to missing data for more accurate modeling and analytics. The next step is to explore how to properly encode categories, transform features, and develop pipelines to feed cleaner, complete data into machine learning models.