Skip to content

How to Exclude Missing Values in Python Pandas

Updated: at 05:37 AM

Handling missing data is a common task in data analysis and data science projects. Pandas, Python’s popular data analysis library, provides many useful features for dealing with missing values in DataFrames and Series objects.

In Pandas, missing values are represented by NaN (Not a Number). Filtering out missing values or excluding rows and columns containing NaN values is often necessary for preparing clean, complete datasets for analysis and modeling. This allows you to focus on just the valid, complete observations without distortion from null entries.

This guide will demonstrate various methods to exclude or drop missing values in Pandas using realistic examples. We will cover key topics including:

Table of Contents

Open Table of Contents

Checking for Null Values

The first step is detecting which cells in a DataFrame or Series contain NaN values. Pandas provides the isna() and notna() methods to check for nulls and non-nulls respectively:

import pandas as pd

df = pd.DataFrame({'A': [1, 2, np.nan],
                   'B': [5, np.nan, np.nan]})

df.isna()
# Returns True for each NaN value

df.notna()
# Returns True for each non-NaN value

The isnull() and notnull() methods are aliases of isna() and notna().

Summing the Boolean DataFrame returned by isna() gives the total count of missing values per column:

df.isna().sum()

A    1
B    2
dtype: int64

To count NaN values in the entire DataFrame, use isna().sum().sum().

These methods enable detecting nulls before handling them according to your requirements.

Dropping Rows and Columns with NaN Values

A common way to handle missing data is to simply exclude rows and/or columns containing NaN values. Pandas provides two main methods for this:

Drop Rows with dropna()

The dropna() method on DataFrames excludes any rows containing NaN values.

df = pd.DataFrame({'A': [1, 2, np.nan],
                   'B': [5, np.nan, np.nan]})

df.dropna()
# Drops second and third row

By default dropna() drops any row with at least 1 NaN. Pass how='all' to only drop rows where all values are NaN:

df = pd.DataFrame({'A': [1, np.nan, np.nan],
                   'B': [5, np.nan, np.nan]})

df.dropna(how='all')
# Only drops third row with all NaNs

To drop rows with at least a certain number of NaNs, use the thresh argument:

df.dropna(thresh=2)
# Would drop rows with >= 2 NaNs

subset allows specifying columns to consider when looking for NaNs:

df.dropna(subset=['A'])
# Only drops rows with NaN in column 'A'

Drop Columns with drop()

The drop() method can exclude columns containing null values by passing axis=1:

df.drop(columns=['A'], axis=1)
# Drops column A

thresh can again be used to drop columns with at least a certain count of NaNs:

df.dropna(axis=1, thresh=1)
# Drops any column with >= 1 NaN

This provides an easy way to filter out columns missing too much data to be usable.

Filling Missing Values

Sometimes it is preferable to fill or impute missing values rather than excluding entire rows or columns containing them. This retains more complete data.

Pandas provides several methods for imputation like fillna(), interpolate(), and replace().

fillna()

The fillna() method replaces NaN values with a specified value. Passing a single value fills all NaNs with it:

df.fillna(0) # Replace NaNs with 0

Dictionary mapping can be passed to fill different values per column:

df.fillna({'A': 0, 'B': 5})

Use method='ffill'/'bfill' to propagate next/previous valid values forward/backward:

df.fillna(method='ffill')

This carries over the last valid observation to fill NaNs in following rows.

interpolate()

For numeric data, interpolate() fills NaN values via linear interpolation of neighboring points:

s = pd.Series([1, np.nan, np.nan, 10])
s.interpolate()

# Fills NaNs by interpolating between 1 and 10

Various interpolation methods like ‘spline’ and ‘akima’ are supported.

replace()

The replace() method can substitute NaN with a new value:

df.replace(np.nan, 5)

This provides a simple, flexible way to fill null values while copying the DataFrame.

Filling missing data preserves more complete rows/columns and may produce better analysis results than dropping NaNs.

Excluding Nulls During Joins/Merges

NaN values can also be excluded when combining DataFrames using operations like join/merge.

By default, joins preserve all values (including NaNs) from the original DataFrames:

df1 = pd.DataFrame({'A': [1, 2], 'B': [5, np.nan]})
df2 = pd.DataFrame({'A': [1, 3], 'C': [7, 8]})

df1.merge(df2)
# NaN value is retained

Passing how='inner' joins on only the intersection of rows containing no null values:

df1.merge(df2, how='inner')
# Row with NaN is excluded

For outer joins, pass indicator=True and filter for rows having _merge equal to 'both':

df1.merge(df2, how='outer', indicator=True)
    .query('_merge == "both"')
# Null rows are excluded after merge

This avoids incorporating NaN values from the join/merge output.

Treating NaN as a Value

In some cases, you may want to explicitly include NaN values rather than exclude them. This treats NaN as an actual value category or level in the data.

Methods like groupby(), value_counts(), and pivot_table() by default will ignore NaN values.

To include NaN values in the output, pass dropna=False:

df.groupby('A', dropna=False).mean()
# NaN is retained as a group

This lets you analyze null values alongside actual data.

Conclusion

Handling missing data is vital for preparing Pandas DataFrames for analysis and visualization. By detecting, dropping, filling, and carefully treating NaN values, you can wrangle messy datasets into clean, usable formats.

The techniques covered in this guide, including isna(), dropna(), fillna(), joining on complete rows, and handling NaN as a group, provide code-driven solutions to exclude or retain missing values when needed.

Mastering Pandas’ tools for missing data will boost your proficiency for real-world data science tasks. The skills are broadly applicable across disciplines including machine learning, analytics, finance, sciences and more.

For further learning, refer to the official Pandas documentation on Working with Missing Data. The example codes can be adapted to your specific datasets and use cases when encountering null values.