Skip to content

Filling Missing Values in Pandas using fillna() and interpolate()

Updated: at 02:14 AM

Missing data is a common issue when working with real-world datasets in Python. Values can be missing due to various reasons such as incorrect data entry, corruption, or failure to record measurements. Dealing with missing data is an inevitable step in most data analysis and machine learning pipelines.

Pandas, Python’s popular data analysis library, provides various methods to handle missing values in DataFrames and Series. The versatile fillna() method allows filling missing values with a static value or using an interpolation method. The interpolate() method offers more advanced interpolation techniques for ordered data.

In this comprehensive guide, we will explore how to fill missing values in Pandas using the fillna() and interpolate() functions.

Prerequisites

To follow this guide, you should have:

The code examples use Pandas 1.3.5 and Python 3.8.10. It is recommended to upgrade to the latest stable versions before trying the examples.

Table of Contents

Open Table of Contents

Checking for Missing Values

The first step is to check whether the DataFrame contains any missing values. Pandas provides various functions to detect null or NaN values:

import pandas as pd

df = pd.DataFrame({'Col1': [1, 2, np.nan, 4],
                   'Col2': [5, np.nan, 7, 8]})

# Check if DataFrame has any missing values
df.isna().any()

# Count total missing values across columns
df.isna().sum().sum()

# Count missing values per column
df.isna().sum()

The output will reveal the columns and number of missing values present. This informs how to approach imputing them.

Filling Missing Values with fillna()

The fillna() method is used to replace or fill missing values (NaN/None) in Pandas objects with a specified value.

The syntax is:

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None)

Let’s go through the parameters:

Filling with a Scalar Value

To fill missing values with a static scalar value like 0, we simply pass it to the value parameter:

df.fillna(value=0)

This replaces all NaN values with 0 in the DataFrame.

For example:

df = pd.DataFrame({'A': [1, np.nan, 3],
                   'B': [5, 2, np.nan]})

df.fillna(value=0)

     A    B
0  1.0  5.0
1  0.0  2.0
2  3.0  0.0

We can also specify a different fill value for each column by passing a dictionary:

df.fillna({'A': 0, 'B': 5})

     A    B
0  1.0  5.0
1  0.0  2.0
2  3.0  5.0

Filling with a Forward Fill

To propagate the previous valid value forward, use method='ffill':

df.fillna(method='ffill')

     A    B
0  1.0  5.0
1  1.0  2.0
2  3.0  2.0

This carries the last observed value forward to fill the missing values.

Filling with a Backward Fill

Use method='bfill' to propagate the next valid value backward:

df.fillna(method='bfill')

     A    B
0  1.0  5.0
1  3.0  2.0
2  3.0  NaN

This carries the next valid value backward to fill missing values.

Filling with Interpolate

To fill missing values by interpolating neighboring values, pass method='interpolate':

df.interpolate()

     A    B
0  1.0  5.0
1  2.0  2.0
2  3.0  3.5

This uses the default linear interpolation method. More complex interpolation methods are supported by the interpolate() function described later.

Forward Fill with Limit

The limit parameter controls how many consecutive missing values to fill while doing a forward or backward fill:

df = pd.DataFrame({'A': [1, np.nan, np.nan, 5, 7]})

df.fillna(method='ffill', limit=1)

     A
0  1.0
1  1.0
2  NaN
3  5.0
4  7.0

This fills at most 1 consecutive NaN value using the forward fill method. The second NaN value remains.

Filling in Place

By default, fillna() returns a copy of the DataFrame with filled values. To fill missing values in the original DataFrame itself, use inplace=True:

df.fillna(value=0, inplace=True)

Now df will be modified directly with the missing values filled.

Filling with interpolate()

While fillna() offers basic interpolation, for more complex numeric data, Pandas provides the powerful interpolate() method. It has various interpolation schemes like linear, quadratic, spline, etc.

The syntax for interpolate() is:

Series.interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction='forward', limit_area=None, downcast=None)

DataFrame.interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction='forward', limit_area=None, downcast=None)

The parameters are similar to fillna(), with the interpolation method being the key addition.

Let’s go through some examples:

Linear Interpolation

This uses the standard method='linear' interpolation between points:

ser = pd.Series([1, 3, np.nan, 7, 4, np.nan, 12])

ser.interpolate()

0     1.0
1     3.0
2     5.0
3     7.0
4     4.0
5     8.0
6    12.0
dtype: float64

The NaN values are filled using linear interpolation between neighboring points.

Quadratic Interpolation

The method='quadratic' interpolates with a quadratic polynomial between points:

ser.interpolate(method='quadratic')

0     1.000000
1     3.000000
2     4.333333
3     7.000000
4     4.000000
5     8.500000
6    12.000000
dtype: float64

This fits a quadratic curve between the points to fill the NaNs.

Spline Interpolation

The method='spline' uses spline interpolation to fill missing data:

ser.interpolate(method='spline', order=2)

0     1.000000
1     3.000000
2     5.428572
3     7.000000
4     4.000000
5     8.285714
6    12.000000
dtype: float64

Spline interpolation fits smooth spline polynomials between known data points to impute missing values.

Time Series Interpolation

For time series data, methods like 'time' and 'cubic' can be used:

dates = pd.date_range('2020-01-01', periods=6, freq='W')
ts = pd.Series([1, np.nan, np.nan, 7, 4, np.nan], index=dates)

ts.interpolate(method='time')

2020-01-05    1.0
2020-01-12    4.0
2020-01-19    4.0
2020-01-26    7.0
2020-02-02    4.0
2020-02-09    7.5
Freq: W-SUN, dtype: float64

This interpolates along the time index, filling gaps with time sensitive interpolation methods.

Limiting Interpolation Range

The limit and limit_area parameters can restrict the range of values used for interpolation:

ser.interpolate(method='spline', limit=2, limit_area='inside')

0    1.0
1    3.0
2    NaN
3    7.0
4    4.0
5    NaN
6   12.0
dtype: float64

This only uses up to 2 valid points on either side of the NaN to interpolate. The ‘inside’ limit area further restricts it to not use outside points.

Filling in Place

To interpolate with the changes applied directly to the original Series, use inplace=True:

ser.interpolate(method='linear', inplace=True)

Practical Examples

Let’s look at some practical examples of filling missing data in Pandas DataFrames using the techniques covered.

Filling Missing Numeric Values

df = pd.DataFrame({'A': [1, 3, 6, np.nan],
                   'B': [5, np.nan, np.nan, np.nan]})

df.fillna(method='ffill')

   A    B
0  1  5.0
1  3  5.0
2  6  5.0
3  6  5.0

Forward fills numeric columns missing values.

Filling Missing Dates

dates = pd.date_range('2020-01-01', periods=3, freq='D')
df = pd.DataFrame({'Date': [dates[0], None, dates[2]]})

df['Date'].fillna(method='ffill')

0   2020-01-01
1   2020-01-01
2   2020-01-03
Name: Date, dtype: datetime64[ns]

Missing dates can be filled forward/backward nicely.

Interpolating Time Series

ts = pd.Series([1, np.nan, 5, 7, np.nan, 12],
               index=pd.date_range('2020-01-01', periods=6, freq='M'))

ts.interpolate(method='time')

2020-01-31    1.0
2020-02-29    3.0
2020-03-31    5.0
2020-04-30    7.0
2020-05-31    9.5
2020-06-30   12.0
Freq: M, dtype: float64

Time sensitive interpolation of missing values in time series data.

Filling Columns by Data Type

df = pd.DataFrame({'Int': [1, 3, None, 7],
                   'Float': [0.5, np.nan, 4.5, 8.7],
                   'Str': ['a', None, 'c', 'd']})

df.fillna({'Int': 0, 'Float': 0.0, 'Str': 'Missing'})

    Int  Float    Str
0   1   0.5       a
1   3   0.0       Missing
2   0   4.5       c
3   7   8.7       d

Fill columns by data type - numeric with 0, strings with ‘Missing’ text.

Here is a new section with real-life examples and complete runnable code for filling missing values in Pandas:

Real-life Examples

Let’s look at some real-life examples of handling missing data in Pandas DataFrames using the techniques we have learned.

Cleaning MessyDataset

This dataset has missing values coded as “NaN” strings that need to be converted to NaNs before filling:

import pandas as pd
import numpy as np

data = {'A': [1, np.nan, 'NaN', 4],
        'B': [5, np.nan, 'NaN', 8],
        'C': ['a', 'b', None, 'd']}
df = pd.DataFrame(data)

print(df)

   A    B    C
0  1  5.0    a
1  NaN  NaN    b
2  NaN  NaN  None
3  4  8.0    d

# Convert "NaN" strings to NaNs
df = df.replace('NaN', np.nan)

# Fill numeric columns forward
# and string column with "Missing"
df = df.fillna(method='ffill', numeric_only=False)

print(df)

   A    B       C
0  1  5.0       a
1  1  5.0       b
2  1  5.0  Missing
3  4  8.0       d

Filling Time Series Gaps

Fill gaps in time series data using time-sensitive interpolation methods:

dates = pd.date_range('2022-01-01', periods=10, freq='W')
values = [1.5, np.nan, 2.1, np.nan, 6.3, np.nan, 4.6, 5.1, np.nan, 8.9]
ser = pd.Series(values, index=dates)

print(ser)

2022-01-02    1.5
2022-01-09    NaN
2022-01-16    2.1
2022-01-23    NaN
2022-01-30    6.3
2022-02-06    NaN
2022-02-13    4.6
2022-02-20    5.1
2022-02-27    NaN
2022-03-06    8.9
Freq: W-SUN, dtype: float64

ser = ser.interpolate(method='time')

print(ser)

2022-01-02    1.5
2022-01-09    1.8
2022-01-16    2.1
2022-01-23    4.2
2022-01-30    6.3
2022-02-06    5.45
2022-02-13    4.6
2022-02-20    5.1
2022-02-27    6.5
2022-03-06    8.9
Freq: W-SUN, dtype: float64

The gaps are nicely filled using time sensitive interpolation.

Filling Missing Weather Data

Use domain knowledge to sensibly fill missing values in a weather dataset:

weather = {'Day': ['Mon', 'Tue', 'Wed','Thu', 'Fri','Sat', 'Sun'],
           'Temperature': [28.5, np.nan, 27.2, 26.4, np.nan, 25.1, 29.7],
           'Event': ['Sunny', 'Rain','Rain', 'Clouds', 'Rain', 'Sunny', 'Sunny']}

df = pd.DataFrame(weather)

print(df)

     Day  Temperature   Event
0    Mon        28.5   Sunny
1    Tue         NaN    Rain
2    Wed        27.2    Rain
3    Thu        26.4  Clouds
4    Fri         NaN    Rain
5    Sat        25.1   Sunny
6    Sun        29.7   Sunny

# Fill missing Temperatures with adjacent day's value
df['Temperature'] = df['Temperature'].interpolate(method='nearest')

# Fill missing Events with 'Clouds' as it is common in Philippines
df['Event'] = df['Event'].fillna('Clouds')

print(df)

     Day  Temperature     Event
0    Mon        28.5     Sunny
1    Tue        27.2     Clouds
2    Wed        27.2       Rain
3    Thu        26.4     Clouds
4    Fri        26.4     Clouds
5    Sat        25.1      Sunny
6    Sun        29.7      Sunny

Domain knowledge helps fill missing weather data appropriately.

These real-world examples demonstrate applying the techniques covered to handle missing data effectively.

Conclusion

Handling missing data is an essential skill for working with Pandas DataFrames. The fillna() method provides a simple way to fill NaN values with a static value or using resampling. For ordered data like time series, the powerful interpolate() method enables filling missing data through a variety of interpolation schemes.

The key points to remember are:

Mastering these techniques will enable handling missing data for effective data analysis and machine learning with Pandas.