Missing data is a common issue when working with real-world datasets in Python. Values can be missing due to various reasons such as incorrect data entry, corruption, or failure to record measurements. Dealing with missing data is an inevitable step in most data analysis and machine learning pipelines.
Pandas, Python’s popular data analysis library, provides various methods to handle missing values in DataFrames and Series. The versatile fillna()
method allows filling missing values with a static value or using an interpolation method. The interpolate()
method offers more advanced interpolation techniques for ordered data.
In this comprehensive guide, we will explore how to fill missing values in Pandas using the fillna()
and interpolate()
functions.
Prerequisites
To follow this guide, you should have:
- Basic knowledge of Python programming
- Experience working with Pandas DataFrames and Series
- Understanding of basic data analysis and visualization concepts
The code examples use Pandas 1.3.5 and Python 3.8.10. It is recommended to upgrade to the latest stable versions before trying the examples.
Table of Contents
Open Table of Contents
Checking for Missing Values
The first step is to check whether the DataFrame contains any missing values. Pandas provides various functions to detect null or NaN values:
import pandas as pd
df = pd.DataFrame({'Col1': [1, 2, np.nan, 4],
'Col2': [5, np.nan, 7, 8]})
# Check if DataFrame has any missing values
df.isna().any()
# Count total missing values across columns
df.isna().sum().sum()
# Count missing values per column
df.isna().sum()
The output will reveal the columns and number of missing values present. This informs how to approach imputing them.
Filling Missing Values with fillna()
The fillna()
method is used to replace or fill missing values (NaN/None) in Pandas objects with a specified value.
The syntax is:
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None)
Let’s go through the parameters:
value
: Value to use to fill missing values. Can be a scalar, dict, Series or DataFrame. Default is None.method
: Interpolation method to use for numeric data. Default is None.axis
: Axis along which to fill. 0 for columns and 1 for rows. Default is None.inplace
: Modify the DataFrame directly or return a copy. Default is False.limit
: For forward and backward filling, maximum consecutive periods to fill.
Filling with a Scalar Value
To fill missing values with a static scalar value like 0, we simply pass it to the value
parameter:
df.fillna(value=0)
This replaces all NaN values with 0 in the DataFrame.
For example:
df = pd.DataFrame({'A': [1, np.nan, 3],
'B': [5, 2, np.nan]})
df.fillna(value=0)
A B
0 1.0 5.0
1 0.0 2.0
2 3.0 0.0
We can also specify a different fill value for each column by passing a dictionary:
df.fillna({'A': 0, 'B': 5})
A B
0 1.0 5.0
1 0.0 2.0
2 3.0 5.0
Filling with a Forward Fill
To propagate the previous valid value forward, use method='ffill'
:
df.fillna(method='ffill')
A B
0 1.0 5.0
1 1.0 2.0
2 3.0 2.0
This carries the last observed value forward to fill the missing values.
Filling with a Backward Fill
Use method='bfill'
to propagate the next valid value backward:
df.fillna(method='bfill')
A B
0 1.0 5.0
1 3.0 2.0
2 3.0 NaN
This carries the next valid value backward to fill missing values.
Filling with Interpolate
To fill missing values by interpolating neighboring values, pass method='interpolate'
:
df.interpolate()
A B
0 1.0 5.0
1 2.0 2.0
2 3.0 3.5
This uses the default linear interpolation method. More complex interpolation methods are supported by the interpolate()
function described later.
Forward Fill with Limit
The limit
parameter controls how many consecutive missing values to fill while doing a forward or backward fill:
df = pd.DataFrame({'A': [1, np.nan, np.nan, 5, 7]})
df.fillna(method='ffill', limit=1)
A
0 1.0
1 1.0
2 NaN
3 5.0
4 7.0
This fills at most 1 consecutive NaN value using the forward fill method. The second NaN value remains.
Filling in Place
By default, fillna()
returns a copy of the DataFrame with filled values. To fill missing values in the original DataFrame itself, use inplace=True
:
df.fillna(value=0, inplace=True)
Now df
will be modified directly with the missing values filled.
Filling with interpolate()
While fillna()
offers basic interpolation, for more complex numeric data, Pandas provides the powerful interpolate()
method. It has various interpolation schemes like linear, quadratic, spline, etc.
The syntax for interpolate()
is:
Series.interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction='forward', limit_area=None, downcast=None)
DataFrame.interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction='forward', limit_area=None, downcast=None)
The parameters are similar to fillna()
, with the interpolation method
being the key addition.
Let’s go through some examples:
Linear Interpolation
This uses the standard method='linear'
interpolation between points:
ser = pd.Series([1, 3, np.nan, 7, 4, np.nan, 12])
ser.interpolate()
0 1.0
1 3.0
2 5.0
3 7.0
4 4.0
5 8.0
6 12.0
dtype: float64
The NaN values are filled using linear interpolation between neighboring points.
Quadratic Interpolation
The method='quadratic'
interpolates with a quadratic polynomial between points:
ser.interpolate(method='quadratic')
0 1.000000
1 3.000000
2 4.333333
3 7.000000
4 4.000000
5 8.500000
6 12.000000
dtype: float64
This fits a quadratic curve between the points to fill the NaNs.
Spline Interpolation
The method='spline'
uses spline interpolation to fill missing data:
ser.interpolate(method='spline', order=2)
0 1.000000
1 3.000000
2 5.428572
3 7.000000
4 4.000000
5 8.285714
6 12.000000
dtype: float64
Spline interpolation fits smooth spline polynomials between known data points to impute missing values.
Time Series Interpolation
For time series data, methods like 'time'
and 'cubic'
can be used:
dates = pd.date_range('2020-01-01', periods=6, freq='W')
ts = pd.Series([1, np.nan, np.nan, 7, 4, np.nan], index=dates)
ts.interpolate(method='time')
2020-01-05 1.0
2020-01-12 4.0
2020-01-19 4.0
2020-01-26 7.0
2020-02-02 4.0
2020-02-09 7.5
Freq: W-SUN, dtype: float64
This interpolates along the time index, filling gaps with time sensitive interpolation methods.
Limiting Interpolation Range
The limit
and limit_area
parameters can restrict the range of values used for interpolation:
ser.interpolate(method='spline', limit=2, limit_area='inside')
0 1.0
1 3.0
2 NaN
3 7.0
4 4.0
5 NaN
6 12.0
dtype: float64
This only uses up to 2 valid points on either side of the NaN to interpolate. The ‘inside’ limit area further restricts it to not use outside points.
Filling in Place
To interpolate with the changes applied directly to the original Series, use inplace=True
:
ser.interpolate(method='linear', inplace=True)
Practical Examples
Let’s look at some practical examples of filling missing data in Pandas DataFrames using the techniques covered.
Filling Missing Numeric Values
df = pd.DataFrame({'A': [1, 3, 6, np.nan],
'B': [5, np.nan, np.nan, np.nan]})
df.fillna(method='ffill')
A B
0 1 5.0
1 3 5.0
2 6 5.0
3 6 5.0
Forward fills numeric columns missing values.
Filling Missing Dates
dates = pd.date_range('2020-01-01', periods=3, freq='D')
df = pd.DataFrame({'Date': [dates[0], None, dates[2]]})
df['Date'].fillna(method='ffill')
0 2020-01-01
1 2020-01-01
2 2020-01-03
Name: Date, dtype: datetime64[ns]
Missing dates can be filled forward/backward nicely.
Interpolating Time Series
ts = pd.Series([1, np.nan, 5, 7, np.nan, 12],
index=pd.date_range('2020-01-01', periods=6, freq='M'))
ts.interpolate(method='time')
2020-01-31 1.0
2020-02-29 3.0
2020-03-31 5.0
2020-04-30 7.0
2020-05-31 9.5
2020-06-30 12.0
Freq: M, dtype: float64
Time sensitive interpolation of missing values in time series data.
Filling Columns by Data Type
df = pd.DataFrame({'Int': [1, 3, None, 7],
'Float': [0.5, np.nan, 4.5, 8.7],
'Str': ['a', None, 'c', 'd']})
df.fillna({'Int': 0, 'Float': 0.0, 'Str': 'Missing'})
Int Float Str
0 1 0.5 a
1 3 0.0 Missing
2 0 4.5 c
3 7 8.7 d
Fill columns by data type - numeric with 0, strings with ‘Missing’ text.
Here is a new section with real-life examples and complete runnable code for filling missing values in Pandas:
Real-life Examples
Let’s look at some real-life examples of handling missing data in Pandas DataFrames using the techniques we have learned.
Cleaning MessyDataset
This dataset has missing values coded as “NaN” strings that need to be converted to NaNs before filling:
import pandas as pd
import numpy as np
data = {'A': [1, np.nan, 'NaN', 4],
'B': [5, np.nan, 'NaN', 8],
'C': ['a', 'b', None, 'd']}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5.0 a
1 NaN NaN b
2 NaN NaN None
3 4 8.0 d
# Convert "NaN" strings to NaNs
df = df.replace('NaN', np.nan)
# Fill numeric columns forward
# and string column with "Missing"
df = df.fillna(method='ffill', numeric_only=False)
print(df)
A B C
0 1 5.0 a
1 1 5.0 b
2 1 5.0 Missing
3 4 8.0 d
Filling Time Series Gaps
Fill gaps in time series data using time-sensitive interpolation methods:
dates = pd.date_range('2022-01-01', periods=10, freq='W')
values = [1.5, np.nan, 2.1, np.nan, 6.3, np.nan, 4.6, 5.1, np.nan, 8.9]
ser = pd.Series(values, index=dates)
print(ser)
2022-01-02 1.5
2022-01-09 NaN
2022-01-16 2.1
2022-01-23 NaN
2022-01-30 6.3
2022-02-06 NaN
2022-02-13 4.6
2022-02-20 5.1
2022-02-27 NaN
2022-03-06 8.9
Freq: W-SUN, dtype: float64
ser = ser.interpolate(method='time')
print(ser)
2022-01-02 1.5
2022-01-09 1.8
2022-01-16 2.1
2022-01-23 4.2
2022-01-30 6.3
2022-02-06 5.45
2022-02-13 4.6
2022-02-20 5.1
2022-02-27 6.5
2022-03-06 8.9
Freq: W-SUN, dtype: float64
The gaps are nicely filled using time sensitive interpolation.
Filling Missing Weather Data
Use domain knowledge to sensibly fill missing values in a weather dataset:
weather = {'Day': ['Mon', 'Tue', 'Wed','Thu', 'Fri','Sat', 'Sun'],
'Temperature': [28.5, np.nan, 27.2, 26.4, np.nan, 25.1, 29.7],
'Event': ['Sunny', 'Rain','Rain', 'Clouds', 'Rain', 'Sunny', 'Sunny']}
df = pd.DataFrame(weather)
print(df)
Day Temperature Event
0 Mon 28.5 Sunny
1 Tue NaN Rain
2 Wed 27.2 Rain
3 Thu 26.4 Clouds
4 Fri NaN Rain
5 Sat 25.1 Sunny
6 Sun 29.7 Sunny
# Fill missing Temperatures with adjacent day's value
df['Temperature'] = df['Temperature'].interpolate(method='nearest')
# Fill missing Events with 'Clouds' as it is common in Philippines
df['Event'] = df['Event'].fillna('Clouds')
print(df)
Day Temperature Event
0 Mon 28.5 Sunny
1 Tue 27.2 Clouds
2 Wed 27.2 Rain
3 Thu 26.4 Clouds
4 Fri 26.4 Clouds
5 Sat 25.1 Sunny
6 Sun 29.7 Sunny
Domain knowledge helps fill missing weather data appropriately.
These real-world examples demonstrate applying the techniques covered to handle missing data effectively.
Conclusion
Handling missing data is an essential skill for working with Pandas DataFrames. The fillna()
method provides a simple way to fill NaN values with a static value or using resampling. For ordered data like time series, the powerful interpolate()
method enables filling missing data through a variety of interpolation schemes.
The key points to remember are:
- Check for missing values using
isna()
,isnull()
ornotna()
,notnull()
fillna()
allows filling missing values with a scalar, Series, DataFrame, or by interpolation- Use
method='ffill'
andmethod='bfill'
for forward and backward filling interpolate()
provides advanced interpolation methods likelinear
,spline
,time
etc.- Parameters like
axis
,inplace
,limit
give more control over the filling - Fill missing values appropriately based on data type - numeric, strings, dates etc.
Mastering these techniques will enable handling missing data for effective data analysis and machine learning with Pandas.