Pandas is a popular Python library used for data analysis and manipulation. One common data cleaning task is removing duplicate rows from a Pandas DataFrame. Duplicates can affect the quality of analysis and statistical modeling using the data. This comprehensive guide will demonstrate various methods to identify and eliminate duplicate rows in Pandas using example code snippets and detailed explanations.
Table of Contents
Open Table of Contents
Introduction
Duplicate or redundant data impacts the accuracy of analysis as they can bias aggregated statistics like counts, sums, averages, etc. Real-world datasets often contain duplicate records due to issues in data collection, storage, or transmission. Data duplication also increases storage and memory requirements. As such, removing duplicates is an essential data cleaning step.
Pandas provide various functionalities to deal with duplicate data rows:
-
DataFrame.drop_duplicates()
method removes duplicate rows based on all or specified columns. -
DataFrame.duplicated()
method returns a boolean series indicating duplicate rows. -
DataFrame.unique()
calculates unique values over specified axis.
This guide will demonstrate how to use these methods with the help of examples and detailed explanations. We will also cover additional techniques like grouping, counting, and sorting to analyze duplicates in Pandas.
Detecting Duplicates
The first step in removing duplicates is identifying duplicate rows. We can check if a Pandas DataFrame contains duplicates using the DataFrame.duplicated()
method.
import pandas as pd
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Claire', 'Bob'],
'Age': [25, 30, 25, 30]})
print(df)
# Name Age
# 0 Alice 25
# 1 Bob 30
# 2 Claire 25
# 3 Bob 30
print(df.duplicated())
# 0 False
# 1 False
# 2 False
# 3 True
The duplicated()
method marks all rows as True
except the first occurrence. This boolean series can then be used to filter the DataFrame and extract the duplicate rows.
print(df[df.duplicated()])
# Name Age
# 3 Bob 30
We can also specify the subset
parameter to check duplicates based on a column or columns.
print(df.duplicated(subset=['Name']))
# 0 False
# 1 False
# 2 False
# 3 True
print(df.duplicated(subset=['Age']))
# 0 False
# 1 False
# 2 True
# 3 False
The keep
parameter controls which duplicate is marked as True
. The options are first
, last
, and False
.
print(df.duplicated(subset='Age', keep='first'))
# 0 False
# 1 False
# 2 True
# 3 False
print(df.duplicated(subset='Age', keep='last'))
# 0 True
# 1 False
# 2 False
# 3 False
print(df.duplicated(subset='Age', keep=False))
# 0 True
# 1 False
# 2 True
# 3 False
Dropping Duplicates
The most common way to eliminate duplicates is by dropping the redundant rows. The DataFrame.drop_duplicates()
method removes duplicate rows based on all or specified columns.
df.drop_duplicates()
# Name Age
# 0 Alice 25
# 1 Bob 30
df.drop_duplicates(subset=['Name'])
# Name Age
# 0 Alice 25
# 1 Bob 30
# 2 Claire 25
The keep
parameter controls which duplicate row to keep. By default, it keeps the first occurrence.
df.drop_duplicates(subset=['Age'], keep='first')
# Name Age
# 0 Alice 25
# 1 Bob 30
df.drop_duplicates(subset=['Age'], keep='last')
# Name Age
# 0 Bob 30
# 2 Claire 25
We can pass a list of columns to subset
to check duplicates based on multiple columns.
df.drop_duplicates(subset=['Name', 'Age'])
# Name Age
# 0 Alice 25
# 1 Bob 30
By default, drop_duplicates()
works on all columns. The inplace=True
parameter modifies the DataFrame inplace.
df.drop_duplicates(inplace=True)
print(df)
# Name Age
# 0 Alice 25
# 1 Bob 30
Counting Duplicates
Counting duplicates gives insights into the scale of duplication and can help prioritize cleaning. The DataFrame.duplicated()
and DataFrameGroupBy.count()
methods can be used together to count duplicates.
dup_rows = df[df.duplicated(subset=['Name'])]
print(dup_rows)
# Name Age
# 3 Bob 30
dup_counts = df.groupby('Name').count()
print(dup_counts)
# Age
# Name
# Alice 1
# Bob 2
# Claire 1
Here we filter for duplicate Name rows, then group by Name and count the occurrences. Bob has 2 records so he is a duplicate.
For numeric columns, we can use DataFrame.value_counts()
to analyze duplicates.
age_counts = df['Age'].value_counts()
print(age_counts)
# 25 2
# 30 2
We can see both 25 and 30 appear twice indicating they are duplicates.
Removing Duplicates by Group
Another strategy is grouping by the subset columns first, then dropping duplicates per group.
df.groupby('Age').apply(lambda x: x.drop_duplicates())
# Name Age
# 0 Alice 25
# 1 Bob 30
This groups df by Age, drops duplicates within each group, and concatenates the results.
We can also group by multiple columns for more granular duplicate removal.
df.groupby(['Name', 'Age']).apply(lambda x: x.drop_duplicates())
# Name Age
# 0 Alice 25
# 1 Bob 30
The benefit of grouping is we can apply other aggregation functions before dropping duplicates. For example, we can groupby and sum or average duplicate rows.
Removing Duplicates by Sorting
Sorting the DataFrame is another option to isolate duplicates rows next to each other, then drop them.
df.sort_values(['Name', 'Age']).drop_duplicates()
# Name Age
# 0 Alice 25
# 1 Bob 30
This sorts ascending first by Name then by Age. The drop_duplicates() call will now remove the duplicated Bob row.
We can sort descending to keep last rather than first duplicates:
df.sort_values(['Name', 'Age'], ascending=False).drop_duplicates()
# Name Age
# 2 Claire 25
# 3 Bob 30
Finding Unique Values
The DataFrame.unique()
method returns unique values in the DataFrame or Series. This can also be useful for duplicate analysis.
print(df['Name'].unique())
# ['Alice' 'Bob' 'Claire']
print(df['Age'].unique())
# [25 30]
We can analyze the length of uniques to identify duplication.
name_uniques = len(df['Name'].unique())
print(name_uniques) # 3
age_uniques = len(df['Age'].unique())
print(age_uniques) # 2
Here we can see there are fewer unique Ages than Names, indicating duplicate ages exist.
Conclusion
This guide demonstrated various techniques to detect, analyze, and remove duplicate rows in Pandas using drop_duplicates()
, duplicated()
, groupby()
, sort_values()
, count()
, and unique()
. The key steps are:
-
Check for duplicates with
duplicated()
and filter the DataFrame to extract them. -
Remove duplicates by dropping with
drop_duplicates()
or by groups after sorting or aggregating. -
Count duplicates using
groupby()
andvalue_counts()
to understand duplication scope. -
Find unique values with
unique()
to identify columns containing duplicates.
Eliminating unwanted duplicate data is an essential pre-processing step for ensuring data quality. The methods discussed provide a robust Pandas toolkit to programmatically remove duplicate rows in Python. Proper duplicate handling will improve the accuracy of downstream analytics and modeling applications.