Removing Duplicates in Pandas

Pandas is a popular Python library used for data analysis and manipulation. One common data cleaning task is removing duplicate rows from a Pandas DataFrame. Duplicates can affect the quality of analysis and statistical modeling using the data. This comprehensive guide will demonstrate various methods to identify and eliminate duplicate rows in Pandas using example code snippets and detailed explanations.

Open Table of Contents

Introduction
Detecting Duplicates
Dropping Duplicates
Counting Duplicates
Removing Duplicates by Group
Removing Duplicates by Sorting
Finding Unique Values
Conclusion

Introduction

Duplicate or redundant data impacts the accuracy of analysis as they can bias aggregated statistics like counts, sums, averages, etc. Real-world datasets often contain duplicate records due to issues in data collection, storage, or transmission. Data duplication also increases storage and memory requirements. As such, removing duplicates is an essential data cleaning step.

Pandas provide various functionalities to deal with duplicate data rows:

DataFrame.drop_duplicates() method removes duplicate rows based on all or specified columns.
DataFrame.duplicated() method returns a boolean series indicating duplicate rows.
DataFrame.unique() calculates unique values over specified axis.

This guide will demonstrate how to use these methods with the help of examples and detailed explanations. We will also cover additional techniques like grouping, counting, and sorting to analyze duplicates in Pandas.

Detecting Duplicates

The first step in removing duplicates is identifying duplicate rows. We can check if a Pandas DataFrame contains duplicates using the DataFrame.duplicated() method.

import pandas as pd

df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Claire', 'Bob'],
                   'Age': [25, 30, 25, 30]})

print(df)
#    Name  Age
# 0  Alice   25
# 1    Bob   30
# 2  Claire   25
# 3    Bob   30

print(df.duplicated())
# 0    False
# 1    False
# 2    False
# 3     True

The duplicated() method marks all rows as True except the first occurrence. This boolean series can then be used to filter the DataFrame and extract the duplicate rows.

print(df[df.duplicated()])
#    Name  Age
# 3    Bob   30

We can also specify the subset parameter to check duplicates based on a column or columns.

print(df.duplicated(subset=['Name']))
# 0    False
# 1    False
# 2    False
# 3     True

print(df.duplicated(subset=['Age']))
# 0    False
# 1    False
# 2     True
# 3    False

The keep parameter controls which duplicate is marked as True. The options are first, last, and False.

print(df.duplicated(subset='Age', keep='first'))
# 0    False
# 1    False
# 2     True
# 3    False

print(df.duplicated(subset='Age', keep='last'))
# 0     True
# 1    False
# 2    False
# 3    False

print(df.duplicated(subset='Age', keep=False))
# 0     True
# 1    False
# 2     True
# 3    False

Dropping Duplicates

The most common way to eliminate duplicates is by dropping the redundant rows. The DataFrame.drop_duplicates() method removes duplicate rows based on all or specified columns.

df.drop_duplicates()
#    Name  Age
# 0  Alice   25
# 1    Bob   30

df.drop_duplicates(subset=['Name'])
#    Name  Age
# 0  Alice   25
# 1    Bob   30
# 2  Claire   25

The keep parameter controls which duplicate row to keep. By default, it keeps the first occurrence.

df.drop_duplicates(subset=['Age'], keep='first')
#    Name  Age
# 0  Alice   25
# 1    Bob   30

df.drop_duplicates(subset=['Age'], keep='last')
#    Name  Age
# 0    Bob   30
# 2  Claire   25

We can pass a list of columns to subset to check duplicates based on multiple columns.

df.drop_duplicates(subset=['Name', 'Age'])
#    Name  Age
# 0  Alice   25
# 1    Bob   30

By default, drop_duplicates() works on all columns. The inplace=True parameter modifies the DataFrame inplace.

df.drop_duplicates(inplace=True)
print(df)
#    Name  Age
# 0  Alice   25
# 1    Bob   30

Counting Duplicates

Counting duplicates gives insights into the scale of duplication and can help prioritize cleaning. The DataFrame.duplicated() and DataFrameGroupBy.count() methods can be used together to count duplicates.

dup_rows = df[df.duplicated(subset=['Name'])]
print(dup_rows)

#    Name  Age
# 3    Bob   30

dup_counts = df.groupby('Name').count()
print(dup_counts)

#       Age
# Name
# Alice  1
# Bob    2
# Claire 1

Here we filter for duplicate Name rows, then group by Name and count the occurrences. Bob has 2 records so he is a duplicate.

For numeric columns, we can use DataFrame.value_counts() to analyze duplicates.

age_counts = df['Age'].value_counts()
print(age_counts)

# 25    2
# 30    2

We can see both 25 and 30 appear twice indicating they are duplicates.

Removing Duplicates by Group

Another strategy is grouping by the subset columns first, then dropping duplicates per group.

df.groupby('Age').apply(lambda x: x.drop_duplicates())
#      Name  Age
# 0   Alice   25
# 1     Bob   30

This groups df by Age, drops duplicates within each group, and concatenates the results.

We can also group by multiple columns for more granular duplicate removal.

df.groupby(['Name', 'Age']).apply(lambda x: x.drop_duplicates())
#     Name  Age
# 0  Alice   25
# 1    Bob   30

The benefit of grouping is we can apply other aggregation functions before dropping duplicates. For example, we can groupby and sum or average duplicate rows.

Removing Duplicates by Sorting

Sorting the DataFrame is another option to isolate duplicates rows next to each other, then drop them.

df.sort_values(['Name', 'Age']).drop_duplicates()
#    Name  Age
# 0  Alice   25
# 1    Bob   30

This sorts ascending first by Name then by Age. The drop_duplicates() call will now remove the duplicated Bob row.

We can sort descending to keep last rather than first duplicates:

df.sort_values(['Name', 'Age'], ascending=False).drop_duplicates()
#     Name  Age
# 2  Claire   25
# 3    Bob   30

Finding Unique Values

The DataFrame.unique() method returns unique values in the DataFrame or Series. This can also be useful for duplicate analysis.

print(df['Name'].unique())
# ['Alice' 'Bob' 'Claire']

print(df['Age'].unique())
# [25 30]

We can analyze the length of uniques to identify duplication.

name_uniques = len(df['Name'].unique())
print(name_uniques) # 3

age_uniques = len(df['Age'].unique())
print(age_uniques) # 2

Here we can see there are fewer unique Ages than Names, indicating duplicate ages exist.

Conclusion

This guide demonstrated various techniques to detect, analyze, and remove duplicate rows in Pandas using drop_duplicates(), duplicated(), groupby(), sort_values(), count(), and unique(). The key steps are:

Check for duplicates with duplicated() and filter the DataFrame to extract them.
Remove duplicates by dropping with drop_duplicates() or by groups after sorting or aggregating.
Count duplicates using groupby() and value_counts() to understand duplication scope.
Find unique values with unique() to identify columns containing duplicates.

Eliminating unwanted duplicate data is an essential pre-processing step for ensuring data quality. The methods discussed provide a robust Pandas toolkit to programmatically remove duplicate rows in Python. Proper duplicate handling will improve the accuracy of downstream analytics and modeling applications.