Skip to content

Practical Applications of Sets in Python for Data Deduplication

Updated: at 03:23 AM

In Python, a set is an unordered collection of unique and immutable objects. Sets are useful when working with data that contains many duplicates or data in which the order doesn’t matter. Some key properties that make sets ideal for removing duplicates andhandling unique values are:

In this comprehensive guide, we will explore some practical applications of using sets in Python for handling unique values and deduplicating data through examples.

Table of Contents

Open Table of Contents

Removing Duplicate Values from a List

A common application of sets is removing duplicates from a list in Python. The set() constructor can be passed a list to return a set with only the unique elements from the list:

# List with duplicates
numbers = [1, 2, 3, 4, 3, 2, 1]

# Convert to set to get unique values
unique_nums = set(numbers)

print(unique_nums)
# {1, 2, 3, 4}

Sets can also be used to get the unique elements back as a list using list() on the set:

unique_nums = list(set(numbers))
print(unique_nums)
# [1, 2, 3, 4]

This technique works for any iterable container like lists, tuples, strings etc. It provides a quick way to remove duplicates and derive unique values.

Removing Duplicates Lines from a Text File

Another common duplicates removal task is dealing with duplicate lines in a text file.

We can load the text file into a Python set to get unique lines only. Each line will be an individual set element:

with open('file.txt') as file:
  lines = set(file.readlines())

print(lines)
# Prints unique lines from file without duplicates

The set automatically handles removing any duplicate lines.

To get the unique lines back into a file:

with open('unique_lines.txt', 'w') as file:
  for line in lines:
    file.write(line)

This writes the set to a new file without the duplicate lines.

Finding Duplicates Between Two Lists

To find duplicates or common elements between two lists, we can use set intersections:

list1 = [1, 2, 3, 4]
list2 = [3, 4, 5]

duplicates = set(list1) & set(list2)
#{3, 4}

The & operator returns a set with the common elements. We can also get the duplicates as a list:

duplicates = list(set(list1) & set(list2))
# [3, 4]

This technique works for any duplicate elements contained in both iterables.

The intersection operator is useful for finding and analyzing overlaps between data sets.

Removing Common Elements from Two Lists

To remove the shared duplicates between two lists, we can use set differences:

list1 = [1, 2, 3]
list2 = [2, 3, 4]

set1 = set(list1)
set2 = set(list2)

unique_list1 = set1 - set2
# {1} Elements only in list1

unique_list2 = set2 - set1
# {4} Elements only in list2

The - operator returns a set with elements that are only in the first set but not the second. This allows us to isolate the unique elements per list.

We can also get the unique elements back as lists:

unique_list1 = list(set1 - set2)
unique_list2 = list(set2 - set1)

This provides an easy way to remove common duplicates between sets and derive values exclusive to each set.

Removing Duplicates from Multiple Lists

To deduplicate data across multiple lists, we can leverage set unions:

list1 = [1, 2, 3]
list2 = [2, 3, 4]
list3 = [3, 4, 5]

master_set = set(list1) | set(list2) | set(list3)
# {1, 2, 3, 4, 5}

The | operator unions multiple sets together, giving us a combined set with all the unique elements across the lists.

We can pass any number of iterables to set() to merge them into a master set of the unique values.

Tracking Unique Visitor Counts

Sets are useful for tracking unique site visitors and analyizing traffic sources:

visitor_ids = ['user123', 'user456', 'user123', 'user789']

unique_visitors = len(set(visitor_ids))
# 3 - Counts only the unique ids without duplicates

We can also analyze visitors by traffic source:

facebook_users = {'user123', 'user456'}
google_users = {'user123', 'user789'}

print('Facebook Users:', len(facebook_users)) # 2
print('Google Users:', len(google_users)) # 2

# Find users from both sources:
both_sources = facebook_users & google_users
print('Visited from Facebook and Google:', len(both_sources)) # 1

# Find total unique users from either source:
total_users = facebook_users | google_users
print('Total unique visitors:', len(total_users)) # 3

This allows useful insights like overlap between visitor segments.

Removing Duplicates in Database Query Results

Another sets application is removing duplicate rows returned from database queries.

We can load the query results into a set to eliminate duplicates and get only unique rows:

import psycopg2

conn = psycopg2.connect(database="mydata")

# Open cursor for executing queries
cur = conn.cursor()

# Fetch query results
cur.execute("SELECT id, name FROM users;")

# Load into set to remove duplicate rows
rows = set(cur.fetchall())

for row in rows:
  print(row)

conn.close()

This fetches the data as tuples which can be set elements. The set handles removing any duplicates.

We can also use Pandas with SQLAlchemy to handle data from multiple tables and remove duplicates.

Finding Unique Values in Pandas DataFrame Column

Pandas and sets can be used together for finding unique values in a DataFrame column:

import pandas as pd

data = pd.DataFrame({
  'Product': ['Widget', 'Gadget', 'Doohickey', 'Gizmo', 'Widget']
})

# Unique values in column as set
unique_products = set(data['Product'])
print(unique_products)
# {'Gadget', 'Widget', 'Doohickey', 'Gizmo'}

# Unique values as list
unique_list = list(set(data['Product']))
print(unique_list)
# ['Gadget', 'Widget', 'Doohickey', 'Gizmo']

This provides a convenient way to summarize and analyze datasets. Unique counts, statistical insights, and visualizations can be derived using the set values.

Sets offer efficient duplicates removal for columns during Pandas operations like joins and merges as well.

Conclusion

Sets are a useful builtin Python data type for removing duplicate values and deriving unique data insights. Operations like unions, intersections and differences allow analyzing relationships between data sets. Sets offer efficient ways to deduplicate iterables and query results. They can be combined with Pandas and SQL databases for powerful data wrangling applications. With their ability to handle uniqueness, sets provide versatile options for solving duplicates issues in real-world Python programs.