In Python, a set is an unordered collection of unique and immutable objects. Sets are useful when working with data that contains many duplicates or data in which the order doesn’t matter. Some key properties that make sets ideal for removing duplicates andhandling unique values are:
-
Sets contain only unique elements. Attempting to add a duplicate element to a set simply doesn’t change the set. This makes sets useful for removing duplicates from data.
-
Elements of a set are unordered. Sets do not record element position or order of insertion like lists or tuples.
-
Set elements must be immutable objects like strings, numbers, or tuples. Lists and dictionaries cannot be set elements.
-
Basic set operations like union, intersection, difference, and symmetric difference can be used to derive insights from data by analyzing relationships between sets.
In this comprehensive guide, we will explore some practical applications of using sets in Python for handling unique values and deduplicating data through examples.
Table of Contents
Open Table of Contents
- Removing Duplicate Values from a List
- Removing Duplicates Lines from a Text File
- Finding Duplicates Between Two Lists
- Removing Common Elements from Two Lists
- Removing Duplicates from Multiple Lists
- Tracking Unique Visitor Counts
- Removing Duplicates in Database Query Results
- Finding Unique Values in Pandas DataFrame Column
- Conclusion
Removing Duplicate Values from a List
A common application of sets is removing duplicates from a list in Python. The set()
constructor can be passed a list to return a set with only the unique elements from the list:
# List with duplicates
numbers = [1, 2, 3, 4, 3, 2, 1]
# Convert to set to get unique values
unique_nums = set(numbers)
print(unique_nums)
# {1, 2, 3, 4}
Sets can also be used to get the unique elements back as a list using list()
on the set:
unique_nums = list(set(numbers))
print(unique_nums)
# [1, 2, 3, 4]
This technique works for any iterable container like lists, tuples, strings etc. It provides a quick way to remove duplicates and derive unique values.
Removing Duplicates Lines from a Text File
Another common duplicates removal task is dealing with duplicate lines in a text file.
We can load the text file into a Python set to get unique lines only. Each line will be an individual set element:
with open('file.txt') as file:
lines = set(file.readlines())
print(lines)
# Prints unique lines from file without duplicates
The set automatically handles removing any duplicate lines.
To get the unique lines back into a file:
with open('unique_lines.txt', 'w') as file:
for line in lines:
file.write(line)
This writes the set to a new file without the duplicate lines.
Finding Duplicates Between Two Lists
To find duplicates or common elements between two lists, we can use set intersections:
list1 = [1, 2, 3, 4]
list2 = [3, 4, 5]
duplicates = set(list1) & set(list2)
#{3, 4}
The &
operator returns a set with the common elements. We can also get the duplicates as a list:
duplicates = list(set(list1) & set(list2))
# [3, 4]
This technique works for any duplicate elements contained in both iterables.
The intersection operator is useful for finding and analyzing overlaps between data sets.
Removing Common Elements from Two Lists
To remove the shared duplicates between two lists, we can use set differences:
list1 = [1, 2, 3]
list2 = [2, 3, 4]
set1 = set(list1)
set2 = set(list2)
unique_list1 = set1 - set2
# {1} Elements only in list1
unique_list2 = set2 - set1
# {4} Elements only in list2
The -
operator returns a set with elements that are only in the first set but not the second. This allows us to isolate the unique elements per list.
We can also get the unique elements back as lists:
unique_list1 = list(set1 - set2)
unique_list2 = list(set2 - set1)
This provides an easy way to remove common duplicates between sets and derive values exclusive to each set.
Removing Duplicates from Multiple Lists
To deduplicate data across multiple lists, we can leverage set unions:
list1 = [1, 2, 3]
list2 = [2, 3, 4]
list3 = [3, 4, 5]
master_set = set(list1) | set(list2) | set(list3)
# {1, 2, 3, 4, 5}
The |
operator unions multiple sets together, giving us a combined set with all the unique elements across the lists.
We can pass any number of iterables to set()
to merge them into a master set of the unique values.
Tracking Unique Visitor Counts
Sets are useful for tracking unique site visitors and analyizing traffic sources:
visitor_ids = ['user123', 'user456', 'user123', 'user789']
unique_visitors = len(set(visitor_ids))
# 3 - Counts only the unique ids without duplicates
We can also analyze visitors by traffic source:
facebook_users = {'user123', 'user456'}
google_users = {'user123', 'user789'}
print('Facebook Users:', len(facebook_users)) # 2
print('Google Users:', len(google_users)) # 2
# Find users from both sources:
both_sources = facebook_users & google_users
print('Visited from Facebook and Google:', len(both_sources)) # 1
# Find total unique users from either source:
total_users = facebook_users | google_users
print('Total unique visitors:', len(total_users)) # 3
This allows useful insights like overlap between visitor segments.
Removing Duplicates in Database Query Results
Another sets application is removing duplicate rows returned from database queries.
We can load the query results into a set to eliminate duplicates and get only unique rows:
import psycopg2
conn = psycopg2.connect(database="mydata")
# Open cursor for executing queries
cur = conn.cursor()
# Fetch query results
cur.execute("SELECT id, name FROM users;")
# Load into set to remove duplicate rows
rows = set(cur.fetchall())
for row in rows:
print(row)
conn.close()
This fetches the data as tuples which can be set elements. The set handles removing any duplicates.
We can also use Pandas with SQLAlchemy to handle data from multiple tables and remove duplicates.
Finding Unique Values in Pandas DataFrame Column
Pandas and sets can be used together for finding unique values in a DataFrame column:
import pandas as pd
data = pd.DataFrame({
'Product': ['Widget', 'Gadget', 'Doohickey', 'Gizmo', 'Widget']
})
# Unique values in column as set
unique_products = set(data['Product'])
print(unique_products)
# {'Gadget', 'Widget', 'Doohickey', 'Gizmo'}
# Unique values as list
unique_list = list(set(data['Product']))
print(unique_list)
# ['Gadget', 'Widget', 'Doohickey', 'Gizmo']
This provides a convenient way to summarize and analyze datasets. Unique counts, statistical insights, and visualizations can be derived using the set values.
Sets offer efficient duplicates removal for columns during Pandas operations like joins and merges as well.
Conclusion
Sets are a useful builtin Python data type for removing duplicate values and deriving unique data insights. Operations like unions, intersections and differences allow analyzing relationships between data sets. Sets offer efficient ways to deduplicate iterables and query results. They can be combined with Pandas and SQL databases for powerful data wrangling applications. With their ability to handle uniqueness, sets provide versatile options for solving duplicates issues in real-world Python programs.