Joining and merging datasets is a common task in data analysis projects. Pandas, Python’s popular data analysis library, provides various methods to combine DataFrames including merge()
and join()
which allow you to perform database-style joins.
This comprehensive guide will examine Pandas’ merge()
and join()
functions to conduct SQL-style joins on DataFrames. We will cover the similarities and differences between merge()
and join()
, their syntaxes, inner vs outer joins, joining types (left, right, inner, outer), joining on multiple columns, handling duplicates, and more. Examples and use cases are provided to illustrate how to effectively leverage these functions for merging, joining, and concatenating DataFrames in Python.
Table of Contents
Open Table of Contents
Overview of merge()
and join()
The merge()
and join()
functions in Pandas allow you to combine DataFrames together similar to joins in relational databases like SQL. This makes it very convenient for those familiar with database concepts to conduct merges and joins on DataFrames.
While their purposes are similar, there are some key differences between merge()
and join()
:
merge()
is a standalone function whilejoin()
is a DataFrame method.merge()
supports SQL-style notation for joins whilejoin()
uses a different syntax based on set logic.join()
only allows two DataFrames to be joined whilemerge()
supports joining multiple DataFrames at once.merge()
offers more flexibility and options for handling the join keys during the merge process.
In general, merge()
is more commonly used due to its similarity to SQL joins and added flexibility. However, join()
can be handy for quick inner or outer joins on the indexes of two DataFrames.
Inner Joins with merge()
and join()
Inner joins select only the intersecting rows between the two DataFrames where keys match in both the left (first) and right (second) DataFrame.
Syntax
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None)
left.join(right, how='inner', on=None, lsuffix='', rsuffix='')
left
- Left DataFrameright
- Right DataFramehow
- Type of join, ‘inner’ specifies inner joinon
- Column to join on, must be found in both DataFramesleft_on
- Column in left DataFrame to use as join keyright_on
- Column in right DataFrame to use as join key
Example
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [2, 3], 'C': [5, 6]})
df_inner = pd.merge(df1, df2, on='A')
print(df_inner)
# Output:
A B C
0 2 4 5
Here, we join df1
and df2
on column ‘A’ using an inner join. This only returns rows where ‘A’ appears in both DataFrames, so the row with 3 in ‘A’ is excluded.
Outer Joins with merge()
and join()
Outer joins return all rows from both the left and right DataFrames, filling in missing values where keys don’t match with NaN.
There are two types of outer joins:
-
Left Outer Join - Returns all rows from left DataFrame supplemented by matching right DataFrame rows. Unmatched rows have NaN for right DataFrame columns.
-
Right Outer Join - Returns all rows from right DataFrame supplemented by matching left DataFrame rows. Unmatched rows contain NaN for left DataFrame columns.
Syntax
pd.merge(left, right, how='outer', on=None, left_on=None, right_on=None)
left.join(right, how='outer', on=None, lsuffix='', rsuffix='')
- Set
how='outer'
for an outer join.
Example
df_leftouter = pd.merge(df1, df2, how='left', on='A')
print(df_leftouter)
df_rightouter = pd.merge(df1, df2, how='right', on='A')
print(df_rightouter)
# Output
A B C
0 1 3 NaN
1 2 4 5
A B C
0 2 4.0 5
1 3 NaN 6
The left outer join returns all rows from df1
plus matched rows from df2
, while the right outer join returns all rows from df2
plus matched rows from df1
. Unmatched rows contain NaN for the missing columns.
Joining on Multiple Columns
To merge on multiple columns, pass a list of column names to the on
parameter. The join will only match rows where all the key columns match between the DataFrames.
Syntax
pd.merge(left, right, on=['key1', 'key2'])
left.join(right, on=['key1', 'key2'])
Example
df3 = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 'Key2': [1, 2]})
df4 = pd.DataFrame({'A': [1, 4], 'C': [5, 6], 'Key2': [1, 3]})
df_mmult = pd.merge(df3, df4, on=['A', 'Key2'])
print(df_mmult)
# Output
A B Key2 C
0 1 3 1 5
Here the rows are matched only if both columns A and Key2 match between the DataFrames.
Specifying Left and Right Join Keys
We can specify different columns to join on from the left and right DataFrames using the left_on
and right_on
parameters.
The column names do not need to match between the left and right join keys. This allows joining DataFrames that may have different column names.
Syntax
pd.merge(left, right, left_on='key1', right_on='key2')
Example
df5 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df6 = pd.DataFrame({'C': [2, 4], 'D': [5, 6]})
df_leftright = pd.merge(df5, df6, left_on='A', right_on='C')
print(df_leftright)
# Output
A B C D
0 2 4 2 5
Here df5[A]
and df6[C]
are used as the join keys instead of requiring the columns to have matching names.
Handling Duplicates During Joins
By default, the merge()
function will return all combinations of duplicates during a join. The join()
method will only return the intersection of duplicates.
We can handle duplicates using the suffixes
, how
, and on
parameters:
lsuffix
/rsuffix
- Append suffixes to duplicate left/right column nameshow='outer'
- Joins will union together duplicates instead of producing all combinations.on
- Specify columns to join on rather than using all column name intersections.
Example
df1 = pd.DataFrame({'A': [1, 1], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'C': [3, 4]})
df_outer = df1.join(df2, how='outer') # Single row with [1, 3, 3, 4]
df_merge = pd.merge(df1, df2) # 4 duplicate rows
df_merge_suffix = pd.merge(df1, df2, lsuffix='_left', rsuffix='_right')
print(df_merge_suffix)
# Output
A_left B A_right C
0 1 3 1 3
1 1 4 1 3
2 1 3 1 4
3 1 4 1 4
Using lsuffix
/rsuffix
appends _left
/_right
to duplicate column names, while an outer join produces just a single row.
Joining and Concatenating DataFrames
In addition to merging and joining, we can also concatenate or stack DataFrames vertically to combine them.
Key differences:
- Joins match rows by key columns
- Concatenation stacks DataFrames vertically
- Use
join()
/merge()
to combine columns - Use
concat()
to append rows
Syntax
Vertical stack:
pd.concat([df1, df2], axis=0)
Example
df1 = pd.DataFrame({'A': [1, 2], 'B':[3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B':[7, 8]})
df_concat = pd.concat([df1, df2])
print(df_concat)
# Output
A B
0 1 3
1 2 4
0 5 7
1 6 8
concat()
vertically stacks df1
on top of df2
to combine the rows.
Summary
- Pandas
merge()
andjoin()
perform database-style joins on DataFrames. merge()
offers more flexibility whilejoin()
provides a quick method for combining two DataFrames.- Inner joins match intersecting keys, outer joins union keys, and outer joins also include unmatched rows.
- Multiple columns can be specified for the join keys to match rows.
concat()
vertically concatenates DataFrames unlike joins which combine horizontally.
Joins and merges are essential tools for combining DataFrames in data analysis workflows. Mastering Pandas’ merge()
, join()
and concat()
allows for efficient wrangling during machine learning, data science, and analytics projects.