Skip to content

Pandas: Database-Style Joins - merge(), join()

Updated: at 05:43 AM

Joining and merging datasets is a common task in data analysis projects. Pandas, Python’s popular data analysis library, provides various methods to combine DataFrames including merge() and join() which allow you to perform database-style joins.

This comprehensive guide will examine Pandas’ merge() and join() functions to conduct SQL-style joins on DataFrames. We will cover the similarities and differences between merge() and join(), their syntaxes, inner vs outer joins, joining types (left, right, inner, outer), joining on multiple columns, handling duplicates, and more. Examples and use cases are provided to illustrate how to effectively leverage these functions for merging, joining, and concatenating DataFrames in Python.

Table of Contents

Open Table of Contents

Overview of merge() and join()

The merge() and join() functions in Pandas allow you to combine DataFrames together similar to joins in relational databases like SQL. This makes it very convenient for those familiar with database concepts to conduct merges and joins on DataFrames.

While their purposes are similar, there are some key differences between merge() and join():

In general, merge() is more commonly used due to its similarity to SQL joins and added flexibility. However, join() can be handy for quick inner or outer joins on the indexes of two DataFrames.

Inner Joins with merge() and join()

Inner joins select only the intersecting rows between the two DataFrames where keys match in both the left (first) and right (second) DataFrame.

Syntax

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None)
left.join(right, how='inner', on=None, lsuffix='', rsuffix='')

Example

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [2, 3], 'C': [5, 6]})

df_inner = pd.merge(df1, df2, on='A')
print(df_inner)

# Output:
   A  B  C
0  2  4  5

Here, we join df1 and df2 on column ‘A’ using an inner join. This only returns rows where ‘A’ appears in both DataFrames, so the row with 3 in ‘A’ is excluded.

Outer Joins with merge() and join()

Outer joins return all rows from both the left and right DataFrames, filling in missing values where keys don’t match with NaN.

There are two types of outer joins:

Syntax

pd.merge(left, right, how='outer', on=None, left_on=None, right_on=None)
left.join(right, how='outer', on=None, lsuffix='', rsuffix='')

Example

df_leftouter = pd.merge(df1, df2, how='left', on='A')
print(df_leftouter)

df_rightouter = pd.merge(df1, df2, how='right', on='A')
print(df_rightouter)

# Output
   A  B   C
0  1  3 NaN
1  2  4   5
   A    B  C
0  2  4.0  5
1  3  NaN  6

The left outer join returns all rows from df1 plus matched rows from df2, while the right outer join returns all rows from df2 plus matched rows from df1. Unmatched rows contain NaN for the missing columns.

Joining on Multiple Columns

To merge on multiple columns, pass a list of column names to the on parameter. The join will only match rows where all the key columns match between the DataFrames.

Syntax

pd.merge(left, right, on=['key1', 'key2'])
left.join(right, on=['key1', 'key2'])

Example

df3 = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 'Key2': [1, 2]})
df4 = pd.DataFrame({'A': [1, 4], 'C': [5, 6], 'Key2': [1, 3]})

df_mmult = pd.merge(df3, df4, on=['A', 'Key2'])
print(df_mmult)

# Output
   A  B  Key2  C
0  1  3     1  5

Here the rows are matched only if both columns A and Key2 match between the DataFrames.

Specifying Left and Right Join Keys

We can specify different columns to join on from the left and right DataFrames using the left_on and right_on parameters.

The column names do not need to match between the left and right join keys. This allows joining DataFrames that may have different column names.

Syntax

pd.merge(left, right, left_on='key1', right_on='key2')

Example

df5 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df6 = pd.DataFrame({'C': [2, 4], 'D': [5, 6]})

df_leftright = pd.merge(df5, df6, left_on='A', right_on='C')
print(df_leftright)

# Output
   A  B  C  D
0  2  4  2  5

Here df5[A] and df6[C] are used as the join keys instead of requiring the columns to have matching names.

Handling Duplicates During Joins

By default, the merge() function will return all combinations of duplicates during a join. The join() method will only return the intersection of duplicates.

We can handle duplicates using the suffixes, how, and on parameters:

Example

df1 = pd.DataFrame({'A': [1, 1], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'C': [3, 4]})

df_outer = df1.join(df2, how='outer') # Single row with [1, 3, 3, 4]
df_merge = pd.merge(df1, df2) # 4 duplicate rows

df_merge_suffix = pd.merge(df1, df2, lsuffix='_left', rsuffix='_right')
print(df_merge_suffix)

# Output
   A_left  B  A_right  C
0       1  3        1  3
1       1  4        1  3
2       1  3        1  4
3       1  4        1  4

Using lsuffix/rsuffix appends _left/_right to duplicate column names, while an outer join produces just a single row.

Joining and Concatenating DataFrames

In addition to merging and joining, we can also concatenate or stack DataFrames vertically to combine them.

Key differences:

Syntax

Vertical stack:

pd.concat([df1, df2], axis=0)

Example

df1 = pd.DataFrame({'A': [1, 2], 'B':[3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B':[7, 8]})

df_concat = pd.concat([df1, df2])
print(df_concat)

# Output
   A  B
0  1  3
1  2  4
0  5  7
1  6  8

concat() vertically stacks df1 on top of df2 to combine the rows.

Summary

Joins and merges are essential tools for combining DataFrames in data analysis workflows. Mastering Pandas’ merge(), join() and concat() allows for efficient wrangling during machine learning, data science, and analytics projects.