Skip to content

Concatenating DataFrames and Series with Pandas' concat()

Updated: at 04:15 AM

Pandas is one of the most popular Python libraries used for data manipulation and analysis. The concat() function in Pandas provides a flexible way to concatenate or join together DataFrames and Series objects along an axis.

Concatenation refers to joining or appending objects end-to-end. With concat(), you can combine data from different sources into a single unified DataFrame or Series for further analysis and modeling.

In this comprehensive guide, we will cover the following topics related to using concat() in Pandas:

Table of Contents

Open Table of Contents

Overview of Concatenation in Pandas

Concatenation combines objects together by stacking them horizontally or vertically. The concat() function allows concatenation of DataFrames, Series, and Panel objects.

Key advantages of using concat():

Methods of Concatenation

There are two main methods of concatenation:

For 1-dimensional Series, vertical concatenation is equivalent to appending the Series.

Concatenating Two or More Objects with concat()

The main parameters to concat() are:

To vertically concatenate two DataFrames:

import pandas as pd

df1 = pd.DataFrame({'A': ['A0', 'A1'],
                    'B': ['B0', 'B1']})

df2 = pd.DataFrame({'A': ['A2', 'A3'],
                    'B': ['B2', 'B3']})

df_vert = pd.concat([df1, df2])

print(df_vert)

# Output
   A   B
0  A0  B0
1  A1  B1
0  A2  B2
1  A3  B3

The index is automatically reset. To preserve the original indexes, set ignore_index=True.

For horizontal concatenation, set axis=1:

df_horz = pd.concat([df1, df2], axis=1)

print(df_horz)

# Output
   A   B   A   B
0  A0  B0  A2  B2
1  A1  B1  A3  B3

join='outer' combines the outer union of indexes and inserts missing values for the mismatched indexes on the other axes.

Concatenating Objects with Overlapping Indexes

When concatenating objects that have overlapping indexes, you can control how they are handled using the join and ignore_index parameters:

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                        'B': ['B0', 'B1', 'B2']},
                        index=[0, 1, 2])

df2 = pd.DataFrame({'A': ['A2', 'A3', 'A4'],
                        'B': ['B2', 'B3', 'B4']},
                       index=[2, 3, 4])

df_inner = pd.concat([df1, df2], join='inner')

print(df_inner)

# Output
     A   B
2  A2  B2

Only overlapping indexes (2) are kept.

With ignore_index=True:

df_ignore = pd.concat([df1, df2], ignore_index=True)

print(df_ignore)

# Output
   A   B
0  A0  B0
1  A1  B1
2  A2  B2
3  A2  B2
4  A3  B3
5  A4  B4

All data is preserved but indexes are ignored and reset.

Adding MultiIndex Keys to Identify Source DataFrames

When concatenating a list of DataFrames, use the keys parameter to add an index level to identify the source:

df1 = pd.DataFrame({'A': ['A0', 'A1'],
                    'B': ['B0', 'B1']})

df2 = pd.DataFrame({'A': ['A2', 'A3'],
                    'B': ['B2', 'B3']})

df3 = pd.DataFrame({'A': ['A4', 'A5'],
                    'B': ['B4', 'B5']})

df_list = [df1, df2, df3]

df_concat = pd.concat(df_list, keys=['x', 'y', 'z'])

print(df_concat)

# Output
      A   B
x 0  A0  B0
  1  A1  B1
y 0  A2  B2
  1  A3  B3
z 0  A4  B4
  1  A5  B5

This adds a multi-index with a new outer level identifying each DataFrame source.

Concatenating Along Columns with join='inner'

When concatenating along columns using axis=1 and join='inner', only the columns found in BOTH objects are kept:

df1 = pd.DataFrame({'A': ['A0', 'A1'],
                    'B': ['B0', 'B1']},
                   columns=['A', 'B'])

df2 = pd.DataFrame({'C': ['C0', 'C1'],
                    'D': ['D0', 'D1']},
                   columns=['C', 'D'])

df_col_inner = pd.concat([df1, df2], axis=1, join='inner')

print(df_col_inner)

# Empty DataFrame
# Columns: []
# Index: [0, 1]

No common columns between df1 and df2, so the result is empty.

Use join='outer' to keep columns from both:

df_col_outer = pd.concat([df1, df2], axis=1, join='outer')

print(df_col_outer)

   A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1

Ignoring Indexes and Setting Names

Use ignore_index=True to disregard existing indexes. New numeric indexes will be created:

df1 = pd.DataFrame({'A': ['A0', 'A1'],
                    'B': ['B0', 'B1']},
                  index=[0, 1])

df2 = pd.DataFrame({'A': ['A2', 'A3'],
                    'B': ['B2', 'B3']},
                    index=[2, 3])

df_concat = pd.concat([df1, df2], ignore_index=True)

print(df_concat)

   A   B
0  A0  B0
1  A1  B1
2  A2  B2
3  A3  B3

To name the result index, use names:

df_concat = pd.concat([df1, df2],
                      ignore_index=True,
                      names=['ID'])

print(df_concat)

   A   B
ID
0  A0  B0
1  A1  B1
2  A2  B2
3  A3  B3

Concatenating with Ordered Sorting

Use sort=True to sort the result DataFrame by the join key:

df1 = pd.DataFrame({'A': ['A0', 'A1'],
                    'B': ['B0', 'B1']},
                    index=[1, 2])

df2 = pd.DataFrame({'A': ['A2', 'A3'],
                    'B': ['B2', 'B3']},
                    index=[2, 3])

df_sorted = pd.concat([df1, df2], sort=True)

print(df_sorted)

     A   B
1  A0  B0
2  A1  B1
2  A2  B2
3  A3  B3

For controlling order along columns, use sort=False:

df3 = pd.DataFrame({'C': ['C0', 'C1']},
                    index=[1, 2])

df = pd.concat([df1, df3], axis=1, sort=False)
print(df)

     A   B   C
1  A0  B0  C0
2  A1  B1  C1

Concatenation with Categorical Data

When concatenating categorical Series or columns, Pandas tries to prevent reordering of categories:

s1 = pd.Series(['a','b','c'], dtype='category')
s2 = pd.Series(['d','e'], dtype='category')

s_cat = pd.concat([s1,s2])

print(s_cat)

0    a
1    b
2    c
0    d
1    e
dtype: category
Categories (5, object): [a, b, c, d, e]

This preserves the original ordering of the categories [‘a’,‘b’,‘c’] before appending ‘d’ and ‘e’.

For DataFrames, combine along rows to preserve categories:

df1 = pd.DataFrame({'A': ['a', 'b', 'c']}, dtype='category')
df2 = pd.DataFrame({'A': ['d', 'e']}, dtype='category')

df_cat = pd.concat([df1, df2])

print(df_cat)

      A
0    a
1    b
2    c
0    d
1    e

Specifying Concatenation Semantics with copy

The copy parameter controls whether concatenation copies data (default) or views the same underlying data:

df1 = pd.DataFrame({'A': ['A0', 'A1'],
                    'B': ['B0', 'B1']})

df2 = df1.copy()
df2.loc[0,'A'] = 'foo'

df_copy = pd.concat([df1, df2], copy=True)
df_view = pd.concat([df1, df2], copy=False)

print(df1)

   A   B
0  A0  B0
1  A1  B1

print(df_copy)

   A   B
0  A0  B0
1  A1  B1
0  foo B0

print(df_view)

   A   B
0  foo B0
1  A1  B1

copy=True makes a full copy so df1 is not changed. copy=False uses a view, so df1 reflects the changes.

Optimizing Concatenation Performance

There are a few options to optimize concat() performance:

Example:

df_list = [df1, df2, df3]

df_concat = pd.concat(df_list,
                   ignore_index=True,
                   copy=False,
                   sort=False,
                   verify_integrity=False)

This will provide significant speedups for large data sets.

Alternative Ways to Concatenate

While concat() is the main method, there are also other ways to concatenate in Pandas:

concat() provides the most flexibility and options for concatenation.

Key Takeaways

Concatenation with concat() enables smoothly combining data from different sources for effective data preparation and analysis using Pandas.