Concatenating DataFrames and Series with Pandas' concat()

Pandas is one of the most popular Python libraries used for data manipulation and analysis. The concat() function in Pandas provides a flexible way to concatenate or join together DataFrames and Series objects along an axis.

Concatenation refers to joining or appending objects end-to-end. With concat(), you can combine data from different sources into a single unified DataFrame or Series for further analysis and modeling.

In this comprehensive guide, we will cover the following topics related to using concat() in Pandas:

Open Table of Contents

Overview of Concatenation in Pandas
- Methods of Concatenation
Concatenating Two or More Objects with concat()
Concatenating Objects with Overlapping Indexes
Adding MultiIndex Keys to Identify Source DataFrames
Concatenating Along Columns with join='inner'
Ignoring Indexes and Setting Names
Concatenating with Ordered Sorting
Concatenation with Categorical Data
Specifying Concatenation Semantics with copy
Optimizing Concatenation Performance
Alternative Ways to Concatenate
Key Takeaways

Overview of Concatenation in Pandas

Concatenation combines objects together by stacking them horizontally or vertically. The concat() function allows concatenation of DataFrames, Series, and Panel objects.

Key advantages of using concat():

Easily combine data from different sources into a single data structure
Specify the axis (columns or rows) along which concatenation is performed
Control the join and sort order of the concatenated objects
Manage indexes, including hierarchical indexes, on the result object
Specify keys for aligning concatenated objects
Handle missing data and duplicate indexes/columns
Configure copy vs view semantics for better performance

Methods of Concatenation

There are two main methods of concatenation:

Vertical concatenation (vstack) - Stacking objects vertically by combining along the rows. This increases the number of rows.
Horizontal concatenation (hstack) - Stacking objects horizontally by combining along the columns. This increases the number of columns.

For 1-dimensional Series, vertical concatenation is equivalent to appending the Series.

Concatenating Two or More Objects with `concat()`

The main parameters to concat() are:

objs - A sequence or mapping of objects to concatenate. This can contain DataFrames, Series, or Panel objects.
axis - The axis along which to concatenate:
- 0: vertically (rows)
- 1: horizontally (columns)
join - How to handle indexes on other axis:
- ‘inner’ - use intersection of indexes
- ‘outer’ - use union of indexes

To vertically concatenate two DataFrames:

import pandas as pd

df1 = pd.DataFrame({'A': ['A0', 'A1'],
                    'B': ['B0', 'B1']})

df2 = pd.DataFrame({'A': ['A2', 'A3'],
                    'B': ['B2', 'B3']})

df_vert = pd.concat([df1, df2])

print(df_vert)

# Output
   A   B
0  A0  B0
1  A1  B1
0  A2  B2
1  A3  B3

The index is automatically reset. To preserve the original indexes, set ignore_index=True.

For horizontal concatenation, set axis=1:

df_horz = pd.concat([df1, df2], axis=1)

print(df_horz)

# Output
   A   B   A   B
0  A0  B0  A2  B2
1  A1  B1  A3  B3

join='outer' combines the outer union of indexes and inserts missing values for the mismatched indexes on the other axes.

Concatenating Objects with Overlapping Indexes

When concatenating objects that have overlapping indexes, you can control how they are handled using the join and ignore_index parameters:

join='inner' - Use intersection of indexes.
ignore_index=True - Ignore indexes and reset to new numbered index.

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                        'B': ['B0', 'B1', 'B2']},
                        index=[0, 1, 2])

df2 = pd.DataFrame({'A': ['A2', 'A3', 'A4'],
                        'B': ['B2', 'B3', 'B4']},
                       index=[2, 3, 4])

df_inner = pd.concat([df1, df2], join='inner')

print(df_inner)

# Output
     A   B
2  A2  B2

Only overlapping indexes (2) are kept.

With ignore_index=True:

df_ignore = pd.concat([df1, df2], ignore_index=True)

print(df_ignore)

# Output
   A   B
0  A0  B0
1  A1  B1
2  A2  B2
3  A2  B2
4  A3  B3
5  A4  B4

All data is preserved but indexes are ignored and reset.

Adding MultiIndex Keys to Identify Source DataFrames

When concatenating a list of DataFrames, use the keys parameter to add an index level to identify the source:

df1 = pd.DataFrame({'A': ['A0', 'A1'],
                    'B': ['B0', 'B1']})

df2 = pd.DataFrame({'A': ['A2', 'A3'],
                    'B': ['B2', 'B3']})

df3 = pd.DataFrame({'A': ['A4', 'A5'],
                    'B': ['B4', 'B5']})

df_list = [df1, df2, df3]

df_concat = pd.concat(df_list, keys=['x', 'y', 'z'])

print(df_concat)

# Output
      A   B
x 0  A0  B0
  1  A1  B1
y 0  A2  B2
  1  A3  B3
z 0  A4  B4
  1  A5  B5

This adds a multi-index with a new outer level identifying each DataFrame source.

Concatenating Along Columns with `join='inner'`

When concatenating along columns using axis=1 and join='inner', only the columns found in BOTH objects are kept:

df1 = pd.DataFrame({'A': ['A0', 'A1'],
                    'B': ['B0', 'B1']},
                   columns=['A', 'B'])

df2 = pd.DataFrame({'C': ['C0', 'C1'],
                    'D': ['D0', 'D1']},
                   columns=['C', 'D'])

df_col_inner = pd.concat([df1, df2], axis=1, join='inner')

print(df_col_inner)

# Empty DataFrame
# Columns: []
# Index: [0, 1]

No common columns between df1 and df2, so the result is empty.

Use join='outer' to keep columns from both:

df_col_outer = pd.concat([df1, df2], axis=1, join='outer')

print(df_col_outer)

   A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1

Ignoring Indexes and Setting Names

Use ignore_index=True to disregard existing indexes. New numeric indexes will be created:

df1 = pd.DataFrame({'A': ['A0', 'A1'],
                    'B': ['B0', 'B1']},
                  index=[0, 1])

df2 = pd.DataFrame({'A': ['A2', 'A3'],
                    'B': ['B2', 'B3']},
                    index=[2, 3])

df_concat = pd.concat([df1, df2], ignore_index=True)

print(df_concat)

   A   B
0  A0  B0
1  A1  B1
2  A2  B2
3  A3  B3

To name the result index, use names:

df_concat = pd.concat([df1, df2],
                      ignore_index=True,
                      names=['ID'])

print(df_concat)

   A   B
ID
0  A0  B0
1  A1  B1
2  A2  B2
3  A3  B3

Concatenating with Ordered Sorting

Use sort=True to sort the result DataFrame by the join key:

df1 = pd.DataFrame({'A': ['A0', 'A1'],
                    'B': ['B0', 'B1']},
                    index=[1, 2])

df2 = pd.DataFrame({'A': ['A2', 'A3'],
                    'B': ['B2', 'B3']},
                    index=[2, 3])

df_sorted = pd.concat([df1, df2], sort=True)

print(df_sorted)

     A   B
1  A0  B0
2  A1  B1
2  A2  B2
3  A3  B3

For controlling order along columns, use sort=False:

df3 = pd.DataFrame({'C': ['C0', 'C1']},
                    index=[1, 2])

df = pd.concat([df1, df3], axis=1, sort=False)
print(df)

     A   B   C
1  A0  B0  C0
2  A1  B1  C1

Concatenation with Categorical Data

When concatenating categorical Series or columns, Pandas tries to prevent reordering of categories:

s1 = pd.Series(['a','b','c'], dtype='category')
s2 = pd.Series(['d','e'], dtype='category')

s_cat = pd.concat([s1,s2])

print(s_cat)

0    a
1    b
2    c
0    d
1    e
dtype: category
Categories (5, object): [a, b, c, d, e]

This preserves the original ordering of the categories [‘a’,‘b’,‘c’] before appending ‘d’ and ‘e’.

For DataFrames, combine along rows to preserve categories:

df1 = pd.DataFrame({'A': ['a', 'b', 'c']}, dtype='category')
df2 = pd.DataFrame({'A': ['d', 'e']}, dtype='category')

df_cat = pd.concat([df1, df2])

print(df_cat)

      A
0    a
1    b
2    c
0    d
1    e

Specifying Concatenation Semantics with `copy`

The copy parameter controls whether concatenation copies data (default) or views the same underlying data:

df1 = pd.DataFrame({'A': ['A0', 'A1'],
                    'B': ['B0', 'B1']})

df2 = df1.copy()
df2.loc[0,'A'] = 'foo'

df_copy = pd.concat([df1, df2], copy=True)
df_view = pd.concat([df1, df2], copy=False)

print(df1)

   A   B
0  A0  B0
1  A1  B1

print(df_copy)

   A   B
0  A0  B0
1  A1  B1
0  foo B0

print(df_view)

   A   B
0  foo B0
1  A1  B1

copy=True makes a full copy so df1 is not changed. copy=False uses a view, so df1 reflects the changes.

Optimizing Concatenation Performance

There are a few options to optimize concat() performance:

Set copy=False to avoid duplicating data
Specify sort=False to avoid sorting if order not required
Set verify_integrity=False to skip index/column checks
Use ignore_index=True to avoid reindexing

Example:

df_list = [df1, df2, df3]

df_concat = pd.concat(df_list,
                   ignore_index=True,
                   copy=False,
                   sort=False,
                   verify_integrity=False)

This will provide significant speedups for large data sets.

Alternative Ways to Concatenate

While concat() is the main method, there are also other ways to concatenate in Pandas:

Series.append() - Append Series together and ignore indexes
DataFrame.append() - Append rows of DataFrames together, preserving indexes
DataFrame.join() - Join columns of DataFrames on an index
DataFrame.merge() - SQL-style merge operation on DataFrames

concat() provides the most flexibility and options for concatenation.

Key Takeaways

Use concat() to concatenate or join together DataFrames and Series along an axis
Set axis=0 to stack objects vertically (row-wise)
Set axis=1 to stack objects horizontally (column-wise)
join handles overlapping indexes while ignore_index resets the index
Add keys for multi-index to identify source objects
Control sort order, data copying, and category ordering
Optimize with copy=False, sort=False, and ignore_index=True

Concatenation with concat() enables smoothly combining data from different sources for effective data preparation and analysis using Pandas.