Pandas is one of the most popular Python libraries used for data manipulation and analysis. The concat()
function in Pandas provides a flexible way to concatenate or join together DataFrames and Series objects along an axis.
Concatenation refers to joining or appending objects end-to-end. With concat()
, you can combine data from different sources into a single unified DataFrame or Series for further analysis and modeling.
In this comprehensive guide, we will cover the following topics related to using concat()
in Pandas:
Table of Contents
Open Table of Contents
- Overview of Concatenation in Pandas
- Concatenating Two or More Objects with
concat()
- Concatenating Objects with Overlapping Indexes
- Adding MultiIndex Keys to Identify Source DataFrames
- Concatenating Along Columns with
join='inner'
- Ignoring Indexes and Setting Names
- Concatenating with Ordered Sorting
- Concatenation with Categorical Data
- Specifying Concatenation Semantics with
copy
- Optimizing Concatenation Performance
- Alternative Ways to Concatenate
- Key Takeaways
Overview of Concatenation in Pandas
Concatenation combines objects together by stacking them horizontally or vertically. The concat()
function allows concatenation of DataFrames, Series, and Panel objects.
Key advantages of using concat()
:
- Easily combine data from different sources into a single data structure
- Specify the axis (columns or rows) along which concatenation is performed
- Control the join and sort order of the concatenated objects
- Manage indexes, including hierarchical indexes, on the result object
- Specify keys for aligning concatenated objects
- Handle missing data and duplicate indexes/columns
- Configure copy vs view semantics for better performance
Methods of Concatenation
There are two main methods of concatenation:
- Vertical concatenation (vstack) - Stacking objects vertically by combining along the rows. This increases the number of rows.
- Horizontal concatenation (hstack) - Stacking objects horizontally by combining along the columns. This increases the number of columns.
For 1-dimensional Series, vertical concatenation is equivalent to appending the Series.
Concatenating Two or More Objects with concat()
The main parameters to concat()
are:
objs
- A sequence or mapping of objects to concatenate. This can contain DataFrames, Series, or Panel objects.axis
- The axis along which to concatenate:- 0: vertically (rows)
- 1: horizontally (columns)
join
- How to handle indexes on other axis:- ‘inner’ - use intersection of indexes
- ’outer’ - use union of indexes
To vertically concatenate two DataFrames:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1']})
df2 = pd.DataFrame({'A': ['A2', 'A3'],
'B': ['B2', 'B3']})
df_vert = pd.concat([df1, df2])
print(df_vert)
# Output
A B
0 A0 B0
1 A1 B1
0 A2 B2
1 A3 B3
The index is automatically reset. To preserve the original indexes, set ignore_index=True
.
For horizontal concatenation, set axis=1
:
df_horz = pd.concat([df1, df2], axis=1)
print(df_horz)
# Output
A B A B
0 A0 B0 A2 B2
1 A1 B1 A3 B3
join='outer'
combines the outer union of indexes and inserts missing values for the mismatched indexes on the other axes.
Concatenating Objects with Overlapping Indexes
When concatenating objects that have overlapping indexes, you can control how they are handled using the join
and ignore_index
parameters:
join='inner'
- Use intersection of indexes.ignore_index=True
- Ignore indexes and reset to new numbered index.
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=[0, 1, 2])
df2 = pd.DataFrame({'A': ['A2', 'A3', 'A4'],
'B': ['B2', 'B3', 'B4']},
index=[2, 3, 4])
df_inner = pd.concat([df1, df2], join='inner')
print(df_inner)
# Output
A B
2 A2 B2
Only overlapping indexes (2) are kept.
With ignore_index=True
:
df_ignore = pd.concat([df1, df2], ignore_index=True)
print(df_ignore)
# Output
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A2 B2
4 A3 B3
5 A4 B4
All data is preserved but indexes are ignored and reset.
Adding MultiIndex Keys to Identify Source DataFrames
When concatenating a list of DataFrames, use the keys
parameter to add an index level to identify the source:
df1 = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1']})
df2 = pd.DataFrame({'A': ['A2', 'A3'],
'B': ['B2', 'B3']})
df3 = pd.DataFrame({'A': ['A4', 'A5'],
'B': ['B4', 'B5']})
df_list = [df1, df2, df3]
df_concat = pd.concat(df_list, keys=['x', 'y', 'z'])
print(df_concat)
# Output
A B
x 0 A0 B0
1 A1 B1
y 0 A2 B2
1 A3 B3
z 0 A4 B4
1 A5 B5
This adds a multi-index with a new outer level identifying each DataFrame source.
Concatenating Along Columns with join='inner'
When concatenating along columns using axis=1
and join='inner'
, only the columns found in BOTH objects are kept:
df1 = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1']},
columns=['A', 'B'])
df2 = pd.DataFrame({'C': ['C0', 'C1'],
'D': ['D0', 'D1']},
columns=['C', 'D'])
df_col_inner = pd.concat([df1, df2], axis=1, join='inner')
print(df_col_inner)
# Empty DataFrame
# Columns: []
# Index: [0, 1]
No common columns between df1 and df2, so the result is empty.
Use join='outer'
to keep columns from both:
df_col_outer = pd.concat([df1, df2], axis=1, join='outer')
print(df_col_outer)
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
Ignoring Indexes and Setting Names
Use ignore_index=True
to disregard existing indexes. New numeric indexes will be created:
df1 = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1']},
index=[0, 1])
df2 = pd.DataFrame({'A': ['A2', 'A3'],
'B': ['B2', 'B3']},
index=[2, 3])
df_concat = pd.concat([df1, df2], ignore_index=True)
print(df_concat)
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
To name the result index, use names
:
df_concat = pd.concat([df1, df2],
ignore_index=True,
names=['ID'])
print(df_concat)
A B
ID
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
Concatenating with Ordered Sorting
Use sort=True
to sort the result DataFrame by the join key:
df1 = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1']},
index=[1, 2])
df2 = pd.DataFrame({'A': ['A2', 'A3'],
'B': ['B2', 'B3']},
index=[2, 3])
df_sorted = pd.concat([df1, df2], sort=True)
print(df_sorted)
A B
1 A0 B0
2 A1 B1
2 A2 B2
3 A3 B3
For controlling order along columns, use sort=False
:
df3 = pd.DataFrame({'C': ['C0', 'C1']},
index=[1, 2])
df = pd.concat([df1, df3], axis=1, sort=False)
print(df)
A B C
1 A0 B0 C0
2 A1 B1 C1
Concatenation with Categorical Data
When concatenating categorical Series or columns, Pandas tries to prevent reordering of categories:
s1 = pd.Series(['a','b','c'], dtype='category')
s2 = pd.Series(['d','e'], dtype='category')
s_cat = pd.concat([s1,s2])
print(s_cat)
0 a
1 b
2 c
0 d
1 e
dtype: category
Categories (5, object): [a, b, c, d, e]
This preserves the original ordering of the categories [‘a’,‘b’,‘c’] before appending ‘d’ and ‘e’.
For DataFrames, combine along rows to preserve categories:
df1 = pd.DataFrame({'A': ['a', 'b', 'c']}, dtype='category')
df2 = pd.DataFrame({'A': ['d', 'e']}, dtype='category')
df_cat = pd.concat([df1, df2])
print(df_cat)
A
0 a
1 b
2 c
0 d
1 e
Specifying Concatenation Semantics with copy
The copy
parameter controls whether concatenation copies data (default) or views the same underlying data:
df1 = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1']})
df2 = df1.copy()
df2.loc[0,'A'] = 'foo'
df_copy = pd.concat([df1, df2], copy=True)
df_view = pd.concat([df1, df2], copy=False)
print(df1)
A B
0 A0 B0
1 A1 B1
print(df_copy)
A B
0 A0 B0
1 A1 B1
0 foo B0
print(df_view)
A B
0 foo B0
1 A1 B1
copy=True
makes a full copy so df1 is not changed. copy=False
uses a view, so df1 reflects the changes.
Optimizing Concatenation Performance
There are a few options to optimize concat()
performance:
- Set
copy=False
to avoid duplicating data - Specify
sort=False
to avoid sorting if order not required - Set
verify_integrity=False
to skip index/column checks - Use
ignore_index=True
to avoid reindexing
Example:
df_list = [df1, df2, df3]
df_concat = pd.concat(df_list,
ignore_index=True,
copy=False,
sort=False,
verify_integrity=False)
This will provide significant speedups for large data sets.
Alternative Ways to Concatenate
While concat()
is the main method, there are also other ways to concatenate in Pandas:
- Series.append() - Append Series together and ignore indexes
- DataFrame.append() - Append rows of DataFrames together, preserving indexes
- DataFrame.join() - Join columns of DataFrames on an index
- DataFrame.merge() - SQL-style merge operation on DataFrames
concat()
provides the most flexibility and options for concatenation.
Key Takeaways
- Use
concat()
to concatenate or join together DataFrames and Series along an axis - Set
axis=0
to stack objects vertically (row-wise) - Set
axis=1
to stack objects horizontally (column-wise) join
handles overlapping indexes whileignore_index
resets the index- Add keys for multi-index to identify source objects
- Control sort order, data copying, and category ordering
- Optimize with
copy=False
,sort=False
, andignore_index=True
Concatenation with concat()
enables smoothly combining data from different sources for effective data preparation and analysis using Pandas.