Pandas is a popular Python library used for data analysis and manipulation. One of Pandas’ most useful features is the ability to easily modify DataFrame columns. This allows developers to shape datasets to best suit their needs. In this comprehensive guide, we will explore the various methods for adding, inserting, removing, and renaming columns in Pandas DataFrames.
Table of Contents
Open Table of Contents
Overview
A Pandas DataFrame is a two-dimensional, tabular data structure with labeled columns that can hold different data types like strings, numbers, booleans, etc. The columns in a DataFrame act like variables in Python. When analyzing and preparing data in Python, it is often necessary to add, delete or modify existing columns.
Pandas provides several methods to make these modifications efficiently without affecting the rest of the DataFrame. The key methods for column manipulation are:
df[column_name] = column_values
- Add/modify columns by assignmentdf.insert(loc, column_name, column_values)
- Insert column at specified locationdf.drop(columns=[column_names])
- Delete columns by namedf.rename(columns={old_name: new_name})
- Rename columns by specifying a mapping
In the following sections, we will explore the proper usage of each method with examples.
Adding Columns
New columns can be added to a Pandas DataFrame by simply assigning the new column with a name and values.
The basic syntax is:
df[new_column_name] = column_values
The column values can be a Python list, NumPy array, Pandas Series, or scalar value that is broadcast across all rows.
For example:
import pandas as pd
data = {'Name': ['John', 'Mary'], 'Age': [25, 27]}
df = pd.DataFrame(data)
# Add new column with scalar value
df['Country'] = 'United States'
# Add column with list
df['Hobby'] = ['Tennis', 'Hiking']
print(df)
Name Age Country Hobby
0 John 25 United States Tennis
1 Mary 27 United States Hiking
The new columns are appended to the right end of the DataFrame. The length of the new column values must match the length of the DataFrame, otherwise Pandas will raise an error.
We can also insert a column at a specific location using insert()
, which will be covered later.
Modifying Columns
Existing columns in a DataFrame can be modified by simply assigning new values to the column:
df[column_name] = new_column_values
The new values must match the length of the DataFrame, similar to adding new columns.
For example:
df['Age'] = [24, 26] # Modify Age column
print(df)
Name Age Country Hobby
0 John 24 United States Tennis
1 Mary 26 United States Hiking
Columns can also be modified with scalar values:
df['Country'] = 'Canada' # Set all rows to Canada
Or using columnar operations like applying mathematical functions:
df['Age'] = df['Age'] + 1 # Increment Age by 1
Inserting Columns
The insert()
method allows inserting a new column at a specified location in the DataFrame.
The syntax is:
df.insert(loc, column_name, column_values)
Where loc
is the zero-indexed insertion location (the numeric index of the column before which the new column will be inserted).
For example:
new_col = [10, 20]
df.insert(1, 'Points', new_col)
print(df)
Name Points Age Country Hobby
0 John 10 24 United States Tennis
1 Mary 20 26 United States Hiking
Here we inserted the ‘Points’ column with values [10, 20]
at index position 1, between the ‘Name’ and ‘Age’ columns.
Inserting a column modifies the DataFrame in-place. The column index positions of existing columns will be shifted right by 1 after the insert location.
We can also insert multiple columns at once by passing a list of column names and values:
df.insert(1, ['Points', 'Score'], [[10, 20], [20, 30]])
This inserts two columns ‘Points’ and ‘Score’ at index 1.
Removing Columns
To remove one or more columns, use the drop()
method on the DataFrame:
df.drop(columns=[column_names], inplace=True)
The columns
parameter accepts the name of the column(s) to remove as a list.
Setting inplace=True
will modify the DataFrame in-place, otherwise drop() will return a copy with the columns removed.
For example:
# Remove single column
df.drop(columns=['Points'], inplace=True)
# Remove multiple columns
df.drop(columns=['Country', 'Hobby'], inplace=True)
print(df)
Name Age
0 John 24
1 Mary 26
We can also remove columns by index position instead of name:
df.drop(columns=[0, 3], axis=1, inplace=True)
Here axis=1
indicates columns since DataFrames are two-dimensional.
The column index positions will be automatically shifted left after dropping columns.
Renaming Columns
The rename()
method is used to rename one or more DataFrame column names.
The basic syntax is:
df.rename(columns={old_name: new_name}, inplace=True)
This specifies a dictionary mapping between the old and new column names.
For example:
df.rename(columns={'Name': 'First Name'}, inplace=True)
print(df)
First Name Age
0 John 24
1 Mary 26
We can rename multiple columns at once:
df.rename(columns={'Name': 'First Name', 'Age': 'Age Years'}, inplace=True)
The column names are modified in-place. The original DataFrame is changed.
We can also rename by index position instead of name:
df.rename(columns={0: 'First Name', 1: 'Age Years'}, inplace=True)
This can be useful when the original column names are missing or invalid.
The rename()
method does not modify dtype or any values in the columns. It only changes the column labels.
Adding Columns Via Parameters
There are a few other ways to inject new columns when creating a Pandas DataFrame:
1. Column Parameter
The columns
parameter can specify column names and values when constructing a DataFrame:
data = [[25, 'John'], [27, 'Mary']]
df = pd.DataFrame(data, columns=['Age', 'Name'])
print(df)
Age Name
0 25 John
1 27 Mary
2. Using Dictionary
A dictionary passed into the DataFrame will create columns from the keys:
data = {'Age': [25, 27], 'Name': ['John', 'Mary']}
df = pd.DataFrame(data)
print(df)
Age Name
0 25 John
1 27 Mary
3. Assign During Creation
We can also inject new columns by assignment when creating the DataFrame:
df = pd.DataFrame(data, columns=['Age', 'Name'])
df['Country'] = 'United States'
print(df)
Age Name Country
0 25 John United States
1 27 Mary United States
Inserting Columns Via Assigning Entire Rows
In some cases, it is useful to insert an entire row with multiple columns at once. This can be done by:
- Creating a new DataFrame from the row data
- Assigning the new row to the index position
For example:
new_row = {'Name': 'Joe', 'Age': 22, 'Country': 'Canada'}
df_new = pd.DataFrame(new_row, index=[2])
df = df.append(df_new, ignore_index=True)
print(df)
Age Name Country
0 25 John United States
1 27 Mary United States
2 22 Joe Canada
Here we created a single row DataFrame df_new
and appended it to the bottom of the original df
. By passing ignore_index=True
, Pandas will reindex the rows sequentially.
The same process can insert multiple rows by creating a multi-row DataFrame and appending.
Concatenating DataFrames
An alternative method to inject new columns is concatenating Pandas DataFrames using concat()
:
df1 = pd.DataFrame({'Age': [25, 27]})
df2 = pd.DataFrame({'Name': ['John', 'Mary']})
df = pd.concat([df1, df2], axis=1)
print(df)
Age Name
0 25 John
1 27 Mary
The axis=1
specifies to concatenate column-wise, stacking df2
next to df1
.
This allows assembling DataFrames created separately into a combined dataset with the desired columns.
We can pass ignore_index=True
to reindex the rows when concatenating.
Conclusion
Pandas provides a versatile set of methods for adding, inserting, removing, and renaming columns in DataFrames. Mastering these column manipulation techniques enables wrangling tabular data in Python to best fit the needs of data science and analysis workflows.
In summary:
- Use column assignment to add new columns or modify existing ones
- Insert columns at specific positions with
insert()
- Remove columns by
drop()
- Rename column names with
rename()
- Add columns using
columns
parameter, dictionaries, or row appends - Concatenate DataFrames with
concat()
With these tools, developers can shape Pandas DataFrames into the ideal schema for modeling, visualization, and machine learning tasks.