Pandas is a popular Python library used for data analysis and manipulation. One of Pandas’ key features is its powerful indexing functionality that allows you to slice, dice, and access specific subsets of data in DataFrames and Series objects quickly and easily. In Pandas, you can index DataFrames using labels (like column names) or integers representing the numerical locations of rows and columns.
This comprehensive guide will explain Pandas’ label and integer location based indexing in detail with examples. We will cover:
Table of Contents
Open Table of Contents
Overview of Pandas Indexes
In Pandas, indexes are used to keep track of and access data within DataFrames and Series objects. By default, Pandas will create RangeIndex
as the index when creating new DataFrames.
import pandas as pd
df = pd.DataFrame({'Column1': [1, 2, 3], 'Column2': [4, 5, 6]})
print(df)
# Column1 Column2
# 0 1 4
# 1 2 5
# 2 3 6
print(df.index)
# RangeIndex(start=0, stop=3, step=1)
We can see above that the default index is a numeric RangeIndex
from 0 to 2 (the number of rows minus 1).
Indexes can be changed by passing the index
parameter during DataFrame creation:
data = {'Column1': [1, 2, 3], 'Column2': [4, 5, 6]}
df = pd.DataFrame(data, index=['a', 'b', 'c'])
print(df)
# Column1 Column2
# a 1 4
# b 2 5
# c 3 6
print(df.index)
# Index(['a', 'b', 'c'], dtype='object')
Now the index contains custom string labels rather than the default integers.
Index labels are immutable and can’t be altered once set. But the index itself can be changed later with DataFrame.set_index()
or DataFrame.reset_index()
.
Indexes make data retrieval much easier in Pandas by allowing label and integer based indexing, which we’ll explore next.
Retrieving Data with .loc and .iloc
Pandas provides two main attribute accessors for retrieving data from DataFrames using label and integer based indexing - .loc
and .iloc
.
.loc allows selecting data by label or text based indexes. The label can be the column name, index name, a slice with labels, a list of labels, or a Boolean array.
.iloc allows selecting data by integer positional locations or numerical order. The integer can be a single position, slice with integers, a list of positions, or a Boolean array.
Let’s see some examples of using .loc
and .iloc
on the following DataFrame:
import pandas as pd
data = {'Brand': ['Honda', 'Toyota', 'Ford', 'Tesla'],
'Price': [22000, 25000, 20000, 35000]}
df = pd.DataFrame(data)
print(df)
# Brand Price
# 0 Honda 22000
# 1 Toyota 25000
# 2 Ford 20000
# 3 Tesla 35000
To select a single row by label using .loc
, pass the index label:
single_row = df.loc['Toyota']
print(single_row)
# Brand Toyota
# Price 25000
# Name: 1, dtype: object
For a single column by label, pass the column name:
single_col = df.loc[:, 'Price']
print(single_col)
# 0 22000
# 1 25000
# 2 20000
# 3 35000
# Name: Price, dtype: int64
For multiple rows or columns, pass a list of labels:
multi_rows = df.loc[['Honda', 'Toyota']]
multi_cols = df.loc[:, ['Brand', 'Price']]
To select a slice of rows with .loc
, use slice notation with labels:
row_slice = df.loc['Ford':'Tesla']
print(row_slice)
# Brand Price
# 2 Ford 20000
# 3 Tesla 35000
For integers based selection with .iloc
, pass the numeric index like:
first_row = df.iloc[0]
first_col = df.iloc[:, 0]
row_slice = df.iloc[1:3]
In summary, .loc
selects data by label and .iloc
selects data by integer position.
Selecting Rows and Columns by Label and Integer Location
Let’s now take a deeper look at how to select specific DataFrame rows and columns using both label and integer based indexing.
To select a single row by label, use .loc
and pass the index label:
row = df.loc['Toyota']
For a single row by integer location, use .iloc
and pass the index integer position:
row = df.iloc[1]
For multiple rows by label, pass a list of labels to .loc
:
rows = df.loc[['Toyota', 'Ford']]
For multiple rows by integer position, pass a list of ints to .iloc
:
rows = df.iloc[[1, 2]]
Selecting Columns
Selecting DataFrame columns works the same way - pass column names to .loc
and column integer positions to .iloc
.
Single column by label:
col = df.loc[:, 'Price']
Single column by integer location:
col = df.iloc[:, 1]
Multiple columns by label:
cols = df.loc[:, ['Brand', 'Price']]
Multiple columns by integer position:
cols = df.iloc[:, [0, 1]]
Selecting Subsets with Slices
You can also select subsets of rows and columns using slices with .loc
and .iloc
.
Slice rows between two labels (inclusive):
subset = df.loc['Toyota':'Ford']
Slice rows between two integer positions (exclusive endpoint):
subset = df.iloc[1:3]
Slice columns between two labels:
subset = df.loc[:'Price']
Slice columns between two integer positions:
subset = df.iloc[:,:1]
So slices with .loc
are inclusive but slices with .iloc
exclude the endpoint index.
Using Boolean Indexing on Selection
Pandas also allows selecting rows and columns from DataFrames using Boolean conditions or masks.
First, create a Boolean Series indicating True/False if each row meets some criteria:
price_filter = df['Price'] > 22000
print(price_filter)
# 0 False
# 1 True
# 2 False
# 3 True
# Name: Price, dtype: bool
Pass this mask to .loc
to filter rows:
df.loc[price_filter]
# Brand Price
# 1 Toyota 25000
# 3 Tesla 35000
This selects all rows where Price is over 22,000.
For columns, create a Boolean mask then pass it to .loc
:
brand_col = df.loc[:, df.columns.str.contains('Brand')]
This selects any columns having ‘Brand’ in their label.
Boolean masks provide a powerful, flexible way to make complex selections from Pandas objects.
Reindexing and Altering Existing Indexes
The existing index of a DataFrame can be changed using various reindex
methods:
DataFrame.reindex()
changes the row labels and orders data to match.DataFrame.reset_index()
resets index to a default integer RangeIndex.DataFrame.set_index()
sets the DataFrame index to the specified column(s).
DataFrame.reindex()
DataFrame.reindex()
takes a list of new labels to conform the data to:
new_idx = ['Ford', 'Honda', 'Tesla', 'Toyota']
df.reindex(new_idx)
This reorders the rows to match the new label ordering.
We can also reindex by passing an integer array:
new_order = [2, 0, 3, 1]
df.reindex(new_order)
This shuffles the rows to match the integer positions passed.
DataFrame.reset_index()
To reset the index to the default consecutive ints, use reset_index()
:
df.reset_index()
# index Brand Price
# 0 0 Honda 22000
# 1 1 Toyota 25000
# 2 2 Ford 20000
# 3 3 Tesla 35000
The existing index is moved into a new ‘index’ column.
DataFrame.set_index()
To create a new index from a column, use set_index()
:
df.set_index('Brand')
# Price
# Brand
# Honda 22000
# Toyota 25000
# Ford 20000
# Tesla 35000
The ‘Brand’ column values are now used as the new index.
Multi-Level and Hierarchical Indexing
Pandas supports indexing with multi-level or hierarchical indexes that have multiple layers of labels.
For example:
data = {
('Tech', 'Apple'): [12, 15],
('Tech', 'Google'): [13, 14],
('Auto', 'Toyota'): [10, 12],
('Auto', 'Honda'): [11, 13]
}
df = pd.DataFrame(data)
print(df)
# 0 1
# Tech Apple 12 15
# Google 13 14
# Auto Toyota 10 12
# Honda 11 13
Here we have a two-level index - ‘Tech’/‘Auto’ and ‘Apple’/‘Google’/‘Toyota’/‘Honda’.
To select rows from the outer level, use .loc[]
with the first index label:
df.loc['Tech']
# 0 1
# Apple 12 15
# Google 13 14
For the inner level, provide both labels to .loc[]
:
df.loc[('Auto', 'Honda')]
# 0 1
# Auto Honda 11 13
The first index refers to the outer level, second index is the inner level label.
You can also use .iloc
by providing tuples with the integer positions of the indexes:
df.iloc[(1, 0)] # 2nd Level 0th Label
df.iloc[(3, 1)] # 4th Level 1st Label
Multi-indexes allow organizing complex, hierarchical data in tabular format.
Best Practices for Indexing in Pandas
Here are some key best practices to follow when indexing in Pandas:
-
Set meaningful indexes like time series, categories, ids etc. rather than default integers. This improves code readability.
-
Use
.loc
when selecting by label. Use.iloc
when selecting by integer position. -
Know that
.loc
is inclusive but.iloc
is exclusive of endpoint when slicing. -
Use boolean indexing to filter large datasets efficiently.
-
Avoid chained indexing like
df[col][row]
which can cause unexpected results. -
Don’t modify DataFrames in place when indexing. Instead set to new variable.
-
Avoid overusing
.ix
- it is deprecated and merges.iloc
and.loc
in ambiguous ways. -
Take advantage of MultiIndexes for grouped, hierarchical data.
-
Store indexes in variables rather than hard-coding for maintainability.
Properly leveraging Pandas’ powerful indexing functionality will allow you to efficiently access, manipulate, and analyze data in Python. With the concepts covered in this guide, you should have a comprehensive understanding of how to use label and position based indexing in Pandas for slicing and dicing DataFrames and Series.