Pandas is a popular Python library used for data analysis and manipulation. At the core of Pandas are two main data structures - Series and DataFrames - which enable efficient data handling and analysis in Python. In this comprehensive guide, we will take a deep dive into Pandas Series and DataFrames, examining how they work and how to use them for data wrangling and analytics.
Introduction
Pandas Series and DataFrames are labeled, multidimensional data structures optimized for data analysis in Python. Pandas extends Python’s built-in data types to add rich data manipulation capabilities like indexing, alignment, reshaping, merging, sorting, analytics, and more.
Here’s a quick overview of Pandas Series and DataFrames:
-
Series - A one-dimensional labeled array capable of holding data of any type (integers, strings, floats, objects, etc). Think of a Series as a column in a table.
-
DataFrame - A two-dimensional tabular data structure with labeled columns that can hold data of different types. You can think of a DataFrame as multiple Series objects grouped together to share the same index.
In this guide, we will explore the fundamentals of Pandas Series and DataFrames to understand:
- How to create Series and DataFrame objects
- Data indexing, alignment, and manipulation
- Performing calculations on data
- Handling missing data
- Reshaping, pivoting, and transforming DataFrames
- Merging, joining, and concatenating data
- Data aggregation and grouping
- Reading and writing data to files and databases
We will look at relevant code examples and applications to demonstrate key concepts. Let’s get started!
Pandas Series
A Pandas Series is a one-dimensional array that can hold data of any type - integers, floats, strings, Python objects, etc. What makes a Series unique is that it has an index which labels each data point. The index can be customized while creating the Series.
Here’s how to create a simple Pandas Series:
import pandas as pd
data = [1, 2, 3, 4, 5]
ser = pd.Series(data)
print(ser)
Output:
0 1
1 2
2 3
3 4
4 5
dtype: int64
By default, a Pandas Series takes 0-based integer indices starting from 0. We can customize the index values by passing the index
parameter:
data = ['a', 'b', 'c', 'd', 'e']
ser = pd.Series(data, index=[100, 101, 102, 103, 104])
print(ser)
Output:
100 a
101 b
102 c
103 d
104 e
dtype: object
The key aspects of a Pandas Series include:
- A one-dimensional array of data
- Size immutable once created
- Values of any data type
- Custom index labels
- Index-based selection (presented later)
- Automatic alignment by index label during operations
- Integrated non-null handling and type inference
- Optimized for performance
These features make Pandas Series ideal for working with column data in data analysis and statistics applications.
Now let’s look at how to access data from a Series.
Data Selection in Series
We can select elements from a Series using various methods:
import pandas as pd
data = ['a', 'b', 'c', 'd', 'e']
ser = pd.Series(data)
# Select by index position
print(ser[0]) # 'a'
# Slice by index range
print(ser[1:4])
# Select by index label
print(ser['a'])
# Boolean indexing
print(ser[ser > 'b'])
Pandas Series supports vectorized operations and broadcasting. For example:
s1 = pd.Series([1, 2, 3, 4, 5])
s2 = pd.Series([10, 20, 30, 40, 50])
print(s1 + s2)
# Adds corresponding elements
Output:
0 11
1 22
2 33
3 44
4 55
dtype: int64
We can also use Series as columns in a DataFrame. More on that later.
Missing Data in Series
Pandas uses NaN
(Not a Number) to represent missing data in Series and DataFrames. For example:
import numpy as np
data = [1, 2, np.nan, 4, 5]
ser = pd.Series(data)
print(ser)
Output:
0 1.0
1 2.0
2 NaN
3 4.0
4 5.0
dtype: float64
We can use the isnull()
and notnull()
methods to detect missing data in Pandas:
import pandas as pd
ser = pd.Series([1, 2, pd.NA, 4, 5])
print(ser.isnull())
# [False False True False False]
print(ser.notnull())
# [ True True False True True]
Pandas provides many utilities to handle missing data - filling NA values, dropping NA rows/columns, interpolation, etc.
In summary, Pandas Series provides a powerful one-dimensional array with customized index for efficient data analysis. It excels at working with columnar data for statistics and data science applications in Python.
Pandas DataFrames
A Pandas DataFrame is a two-dimensional data structure with labeled columns that can hold data of different types like a table or spreadsheet. You can think of DataFrame as a collection of Pandas Series objects that share a common index.
Let’s create a simple DataFrame from a dictionary of Series objects:
import pandas as pd
data = {'Name': ['John', 'Mary', 'Steve', 'Sarah'],
'Age': [28, 32, 35, 27],
'Salary': [80000, 90000, 72000, 60000]}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Salary
0 John 28 80000
1 Mary 32 90000
2 Steve 35 72000
3 Sarah 27 60000
The key aspects of a Pandas DataFrame include:
- A two-dimensional tabular data structure with labeled axes (rows and columns)
- Potentially columns of different data types
- Size immutable once created but can add/delete columns
- Labeled axes (rows and columns) with index for selection
- Integrated non-null handling and type inference
- Powerful indexing and alignment capabilities
- Vectorized operations and broadcasting across rows and columns
- Columnar storage for efficient analytics
- Easy to read, write and integrate with other data formats like CSV, SQL, Excel, etc.
These features make DataFrame ideal for data manipulation, analysis, and visualization in Python.
Now let’s explore how to access and select data from DataFrames.
Data Selection in DataFrames
We can select data from a DataFrame using:
- Column names
- Row index
- Slicing ranges
- Boolean indexing
import pandas as pd
data = {'Name': ['John', 'Mary', 'Steve', 'Sarah'],
'Age': [28, 32, 35, 27],
'Salary': [80000, 90000, 72000, 60000]}
df = pd.DataFrame(data)
# Select column
print(df['Name'])
# Slice rows
print(df[1:3])
# Boolean selection
print(df[df['Age'] > 30])
DataFrame also provides some special indexing and selection methods like:
loc
- Select by labeliloc
- Select by integer positionat, iat
- Select single value by label and integer position
For example:
print(df.loc[1:2, ['Name', 'Salary']]) # Slice by label
print(df.iloc[1:3, 0:2]) # Slice by integer location
print(df.at[1, 'Age']) # Select single value by label
These indexed-based selections make it very fast and convenient to access subsets of data in Pandas.
Adding/Deleting Columns
We can add a new column by simply assigning the values:
df['Height'] = [165, 187, 192, 154]
print(df)
To delete a column, use del
or pop
:
del df['Height']
height = df.pop('Height')
Vectorized Operations
One of the main advantages of using Pandas DataFrames is that we can apply vectorized operations on entire columns and rows efficiently.
For example:
df = pd.DataFrame({'A': [1, 2, 3],
'B': [4, 5, 6]})
print(df + 5) # Add 5 to every element
print(df['A'] * 3) # Multiply column A by 3
print(df['A'] > 2) # Conditional filtering
Pandas uses the NumPy library to provide fast vectorized operations for data in Series and DataFrames. This enables optimized analytical routines.
Handling Missing Data
Like Series, Pandas DataFrames use NaN
to represent missing data:
import numpy as np
data = {'A': [1, 2, np.nan], 'B': [5, np.nan, np.nan]}
df = pd.DataFrame(data)
print(df)
print(df.isnull()) # Detect missing data
We can fill or drop missing values in Pandas:
# Fill missing values
df = df.fillna(0)
# Drop rows with any null values
df = df.dropna()
# Drop columns with any null values
df = df.dropna(axis=1)
This makes Pandas very robust in handling imperfect real-world data.
DataFrame Metadata
Pandas DataFrames provide various metadata about the data:
print(df.columns) # Column names
print(df.dtypes) # Data types of each column
print(df.shape) # Dimensions (rows, columns)
print(df.size) # Total array size
print(df.memory_usage()) # Memory used
This metadata is helpful for exploratory data analysis.
Now that we’ve covered the basics of creating and manipulating data in Series and DataFrames, let’s go over some more advanced Pandas functionality.
Reshaping, Pivoting and Transforming Data
Pandas provides versatile facilities for reshaping, pivoting and transforming DataFrames to tidy up messy data into analysis-friendly formats.
Reshaping with Melt
The melt()
method is useful for reshaping a DataFrame from wide format to long format:
import pandas as pd
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6}})
print(df)
melted = df.melt(ignore_index=True)
print(melted)
Output:
A B C
0 a 1 2
1 b 3 4
2 c 5 6
variable value
0 A a
1 A b
2 A c
3 B 1
4 B 3
5 B 5
6 C 2
7 C 4
8 C 6
This converts the columns to rows to transform from wide to long format.
We can also pivot data from long to wide format using pivot()
:
pivoted = melted.pivot(index='variable', columns='value')
print(pivoted)
Transforming with Applymap
We can apply element-wise transformations to entire DataFrames using applymap()
:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df = df.applymap(lambda x: x**2) # square each element
print(df)
Output:
A B
0 1 16
1 4 25
2 9 36
This is useful for batch transforming DataFrames efficiently.
Working with Text Data
Pandas makes manipulating textual data easy. For example:
df = pd.DataFrame({'Text': ['Hello World', 'Python Programming']})
print(df['Text'].str.lower()) # Convert to lower case
print(df['Text'].str.contains('Python')) # Check for substring
print(df['Text'].str.split()) # Split strings into words
This enables powerful text processing capabilities.
Combining and Merging Datasets
Pandas provides various facilities for combining and merging DataFrames - concatenation, joins and merges.
Concatenation
We can concatenate or stack together DataFrames with similar structures column-wise or row-wise:
df1 = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1']})
df2 = pd.DataFrame({'A': ['A2', 'A3'],
'B': ['B2', 'B3']})
print(pd.concat([df1, df2])) # Column-wise
print(pd.concat([df1, df2], axis=1)) # Row-wise
This provides an easy way to combine DataFrames.
Joins
SQL-like join operations merge DataFrames based on columns:
df1 = pd.DataFrame({'employee': ['John', 'Mary'],
'group': ['Accounting', 'Engineering']})
df2 = pd.DataFrame({'employee': ['Mary', 'Sarah'],
'start_date': [2010, 2014]})
print(df1.merge(df2)) # SQL-style inner join
Output:
employee group start_date
0 Mary Engineering 2010
Other joins like outer, right, left joins are also supported.
Merges
The merge()
method also merges DataFrames but supports more options like different join types. For example:
df1 = pd.DataFrame({'A': [1, 2], 'B': [10, 20]})
df2 = pd.DataFrame({'A': [4, 5, 6], 'C': [40, 50, 60]})
print(pd.merge(df1, df2, on='A')) # Merge on column A
Output:
A B C
0 1 10 NaN
1 2 20 NaN
2 4 NaN 40
3 5 NaN 50
4 6 NaN 60
This provides great flexibility to combine heterogeneous datasets.
Data Aggregation and Grouping
Pandas allows aggregating, filtering, and transforming DataFrames by one or more columns. The groupby()
method is used to group data by categories for aggregation.
For example, grouping by ‘category’:
import pandas as pd
data = {'category': ['A', 'A', 'B', 'B', 'C'],
'value': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)
print(df.groupby('category').sum())
Output:
value
category
A 3
B 7
C 5
Common aggregation functions like sum(), mean(), count(), min(), max()
can be applied on the grouped data.
We can group by multiple columns and compose complex aggregations:
data = {'category': ['A', 'A', 'B', 'B', 'C'],
'item': ['X', 'Y', 'X', 'Y', 'X'],
'value': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)
# Group by 'category' and 'item'
df.groupby(['category', 'item']).sum()
This provides an efficient way to analyze and understand relationships between categories of data.
Reading and Writing Data
Pandas provides easy I/O with a variety of file formats and data sources. This allows loading datasets into DataFrames for analysis.
CSV Files
The read_csv()
method loads CSV data into a DataFrame:
df = pd.read_csv('data.csv')
# Export DataFrame to CSV
df.to_csv('output.csv')
CSV files can contain headers which will be used as column names in the resulting DataFrame.
Excel Files
Pandas can read and write Excel XLSX and XLS files:
df = pd.read_excel('data.xlsx')
df.to_excel('output.xlsx')
This provides interoperability with Excel.
Here is the continuation of the Pandas programming guide:
SQL Databases
Pandas integrates with SQLAlchemy to load data from SQL databases like PostgreSQL, MySQL, etc:
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:password@hostname:port/dbname')
df = pd.read_sql('SELECT * FROM table_name', engine)
We can also write DataFrames to SQL tables:
df.to_sql('table_name', engine)
This enables analyzing SQL database data using Pandas.
JSON
JSON strings and files can be loaded into Pandas DataFrames:
df = pd.read_json('data.json')
json_str = '''
{"A": [1, 2], "B": [3, 4]}
'''
df = pd.read_json(json_str)
HDF5 Format
Pandas provides high performance HDF5 file I/O using PyTables:
df.to_hdf('data.h5','table',mode='w')
pd.read_hdf('data.h5','table')
HDF5 stores DataFrames efficiently for high performance analytics.
Web APIs
Pandas can query web APIs and load JSON formatted data into DataFrames:
import requests
url = 'https://api.example.com/data'
r = requests.get(url)
df = pd.DataFrame(r.json())
This enables accessing databases and RESTful web APIs.
In summary, Pandas provides versatile I/O with a variety of sources to import datasets for analysis, as well as export DataFrames.
Practical Examples
Let’s now look at some real-world examples demonstrating how to work with data using Pandas Series and DataFrames:
Exploring COVID-19 Datasets
This example extracts COVID-19 data from a CSV file into a DataFrame and calculates statistics:
import pandas as pd
df = pd.read_csv('covid_data.csv')
# Calculate stats
print(df['cases'].max())
print(df.groupby('country')['cases'].sum())
We can easily manipulate and analyze COVID data using Pandas.
Analyzing Stock Market Data
This example loads historical stock price data from Yahoo Finance’s API:
import yfinance as yf
import pandas as pd
msft = yf.Ticker("MSFT")
# Get historical prices
df = msft.history(period="max")
# Resample to business day frequency
df = df.resample('B').mean()
# Analysis
print(df['Close'].max())
print(df['Volume'].nlargest(3))
Pandas enables powerful analysis of financial data.
Data Cleaning in Machine Learning
When preparing data for machine learning, we often need to clean, transform, and normalize data. Pandas is ideal for this:
import pandas as pd
df = pd.read_csv('data.csv')
# Handle missing values
df = df.fillna(0)
# Normalize column
df['amount'] = df['amount'] / df['amount'].sum()
# Convert categoricals
df['category'] = df['category'].astype('category')
# Export cleaned data
df.to_csv('clean_data.csv', index=False)
In this way, Pandas helps prepare high-quality datasets for ML modeling.
The wide range of file formats and data sources supported makes Pandas suitable for virtually any data analysis task.
Conclusion
In this comprehensive guide, we explored:
- Pandas Series - For working with columnar 1D data
- DataFrames - For multi-dimensional tabular 2D data
- Indexing and selection - For accessing subsets of data
- Manipulation - Adding, deleting, transforming, reshaping data
- Combine/Merge - Concatenating, joining and merging datasets
- GroupBy - Split-apply-combine operations
- Data I/O - Reading/writing from various file formats and data sources
Pandas Series and DataFrames enable efficient exploratory analysis and manipulation of structured data sets in Python. With the powerful tools and expressiveness of the Python language, Pandas provides a scalable data analysis environment for programmers, data scientists, engineers and analysts.
The wide adoption of Pandas for data science, machine learning, and general purpose data analysis demonstrates the immense value of having optimized data structures for working with tabular data in Python.