Data frames are one of the most important data structures for data manipulation and analysis in Python. As two-dimensional labeled data structures with columns of potentially different types, data frames are flexible, powerful, and widely used for working with tabular dataset formats like CSV files, databases, Excel spreadsheets, and JSON.
In this in-depth guide, we will cover everything you need to know about data frames in Python, including:
Table of Contents
Open Table of Contents
- What Are Data Frames and Why They Matter
- Creating Data Frames from Various Sources
- Indexing, Slicing, and Selecting Data
- Adding, Removing, and Modifying Columns
- Data Frame Operations and Transformations
- Sorting and Filtering Rows
- Applying Functions to Data Frames
- Merging, Joining, and Concatenating Data Frames
- Handling Missing Data
- Exporting Data Frames
- Data Frame Best Practices and Tips
- Visualizing Data Frames
- Conclusion
What Are Data Frames and Why They Matter
A data frame is a two-dimensional, tabular data structure with labeled columns and rows. It is analogous to a spreadsheet, SQL table, or R data frame. The Pandas library provides the DataFrame
class in Python for working with tabular data in rows and columns.
Some key properties of Pandas data frames:
- Each column can be considered a Series object and can contain different data types, like integers, strings, floating point values, etc.
- The column labels are text strings which represent the column names.
- The row indices represent the sequence of rows, which can also have custom index labels.
- Data frames are mutable and can be modified after creation.
- Columns can be retrieved, added, and deleted easily.
- Built-in methods and syntax make data frames intuitive to work with for data analysis.
Data frames are a fundamental Python data structure due to their versatility and convenience for many real-world data manipulation use cases. Key advantages include:
- Flexible mixed data types - Store different data in columns.
- Labeled axes - Meaningful row and column identification.
- Efficient operations - Optimized for speed.
- Convenience - Intuitive syntax for data tasks.
- Integrated handling of missing data.
- Interoperability - Work with NumPy, SciPy, Matplotlib, Scikit-Learn, etc.
Let’s look at a simple data frame example:
import pandas as pd
data = {'Name': ['John', 'Mary', 'Mike', 'Sarah'],
'Age': [25, 32, 35, 28],
'City': ['New York', 'Chicago', 'Seattle', 'Boston']}
df = pd.DataFrame(data)
print(df)
Name Age City
0 John 25 New York
1 Mary 32 Chicago
2 Mike 35 Seattle
3 Sarah 28 Boston
This demonstrates constructing a data frame from a Python dictionary, with column labels and an index auto-created.
Data frames are used ubiquitously for ETL, data cleaning, visualization, machine learning, statistical analysis, and more. Mastering data frames is crucial for any Python developer working with data.
Creating Data Frames from Various Sources
There are many ways to construct pandas data frames from different data sources or Python structures. Let’s go through examples of the most common ones:
From a Python Dictionary
As shown above, pass a dictionary to the DataFrame()
constructor where keys become columns and values become rows. The dictionary can contain nested lists, NumPy arrays, or Pandas Series objects.
data = {'Product': ['Widget', 'Gadget'],
'Price': [9.99, 14.50]}
df = pd.DataFrame(data)
From List of Dicts
For nested data, provide a list of dictionaries with matching keys to create a data frame:
data = [{'Product': 'Widget', 'Price': 9.99},
{'Product': 'Gadget', 'Price': 14.50}]
df = pd.DataFrame(data)
From NumPy 2D Arrays
Pass a 2D NumPy array, where each row becomes a data frame row. Column names are auto-generated as integers:
import numpy as np
array = np.array([[5, 9, 3], [2, 4, 7]])
df = pd.DataFrame(array)
Specify custom column names with the columns
parameter:
df = pd.DataFrame(array, columns=['Col1', 'Col2', 'Col3'])
From CSV Files
Import CSV data into a data frame using pd.read_csv()
:
df = pd.read_csv('data.csv')
From Excel Spreadsheets
Use pd.read_excel()
to load Excel data into a data frame:
df = pd.read_excel('data.xlsx')
From SQL Databases
Load a data frame from a SQL table with pd.read_sql_query()
or pd.read_sql_table()
. Note that the sqlalchemy
library needs to be installed and you should replace the connection string with one relevant to your database:
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:pass@localhost:5432/mydb')
df = pd.read_sql_query('SELECT * FROM mytable', engine)
These are some common ways to construct new data frames. Pandas provides great flexibility for ingesting data from multiple sources into easy-to-use data frame structures.
Indexing, Slicing, and Selecting Data
We often need to extract specific rows, columns, or cells from a data frame. Pandas provides a variety of indexing and slicing methods to make this convenient:
Column Selection
Select one column as a Series using its label:
ages = df['Age']
Select multiple columns by passing a list of column names. This returns a DataFrame containing the specified columns:
subset = df[['Age', 'City']] # Returns a DataFrame with just the 'Age' and 'City' columns
Row Selection
Select rows by index position (first 5 rows):
df[0:5]
Or use row labels:
df['Mary':'Mike']
Select Scalar Value
Select a specific scalar value using .at[]
by row and column:
df.at[1, 'Age']
Boolean Indexing
Filter rows based on boolean selection criteria. This returns a DataFrame containing the rows where the condition is true:
df[df['Age'] > 30] # Returns rows where Age > 30
This flexibility allows easy exploration and analysis of data frames.
Adding, Removing, and Modifying Columns
Let’s look at how to mutate data frames by adding, removing, and modifying columns.
Add New Column
Create a new column by assigning the values to the column name:
df['Sales'] = [350, 600, 900]
Or perform operations between existing columns to generate a new one:
df['Sales'] = df['Price'] * 2 # New 'Sales' column based on 'Price'
Also use df.insert()
:
df.insert(1, 'Revenue', [350, 600, 900])
Remove Column
Delete a column using pop()
or del
:
df.pop('Sales')
del df['Revenue']
Modify Column Values
Change column values by selection and assignment:
df['Price'] = df['Price'] * 0.9 # Discount 10%
df.loc[:, 'In Stock'] = True # Add stock column
These operations provide an easy way to shape and wrangle data frames.
Data Frame Operations and Transformations
Pandas implements many methods familiar to Excel users for operating on data frames:
Sorting
Sort a data frame by one or more columns:
df = df.sort_values('Price') # Sort values by Price
df = df.sort_values(['Age', 'Name']) # Multi-column sort
Sort by multiple columns in descending order:
df = df.sort_values(['Age', 'Name'], ascending=[False, False])
Aggregation
Compute aggregations like sum, mean, median over rows:
df['Price'].sum()
df['Price'].mean()
Perform aggregation by group using groupby()
:
df.groupby('City')['Price'].count()
Apply
Apply a custom Python function to the data frame:
def discount(price):
return price * 0.85
df['Discounted Price'] = df['Price'].apply(discount)
Filter
Filter rows based on criteria:
df[df['Price'] > 10]
Transform
Perform data transformations like normalization:
from sklearn.preprocessing import StandardScaler
df['Normalized Price'] = StandardScaler().fit_transform(df[['Price']])
These built-in vectorized operations allow fast in-place modifications without looping.
Sorting and Filtering Rows
Sorting and filtering data frames are common tasks that Pandas makes easy.
Sorting Rows
Sort the data frame by one or more columns in ascending or descending order:
# Sort by a single column
df = df.sort_values('Price', ascending=False)
# Sort by multiple columns
df = df.sort_values(['Age', 'Score'], ascending=[True, False])
Filtering Rows
Filter rows by boolean indexing based on a condition:
# Filter for Price > 10
df[df['Price'] > 10]
# Filter by multiple criteria
df[(df['Age'] > 30) & (df['City'] == 'Chicago')]
More complex filtering can be done using the query()
method. This uses a SQL-like query syntax:
df.query('Price >= 10 & City == "New York"')
Filtering and sorting enable easy data exploration and analysis.
Applying Functions to Data Frames
When we need to perform a custom transformation, calculation, or data manipulation across the data frame, we can apply Python functions:
Apply Element-wise
Use df.apply()
to apply a function to each element:
def convert_currency(val):
return '$' + str(val)
df['Price'] = df['Price'].apply(convert_currency)
Apply Row or Column-wise
Apply a function to each row or column using apply()
:
# Sum each row
df.apply(lambda row: row.sum(), axis=1)
# Find max of each column
df.apply(lambda col: col.max())
Apply with GroupBy
Apply functions by groups after splitting using groupby()
:
df.groupby('City').apply(lambda x: x['Price'].max())
Custom Transformations
Define custom functions for data cleaning and transformations:
def calculate_total(row):
return row['Price'] * row['Quantity']
df['Total'] = df.apply(calculate_total, axis=1)
Applying functions provides flexibility to compute on data frames.
Merging, Joining, and Concatenating Data Frames
Combining data from different data frames is a common need. The main options for achieving this are:
Concatenation
Concatenate or stack data frames vertically by row:
df = pd.concat([df1, df2], axis=0)
Keep the index separate:
df = pd.concat([df1, df2], axis=0, keys=['X', 'Y'])
Joins
Join or merge data frames horizontally by column:
df = df1.join(df2, lsuffix='_left', rsuffix='_right')
Perform SQL-style joins - inner, outer, left, right, etc:
df = df1.merge(df2, on='CustomerID', how='inner')
Set indexes to join on the index:
df1.join(df2, lsuffix='_left', rsuffix='_right')
Appending Rows
Append rows of one data frame to another:
df = df.append(new_data)
These provide versatile options for combining data frames.
Handling Missing Data
Real-world data often has missing values that need to be managed. Pandas provides methods to handle missing data effectively:
Finding Null Values
Find missing values using isnull()
and notnull()
:
df[df['Age'].isnull()] # Rows where Age is NA
df['Age'].notnull() # True for non-NA values
Dropping Missing Data
Drop rows or columns with missing values. Note the inplace=True
parameter to modify the data frame in place:
# Drop rows with any NA values
df.dropna(inplace=True)
# Drop columns with NA values
df.dropna(axis=1, inplace=True)
Filling Missing Data
Replace missing values with a scalar value, forward fill, or interpolation:
# Fill NA with 0
df.fillna(0, inplace=True)
# Forward fill
df.fillna(method='ffill', inplace=True)
# Interpolate
df.interpolate(inplace=True)
Use fillna()
to fill based on mean, median, mode, etc:
df['Age'].fillna(df['Age'].mean(), inplace=True)
Careful handling of missing data prevents errors and biases.
Exporting Data Frames
Once we have performed analysis and transformation, we can export the modified data frames:
Export to CSV
Write data frame to a CSV file:
df.to_csv('data.csv', index=False)
Export to Excel
Export to an Excel XLSX file:
df.to_excel('data.xlsx', sheet_name='Sheet1')
Export to SQL
Write to a table in SQL database:
df.to_sql('my_table', engine, index=False)
Export to Other Formats
Export data frames to other common formats like JSON or Parquet:
df.to_json('data.json', orient='records') # Export to JSON
df.to_parquet('data.parquet') # Export to Parquet
This enables sharing data frames with non-Python systems.
Data Frame Best Practices and Tips
Here are some tips for working effectively with pandas data frames:
- Use meaningful column names instead of integers.
- Convert data to appropriate data types - strings, numeric, timestamps, etc.
- Set the index to a meaningful column like customer_id if available.
- Use
inplace=True
to mutate data frames avoiding copies. - Avoid chained indexing like
df['col'][0]
as it can lead to unexpected results. - Use
iloc
andloc
for positional vs label-based indexing. - Manage large data frames in chunks for memory efficiency.
- Prefer vectorized operations over looping like
apply()
. - Become familiar with useful built-in methods like
groupby()
,join()
,pivot()
.
Visualizing Data Frames
Pandas integrates nicely with data visualization libraries like Matplotlib and Seaborn to create insightful plots from data frames.
Some examples:
# Scatter plot
plt.scatter(df['Age'], df['Income'])
# Histogram
df['Price'].plot.hist()
# Boxplot
df.boxplot(column='Salary', by='Dept')
# Heatmap
sns.heatmap(df.corr(), annot=True)
Visualization makes patterns, trends, and relationships more apparent. As the saying goes “A picture is worth a thousand words”.
Conclusion
In this comprehensive guide, we covered the key aspects of using data frames in Python for effective data manipulation and analysis. Data frames are an essential component of the PyData stack. Learning how to create, transform, combine, clean, analyze, and visualize data frames will enable you to be more productive working with real-world datasets in Python.
The built-in methods and vectorized operations provided by Pandas data frames make many data tasks intuitive and fast. With this in-depth knowledge, you will be prepared to leverage data frames for your own data science, data engineering, and analytics applications in Python.
Some next steps to build on these data frame skills:
- Learn more advanced indexing and selection techniques
- Study multi-indexing and hierarchical data frames
- Apply operations and analysis across multiple related data frames
- Integrate and enrich data frames with APIs, databases, and other sources
- Use data frames for building machine learning features and pipelines
- Visualize insights and create interactive reports for stakeholders
There is always more to learn about maximizing the power of data frames in Python!