Skip to content

Mastering Data Frame Operations in Python: Concatenation, GroupBy, and More

Published: at 02:47 AM

When working with data in Python, it is common to have multiple related data frames that you need to apply operations and analysis across. For example, you may have separate data frames for different years or categories that you want to combine and compare. Applying functions and transformations across multiple data frames requires some careful handling, but Python provides several powerful tools to make this process efficient and straightforward.

In this comprehensive guide, we will explore methods for applying operations and analysis across multiple related data frames using Python and its key data analysis libraries like Pandas, NumPy, and SciPy.

Table of Contents

Open Table of Contents

Overview

A data frame is a two-dimensional tabular data structure with labeled columns and rows, similar to a spreadsheet. The Pandas library in Python provides excellent capabilities for working with data frames for data cleaning, preprocessing, and analysis.

When working with multiple related data frames, such as those split by year or category, we need to apply the same operations and analysis across all the frames in a scalable way while handling indices and labels appropriately.

Key topics we will cover include:

Understanding these techniques will allow you to efficiently apply analysis across large, complex datasets with many interrelated data frames in Python.

Importing Key Libraries

We will utilize the following core Python libraries for working with multiple data frames:

# Import Pandas for data frame operations
import pandas as pd

# Import NumPy for numerical operations on array data
import numpy as np

# Import SciPy for statistical tests and analyses
from scipy import stats

Pandas provides the DataFrame object and associated methods while NumPy enables array-based computations. SciPy contains statistical tests we can apply across data frames.

Concatenating Data Frames

Concatenation joins multiple data frames together into a larger combined data frame. This is useful when you have data split across multiple frames and need to bring it together for analysis.

It’s important to note that concatenation does not modify the original data frames. Instead, it returns a new concatenated data frame.

Pandas provides two main methods for concatenating data frames:

df_concat = pd.concat([df1, df2])
df_concat = df1.append(df2)

Both achieve the same result of concatenating df1 and df2.

By default they append rows, but you can specify axis=1 to concatenate by columns instead. You can pass any number of data frames in a list to concatenate together.

Set the ignore_index=True parameter to avoid duplicate indices when concatenating. You can also add keys to identify the source data frames:

df_concat = pd.concat([df1, df2], keys=['x', 'y'])

This labels each data frame with an identifying key for reference.

Applying Functions to Multiple Data Frames

To apply the same operation to multiple data frames, such as filtering or transforming values, you can use a for loop to iterate over each data frame.

It’s important to note that the operations within the loop do not modify the original data frames. To keep the results, you need to store the filtered data frames.

# Import Pandas for data frame operations
import pandas as pd

# Create sample data frames
data1 = {'Name': ['Alice', 'Bob', 'Charlie'],
         'Age': [25, 30, 35]}
data2 = {'Name': ['David', 'Eve', 'Frank'],
         'Age': [22, 28, 40]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# List of original data frames
df_list = [df1, df2]

# Empty list to hold filtered data frames
df_filtered = []

# Apply filtering to each data frame
for df in df_list:
    filtered_df = df[df['Age'] > 30]  # Example filtering condition
    df_filtered.append(filtered_df)

# Display filtered data frames
for i, df in enumerate(df_filtered):
    print(f"Filtered Data Frame {i + 1}:")
    print(df)

This prints:

Filtered Data Frame 1:
      Name  Age
2  Charlie   35
Filtered Data Frame 2:
    Name  Age
2  Frank   40

This loops through and filters each data frame based on a criteria, storing the filtered versions in a new list.

More complex functions can be applied as well within the loop besides filtering.

Using GroupBy on Multi-Frames

Pandas GroupBy allows you to group data frames by a category column and apply operations within each group. This can be used across multiple data frames:

# Import Pandas for data frame operations
import pandas as pd

# Create sample data frames
data1 = {'Category': ['A', 'B', 'A', 'B'],
         'Value': [10, 15, 20, 25]}
data2 = {'Category': ['A', 'B', 'A', 'B'],
         'Value': [30, 35, 40, 45]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# List of data frames
dfs = [df1, df2]

# Group each data frame by 'Category' column
grouped = [df.groupby('Category') for df in dfs]

# Apply aggregation to each group
df_means = [group.mean() for group in grouped]

# Concatenate the group means into a single data frame
result_df = pd.concat(df_means, keys=['df1', 'df2'])

# Display the result data frame
print("Resulting Data Frame:")
print(result_df)

This prints:

Resulting Data Frame:
              Value
    Category
df1 A          15.0
    B          20.0
df2 A          35.0
    B          40.0

This groups each data frame by ‘category’, applies the .mean() aggregation to each group, and stores the results into df_means, which is a list containing a data frame of group means for each original data frame in dfs.

Grouping multiple data frames results in a hierarchical index. The index levels can be reset with df_means[0].reset_index() if needed.

To concatenate the group means into a single data frame, you can use pd.concat(df_means).

Any aggregate operations like sum(), count(), max() etc. can be applied across the grouped data frames.

Joining and Merging Data Frames

Joining or merging allows you to combine data from different data frames. This is helpful when data is split across multiple data frames but needs to be linked.

Pandas provides two main functions for joining:

The merge() function handles all types of joins:

# Import Pandas for data frame operations
import pandas as pd

# Create sample data frames
data1 = {'ID': [1, 2, 3],
         'Name': ['Alice', 'Bob', 'Charlie']}
data2 = {'ID': [2, 3, 4],
         'Age': [25, 30, 35]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Merge df1 and df2 on the 'ID' column
df_merged = pd.merge(df1, df2, on='ID')

# Left join keeps all rows of df1
df_left = pd.merge(df1, df2, on='ID', how='left')

# Right join keeps all rows of df2
df_right = pd.merge(df1, df2, on='ID', how='right')

# Outer join keeps all rows of both
df_outer = pd.merge(df1, df2, on='ID', how='outer')

# Join on index
df1.set_index('ID', inplace=True)
df2.set_index('ID', inplace=True)
df_idx = df1.join(df2, how='inner')

# Display the merged and joined data frames
print("Merged Data Frame:")
print(df_merged)

print("\nLeft Join Data Frame:")
print(df_left)

print("\nRight Join Data Frame:")
print(df_right)

print("\nOuter Join Data Frame:")
print(df_outer)

print("\nJoined Data Frame on Index:")
print(df_idx)

This prints:

Merged Data Frame:
   ID     Name  Age
0   2      Bob   25
1   3  Charlie   30

Left Join Data Frame:
   ID     Name   Age
0   1    Alice   NaN
1   2      Bob  25.0
2   3  Charlie  30.0

Right Join Data Frame:
   ID     Name  Age
0   2      Bob   25
1   3  Charlie   30
2   4      NaN   35

Outer Join Data Frame:
   ID     Name   Age
0   1    Alice   NaN
1   2      Bob  25.0
2   3  Charlie  30.0
3   4      NaN  35.0

Joined Data Frame on Index:
       Name  Age
ID
2       Bob   25
3   Charlie   30

For index-only joins, use df1.join(df2):

# Join on index
df1.join(df2, how='inner')

You can join multiple data frames by passing a list:

df_joined = df1.join([df2, df3, df4], how='inner')

This performs an inner join across all the data frames.

Combining Data Frame Statistics

When working with multiple data frames, you may need to combine or compare summary statistics across the frames.

For example, calculating descriptive stats for each frame and aggregating together:

# Import Pandas for data frame operations
import pandas as pd

# Create sample data frames
data1 = {'ID': [1, 2, 3, 4],
         'Value1': [10, 15, 20, 25]}
data2 = {'ID': [1, 2, 3, 4],
         'Value2': [30, 35, 40, 45]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# List of data frames
df_list = [df1, df2]

# Empty list to hold summary statistics data frames
stats_list = []

# Calculate summary stats for each data frame
for df in df_list:
    df_stats = df.describe()
    stats_list.append(df_stats)

# Concatenate stats horizontally into one data frame
stats_df = pd.concat(stats_list, axis=1)

# Display the combined summary statistics data frame
print("Combined Summary Statistics Data Frame:")
print(stats_df)

This prints:

Combined Summary Statistics Data Frame:
             ID     Value1        ID     Value2
count  4.000000   4.000000  4.000000   4.000000
mean   2.500000  17.500000  2.500000  37.500000
std    1.290994   6.454972  1.290994   6.454972
min    1.000000  10.000000  1.000000  30.000000
25%    1.750000  13.750000  1.750000  33.750000
50%    2.500000  17.500000  2.500000  37.500000
75%    3.250000  21.250000  3.250000  41.250000
max    4.000000  25.000000  4.000000  45.000000

This calculates the describe() statistics for each data frame and concatenates the results horizontally into stats_df, with columns corresponding to each original data frame.

You can also use aggregation functions like pandas.concat() and pandas.merge() to combine group statistics:

# Group by category
grouped = df.groupby('category')

# Compute mean of each group
means = grouped.mean()

# Compute sum of each group
sums = grouped.sum()

# Merge the two DFs
stats = means.merge(sums, on='category')

This aggregates the data into separate means and sums data frames then merges them into a final stats data frame.

Applying Statistical Tests Across Data Frames

Python’s SciPy library provides statistical tests that can be applied across multiple data frames.

It’s essential to check test assumptions and ensure the data frames are compatible before applying statistical tests.

For example, performing a two sample t-test to compare values between two data frames:

# Import SciPy for statistical tests and analyses
from scipy import stats

# Example data frames df1 and df2
# Make sure to replace these with your actual data frames
df1 = pd.DataFrame({'values': [1.2, 2.5, 3.8, 4.1, 5.3]})
df2 = pd.DataFrame({'values': [2.0, 2.7, 3.6, 4.0, 5.1]})

# Extract columns from each DF
vec1 = df1['values']
vec2 = df2['values']

# Check vectors are equal length
assert len(vec1) == len(vec2)

# Perform a two-sample t-test assuming normality
t_stat, p_val = stats.ttest_ind(vec1, vec2)

# Print p-value
print("P-value:", p_val)

# Interpret the results
if p_val < 0.05:
    print("Reject the null hypothesis: There is a significant difference between the two data sets.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the two data sets.")

This prints:

P-value: 0.9126844168523223
Fail to reject the null hypothesis: There is no significant difference between the two data sets.

This code snippet demonstrates how to perform a two-sample t-test between two data frames (df1 and df2) assuming that the data follows a normal distribution. It checks the equality of means between the two data sets and prints the p-value. You can replace df1 and df2 with your actual data frames and adapt the test as needed based on your specific requirements.

Tests like ANOVA, correlations, chi-square, and others can be applied across properly formatted data frames.

Case Study - Analyzing Store Sales Data Frames

Let’s walk through an applied example of using these techniques to analyze daily sales data frames from different store locations.

First, let’s import Pandas and create dummy sales data frames for store locations 1 to 4:

import pandas as pd

# Dummy sales data frames
sales_data = {
    'store_id': [1, 2, 3, 4],
    'date': ['2023-01-01', '2023-01-01', '2023-01-01', '2023-01-01'],
    'customer_id': [101, 102, 103, 104],
    'amount': [50.0, 75.0, 60.0, 90.0]
}

# Create sales data frames for each store
sales_dfs = []
for store_id in [1, 2, 3, 4]:
    df = pd.DataFrame(sales_data)
    df['store_id'] = store_id
    sales_dfs.append(df)

# Concatenate sales data frames into one
all_stores = pd.concat(sales_dfs, ignore_index=True)

We have created dummy sales data frames for each store location and concatenated them into a single data frame all_stores.

Next, let’s create a dummy customer demographics data frame:

# Dummy customer demographics data
customers_data = {
    'customer_id': [101, 102, 103, 104],
    'age': [30, 25, 35, 40],
    'gender': ['M', 'F', 'M', 'F']
}

# Create customer demographics data frame
customers_df = pd.DataFrame(customers_data)

Now, let’s merge the sales data with the customer demographics data:

# Merge sales data with customer demographics data
sales_with_cust = pd.merge(all_stores, customers_df, on='customer_id')

The sales_with_cust data frame now contains both sales data and customer demographics data merged on the ‘customer_id’ column.

Next, we can analyze sales by customer segment across all stores. For example, let’s calculate the total sales amount by gender:

# Analyze sales by customer segment across all stores
segment_stats = sales_with_cust.groupby('gender').agg({'amount': 'sum'})

# Display sales by customer segment
print("Sales by Customer Segment:")
print(segment_stats)

This code calculates the total sales amount for each gender and displays the results.

Complete Code Example:

# Import Pandas for data frame operations
import pandas as pd

# Dummy sales data frames
sales_data = {
    'store_id': [1, 2, 3, 4],
    'date': ['2023-01-01', '2023-01-01', '2023-01-01', '2023-01-01'],
    'customer_id': [101, 102, 103, 104],
    'amount': [50.0, 75.0, 60.0, 90.0]
}

# Create sales data frames for each store
sales_dfs = []
for store_id in [1, 2, 3, 4]:
    df = pd.DataFrame(sales_data)
    df['store_id'] = store_id
    sales_dfs.append(df)

# Concatenate sales data frames into one
all_stores = pd.concat(sales_dfs, ignore_index=True)

# Group the data by store location and calculate statistics
stats_by_store = all_stores.groupby('store_id').agg({'amount': ['sum', 'mean', 'max']})

# Display statistics by store
print("Statistics by Store:")
print(stats_by_store)

# Dummy customer demographics data
customers_data = {
    'customer_id': [101, 102, 103, 104],
    'age': [30, 25, 35, 40],
    'gender': ['M', 'F', 'M', 'F']
}

# Create customer demographics data frame
customers_df = pd.DataFrame(customers_data)

# Merge sales data with customer demographics data
sales_with_cust = pd.merge(all_stores, customers_df, on='customer_id')

# Display merged data
print("\nSales Data with Customer Demographics:")
print(sales_with_cust)

# Analyze sales by customer segment across all stores
segment_stats = sales_with_cust.groupby('gender').agg({'amount': 'sum'})

# Display sales by customer segment
print("\nSales by Customer Segment:")
print(segment_stats)

This prints:

Statistics by Store:
         amount
            sum   mean   max
store_id
1         275.0  68.75  90.0
2         275.0  68.75  90.0
3         275.0  68.75  90.0
4         275.0  68.75  90.0

Sales Data with Customer Demographics:
    store_id        date  customer_id  amount  age gender
0          1  2023-01-01          101    50.0   30      M
1          2  2023-01-01          101    50.0   30      M
2          3  2023-01-01          101    50.0   30      M
3          4  2023-01-01          101    50.0   30      M
4          1  2023-01-01          102    75.0   25      F
5          2  2023-01-01          102    75.0   25      F
6          3  2023-01-01          102    75.0   25      F
7          4  2023-01-01          102    75.0   25      F
8          1  2023-01-01          103    60.0   35      M
9          2  2023-01-01          103    60.0   35      M
10         3  2023-01-01          103    60.0   35      M
11         4  2023-01-01          103    60.0   35      M
12         1  2023-01-01          104    90.0   40      F
13         2  2023-01-01          104    90.0   40      F
14         3  2023-01-01          104    90.0   40      F
15         4  2023-01-01          104    90.0   40      F

Sales by Customer Segment:
        amount
gender
F        660.0
M        440.0

These examples demonstrate applying operations and analysis to the multiple related sales data frames using Python and Pandas.

Summary

In this comprehensive guide we explored practical techniques for applying transformations and analysis across multiple related data frames in Python, including:

Mastering these multi-data frame operations enables efficient wrangling of complex real-world data and unified analysis across disparate frames using Python’s versatile data analysis libraries.

The key takeaways are understanding the various methods to join, apply operations, and analyze at scale across collections of interrelated data frames. This unlocks the full potential of Python for data preparation, exploration and analytics.