Pandas is one of the most popular Python libraries used for data analysis and manipulation. The df.describe()
method in Pandas provides a quick way to generate descriptive statistics on numeric columns in a DataFrame. It outputs the count, mean, standard deviation, minimum, quartiles, and maximum for each numeric column.
Understanding how to use df.describe()
is an essential skill for data scientists, analysts, and Python developers working with data. This comprehensive guide will explain what df.describe()
does, how to use it correctly, and how to interpret the output statistics. Real-world examples and sample code snippets are provided to help you master using df.describe()
for exploratory data analysis.
Table of Contents
Open Table of Contents
What is df.describe()?
The df.describe()
method generates a high-level summary of statistics for a DataFrame. According to the Pandas API Reference:
“Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.”
It calculates the count, mean, standard deviation, minimum, quartiles, and maximum for all numeric columns in a DataFrame or Series. By default, it computes the statistics on all numeric types such as float, int, boolean, etc.
The output is a new DataFrame containing the summary statistics for each numeric column. Non-numeric columns are ignored unless specified.
Let’s look at a simple example:
import pandas as pd
data = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': ['a', 'b', 'c']
})
print(data.describe())
A B
count 3.00000 3.00000
mean 2.00000 5.00000
std 1.00000 1.00000
min 1.00000 4.00000
25% 1.50000 4.50000
50% 2.00000 5.00000
75% 2.50000 5.50000
max 3.00000 6.00000
Here we can see the summary statistics generated for the numeric columns ‘A’ and ‘B’. The non-numeric column ‘C’ was excluded.
When to Use df.describe()
df.describe()
is commonly used for:
- Quick exploratory data analysis - summarize numeric columns to understand distributions
- Identify outliers and anomalies - min/max values highlight outliers
- Compare central tendency/dispersion - mean and standard deviation measures
- Understand impact of cleaning/preprocessing - compare before and after statistics
- Feature engineering and selection - identify predictive numeric columns
- Model training and evaluation - summarize key input variables
It provides a high-level overview of the numeric data that can guide deeper analysis.
How to Use df.describe()
The basic syntax for df.describe()
is:
df.describe(include=None, exclude=None, datetime_is_numeric=False)
df
- the Pandas DataFrame to describeinclude
- optional list of data types to include. Ex:include=['float', 'int']
exclude
- optional list of data types to exclude. Ex:exclude=['object']
datetime_is_numeric
- whether to treat datetime columns as numeric. Default is False.
By default, it will describe all numeric columns.
To generate statistics on non-numeric columns like object or datetime, you need to specify include
or set datetime_is_numeric=True
.
Examples
Describe all numeric columns:
df.describe()
Describe only float columns:
df.describe(include=['float'])
Describe all except object columns:
df.describe(exclude=['object'])
Treat datetimes as numeric:
df.describe(datetime_is_numeric=True)
The output DataFrame indexes the summary stats by column name. You can transpose the output using df.describe().T
to pivot if needed.
df.describe() Output Statistics
The df.describe()
output provides the following summary statistics by column:
- count - non-NaN observations
- mean - average value
- std - standard deviation (dispersion from mean)
- min - minimum value
- 25% - first quartile
- 50% - second quartile (median)
- 75% - third quartile
- max - maximum value
These measures allow you to understand the distribution, central tendency, and spread of numeric columns.
For example, columns with a small standard deviation have data clustered close to the mean, while large deviations indicate dispersed data. The median gives the mid-point value while quartiles show the lower, middle, and upper portions of data.
Outlier values can be identified using the min and max. Comparing central tendency and spread helps determine normality of data.
Handling Null Values
By default, df.describe()
excludes missing values labeled NaN
from its calculations. This provides statistics representative of the non-missing data.
To include null values, you can use:
df.describe(include='all')
This will output NaN for the central tendency and dispersion statistics, while count reflects the total rows.
Interpreting df.describe() Output
When exploring a new dataset, df.describe()
provides insights into:
- Data types - which columns are numeric vs non-numeric
- Completeness - missing values if count is less than total rows
- Central tendency - mean and median represent “average” value
- Variability - range from min to max shows spread; standard deviation measures dispersion
- Outliers - min and max identify potential anomalies or errors
- Distribution - symmetry, normality, and shape
For example, a column with mean and median close to min/max likely has outliers skewing distribution. Standard deviation gives a sense of clustering - small is tighter, large more dispersed.
Quartiles show whether data is balanced or skewed between lower and upper bounds. Together these metrics help assess normality.
We can generate plots like histograms to visualize the distribution as well. Identifying anomalies or heavily skewed data may prompt data cleaning or transformation before analysis.
Example Analysis
Let’s use a DataFrame of basic numeric data to demonstrate interpreting df.describe()
:
data = {
'Normal': [1, 2, 3, 4, 5],
'Uniform': [1, 1, 1, 1, 1],
'Skewed': [1, 1, 1, 1, 100]
}
df = pd.DataFrame(data)
print(df.describe())
Normal Uniform Skewed
count 5.000000 5.000 5.0
mean 3.000000 1.000 20.8
std 1.581139 0.000 41.2
min 1.000000 1.000 1.0
25% 2.000000 1.000 1.0
50% 3.000000 1.000 1.0
75% 4.000000 1.000 1.0
max 5.000000 1.000 100.0
-
Normal has mean ~ median, small standard deviation, and symmetric min/max indicating normal distribution.
-
Uniform has identical values, so min/max/mean/median are all 1. Standard deviation is 0 showing no variance.
-
Skewed median and quartiles are 1, but high mean and max indicates right skewed distribution due to outlier.
This shows how the summary statistics reflect shape and skew. We can make informed decisions about preprocessing, anomalies, and modeling based on these insights.
df.describe() for Categorical Columns
df.describe()
focuses on numeric columns by default. For statistics on categorical or object data, you can use:
df.describe(include=['object'])
This will output the count, unique values, top frequent value, and frequency of top value for each object column.
For example:
data = {'Category': ['A', 'B', 'C', 'A', 'B']}
df = pd.DataFrame(data)
print(df.describe(include=['object']))
Category
count 5
unique 3
top A
freq 2
This allows basic EDA on categorical features like cardinality, modal category, and frequency distribution.
The Pandas Profiling library also generates more detailed statistics and plots for deeper analysis of object, numeric, and datetime data.
Real World Example - Bike Sharing Dataset
Let’s walk through a real-world example using a public bike sharing dataset from Kaggle. The data has features like temperature, humidity, count of rental bikes, etc.
We’ll import it into a DataFrame and use df.describe()
to explore:
import pandas as pd
bikes = pd.read_csv('bike_sharing.csv')
print(bikes.describe())
Output:
timestamp count humidity temperature
count 6.547000e+03 6547.0 6547.000 6547.000
mean 1.538513e+13 191.273 0.625 0.490
std 8.398935e+09 151.314 0.186 0.194
min 1.536912e+13 1.0 0.150 -0.100
25% 1.538169e+13 36.0 0.500 0.330
50% 1.538506e+13 146.0 0.600 0.500
75% 1.538874e+13 284.0 0.700 0.670
max 1.539133e+13 977.0 1.000 1.000
The timestamp column was treated as numeric. We can observe:
- count matches total rows so no missing values
- mean rental bikes is 191
- Average humidity is 0.625 and temperature 0.49
- Min temperature is -0.1C and bike count is 1
- Max bike rentals spike to 977
- Standard deviation of 151 for count indicates dispersion from mean
This quick overview helps guide deeper analysis on the influence of weather on rental patterns. We may also want to preprocess timestamp and handle outliers before modeling the data.
Summary
Pandas df.describe()
is an invaluable tool for quick exploratory data analysis on numeric columns. It outputs an insightful statistical summary including central tendency, dispersion, outliers, and distribution shape.
Mastering df.describe()
allows efficiently developing intuition for cleaning, preprocessing, feature engineering, and modeling workflows in data science and analytics.
The key points to remember are:
- Generates statistics like mean, standard deviation, quartiles, min/max for numeric data
- Useful for initial EDA to understand distributions
- Identify anomalies and compare before/after cleaning or transformations
- Analyze numeric patterns in features for engineering and modeling
- Control inclusion/exclusion using options like
include
andexclude
- Interpret output to assess normality, shape, spread, and skew
With a solid grasp of applying df.describe()
and analyzing the output, you can extract powerful insights from your data using native Pandas functionality.