Pandas is one of the most popular Python libraries used for data analysis and manipulation. The df.describe()
method in Pandas provides a quick way to generate descriptive statistics on numeric columns in a DataFrame. It outputs the count, mean, standard deviation, minimum, quartiles, and maximum for each numeric column.
Understanding how to use df.describe()
is an essential skill for data scientists, analysts, and Python developers working with data. This comprehensive guide will explain what df.describe()
does, how to use it correctly, and how to interpret the output statistics. Realworld examples and sample code snippets are provided to help you master using df.describe()
for exploratory data analysis.
Table of Contents
Open Table of Contents
What is df.describe()?
The df.describe()
method generates a highlevel summary of statistics for a DataFrame. According to the Pandas API Reference:
“Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.”
It calculates the count, mean, standard deviation, minimum, quartiles, and maximum for all numeric columns in a DataFrame or Series. By default, it computes the statistics on all numeric types such as float, int, boolean, etc.
The output is a new DataFrame containing the summary statistics for each numeric column. Nonnumeric columns are ignored unless specified.
Let’s look at a simple example:
import pandas as pd
data = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': ['a', 'b', 'c']
})
print(data.describe())
A B
count 3.00000 3.00000
mean 2.00000 5.00000
std 1.00000 1.00000
min 1.00000 4.00000
25% 1.50000 4.50000
50% 2.00000 5.00000
75% 2.50000 5.50000
max 3.00000 6.00000
Here we can see the summary statistics generated for the numeric columns ‘A’ and ‘B’. The nonnumeric column ‘C’ was excluded.
When to Use df.describe()
df.describe()
is commonly used for:
 Quick exploratory data analysis  summarize numeric columns to understand distributions
 Identify outliers and anomalies  min/max values highlight outliers
 Compare central tendency/dispersion  mean and standard deviation measures
 Understand impact of cleaning/preprocessing  compare before and after statistics
 Feature engineering and selection  identify predictive numeric columns
 Model training and evaluation  summarize key input variables
It provides a highlevel overview of the numeric data that can guide deeper analysis.
How to Use df.describe()
The basic syntax for df.describe()
is:
df.describe(include=None, exclude=None, datetime_is_numeric=False)
df
 the Pandas DataFrame to describeinclude
 optional list of data types to include. Ex:include=['float', 'int']
exclude
 optional list of data types to exclude. Ex:exclude=['object']
datetime_is_numeric
 whether to treat datetime columns as numeric. Default is False.
By default, it will describe all numeric columns.
To generate statistics on nonnumeric columns like object or datetime, you need to specify include
or set datetime_is_numeric=True
.
Examples
Describe all numeric columns:
df.describe()
Describe only float columns:
df.describe(include=['float'])
Describe all except object columns:
df.describe(exclude=['object'])
Treat datetimes as numeric:
df.describe(datetime_is_numeric=True)
The output DataFrame indexes the summary stats by column name. You can transpose the output using df.describe().T
to pivot if needed.
df.describe() Output Statistics
The df.describe()
output provides the following summary statistics by column:
 count  nonNaN observations
 mean  average value
 std  standard deviation (dispersion from mean)
 min  minimum value
 25%  first quartile
 50%  second quartile (median)
 75%  third quartile
 max  maximum value
These measures allow you to understand the distribution, central tendency, and spread of numeric columns.
For example, columns with a small standard deviation have data clustered close to the mean, while large deviations indicate dispersed data. The median gives the midpoint value while quartiles show the lower, middle, and upper portions of data.
Outlier values can be identified using the min and max. Comparing central tendency and spread helps determine normality of data.
Handling Null Values
By default, df.describe()
excludes missing values labeled NaN
from its calculations. This provides statistics representative of the nonmissing data.
To include null values, you can use:
df.describe(include='all')
This will output NaN for the central tendency and dispersion statistics, while count reflects the total rows.
Interpreting df.describe() Output
When exploring a new dataset, df.describe()
provides insights into:
 Data types  which columns are numeric vs nonnumeric
 Completeness  missing values if count is less than total rows
 Central tendency  mean and median represent “average” value
 Variability  range from min to max shows spread; standard deviation measures dispersion
 Outliers  min and max identify potential anomalies or errors
 Distribution  symmetry, normality, and shape
For example, a column with mean and median close to min/max likely has outliers skewing distribution. Standard deviation gives a sense of clustering  small is tighter, large more dispersed.
Quartiles show whether data is balanced or skewed between lower and upper bounds. Together these metrics help assess normality.
We can generate plots like histograms to visualize the distribution as well. Identifying anomalies or heavily skewed data may prompt data cleaning or transformation before analysis.
Example Analysis
Let’s use a DataFrame of basic numeric data to demonstrate interpreting df.describe()
:
data = {
'Normal': [1, 2, 3, 4, 5],
'Uniform': [1, 1, 1, 1, 1],
'Skewed': [1, 1, 1, 1, 100]
}
df = pd.DataFrame(data)
print(df.describe())
Normal Uniform Skewed
count 5.000000 5.000 5.0
mean 3.000000 1.000 20.8
std 1.581139 0.000 41.2
min 1.000000 1.000 1.0
25% 2.000000 1.000 1.0
50% 3.000000 1.000 1.0
75% 4.000000 1.000 1.0
max 5.000000 1.000 100.0

Normal has mean ~ median, small standard deviation, and symmetric min/max indicating normal distribution.

Uniform has identical values, so min/max/mean/median are all 1. Standard deviation is 0 showing no variance.

Skewed median and quartiles are 1, but high mean and max indicates right skewed distribution due to outlier.
This shows how the summary statistics reflect shape and skew. We can make informed decisions about preprocessing, anomalies, and modeling based on these insights.
df.describe() for Categorical Columns
df.describe()
focuses on numeric columns by default. For statistics on categorical or object data, you can use:
df.describe(include=['object'])
This will output the count, unique values, top frequent value, and frequency of top value for each object column.
For example:
data = {'Category': ['A', 'B', 'C', 'A', 'B']}
df = pd.DataFrame(data)
print(df.describe(include=['object']))
Category
count 5
unique 3
top A
freq 2
This allows basic EDA on categorical features like cardinality, modal category, and frequency distribution.
The Pandas Profiling library also generates more detailed statistics and plots for deeper analysis of object, numeric, and datetime data.
Real World Example  Bike Sharing Dataset
Let’s walk through a realworld example using a public bike sharing dataset from Kaggle. The data has features like temperature, humidity, count of rental bikes, etc.
We’ll import it into a DataFrame and use df.describe()
to explore:
import pandas as pd
bikes = pd.read_csv('bike_sharing.csv')
print(bikes.describe())
Output:
timestamp count humidity temperature
count 6.547000e+03 6547.0 6547.000 6547.000
mean 1.538513e+13 191.273 0.625 0.490
std 8.398935e+09 151.314 0.186 0.194
min 1.536912e+13 1.0 0.150 0.100
25% 1.538169e+13 36.0 0.500 0.330
50% 1.538506e+13 146.0 0.600 0.500
75% 1.538874e+13 284.0 0.700 0.670
max 1.539133e+13 977.0 1.000 1.000
The timestamp column was treated as numeric. We can observe:
 count matches total rows so no missing values
 mean rental bikes is 191
 Average humidity is 0.625 and temperature 0.49
 Min temperature is 0.1C and bike count is 1
 Max bike rentals spike to 977
 Standard deviation of 151 for count indicates dispersion from mean
This quick overview helps guide deeper analysis on the influence of weather on rental patterns. We may also want to preprocess timestamp and handle outliers before modeling the data.
Summary
Pandas df.describe()
is an invaluable tool for quick exploratory data analysis on numeric columns. It outputs an insightful statistical summary including central tendency, dispersion, outliers, and distribution shape.
Mastering df.describe()
allows efficiently developing intuition for cleaning, preprocessing, feature engineering, and modeling workflows in data science and analytics.
The key points to remember are:
 Generates statistics like mean, standard deviation, quartiles, min/max for numeric data
 Useful for initial EDA to understand distributions
 Identify anomalies and compare before/after cleaning or transformations
 Analyze numeric patterns in features for engineering and modeling
 Control inclusion/exclusion using options like
include
andexclude
 Interpret output to assess normality, shape, spread, and skew
With a solid grasp of applying df.describe()
and analyzing the output, you can extract powerful insights from your data using native Pandas functionality.