Skip to content

A Comprehensive Guide to Pandas df.describe() for Descriptive Statistics on Numeric Columns

Updated: at 05:15 AM

Pandas is one of the most popular Python libraries used for data analysis and manipulation. The df.describe() method in Pandas provides a quick way to generate descriptive statistics on numeric columns in a DataFrame. It outputs the count, mean, standard deviation, minimum, quartiles, and maximum for each numeric column.

Understanding how to use df.describe() is an essential skill for data scientists, analysts, and Python developers working with data. This comprehensive guide will explain what df.describe() does, how to use it correctly, and how to interpret the output statistics. Real-world examples and sample code snippets are provided to help you master using df.describe() for exploratory data analysis.

Table of Contents

Open Table of Contents

What is df.describe()?

The df.describe() method generates a high-level summary of statistics for a DataFrame. According to the Pandas API Reference:

“Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.”

It calculates the count, mean, standard deviation, minimum, quartiles, and maximum for all numeric columns in a DataFrame or Series. By default, it computes the statistics on all numeric types such as float, int, boolean, etc.

The output is a new DataFrame containing the summary statistics for each numeric column. Non-numeric columns are ignored unless specified.

Let’s look at a simple example:

import pandas as pd

data = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': ['a', 'b', 'c']
})

print(data.describe())
           A         B
count  3.00000  3.00000
mean   2.00000  5.00000
std    1.00000  1.00000
min    1.00000  4.00000
25%    1.50000  4.50000
50%    2.00000  5.00000
75%    2.50000  5.50000
max    3.00000  6.00000

Here we can see the summary statistics generated for the numeric columns ‘A’ and ‘B’. The non-numeric column ‘C’ was excluded.

When to Use df.describe()

df.describe() is commonly used for:

It provides a high-level overview of the numeric data that can guide deeper analysis.

How to Use df.describe()

The basic syntax for df.describe() is:

df.describe(include=None, exclude=None, datetime_is_numeric=False)

By default, it will describe all numeric columns.

To generate statistics on non-numeric columns like object or datetime, you need to specify include or set datetime_is_numeric=True.

Examples

Describe all numeric columns:

df.describe()

Describe only float columns:

df.describe(include=['float'])

Describe all except object columns:

df.describe(exclude=['object'])

Treat datetimes as numeric:

df.describe(datetime_is_numeric=True)

The output DataFrame indexes the summary stats by column name. You can transpose the output using df.describe().T to pivot if needed.

df.describe() Output Statistics

The df.describe() output provides the following summary statistics by column:

These measures allow you to understand the distribution, central tendency, and spread of numeric columns.

For example, columns with a small standard deviation have data clustered close to the mean, while large deviations indicate dispersed data. The median gives the mid-point value while quartiles show the lower, middle, and upper portions of data.

Outlier values can be identified using the min and max. Comparing central tendency and spread helps determine normality of data.

Handling Null Values

By default, df.describe() excludes missing values labeled NaN from its calculations. This provides statistics representative of the non-missing data.

To include null values, you can use:

df.describe(include='all')

This will output NaN for the central tendency and dispersion statistics, while count reflects the total rows.

Interpreting df.describe() Output

When exploring a new dataset, df.describe() provides insights into:

For example, a column with mean and median close to min/max likely has outliers skewing distribution. Standard deviation gives a sense of clustering - small is tighter, large more dispersed.

Quartiles show whether data is balanced or skewed between lower and upper bounds. Together these metrics help assess normality.

We can generate plots like histograms to visualize the distribution as well. Identifying anomalies or heavily skewed data may prompt data cleaning or transformation before analysis.

Example Analysis

Let’s use a DataFrame of basic numeric data to demonstrate interpreting df.describe():

data = {
  'Normal': [1, 2, 3, 4, 5],
  'Uniform': [1, 1, 1, 1, 1],
  'Skewed': [1, 1, 1, 1, 100]
}

df = pd.DataFrame(data)
print(df.describe())
           Normal  Uniform  Skewed
count    5.000000    5.000     5.0
mean     3.000000    1.000    20.8
std      1.581139    0.000    41.2
min      1.000000    1.000     1.0
25%      2.000000    1.000     1.0
50%      3.000000    1.000     1.0
75%      4.000000    1.000     1.0
max      5.000000    1.000   100.0

This shows how the summary statistics reflect shape and skew. We can make informed decisions about preprocessing, anomalies, and modeling based on these insights.

df.describe() for Categorical Columns

df.describe() focuses on numeric columns by default. For statistics on categorical or object data, you can use:

df.describe(include=['object'])

This will output the count, unique values, top frequent value, and frequency of top value for each object column.

For example:

data = {'Category': ['A', 'B', 'C', 'A', 'B']}
df = pd.DataFrame(data)

print(df.describe(include=['object']))

        Category
count	      5
unique	    3
top	      A
freq	      2

This allows basic EDA on categorical features like cardinality, modal category, and frequency distribution.

The Pandas Profiling library also generates more detailed statistics and plots for deeper analysis of object, numeric, and datetime data.

Real World Example - Bike Sharing Dataset

Let’s walk through a real-world example using a public bike sharing dataset from Kaggle. The data has features like temperature, humidity, count of rental bikes, etc.

We’ll import it into a DataFrame and use df.describe() to explore:

import pandas as pd

bikes = pd.read_csv('bike_sharing.csv')

print(bikes.describe())

Output:

              timestamp  count  humidity  temperature
count   6.547000e+03   6547.0  6547.000     6547.000
mean    1.538513e+13  191.273     0.625        0.490
std     8.398935e+09   151.314     0.186        0.194
min     1.536912e+13      1.0     0.150       -0.100
25%     1.538169e+13     36.0     0.500        0.330
50%     1.538506e+13    146.0     0.600        0.500
75%     1.538874e+13    284.0     0.700        0.670
max     1.539133e+13    977.0     1.000        1.000

The timestamp column was treated as numeric. We can observe:

This quick overview helps guide deeper analysis on the influence of weather on rental patterns. We may also want to preprocess timestamp and handle outliers before modeling the data.

Summary

Pandas df.describe() is an invaluable tool for quick exploratory data analysis on numeric columns. It outputs an insightful statistical summary including central tendency, dispersion, outliers, and distribution shape.

Mastering df.describe() allows efficiently developing intuition for cleaning, preprocessing, feature engineering, and modeling workflows in data science and analytics.

The key points to remember are:

With a solid grasp of applying df.describe() and analyzing the output, you can extract powerful insights from your data using native Pandas functionality.