Pandas is one of the most popular Python libraries used for data analysis and manipulation. The `df.describe()`

method in Pandas provides a quick way to generate descriptive statistics on numeric columns in a DataFrame. It outputs the count, mean, standard deviation, minimum, quartiles, and maximum for each numeric column.

Understanding how to use `df.describe()`

is an essential skill for data scientists, analysts, and Python developers working with data. This comprehensive guide will explain what `df.describe()`

does, how to use it correctly, and how to interpret the output statistics. Real-world examples and sample code snippets are provided to help you master using `df.describe()`

for exploratory data analysis.

## Table of Contents

## Open Table of Contents

## What is df.describe()?

The `df.describe()`

method generates a high-level summary of statistics for a DataFrame. According to the Pandas API Reference:

“Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.”

It calculates the count, mean, standard deviation, minimum, quartiles, and maximum for all numeric columns in a DataFrame or Series. By default, it computes the statistics on all numeric types such as float, int, boolean, etc.

The output is a new DataFrame containing the summary statistics for each numeric column. Non-numeric columns are ignored unless specified.

Let’s look at a simple example:

```
import pandas as pd
data = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': ['a', 'b', 'c']
})
print(data.describe())
```

```
A B
count 3.00000 3.00000
mean 2.00000 5.00000
std 1.00000 1.00000
min 1.00000 4.00000
25% 1.50000 4.50000
50% 2.00000 5.00000
75% 2.50000 5.50000
max 3.00000 6.00000
```

Here we can see the summary statistics generated for the numeric columns ‘A’ and ‘B’. The non-numeric column ‘C’ was excluded.

## When to Use df.describe()

`df.describe()`

is commonly used for:

- Quick exploratory data analysis - summarize numeric columns to understand distributions
- Identify outliers and anomalies - min/max values highlight outliers
- Compare central tendency/dispersion - mean and standard deviation measures
- Understand impact of cleaning/preprocessing - compare before and after statistics
- Feature engineering and selection - identify predictive numeric columns
- Model training and evaluation - summarize key input variables

It provides a high-level overview of the numeric data that can guide deeper analysis.

## How to Use df.describe()

The basic syntax for `df.describe()`

is:

```
df.describe(include=None, exclude=None, datetime_is_numeric=False)
```

`df`

- the Pandas DataFrame to describe`include`

- optional list of data types to include. Ex:`include=['float', 'int']`

`exclude`

- optional list of data types to exclude. Ex:`exclude=['object']`

`datetime_is_numeric`

- whether to treat datetime columns as numeric. Default is False.

By default, it will describe all numeric columns.

To generate statistics on non-numeric columns like object or datetime, you need to specify `include`

or set `datetime_is_numeric=True`

.

### Examples

Describe all numeric columns:

```
df.describe()
```

Describe only float columns:

```
df.describe(include=['float'])
```

Describe all except object columns:

```
df.describe(exclude=['object'])
```

Treat datetimes as numeric:

```
df.describe(datetime_is_numeric=True)
```

The output DataFrame indexes the summary stats by column name. You can transpose the output using `df.describe().T`

to pivot if needed.

## df.describe() Output Statistics

The `df.describe()`

output provides the following summary statistics by column:

**count**- non-NaN observations**mean**- average value**std**- standard deviation (dispersion from mean)**min**- minimum value**25%**- first quartile**50%**- second quartile (median)**75%**- third quartile**max**- maximum value

These measures allow you to understand the distribution, central tendency, and spread of numeric columns.

For example, columns with a small standard deviation have data clustered close to the mean, while large deviations indicate dispersed data. The median gives the mid-point value while quartiles show the lower, middle, and upper portions of data.

Outlier values can be identified using the min and max. Comparing central tendency and spread helps determine normality of data.

### Handling Null Values

By default, `df.describe()`

excludes missing values labeled `NaN`

from its calculations. This provides statistics representative of the non-missing data.

To include null values, you can use:

```
df.describe(include='all')
```

This will output NaN for the central tendency and dispersion statistics, while count reflects the total rows.

## Interpreting df.describe() Output

When exploring a new dataset, `df.describe()`

provides insights into:

**Data types**- which columns are numeric vs non-numeric**Completeness**- missing values if count is less than total rows**Central tendency**- mean and median represent “average” value**Variability**- range from min to max shows spread; standard deviation measures dispersion**Outliers**- min and max identify potential anomalies or errors**Distribution**- symmetry, normality, and shape

For example, a column with mean and median close to min/max likely has outliers skewing distribution. Standard deviation gives a sense of clustering - small is tighter, large more dispersed.

Quartiles show whether data is balanced or skewed between lower and upper bounds. Together these metrics help assess normality.

We can generate plots like histograms to visualize the distribution as well. Identifying anomalies or heavily skewed data may prompt data cleaning or transformation before analysis.

### Example Analysis

Let’s use a DataFrame of basic numeric data to demonstrate interpreting `df.describe()`

:

```
data = {
'Normal': [1, 2, 3, 4, 5],
'Uniform': [1, 1, 1, 1, 1],
'Skewed': [1, 1, 1, 1, 100]
}
df = pd.DataFrame(data)
print(df.describe())
```

```
Normal Uniform Skewed
count 5.000000 5.000 5.0
mean 3.000000 1.000 20.8
std 1.581139 0.000 41.2
min 1.000000 1.000 1.0
25% 2.000000 1.000 1.0
50% 3.000000 1.000 1.0
75% 4.000000 1.000 1.0
max 5.000000 1.000 100.0
```

**Normal**has mean ~ median, small standard deviation, and symmetric min/max indicating normal distribution.**Uniform**has identical values, so min/max/mean/median are all 1. Standard deviation is 0 showing no variance.**Skewed**median and quartiles are 1, but high mean and max indicates right skewed distribution due to outlier.

This shows how the summary statistics reflect shape and skew. We can make informed decisions about preprocessing, anomalies, and modeling based on these insights.

## df.describe() for Categorical Columns

`df.describe()`

focuses on numeric columns by default. For statistics on categorical or object data, you can use:

```
df.describe(include=['object'])
```

This will output the **count**, **unique** values, **top** frequent value, and **frequency** of top value for each object column.

For example:

```
data = {'Category': ['A', 'B', 'C', 'A', 'B']}
df = pd.DataFrame(data)
print(df.describe(include=['object']))
Category
count 5
unique 3
top A
freq 2
```

This allows basic EDA on categorical features like cardinality, modal category, and frequency distribution.

The Pandas Profiling library also generates more detailed statistics and plots for deeper analysis of object, numeric, and datetime data.

## Real World Example - Bike Sharing Dataset

Let’s walk through a real-world example using a public bike sharing dataset from Kaggle. The data has features like temperature, humidity, count of rental bikes, etc.

We’ll import it into a DataFrame and use `df.describe()`

to explore:

```
import pandas as pd
bikes = pd.read_csv('bike_sharing.csv')
print(bikes.describe())
```

Output:

```
timestamp count humidity temperature
count 6.547000e+03 6547.0 6547.000 6547.000
mean 1.538513e+13 191.273 0.625 0.490
std 8.398935e+09 151.314 0.186 0.194
min 1.536912e+13 1.0 0.150 -0.100
25% 1.538169e+13 36.0 0.500 0.330
50% 1.538506e+13 146.0 0.600 0.500
75% 1.538874e+13 284.0 0.700 0.670
max 1.539133e+13 977.0 1.000 1.000
```

The timestamp column was treated as numeric. We can observe:

**count**matches total rows so no missing values**mean**rental bikes is 191- Average
**humidity**is 0.625 and**temperature**0.49 **Min**temperature is -0.1C and bike count is 1**Max**bike rentals spike to 977**Standard deviation**of 151 for count indicates dispersion from mean

This quick overview helps guide deeper analysis on the influence of weather on rental patterns. We may also want to preprocess timestamp and handle outliers before modeling the data.

## Summary

Pandas `df.describe()`

is an invaluable tool for quick exploratory data analysis on numeric columns. It outputs an insightful statistical summary including central tendency, dispersion, outliers, and distribution shape.

Mastering `df.describe()`

allows efficiently developing intuition for cleaning, preprocessing, feature engineering, and modeling workflows in data science and analytics.

The key points to remember are:

- Generates statistics like mean, standard deviation, quartiles, min/max for numeric data
- Useful for initial EDA to understand distributions
- Identify anomalies and compare before/after cleaning or transformations
- Analyze numeric patterns in features for engineering and modeling
- Control inclusion/exclusion using options like
`include`

and`exclude`

- Interpret output to assess normality, shape, spread, and skew

With a solid grasp of applying `df.describe()`

and analyzing the output, you can extract powerful insights from your data using native Pandas functionality.