A Comprehensive Guide to Pandas df.info() in Python

Pandas is one of the most popular and powerful data analysis libraries in Python. It provides efficient data structures like DataFrames and Series to make data analysis workflow much easier and intuitive.

One important method in Pandas is df.info(), which allows us to get a quick overview of the DataFrame including the index, columns, data types, memory usage and more. Having a solid understanding of df.info() is critical for effective exploratory data analysis using Pandas.

In this comprehensive guide, we will dive deep into df.info() and learn how to use it to extract key details about a Pandas DataFrame. We will cover the following topics in-depth with example code snippets:

Open Table of Contents

Overview of df.info()
Index Details
Column Details
Data Types Overview
Memory Usage
Use Cases and Examples
Additional Parameters
- verbose
- buf
- max_cols
- memory_usage
- null_counts
How df.info() Works Internally
Comparison with df.describe()
Limitations to be Aware Of
Conclusion

Overview of df.info()

The df.info() method in Pandas provides an overview of the DataFrame by outputting information about the index, columns, data types, memory usage and more.

Here is the basic syntax:

df.info(verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)

It takes the following optional parameters:

verbose (bool): Whether to print more information like column dtypes and memory usage. Default is True.
buf (writable buffer): Where to send output. Defaults to sys.stdout
max_cols (int): Maximum number of columns to show. Defaults to show all.
memory_usage (bool, str): Show total memory usage of DataFrame. If ‘deep’ computes for deep memory usage.
null_counts (bool): Whether to show the non-null counts. Default False.

Calling df.info() quickly outputs a concise summary of the DataFrame without having to write much code. This makes it very useful for initial exploratory data analysis.

Let’s look at a simple example:

import pandas as pd

df = pd.DataFrame({'A': [1, 2],
                   'B': [1.0, 3.0],
                  'C': ['a', 'b']})

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       2 non-null      int64
 1   B       2 non-null      float64
 2   C       2 non-null      object
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes

This quickly shows the DataFrame has:

2 rows labeled 0 to 1
3 columns - ‘A’, ‘B’, ‘C’
Data types of each column
Memory usage

As we can see, df.info() provides a neat summary of all the main details we need to know about the structure of a DataFrame. Now let’s look at each of these elements more closely.

Index Details

df.info() provides useful details about the index of the DataFrame including:

Index name
Index data type
Number of index entries

By default, Pandas DataFrames have a default integer index labeled 0 to n-1 rows.

We can change this index to another column if required. Let’s see an example:

df = pd.DataFrame({'A': [1, 2],
                   'B': [1.0, 3.0]},
                  index=['row1', 'row2'])

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, row1 to row2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       2 non-null      int64
 1   B       2 non-null      float64
dtypes: float64(1), int64(1)
memory usage: 96.0 bytes

Here we can see the index is named row1 to row2 with 2 entries.

The index data type is also visible. By default, it is the integer position values from 0 to n-1 rows. But it can be set to any data type like strings, datetime etc.

Column Details

In addition to index information, df.info() also provides details about the columns in the DataFrame:

Column names
Data type of each column
Number of non-null values in each column

This allows us to quickly check if the columns and data types are as expected.

Let’s see an example:

df = pd.DataFrame({'NumericCol': [1, 2],
                   'StringCol': ['a', 'b']},
                   index=['row1', 'row2'])

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, row1 to row2
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   NumericCol  2 non-null      int64
 1   StringCol  2 non-null      object
dtypes: int64(1), object(1)
memory usage: 128.0 bytes

Here we can see:

Column names: ‘NumericCol’, ‘StringCol’
Data types: int64, object
Non-null count: 2 for each column

This allows us to verify the DataFrame structure at a glance.

Data Types Overview

One of the most useful parts of df.info() is that it provides a quick overview of the data types of all the columns.

The data types summary is shown in the dtypes section of the output.

For example:

dtypes: float64(2), int64(1), object(1)

This summarizes the data types in the DataFrame as:

2 float64 columns
1 int64 column
1 object column

This allows us to easily verify that the columns have the expected types and detect any unexpected types that could lead to errors later on.

Detecting mixed data types is especially important for numeric calculations to prevent silent errors.

Memory Usage

When dealing with large datasets, understanding the memory footprint is important.

df.info() provides memory usage details of the DataFrame by default.

For example:

memory usage: 200.0+ bytes

This shows the total memory usage in bytes to store the DataFrame data and metadata.

We can also get deep memory usage by passing memory_usage='deep':

df.info(memory_usage='deep')

This traverses the DataFrame columns to provide a more detailed memory breakdown including memory usage of each column.

Use Cases and Examples

Now that we’ve seen what df.info() displays, let’s go over some examples of how it can be used for exploratory data analysis.

1. Verify DataFrame structure and metadata

As seen earlier, we can use df.info() after creating a new DataFrame to verify it has the expected index, columns, data types and size. This helps catch any mismatches between assumptions and reality about the DataFrame.

2. Profile new unknown data sources

When loading datasets from new sources, we may not know the structure, data types or size beforehand. df.info() allows quickly profiling the DataFrame to understand the data better.

3. Catch mixed data types

Using df.info() to print the data types overview can help identify any mixed types in columns. This prevents silent errors later when doing computations on such data.

4. Check for missing data

The non-null count in df.info() output can reveal columns with missing values. This helps plan data cleaning steps like imputation.

5. Estimate memory usage

For big data applications, the memory footprint is important. df.info() provides an estimate of memory usage to optimize system configuration.

6. Monitor memory usage during transformations

We can insert df.info() at various points while transforming data to track how memory usage changes. This helps detect memory leaks or inefficient operations.

7. Compare DataFrames

df.info() can be used to print and compare summaries of two DataFrames side-by-side to understand how they differ.

Additional Parameters

We briefly introduced the extra parameters available for df.info() earlier. Let’s look at them in more depth with examples:

verbose

The verbose parameter controls whether to print the full summary or just the basic details.

df.info(verbose=False)

This will omit the column details like dtypes and memory usage.

buf

We can pass a buffer or file handle to buf to redirect the output to a file or StringIO object.

For example:

import StringIO
buffer = StringIO.StringIO()
df.info(buf=buffer)

max_cols

To limit the number of columns printed, we can pass max_cols. This is useful for wide DataFrames.

For example:

df.info(max_cols=5)

This will print details of only the first 5 columns.

memory_usage

We discussed using memory_usage='deep' earlier to get detailed memory breakdown.

null_counts

Setting null_counts=True will include a column showing the number of non-null values per column.

How df.info() Works Internally

Under the hood, df.info() works by iterating through the columns of the DataFrame and extracting the index, column and data type details.

It uses the following attributes and methods:

df.index - To get index details
df.columns - For column names
df.dtypes - To get data types
df.get_dtype_counts() - For data types summary
df.memory_usage() - For memory usage

The output summary string is constructed using this information.

Knowing this helps understand what operations are done internally by df.info(). We can avoid repeating any redundant operations in our own code.

Comparison with df.describe()

Both df.info() and df.describe() are used for exploratory data analysis with Pandas. But they provide different types of summaries:

df.info() - Provides metadata like index, columns, data types.
df.describe() - Provides summary statistics like mean, quartiles, count etc.

So df.info() complements df.describe() by providing structural metadata compared to just statistics.

Here is a comparison:

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})

print(df.info())
print(df.describe())

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       2 non-null      int64
 1   B       2 non-null      int64
dtypes: int64(2)
memory usage: 96.0 bytes

        A    B
count  2.0  2.0
mean   1.5  3.5
std    0.5  0.5
min    1.0  3.0
25%    1.0  3.0
50%    1.5  3.5
75%    2.0  4.0
max    2.0  4.0

We can see df.info() provides structural metadata like column names, dtypes, index etc. while df.describe() provides statistical summary like mean, standard deviation etc.

Using both together gives a more comprehensive data profile.

Limitations to be Aware Of

While df.info() is very useful, some limitations to keep in mind:

It provides an overview but not detailed statistics like df.describe().
Memory usage is an estimate and may vary based on system configuration.
It does not show interactions between columns like correlations.
The summary is printed rather than stored, so it cannot be programatically accessed later.
There are no plotting capabilities to visualize the overview.

Conclusion

In this comprehensive guide, we explored df.info() in depth including its parameters, use cases, internal working and limitations.

The key takeaways are:

df.info() provides a quick overview of Pandas DataFrames for exploratory analysis.
It shows details like index, columns, data types and memory usage.
This allows verifying the DataFrame structure and metadata.
It helps catch issues with mixed data types, missing data and memory constraints.
df.info() complements df.describe() by providing structural metadata vs just statistics.

Overall, mastering df.info() provides a simple yet powerful way to understand the shape of DataFrames for effective data analysis in Python. It should be part of every Pandas user’s toolbox.

Hopefully this guide gives you the knowledge to use df.info() for profiling DataFrames confidently. The key is practice - use df.info() liberally when exploring datasets to build intuition. This will enable you to derive insights from data more effectively using Python.