Pandas is one of the most popular and powerful data analysis libraries in Python. It provides efficient data structures like DataFrames and Series to make data analysis workflow much easier and intuitive.
One important method in Pandas is df.info()
, which allows us to get a quick overview of the DataFrame including the index, columns, data types, memory usage and more. Having a solid understanding of df.info()
is critical for effective exploratory data analysis using Pandas.
In this comprehensive guide, we will dive deep into df.info()
and learn how to use it to extract key details about a Pandas DataFrame. We will cover the following topics in-depth with example code snippets:
Table of Contents
Open Table of Contents
Overview of df.info()
The df.info()
method in Pandas provides an overview of the DataFrame by outputting information about the index, columns, data types, memory usage and more.
Here is the basic syntax:
df.info(verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)
It takes the following optional parameters:
-
verbose
(bool): Whether to print more information like column dtypes and memory usage. Default is True. -
buf
(writable buffer): Where to send output. Defaults to sys.stdout -
max_cols
(int): Maximum number of columns to show. Defaults to show all. -
memory_usage
(bool, str): Show total memory usage of DataFrame. If ‘deep’ computes for deep memory usage. -
null_counts
(bool): Whether to show the non-null counts. Default False.
Calling df.info()
quickly outputs a concise summary of the DataFrame without having to write much code. This makes it very useful for initial exploratory data analysis.
Let’s look at a simple example:
import pandas as pd
df = pd.DataFrame({'A': [1, 2],
'B': [1.0, 3.0],
'C': ['a', 'b']})
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 2 non-null int64
1 B 2 non-null float64
2 C 2 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes
This quickly shows the DataFrame has:
- 2 rows labeled 0 to 1
- 3 columns - ‘A’, ‘B’, ‘C’
- Data types of each column
- Memory usage
As we can see, df.info()
provides a neat summary of all the main details we need to know about the structure of a DataFrame. Now let’s look at each of these elements more closely.
Index Details
df.info()
provides useful details about the index of the DataFrame including:
- Index name
- Index data type
- Number of index entries
By default, Pandas DataFrames have a default integer index labeled 0 to n-1 rows.
We can change this index to another column if required. Let’s see an example:
df = pd.DataFrame({'A': [1, 2],
'B': [1.0, 3.0]},
index=['row1', 'row2'])
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, row1 to row2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 2 non-null int64
1 B 2 non-null float64
dtypes: float64(1), int64(1)
memory usage: 96.0 bytes
Here we can see the index is named row1
to row2
with 2 entries.
The index data type is also visible. By default, it is the integer position values from 0 to n-1 rows. But it can be set to any data type like strings, datetime etc.
Column Details
In addition to index information, df.info()
also provides details about the columns in the DataFrame:
- Column names
- Data type of each column
- Number of non-null values in each column
This allows us to quickly check if the columns and data types are as expected.
Let’s see an example:
df = pd.DataFrame({'NumericCol': [1, 2],
'StringCol': ['a', 'b']},
index=['row1', 'row2'])
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, row1 to row2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NumericCol 2 non-null int64
1 StringCol 2 non-null object
dtypes: int64(1), object(1)
memory usage: 128.0 bytes
Here we can see:
- Column names: ‘NumericCol’, ‘StringCol’
- Data types: int64, object
- Non-null count: 2 for each column
This allows us to verify the DataFrame structure at a glance.
Data Types Overview
One of the most useful parts of df.info()
is that it provides a quick overview of the data types of all the columns.
The data types summary is shown in the dtypes section of the output.
For example:
dtypes: float64(2), int64(1), object(1)
This summarizes the data types in the DataFrame as:
- 2 float64 columns
- 1 int64 column
- 1 object column
This allows us to easily verify that the columns have the expected types and detect any unexpected types that could lead to errors later on.
Detecting mixed data types is especially important for numeric calculations to prevent silent errors.
Memory Usage
When dealing with large datasets, understanding the memory footprint is important.
df.info()
provides memory usage details of the DataFrame by default.
For example:
memory usage: 200.0+ bytes
This shows the total memory usage in bytes to store the DataFrame data and metadata.
We can also get deep memory usage by passing memory_usage='deep'
:
df.info(memory_usage='deep')
This traverses the DataFrame columns to provide a more detailed memory breakdown including memory usage of each column.
Use Cases and Examples
Now that we’ve seen what df.info()
displays, let’s go over some examples of how it can be used for exploratory data analysis.
1. Verify DataFrame structure and metadata
As seen earlier, we can use df.info()
after creating a new DataFrame to verify it has the expected index, columns, data types and size. This helps catch any mismatches between assumptions and reality about the DataFrame.
2. Profile new unknown data sources
When loading datasets from new sources, we may not know the structure, data types or size beforehand. df.info()
allows quickly profiling the DataFrame to understand the data better.
3. Catch mixed data types
Using df.info()
to print the data types overview can help identify any mixed types in columns. This prevents silent errors later when doing computations on such data.
4. Check for missing data
The non-null count in df.info()
output can reveal columns with missing values. This helps plan data cleaning steps like imputation.
5. Estimate memory usage
For big data applications, the memory footprint is important. df.info()
provides an estimate of memory usage to optimize system configuration.
6. Monitor memory usage during transformations
We can insert df.info()
at various points while transforming data to track how memory usage changes. This helps detect memory leaks or inefficient operations.
7. Compare DataFrames
df.info()
can be used to print and compare summaries of two DataFrames side-by-side to understand how they differ.
Additional Parameters
We briefly introduced the extra parameters available for df.info()
earlier. Let’s look at them in more depth with examples:
verbose
The verbose
parameter controls whether to print the full summary or just the basic details.
df.info(verbose=False)
This will omit the column details like dtypes and memory usage.
buf
We can pass a buffer or file handle to buf
to redirect the output to a file or StringIO object.
For example:
import StringIO
buffer = StringIO.StringIO()
df.info(buf=buffer)
max_cols
To limit the number of columns printed, we can pass max_cols
. This is useful for wide DataFrames.
For example:
df.info(max_cols=5)
This will print details of only the first 5 columns.
memory_usage
We discussed using memory_usage='deep'
earlier to get detailed memory breakdown.
null_counts
Setting null_counts=True
will include a column showing the number of non-null values per column.
How df.info() Works Internally
Under the hood, df.info()
works by iterating through the columns of the DataFrame and extracting the index, column and data type details.
It uses the following attributes and methods:
df.index
- To get index detailsdf.columns
- For column namesdf.dtypes
- To get data typesdf.get_dtype_counts()
- For data types summarydf.memory_usage()
- For memory usage
The output summary string is constructed using this information.
Knowing this helps understand what operations are done internally by df.info()
. We can avoid repeating any redundant operations in our own code.
Comparison with df.describe()
Both df.info()
and df.describe()
are used for exploratory data analysis with Pandas. But they provide different types of summaries:
df.info()
- Provides metadata like index, columns, data types.df.describe()
- Provides summary statistics like mean, quartiles, count etc.
So df.info()
complements df.describe()
by providing structural metadata compared to just statistics.
Here is a comparison:
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
print(df.info())
print(df.describe())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 2 non-null int64
1 B 2 non-null int64
dtypes: int64(2)
memory usage: 96.0 bytes
A B
count 2.0 2.0
mean 1.5 3.5
std 0.5 0.5
min 1.0 3.0
25% 1.0 3.0
50% 1.5 3.5
75% 2.0 4.0
max 2.0 4.0
We can see df.info()
provides structural metadata like column names, dtypes, index etc. while df.describe()
provides statistical summary like mean, standard deviation etc.
Using both together gives a more comprehensive data profile.
Limitations to be Aware Of
While df.info()
is very useful, some limitations to keep in mind:
- It provides an overview but not detailed statistics like
df.describe()
. - Memory usage is an estimate and may vary based on system configuration.
- It does not show interactions between columns like correlations.
- The summary is printed rather than stored, so it cannot be programatically accessed later.
- There are no plotting capabilities to visualize the overview.
Conclusion
In this comprehensive guide, we explored df.info()
in depth including its parameters, use cases, internal working and limitations.
The key takeaways are:
df.info()
provides a quick overview of Pandas DataFrames for exploratory analysis.- It shows details like index, columns, data types and memory usage.
- This allows verifying the DataFrame structure and metadata.
- It helps catch issues with mixed data types, missing data and memory constraints.
df.info()
complementsdf.describe()
by providing structural metadata vs just statistics.
Overall, mastering df.info()
provides a simple yet powerful way to understand the shape of DataFrames for effective data analysis in Python. It should be part of every Pandas user’s toolbox.
Hopefully this guide gives you the knowledge to use df.info()
for profiling DataFrames confidently. The key is practice - use df.info()
liberally when exploring datasets to build intuition. This will enable you to derive insights from data more effectively using Python.