Skip to content

Reading and Writing Data in Different Formats with Pandas

Updated: at 04:15 AM

Pandas is a popular Python library used for data analysis and manipulation. One of its most powerful features is the ability to read and write data between Pandas DataFrames and external data sources like CSV, JSON, SQL databases, and Excel files. This allows data scientists to load data from various sources, process and analyze it using Pandas, and export it to storage or share with others.

This guide will provide a comprehensive overview of reading and writing data in CSV, JSON, SQL, and Excel formats with Pandas’ I/O tools. We will cover the following topics in-depth with example code snippets:

Table of Contents

Open Table of Contents

Reading CSV Files

CSV (Comma Separated Values) is one of the most common and simplest data file formats used for storing tabular data. Pandas provides the read_csv() function to load CSV data into a DataFrame.

Here is the basic syntax:

df = pandas.read_csv('data.csv')

This will read the CSV file ‘data.csv’ in the current directory and create a Pandas DataFrame (df) from its contents.

Below are some important parameters that can be passed to customize how Pandas reads and parses the CSV data:

For example, to read only the ‘Date’ and ‘Value’ columns from ‘data.csv’ into a DataFrame:

import pandas as pd

df = pd.read_csv('data.csv', usecols=['Date', 'Value'])

You can also pass a list of column names or column indices to selectively load data.

To handle irregular delimited files, specify the delimiter:

df = pd.read_csv('data.txt', sep='\t') # use tabs as delimiter

For fine-grained control over data types, use dtype:

dtype = {'Date': 'string', 'Value': 'float', 'Name': 'category'}
df = pd.read_csv('data.csv', dtype=dtype)

This parses ‘Date’ and ‘Name’ columns as strings and ‘Value’ as floating point numbers.

Writing DataFrames to CSV

The to_csv() method writes a Pandas DataFrame to a CSV file.

Basic usage:

df.to_csv('data.csv')

This writes the DataFrame df to ‘data.csv’ without any options.

Here are some common parameters to customize the output:

For example, to write a DataFrame without row indexes or column headers:

df.to_csv('data.csv', index=False, header=False)

You can also pass a file buffer object to write the CSV output:

with open('data.csv', 'w') as f:
    df.to_csv(f)

Specific columns can be written by passing required column names to to_csv():

df.to_csv('data.csv', columns=['Date', 'Value', 'Name'])

Reading JSON Data

JSON (JavaScript Object Notation) is a common data exchange format that is readable by humans. Pandas provides read_json() to read JSON files/strings into DataFrames.

Basic syntax:

df = pd.read_json('data.json')

This reads ‘data.json’ file and creates a DataFrame.

Key parameters for read_json():

The orient parameter handles different JSON formats:

For example:

df = pd.read_json('data.json', orient='index')

This reads a JSON file with row index keys.

To parse date strings:

df = pd.read_json('data.json', convert_dates=['Date'])

Writing DataFrames to JSON

The to_json() method converts a DataFrame to a JSON string.

Basic usage:

json_str = df.to_json()

This converts df to a JSON string.

Optional parameters:

For example, to write JSON string in ‘records’ format:

df.to_json(orient='records')

To write DataFrame into a JSON file:

df.to_json('data.json')

The index can be disabled and dates formatted:

df.to_json('data.json', index=False, date_format='%Y-%m-%d')

Reading Data from SQL Databases

Pandas can connect and read data from SQL databases using the read_sql() and read_sql_table() functions. This requires a SQLAlchemy engine that connects to the database.

Import required libraries:

import pandas as pd
from sqlalchemy import create_engine

Create a SQLAlchemy engine to connect to SQLite:

engine = create_engine('sqlite:///database.db')

Read entire SQL table into DataFrame:

df = pd.read_sql_table('data', engine)

This reads the ‘data’ table from the connected database.

For read_sql(), pass a SQL query to read data:

query = "SELECT * FROM data WHERE id > 100"
df = pd.read_sql(query, engine)

Key parameters:

For example, to read specific columns and set index:

df = pd.read_sql('SELECT * FROM data', engine,
                columns=['id', 'name', 'value'], index_col='id')

Writing DataFrames to SQL Databases

Pandas can write DataFrame contents to a table in SQL database using the to_sql() method.

Basic usage:

df.to_sql('data', engine)

This writes df to the ‘data’ table in the connected SQLite database. Any existing table is overwritten.

Key parameters:

For example, to append new rows into existing table:

df.to_sql('data', engine, if_exists='append')

Specifying data types:

dtypes = {'id': sqlalchemy. types.BIGINT(),
          'name': sqlalchemy.types.VARCHAR(length=255),
          'value': sqlalchemy.types.FLOAT()}

df.to_sql('data', engine, dtype=dtypes)

Reading Excel Files

Pandas provides read_excel() to read data from Excel 2003 (.xls) and Excel 2007+ (.xlsx) files using the ExcelFile class.

Import Pandas and create an ExcelFile:

import pandas as pd

xlsx = pd.ExcelFile('data.xlsx')

To parse a specific sheet by index/name into DataFrame:

df = pd.read_excel(xlsx, 'Sheet1')

Key parameters:

For example, to get third sheet:

df = pd.read_excel(xlsx, sheet_name=2)

Reading with header and index:

df = pd.read_excel(xlsx, header=1, index_col=0)

Writing DataFrames to Excel

Pandas can export DataFrames to Excel using the to_excel() method.

Basic usage:

df.to_excel('data.xlsx')

This writes df to ‘data.xlsx’ file, with sheet name ‘Sheet1’.

Optional parameters:

To write to multiple sheets:

writer = pd.ExcelWriter('data.xlsx')

df1.to_excel(writer, sheet_name='Sheet1')
df2.to_excel(writer, sheet_name='Sheet2')

writer.save()

Specifying start row and column:

df.to_excel('data.xlsx', index=False, header=False,
          startrow=1, startcol=1)

This writes df without header or index starting at cell B2.

Conclusion

In this guide, we covered how to use Pandas to load and save data between DataFrames and various external sources like CSV, JSON, SQL, and Excel seamlessly. Pandas I/O tools like read_csv, to_json, read_sql, and to_excel enable easy data import and export so data scientists can access data from different sources and share it with others in common formats.

The key is to understand the different parameters and options available to handle variations in formats, structure, and data types across these sources. With a good grasp of the Pandas I/O API, you can efficiently move data between DataFrames and files for analysis.