Reading and Writing Data in Different Formats with Pandas

Pandas is a popular Python library used for data analysis and manipulation. One of its most powerful features is the ability to read and write data between Pandas DataFrames and external data sources like CSV, JSON, SQL databases, and Excel files. This allows data scientists to load data from various sources, process and analyze it using Pandas, and export it to storage or share with others.

This guide will provide a comprehensive overview of reading and writing data in CSV, JSON, SQL, and Excel formats with Pandas’ I/O tools. We will cover the following topics in-depth with example code snippets:

Open Table of Contents

Reading CSV Files
Writing DataFrames to CSV
Reading JSON Data
Writing DataFrames to JSON
Reading Data from SQL Databases
Writing DataFrames to SQL Databases
Reading Excel Files
Writing DataFrames to Excel
Conclusion

Reading CSV Files

CSV (Comma Separated Values) is one of the most common and simplest data file formats used for storing tabular data. Pandas provides the read_csv() function to load CSV data into a DataFrame.

Here is the basic syntax:

df = pandas.read_csv('data.csv')

This will read the CSV file ‘data.csv’ in the current directory and create a Pandas DataFrame (df) from its contents.

Below are some important parameters that can be passed to customize how Pandas reads and parses the CSV data:

sep - Delimiter that separates the values on each row. Comma (,) is default.
header - Row number to use as column names (headers), default is 0 (first row). Pass None if there is no header row.
names - List of column names to use as header.
index_col - Column number or name to use as index for DataFrame.
usecols - List of column names or indices to read from CSV.
squeeze - Return output as Series if DataFrame has one column, default False.
dtype - Dictionary specifying data type for each column, e.g. {‘col1’: ‘float64’, ‘col2’: ‘object’}
converters - Dict specifying functions to convert values in certain columns, e.g. {‘col1’: float}
skiprows - Number of rows to skip from start of file.
na_values - Scalar or list of strings to recognize as NaN/NA values.

For example, to read only the ‘Date’ and ‘Value’ columns from ‘data.csv’ into a DataFrame:

import pandas as pd

df = pd.read_csv('data.csv', usecols=['Date', 'Value'])

You can also pass a list of column names or column indices to selectively load data.

To handle irregular delimited files, specify the delimiter:

df = pd.read_csv('data.txt', sep='\t') # use tabs as delimiter

For fine-grained control over data types, use dtype:

dtype = {'Date': 'string', 'Value': 'float', 'Name': 'category'}
df = pd.read_csv('data.csv', dtype=dtype)

This parses ‘Date’ and ‘Name’ columns as strings and ‘Value’ as floating point numbers.

Writing DataFrames to CSV

The to_csv() method writes a Pandas DataFrame to a CSV file.

Basic usage:

df.to_csv('data.csv')

This writes the DataFrame df to ‘data.csv’ without any options.

Here are some common parameters to customize the output:

sep - Delimiter for values, default ’,’.
na_rep - Representation for NA values, default empty string.
index - Whether to write row names/index, default True.
header - Whether to write out column names, default True.
path_or_buf - File path or object where CSV output should be written.
compression - Compression type to use for file. Allowed values are ‘gzip’, ‘bz2’, ‘xz’, ‘zip’.

For example, to write a DataFrame without row indexes or column headers:

df.to_csv('data.csv', index=False, header=False)

You can also pass a file buffer object to write the CSV output:

with open('data.csv', 'w') as f:
    df.to_csv(f)

Specific columns can be written by passing required column names to to_csv():

df.to_csv('data.csv', columns=['Date', 'Value', 'Name'])

Reading JSON Data

JSON (JavaScript Object Notation) is a common data exchange format that is readable by humans. Pandas provides read_json() to read JSON files/strings into DataFrames.

Basic syntax:

df = pd.read_json('data.json')

This reads ‘data.json’ file and creates a DataFrame.

Key parameters for read_json():

path_or_buf - File path, URL or JSON string to read.
orient - Indication of expected JSON string format. Possible values are ‘split’, ‘records’, ‘index’, ‘columns’, ‘values’. Default is ‘columns’.
dtype - Data type to force on DataFrame columns.
convert_dates - Attempt to parse dates. Default True.
lines - Read line-delimited JSON file.

The orient parameter handles different JSON formats:

split - Dict-like with index/columns split into separate attributes.
records - List-like with each item a row.
index - Dict-like with keys as row index.
columns - Dict-like with keys as column names.
values - Just the values array.

For example:

df = pd.read_json('data.json', orient='index')

This reads a JSON file with row index keys.

To parse date strings:

df = pd.read_json('data.json', convert_dates=['Date'])

Writing DataFrames to JSON

The to_json() method converts a DataFrame to a JSON string.

Basic usage:

json_str = df.to_json()

This converts df to a JSON string.

Optional parameters:

path_or_buf - File path or object to write JSON string. Default is output to a string.
orient - Format of returned JSON string. Possible values are same as read_json().
date_format - Format string for datetime objects.
lines - Write out line-delimited JSON, with each row a new line.
compression - Compression type for file output - ‘gzip’, ‘bz2’, ‘zip’ or ‘xz’.

For example, to write JSON string in ‘records’ format:

df.to_json(orient='records')

To write DataFrame into a JSON file:

df.to_json('data.json')

The index can be disabled and dates formatted:

df.to_json('data.json', index=False, date_format='%Y-%m-%d')

Reading Data from SQL Databases

Pandas can connect and read data from SQL databases using the read_sql() and read_sql_table() functions. This requires a SQLAlchemy engine that connects to the database.

Import required libraries:

import pandas as pd
from sqlalchemy import create_engine

Create a SQLAlchemy engine to connect to SQLite:

engine = create_engine('sqlite:///database.db')

Read entire SQL table into DataFrame:

df = pd.read_sql_table('data', engine)

This reads the ‘data’ table from the connected database.

For read_sql(), pass a SQL query to read data:

query = "SELECT * FROM data WHERE id > 100"
df = pd.read_sql(query, engine)

Key parameters:

sql - Actual SQL query string to execute.
con - SQLAlchemy engine object that handles database connection.
index_col - Column to set as index.
columns - Subset of columns to read from table.
coerce_float - Attempt to convert values to non-string, non-numeric to NaN.
params - List or dict of parameters if query uses variable substitution.

For example, to read specific columns and set index:

df = pd.read_sql('SELECT * FROM data', engine,
                columns=['id', 'name', 'value'], index_col='id')

Writing DataFrames to SQL Databases

Pandas can write DataFrame contents to a table in SQL database using the to_sql() method.

Basic usage:

df.to_sql('data', engine)

This writes df to the ‘data’ table in the connected SQLite database. Any existing table is overwritten.

Key parameters:

name - Table name.
con - SQLAlchemy engine.
schema - Schema name to use.
if_exists - What to do if table exists. ‘fail’, ‘replace’ or ‘append’.
index - Whether to write DataFrame index to table.
index_label - Column name for index.
chunksize - Rows to write at a time.
dtype - Dict specifying SQL data types for columns.

For example, to append new rows into existing table:

df.to_sql('data', engine, if_exists='append')

Specifying data types:

dtypes = {'id': sqlalchemy. types.BIGINT(),
          'name': sqlalchemy.types.VARCHAR(length=255),
          'value': sqlalchemy.types.FLOAT()}

df.to_sql('data', engine, dtype=dtypes)

Reading Excel Files

Pandas provides read_excel() to read data from Excel 2003 (.xls) and Excel 2007+ (.xlsx) files using the ExcelFile class.

Import Pandas and create an ExcelFile:

import pandas as pd

xlsx = pd.ExcelFile('data.xlsx')

To parse a specific sheet by index/name into DataFrame:

df = pd.read_excel(xlsx, 'Sheet1')

Key parameters:

io - File path, URL, file buffer or ExcelFile object to read.
sheet_name - Name or index of sheet to parse.
header - Row number to use as column names, default 0.
names - List of column names, must match names in file.
index_col - Column number or name to use as row index.
usecols - Subset of columns to parse.
squeeze - Return DataFrame if only one column, default False.
dtype - Data type for columns.
converters - Dict of functions to convert values in specific columns.

For example, to get third sheet:

df = pd.read_excel(xlsx, sheet_name=2)

Reading with header and index:

df = pd.read_excel(xlsx, header=1, index_col=0)

Writing DataFrames to Excel

Pandas can export DataFrames to Excel using the to_excel() method.

Basic usage:

df.to_excel('data.xlsx')

This writes df to ‘data.xlsx’ file, with sheet name ‘Sheet1’.

Optional parameters:

excel_writer - Instance of ExcelWriter for output.
sheet_name - Name of sheet to write to.
na_rep - Representation for missing data, default empty string.
float_format - Format string for floating point numbers.
columns - Columns to write.
header - Whether to write out column names.
index - Whether to write row names.
index_label- Column label for index column.
startrow - Upper left cell row to dump data frame.
startcol - Upper left cell column to dump data frame.
engine - Engine to use to write Excel file. ‘openpyxl’ or ‘xlsxwriter’.

To write to multiple sheets:

writer = pd.ExcelWriter('data.xlsx')

df1.to_excel(writer, sheet_name='Sheet1')
df2.to_excel(writer, sheet_name='Sheet2')

writer.save()

Specifying start row and column:

df.to_excel('data.xlsx', index=False, header=False,
          startrow=1, startcol=1)

This writes df without header or index starting at cell B2.

Conclusion

In this guide, we covered how to use Pandas to load and save data between DataFrames and various external sources like CSV, JSON, SQL, and Excel seamlessly. Pandas I/O tools like read_csv, to_json, read_sql, and to_excel enable easy data import and export so data scientists can access data from different sources and share it with others in common formats.

The key is to understand the different parameters and options available to handle variations in formats, structure, and data types across these sources. With a good grasp of the Pandas I/O API, you can efficiently move data between DataFrames and files for analysis.