Social Media Sentiment Analysis with Pandas

Sentiment analysis is the process of computationally identifying and categorizing opinions expressed in text data in order to determine the writer’s attitude towards a particular topic, product, etc. This powerful technique allows us to extract insights from social media, reviews, survey responses, and other textual data to understand public sentiment.

In this comprehensive guide, we will demonstrate how to perform social media sentiment analysis in Python using the popular Pandas library.

Open Table of Contents

Overview
Importing Packages
Loading Data
Data Cleaning and Preprocessing
Performing Sentiment Analysis
Visualizing Insights
Practical Applications
Conclusion

Overview

Sentiment analysis involves the following key steps:

Collecting data - Gather textual data from sources like Twitter, Facebook, surveys, etc.
Preprocessing data - Clean and normalize the text data to prepare it for analysis. Tasks include converting to lowercase, removing stopwords, spelling correction, lemmatization, etc.
Analyzing sentiment - Classify the sentiment of each text unit as positive, negative or neutral using rule-based algorithms, machine learning models, or both.
Visualizing results - Create plots and graphs to effectively communicate insights from the sentiment analysis.

We will use Pandas, an open-source Python library, to demonstrate each of these steps in detail. Pandas provides fast, flexible data structures and data analysis tools that make it perfectly suited for processing textual data and time series.

Key Pandas features that are especially useful for sentiment analysis include:

DataFrames - Tabular, column-oriented data structures with labeled axes, ideal for working with tabular/structured data.
Data cleaning - Tools like drop(), dropna(), etc. for preparing messy real-world data.
Merge/join - Combine datasets and manage relational data.
GroupBy - Split, apply and combine data in groups. Useful for grouping text by categories.
Time series - Powerful tools for working with time series data like timestamps.
Visualization - Seamless integration with Matplotlib to create informative plots and charts.

By leveraging these Pandas capabilities, we can efficiently ingest textual data from sources like Twitter, process and analyze the text, and then visualize the sentiment trends.

Importing Packages

Let’s first import Pandas and Matplotlib along with some other packages we’ll need:

import pandas as pd
import matplotlib.pyplot as plt

import re   # regular expressions
import string
from wordcloud import WordCloud

We import Pandas as pd and Matplotlib’s pyplot module as plt for convenience. We’ll also need the regex and string modules from Python’s standard library for text processing. The WordCloud module allows us to easily generate insightful word clouds.

Loading Data

For this analysis, we will use a dataset of tweets containing the hashtag “#vaccines” scraped from Twitter using the GetOldTweets3 library. The data is stored in a CSV file called vaccine_tweets.csv.

Let’s load it into a Pandas DataFrame:

tweets_df = pd.read_csv('vaccine_tweets.csv', parse_dates=['date'])

We pass parse_dates=['date'] to convert the ‘date’ column containing timestamps into Python datetime objects while reading the CSV.

Let’s inspect the DataFrame using .head():

tweets_df.head()

The output is:

	tweet_id	date	username	text	is_retweet	nlikes	nreplies	nretweets	is_quoted
0	1414678193484544029	2022-06-26 21:57:17	strudel_pastry	@Reuters So tragic. 💔	False	2	0	0	False
1	1414678193460183041	2022-06-26 21:57:16	vaxfax	With a heavy heart, we mourn the loss of this precious child. Words cannot express our sorrow. Our thoughts are with the family.	False	0	0	0	False
2	1414678193427324928	2022-06-26 21:57:16	prohealth	Another precious life lost. When will we start taking vaccine injuries seriously? Our hearts break for this child and family. 💔	False	1	0	0	False
3	1414056318487949315	2022-06-25 22:06:20	VaccineEffects	Another child dead after vaccination. How many more have to die before the madness stops? https://t.co/ABC123 #vaccineinjury #LearnTheRisk	True	23	2	3	False
4	1385940217791684608	2022-04-22 00:56:27	VaccineTruth	Sadly, another child lost to vaccines. Condolences to the family. When will these products be banned? #vaccineinjury https://t.co/XYZ456	False	10	1	0	False

Note: Fictional data has been used for this guide to protect user privacy including usernames, tweet IDs, URLs, etc.

This gives us a sample of the dataset. Each row contains metadata like the tweet ID, timestamp, username, text content of the tweet, number of likes/replies, etc. This metadata provides useful context.

Our main text data is contained in the ‘text’ column. Now let’s start preprocessing this text for sentiment analysis.

Data Cleaning and Preprocessing

Real-world textual data tends to be messy and needs cleaning before analysis. For social media data, some common tasks include:

Remove URLs, handles, hashtags, emojis, etc.
Convert text to lowercase
Fix spelling errors
Remove punctuation
Remove stopwords like ‘a’, ‘and’, ‘the’
Lemmatize words to their base form

Let’s define a function preprocess_text() to perform these cleaning tasks:

import string

def preprocess_text(text):
    # Remove urls
    text = re.sub(r"http\S+", "", text)

    # Remove user handles
    text = re.sub(r"@\w+", "", text)

    # Remove hashtags
    text = re.sub(r"#\w+", "", text)

    # Remove emojis
    text = re.sub(r"[^\w\s]", "", text)

    # Lowercase
    text = text.lower()

    # Fix spelling
    text = mispell_fix(text)

    # Remove punctuation
    text = text.translate(str.maketrans("","", string.punctuation))

    # Remove stopwords
    text = remove_stopwords(text)

    # Lemmatize
    text = lemmatize_text(text)

    return text

This function takes raw text as input, performs the cleaning and preprocessing steps listed above, and returns cleaned text ready for sentiment analysis. The sub-tasks like spell correction, stopword removal and lemmatization are implemented in separate functions.

Let’s apply preprocess_text() to the ‘text’ column in our DataFrame:

tweets_df['text'] = tweets_df['text'].apply(preprocess_text)

This creates a new column text with cleaned text while dropping the original raw text column.

Our data is now ready for sentiment analysis!

Performing Sentiment Analysis

Sentiment analysis involves classifying each text unit as positive, negative or neutral based on its content. For social media, the text unit is usually a post, tweet, comment, etc.

Let’s define a function analyze_sentiment() that takes preprocessed text as input and returns the predicted sentiment:

from textblob import TextBlob

def analyze_sentiment(text):
    polarity = TextBlob(text).sentiment.polarity
    if polarity > 0:
        return 'Positive'
    elif polarity == 0:
        return 'Neutral'
    else:
        return 'Negative'

Here we use the TextBlob library which offers simple built-in sentiment analysis capabilities. The sentiment.polarity score returned by TextBlob ranges from -1 to 1 representing the degree of negativity to positivity. Based on fixed thresholds, we categorize the text as positive, neutral or negative sentiment.

Let’s apply this to our cleaned tweets:

tweets_df['sentiment'] = tweets_df['text'].apply(analyze_sentiment)

This adds a new sentiment column containing the predicted sentiment category for each tweet.

Let’s inspect tweets with positive sentiment:

print(tweets_df[tweets_df['sentiment']=='Positive']['text'].sample(5))

Example output:

wish received vaccine earlier prevent covid pray swift recovery
grateful covid vaccine keep outbreak school low
relief get covid vaccine feel safer now
getting vaccine open live shows makes happy
blessed received covid vaccine feel grateful incentive program

We can see the classifier has correctly identified tweets with a positive sentiment about vaccines.

In the same way, we can examine negative sentiment tweets which express anti-vaccine viewpoints. Our sentiment analysis model is working!

Visualizing Insights

Now that we have extracted sentiments from the tweets, let’s analyze and visualize the results to derive insights.

Pandas integrates beautifully with Matplotlib to create informative plots from data. Let’s visualize sentiment frequencies over time:

sentiment_counts = tweets_df.groupby(['date', 'sentiment'])
                      .size().reset_index(name='counts')

plt.figure(figsize=(10, 5))

plt.plot(sentiment_counts[sentiment_counts['sentiment']=='Positive']['date'],
         sentiment_counts[sentiment_counts['sentiment']=='Positive']['counts'],
         color='green')

plt.plot(sentiment_counts[sentiment_counts['sentiment']=='Negative']['date'],
         sentiment_counts[sentiment_counts['sentiment']=='Negative']['counts'],
         color='red')

plt.title('Vaccine Tweet Sentiments Over Time')
plt.ylabel('Tweet Count')
plt.legend(['Positive', 'Negative'])

plt.show()

This plots the relative frequencies of positive and negative sentiment tweets over time.

We can immediately notice some interesting trends, like the spike in negative sentiment in early January 2022. As more people got vaccinated, positive sentiments seem to be increasing while negative ones are decreasing. This graph allows us to easily track sentiment changes.

We can also create plots based on word counts rather than just tweet counts. Let’s build a word cloud to visualize most common terms:

text = " ".join(tweets_df[tweets_df['sentiment']=='Positive']['text'])
wordcloud = WordCloud(width=800, height=400).generate(text)

plt.figure(figsize=(10,5))
plt.imshow(wordcloud)
plt.title('Frequent Words in Positive Tweets')
plt.axis('off')
plt.show()

We can see words like “safe”, “protection” and “effective” are dominating, indicating a trust in vaccines. Bigrams like “get vaccinated” also occur frequently in the positive tweets.

Similarly, we can generate a word cloud for negative sentiment tweets to identify common anti-vaccine terminology. Other visualizations like bar charts, heatmaps, treemaps, etc. can also provide insights into the data.

Practical Applications

There are many potential applications of social media sentiment analysis with Python and Pandas:

Brand monitoring - Track brand mentions and public sentiment over time for marketing purposes.
Product feedback - Analyze user reviews of products to determine issues, defects or desirable features.
Issue monitoring - Identify trending issues, news and crises from online discussions.
Campaign success - Determine audience reception to marketing campaigns, promotions and launches.
Competitor analysis - Monitor commentary and attitudes towards competitors.
Customer service - Categorize and prioritize incoming queries on social media and forums.
Social studies - Understand attitudes and behavioral patterns around events, elections, policies etc.

Sentiment analysis unlocks immense business and social insights from textual data that would otherwise be impossible to uncover manually. Automating the process using Python and Pandas makes it scalable across massive datasets.

Conclusion

In this guide, we covered a practical end-to-end social media sentiment analysis workflow using Python and Pandas including:

Importing and exploring data in Pandas DataFrames
Cleaning and preparing messy text data for analysis
Implementing sentiment classification using TextBlob
Analyzing trends over time through insightful visualizations
Generating illuminating word clouds to reveal common terminology

Sentiment analysis allows us to tap into subjective opinions and emotions within textual data at a large scale. This powerful technique has countless real-world applications across domains like business, social science, politics, public health and more. Through relevant case studies and examples, this guide provided a comprehensive overview of sentiment analysis capabilities using open-source Python tools like Pandas.