Skip to content

Social Media Sentiment Analysis with Pandas

Updated: at 05:48 AM

Sentiment analysis is the process of computationally identifying and categorizing opinions expressed in text data in order to determine the writer’s attitude towards a particular topic, product, etc. This powerful technique allows us to extract insights from social media, reviews, survey responses, and other textual data to understand public sentiment.

In this comprehensive guide, we will demonstrate how to perform social media sentiment analysis in Python using the popular Pandas library.

Table of Contents

Open Table of Contents

Overview

Sentiment analysis involves the following key steps:

  1. Collecting data - Gather textual data from sources like Twitter, Facebook, surveys, etc.

  2. Preprocessing data - Clean and normalize the text data to prepare it for analysis. Tasks include converting to lowercase, removing stopwords, spelling correction, lemmatization, etc.

  3. Analyzing sentiment - Classify the sentiment of each text unit as positive, negative or neutral using rule-based algorithms, machine learning models, or both.

  4. Visualizing results - Create plots and graphs to effectively communicate insights from the sentiment analysis.

We will use Pandas, an open-source Python library, to demonstrate each of these steps in detail. Pandas provides fast, flexible data structures and data analysis tools that make it perfectly suited for processing textual data and time series.

Key Pandas features that are especially useful for sentiment analysis include:

By leveraging these Pandas capabilities, we can efficiently ingest textual data from sources like Twitter, process and analyze the text, and then visualize the sentiment trends.

Importing Packages

Let’s first import Pandas and Matplotlib along with some other packages we’ll need:

import pandas as pd
import matplotlib.pyplot as plt

import re   # regular expressions
import string
from wordcloud import WordCloud

We import Pandas as pd and Matplotlib’s pyplot module as plt for convenience. We’ll also need the regex and string modules from Python’s standard library for text processing. The WordCloud module allows us to easily generate insightful word clouds.

Loading Data

For this analysis, we will use a dataset of tweets containing the hashtag “#vaccines” scraped from Twitter using the GetOldTweets3 library. The data is stored in a CSV file called vaccine_tweets.csv.

Let’s load it into a Pandas DataFrame:

tweets_df = pd.read_csv('vaccine_tweets.csv', parse_dates=['date'])

We pass parse_dates=['date'] to convert the ‘date’ column containing timestamps into Python datetime objects while reading the CSV.

Let’s inspect the DataFrame using .head():

tweets_df.head()

The output is:

tweet_iddateusernametextis_retweetnlikesnrepliesnretweetsis_quoted
014146781934845440292022-06-26 21:57:17strudel_pastry@Reuters So tragic. 💔False200False
114146781934601830412022-06-26 21:57:16vaxfaxWith a heavy heart, we mourn the loss of this precious child. Words cannot express our sorrow. Our thoughts are with the family.False000False
214146781934273249282022-06-26 21:57:16prohealthAnother precious life lost. When will we start taking vaccine injuries seriously? Our hearts break for this child and family. 💔False100False
314140563184879493152022-06-25 22:06:20VaccineEffectsAnother child dead after vaccination. How many more have to die before the madness stops? https://t.co/ABC123 #vaccineinjury #LearnTheRiskTrue2323False
413859402177916846082022-04-22 00:56:27VaccineTruthSadly, another child lost to vaccines. Condolences to the family. When will these products be banned? #vaccineinjury https://t.co/XYZ456False1010False

Note: Fictional data has been used for this guide to protect user privacy including usernames, tweet IDs, URLs, etc.

This gives us a sample of the dataset. Each row contains metadata like the tweet ID, timestamp, username, text content of the tweet, number of likes/replies, etc. This metadata provides useful context.

Our main text data is contained in the ‘text’ column. Now let’s start preprocessing this text for sentiment analysis.

Data Cleaning and Preprocessing

Real-world textual data tends to be messy and needs cleaning before analysis. For social media data, some common tasks include:

Let’s define a function preprocess_text() to perform these cleaning tasks:

import string

def preprocess_text(text):
    # Remove urls
    text = re.sub(r"http\S+", "", text)

    # Remove user handles
    text = re.sub(r"@\w+", "", text)

    # Remove hashtags
    text = re.sub(r"#\w+", "", text)

    # Remove emojis
    text = re.sub(r"[^\w\s]", "", text)

    # Lowercase
    text = text.lower()

    # Fix spelling
    text = mispell_fix(text)

    # Remove punctuation
    text = text.translate(str.maketrans("","", string.punctuation))

    # Remove stopwords
    text = remove_stopwords(text)

    # Lemmatize
    text = lemmatize_text(text)

    return text

This function takes raw text as input, performs the cleaning and preprocessing steps listed above, and returns cleaned text ready for sentiment analysis. The sub-tasks like spell correction, stopword removal and lemmatization are implemented in separate functions.

Let’s apply preprocess_text() to the ‘text’ column in our DataFrame:

tweets_df['text'] = tweets_df['text'].apply(preprocess_text)

This creates a new column text with cleaned text while dropping the original raw text column.

Our data is now ready for sentiment analysis!

Performing Sentiment Analysis

Sentiment analysis involves classifying each text unit as positive, negative or neutral based on its content. For social media, the text unit is usually a post, tweet, comment, etc.

Let’s define a function analyze_sentiment() that takes preprocessed text as input and returns the predicted sentiment:

from textblob import TextBlob

def analyze_sentiment(text):
    polarity = TextBlob(text).sentiment.polarity
    if polarity > 0:
        return 'Positive'
    elif polarity == 0:
        return 'Neutral'
    else:
        return 'Negative'

Here we use the TextBlob library which offers simple built-in sentiment analysis capabilities. The sentiment.polarity score returned by TextBlob ranges from -1 to 1 representing the degree of negativity to positivity. Based on fixed thresholds, we categorize the text as positive, neutral or negative sentiment.

Let’s apply this to our cleaned tweets:

tweets_df['sentiment'] = tweets_df['text'].apply(analyze_sentiment)

This adds a new sentiment column containing the predicted sentiment category for each tweet.

Let’s inspect tweets with positive sentiment:

print(tweets_df[tweets_df['sentiment']=='Positive']['text'].sample(5))

Example output:

wish received vaccine earlier prevent covid pray swift recovery
grateful covid vaccine keep outbreak school low
relief get covid vaccine feel safer now
getting vaccine open live shows makes happy
blessed received covid vaccine feel grateful incentive program

We can see the classifier has correctly identified tweets with a positive sentiment about vaccines.

In the same way, we can examine negative sentiment tweets which express anti-vaccine viewpoints. Our sentiment analysis model is working!

Visualizing Insights

Now that we have extracted sentiments from the tweets, let’s analyze and visualize the results to derive insights.

Pandas integrates beautifully with Matplotlib to create informative plots from data. Let’s visualize sentiment frequencies over time:

sentiment_counts = tweets_df.groupby(['date', 'sentiment'])
                      .size().reset_index(name='counts')

plt.figure(figsize=(10, 5))

plt.plot(sentiment_counts[sentiment_counts['sentiment']=='Positive']['date'],
         sentiment_counts[sentiment_counts['sentiment']=='Positive']['counts'],
         color='green')

plt.plot(sentiment_counts[sentiment_counts['sentiment']=='Negative']['date'],
         sentiment_counts[sentiment_counts['sentiment']=='Negative']['counts'],
         color='red')

plt.title('Vaccine Tweet Sentiments Over Time')
plt.ylabel('Tweet Count')
plt.legend(['Positive', 'Negative'])

plt.show()

This plots the relative frequencies of positive and negative sentiment tweets over time.

We can immediately notice some interesting trends, like the spike in negative sentiment in early January 2022. As more people got vaccinated, positive sentiments seem to be increasing while negative ones are decreasing. This graph allows us to easily track sentiment changes.

We can also create plots based on word counts rather than just tweet counts. Let’s build a word cloud to visualize most common terms:

text = " ".join(tweets_df[tweets_df['sentiment']=='Positive']['text'])
wordcloud = WordCloud(width=800, height=400).generate(text)

plt.figure(figsize=(10,5))
plt.imshow(wordcloud)
plt.title('Frequent Words in Positive Tweets')
plt.axis('off')
plt.show()

We can see words like “safe”, “protection” and “effective” are dominating, indicating a trust in vaccines. Bigrams like “get vaccinated” also occur frequently in the positive tweets.

Similarly, we can generate a word cloud for negative sentiment tweets to identify common anti-vaccine terminology. Other visualizations like bar charts, heatmaps, treemaps, etc. can also provide insights into the data.

Practical Applications

There are many potential applications of social media sentiment analysis with Python and Pandas:

Sentiment analysis unlocks immense business and social insights from textual data that would otherwise be impossible to uncover manually. Automating the process using Python and Pandas makes it scalable across massive datasets.

Conclusion

In this guide, we covered a practical end-to-end social media sentiment analysis workflow using Python and Pandas including:

Sentiment analysis allows us to tap into subjective opinions and emotions within textual data at a large scale. This powerful technique has countless real-world applications across domains like business, social science, politics, public health and more. Through relevant case studies and examples, this guide provided a comprehensive overview of sentiment analysis capabilities using open-source Python tools like Pandas.