Sentiment analysis is the process of computationally identifying and categorizing opinions expressed in text data in order to determine the writer’s attitude towards a particular topic, product, etc. This powerful technique allows us to extract insights from social media, reviews, survey responses, and other textual data to understand public sentiment.
In this comprehensive guide, we will demonstrate how to perform social media sentiment analysis in Python using the popular Pandas library.
Table of Contents
Open Table of Contents
Overview
Sentiment analysis involves the following key steps:
-
Collecting data - Gather textual data from sources like Twitter, Facebook, surveys, etc.
-
Preprocessing data - Clean and normalize the text data to prepare it for analysis. Tasks include converting to lowercase, removing stopwords, spelling correction, lemmatization, etc.
-
Analyzing sentiment - Classify the sentiment of each text unit as positive, negative or neutral using rule-based algorithms, machine learning models, or both.
-
Visualizing results - Create plots and graphs to effectively communicate insights from the sentiment analysis.
We will use Pandas, an open-source Python library, to demonstrate each of these steps in detail. Pandas provides fast, flexible data structures and data analysis tools that make it perfectly suited for processing textual data and time series.
Key Pandas features that are especially useful for sentiment analysis include:
-
DataFrames - Tabular, column-oriented data structures with labeled axes, ideal for working with tabular/structured data.
-
Data cleaning - Tools like
drop()
,dropna()
, etc. for preparing messy real-world data. -
Merge/join - Combine datasets and manage relational data.
-
GroupBy - Split, apply and combine data in groups. Useful for grouping text by categories.
-
Time series - Powerful tools for working with time series data like timestamps.
-
Visualization - Seamless integration with Matplotlib to create informative plots and charts.
By leveraging these Pandas capabilities, we can efficiently ingest textual data from sources like Twitter, process and analyze the text, and then visualize the sentiment trends.
Importing Packages
Let’s first import Pandas and Matplotlib along with some other packages we’ll need:
import pandas as pd
import matplotlib.pyplot as plt
import re # regular expressions
import string
from wordcloud import WordCloud
We import Pandas as pd
and Matplotlib’s pyplot module as plt
for convenience. We’ll also need the regex and string modules from Python’s standard library for text processing. The WordCloud
module allows us to easily generate insightful word clouds.
Loading Data
For this analysis, we will use a dataset of tweets containing the hashtag “#vaccines” scraped from Twitter using the GetOldTweets3 library. The data is stored in a CSV file called vaccine_tweets.csv
.
Let’s load it into a Pandas DataFrame:
tweets_df = pd.read_csv('vaccine_tweets.csv', parse_dates=['date'])
We pass parse_dates=['date']
to convert the ‘date’ column containing timestamps into Python datetime objects while reading the CSV.
Let’s inspect the DataFrame using .head()
:
tweets_df.head()
The output is:
tweet_id | date | username | text | is_retweet | nlikes | nreplies | nretweets | is_quoted | |
---|---|---|---|---|---|---|---|---|---|
0 | 1414678193484544029 | 2022-06-26 21:57:17 | strudel_pastry | @Reuters So tragic. 💔 | False | 2 | 0 | 0 | False |
1 | 1414678193460183041 | 2022-06-26 21:57:16 | vaxfax | With a heavy heart, we mourn the loss of this precious child. Words cannot express our sorrow. Our thoughts are with the family. | False | 0 | 0 | 0 | False |
2 | 1414678193427324928 | 2022-06-26 21:57:16 | prohealth | Another precious life lost. When will we start taking vaccine injuries seriously? Our hearts break for this child and family. 💔 | False | 1 | 0 | 0 | False |
3 | 1414056318487949315 | 2022-06-25 22:06:20 | VaccineEffects | Another child dead after vaccination. How many more have to die before the madness stops? https://t.co/ABC123 #vaccineinjury #LearnTheRisk | True | 23 | 2 | 3 | False |
4 | 1385940217791684608 | 2022-04-22 00:56:27 | VaccineTruth | Sadly, another child lost to vaccines. Condolences to the family. When will these products be banned? #vaccineinjury https://t.co/XYZ456 | False | 10 | 1 | 0 | False |
Note: Fictional data has been used for this guide to protect user privacy including usernames, tweet IDs, URLs, etc.
This gives us a sample of the dataset. Each row contains metadata like the tweet ID, timestamp, username, text content of the tweet, number of likes/replies, etc. This metadata provides useful context.
Our main text data is contained in the ‘text’ column. Now let’s start preprocessing this text for sentiment analysis.
Data Cleaning and Preprocessing
Real-world textual data tends to be messy and needs cleaning before analysis. For social media data, some common tasks include:
- Remove URLs, handles, hashtags, emojis, etc.
- Convert text to lowercase
- Fix spelling errors
- Remove punctuation
- Remove stopwords like ‘a’, ‘and’, ‘the’
- Lemmatize words to their base form
Let’s define a function preprocess_text()
to perform these cleaning tasks:
import string
def preprocess_text(text):
# Remove urls
text = re.sub(r"http\S+", "", text)
# Remove user handles
text = re.sub(r"@\w+", "", text)
# Remove hashtags
text = re.sub(r"#\w+", "", text)
# Remove emojis
text = re.sub(r"[^\w\s]", "", text)
# Lowercase
text = text.lower()
# Fix spelling
text = mispell_fix(text)
# Remove punctuation
text = text.translate(str.maketrans("","", string.punctuation))
# Remove stopwords
text = remove_stopwords(text)
# Lemmatize
text = lemmatize_text(text)
return text
This function takes raw text as input, performs the cleaning and preprocessing steps listed above, and returns cleaned text ready for sentiment analysis. The sub-tasks like spell correction, stopword removal and lemmatization are implemented in separate functions.
Let’s apply preprocess_text()
to the ‘text’ column in our DataFrame:
tweets_df['text'] = tweets_df['text'].apply(preprocess_text)
This creates a new column text
with cleaned text while dropping the original raw text column.
Our data is now ready for sentiment analysis!
Performing Sentiment Analysis
Sentiment analysis involves classifying each text unit as positive, negative or neutral based on its content. For social media, the text unit is usually a post, tweet, comment, etc.
Let’s define a function analyze_sentiment()
that takes preprocessed text as input and returns the predicted sentiment:
from textblob import TextBlob
def analyze_sentiment(text):
polarity = TextBlob(text).sentiment.polarity
if polarity > 0:
return 'Positive'
elif polarity == 0:
return 'Neutral'
else:
return 'Negative'
Here we use the TextBlob
library which offers simple built-in sentiment analysis capabilities. The sentiment.polarity
score returned by TextBlob ranges from -1 to 1 representing the degree of negativity to positivity. Based on fixed thresholds, we categorize the text as positive, neutral or negative sentiment.
Let’s apply this to our cleaned tweets:
tweets_df['sentiment'] = tweets_df['text'].apply(analyze_sentiment)
This adds a new sentiment
column containing the predicted sentiment category for each tweet.
Let’s inspect tweets with positive sentiment:
print(tweets_df[tweets_df['sentiment']=='Positive']['text'].sample(5))
Example output:
wish received vaccine earlier prevent covid pray swift recovery
grateful covid vaccine keep outbreak school low
relief get covid vaccine feel safer now
getting vaccine open live shows makes happy
blessed received covid vaccine feel grateful incentive program
We can see the classifier has correctly identified tweets with a positive sentiment about vaccines.
In the same way, we can examine negative sentiment tweets which express anti-vaccine viewpoints. Our sentiment analysis model is working!
Visualizing Insights
Now that we have extracted sentiments from the tweets, let’s analyze and visualize the results to derive insights.
Pandas integrates beautifully with Matplotlib to create informative plots from data. Let’s visualize sentiment frequencies over time:
sentiment_counts = tweets_df.groupby(['date', 'sentiment'])
.size().reset_index(name='counts')
plt.figure(figsize=(10, 5))
plt.plot(sentiment_counts[sentiment_counts['sentiment']=='Positive']['date'],
sentiment_counts[sentiment_counts['sentiment']=='Positive']['counts'],
color='green')
plt.plot(sentiment_counts[sentiment_counts['sentiment']=='Negative']['date'],
sentiment_counts[sentiment_counts['sentiment']=='Negative']['counts'],
color='red')
plt.title('Vaccine Tweet Sentiments Over Time')
plt.ylabel('Tweet Count')
plt.legend(['Positive', 'Negative'])
plt.show()
This plots the relative frequencies of positive and negative sentiment tweets over time.
We can immediately notice some interesting trends, like the spike in negative sentiment in early January 2022. As more people got vaccinated, positive sentiments seem to be increasing while negative ones are decreasing. This graph allows us to easily track sentiment changes.
We can also create plots based on word counts rather than just tweet counts. Let’s build a word cloud to visualize most common terms:
text = " ".join(tweets_df[tweets_df['sentiment']=='Positive']['text'])
wordcloud = WordCloud(width=800, height=400).generate(text)
plt.figure(figsize=(10,5))
plt.imshow(wordcloud)
plt.title('Frequent Words in Positive Tweets')
plt.axis('off')
plt.show()
We can see words like “safe”, “protection” and “effective” are dominating, indicating a trust in vaccines. Bigrams like “get vaccinated” also occur frequently in the positive tweets.
Similarly, we can generate a word cloud for negative sentiment tweets to identify common anti-vaccine terminology. Other visualizations like bar charts, heatmaps, treemaps, etc. can also provide insights into the data.
Practical Applications
There are many potential applications of social media sentiment analysis with Python and Pandas:
-
Brand monitoring - Track brand mentions and public sentiment over time for marketing purposes.
-
Product feedback - Analyze user reviews of products to determine issues, defects or desirable features.
-
Issue monitoring - Identify trending issues, news and crises from online discussions.
-
Campaign success - Determine audience reception to marketing campaigns, promotions and launches.
-
Competitor analysis - Monitor commentary and attitudes towards competitors.
-
Customer service - Categorize and prioritize incoming queries on social media and forums.
-
Social studies - Understand attitudes and behavioral patterns around events, elections, policies etc.
Sentiment analysis unlocks immense business and social insights from textual data that would otherwise be impossible to uncover manually. Automating the process using Python and Pandas makes it scalable across massive datasets.
Conclusion
In this guide, we covered a practical end-to-end social media sentiment analysis workflow using Python and Pandas including:
- Importing and exploring data in Pandas DataFrames
- Cleaning and preparing messy text data for analysis
- Implementing sentiment classification using TextBlob
- Analyzing trends over time through insightful visualizations
- Generating illuminating word clouds to reveal common terminology
Sentiment analysis allows us to tap into subjective opinions and emotions within textual data at a large scale. This powerful technique has countless real-world applications across domains like business, social science, politics, public health and more. Through relevant case studies and examples, this guide provided a comprehensive overview of sentiment analysis capabilities using open-source Python tools like Pandas.