Ted's cave

Crawlthroughs and projects

View on GitHub
29 November 2024

Sentiment analysis part 1

by

ted

My goal for this project is to analyze how people in data science feel about their jobs. More specifically I will will gather data from the r/datascience subreddit and figure out what the talk is there.

The code notebook without annotations can be viewed here

Step 1 (connect to our data source)


# Connecting our credentials

import praw

reddit = praw.Reddit(

client_id = "myid",
client_secret = "mysecret",
user_agent = "myuser"
)

# Test by printing top 4 titles on r/datascience

subreddit = reddit.subreddit("datascience")

for post in subreddit.hot(limit=4):

    print(post.title)

picture Success!

Step 2 (preprocess our data)


Our goal here is to prepare the reddit posts for our eventual sentiment analysis. We cannot jump right in as we will have some problems in the formatting such as URLS, special characters, emojis and so on. We might also want to lowercase all the text though that might affect the sentiment analysis. For example don’t and DON’T convey two different tones of sentiment.

Let’s however try and get a bit more data from our reddit posts first.

# Getting the posts from r/datascience

subreddit = reddit.subreddit("datascience")

# Getting the top 4 hot posts with number of comments, upvotes and the text

posts = []

for post in subreddit.hot(limit=4):
    posts.append({
    
        "title": post.title,
        "text": post.selftext,
        "score": post.score,
        "comments": post.num_comments
    })
    
print(posts)

Ok now let us try and fetch the actual comments.

# fetching the comments in the 4 hot posts

comments = []

for post in subreddit.hot(limit=4):

    post.comments.replace_more(limit=0)

    for comment in post.comments.list():

        comments.append({
            "post_title": post.title,
            "comment_text": comment.body,
            "comment_score": comment.score
       })

print(comments)

picture

Cleaning the text data


We are getting to the first big obstacle of this whole project which is cleaning and organizing our data. As said before we are going to have to remove a lot of unnecessary jargon such as URLS, emojis, and symbols. Two python libraries that can help with this are RegEx and nltk. Before using these in code let’s try and see why and how we can use these libraries.

Benefits of nltk


  1. Stopword removal —> removes words such as, “and”, “the”, “is” which don’t add value to us.
    • e.g. “The fox is brown and fast” —> “fox brown fast”
  2. Tokenization —> splits sentences into a list of words for easier processing.
    • e.g. “fox brown fast” —> [“fox”, “brown”, “fast”].
  3. Stemming —> reduces words to their base form for easier processing.
    • e.g. “jumping” —> “jump”

Benefits of RegEx


  1. Removes URLS –> “check this out https://google.com” —> “check this out”
  2. Removes special characters –> “data science is awesome🔥🔥🔥!” –> “data science is awesome”
  3. Removes spaces –> “what is a linear /n regression “ –> “what is a linear regression”

By removing the noise with nltk and RegEx our model will be able to focus on the essential words that carry the sentiment, allowing our results to be more accurate and interpretable.

Ok let’s get into the actual code.

import re
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")

stop_words = set(stopwords.words("english"))
# Cleaning our text

def clean_text(text):
  
    # Remove special characters
    text = re.sub(r"[^a-zA-Z\s]", "", text)

    # Remove URLS
    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)

    # Convert to lowercase
    text = text.lower()

    # Remove stopwords
    text = " ".join([
        word for word in text.split() if word not in stop_words
    ])

    return text
# Cleaning posts and comments
cleaned_posts = [{

    "title": clean_text(post["title"]), "text": clean_text(post["text"])

    } for post in posts]

cleaned_comments = [{
	
	"post_title": clean_text(comment["post_title"]), "comment_text": clean_text(comment["comment_text"])

    } for comment in comments]

Step 3 (framing our data)


We will use pandas which is a data manipulation library, more specifically we will put to use DataFrame which will allow us to format our data like an excel spreadsheet. We should also import numpy which will help us in storing the data.

import numpy as np
import pandas as pd 

# Convert posts to DataFrame
posts_df = pd.DataFrame(cleaned_posts)

# Convert comments to DataFrame
comments_df = pd.DataFrame(cleaned_comments)

# Proof of concept 
print(posts_df.head())

picture

The “job” problem


To progress we need to get rid of the posts that don’t pertain to what we are trying to figure out, which is the sentiment on jobs in data science. How should we go about this? The method shouldn’t be too hard, we just remove the posts that don’t have the word “job” in them. But what about the words “internship”, “salary”, “work”, “workplace” and more that I can’t think of right now? What exactly is our criteria for this elimination?

picture

In regards to handpicked vs. algorithm, the time it takes to handpick a wordlist will be less than designing an algorithm to pick the words. The accuracy difference should also be very minimal making it not worth the time or effort. For the subset let’s take 10-15 words.

# Filtering posts not related to our analysis
filtered_posts_df = posts_df[posts_df["title"].str.contains("job|salary|intern|work|career|pay|position|skill|profession|employ|hire|company|money|", case = False, na = False)]

Further data framing


# merging filtered posts and comments 
merged_df = pd.merge(
    filtered_posts_df,
    comments_df,
    right_on = "post_title", # match on title in post 
    left_on = "title",       # match on post_title in comments 
    how = "inner"            # keeping rows that match 
    
)
merged_df.head()

picture

# remove posts_title column
merged_df = merged_df.drop(columns = ["post_title"])

merged_df.head()

picture

Let’s check the general health of our dataset.

# Check for missing values 
merged_df.isnull().sum()

picture

# Check the number of rows and columns
merged_df.shape

Out[17] = (73,3)

Upon enumerating the dataset I found out we only have 4 different posts worth of data. Why this is the case… I’m not sure. One theory I can come up with is in the process of cleaning the data we removed….WAIT

Remember way back when we first connected to to our data source? Let’s bring that code back up.

# Connecting our credentials

import praw

reddit = praw.Reddit(

client_id = "myid",
client_secret = "mysecret",
user_agent = "myuser"
)

# Test by printing top 4 titles on r/datascience

subreddit = reddit.subreddit("datascience")

for post in subreddit.hot(limit=4):  # The mistake is here 

    print(post.title)
# Check the number of rows and columns
merged_df.shape

Out[38] = (3616,3)

Now we should also add the comment upvotes to our data frame

# Add the upvotes of each comment
merged_df["comment_score"] = [comment["comment_score"] for comment in comments]

ValueError: Length of values (3256) does not match length of index (3616)
merged_df.describe()

picture Some interesting information here, our dataset seems to be retaining deleted comments and we also seem to have quite a few duplicate comments. The relation to the upvotes problem is not clear but let’s fix the duplication problem at least.

# Get rid of rows with "deleted" in the comment_text column
merged_df = merged_df[merged_df.comment_text != "deleted"]

merged_df.describe()

picture

Fixing the upvotes problem


Before we go about deleting rows and rows of information let’s try and add the upvotes column. The upvotes column is important since it adds more weight to the sentiment of a particular comment. For example a comment that says “machine learning is awesome” with 20 upvotes should have more weight than a comment that says “machine learning sucks” with 5 upvotes.

Unfortunately this statistic was forgotten somewhere in the code so we need to go back and add it in,

# Cleaning posts and comments
cleaned_posts = [{

    "title": clean_text(post["title"]), "text": clean_text(post["text"])

    } for post in posts]

cleaned_comments = [{
	
	"post_title": clean_text(comment["post_title"]), "comment_text": clean_text(comment["comment_text"])

    } for comment in comments]
# Adding upvotes to dictionary 
cleaned_comments = [{

	"post_title": clean_text(comment["post_title"]), "comment_text": clean_text(comment["comment_text"]), "upvotes": (comment["comment_score"])
	
} for comment in comments]
merged_df.head()

picture

Step 4 (Data analysis)


To perform analysis on our data we need something to measure by. The upvotes are not enough since they do not correlate to positive or negative sentiment, they just act as a multiplier. Time to summon VADER.

Valence Aware Dictionary and sEntiment Reasoner


VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that works well with sentiments expressed in social media. For example,

VADER outputs a positive, neutral and negative score as well as a compound score that gives values between -1 and 1. -1 being very negative and 1 being very positive, with 0 in the middle for neutral. We will use this compound score to measure our text data more specifically the comment_text.

from nltk.sentiment import SentimentIntensityAnalyzer 
nltk.download("vader_lexicon)

sia = SentimentIntensityAnalyzer()

# Add compound scores to the data frame
merged_df["sentiment"] = merged_df["comment_text"].apply(lambda x: sia.polarity_scores(x)["compound"])

merged_df.head()

picture

Let’s look at some basic data

merged_df.describe()

picture

Upvotes vs Sentiment


Let’s see the correlation between the upvotes and the sentiment score. We will use matplotlib and seaborn for their visualization capabilities.

import seaborn as sns 
import matplotlib.pyplot as plt 

# Scatter plot on sentiment score vs upvotes
sns.scatterplot(x = 'sentiment', y = 'upvotes', data = merged_df)

picture

# Plotting frequency of sentiment score 
sns.histplot(data = merged_df, x = 'sentiment')

picture

The Final Merge


We are almost at the part where we can draw some final conclusions from our data. One obstacle stands in our way and that is the upvotes. Ideally we want to merge the upvotes with the sentiment score for a more accurate reflection of how the comments feel about jobs in data science.

The problem is our upvotes fall in the range of [-46,636] so we need to squeeze these scores into a range with meaning. One function that can do this is the sigmoid function, \(σ(x) = \frac{1}{1 + e^{-x}}\) This function will take our range of upvotes between [-46,636] and squeeze them into the range of {0,1}. A better interval would be (-1,1) so we can combine the upvotes and sentiments column by simply adding them together. In this case our negative to positive sentiment interval would change from (-1,1) —> (-2,2) with 0 still being neutral. Here is the function that will give this interval change, \(σ(x) = \frac{2}{1 + e^{-x}}-1\)

We still have 2 problems, 1) σ(10) = 0.9999 so the difference between 10 upvotes and let’s say 100 upvotes will be negligible in our analysis. 2) Neutral comments with extreme upvotes will be skewed as positive or negative sentiment by our analysis.

We can fix the first problem by changing our sigmoid function to, \(σ(x) = \frac{100}{1 + e^{-0.01x}}-50\) This squeezes our upvotes into the interval (-50,50). The benefit of this is that it keeps 0 as the neutral point and even when we merge with sentiment the interval turns into, (-50,50) + (-1,1) —> (-51,51), with 0 still in the middle for neutral.

We can fix the second problem by not merging upvotes for sentiment scores in the range (-0.2,0.2). Upvotes for neutral statements do not really reflect on overall sentiment.

# Defining the function
def sigmoid(x):

    return 100 / (1 + np.exp(-0.01 * x)) - 50

  

# Applying the function to the upvotes column
merged_df['transformed_upvotes'] = merged_df['upvotes'].apply(sigmoid)
# Computing the adjusted score
merged_df['final_score'] = np.where(

# Not changing score for values in (-0.2,0.2)
    (merged_df['sentiment'] <= -0.2) | (merged_df['sentiment'] >= 0.2),

    # Adding sentiment and transformed_upvotes
    merged_df['sentiment'] + merged_df['transformed_upvotes'],

    
    merged_df['sentiment']  

)

Step 5 (Data visualization/analysis)


We saw that the sentiment regarding jobs in data science is generally neutral. Let’s see if our new score reflects that,

sns.histplot(data = merged_df, x = 'final_score')

picture

We can more clearly view the distribution with a kernel density estimation plot which will smooth out the above histogram,

sns.displot(merged_df, x="final_score", kind="kde")

picture

Therefore the range (-50,50) for our upvotes is a tradeoff, we avoid clumping of points near the extremes but the spread of points is then skewed towards the positive direction.

Word cloud analysis


To get more wordy let’s generate a word cloud that will show us the most common words in our data specifically in the comment_text column,

from wordcloud import WordCloud

# The data we are putting into the wordcloud
df = merged_df
comment_words = " ".join(df['comment_text'].astype(str))

# wordcloud collecting and formatting from data
wordcloud = WordCloud(
    width = 800, height = 800,
    background_color = 'black',
    min_font_size = 10,

).generate(comment_words)

  

# plotting the wordcloud                      
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

plt.show()

picture

Let’s extract some word clouds for positive and negative sentiment. Starting with final_score > 10,

# The data we are putting into the wordcloud
merged_df_pos = merged_df[merged_df['final_score'] > 10]
comment_words = " ".join(merged_df_pos['comment_text'].astype(str))

# wordcloud collecting and formatting from data source 
wordcloud = WordCloud(
    width = 800, height = 800,
    background_color = 'black',
    min_font_size = 10,
    
).generate(comment_words)

# plotting the wordcloud                      
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
 
plt.show()

picture

Let’s up the positivity and retrieve data from columns with final_score > 30,

picture

Let’s move onto final_score < -2 which will give us some insight into the negative sentiment side of our data,

picture

Let’s get a bit more negative and set final_score < 5,

picture

Step 6 (Conclusions)


We note that the overall sentiment on jobs in r/datascience is neutral, however there are more positive then negative comments. The positive comments talk about the people, time, and the job itself. Therefore there seems to be a positive sentiment towards the everyday side of the data science job. On the other hand the negative comments speak of the business side of data science. It seems that the products or value that comes out of the work data scientists do is not well-liked. Additionally there seems to be negative opinions on AI and its effects on jobs and work.

There are some areas of improvement I would also like to mention. 1) The data size can be increased if we pull from more than the top 100 reddit posts or if we filter first and then take the top 100 posts. 2) The filtering can also be tweaked to allow for more data or less but more related data, the filter choice as seen in the “job” problem was arbitrary and assumes that every post with the chosen words will have comments related to those words. 3) The function used to merge the upvotes and sentiment score can be tweaked and improved to allow for a better data spread

In part 2 we will see how our analysis here compares to other jobs in different subreddits and build some models with our data to predict the sentiment in other fields.

tags: