Hi everyone,


The Covid19 Pandemic brought about distance learning in the 2020 academic term. Although some people could adapt easily, some of them found it inefficient. Nowadays, the re-opening of schools is being discussed. Most experts suggest that at least one semester should be online again. As a student who passed the last semester with distance learning, I could find a lot of time to spend on learning natural language processing. Finally, I decided to explore what people think about distance learning.

I am planning this story as an end-to-end project. We are going to explore the tweets related to distance learning to understand people’s opinions (a.k.a opinion mining) and to discover facts. I will use the lexicon-based approach to determine the tweets’ polarities (I’ll explain it later). TextBlob will be our tool to do that. We will also build a machine learning model to predict the positivity and the negativity of the tweets by using Bernoulli Naive Bayes Classifier.

Our workflow is the following:


  1. Data Gathering


    - Twitter API

    - Retrieve tweets with




  2. Preprocessing and Cleaning


    - Drop duplicates


    - Data type conversions


    - Drop uninformative columns


    - Get rid of stop words, hashtags, punctuation, and one or two-letter words


    - Tokenize the words


    - Apply lemmatization


    - Term frequency-inverse document frequency vectorization


  3. Exploratory Data Analysis


    - Visualize the data


    - Compare word counts


    - Investigate the creation times distribution


    - Investigate the locations of tweets


    - Look at the popular tweets and the most frequent words


    - Make a word cloud


  4. Sentiment Analysis

  5. Machine Learning

  6. Summary


Before starting, please make sure that the following libraries are available in your workspace.



You can use the following commands to install non-built-in libraries.


pip install pycountry
pip install nltk
pip install textblob
pip install wordcloud
pip install scikit-learn
pip install pickle

You can find the entire code here.


First of all, we need a Twitter Developer Account to be allowed to use Twitter API. You can get the account here. It can take a few days to be approved. I have already completed those steps. Once I got the account, I created a text file that contains API information. It is located on the upward directory of the project. The content of the text file is the following. If you want to use it you have to replace the information with yours.

CONSUMER KEY=your_consumer_key
CONSUMER KEY SECRET=your_consumer_key_secret
ACCESS TOKEN=your_access_token
ACCESS TOKEN SECRET=your_access_token_secret

After that, I created a py file called get_tweets.py to collect tweets (in English only) related to distance learning. You can see the entire code below.

The code above, searches the tweets contain the following hashtags


#distancelearning, #onlineschool, #onlineteaching, #virtuallearning, #onlineducation, #distanceeducation, #OnlineClasses, #DigitalLearning, #elearning, #onlinelearning


and the following keywords


“distance learning”, “online teaching”, “online education”, “online course”, “online semester”, “distance course”, “distance education”, “online class”,” e-learning”, “e learning”


It also filters the retweets to avoid duplication.


The get_tweets function stores the tweets retrieved in a temporary pandas DataFrame and saves as CSV files in the output directory. It approximately took 40 hours to collect 202.645 tweets. After that, It gave me the following files

Image for post
Output files, image by author

To concatenate all CSV files into one, I created the concatenate.py file that contains the following code.


Ultimately, we have tweets_raw.csv file. Let’s look at how it looks like.

# Load the tweets
tweets_raw = pd.read_csv("tweets_raw.csv")# Display the first five rows
display(tweets_raw.head())# Print the summary statistics
print(tweets_raw.describe())# Print the info
Image for post
Image for post

At a first glance, we can see that there are 202.645 tweets including the content, location, username, Retweet count, Favorites count, and the creation time features in the DataFrame. There are also some missing values in the Location column. We’ll deal with them in the next step.

乍一看,我们可以看到有202.645条Tweet,包括内容,位置,用户名,Retweet计数,收藏夹计数以及DataFrame中的创建时间功能。 在位置列中也缺少一些值。 我们将在下一步中处理它们。

According to the information above, Unnamed: 0 and Unnamed: 0.1 columns are not informative to us so we’ll drop them out. The data type of Created at column also should be datetime. As well as we need to get rid of duplicated tweets if there are some.

# We do not need first two columns. Let's drop them out.
tweets_raw.drop(columns=["Unnamed: 0", "Unnamed: 0.1"], axis=1, inplace=True)# Drop duplicated rows
tweets_raw.drop_duplicates(inplace=True)# Created at column's type should be datatime
tweets_raw["Created at"] = pd.to_datetime(tweets_raw["Created at"])# Print the info again
Image for post

The tweets count has been reduced to 187.052 (There were 15.593 duplicated rows). “Created at” column’s data type is also changed to datatime64[ns].

Now, let’s tidy the tweets’ contents up. We need to get rid of stopwords, punctuation, hashtags, mentions, links, and one or two-letter words. As well as we need to tokenize the tweets.

Tokenization is the splitting of a sentence into words and punctuation marks. The sentence “This is an example.” can be tokenized like [“This”, “is”, “an”, “example”, “.”]

Stopwords are the words that are commonly used and they don’t contribute to the meaning of a sentence such as “a”, “an”, “the”, “on”, “in” and so forth.


Lemmatization is the process of reducing a word to its root form. This root form called a lemma. For example, the lemma of words running runs, and ran is run

Let’s define a function to do all of these operations.


After the function call, our Processed column will be like the following. You see that the tweets are tokenized and they do not contain stopwords, hashtags, links, and one or two-letter words. We also applied the lemmatization operation on them.

Image for post

We got what we want. Do not worry about words such as learning, online, education, etc. We will deal with them later.

Tweet lengths and number of words in the tweets might be also interesting in the exploratory data analysis. Let’s get them!

# Get the tweet lengths
tweets_raw["Length"] = tweets_raw["Content"].str.len()# Get the number of words in tweets
tweets_raw["Words"] = tweets_raw["Content"].str.split().str.len()# Display the new columns
display(tweets_raw[["Length", "Words"]])
Image for post

Notice that we did not use the Processed tweets.


What about the locations?


When we called the info function of tweets_raw DataFrame, we saw that there were some missing values in the “Location” columns. The missing values are indicated as NaN. We’ll fill the missing values with the “unknown” tag.

# Fill the missing values with unknown tag
tweets_raw["Location"].fillna("unknown", inplace=True)

How many unique locations we have?


# Print the unique locations and number of unique locations
print("Unique Values:",tweets_raw["Location"].unique())
print("Unique Value count:",len(tweets_raw["Location"].unique()))
Image for post

The outputs show us the location information is messy. There are 37.119 unique locations. We need to group them by country. To achieve this, we will use the pycountry package in python. If you are interested, you can find further information here.

Let’s define a function called get_countries which returns the country codes of the given locations.


Image for post

It worked! Now we have 156 unique country codes. We will use them for the exploratory data analysis part.

Now it’s time to vectorize the tweets. We’ll use tf-idf (term frequency-inverse document term frequency) vectorization.

Tf-idf (Term Frequency — Inverse Term Frequency) is a statistical concept to be used to get the frequency of words in the corpus. We’ll use scikit-learn’s TfidfVectorizer. The vectorizer will calculate the weight of each word in the corpus and will return a tf-idf matrix. You can find further information here

Image for post

td = Term frequency (number of occurrences each i in j)df = Document frequencyN = Number of documentsw = tf-idf weight for each i and j (document).

td =术语频率(j中每个i中出现的次数) df =文档频率N =文档数w =每个ij (文档)的tf-idf权重。

We are going to select only the top 5000 words for tf-idf vectorization due to memory constraints. You can experiment with more by using other approaches like hashing.

# Create our contextual stop words
tfidf_stops = ["online","class","course","learning","learn",\
"teacher","student","grade","classes","computer","onlineeducation",\ "onlinelearning", "school", "students","class","virtual","eschool",\ "virtuallearning", "educated", "educates", "teaches", "studies",\ "study", "semester", "elearning","teachers", "lecturer", "lecture",\ "amp","academic", "admission", "academician", "account", "action" \
"add", "app", "announcement", "application", "adult", "classroom", "system", "video", "essay", "homework","work","assignment","paper",\ "get", "math", "project", "science", "physics", "lesson","courses",\ "assignments", "know", "instruction","email", "discussion","home",\ "college","exam""use","fall","term","proposal","one","review",\
"proposal", "calculus", "search", "research", "algebra"]# Initialize a Tf-idf Vectorizer
vectorizer = TfidfVectorizer(max_features=5000, stop_words= tfidf_stops)# Fit and transform the vectorizer
tfidf_matrix = vectorizer.fit_transform(tweets_processed["Processed"])# Let's see what we have
display(tfidf_matrix)# Create a DataFrame for tf-idf vectors and display the first rows
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns= vectorizer.get_feature_names())
Image for post

It returned us a sparse matrix. You can look at its content below.

Image for post

PAfter all, we save the new DataFrame as a CSV file to use later without performing whole operations again.


# Save the processed data as a csv file

3.探索性数据分析 (3. Exploratory Data Analysis)

Exploratory data analysis is an indispensable part of data science projects. We can build our models as long as we understand what our data tell us.

# Load the processed DataFrame
tweets_processed = pd.read_csv("tweets_processed.csv", parse_dates=["Created at"])

First of all, let’s look at the oldest and the newest tweets creation time in our data set.


# Print the minimum datetime
print("Since:",tweets_processed["Created at"].min())# Print the maximum datetime
print("Until",tweets_processed["Created at"].max())
Image for post

The tweets have been created between 23 July and 14 August 2020. What about the creation hours?


# Set the seaborn style
sns.set()# Plot the histogram of hours
sns.distplot(tweets_processed["Created at"].dt.hour, bins=24)
plt.title("Hourly Distribution of Tweets")
Image for post

The histogram demonstrates that most of the tweets are created between 12 am-17 pm in a day. The most popular hour is about 15 pm.

Let’s look at the locations that we have already processed.


# Print the value counts of Country column
Image for post

Apparently, the locations will be noninformative for us because we have 169.734 unknown locations. But we can still check the top tweeting countries.

Image for post

According to the bar plot above, United States, United Kingdom, and India are the top 3 countries in our dataset.

根据上面的条形图, 美国英国印度是我们数据集中的前3个国家/地区。

Now, let’s look at the most popular tweets (in terms of retweets and favorites).


# Display the most popular tweets
display(tweets_processed.sort_values(by=["Favorites","Retweet-Count", ], axis=0, ascending=False)[["Content","Retweet-Count","Favorites"]].head(20))
Image for post
Popular tweets, click on the image to see better

The frequent words in the tweets can also tell us a lot. Let’s get them from our Tf-idf matrix.

# Create a new DataFrame called frequencies
frequencies = pd.DataFrame(tfidf_matrix.sum(axis=0).T,index=vectorizer.get_feature_names(),columns=['total frequency'])# Display the most 20 frequent words
display(frequencies.sort_values(by='total frequency',ascending=False).head(20))
Image for post

Word clouds would be nicer.


Image for post
Wordcloud, image by author

Apparently, people talk about “payment”. “Help” is one of the most frequent words. We can say that people are looking for help a lot :)

4.情绪分析 (4. Sentiment Analysis)

After preprocessing and EDA, we can finally focus on our main aim in this project. We are going to calculate the tweets’ sentimental features such as polarity and subjectivity by using TextBlob. It gives us these values by using the predefined word scores. You can check the documentation for more information.

polarity is a value changes between -1 to 1. It shows us how positive or negative the sentence given is.

极性是在-1 1之间变化的值。 它向我们展示了给出的句子是正面还是负面

Image for post

subjectivity is another value changes between 0 to 1 which shows us whether the sentence is about a fact or opinion (objective or subjective).


Image for post

Let’s calculate the polarity and subjectivity scores with TextBlob


Image for post

We need to classify the polarities as positive, neutral, and negative.


Image for post

We can also count them up like the following.


# Print the value counts of the Label column
Image for post

The results are different than what I expected. Positive tweets are significantly more than negative ones.

We tagged the tweets as positive, neutral, and negative so far. Let’s go over our findings deeply. I will start with label counts.

# Change the datatype as "category"
tweets_processed["Label"] = tweets_processed["Label"].astype("category")# Visualize the Label counts
plt.title("Label Counts")
plt.show()# Visualize the Polarity scores
plt.figure(figsize = (10, 10))
sns.scatterplot(x="Polarity", y="Subjectivity", hue="Label", data=tweets_processed)
plt.title("Subjectivity vs Polarity")
Image for post
Image for post

Since the lexicon-based analysis is not always reliable, we have to check the results manually. Let’s see the popular (in terms of retweets and favorites) tweets that have the highest/lowest polarity scores.

由于基于词典的分析并不总是可靠的,因此我们必须手动检查结果。 让我们看看具有最高/最低极性分数的流行(就转发和收藏而言)。

# Display the positive tweets
display(tweets_processed.sort_values(by=["Polarity", "Retweet-Count", "Favorites"], axis=0, ascending=False)[["Content","Retweet-Count","Favorites","Polarity"]].head(20))# Display the negative tweets
display(tweets_processed.sort_values(by=["Polarity", "Retweet-Count", "Favorites"], axis=0, ascending=[True, False, False])[["Content","Retweet-Count","Favorites","Polarity"]].head(20))
Image for post
Positive tweets, click on the image to see better
Image for post
Negative tweets, , click on the image to see better

According to the results above, TextBlob has done its job correctly! We can make word clouds for each label as we did above. To do this, I will define a function. The function will take a DataFrame and a label as arguments and vectorized the Processed tweets with tf-idf vectorizer. Finally, it will make the word clouds for us. We will only look at the most popular 50 tweets because of the computational constraints. You can try with more data.

Image for post
Image for post

Apparently, people whose tweets are negative find distance learning is boring, horrible, and terrible. On the other hand, some people like options for distance learning.

Let’s look at the positive and negative tweet counts by country.


Image for post
Image for post

Is there any relationship between the time and tweets’ polarities?


positive = tweets_processed.loc[tweets_processed.Label=="Positive"]["Created at"].dt.hour
negative = tweets_processed.loc[tweets_processed.Label=="Negative"]["Created at"].dt.hourplt.hist(positive, alpha=0.5, bins=24, label="Positive", density=True)
plt.hist(negative, alpha=0.5, bins=24, label="Negative", density=True)
plt.title("Hourly Distribution of Tweets")
plt.legend(loc='upper right')
Image for post

The histogram above demonstrates that there is no relationship between the time and tweets’ polarities.


I want to finish my exploration here to keep this story short.


5.建立机器学习模型 (5. Build a Machine Learning Model)

We have labeled the tweets according to their polarity scores. Let’s build a Machine Learning model by using a Multinomial Naive Bayes Classifier. We will use our tf-idf vectors as the features and the labels as the target.

# Encode the labels
le = LabelEncoder()
tweets_processed["Label_enc"] = le.fit_transform(tweets_processed["Label"])# Display the encoded labels
Image for post

We have encoded the labels.


“正数” = 2

“中立” = 1

“负数” = 0

# Select the features and the target
X = tweets_processed['Processed']
y = tweets_processed["Label_enc"]

Now, we need to split our data into train and test sets. We’ll use the stratify parameter of train_test_split since our data is unbalanced.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=34, stratify=y)

Now, we can create our model. Since our earlier tf-idf vectorizer fit by the entire dataset, we have to initialize a new one. Otherwise, our model can learn by the test set.

# Create the tf-idf vectorizer
model_vectorizer = TfidfVectorizer()# First fit the vectorizer with our training set
tfidf_train = vectorizer.fit_transform(X_train)# Now we can fit our test data with the same vectorizer
tfidf_test = vectorizer.transform(X_test)# Initialize the Bernoulli Naive Bayes classifier
nb = BernoulliNB()# Fit the model
nb.fit(tfidf_train, y_train)# Print the accuracy score
best_accuracy = cross_val_score(nb, tfidf_test, y_test, cv=10, scoring='accuracy').max()
Image for post

Although we did not do any hyperparameter tuning, the accuracy is not bad. Let’s look at the confusion matrix and classification report.

# Predict the labels
y_pred = nb.predict(tfidf_test)# Print the Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix\n")
print(cm)# Print the Classification Report
cr = classification_report(y_test, y_pred)
print("\n\nClassification Report\n")
Image for post

There is still a lot to do for improving the model’s performance on negative tweets but I keep it for another story :)


Finally, we can save the model to use later.


# Save the model
pickle.dump(nb, open("model.pkl", 'wb'))

摘要 (Summary)

In summary, let’s remember what we did together. Firstly, we have collected the Tweets about distance learning by using Twitter API and the tweepy library. After that we applied common preprocessing steps on them such as tokenization, lemmatization, removing stopwords, and so forth. We explored the data by using summary statistics and visualization tools. After all, we used TextBlob to get polarity scores of the tweets and interpreted our findings. Consequently, we found that in our dataset most of the tweets have positive opinions about distance learning. Do not forget the fact that we only used a lexicon-based approach which is not very reliable. I hope this story will be helpful for you to understand the sentiment analysis of tweets.

[1] (Tutorial) simplifying sentiment analysis in Python. (n.d.). DataCamp Community. https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python

[2] Lee, J. (2020, May 19). Twitter sentiment analysis | NLP | Text analytics. Medium. https://towardsdatascience.com/twitter-sentiment-analysis-nlp-text-analytics-b7b296d71fce

[3] Li, C. (2019, September 20). Real-time Twitter sentiment analysis for brand improvement and topic tracking (Chapter 1/3). Medium. https://towardsdatascience.com/real-time-twitter-sentiment-analysis-for-brand-improvement-and-topic-tracking-chapter-1-3-e02f7652d8ff

[4] Randerson112358. (2020, July 18). How to do sentiment analysis on a Twitter account in Python. Medium. https://medium.com/better-programming/twitter-sentiment-analysis-15d8892c0082

[5] Stemming and Lemmatization in Python. (n.d.). DataCamp Community. https://www.datacamp.com/community/tutorials/stemming-lemmatization-python

翻译自: https://towardsdatascience.com/sentiment-analysis-on-the-tweets-about-distance-learning-with-textblob-cc73702b48bc

textblob 情感分析





