建立RoBERTa模型以发现Reddit小组的情绪

最新推荐文章于 2025-04-27 22:08:53 发布

weixin_26632369

最新推荐文章于 2025-04-27 22:08:53 发布

阅读量1k

点赞数

文章标签： python java 人工智能机器学习 tensorflow

原文链接：https://towardsdatascience.com/discover-the-sentiment-of-reddit-subgroup-using-roberta-model-10ab9a8271b8

版权

本文介绍如何利用RoBERTa模型分析社交媒体Reddit的情绪，通过预训练模型进行情感分析，包括配置、数据预处理、模型训练和评估，以识别小组的正面和负面评论。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

How do you feel when you log in to social media accounts and read the opening post? Is it put a smile on your face or make you sad or angry? I have a mixed experience. Most of the time, social media posts make me happy. How? Well, we can’t control what other people post, but we can control what we want to see on our social media accounts.

当您登录社交媒体帐户并阅读开头的帖子时，您感觉如何？是让您的脸上露出微笑还是让您悲伤或生气？我的经历很复杂。大多数时候，社交媒体上的帖子使我感到高兴。怎么样？好吧，我们无法控制其他人发布的内容，但是我们可以控制我们希望在社交媒体帐户上看到的内容。

If you joined a group having high negative comments, then you will read those comments more often. That makes you angry and sad. Leave those toxic group before it takes a toll on your mental health.

如果您加入了负面评价很高的小组，那么您将更常阅读这些评论。那会让你生气和悲伤。离开这些有毒物质组，以免损害您的心理健康。

So if I tell you to find out the toxic group of your social media account, can you do that?

因此，如果我告诉您找出您的社交媒体帐户中的有害人群，您可以这样做吗？

Well, this article will help you to create a model which help you to summarize the sentiment of all post or comment. So you can leave those groups before they make you feel like quitting social media.

好吧，本文将帮助您创建一个模型，该模型可以帮助您总结所有文章或评论的观点。因此，您可以在这些小组让您感到退出社交媒体之前就离开他们。

We will use Reddit social media references for this article. I will analyze my Reddit subgroup. And check whether these subgroups have a high count of negative comments or not.

我们将在本文中使用Reddit社交媒体参考。我将分析我的Reddit子组。并检查这些亚组是否有很多否定评论。

为什么要Reddit？ (Why Reddit?)

Social media like Reddit and twitter will let you access the user’s post and comments via API. You can test and implement the sentiment analysis model on Reddit data.

像Reddit和Twitter这样的社交媒体将使您可以通过API访问用户的帖子和评论。您可以在Reddit数据上测试并实现情感分析模型。

This article has divided into two parts. In the first part, I will build a RoBERTa model. And in the second part, we analyze the sentiment of the Reddit subgroup.

本文分为两个部分。 在第一部分中，我将构建一个RoBERTa模型。 在第二部分中，我们分析了Reddit小组的情绪。

RoBERTa模型的建立 (Building of RoBERTa Model)

We will train and fine-tune the pre-trained model RoBERTa with a twitter dataset. You can find the data here. The dataset contains positive and negative sentiment tweets. I have chosen binary sentiment data to increase accuracy. It’s easy to interpret binary prediction. Also, it makes the decision process easy.

我们将使用Twitter数据集对预训练模型RoBERTa进行训练和微调。您可以在此处找到数据。数据集包含正面和负面情绪鸣叫。我选择了二进制情感数据来提高准确性。二进制预测很容易解释。而且，它使决策过程变得容易。

Huggingface team transformers library will help us to access the pre-trained RoBERTa model. The RoBERTa model performs exceptionally good on the NLP benchmark, General Language Understanding Evaluation (GLUE). Performance of RoBERTa model match with human-level performance. Learn more about RoBERTa here. Learn more about the Transformers library here.

Huggingface团队变压器库将帮助我们访问经过预先训练的RoBERTa模型。 RoBERTa模型在NLP基准通用语言理解评估(GLUE)上表现出色。 RoBERTa模型的性能与人员水平的性能相匹配。在此处了解有关RoBERTa的更多信息。在此处了解有关Transformers库的更多信息。

Now, let’s go over different parts of the code in sequence.

现在，让我们按顺序遍历代码的不同部分。

第1部分。配置和令牌化 (Part 1. Configuration and Tokenization)

The pre-trained model has a configuration file contain pieces of information, such as the number of layers and the number of attention heads. The details of the RoBERTa model configuration file are mention below.

预先训练的模型具有一个配置文件，其中包含一些信息，例如层数和注意头数。下面提到RoBERTa模型配置文件的详细信息。

{
    “architectures”: [
    “RobertaForMaskedLM”
    ],
    “attention_probs_dropout_prob”: 0.1,
    “bos_token_id”: 0,
    “eos_token_id”: 2,
    “hidden_act”: “gelu”,
    “hidden_dropout_prob”: 0.1,
    “hidden_size”: 768,
    “initializer_range”: 0.02,
    “intermediate_size”: 3072,
    “layer_norm_eps”: 1e-05,
    “max_position_embeddings”: 514,
    “model_type”: “roberta”,
    “num_attention_heads”: 12,
    “num_hidden_layers”: 12,
    “pad_token_id”: 1,
    “type_vocab_size”: 1,
    “vocab_size”: 50265
}

The tokenization means converting python string or sentences in arrays or tensors of integers, which is indices in model vocabulary. Each model has its own tokenizer. Also, it helps in making data ready for the model.

标记化意味着将python字符串或句子转换为整数数组或张量，这是模型词汇表中的索引。每个模型都有自己的令牌生成器。而且，它有助于使数据准备好用于模型。

from transformers import RobertaTokenizer
roberta_tokenizer = RobertaTokenizer.from_pretrained(“roberta-base”)

Note: The final version of the code is available at the end of this article.

注意：该代码的最终版本在本文末尾提供。

第2部分。数据预处理 (Part 2. Data Pre-Processing)

In this section, we use the tokenizer to tokenize the sentences or input data. This model requires to add tokens at the beginning and end of sequences like [SEP], [CLS] or </s> or <s>.

在本节中，我们使用分词器对句子或输入数据进行分词。此模型需要在[SEP]，[CLS]或</ s>或<s>之类的序列的开头和结尾添加令牌。

def convert_example_to_feature(review):
  return roberta_tokenizer.encode_plus(review,
                                       add_special_tokens=True,
                                       max_length=max_length,
                                       pad_to_max_length=True,
                                       return_attention_mask=True,
  )def encode_examples(ds, limit=-1):
     # prepare list, so that we can build up final TensorFlow dataset from slices.
  input_ids_list = []
  attention_mask_list = []
  label_list = []
  if (limit > 0):
    ds = ds.take(limit)
  for review, label in tfds.as_numpy(ds):
    bert_input = convert_example_to_feature(review.decode())
    input_ids_list.append(bert_input[‘input_ids’])
    attention_mask_list.append(bert_input[‘attention_mask’])
    label_list.append([label])
  return tf.data.Dataset.from_tensor_slices((input_ids_list,
                                             attention_mask_list,
                              label_list)).map(map_example_to_dict)

max_length: This variable represents the max length of the sentence allowed. Max value for this variable should not exceed 512.

max_length：此变量表示允许的句子的最大长度。此变量的最大值不应超过512。

pad_to_max_length: If True, tokenizer add [PAD] at the end of sentence.

pad_to_max_length：如果为True，则令牌生成器在句子末尾添加[PAD]。

The RoBERTa model needs 3 inputs.

RoBERTa模型需要3个输入。

1. input_ids: The sequence or index of data points.

1. input_ids：数据点的顺序或索引。

2. attention_mask: It distinguishes original words from special tokens or padded words.

2.tention_mask：将原始单词与特殊标记或填充单词区分开。

3. label: labeled data

3.标签：带标签的数据

第3部分。模型训练和微调 (Part 3. Model Training and Fine-Tuning)

The Transformers library loads the pre-trained RoBERTa model in one line of code. The weights are downloaded and cached on your local machine. We fine-tune these models according to NLP tasks.

Transformers库在一行代码中加载了经过预训练的RoBERTa模型。权重已下载并缓存在本地计算机上。我们根据NLP任务微调这些模型。

from transformers import TFRobertaForSequenceClassificationmodel = TFRobertaForSequenceClassification.from_pretrained(“roberta-base”)
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy(‘accuracy’)
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
model.fit(ds_train_encoded, 
          epochs=number_of_epochs, 
          validation_data=ds_test_encoded, 
          callbacks=[metrics])

Use the below pointers to fine-tune the model.

使用以下指针微调模型。

1. The value of learning_rate variable between 1e-05 to 1e-06 gives a good accuracy score.

1.在1e-05到1e-06之间的learning_rate变量的值给出了良好的准确性得分。

2. An increase in batch size improves accuracy and also increase training time.

2.增加批量大小可以提高准确性，也可以增加培训时间。

3. The pre-trained model does not require to train on more number of epochs. Epochs between 3 to 10 will work fine.

3.预先训练的模型不需要训练更多的纪元。 3到10之间的时间间隔可以正常工作。

第4部分。准确性，F1得分并保存模型 (Part 4. Accuracy, F1 score and save the model)

The accuracy score helps you to detect bias and variance in models. Improvement in the model mostly depends on the accuracy score. Use accuracy score during balanced data and the F1 score in an unbalanced data. F1 scores tell us whether the model learns all data equally or not. We will use Keras callback function to calculate the F1 score of the model.

准确性分数可帮助您检测模型中的偏差和方差。模型的改进主要取决于准确性得分。在平衡数据中使用准确性评分，在非平衡数据中使用F1评分。 F1分数告诉我们模型是否平等地学习所有数据。我们将使用Keras回调函数来计算模型的F1分数。

class ModelMetrics(tf.keras.callbacks.Callback):  def on_train_begin(self, logs={}):
    self.count_n = 1  def on_epoch_end(self, batch, logs={}):
    os.mkdir(‘/create/folder/’ + str(self.count_n))
    self.model.save_pretrained(‘/folder/to/save/model/’ + str(self.count_n))
    y_val_pred = tf.nn.softmax(self.model.predict(ds_test_encoded))
    y_pred_argmax = tf.math.argmax(y_val_pred, axis=1)
    testing_copy = testing_sentences.copy()
    testing_copy[‘predicted’] = y_pred_argmax
    f1_s = f1_score(testing_sentences[‘label’],
                    testing_copy[‘predicted’])
    print(‘\n f1 score is :’, f1_s)
    self.count_n += 1metrics = ModelMetrics()

We will use a save_pretrained method to save the model. You can save the model with each epoch. We will keep the model that has high accuracy and delete the rest.

我们将使用save_pretrained方法保存模型。您可以在每个时期保存模型。我们将保留具有高精度的模型，并删除其余模型。

分析Reddit小组的情绪 (Analyze sentiment of Reddit subgroup)

Once you complete the building of the RoBERTa model, we will detect the sentiment of the Reddit subgroup. These are the steps you follow to complete the task.

一旦完成RoBERTa模型的构建，我们将检测Reddit子组的情绪。这些是您完成任务所遵循的步骤。

1. Fetch the comment of the Reddit subgroup. Learn more about how to fetch comments from Reddit here.2. Check the sentiment of each comment using your RoBERTa model.3. Count the positive and negative comments of the Reddit subgroup.4. Repeat the process for different Reddit subgroup.

You can find a detailed explanation of steps 1 and 2 here. I have selected my favorite five subreddit for analysis. We analyze the comments of the top 10 weekly posts. I have restricted the comments due to the limitation of Reddit API requests.

您可以在此处找到有关步骤1和2的详细说明。我选择了我最喜欢的五个subreddit进行分析。我们分析了每周前10名帖子的评论。由于Reddit API请求的限制，我已经限制了评论。

The count of positive and negative comments will give you the overall sentiment of the Reddit subgroup. I have implemented these steps in the code. You can find this code at the end of this article.

正面和负面评论的数量将为您提供Reddit子组的整体情感。我已经在代码中实现了这些步骤。 您可以在本文末尾找到此代码。

Graph of sentiment analysis of my favorite 5 Reddit subgroups.

我最喜欢的5个Reddit子群的情感分析图。

Reddit subgroups have highly regulated by their moderators. If your comment breaks any subreddit rules, then Reddit bot will delete your comment. Reddit bot does not delete comments based on their sentiment. But you can say that most negative comments break subreddit rules.

Reddit子组受到其主持人的严格监管。如果您的评论违反了任何subreddit规则，则Reddit机器人将删除您的评论。 Reddit bot不会根据其评论删除评论。但是您可以说大多数负面评论都违反了subreddit规则。

结论 (Conclusion)

In this article, you can learn how to discover the sentiment of social media platform Reddit. This article also covers the building of the RoBERTa model for a sentiment analysis task. With the help of pre-trained models, we can solve a lot of NLP problems. Models in the NLP field is maturing and getting powerful. Huggingface Transformers library made it quite easy to access those models. Try these models with different configurations and tasks.

在本文中，您可以学习如何发现社交媒体平台Reddit的情绪。本文还介绍了用于情感分析任务的RoBERTa模型的构建。借助预训练的模型，我们可以解决很多NLP问题。 NLP领域中的模型正在日趋成熟和强大。 Huggingface Transformers库使访问这些模型变得非常容易。尝试使用具有不同配置和任务的这些模型。

Code to build a RoBERTa classification model

建立RoBERTa分类模型的代码

Code to analyze Reddit subgroup

分析Reddit子组的代码