Emoji表情符号用于文本情感分析-Improving sentiment analysis accuracy with emoji embedding

Abstract:
Due to the diversity and variability of Chinese syntax and semantics, accurately identifying and distinguishing individual emotions from online texts is challenging. To overcome this limitation, we incorporate a new source of individual sentiment, emojis, which contain thousands of graphic symbols and are increasingly being used for expressing emotion in online conversations. We examined popular sentiment analysis algorithms, including rule-based and classification algorithms, to evaluate the impact of supplementing emojis as additional features to improve the algorithm performance. Emojis were also translated into corresponding sentiment words when constructing features for comparison with those directly generated from emoji label words. In addition, considering different functions of emojis in texts, we classified all posts in the dataset by their emoji usage and examined the changes in algorithm performance. We found that emojis are effective as expanding features for improving the accuracy of sentiment analysis algorithms, and the algorithm performance can be further increased by taking different emoji usages into consideration. In this study, we developed an improved emoji-embedding model based on Bi-LSTM (namely, CEmo-LSTM), which achieves the highest accuracy (around 0.95) when analyzing online Chinese texts. We applied the CEmo-LSTM algorithm to a large dataset collected from Weibo from December 1, 2019 to March 20, 2020 to understand the sentiment evolution of online users during the COVID-19 pandemic. We found that the pandemic remarkably impacted individual sentiments and caused more passive emotions (e.g., horror and sadness). Our novel emoji-embedding algorithm creatively combined emojis as well as emoji usage with the sentiment analysis model and can handle emotion mining tasks more effectively and efficiently.
由于汉语句法和语义的多样性和可变性,准确识别和区分网络文本中的个人情感是一项挑战。为了克服这一限制,我们加入了一种新的个人情感来源,表情符号,它包含数千个图形符号,越来越多地被用于在线对话中表达情感。我们研究了流行的情感分析算法,包括基于规则的算法和分类算法,以评估补充表情符号作为额外特征对提高算法性能的影响。在构造特征以与表情符号标签词直接生成的特征进行比较时,表情符号也被翻译成相应的情感词。此外,考虑到表情符号在文本中的不同功能,我们根据表情符号的使用情况对数据集中的所有帖子进行分类,并检查算法性能的变化。我们发现表情符号作为扩展特征对于提高情感分析算法的准确性是有效的,并且通过考虑不同表情符号的使用可以进一步提高算法的性能。在本研究中,我们开发了一种基于Bi-LSTM(即CEmo-LSTM)的改进表情嵌入模型,该模型在分析在线中文文本时达到了最高的准确率(约0.95)。我们将CEmo LSTM算法应用于2019年12月1日至2020年3月20日从微博收集的一个大型数据集,以了解2019冠状病毒疾病大流行期间在线用户的情绪演变。我们发现,这种流行病显著影响了个人情绪,并导致更多的消极情绪(例如,恐惧和悲伤)。我们的新表情嵌入算法创造性地将表情以及表情的使用与情感分析模型相结合,可以更有效地处理情感挖掘任务。

Main Work:
However, these studies mainly considered emojis as one feature and did not research the sentiment effects of emojis on the whole texts. Little attention has been given to the SA model combined with different emoji usages in texts.
In this study, we proposed an emoji-embedding architecture named CEmo-LSTM to improve the accuracy of sentiment identification and classification in SA tasks. We further evaluated the benefits of introducing emojis to the accuracy of SA in both the traditional rule-based and supervised learning algorithms. Additionally, the most effective approach for embedding emojis in SA algorithms was examined. We compared the performance of the CEmo-LSTM model with that of other mainstream SA models in different experimental settings. Finally, by collecting all posts and embedded emojis published by users on Weibo during the COVID-19 outbreak, we utilized CEmo-LSTM to analyze the sentiment evolution of online users and measured the impact of the COVID-19 pandemic on individual moods. To the best of our knowledge, this is the first study that comprehensively evaluates the effectiveness of introducing emoji usage into SA algorithms.
然而,这些研究主要将表情符号作为一个特征,而没有研究表情符号对整个文本的情感影响。很少有人关注SA模型与文本中不同表情符号的结合。
在本研究中,我们提出了一种表情符号嵌入架构CEmo-LSTM,以提高SA任务中情感识别和分类的准确性。我们进一步评估了在传统的基于规则和监督学习算法中引入表情符号对SA准确性的好处。此外,还研究了在SA算法中嵌入表情符号的最有效的ap方法。我们比较了CEmo-LSTM模型与其他主流SA模型在不同实验环境下的性能。最后,通过收集2019冠状病毒疾病爆发期间用户在微博上发布的所有帖子和嵌入表情,我们利用CEmo-LSTM分析了在线用户的情绪演变,并衡量了2019冠状病毒疾病疫情对个人情绪的影响。据我们所知,这是第一次全面评估将表情符号使用引入SA算法的有效性的研究。

Research Process:

  1. Data collection: We collected all data from Weibo that were posted publicly by users located in Wuhan (the capital of the Hubei province in China), including microblog text, posting time, author ID, and gender, from December 1, 2019 to March 20, 2020. By comparing the sentiments in posts published by Wuhan users before and after the COVID-19 outbreak, we can analyze the sentiment evolution of online users and further explore the impact of COVID-19 on individual moods. Overall, 38,183,194 microblog posts from 2,239,472 unique users were collected. We found that emotion tokens (i.e., emoji characters) were commonly used in Weibo posts. There were 15,609,843 posts containing emoji symbols, accounting for 40.88% of the total posts. In addition, 1,279,828 users used emojis at least once, accounting for 57.15% of all unique users.
    数据收集:从2019年12月1日至2020年3月20日,我们从位于武汉(中国湖北省省会)的用户公开发布的微博上收集了所有数据,包括微博文本、发布时间、作者ID和性别。通过比较2019冠状病毒疾病爆发前后武汉用户发表的帖子中的情绪,我们可以分析网络用户的情绪演变,进一步探讨2019冠状病毒疾病对个人情绪的影响。总体而言,共收集了2239472名独立用户的38183194篇微博帖子。我们发现,情感标记(即表情符号)在微博帖子中普遍使用。共有15609843条含有表情符号的帖子,占帖子总数的40.88%。此外,1279828名用户至少使用过一次表情符号,占所有唯一用户的57.15%。
  2. Annotation: Although there have been some annotated corpora on Chinese and English for SA [23,24], they do not explicitly model the interaction between emojis and text. To fill in this gap, we manually annotated a Chinese microblog corpus. A total of 10 annotators (graduate students majoring in data analytics) were engaged to label the corpus, which consists of 10,000 randomly selected microblog posts. The sentiment polarities of the posts were manually classified as positive, negative, and neutral, denoted by 1, -1, and 0, respectively (Table 1). The annotators were asked to label each post by considering both the plain text and embedded emojis.
    As there are several principal functions for which emojis are used (e.g., sentiment expression, sentiment enhancement, and sentiment modification) [25], the emoji usage of each post containing emojis was also annotated. Specifically, the emoji usage of each post was classified into three categories, strengthening, reversing (or revising), and uncertain, labelled by 1, -1, and 0, respectively, indicating whether the sentiment of the embedded emojis was consistent (1) or inconsistent (-1) with the sentiment of the text-only post (Table 2). The label 0 was used to denote when the effect of emojis in the post could not be confidently determined. We found that most emojis embedded in the posts were used to strengthen and clarify the sentiment of the original texts, accounting for approximately 73.6% of all posts with emojis included in the corpus. Finally, all 10,000 microblog posts were labelled with their sentiment polarities, of which 5499 posts containing emojis were also annotated with their emoji usages.
    注释:尽管有一些关于SA的中英文注释语料库[23,24],但它们并没有明确地模拟表情符号和文本之间的交互。为了填补这一空白,我们手动注释了一个中文微博语料库。共有10名注释员(数据分析硕士研究生)参与了语料库的标注工作,语料库由10000条随机选择的微博帖子组成。这些帖子的情感极性被手动分为积极、消极和中性,分别用1、-1和0表示(表1)。注释者被要求通过考虑纯文本和嵌入表情来标记每篇文章。
    由于使用表情符号有几个主要功能(例如,情感表达、情感增强和情感修改)[25],因此还对每个包含表情符号的帖子的表情符号用法进行了注释。具体而言,每个帖子的表情符号用法分为三类,强化、反转(或修订)和不确定,分别用1、-1和0标记,表明嵌入表情符号的情绪与纯文本帖子的情绪是一致的(1)还是不一致的(-1)(表2)。标签0用于在无法确定帖子中表情符号的效果时进行注释。我们发现,大多数嵌入在帖子中的表情符号被用来加强和澄清原文的情感,约占语料库中包含表情符号的所有帖子的73.6%。最后,所有10000条微博帖子都贴上了情感极性标签,其中5499条包含表情符号的帖子也标注了表情符号的用法。
  3. CEmo-LSTM model:
    The architecture of the CEmo-LSTM model.
    As illustrated in Figure 1, our model includes the input sentence, word (emoji) representation, word embedding layer, Bi-LSTM layer, dropout layer, and a softmax layer. Given an input post 𝑆𝑖, the model first classifies the post according to whether there are any emojis embedded and evaluates the emoji usage of each post containing emojis. For posts containing emojis, both texts and emojis are input as features. Then, a microblog post can be described as {𝑤1, 𝑤2, …, 𝑤𝑖 ; 𝐸}, where 𝑤𝑖 denotes the word token and 𝐸 denotes the emoji. Through the embedding layer, both 𝑤𝑖 and 𝐸 are converted to the vector representation, 𝑑𝑖, as the input of the deep learning model to predict the sentiment polarity of a post. A Bi-LSTM layer is built to capture the representation of a microblog post, and a dropout layer is added to prevent over-fitting and improve the generalizability of the model. Finally, a softmax activation function is used to calculate a probability distribution 𝑝 over a set of sentiment polarities {1, −1, 0}. Consequently, a list of labels of input posts is predicted according to the corresponding output of the softmax layer.
    如图1所示,我们的模型包括输入句子、单词(表情符号)表示、单词嵌入层、Bi LSTM层、退出层和softmax层。给定输入帖子𝑆𝑖, 该模型首先根据是否有表情符号嵌入对帖子进行分类,并评估每个包含表情符号的帖子的表情符号使用情况。对于包含表情符号的帖子,文本和表情符号都作为特征输入。那么,微博帖子可以描述为{𝑤1.𝑤2, …, 𝑤𝑖 ; 𝐸}, 𝑤𝑖 表示单词标记,𝐸 表示表情符号。通过嵌入层𝑤𝑖 和𝐸 转换为矢量表示,𝑑𝑖,作为深度学习模型的输入,预测帖子的情感极性。构建了一个BiLSTM层来捕获微博帖子的表示,并添加了一个退出层来防止过度拟合并提高模型的通用性。最后,使用softmax激活函数计算概率分布𝑝 在一组情感极性{1,−1,0}。因此,根据softmax层的相应输出预测输入帖子的标签列表。
  4. Experiments design:
    RQ1: Does the supplementation of emojis promote the emotion recognition of texts? To answer this question, a rigorous contrast experiment was conducted. We compared the performance of SA algorithms on posts with embedded emojis and text-only posts, respectively, to measure the impact of emojis on emotion recognition. For text-only posts, a microblog text was described as {𝑤1, 𝑤2, …, 𝑤𝑖}, where 𝑤𝑖 denotes the word token. A post with embedded emojis was represented as {𝑤1, 𝑤2, …, 𝑤𝑖 ; 𝐸}, where E indicates the set of emoji tag words.
    RQ1:添加表情符号是否会促进文本的情绪识别?为了回答这个问题,进行了严格的对比实验。我们分别比较了SA算法在嵌入表情符号和纯文本帖子上的性能,以衡量表情符号对情绪识别的影响。对于纯文本帖子,微博文本被描述为{𝑤1.𝑤2, …, 𝑤𝑖}, 𝑤𝑖 表示单词标记。嵌入表情符号的帖子表示为{𝑤1.𝑤2, …, 𝑤𝑖 ; 𝐸}, 其中E表示表情标记词集。
    RQ2: Can the tag words of emojis be directly used when constructing features? We examined whether the vagueness and ambiguity of emoji tag words would affect the sentiment identification of SA algorithms. Before constructing features, all emojis were converted into corresponding sentiment words (e.g., Sad, Happy) instead of emoji tag words based on their meanings and sentiments, and we evaluated the changes in algorithm performance. Accordingly, an emoji-embedded post was denoted as {𝑤1, 𝑤2, …, 𝑤𝑖 ; 𝐸𝑆}, where ES is the set of sentiment words translated from emojis.
    RQ2:在构建功能时,是否可以直接使用表情符号的标记词?我们研究了表情标记词的模糊性和歧义性是否会影响SA算法的情感识别。在构建特征之前,所有表情都被转换成相应的情感词(例如,悲伤、快乐),而不是基于其含义和情感的表情标记词,我们评估了算法性能的变化。因此,嵌入表情符号的帖子被表示为{𝑤1.𝑤2, …, 𝑤𝑖 ; 𝐸𝑆}, 其中ES是从表情符号翻译而来的情感词集。
    RQ3: Does the classification of the training dataset on emoji usage improve the performance of SA algorithms? Corresponding to this question, an experiment was also conducted. We classified the emoji usage of all posts containing emojis to examine the impact of the introduction of emoji usage on SA algorithms. We found that, in most posts on Weibo, the emotions expressed by emojis were consistent with emotions of plain texts, and the main function of emojis was to clarify and enhance the sentiment of the sentence. Hence, strengthening posts in the corpora (labelled with 1 in the field emoji usage) were filtered out and used to train SA models. A post classified by emoji usage 𝑈 was described as {𝑤𝑢1, 𝑤𝑢2, …, 𝑤𝑢𝑖 ; 𝐸𝑈}, where 𝑤𝑢𝑖 denotes the word token and EU stands for the set of emojis embedded.
    RQ3:基于表情符号使用的训练数据集分类是否提高了SA算法的性能?针对这个问题,还进行了一个实验。我们对所有包含表情符号的帖子的表情符号使用情况进行了分类,以检查表情符号使用的引入对SA算法的影响。我们发现,在大多数微博帖子中,表情符号表达的情感与普通文本的情感一致,表情符号的主要功能是澄清和增强句子的情感。因此,语料库中的强化帖子(在字段表情符号使用中标记为1)被筛选出来并用于训练SA模型。按表情符号用法分类的帖子𝑈 被描述为{𝑤𝑢1,𝑤𝑢2, …, 𝑤𝑢𝑖 ; 𝐸𝑈}, 𝑤𝑢𝑖 表示单词token,EU表示嵌入的表情符号集。
  5. Baselines: Rule-based approach & Classification algorithms
    1) Rule-based approach: the traditional lexicon for sentiment words (sentiment lexicon, for short) and an emoji lexicon based on the sentiment of different emojis. Based on these two lexicons, we extracted all sentiment words and emojis contained in each post. To construct the sentiment lexicon, we first integrated four popular Chinese sentiment dictionaries, including DUTIR, C-LIWC, HowNet, and NTUSD [29,30]. Then, by supplementing popular sentiment words used on the internet [31], we built a comprehensive sentiment lexicon, which is more suitable for SA on Weibo. ) Emoji lexicon. As there is significant heterogeneity [32,33] in the popularity of different emojis (i.e., in the Sina Weibo data used), the top 100 most popular emojis account for approximately 96% of all emojis used daily. We constructed an emoji lexicon (Table 3) based on the top 100 most frequently used emojis and classified them into three categories, positive, negative, and neutral, according to their official annotations and emotions expressed. Each emoji was also assigned a sentiment value, with positive emojis denoted from 1 to 5 and negative emojis denoted from -1 to -5, respectively. The absolute value represents the emotional intensity.
    基于规则的方法:情感词的传统词典(简称情感词典)和基于不同表情的情感的表情词典。基于这两个词汇,我们提取了每篇文章中包含的所有情感词和表情符号。为了构建情感词典,我们首先整合了四种流行的汉语情感词典,包括dutir、C-LIWC、HowNet和NTUSD [29,30]。然后,通过补充互联网上使用的流行情感词【31】,我们构建了一个全面的情感词典,该词典更适合微博上的SA。表情符号词典。由于不同表情符号的受欢迎程度存在明显的异质性[32,33](即在使用的新浪微博数据中),前100位最受欢迎的表情符号约占每天使用的所有表情符号的96%。我们基于前100个最常用的表情符号构建了一个表情符号词典(表3),并根据其官方注释和表达的情绪将其分为三类,积极、消极和中性。每个表情符号也被赋予一个情感值,正面表情符号分别表示从1到5,负面表情符号分别表示从-1到-5。绝对值代表情绪强度。
    在这里插入图片描述2) Classification algorithms:
    Logistic Regression (LR)——LR (text)、LR (text+E)、LR (text+ES)、LR (EU)
    Support Vector Machine (SVM)——SVM (text)、SVM (text+E)、SVM(text+ES)、SVM (EU)
    Naive Bayes classifier (NB)——NB (text)、NB (text+E)、 NB(text+ES)、NB (EU)
    Gradient Boosting Decision Tree (GBDT)——GBDT (text)、GBDT (text+E)、GBDT (text+ES)、GBDT (EU)
    Long Short-Term Memory (LSTM)——LSTM (text)、LSTM (text+E)、LSTM (text+ES)、LSTM (EU)
    Bidirectional Encoder Representation from Transformers (BERT)——BERT (text)、BERT (text+E)、BERT (text+ES)、 BERT (EU)
    3)Evaluation metric:
    tenfold cross validation, λ = 𝑇∕𝑁, where T indicates the number of predicted sentiment ratings that are identical with manual sentiment ratings, and N indicates the number of posts.
  6. Results:
    1)Effect of emojis on the accuracy of sentiment recognition:
    Rule-based approach: We found that the performance of the algorithm with emoji posts (𝜆 = 0.561) was significantly better than with emoji-free posts (𝜆 = 0.360). Emojis are beneficial clues for the rule-based algorithm in SA tasks. This further indicates that emojis play an important role in clarifying and enhancing the sentiment of sentences. However, the accuracy of the rule-based algorithm in both scenarios was not satisfactory, possibly due to the short length of internet micro-texts and inadequate emotional clues.
    基于规则的方法:我们发现该算法的性能与表情帖子(𝜆 = 0.561)明显优于无表情纯文字的帖子(𝜆 = 0.360)。表情符号是SA任务中基于规则的算法的有益线索。这进一步表明表情符号在澄清和增强句子情感方面起着重要作用。然而,基于规则的算法在这两种情况下的准确性都不令人满意,可能是因为互联网微文本的长度较短,情感线索不足。
    Classification algorithms:
    在这里插入图片描述
    2)Feature comparison between emoji tag words and sentiment words
    在这里插入图片描述
    3) Improving algorithm accuracy with sentiment strengthening
    在这里插入图片描述
  7. Case study:
    在这里插入图片描述
  8. Conclusion & discussion:
    在这里插入图片描述
  • 1
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值