方面级paper8Progressive Self-Supervised Attention Learning for Aspect-Level Sentiment Analysis(2019ACL)

Paper link: https://arxiv.org/pdf/1906.01213v1.pdf

Code link:

Source:2019 ACL




该文章的信息:title:Progressive Self-Supervised Attention Learning for Aspect-Level Sentiment Analysis(作者:Jialong Tang, Ziyao Lu, jinsong su, Yubin Ge, Linfeng Song, Le Sun and Jiebo Luo)。论文共同第一作者是中国科学院软件研究所2018级博士研究生唐家龙和厦门大学软件学院2018级硕士研究生陆紫耀,通讯作者是苏劲松副教授。






在SemEval 14 REST,LAPTOP以及口语化数据集TWITTER上的实验结果表明,团队提出的渐进注意力机制能在多个前沿模型的基础之上取得显著性提升。 


1. Abstract:

In aspect-level sentiment classification (ASC), it is prevalent to equip dominant neural models with attention mechanisms, for the sake of acquiring the importance of each context word on the given aspect. However, such a mechanism tends to excessively focus on a few frequent words with sentiment polarities, while ignoring infrequent ones. In this paper, we propose a progressive self-supervised attention learning approach for neural ASC models, which automatically mines useful attention supervision information from a training corpus to refine attention mechanisms. Specifically, we iteratively conduct sentiment predictions on all training instances. Particularly, at each iteration, the context word with the maximum attention weight is extracted as the one with active/misleading influence on the correct/incorrect prediction of every instance, and then the word itself is masked for subsequent iterations. Finally, we augment the conventional training objective with a regularization term, which enables ASC models to continue equally focusing on the extracted active context words while decreasing weights of those misleading ones.

2. Introduction

Aspect-level sentiment classification (ASC), as an indispensable task in sentiment analysis, aims at inferring the sentiment polarity of an input sentence in a certain aspect.
However, the existing attention mechanism in ASC suffers from a major drawback. Specifically, it is prone to overly focus on a few frequent words with sentiment polarities and little attention is laid upon low-frequency ones. As a result, the performance of attentional neural ASC models is still far from satisfaction. We speculate that this is because there exist widely “apparent patterns” and “inapparent patterns”. Here, “apparent patterns” are interpreted as high-frequency words with strong sentiment polarities and “inapparent patterns” are referred to as low-frequency ones in training data. As above-mentioned , NNs are easily affected by these two modes: “apparent patterns” tend to be overly learned while “inapparent patterns” often can not be fully learned.




In the first three training sentences given the fact that the context word “small” occurs frequently with negative sentiment, the attention mechanism pays more attention to it and directly relates the sentences containing it with negative sentiment. This inevitably causes an other informative context word “crowded” to be partially neglected in spite of it als opossesses negative sentiment. Consequently, a neural ASC model incorrectly predicts the sentiment of the last two test sentences: in the first test sentence, the neural ASC model fails to capture the negative sentiment implicated by”crowded”;while,in the second test sentence, the attention mechanism directly focuses on “small” though it is not related to the given aspect..


so  propose a novel progressive self-supervised attention learning approach for neural ASC models。

contributions are three-fold:

 (1) Through in-depth analysis, we point out the existing drawback of the attention mechanism for ASC.

(2) We propose a novel incremental approach to automatically extract attention supervision information for neural ASC models. To the best of our knowledge, our work is the first attempt to explore automatic attention supervision information mining for ASC.

(3)We apply our approachto two dominant neural ASC models: Memory Network(MN)(Tangetal.,2016b;Wangetal.,2018) and Transformation Network (TNet) (Li et al., 2018). Experimental results on several benchmark datasets demonstrate the effectiveness of our approach.




  记忆网络(MN)(Tangetal.,2016b;Wangetal.,2018) 和

  转换网络(TNet) (Li etal.,2018)。几个基准数据集的实验结果证明了该方法的有效性。


3.1  Memory Networks(MN):


then define the final vector representation v(t) of t as the averaged aspect embedding of its words;

and                                                     o = \sum_i Softmax(v_{t}^TMm_i)h_i

3.2 Tramework Network(TNet/TNet-ATT)


(1) The bottom layer is a Bi-LSTM that transforms the input x into the contextualized word representationsh^{(0)}(x) = (h_1^{(0)},h_2^{(0)},\cdots,h_N^{(0)})(i.e. hidden states of Bi-LSTM).

(2) The middle part, as the core of the whole model, contains L layers of Context-Preserving Transformation (CPT), where word representations are updated as h^{(l+1)}(x) = CPT(h^{(l)}(x)). The key operation of CPT layers is Target-Specific Transformation. It contains another Bi-LSTM for generating v(t) via an attention mechanism, and then incorporates v(t) into the word representations. Besides, CPT layers are also equipped with a Context-Preserving Mechanism (CPM) to preserve the context information and learn more abstract word-level features. In the end, we obtain the word-level semantic representations h(x) = (h_1,h_2,\cdots,h_N),with h_i =h_i^{(L)}

(3) The topmost part is a CNN layer used to produce the aspect-related sentence representation o for the sentiment classification. 

(1)底层是Bi-LSTM,它将输入x转换为上下文化的单词表示形式  h^{(0)}(x) = (h_1^{(0)},h_2^{(0)},\cdots,h_N^{(0)})(即Bi-LSTM的隐藏状态)。

(2)中间部分作为整个模型的核心,包含L层上下文保持转换(Context-Preserving Transformation:CPT),其中单词表示形式更新为h^{(l+1)}(x) = CPT(h^{(l)}(x))。CPT层的关键操作是特定于目标的转换。它包含另一个Bi-LSTM,用于通过注意机制生成v(t),然后将v(t)合并到单词表示中。此外,CPT层还配备了上下文保存机制(Context-Preserving Mechanism: CPM)来保存上下文信息和学习更抽象的单词级特性。最后,我们得到了单词级语义表示 h(x) = (h_1,h_2,\cdots,h_N),with h_i =h_i^{(L)}


3.3 training objective(NLL)


4. model


 we first use the initial training corpus D to conduct model training, and then obtain the initial model parameters θ(0) (Line 1). Then, we continue training the model for K iterations, where influential context words of all trainingin stances can be iteratively extracted (Lines 6-25). During this process, for each training instance (x,t,y), we introduce two word sets initialized as ∅ (Lines 2-5) to record its extracted context words: (1) s_a(x) consists of context words with active effects on the sentiment prediction of x. Each word of s_a(x) will be encouraged to remain considered in the refined model training,and (2) s_m(x) contains context words with misleading effects, whose attention weights are expected to be decreased. Specifically, at the k-th training iteration, we adopt the following steps to deal with (x,t,y):

我们第一次使用初始训练语料库维进行模型训练,然后获得初始模型参数θ(0)(第1行)。然后,我们继续训练模型K迭代,影响力的上下文的所有训练立场可以反复提取(6-25行)。在这个过程中,对于每一个训练实例(t x, y),我们介绍两个词集初始化为∅(2 - 5行)来记录其提取上下文的话:(1) s_a(x)是由上下文词汇与积极影响x的情绪预测。每个单词的 s_a(x)将被鼓励仍然认为改进模型中的训练,和(2) s_m(x)包含上下文与误导的影响,关注权重的预计将下降。具体来说,在第k次训练迭代时,我们采用以下步骤来处理(x,t,y):

step1:line 9 to line 11


step3; line 13 to line 20

step4: line21 to line 24  (detail please see the paper)

where                                        E(\alpha (x')) = - \sum _{i=1}^{N}\alpha (x_{i}^{'})log\alpha (x_{i}^{'})

Through K iterations of the above steps, we manage to extract influential context words of all training instances. Table 2 illustrates the context word mining process of the first sentence shown in Table 1. In this example, we iteratively extract three context words in turn: “small”, “crowded” and “quick”. The former two words are included in s_a(x), while the last one is contained in s_m(x). Finally, the extracted context words of each training instance will be included into D, forming a final training corpus Ds with attention supervision information (Lines 26-29), which will be used to carry out the last model training (Line 30)



5.3 Parameter-Setting

the dimnesion of Glove is 300; "OOV": U[-0.25,0.25]; the initialization of other paramters:U[-0.01,0.01]; dropout; Adam; learning rate is 0.001.

iterations k =5; regulariztion coefficients \gamma:



accuracy and macro-F1 

dataset split: 80% for training and 20% for testing.

5.4 Results

 explore the chage of \varepsilon_a


case study:

6.conclusion and future work


评论 2




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


