SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge
Abstract
现有的预训练语言表征模型大多忽略了文本的语言知识,而文本的语言知识可以促进自然语言处理任务中的语言理解。为了便于情感分析的下游任务,我们提出了一种新的语言表示模型,称为SentiLARE,该模型将包括词性标签(POS tag)和情感极性(sentiment polarity, 从SentiWordNet推断)在内的词级语言知识引入到预训练模型中。首先,我们提出了一种上下文感知情感注意机制来获取情感极性,同时通过查询SentiWordNet来获取每个词的词性标签。然后,我们设计了一个新的预训练任务——label-aware masked language来构建知识感知的语言表示。实验表明,SentiLARE在各种情感分析任务上获得了最新的性能。
Model
我们的任务是定义如下:给定一个文本序列 X = ( x 1 , x 2 , ⋅ ⋅ ⋅ , x n ) X = (x_1, x_2, · · · , x_n) X=(x1,x2,⋅⋅⋅,xn)长度为n的,我们的目标是获得整个序列的表示 H = ( h 1 , h 2 , ⋅ ⋅ ⋅ , h n ) ㄒ ∈ R n × d H = (h_1, h_2 , · · · , h_n)^ㄒ∈R^{n×d} H=(h1,h2,⋅⋅⋅,hn)ㄒ∈Rn×d, 抓住了上下文信息和语言知识, d d d表示向量的维数表示。
图1给出了我们模型的概述,该模型包括两个步骤:
1)Acquiring the partof-speech tag and the sentiment polarity for each word;
2)Conducting pre-training via label-aware masked language model, which contains two pretraining sub-tasks, i.e., early fusion and late supervision.
与现有的BERT-style预训练模型相比,该模型利用部分词性标签和情感极性等语言知识丰富输入序列,并利用label-aware masked language model捕捉句子级语言表示和词级语言知识之间的关系。
Linguistic Knowledge Acquisition
input: a text sequence
X
=
(
x
1
,
x
2
,
⋅
⋅
⋅
,
x
n
)
X = (x_1, x_2, · · · , x_n)
X=(x1,x2,⋅⋅⋅,xn),
x
i
(
1
≤
i
≤
n
)
x_i(1 ≤ i ≤ n)
xi(1≤i≤n) indicates a word in the vocabulary
Stanford Log-Linear Part-of-Speech Tagger: get part-of-speech tag
p
o
s
i
pos_i
posi of each word
x
i
x_i
xi, for simplicity, 只考虑五个tag: verb
(
v
)
(v)
(v), noun
(
n
)
(n)
(n), adjective
(
a
)
(a)
(a), adverb
(
r
)
(r)
(r), and others
(
o
)
(o)
(o)
SentiWordNet:
(
x
i
,
p
o
s
i
)
(x_i, pos_i)
(xi,posi) —> m个different
p
o
l
a
r
i
polar_i
polari
-
each of which contains a sense number, a positive / negative score, and a gloss ( S N i ( j ) , P s c o r e i ( j ) / N s c o r e i ( j ) , G i ( j ) ) (SN_i^{(j)} , P_{score_i}^{(j)}/N_{score_i}^{(j)} , G^{(j)}_i ) (SNi(j),Pscorei(j)/Nscorei(j),Gi(j)), 1 ≤ j ≤ m
-
( S N SN SN表示各个sense的排名, P s c o r e i ( j ) / N s c o r e i ( j ) P_{score_i}^{(j)}/N_{score_i}^{(j)} Pscorei(j)/Nscorei(j)表示由SentiWordNet得到的positive/negative得分, G i ( j ) G^{(j)}_i Gi(j)代表每种sense的定义)
-
受SentiWordNet的启发,我们提出了一种情境感知的注意机制,该机制同时考虑了sense排名和context-gloss相似性来确定每个sense的注意权重:
α i ( j ) = s o f t m a x ( 1 S N i ( j ) ⋅ s i m ( X , G i ( j ) ) ) α_i^{(j)} = sof tmax( \frac{1}{SN^{(j)}_i} · sim(X, G^{(j)}_i)) αi(j)=softmax(SNi(j)1⋅sim(X,Gi(j)))
- 1 S N i ( j ) \frac{1}{SN^{(j)}_i} SNi(j)1近似于sense频率的影响,因为sense等级越小,表示自然语言中使用该sense的频率越高
-
s
i
m
(
X
,
G
i
(
j
)
)
)
sim(X, G^{(j)}_i))
sim(X,Gi(j)))表示上下文与gloss of each sense之间的文本相似性,是无监督词义消歧中常用的一个重要特征。为了计算
X
X
X和
G
(
j
)
G^{(j)}
G(j)的相似度, 我们用Sentence-BERT(SBERT)对它们进行编码,它实现了语义文本相似度任务的最新性能,并得到向量之间的余弦相似度:
s i m ( X , G i ( j ) ) = c o s ( S B E R T ( X ) , S B E R T ( G i ( j ) ) ) sim(X, G^{(j)}_i ) = cos(SBERT(X), SBERT(G^{(j)}_i )) sim(X,Gi(j))=cos(SBERT(X),SBERT(Gi(j)))
Obtain the attention weight of eachsense: 计算每个
(
x
i
,
p
o
s
i
)
(x_i, pos_i)
(xi,posi)对的情感得分, by simply weighting the scores of all the senses:
s
(
x
i
,
p
o
s
i
)
=
∑
j
=
i
m
α
i
(
j
)
(
P
s
c
o
r
e
i
(
j
)
−
N
s
c
o
r
e
i
(
j
)
)
s(x_i, pos_i) = \sum\limits^m_{j=i}α_i^{(j)}(P_{score_i}^{(j)} − N_{score_i}^{(j)})
s(xi,posi)=j=i∑mαi(j)(Pscorei(j)−Nscorei(j))
Finally, the word-level sentiment polarity p o l a r i polari polari for the pair ( x i , p o s i ) (x_i, pos_i) (xi,posi) can be assigned with P o s i t i v e / N e g a t i v e / N e u t r a l Positive/Negative/Neutral Positive/Negative/Neutral when s ( x i , p o s i ) s(x_i, pos_i) s(xi,posi) is p o s i t i v e / n e g a t i v e / z e r o positive / negative / zero positive/negative/zero, respectively. Note that if we cannot find any sense for ( x i , p o s i ) (xi , posi) (xi,posi) in SentiWordNet, p o l a r i polari polari is assigned with N e u t r a l Neutral Neutral.
Pre-training Task
上面得到了knowledge enhanced text sequence X k = ( x i , p o s i , p o l a r i ) i = 1 n X_k = {(x_i, pos_i, polar_i)^n_{i=1}} Xk=(xi,posi,polari)i=1n
我们设计了一种新的有监督的训练前任务,称为label-aware masked language model(LA-MLM),该方法在预训练阶段引入句子级情感标签 l l l,捕捉句子级语言表征与单个词之间的依赖关系。它包含两个独立的子任务:早期融合和后期监督(early fusion and late supervision)。
Early Fusion
(
h
c
l
s
E
F
,
h
1
E
F
,
.
.
.
,
h
n
E
F
,
h
s
e
p
E
F
)
=
T
r
a
n
s
f
o
r
m
e
r
(
X
^
k
,
l
)
(h^{EF}_{cls} , h^{EF}_{1}, ..., h^{EF}_{n}, h^{EF}_{sep} ) = Transformer( \hat X_k, l)
(hclsEF,h1EF,...,hnEF,hsepEF)=Transformer(X^k,l)
X
^
\hat{X}
X^包含:
- embeddingused in BERT
- the part-ofspeech (POS) embedding
- word-level polarity embedding
模型需要分别预测掩码位置的词、词性标签和词级极性
Late Supervision
基于[CLS]和掩码位置的隐藏状态,让模型预测句子级标签和单词信息
(
h
c
l
s
L
S
,
h
1
L
S
,
.
.
.
,
h
n
L
S
,
h
s
e
p
L
S
)
=
T
r
a
n
s
f
o
r
m
e
r
(
X
^
k
,
l
)
(h^{LS}_{cls} , h^{LS}_{1}, ..., h^{LS}_{n}, h^{LS}_{sep} ) = Transformer( \hat X_k, l)
(hclsLS,h1LS,...,hnLS,hsepLS)=Transformer(X^k,l)