CM-Net:
过去的方法:
1.slotfilling和intentdetect分开的,rnn或者cnn建模语句,之后用分类算法
2.slotfilling和intentdetect是有关系的,
- Goo2018intent影响slotfilling(intent产生之后在slot生成的LSTM中加元素)
- zhang2018提出capsule,word->slot->intent,再倒回去。( limited in capturing complicated correlations among words, slots and intents;;local context information which has been shown highly useful for the slot filling (Mesnil et al., 2014), is not explicitly modeled.)
- 我们的方法:
- directly capture semantic relationships among words, slots and intents, which is conducted simultaneously at each word position in a collaborative manner.
- alternately perform information exchange among the task-specific features referred from memories, local context representations and global sequential information via the well-designed block, named CM-block.
其实算法很简单:
以前:slot方式得到
p
(
y
s
l
o
t
∣
H
)
p(y^{slot}|H)
p(yslot∣H)(test的时候使用viterbi算法);intent是
h
t
h^t
ht的平均值然后使用分类算法。slot和intent的算法结合。
slot memory是最初提取的intent和slot特征(粗略)(随机初始化)
h
~
t
i
n
t
=
A
T
T
(
h
t
,
M
i
n
t
)
h
t
s
l
o
t
=
A
T
T
(
[
h
t
;
h
~
t
i
n
t
]
,
M
s
l
o
t
)
\begin{aligned} \widetilde{\mathbf{h}}_{t}^{i n t} &=A T T\left(\mathbf{h}_{t}, \mathbf{M}^{\mathrm{int}}\right) \\ \mathbf{h}_{t}^{s l o t} &=A T T\left(\left[\mathbf{h}_{t} ; \widetilde{\mathbf{h}}_{t}^{i n t}\right], \mathbf{M}^{\mathrm{slot}}\right) \end{aligned}
h
tinthtslot=ATT(ht,Mint)=ATT([ht;h
tint],Mslot)
A
T
T
(
h
t
,
M
x
)
=
∑
i
α
i
m
i
x
α
i
=
exp
(
u
⊤
s
i
)
∑
j
exp
(
u
⊤
s
j
)
s
i
=
h
t
⊤
W
m
i
x
\begin{array}{l}{A T T\left(\mathbf{h}_{\mathbf{t}}, \mathbf{M}^{x}\right)=\sum_{i} \alpha_{i} \mathbf{m}_{\mathbf{i}}^{\mathbf{x}}} \\ {\alpha_{i}=\frac{\exp \left(\mathbf{u}^{\top} s_{i}\right)}{\sum_{j} \exp \left(\mathbf{u}^{\top} s_{j}\right)}} \\ {s_{i}=\mathbf{h}_{t}^{\top} \mathbf{W} \mathbf{m}_{\mathbf{i}}^{\mathbf{x}}}\end{array}
ATT(ht,Mx)=∑iαimixαi=∑jexp(u⊤sj)exp(u⊤si)si=ht⊤Wmix
h
~
t
s
l
o
t
=
A
T
T
(
h
t
,
M
s
l
o
t
)
h
t
i
n
t
=
A
T
T
(
[
h
t
;
h
~
t
s
l
o
t
]
,
M
i
n
t
)
\begin{array}{l}{\widetilde{\mathbf{h}}_{t}^{s l o t}=A T T\left(\mathbf{h}_{\mathbf{t}}, \mathbf{M}^{\mathrm{slot}}\right)} \\ {\mathbf{h}_{t}^{i n t}=A T T\left(\left[\mathbf{h}_{\mathbf{t}} ; \widetilde{\mathbf{h}}_{t}^{s l o t}\right], \mathbf{M}^{\mathrm{int}}\right)}\end{array}
h
tslot=ATT(ht,Mslot)htint=ATT([ht;h
tslot],Mint)
h
t
h^t
ht->
h
~
t
i
n
t
\widetilde{\mathbf{h}}_{t}^{int}
h
tint->
h
t
s
l
o
t
h_t^{slot}
htslot第一步,ATT函数是得到带有h信息的intent-memory表示。第二步,带有intent的信息然后又去和slot-memory交互,得到slot的确切信息。
借鉴18年模型使用S-LSTM(张岳的工作,但是认为可以用bert代替),将生成的信息联合原来的表示得到新的local feature,如下图:
生成${i,o,f,l,r,u}$6个lstm中状态转换相关的量。(解释:the hidden state is updated with abundant information from different perspectives, namely word embed- dings, local contexts, slots and intents representations.)
第三步是:global。用bilstm得到全局sequence信息。
最后的生成函数简单。
总的来说,就是信息深度融合。起的名字非常花哨。
但是我还是不太服气。不借助外部KB信息真的能做的这么好吗?
这样无法大量泛化,因为模型学到的抽取特征和意图分类仅仅是训练数据得来的。因为和sequence的结构性没有对上钩。
SF-ID network:
没有前一个指标高,(elmo和bert狠厉害,但是cmnet超过了他们)
c
s
l
o
t
i
=
∑
j
=
1
T
α
i
,
j
S
h
j
c_{s l o t}^{i}=\sum_{j=1}^{T} \alpha_{i, j}^{S} h_{j}
csloti=∑j=1Tαi,jShj(
c
s
l
o
t
同
理
,
a
t
t
e
n
t
i
o
n
产
生
c_{slot}同理,attention产生
cslot同理,attention产生)
SF subnet:
f
=
∑
V
∗
tanh
(
c
s
l
o
t
i
+
W
∗
c
i
n
t
e
)
f=\sum V * \tanh \left(c_{s l o t}^{i}+W * c_{i n t e}\right)
f=∑V∗tanh(csloti+W∗cinte)
r
s
l
o
t
i
=
f
⋅
c
s
l
o
t
i
r_{s l o t}^{i}=f \cdot c_{s l o t}^{i}
rsloti=f⋅csloti
ID subnet:
r
=
∑
i
=
1
T
α
i
⋅
r
s
l
o
t
i
r=\sum_{i=1}^{T} \alpha_{i} \cdot r_{s l o t}^{i}
r=∑i=1Tαi⋅rsloti
α
i
=
exp
(
e
i
,
i
)
∑
j
=
1
T
exp
(
e
i
,
j
)
\alpha_{i}=\frac{\exp \left(e_{i, i}\right)}{\sum_{j=1}^{T} \exp \left(e_{i, j}\right)}
αi=∑j=1Texp(ei,j)exp(ei,i)
e
i
,
j
=
W
∗
tanh
(
V
1
∗
r
s
l
o
t
i
+
V
2
∗
h
j
+
b
)
e_{i, j}=W * \tanh \left(V_{1} * r_{s l o t}^{i}+V_{2} * h_{j}+b\right)
ei,j=W∗tanh(V1∗rsloti+V2∗hj+b)
r
i
n
t
e
=
r
+
c
i
n
t
e
r_{i n t e}=r+c_{i n t e}
rinte=r+cinte
Iteration Mechanism:(SF的f函数替换成这个,主要是
r
i
n
t
e
r_{inte}
rinte替换
c
i
n
t
e
c_{inte}
cinte)
f
=
∑
V
∗
tanh
(
c
s
l
o
t
i
+
W
∗
r
i
n
t
e
)
f=\sum V * \tanh \left(c_{s l o t}^{i}+W * r_{i n t e}\right)
f=∑V∗tanh(csloti+W∗rinte)
生成:
y
i
n
t
e
=
softmax
(
W
i
n
t
e
h
y
concat
(
h
T
,
r
i
n
t
e
)
)
y_{i n t e}=\operatorname{softmax}\left(W_{i n t e}^{h y} \operatorname{concat}\left(h_{T}, r_{i n t e}\right)\right)
yinte=softmax(Wintehyconcat(hT,rinte))
y
s
l
o
t
i
=
softmax
(
W
s
l
o
t
h
y
concat
(
h
i
,
r
s
l
o
t
i
)
)
y_{s l o t}^{i}=\operatorname{softmax}\left(W_{s l o t}^{h y} \operatorname{concat}\left(h_{i}, r_{s l o t}^{i}\right)\right)
ysloti=softmax(Wslothyconcat(hi,rsloti))
分析感觉intent对slot的影响不大, intent->slot->intent,第一步对结果影响不大。
这个信息融合也比较简单。iteration的方式也没有很夸张。
intent先产生对于slot的产生更有帮助。
DCMTL方法:
Segment tagging and named entity tagging can be regarded as syntactic labeling, while slot filling is more like semantic labeling. With the help of information sharing ability of multi-task learning, once we learn the information of syntactic structure of an input sentence, filling the semantic labels becomes much easier。
The slot labels are large-scaled, informative and diverse in the case of E-commerce, and the syntactic structure of input Chinese utterance are complicated, so that the slot filling problem becomes hard to solve. If we directly train an endto-end sequential model, the tagging performance will suffer from data sparsity severely. When we try to handle slot filling (can be seen as semantic labeling task), some low-level tasks such as named entity tagging or segment tagging (can be seen as syntactic labeling task) may first make mistakes. If the low-level tasks get wrong, so as to the target slot filling task. That is to say it is easy to make wrong decisions in the low-level tasks, if we try to fill in all the labels at once. Then the error will propagate and lead to a bad performance of slot filling, which is our high-level target.
提出我们的模型:
“However, when it comes to problems where different tasks maintain a strict order, in another word, the performance of high-level task dramatically depends on low-level tasks, the hierarchy structure is not compact and effective enough. Therefore, we propose cascade and residual connections to allow high-level tasks to take the tagging results and hidden states from low-level tasks as additional input. These connections serves as “shortcuts” that create a more closely coupled and efficient model. We call it deep cascade multi-task learning,”
原来的hierarchical是hidden作为下一层的hidden,作者做了两个改进。不仅是上一层的hidden还加上一层的output,不仅是上一层的hidden还加上一层的output还加上上一层的input.
因为是做的完全不同的数据集、而且没有加入今年的方法所以对比性不强。但是还是值得学习。Cascade Connection和Residual Connection。写作方法也不错。