【论文笔记】Convolutional Neural Networks for Sentence Classification

最新推荐文章于 2021-03-06 16:28:10 发布

weixin_30788731

最新推荐文章于 2021-03-06 16:28:10 发布

阅读量182

点赞数

原文链接：http://www.cnblogs.com/doragd/p/11320812.html

版权

Model

\(x_i\in{\mathbb{R}^k}\) ：第\(i\)个词的\(k\)维词向量表示
经过padding后长度为\(n\)个句子被表示为\(x_{1:n}\), 形状为\(n\times{k}\)
\(filter:w\in{\mathbb{R}^{hk}}\) 窗口大小为\(h\times{k}\) 的卷积核，stride=1
\(c_i=f(w\cdot{x_{i:i+h-1}}+b)\in{\mathbb{R}}\)
- 提取的特征向量分量，\(b\in{\mathbb{R}}\)
feature map：\(c=[c_1,c_2,...,c_{n-h+1}]\in{\mathbb{R}^{n-h+1}}\)
然后在feature map上应用max-over-time pooling，形成了池化层
- 一个feature map只提取最高的那个值 \(\hat{c}=\max\{c\}\)
- The idea is to capture the most important feature—one with the highest value—for each feature map.
- This pooling scheme naturally deals with variable sentence lengths
- The model uses multiple filters (with varying window sizes)to obtain multiple features.
最后经过全连接和softmax得到标签上的概率分布
model variants：
- two channels of word vectors
  - keep static throughout training
  - fine-tuned via bp

employ dropout on the penultimate layer with a constraint onl2-norms of the weight vectors
Dropout prevents co-adaptation of hidden units by randomly dropping out i.e., setting to zero—a proportion \(p\) of the hidden units during foward-backpropagation.
penultimate layer \(z=[\hat{c_1},...,\hat{c_m}]\) , for output unit \(y\)
- dropout uses \(y = w\cdot(z\odot{r})+b\)
  - \(r\in{\mathbb{R}^m}\) 是具有概率\(p\)的伯努利随机变量的掩膜向量，梯度只能通过未掩膜的单元反向传播
- 在测试时，不使用dropout
  - the learned weight vectors are scaled by \(p\) such that\(\hat{w}=pw\), and \(\hat{w}\) is used (without dropout) to score unseen sentences.
We additionally constrain \(l_2-norms\) of the weight vectors by rescaling \(w\) to have\(||w||_2=s\) whenever \(||w||_2> s\) after a gradient descent step.

\(c\) 类别数，\(l\) 平均句子长度，\(N\): 数据集大小，\(|V|\) ：词表大小，\(|V_{pre}|\) 出现在预训练词向量中的单词数
\(CV\) 没有标准的训练测试划分，故使用10折交叉验证
MR ：Movie Reviews
- positive/negative
SST-1: Stanford Sentiment Treebank，MR的扩展，提供了训练测试划分，以及更细粒度的标签
- very positive, positive, neutral, negative, very negative
SST-2 ：remove neutral and binary labels
Subj：句子的客观和主观标签
TREC ：将一个问题分成六个问题类型（about person,location, numeric information, etc）
CR ：Customer reviews of various products
- positive/negative
MPQA：Opinion polarity detection subtask of the MPQA dataset

（在 SST-2 开发集上执行grid search得到的结果）

CNN-rand ：baseline，词向量全部随机初始化
CNN-static ：使用预训练的词向量，没有出现的就直接随机初始化，词向量在整个训练过程中保持不变
CNN-non-static ：加了fine-tuned
CNN-multichannel ：两种类型的词向量，一种static，一种加了fine-tuned，看成两个channel，但是反向传播只经过一个channel。两个通道都用word2vec初始化

为了避免其他随机变量影响，CV folds，CNN初始化参数，未知的词的初始化在不同数据集上均一致

CNN-rand: 词向量全部随机初始化表现不咋地
CNN-static：加入固定的预训练词向量效果马上就好起来了 [MPQA数据集上SOTA]
CNN-non-static：预训练+微调就更厉害了 [MR数据集上SOTA]
CNN-multichannel：微调+固定 [SST-2数据集上SOTA]
多通道和单通道的对比：使用两通道的本意是原本固定的vector可以防止过拟合(也就是使得微调后的词向量的含义不会太偏离原来的)，但是结果看来，与其多加一个通道，还不如在单通道上增加多一点词向量维度，使得训练的时候可以调整
固定和微调的对比：作者还对比了固定的vector和微调的vector在余弦距离上的前4个相似的词。微调后可以更好适合当前特定任务，比如good经过微调后可以相似于nice。对于那些没有在预训练词向量中的词，经过训练后，可以更具有意义，比如逗号和连接词相关，叹号和热情洋溢有关。

Dropout可以起到很好的正则化效果，加了它以后效果提高了2%-4%
使用\(U[-a,a]\)随机初始化，其中\(a\) 是使得预训练向量和初始化向量有相同的方差，均匀分布的方差为\(\frac{(b-a)^2}{12}=\frac{a^2}{3}\)
SENNA词向量效果不如word2vec，但不知道是架构问题还是词向量问题
Adadelta效果和Adagrad一样，但是epochs更小

转载于:https://www.cnblogs.com/doragd/p/11320812.html

关注