深度学习论文阅读:Convolutional Neural Networks for Sentence Classification (TextCNN)

这是一篇将CNN运用于句子分类的论文

摘要

We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. We additionally propose a simple modification to the architecture to allow for the use of both task-specific and static vectors. The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification.

  • 在句子层面的分类任务中,预训练好的词向量上加一个CNN训练
  • 简单的CNN加上细微的超参数调整,还有静态词向量可以得到不错的结果
  • 通过微调学习和任务相关的词向量,可以得到更好的效果
  • 提出了一个简单的结构修改,以是的静态和动态词向量可以结合使用
  • 这里的CNN模型对情感分析、问题分类有不错的效果

 

引言

Deep learning models have achieved remarkable results in computer vision (Krizhevsky et al., 2012) and speech recognition (Graves et al., 2013) in recent years. Within natural language processing, much of the work with deep learning methods has involved learning word vector representations through neural language models (Bengio et al., 2003; Yih et al., 2011; Mikolov et al., 2013) and performing composition over the learned word vectors for classification (Collobert et al., 2011). Word vectors, wherein words are projected from a sparse, 1-of-V encoding (here V is the vocabulary size) onto a lower dimensional vector space via a hidden layer, are essentially feature extractors that encode semantic features of words in their dimensions. In such dense representations, semantically close words are likewise close—in euclidean or cosine distance—in the lower dimensional vector space. Convolutional neural networks (CNN) utilize layers with convolving filters that are applied to local features (LeCun et al., 1998). Originally invented for computer vision, CNN models have subsequently been shown to be effective for NLP and have achieved excellent results in semantic parsing (Yih et al., 2014), search query retrieval (Shen et al., 2014), sentence modeling (Kalchbrenner et al., 2014), and other traditional NLP tasks (Collobert et al., 2011). In the present work, we train a simple CNN with one layer of convolution on top of word vectors obtained from an unsupervised neural language model. These vectors were trained by Mikolov et al. (2013) on 100 billion words of Google News, and are publicly available.1 We initially keep the word vectors static and learn only the other parameters of the model. Despite little tuning of hyperparameters, this simple model achieves excellent results on multiple benchmarks, suggesting that the pre-trained vectors are ‘universal’ feature extractors that can be utilized for various classification tasks. Learning task-specific vectors through fine-tuning results in further improvements. We finally describe a simple modification to the architecture to allow for the use of both pre-trained and task-specific vectors by having multiple channels. Our work is philosophically similar to Razavian et al. (2014) which showed that for image classification, feature extractors obtained from a pretrained deep learning model perform well on a variety of tasks—including tasks that are very different from the original task for which the feature extractors were trained.

  • 自然语言的深度学习一般通过语言模型学习词向量的表示,并对学到的词向量的组合用于分类
  • V维的词向量通过隐藏层编码到低维的特征空间(V为词典词数),作为词的语义特征
  • 在语义特征空间中(非稀疏),语义相近的词距离更近些
  • CNN起源于cv领域,陆续在nlp中证明了有效性,在语义分析、搜索词提取、语句建模等
  • 本文通过在无监督语言模型训练得到的词向量上加一层CNN,词向量是用google新闻训练的。
  • 先固定词向量,只对其他参数学习,获取了不错的结果,显示了预训练好的词向量可用于其他分类任务
  • 然后在任务中同时训练词向量,获得更好的改善
  • 这里看出尽管两项任务非常不同,但是提取的特征还是可以做迁移的

 

模型

The model architecture, shown in figure 1, is a slight variant of the CNN architecture of Collobert et al. (2011). Let xi ∈ R k be the k-dimensional word vector corresponding to the i-th word in the sentence. A sentence of length n (padded where necessary) is represented as  

x1:n = x1 ⊕ x2 ⊕ . . . ⊕ xn, (1)

where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concatenation of words xi , xi+1, . . . , xi+j . A convolution operation involves a filter w ∈ R hk, which is applied to a window of h words to produce a new feature. For example, a feature ci is generated from a window of words xi:i+h−1 by

ci = f(w · xi:i+h−1 + b). (2)

Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sentence {x1:h, x2:h+1, . . . , xn−h+1:n} to produce a feature map

c = [c1, c2, . . . , cn−h+1], (3)

with c ∈ R n−h+1. We then apply a max-overtime pooling operation (Collobert et al., 2011) over the feature map and take the maximum value cˆ = max{c} as the feature corresponding to this particular filter. The idea is to capture the most important feature—one with the highest value—for each feature map. This pooling scheme naturally deals with variable sentence lengths.

We have described the process by which one feature is extracted from one filter. The model uses multiple filters (with varying window sizes) to obtain multiple features. These features form the penultimate layer and are passed to a fully connected softmax layer whose output is the probability distribution over labels.

In one of the model variants, we experiment with having two ‘channels’ of word vectorsone that is kept static throughout training and one that is fine-tuned via backpropagation (section 3.2).2 In the multichannel architecture, illustrated in figure 1, each filter is applied to both channels and the results are added to calculate ci in equation (2). The model is otherwise equivalent to the single channel architecture.

  • 一句话中,n个词,每个词向量长度为k,所有词向量级联,得到n*k的矩阵
  • 卷积操作是一个窗口长度为h个词的卷积核,对句子做卷积得到新的特征
  • w为卷积核参数,b为偏置,f为非线性函数,产生的特征是c
  • 然后对c使用最大池化,只保留一个卷积核结果的最大值,即最重要的特征
  • 通过多个卷积核得到多个特征,然后进入全连接层(softmax),从而得到每种分类的概率
  • 实验中尝试了两个通道的词向量,一个保持不变,一个通过反向传播来微调。

 

正则化

For regularization we employ dropout on the penultimate layer with a constraint on l2-norms of the weight vectors (Hinton et al., 2012). Dropout prevents co-adaptation of hidden units by randomly dropping out—i.e., setting to zero—a proportion p of the hidden units during fowardbackpropagation. That is, given the penultimate layer z = [ˆc1, . . . , cˆm] (note that here we have m filters), instead of using

y = w · z + b (4)

for output unit y in forward propagation, dropout uses

y = w · (z ◦ r) + b, (5)

where ◦ is the element-wise multiplication operator and r ∈ R m is a ‘masking’ vector of Bernoulli random variables with probability p of being 1. Gradients are backpropagated only through the unmasked units. At test time, the learned weight vectors are scaled by p such that wˆ = pw, and wˆ is used (without dropout) to score unseen sentences. We additionally constrain l2-norms of the weight vectors by rescaling w to have ||w||2 = s whenever ||w||2 > s after a gradient descent step.

  • 在倒数第二层加入dropout ,并且在词向量上添加l2范数
  • 随机dropout 防止了隐藏层单元的耦合
  • 最大池化后的结果z,与一个伯努利分布的r对位想乘(为1的概率p),梯度反向传播只经过保留的单元,所以学习到的权重都乘以了p,乘p后的w用于测试。
  • 同时限制l2范数为s

 

 

数据集与实验

We test our model on various benchmarks. Summary statistics of the datasets are in table 1.

  • MR: Movie reviews with one sentence per review. Classification involves detecting positive/negative reviews (Pang and Lee, 2005).3
  • SST-1: Stanford Sentiment Treebank—an extension of MR but with train/dev/test splits provided and fine-grained labels (very positive, positive, neutral, negative, very negative), re-labeled by Socher et al. (2013).4
  • SST-2: Same as SST-1 but with neutral reviews removed and binary labels.
  • Subj: Subjectivity dataset where the task is to classify a sentence as being subjective or objective (Pang and Lee, 2004).
  • TREC: TREC question dataset—task involves classifying a question into 6 question types (whether the question is about person, location, numeric information, etc.) (Li and Roth, 2002).5
  • CR: Customer reviews of various products (cameras, MP3s etc.). Task is to predict positive/negative reviews (Hu and Liu, 2004).6
  • MPQA: Opinion polarity detection subtask of the MPQA dataset (Wiebe et al., 2005).7
  • 数据集
    • MR:电影评论,对每句评论分类
    • SST-1:斯坦福的情感数据库,是MR的扩展(分好的训练验证测试集,还有label)
    • SST-2:类似SST-1,但移除了中立评论,label是0,1
    • Subj:主观数据,将句子分类为主观与客观
    • TREC:提问数据集,将问题分为6类
    • CR:消费者评论,好评差评
    • MPQA:正负观点检测

 

超参数与训练

For all datasets we use: rectified linear units, filter windows (h) of 3, 4, 5 with 100 feature maps each, dropout rate (p) of 0.5, l2 constraint (s) of 3, and mini-batch size of 50. These values were chosen via a grid search on the SST-2 dev set. We do not otherwise perform any datasetspecific tuning other than early stopping on dev sets. For datasets without a standard dev set we randomly select 10% of the training data as the dev set. Training is done through stochastic gradient descent over shuffled mini-batches with the Adadelta update rule (Zeiler, 2012).

  • 使用整流线性单元(ReLU)
  • 卷积核长度分别为3,4,5,每种100个
  • dropout 率为0.5
  • l2范数限制为3
  • 批量大小为50
  • 通过SST-2数据集网格搜索得到这些参数
  • 没有任何针对特殊数据集的微调
  • 对没有验证集的数据从训练集挑选10%
  • 用随机梯度下降训练

预训练词向量

Initializing word vectors with those obtained from an unsupervised neural language model is a popular method to improve performance in the absence of a large supervised training set (Collobert et al., 2011; Socher et al., 2011; Iyyer et al., 2014). We use the publicly available word2vec vectors that were trained on 100 billion words from Google News. The vectors have dimensionality of 300 and were trained using the continuous bag-of-words architecture (Mikolov et al., 2013). Words not present in the set of pre-trained words are initialized randomly.

  •  当没有大量有监督训练集时,将词向量初始化为从无监督神经语言模型得到的向量是种广泛的方法
  • 用google新闻的1000亿单词训练得到词向量
  • 300维度词向量,用连续词袋模型训练
  • 未出现在训练集的单词随机初始化

模型变化

We experiment with several variants of the model

  • CNN-rand: Our baseline model where all words are randomly initialized and then modified during training.
  • CNN-static: A model with pre-trained vectors from word2vec. All words— including the unknown ones that are randomly initialized—are kept static and only the other parameters of the model are learned
  • CNN-non-static: Same as above but the pretrained vectors are fine-tuned for each task.
  • CNN-multichannel: A model with two sets of word vectors. Each set of vectors is treated as a ‘channel’ and each filter is applied to both channels, but gradients are backpropagated only through one of the channels. Hence the model is able to fine-tune one set of vectors while keeping the other static. Both channels are initialized with word2vec.

In order to disentangle the effect of the above variations versus other random factors, we eliminate other sources of randomness—CV-fold assignment, initialization of unknown word vectors, initialization of CNN parameters—by keeping them uniform within each dataset.

  • 对模型施加几种变化 
    • CNN-rand:基线模型,所有词向量随机初始化,在训练中调整
    • CNN-static:词向量通过word2vec初始化,训练中保持不变
    • CNN-non-static:词向量训练中迭代修改
    • CNN-multichannel:两个通道,用word2vec初始化的词向量,一个在反向传播中调整,一个不变
    • 为保证单一变量原则,消除其他可能的随机性:交叉验证、未知词语的初始化、CNN参数初始化

 

结果与讨论

Results of our models against other methods are listed in table 2. Our baseline model with all randomly initialized words (CNN-rand) does not perform well on its own. While we had expected performance gains through the use of pre-trained vectors, we were surprised at the magnitude of the gains. Even a simple model with static vectors (CNN-static) performs remarkably well, giving competitive results against the more sophisticated deep learning models that utilize complex pooling schemes (Kalchbrenner et al., 2014) or require parse trees to be computed beforehand (Socher et al., 2013). These results suggest that the pretrained vectors are good, ‘universal’ feature extractors and can be utilized across datasets. Finetuning the pre-trained vectors for each task gives still further improvements (CNN-non-static).

  •  随机初始化词向量性能不好
  • 基于预训练词向量的改善比较大
  • 每个任务中微调词向量性能更好

多通道与单通道模型

We had initially hoped that the multichannel architecture would prevent overfitting (by ensuring that the learned vectors do not deviate too far from the original values) and thus work better than the single channel model, especially on smaller datasets. The results, however, are mixed, and further work on regularizing the fine-tuning process is warranted. For instance, instead of using an additional channel for the non-static portion, one could maintain a single channel but employ extra dimensions that are allowed to be modified during training.

  •  训练中保持(词向量)迭代效果更好

静态与非静态表示

As is the case with the single channel non-static model, the multichannel model is able to fine-tune the non-static channel to make it more specific to the task-at-hand. For example, good is most similar to bad in word2vec, presumably because they are (almost) syntactically equivalent. But for vectors in the non-static channel that were finetuned on the SST-2 dataset, this is no longer the case (table 3). Similarly, good is arguably closer to nice than it is to great for expressing sentiment, and this is indeed reflected in the learned vectors. For (randomly initialized) tokens not in the set of pre-trained vectors, fine-tuning allows them to learn more meaningful representations: the network learns that exclamation marks are associated with effusive expressions and that commas are conjunctive (table 3).

  •  正如单通道非静态模型,多通道也可以微调来适应特殊的任务
  • 例如word2vec中单词good和bad比较相似,因为语法用法相近。但在SST-2数据集上微调好之后,在表达情感上,相比great和nice,good更接近后者。这可以反映出确实学习了。
  • 不在词向量集中的词,微调能使其得到更有意义的表示。网络学习到了感叹号与冒昧关联,而逗号是连词

更多发现

We report on some further experiments and observations:

  • Kalchbrenner et al. (2014) report much worse results with a CNN that has essentially the same architecture as our single channel model. For example, their Max-TDNN (Time Delay Neural Network) with randomly initialized words obtains 37.4% on the SST-1 dataset, compared to 45.0% for our model. We attribute such discrepancy to our CNN having much more capacity (multiple filter widths and feature maps).
  • Dropout proved to be such a good regularizer that it was fine to use a larger than necessary network and simply let dropout regularize it. Dropout consistently added 2%–4% relative performance.
  • When randomly initializing words not in word2vec, we obtained slight improvements by sampling each dimension from U[−a, a] where a was chosen such that the randomly initialized vectors have the same variance as the pre-trained ones. It would be interesting to see if employing more sophisticated methods to mirror the distribution of pre-trained vectors in the initialization process gives further improvements.
  • We briefly experimented with another set of publicly available word vectors trained by Collobert et al. (2011) on Wikipedia,8 and found that word2vec gave far superior performance. It is not clear whether this is due to Mikolov et al. (2013)’s architecture or the 100 billion word Google News dataset.
  • Adadelta (Zeiler, 2012) gave similar results to Adagrad (Duchi et al., 2011) but required fewer epochs.
  • Kalchbrenner et al. (2014) 报告显示与我们单通道实验结构相同却差得多的结果。可能是我们CNN结构有更多容量(卷积核宽度,特征map)
  • Dropout  很有效,所以可以使用比实际需要更大的网络然后Dropout ,一半有2%-4% 提升
  • 对于不在预训练词向量中的词,每一维度在[−a, a]中平均采样,是的与预训练样本有相同的方差,有轻微的性能提升。说不定参照预训练的向量,采用复杂手段使得新词汇表现更相似,会有更多提升。
  • 使用 Collobert et al 提供的词向量性能更好

结论

In the present work we have described a series of experiments with convolutional neural networks built on top of word2vec. Despite little tuning of hyperparameters, a simple CNN with one layer of convolution performs remarkably well. Our results add to the well-established evidence that unsupervised pre-training of word vectors is an important ingredient in deep learning for NLP.

实验充分证明了无监督预训练词向量对nlp深度学习的性能提升 

致谢

We would like to thank Yann LeCun and the anonymous reviewers for their helpful feedback and suggestions

 

原文

Convolutional Neural Networks for Sentence Classification

 

 

 

 

 

 

 

  • 0
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值