NLP基础实验②:TextCNN实现THUCNews新闻文本分类

一、TextCNN

CNN本质特性和作用

  1. 稀疏交互:输出神经元仅与前一层局部区域的神经元存在连接权重,学习到局部特征结构。作用:参数量减少,时间复杂度小。边缘特征(底层)—》五官特征—》人脸特征(上层)
  2. 参数共享:同一模型的不同模块使用相同的参数。卷积层具有平移不变性。作用:训练较快

 

preview

(1)输入层:N* K 二维矩阵,单词总数 N * 每个词对应的表示向量维度 K,可以是预训练的词嵌入(训练过程不变),也可以是当前网络训练的词向量(训练过程改变)

  • 数据预处理:一句话【wait for the video and don't rent it】。首先将一句话分词为9个词语,分别为【wait, for, the, video, and, do, n't rent, it】,接着将词语转换为数字,代表该词在词典中的词索引。
  • 嵌入层: 通过word2vec或者GLOV 等embedding 方式将每个词成映射到一个低维空间中,本质上是特征提取器,在指定维度中编码语义特征。例如用长度为6的一维向量来表示每个词语(即词向量的维度为6),wait可以表示为[1,0,0,0,0,0,0],以此类推,这句话就可以用9*6的二维矩阵表示。

(2)卷积层:一维卷积。卷积核的宽度和词向量的维度一致,高度可以自行设置。

以将卷积核的大小设置为[2,3]为例,设置了2个卷积核,得到两个向量,得到的向量大小分别为T1:8*1和T2:7*1,向量大小计算过程分别为(9-2+1)=8,(9-3+1)=7

  • 一维卷积(词的长度-卷积核大小+1)。
  • 二维卷积[(N - F + 2pad)/stride]+1, 其中N是输入的height和width。

(3)池化层:效果是将不同长度的句子通过池化,得到一个定长的向量表示

通过不同高度的卷积核卷积之后,输出的向量维度不同,采用1-Max-pooling(还有k-Max池化,平均池化)将每个特征向量池化成一个值,即抽取每个特征向量的最大值表示该特征。池化完成后还需要将每个值拼接起来,得到池化层最终的特征向量为2*1。

(4)全连接层,并使用softmax输出每个类别的概率

与CNN模型一样,先对输出池化层输出进行平坦化,再输入全连接层。为了防止过拟合,在输出层之前加上dropout防止过拟合,输出结果就为预测的文本种类。

# 情感分析, 0消极 or 1积极
import tensorflow as tf
import numpy as np

tf.reset_default_graph()

# Text-CNN Parameter
embedding_size = 2 # n-gram
sequence_length = 3
num_classes = 2 # 0 or 1
filter_sizes = [2,2,2] # n-gram window
num_filters = 3

# 3 words sentences (=sequence_length is 3)
sentences = ["i love you","he loves me", "she likes baseball", "i hate you","sorry for that", "this is awful"]
labels = [1,1,1,0,0,0] # 1 is good, 0 is not good.

word_list = " ".join(sentences).split()
word_list = list(set(word_list))
word_dict = {w: i for i, w in enumerate(word_list)}  # word2id
vocab_size = len(word_dict)

inputs = []
for sen in sentences:
    inputs.append(np.asarray([word_dict[n] for n in sen.split()]))

outputs = []
for out in labels:
    outputs.append(np.eye(num_classes)[out]) # ONE-HOT : To using Tensor Softmax Loss function

# Model————————————————————————————
# 计算图输入
X = tf.placeholder(tf.int32, [None, sequence_length])
Y = tf.placeholder(tf.int32, [None, num_classes])

# 模型参数
W = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0))

# 句子矩阵
embedded_chars = tf.nn.embedding_lookup(W, X) # [batch_size, sequence_length, embedding_size]
embedded_chars = tf.expand_dims(embedded_chars, -1) # add channel(=1) [batch_size, sequence_length, embedding_size, 1]

pooled_outputs = []
for i, filter_size in enumerate(filter_sizes):
    filter_shape = [filter_size, embedding_size, 1, num_filters]
    W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1))
    b = tf.Variable(tf.constant(0.1, shape=[num_filters]))

    # 卷积层
    conv = tf.nn.conv2d(embedded_chars, # [batch_size, sequence_length, embedding_size, 1]
                        W,              # [filter_size(n-gram window), embedding_size, 1, num_filters(=3)]
                        strides=[1, 1, 1, 1],
                        padding='VALID')
    h = tf.nn.relu(tf.nn.bias_add(conv, b))
    
    # pooling层
    pooled = tf.nn.max_pool(h,
                            ksize=[1, sequence_length - filter_size + 1, 1, 1], # [batch_size, filter_height, filter_width, channel]
                            strides=[1, 1, 1, 1],
                            padding='VALID')
    pooled_outputs.append(pooled) # dim of pooled : [batch_size(=6), output_height(=1), output_width(=1), channel(=1)]

num_filters_total = num_filters * len(filter_sizes)
h_pool = tf.concat(pooled_outputs, num_filters) # h_pool : [batch_size(=6), output_height(=1), output_width(=1), channel(=1) * 3]
h_pool_flat = tf.reshape(h_pool, [-1, num_filters_total]) # [batch_size, ]

# Model-Training————————————————————————————
Weight = tf.get_variable('W', shape=[num_filters_total, num_classes], 
                    initializer=tf.contrib.layers.xavier_initializer())
Bias = tf.Variable(tf.constant(0.1, shape=[num_classes]))

model = tf.nn.xw_plus_b(h_pool_flat, Weight, Bias)  
# 交叉熵损失
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=model, labels=Y))
# Adam优化器
optimizer = tf.train.AdamOptimizer(0.001).minimize(cost)

# Model-Predict——————————————————————————————————
# softmax层,输出每个类别的概率
hypothesis = tf.nn.softmax(model)
predictions = tf.argmax(hypothesis, 1)

# Training————————————————————————————————
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)

for epoch in range(5000):
    _, loss = sess.run([optimizer, cost], feed_dict={X: inputs, Y: outputs})
    if (epoch + 1)%1000 == 0:
        print('Epoch:', '%06d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))

# Test——————————————————————————————
test_text = 'sorry hate you'
tests = []
tests.append(np.asarray([word_dict[n] for n in test_text.split()]))

predict = sess.run([predictions], feed_dict={X: tests})
result = predict[0][0]
if result == 0:
    print(test_text,"is Bad Mean...")
else:
    print(test_text,"is Good Mean!!")

下图是14年这篇文章提出的TextCNN的结构。fastText 中的网络结果是完全没有考虑词序信息的,而它用的 n-gram 特征 trick 恰恰说明了局部序列信息的重要意义。CNN核心点在于可以捕捉局部相关性,具体到文本分类任务中可以利用CNN来提取句子中类似 n-gram 的关键信息。

TextCNN的详细过程原理图见下:

 

TextCNN详细过程:第一层是图中最左边的7乘5的句子矩阵,每行是词向量,维度=5,这个可以类比为图像中的原始像素点了。然后经过有 filter_size=(2,3,4) 的一维卷积层,每个filter_size 有两个输出 channel。第三层是一个1-max pooling层,这样不同长度句子经过pooling层之后都能变成定长的表示了,最后接一层全连接的 softmax 层,输出每个类别的概率。

 

特征:这里的特征就是词向量,有静态(static)和非静态(non-static)方式。static方式采用比如word2vec预训练的词向量,训练过程不更新词向量,实质上属于迁移学习了,特别是数据量比较小的情况下,采用静态的词向量往往效果不错。non-static则是在训练过程中更新词向量。推荐的方式是 non-static 中的 fine-tunning方式,它是以预训练(pre-train)的word2vec向量初始化词向量,训练过程中调整词向量,能加速收敛,当然如果有充足的训练数据和资源,直接随机初始化词向量效果也是可以的。

 

通道(Channels):图像中可以利用 (R, G, B) 作为不同channel,而文本的输入的channel通常是不同方式的embedding方式(比如 word2vec或Glove),实践中也有利用静态词向量和fine-tunning词向量作为不同channel的做法。

一维卷积(conv-1d):图像是二维数据,经过词向量表达的文本为一维数据,因此在TextCNN卷积用的是一维卷积。一维卷积带来的问题是需要设计通过不同 filter_size 的 filter 获取不同宽度的视野。

Pooling层:利用CNN解决文本分类问题的文章还是很多的,比如这篇 A Convolutional Neural Network for Modelling Sentences 最有意思的输入是在 pooling 改成 (dynamic) k-max pooling ,pooling阶段保留 k 个最大的信息,保留了全局的序列信息。比如在情感分析场景,举个例子:

            “ 我觉得这个地方景色还不错,但是人也实在太多了 ”

虽然前半部分体现情感是正向的,全局文本表达的是偏负面的情感,利用 k-max pooling能够很好捕捉这类信息。

 

二、THUCNews新闻文本分类

1、数据集

实验采用了清华NLP组提供的THUCNews新闻文本分类数据集的一个子集(原始的数据集大约74万篇文档,训练起来需要花较长的时间)。

官方全部数据集下载

http://thuctc.thunlp.org/

本次训练使用了其中的10个分类,每个分类6500条,总共65000条新闻数据。

 

2、模型

CNN模型参数配置

class TCNNConfig(object):
    """CNN配置参数"""

    embedding_dim = 64      # 词向量维度
    seq_length = 600        # 序列长度
    num_classes = 10        # 类别数
    num_filters = 128        # 卷积核数目
    kernel_size = 5         # 卷积核尺寸
    vocab_size = 5000       # 词汇表达小

    hidden_dim = 128        # 全连接层神经元

    dropout_keep_prob = 0.5 # dropout保留比例
    learning_rate = 1e-3    # 学习率

    batch_size = 64         # 每批训练大小
    num_epochs = 10         # 总迭代轮次

    print_per_batch = 100    # 每多少轮输出一次结果
    save_per_batch = 10      # 每多少轮存入tensorboard

3、结果

cmd运行 python run_cnn.py train

Training and evaluating...
Epoch: 1
Iter:      0, Train Loss:    2.3, Train Acc:  15.62%, Val Loss:    2.3, Val Acc:   9.98%, Time: 0:00:02 *
Iter:    100, Train Loss:    1.0, Train Acc:  67.19%, Val Loss:    1.1, Val Acc:  68.36%, Time: 0:00:05 *
Iter:    200, Train Loss:   0.51, Train Acc:  81.25%, Val Loss:   0.61, Val Acc:  80.54%, Time: 0:00:08 *
Iter:    300, Train Loss:   0.33, Train Acc:  90.62%, Val Loss:   0.41, Val Acc:  87.16%, Time: 0:00:10 *
Iter:    400, Train Loss:   0.27, Train Acc:  89.06%, Val Loss:   0.36, Val Acc:  89.80%, Time: 0:00:13 *
Iter:    500, Train Loss:    0.2, Train Acc:  93.75%, Val Loss:    0.3, Val Acc:  92.14%, Time: 0:00:16 *
Iter:    600, Train Loss:   0.14, Train Acc:  95.31%, Val Loss:   0.29, Val Acc:  92.64%, Time: 0:00:18 *
Iter:    700, Train Loss:  0.092, Train Acc:  95.31%, Val Loss:   0.29, Val Acc:  92.50%, Time: 0:00:21
Epoch: 2
Iter:    800, Train Loss:  0.093, Train Acc:  98.44%, Val Loss:   0.27, Val Acc:  92.94%, Time: 0:00:24 *
Iter:    900, Train Loss:  0.048, Train Acc: 100.00%, Val Loss:   0.25, Val Acc:  93.08%, Time: 0:00:27 *
Iter:   1000, Train Loss:  0.034, Train Acc: 100.00%, Val Loss:   0.29, Val Acc:  92.06%, Time: 0:00:29
Iter:   1100, Train Loss:   0.15, Train Acc:  98.44%, Val Loss:   0.26, Val Acc:  92.92%, Time: 0:00:32
Iter:   1200, Train Loss:   0.12, Train Acc:  96.88%, Val Loss:   0.25, Val Acc:  92.64%, Time: 0:00:34
Iter:   1300, Train Loss:   0.21, Train Acc:  92.19%, Val Loss:   0.21, Val Acc:  94.16%, Time: 0:00:37 *
Iter:   1400, Train Loss:   0.15, Train Acc:  95.31%, Val Loss:   0.23, Val Acc:  93.74%, Time: 0:00:39
Iter:   1500, Train Loss:  0.088, Train Acc:  96.88%, Val Loss:   0.29, Val Acc:  90.94%, Time: 0:00:42
Epoch: 3
Iter:   1600, Train Loss:  0.071, Train Acc:  98.44%, Val Loss:   0.21, Val Acc:  93.80%, Time: 0:00:44
Iter:   1700, Train Loss:  0.046, Train Acc:  98.44%, Val Loss:   0.19, Val Acc:  94.82%, Time: 0:00:47 *
Iter:   1800, Train Loss:  0.094, Train Acc:  96.88%, Val Loss:   0.23, Val Acc:  93.60%, Time: 0:00:50
Iter:   1900, Train Loss:  0.073, Train Acc:  96.88%, Val Loss:    0.2, Val Acc:  94.30%, Time: 0:00:52
Iter:   2000, Train Loss:  0.017, Train Acc: 100.00%, Val Loss:    0.2, Val Acc:  94.48%, Time: 0:00:55
Iter:   2100, Train Loss:  0.039, Train Acc: 100.00%, Val Loss:    0.2, Val Acc:  95.02%, Time: 0:00:58 *
Iter:   2200, Train Loss: 0.0094, Train Acc: 100.00%, Val Loss:   0.22, Val Acc:  93.96%, Time: 0:01:00
Iter:   2300, Train Loss:  0.069, Train Acc:  98.44%, Val Loss:    0.2, Val Acc:  94.52%, Time: 0:01:03
Epoch: 4
Iter:   2400, Train Loss:  0.018, Train Acc:  98.44%, Val Loss:   0.18, Val Acc:  95.22%, Time: 0:01:05 *
Iter:   2500, Train Loss:  0.082, Train Acc:  98.44%, Val Loss:   0.22, Val Acc:  94.06%, Time: 0:01:08
Iter:   2600, Train Loss:  0.012, Train Acc: 100.00%, Val Loss:    0.2, Val Acc:  95.04%, Time: 0:01:10
Iter:   2700, Train Loss:  0.077, Train Acc:  98.44%, Val Loss:   0.18, Val Acc:  95.46%, Time: 0:01:13 *
Iter:   2800, Train Loss:  0.025, Train Acc:  98.44%, Val Loss:   0.18, Val Acc:  95.36%, Time: 0:01:16
Iter:   2900, Train Loss:  0.091, Train Acc:  95.31%, Val Loss:   0.23, Val Acc:  93.90%, Time: 0:01:18
Iter:   3000, Train Loss:  0.036, Train Acc:  96.88%, Val Loss:   0.17, Val Acc:  95.40%, Time: 0:01:21
Iter:   3100, Train Loss:  0.039, Train Acc:  98.44%, Val Loss:   0.19, Val Acc:  95.26%, Time: 0:01:23
Epoch: 5
Iter:   3200, Train Loss:  0.012, Train Acc: 100.00%, Val Loss:   0.19, Val Acc:  95.08%, Time: 0:01:26
Iter:   3300, Train Loss: 0.0018, Train Acc: 100.00%, Val Loss:   0.19, Val Acc:  95.42%, Time: 0:01:28
Iter:   3400, Train Loss: 0.0025, Train Acc: 100.00%, Val Loss:   0.19, Val Acc:  95.04%, Time: 0:01:31
Iter:   3500, Train Loss:  0.053, Train Acc:  98.44%, Val Loss:   0.21, Val Acc:  94.64%, Time: 0:01:33
Iter:   3600, Train Loss:  0.022, Train Acc:  98.44%, Val Loss:   0.22, Val Acc:  94.14%, Time: 0:01:36
Iter:   3700, Train Loss: 0.0015, Train Acc: 100.00%, Val Loss:   0.18, Val Acc:  95.12%, Time: 0:01:38
No optimization for a long time, auto-stopping...

在验证集上的最佳效果为95.12%,且只经过了5轮迭代就已经停止。

参考文献

[1] CNN字符级中文文本分类-基于TensorFlow实现

https://blog.csdn.net/u011439796/article/details/77692621

[2] code

https://github.com/gaussic/text-classification-cnn-rnn

[3] TextCNN

https://zhuanlan.zhihu.com/p/25928551

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值