NLP实战之textRCNN中文文本分类

最新推荐文章于 2024-05-14 15:31:37 发布

vivian_ll

最新推荐文章于 2024-05-14 15:31:37 发布

阅读量1.4k

点赞数 3

分类专栏： NLP实战文章标签：神经网络深度学习机器学习 nlp keras

本文链接：https://blog.csdn.net/vivian_ll/article/details/106235802

版权

NLP实战专栏收录该内容

14 篇文章 11 订阅

订阅专栏

text-RCNN神经网络文本分类

原理讲解

RCNN出处：论文Recurrent Convolutional Neural Networks for Text Classification

讲解可以参考TextRCNN 阅读笔记

网络结构

在这里插入图片描述

Word Representation Learning. RCNN uses a recurrent structure, which is a bi-directional recurrent neural network, to capture the contexts. Then, combine the word and its context to present the word. And apply a linear transformation together with the tanh activation fucntion to the representation.
Text Representation Learning. When all of the representations of words are calculated, it applys a element-wise max-pooling layer in order to capture the most important information throughout the entire text. Finally, do the linear transformation and apply the softmax function.

本文实现

在这里插入图片描述

定义网络结构

多输入单输出的网络。

from tensorflow.keras import Input, Model
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Embedding, Dense, SimpleRNN, Lambda, Concatenate, Conv1D, GlobalMaxPooling1D


class RCNN(object):
    def __init__(self, maxlen, max_features, embedding_dims,
                 class_num=5,
                 last_activation='softmax'):
        self.maxlen = maxlen
        self.max_features = max_features
        self.embedding_dims = embedding_dims
        self.class_num = class_num
        self.last_activation = last_activation

    def get_model(self):
        input_current = Input((self.maxlen,))
        input_left = Input((self.maxlen,))
        input_right = Input((self.maxlen,))

        embedder = Embedding(self.max_features, self.embedding_dims, input_length=self.maxlen)
        embedding_current = embedder(input_current)
        embedding_left = embedder(input_left)
        embedding_right = embedder(input_right)

        x_left = SimpleRNN(128, return_sequences=True)(embedding_left)
        x_right = SimpleRNN(128, return_sequences=True, go_backwards=True)(embedding_right)
        x_right = Lambda(lambda x: K.reverse(x, axes=1))(x_right)
        x = Concatenate(axis=2)([x_left, embedding_current, x_right])

        x = Conv1D(64, kernel_size=1, activation='tanh')(x)
        x = GlobalMaxPooling1D()(x)

        output = Dense(self.class_num, activation=self.last_activation)(x)
        model = Model(inputs=[input_current, input_left, input_right], outputs=output)
        return model

数据处理与训练

from tensorflow.keras.preprocessing import sequence
import random
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import to_categorical
from utils import *

# 路径等配置
data_dir = "./processed_data"
vocab_file = "./vocab/vocab.txt"
vocab_size = 40000

# 神经网络配置
max_features = 40001
maxlen = 400
batch_size = 32
embedding_dims = 50
epochs = 10

print('数据预处理与加载数据...')
# 如果不存在词汇表，重建
if not os.path.exists(vocab_file):  
    build_vocab(data_dir, vocab_file, vocab_size)
# 获得 词汇/类别 与id映射字典
categories, cat_to_id = read_category()
words, word_to_id = read_vocab(vocab_file)

# 全部数据
x, y = read_files(data_dir)
data = list(zip(x,y))
del x,y
# 乱序
random.shuffle(data)
# 切分训练集和测试集
train_data, test_data = train_test_split(data)
# 对文本的词id和类别id进行编码
x_train = encode_sentences([content[0] for content in train_data], word_to_id)
y_train = to_categorical(encode_cate([content[1] for content in train_data], cat_to_id))
x_test = encode_sentences([content[0] for content in test_data], word_to_id)
y_test = to_categorical(encode_cate([content[1] for content in test_data], cat_to_id))

print('对序列做padding，保证是 samples*timestep 的维度')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('为模型准备输入数据...')
x_train_current = x_train
x_train_left = np.hstack([np.expand_dims(x_train[:, 0], axis=1), x_train[:, 0:-1]])
x_train_right = np.hstack([x_train[:, 1:], np.expand_dims(x_train[:, -1], axis=1)])
x_test_current = x_test
x_test_left = np.hstack([np.expand_dims(x_test[:, 0], axis=1), x_test[:, 0:-1]])
x_test_right = np.hstack([x_test[:, 1:], np.expand_dims(x_test[:, -1], axis=1)])
print('x_train_current 维度:', x_train_current.shape)
print('x_train_left 维度:', x_train_left.shape)
print('x_train_right 维度:', x_train_right.shape)
print('x_test_current 维度:', x_test_current.shape)
print('x_test_left 维度:', x_test_left.shape)
print('x_test_right 维度:', x_test_right.shape)

print('构建模型...')
model = RCNN(maxlen, max_features, embedding_dims).get_model()
model.compile('adam', 'categorical_crossentropy', metrics=['accuracy'])

print('Train...')
early_stopping = EarlyStopping(monitor='val_accuracy', patience=2, mode='max')
history = model.fit([x_train_current, x_train_left, x_train_right], y_train,
          batch_size=batch_size,
          epochs=epochs,
          callbacks=[early_stopping],
          validation_data=([x_test_current, x_test_left, x_test_right], y_test))

print('Test...')
result = model.predict([x_test_current, x_test_left, x_test_right])

画图

略，详见前两篇博文。

注意：注意事项同textCNN和textRNN，详见前两篇。

总结：RCNN和RNN训练起来都比CNN慢。

vivian_ll

关注

3
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
NLP实战之textRCNN中文文本分类

text-RCNN神经网络文本分类原理讲解RCNN出处：论文Recurrent Convolutional Neural Networks for Text Classification讲解可以参考TextRCNN 阅读笔记网络结构Word Representation Learning. RCNN uses a recurrent structure, which is a bi-directional recurrent neural network, to capture the co
复制链接

扫一扫