自然语言处理系列-2-文本分类-深度学习-1

最新推荐文章于 2023-01-08 23:49:13 发布

红色工程师qk

最新推荐文章于 2023-01-08 23:49:13 发布

阅读量355

点赞数

分类专栏：自然语言处理

自然语言处理专栏收录该内容

5 篇文章 0 订阅

订阅专栏

深度学习模型的重点是模型的构建和调参，相对来说任务量能小不少。RNN、LSTM等模型由于拥有记忆能力，因而在文本处理上表现优异，但是缺点很明显就是计算量很大，在没有GPU加速情况下，不适合处理大批的数据，CNN在FaceBook的翻译项目上大放异彩也表明CNN在文本处理领域上的重要性，而且相对RNN来说，速度明显提升。本文尝试了多层CNN、并行CNN、RNN与CNN的结合、基于Hierarchical Attention的RNN、迁移学习、多任务学习、联合模型学习。在单模型和联合模型学习上，我们复现、借鉴了2017知乎看山杯比赛第一名的方案，在此表示感谢。深度学习部分代码都是使用Keras框架实现的，Keras搭建模型非常方便适合快速验证自己的想法和模型。

1. 文本预处理

分词、Word Embedding已经介绍过，一般文本内容输入到神经网络作为Input，要先进行Tokenizer，然后对空白部分做padding，并且获得Word Embedding的emnedding_matrix其中Tokenizer、padding都是使用Keras自带的API，因为我刚开始使用深度学习处理文本时候这个过程不是很明白，就分享一下代码，具体过程如下：

from keras.preprocessing.text import Tokenizer
from gensim.models.word2vec import Word2Vec

max_nb_words = 100000 #常用词设置为10w
tokenizer = Tokenizer(num_words=max_nb_words, filters='')
tokenizer.fit_on_texts(train_para_cut) #使用已经切分的训练语料进行fit
word_index = tokenizer.word_index
vocab_size = len(word_index)
model = Word2Vec.load(w2v_file) #Load之前训练好的Word Embedding模型
word_vectors = model.wv
embeddings_index = dict()
for word, vocab_obj in model.wv.vocab.items():
    if int(vocab_obj.index) < max_nb_words:
        embeddings_index[word] = word_vectors[word]
del model, word_vectors
print("word2vec size: {}".format(len(embeddings_index)))
num_words = min(max_nb_words, vocab_size)
not_found = 0
embedding_matrix = np.zeros((num_words+1, w2v_dim)) # 与训练好的，神经网络Embedding层需要用到
for word, i in word_index.items():
    if i > num_words:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
    else:
        not_found += 1
print("not found word in w2v: {}".format(not_found))
print("input layer size: {}".format(num_words))
print("GET Embedding Matrix Completed")

train_x = tokenizer.texts_to_sequences(train_content_cut)
train_x = pad_sequences(train_x, maxlen=max_len, padding="post", truncating="post")
test_x = tokenizer.texts_to_sequences(test_content_cut)
test_x = pad_sequences(test_x, maxlen=max_len, padding="post", truncating="post")

2. 必要的数据统计

一篇文档及其标签作为神经网络的一个输入，经Tokenizer之后，需要设置定长的输入，必须统计文档长度、句子数、句子长度、标题长度，推荐使用pandas进行统计，方便简洁。就人机写作判断任务的数据统计情况如下：

Label on train:
NEGATIVE 359631
POSITIVE 240369

Content length on train:
POSITVE:
count 240369.000000
mean 1030.239369
std 606.937210
min 2.000000
25% 554.000000
50% 866.000000
75% 1350.000000
max 3001.000000
NEGATIVE:
count 359631.000000
mean 1048.659999
std 607.034089
min 186.000000
25% 574.000000
50% 882.000000
75% 1369.000000
max 3385.000000

Content length on test:
count 400000.000000
mean 1042.695075
std 608.866342
min 136.000000
25% 567.000000
50% 877.000000
75% 1362.000000
max 4042.000000