1、本文建立在https://blog.csdn.net/m0_38088359/article/details/83004972这篇文章的基础上,并对《Implementing a CNN for Text Classification in TensorFlow》原论文的方法进行复现,去掉了embedding层,用Word2Vec来代替词向量。
原论文中是用了六个卷积核做特征提取,分别为两个2embed_size,两个3embed_size,两个4*embed_size的卷积核。在这里完整地复现了论文,并稍做了参数的调整。
2、完整训练过程如下(没有做类的封装):
(1)读取数据,做 字符 —> id 的vector转化:
with open('./cnews/cnews.vocab.txt', encoding='utf8') as file:
vocabulary_list = [k.strip() for k in file.readlines()]
with open('./cnews/cnews.train.txt', encoding='utf8') as file:
line_list = [k.strip() for k in file.readlines()]
train_label_list = [k.split(maxsplit=1)[0] for k in line_list]
train_content_list = [k.split(maxsplit=1)[1] for k in line_list]
with open('./cnews/cnews.test.txt', encoding='utf8') as file:
line_list = [k.strip() for k in file.readlines()]
test_label_list = [k.split(maxsplit=1)[0] for k in line_list]
test_content_list = [k.split(maxsplit=1)[1] for k in line_list]
word2id_dict = dict(((b, a) for a, b in enumerate(vocabulary_list)))
def content2vector(content_list):
content_vector_list = []
for content in content_list:
content_vector = []
for word in content:
if word in word2id_dict:
content_vector.append(word2id_dict[word])
else:
content_vector.append(word2id_dict['<PAD>'])
content_vector_list.append(content_vector)
return content_vector_list
train_vector_list = content2vector(train_content_list)
test_vector_list = content2vector(test_content_list)
(2)对句子的id向量进行统一补零和截断处理。
import tensorflow.contrib.keras as kr
train_X = kr.preprocessing.sequence.pad_sequences(train_vector_list,600)
test_X = kr.preprocessing.sequence.pad_sequences(test_vector_list,600)
(3)参数初始化:
vocab_size = 5000
kernel_sizes = [2,2,3,3,4,4]
dropout_keep_prob = 0.5
num_kernels = 128
batch_size = 64
seq_length = 600
embed_size = 128
hidden_dim = 256
num_classes = 10
learning_rate = 1e-3
embedding_dim = 128 # 词向量维度
(4)设置占位符以及embedding层:
import tensorflow as tf
X_holder = tf.placeholder(tf.int32,[None,seq_length])
Y_holder = tf.placeholder(tf.float32,[None,num_classes])
embedding = tf.get_variable('embedding', [vocab_size, embedding_dim,1])
embedding_inputs = tf.nn.embedding_lookup(embedding, X_holder)
(5)中间的卷基层搭建:
def conv_pool_concate(X_holder