深度学习实战1----TextCNN文本分类
前言
大三课程作业时做的项目,但当时啥也不懂,跟着一篇博客瞎做的,好像那篇博客写的也有点问题。目前那篇博客也找不到了。当时用的是TensorFlow,现在学习了深度学习的相关知识,对这个项目有了更深的理解,所以用Keras重新写了一遍。
用到的数据集链接:
链接:https://pan.baidu.com/s/1L84hOsnJQocJ4u7yJxunAw
提取码:cd80
复制这段内容后打开百度网盘手机App,操作更方便哦
涉及到知识点
- Keras
- CNN
- 词向量
数据处理
本次实验用的是搜狗新闻数据集,其中训练集有50000篇新闻,测试集有10000篇新闻,验证集有5000篇新闻。还有一个含有5000字常用词的文件。每篇新闻的格式在txt中都是占一行,其中开始部分是标签。如图1所示:
1. 分开文本和标签
处理数据的第一步就是将标签和文本分开。可以使用下面的方法:
with open('./cnews/cnews.train.txt',encoding='utf-8') as file:
line_list = [k.strip() for k in file.readlines()] # 用 strip()移除末尾的空格
train_label_list = [k.split()[0] for k in line_list ] # 利用split()将标签和正文分离
train_content_list =[k.split(maxsplit=1)[1] for k in line_list] # maxslpit=1这个参数非常重要,它指明,后面的内容不再更具空格来分割
2. 文字向量化
将文本和标签分开以后,我们需要将文字向量化。什么意思呢?计算机无法读懂中文,所以我们需要将中文转为数字索引,我们有5000个字的字典,那么不同的字可以对应不同的编号。例如,字典里有[‘你’,‘我’,‘它’…]这些字,如果你的文本里有 ‘你’ 这个字,那我们就用0来表示它,你的文本里有 ‘我’,那么我们就用1来表示…
所以,下面一步就是将字典里的字编号。
with open('./cnews/cnews.vocab.txt',encoding = 'utf-8') as file:
vacabulary_list = [k.strip() for k in file.readlines()] # 读取词典
word2id_dict = dict([(b,a) for a,b in enumerate(vacabulary_list)]) # 词和序号的映射关系
做好了这一步,我们的文章就可以由文字转为数字了。
content2idList = lambda content : [word2id_dict[word] for word in content if word in word2id_dict]
train_idlist_list = [content2idList(content) for content in train_content_list] # 将文章的内容转化为词典word2id_list对应的索引
3. 统一文章长度
由于每篇文章的长度不一样,而我们深度学习的输入格式是统一的,所以需要对文章进行截取,目前选用的方法是选取文章的后600个字(当然,你也可以选择前600,或者前1000…),具体方法是使用keras的pad_sequences接口。
train_text_600 = pad_sequences(train_idlist_list,seq_length) # 截取后600个词
4. 关于标签数据
目前标签一共有10类,所以我们采用one-hot编码来表示标签。
5. 使用Embedding层将每个词编码转换为词向量
Embedding层基于上文所得的词编码,对每个词进行one-hot编码,每个词都会是一个vocabulary_size维(在本次实验中即为5000维)的向量;然后通过神经网络的训练迭代更新得到一个合适的权重矩阵(具体实现过程可以参考skip-gram模型),行大小为vocabulary_size,列大小为词向量的维度,将本来以one-hot编码的词向量映射到低维空间,得到低维词向量。需要声明一点的是Embedding层是作为模型的第一层,在训练模型的同时,得到该语料库的词向量。当然,也可以使用已经预训练好的词向量表示现有语料库中的词。
如果对词向量不是很了解的,建议看吴恩达深度学习视频序列模型第二周的课程。吴恩达深度学习序列模型
单层TextCNN模型构建
只使用一层卷积核一层maxpooling。完整代码如下。
需要先说明一下几点:
- 对于文本数据,卷积核的宽度需要和词向量的长度相同,所以能调整的只有长度。
- 文本数据和图像数据有所不同,图像一般都是三维的(长,宽,通道)。而文本数据只有长,宽。
- 写出代码的关键在于,理解从输入开始到输出过程的维度变化。
#!/usr/bin/env python
# coding: utf-8
# In[1]:
# textCNN keras version
# write by ydc
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras import backend as K
from keras.engine.topology import Layer
from keras.layers import LSTM,RepeatVector,Dense, Activation,Add,Reshape,Input,Lambda,Multiply,Concatenate, Dot
from keras.models import Model
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import os
import h5py
with open('./cnews/cnews.train.txt',encoding='utf-8') as file:
line_list = [k.strip() for k in file.readlines()] # 用 strip()移除末尾的空格
train_label_list = [k.split()[0] for k in line_list ] # 利用split()将标签和正文分离
train_content_list =[k.split(maxsplit=1)[1] for k in line_list] # maxslpit=1这个参数非常重要,它指明,后面的内容不再更具空格来分割
with open('./cnews/cnews.vocab.txt',encoding = 'utf-8') as file:
vacabulary_list = [k.strip() for k in file.readlines()] # 读取词典
word2id_dict = dict([(b,a) for a,b in enumerate(vacabulary_list)]) # 词和序号的映射关系
content2idList = lambda content : [word2id_dict[word] for word in content if word in word2id_dict]
train_idlist_list = [content2idList(content) for content in train_content_list] # 将文章的内容转化为词典word2id_list对应的索引
# In[2]:
print(len(train_idlist_list)) # 训练集有5万篇文章
seq_length =600 # 每篇文章取最后600个字
embedding_dim = 600 # 词向量维度
vocab_size = 5000 # 词汇表大小
num_classes = 10 # 类别数
train_text_600 = pad_sequences(train_idlist_list,seq_length) # 截取后600个词
# train_text_600 = np.array(train_text_600)
# print(train_text_600.shape)
# In[3]:
from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from keras.layers import Conv1D,MaxPooling1D,Embedding,Flatten
labelEncoder = LabelEncoder() # 准备使用labelEncoder
train_y = labelEncoder.fit_transform(train_label_list) # 按顺序进行编号,编号以后只有0-9这几个数字
train_Y = to_categorical(train_y,num_classes) # 将标签进行one-hot编码
# In[9]:
# model
X = Input(shape=(seq_length,))
# Y = Input(shape=(num_classes,))
embedding = Embedding(input_dim=vocab_size , output_dim= embedding_dim, input_length=seq_length)(X) # 将50000个文本,每个文本600词,
#每个词转成一个5000维的one-hot编码,作为输入向量,通过Embedding层进行训练,转为每个词为600维的词向量.
conv1D = Conv1D(filters=256,kernel_size=5,strides=1)(embedding) # 卷积核大小为 5*单个向量长度(即宽度不滑动)
# 输入 (batch_size,seq_length,embeding_dim) 输出(batch_size,(seq_length-kernel_size)/strides +1 ,filters) 即 (bacth_size,600-5+1,256)
maxPooling = MaxPooling1D(pool_size=20)(conv1D)
flat = Flatten()(maxPooling) # 不影响batch_size,只将后面的维数拉平
dense = Dense(128)(flat) # 经过一个全连接层
result = Dense(num_classes,activation='softmax')(dense) #
model = Model(inputs=[X],outputs=[result])
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['acc'])
model.summary()
# In[ ]:
with open('./cnews/cnews.val.txt', encoding='utf8') as file:
line_list = [k.strip() for k in file.readlines()]
val_label_list = [k.split()[0] for k in line_list]
val_content_list = [k.split(maxsplit=1)[1] for k in line_list]
val_idlist_list = [content2idList(content) for content in val_content_list] # 将测试的文章文字转换为编号
val_X = pad_sequences(val_idlist_list,seq_length)
val_y = labelEncoder.fit_transform(val_label_list) #
val_Y = to_categorical(val_y,num_classes) # one-hot 编码
print(len(val_X))
# In[15]:
model.fit(train_text_600,train_Y,batch_size=64,epochs=20,validation_data=(val_X,val_Y))
model.save('./cnn.h5')
# In[10]:
with open('./cnews/cnews.test.txt', encoding='utf8') as file:
line_list = [k.strip() for k in file.readlines()]
test_label_list = [k.split()[0] for k in line_list]
test_content_list = [k.split(maxsplit=1)[1] for k in line_list]
test_idlist_list = [content2idList(content) for content in test_content_list] # 将测试的文章文字转换为编号
test_X = pad_sequences(test_idlist_list,seq_length)
test_y = labelEncoder.fit_transform(test_label_list) #
test_Y = to_categorical(test_y,num_classes) # one-hot 编码
# In[ ]:
predicted = model.predict(test_X,verbose=1)
predicted_result = np.argmax(predicted, axis=1) # 获得最大概率对应的标签
#y_predict = list(map(str, predicted_result))
cnt=0
for i in range(len(predicted_result)):
if predicted_result[i]==test_y[i]:
cnt+=1
print('准确率 ', float(cnt/(len(predicted_result))))
#print('平均f1-score:', metrics.f1_score(y_test, y_predict, average='weighted'))
多卷积核TextCNN模型构建
和前面的不同之处在于,同时使用了3个大小不同的卷积核进行特征提取。但效果似乎并不明显。
#!/usr/bin/env python
# coding: utf-8
# In[1]:
# textCNN keras version
# write by ydc
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras import backend as K
from keras.engine.topology import Layer
from keras.layers import LSTM,RepeatVector,Dense, Activation,Add,Reshape,Input,Lambda,Multiply,Concatenate, Dot
from keras.models import Model
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import os
import h5py
with open('./cnews/cnews.train.txt',encoding='utf-8') as file:
line_list = [k.strip() for k in file.readlines()] # 用 strip()移除末尾的空格
train_label_list = [k.split()[0] for k in line_list ] # 利用split()将标签和正文分离
train_content_list =[k.split(maxsplit=1)[1] for k in line_list] # maxslpit=1这个参数非常重要,它指明,后面的内容不再更具空格来分割
with open('./cnews/cnews.vocab.txt',encoding = 'utf-8') as file:
vacabulary_list = [k.strip() for k in file.readlines()] # 读取词典
word2id_dict = dict([(b,a) for a,b in enumerate(vacabulary_list)]) # 词和序号的映射关系
content2idList = lambda content : [word2id_dict[word] for word in content if word in word2id_dict]
train_idlist_list = [content2idList(content) for content in train_content_list] # 将文章的内容转化为词典word2id_list对应的索引
# In[15]:
print(len(train_idlist_list)) # 训练集有5万篇文章
seq_length =600 # 每篇文章取最后600个字
embedding_dim = 300 # 词向量维度
vocab_size = 5000 # 词汇表大小
num_classes = 10 # 类别数
train_text_600 = pad_sequences(train_idlist_list,seq_length) # 截取后600个词
# train_text_600 = np.array(train_text_600).reshape(-1,seq_length)
# permutation = np.random.permutation(train_text_600.shape[0])
# train_text_X = train_text_600[permutation,:]
#print(train_text_X.shape)
# print(train_text_600[0:100,0])
#print(train_text_600.shape)
# train_text_600 = np.array(train_text_600)
# print(train_text_600.shape)
# In[8]:
from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from keras.layers import Conv1D,MaxPooling1D,Embedding,Flatten,Dropout
labelEncoder = LabelEncoder() # 准备使用labelEncoder
train_y = labelEncoder.fit_transform(train_label_list) # 按顺序进行编号,编号以后只有0-9这几个数字
train_Y = to_categorical(train_y,num_classes) # 将标签进行one-hot编码
# In[17]:
# model
X = Input(shape=(seq_length,))
# Y = Input(shape=(num_classes,))
embedding = Embedding(input_dim=vocab_size , output_dim= embedding_dim, input_length=seq_length)(X) # 将50000个文本,每个文本600词,
#每个词转成一个5000维的one-hot编码,作为输入向量,通过Embedding层进行训练,转为每个词为600维的词向量.
conv1D1 = Conv1D(filters=256,kernel_size=5,strides=1)(embedding) # 卷积核大小为 5*单个向量长度(即宽度不滑动)
# 输入 (batch_size,seq_length,embeding_dim) 输出(batch_size,(seq_length-kernel_size)/strides +1 ,filters) 即 (bacth_size,600-5+1,256)
maxPooling1 = MaxPooling1D(pool_size=20)(conv1D1)
conv1D2 = Conv1D(filters=256,kernel_size=3,strides=1)(embedding)
maxPooling2 = MaxPooling1D(pool_size=30)(conv1D2)
conv1D3 = Conv1D(filters=256,kernel_size=4,strides=1)(embedding)
maxPooling3 = MaxPooling1D(pool_size=25)(conv1D3)
concat = Lambda(lambda x: K.concatenate([x[0],x[1],x[2]],axis=1))([maxPooling1,maxPooling2,maxPooling3])
flat = Flatten()(concat) # 不影响batch_size,只将后面的维数拉平
drop=Dropout(0.3)(flat)
dense = Dense(128)(drop) # 经过一个全连接层
drop_dense = Dropout(0.2)(dense)
result = Dense(num_classes,activation='softmax')(drop_dense) #
model = Model(inputs=[X],outputs=[result])
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['acc'])
model.summary()
# In[ ]:
with open('./cnews/cnews.val.txt', encoding='utf8') as file:
line_list = [k.strip() for k in file.readlines()]
val_label_list = [k.split()[0] for k in line_list]
val_content_list = [k.split(maxsplit=1)[1] for k in line_list]
val_idlist_list = [content2idList(content) for content in val_content_list] # 将测试的文章文字转换为编号
val_X = pad_sequences(val_idlist_list,seq_length)
val_y = labelEncoder.fit_transform(val_label_list) #
val_Y = to_categorical(val_y,num_classes) # one-hot 编码
print(len(val_X))
# In[ ]:
model.fit(train_text_600,train_Y,batch_size=64,epochs=100,validation_data=(val_X,val_Y))
model.save('./cnn.h5')
# In[ ]:
with open('./cnews/cnews.test.txt', encoding='utf8') as file:
line_list = [k.strip() for k in file.readlines()]
test_label_list = [k.split()[0] for k in line_list]
test_content_list = [k.split(maxsplit=1)[1] for k in line_list]
test_idlist_list = [content2idList(content) for content in test_content_list] # 将测试的文章文字转换为编号
test_X = pad_sequences(test_idlist_list,seq_length)
test_y = labelEncoder.fit_transform(test_label_list) #
test_Y = to_categorical(test_y,num_classes) # one-hot 编码
# In[ ]:
predicted = model.predict(test_X,verbose=1)
predicted_result = np.argmax(predicted, axis=1) # 获得最大概率对应的标签
#y_predict = list(map(str, predicted_result))
cnt=0
for i in range(len(predicted_result)):
if predicted_result[i]==test_y[i]:
cnt+=1
print('准确率 ', float(cnt/(len(predicted_result))))
#print('平均f1-score:', metrics.f1_score(y_test, y_predict, average='weighted'))
# In[12]:
上述两种模型在测试集的命中率都只有90%左右。据说使用训练好的word2vec可以提升命中率,后续会考虑加入word2vec。