跑TextCNN的一些小分享

卷积神经网络文本分类

原理讲解

本文实现

TextCNN 的网络结构:
在这里插入图片描述

一、对输入的句子补齐成一样的长度

二、把词映射成稠密的词向量(Emedding)

三、分别用不同大小的划窗(卷积核)做信息的抽取

四、抽取后的信息池化后进行拼接

五、最后经过全连接层和softmax做分类处理。

全连接层的作用是:将上层抽取出来的特征向量在里面做一个权重计算
在这里插入图片描述

softmax的作用是:softmax将原来的输入数值映射到(0,1)之间,输出概率

在这里插入图片描述

由于本人的电脑的内存不够,所以决定在colab中运行。(此处有很多坑)

一、上传数据

由于csv文件比较多,我是刚开始是一个一个上传,上传了很长时间(比较尴尬),然后打包成rar之后再上传,但是注意上传之后要解压。代码如下:

!unrar x origin.rar

由于在运行的过程中用到processed_data和vocab两个文件,同样的方式打包,上传,解压。

二、 模型构建与训练

2.1 定义网络结构

from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Embedding, Dense, Conv1D, GlobalMaxPooling1D, Concatenate, Dropout#(词嵌入,稠密层就是全连接层:
#作用是分类,卷积层,池化,拼接,随机失活)


class TextCNN(object):
    def __init__(self, maxlen, max_features, embedding_dims,
                 class_num=5,
                 last_activation='softmax'):
        self.maxlen = maxlen#最大文本词的长度
        self.max_features = max_features#词表的大小,最多能容纳多少词,
        self.embedding_dims = embedding_dims#每个词映射的词向量的维度
        self.class_num = class_num#类别数
        self.last_activation = last_activation#激活函数用softmax

    def get_model(self):#用来搭建模型
        input = Input((self.maxlen,))#长度为maxlen的输入
        embedding = Embedding(self.max_features, self.embedding_dims, input_length=self.maxlen)(input)#做映射映射为稠密词向量,做成一个矩阵
        convs = []#然后接卷积层
        for kernel_size in [3, 4, 5]:#然后有一个for循环
            c = Conv1D(128, kernel_size, activation='relu')(embedding)#接卷积层
            c = GlobalMaxPooling1D()(c)#池化
            convs.append(c)
        x = Concatenate()(convs)#输出的列表送入

        output = Dense(self.class_num, activation=self.last_activation)(x)#dense是全连接
        model = Model(inputs=input, outputs=output)
        return model

以上部分是没有问题的

2.2 数据处理与训练

from tensorflow.keras.preprocessing import sequence
import random
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.utils import to_categorical
from utils import *#utils所有的函数都可以用
#utils是工具库函数
# 路径等配置
data_dir = "./processed_data"#处理过后的数据(去掉停用词),路径的配置
vocab_file = "./vocab/vocab.txt"#构建的词表
vocab_size = 40000

# 神经网络配置
max_features = 40001
maxlen = 100
batch_size = 256
embedding_dims = 50
epochs = 8#数据被训练的次数

print('数据预处理与加载数据...')
# 如果不存在词汇表,重建
if not os.path.exists(vocab_file):  
    build_vocab(data_dir, vocab_file, vocab_size)#扫描文本构建高级词汇
# 获得 词汇/类别 与id映射字典
categories, cat_to_id = read_category()
words, word_to_id = read_vocab(vocab_file)

# 全部数据
x, y = read_files(data_dir)
data = list(zip(x,y))
del x,y
# 乱序
random.shuffle(data)
# 切分训练集和测试集
train_data, test_data = train_test_split(data)
# 对文本的词id和类别id进行编码
x_train = encode_sentences([content[0] for content in train_data], word_to_id)
y_train = to_categorical(encode_cate([content[1] for content in train_data], cat_to_id))
x_test = encode_sentences([content[0] for content in test_data], word_to_id)
y_test = to_categorical(encode_cate([content[1] for content in test_data], cat_to_id))

print('对序列做padding,保证是 samples*timestep 的维度')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('构建模型...')
model = TextCNN(maxlen, max_features, embedding_dims).get_model()
model.compile('adam', 'categorical_crossentropy', metrics=['accuracy'])

print('训练...')
# 设定callbacks回调函数
my_callbacks = [
    ModelCheckpoint('./cnn_model.h5', verbose=1),
    EarlyStopping(monitor='val_accuracy', patience=2, mode='max')
]

# fit拟合数据
history = model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          callbacks=my_callbacks,
          validation_data=(x_test, y_test))

#print('对测试集预测...')
#result = model.predict(x_test)

此处坑比较多:

坑一:由于utils是本地电脑写的py文件,“from utils import *”的方式是不对的,需要写路径导包,试了此方法,好几次都在此处报错!然后,就换了干脆不用
写路径导包的方式,即把utils.py中的内容全部copy到colab中,然后把“from utils import *”去掉,换成以下代码。此方法是比较笨的方法,但是不用
纠结路径问题啦
import sys
from collections import Counter
import numpy as np
import tensorflow.keras as kr
import os

if sys.version_info[0] > 2:
    is_py3 = True
else:
    reload(sys)
    sys.setdefaultencoding("utf-8")
    is_py3 = False

def open_file(filename, mode='r'):
    """
    常用文件操作,可在python2和python3间切换.
    mode: 'r' or 'w' for read or write
    """
    if is_py3:
        return open(filename, mode, encoding='utf-8', errors='ignore')
    else:
        return open(filename, mode)

def read_file(filename):
    """读取单个文件,文件中包含多个类别"""
    contents = []
    labels = []
    with open_file(filename) as f:
        for line in f:
            try:
                raw = line.strip().split("\t")
                content = raw[1].split(' ')
                if content:
                    contents.append(content)
                    labels.append(raw[0])
            except:
                pass
    return contents, labels    

def read_single_file(filename):
    """读取单个文件,文件为单一类别"""
    contents = []
    label = filename.split('/')[-1].split('.')[0]
    with open_file(filename) as f:
        for line in f:
            try:
                content = line.strip().split(' ')
                if content:
                    contents.append(content)
            except:
                pass
    return contents, label

def read_files(dirname):
    """读取文件夹"""
    contents = []
    labels = []
    files = [f for f in os.listdir(dirname) if f.endswith(".txt")]
    for filename in files:
        content, label = read_single_file(os.path.join(dirname, filename))
        contents.extend(content)
        labels.extend([label]*len(content))
    return contents, labels

def build_vocab(train_dir, vocab_file, vocab_size=5000):
    """根据训练集构建词汇表,存储"""
    data_train, _ = read_files(train_dir)

    all_data = []
    for content in data_train:
        all_data.extend(content)

    counter = Counter(all_data)
    count_pairs = counter.most_common(vocab_size - 1)
    words, _ = list(zip(*count_pairs))
    # 添加一个 <PAD> 来将所有文本pad为同一长度
    words = ['<PAD>'] + list(words)
    open_file(vocab_file, mode='w').write('\n'.join(words) + '\n')


def read_vocab(vocab_file):
    """读取词汇表"""
    # words = open_file(vocab_dir).read().strip().split('\n')
    with open_file(vocab_file) as fp:
        # 如果是py2 则每个值都转化为unicode
        words = [_.strip() for _ in fp.readlines()]
    word_to_id = dict(zip(words, range(len(words))))
    return words, word_to_id


def read_category():
    """读取分类,编码"""
    categories = ['car', 'entertainment', 'military', 'sports', 'technology']
    cat_to_id = dict(zip(categories, range(len(categories))))
    return categories, cat_to_id

def encode_cate(content, words):
    """将id表示的内容转换为文字"""
    return [(words[x] if x in words else 40000) for x in content]

def encode_sentences(contents, words):
    """将id表示的内容转换为文字"""
    return [encode_cate(x,words) for x in contents]

def process_file(filename, word_to_id, cat_to_id, max_length=600):
    """将文件转换为id表示"""
    contents, labels = read_file(filename)

    data_id, label_id = [], []
    for i in range(len(contents)):
        data_id.append([word_to_id[x] for x in contents[i] if x in word_to_id])
        label_id.append(cat_to_id[labels[i]])

    # 使用keras提供的pad_sequences来将文本pad为固定长度
    x_pad = kr.preprocessing.sequence.pad_sequences(data_id, max_length)
    y_pad = kr.utils.to_categorical(label_id, num_classes=len(cat_to_id))  # 将标签转换为one-hot表示

    return x_pad, y_pad


def batch_iter(x, y, batch_size=64):
    """生成批次数据"""
    data_len = len(x)
    num_batch = int((data_len - 1) / batch_size) + 1

    indices = np.random.permutation(np.arange(data_len))
    x_shuffle = x[indices]
    y_shuffle = y[indices]

    for i in range(num_batch):
        start_id = i * batch_size
        end_id = min((i + 1) * batch_size, data_len)
        yield x_shuffle[start_id:end_id], y_shuffle[start_id:end_id]

from tensorflow.keras.preprocessing import sequence
import random
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.utils import to_categorical
from utils import *#utils所有的函数都可以用
#utils是工具库函数
# 路径等配置
data_dir = "./processed_data"#处理过后的数据(去掉停用词),路径的配置
vocab_file = "./vocab/vocab.txt"#构建的词表
vocab_size = 40000

# 神经网络配置
max_features = 40001
maxlen = 100
batch_size = 256
embedding_dims = 50
epochs = 8#数据被训练的次数

print('数据预处理与加载数据...')
# 如果不存在词汇表,重建
if not os.path.exists(vocab_file):  
    build_vocab(data_dir, vocab_file, vocab_size)#扫描文本构建高级词汇
# 获得 词汇/类别 与id映射字典
categories, cat_to_id = read_category()
words, word_to_id = read_vocab(vocab_file)

# 全部数据
x, y = read_files(data_dir)
data = list(zip(x,y))
del x,y
# 乱序
random.shuffle(data)
# 切分训练集和测试集
train_data, test_data = train_test_split(data)
# 对文本的词id和类别id进行编码
x_train = encode_sentences([content[0] for content in train_data], word_to_id)
y_train = to_categorical(encode_cate([content[1] for content in train_data], cat_to_id))
x_test = encode_sentences([content[0] for content in test_data], word_to_id)
y_test = to_categorical(encode_cate([content[1] for content in test_data], cat_to_id))

print('对序列做padding,保证是 samples*timestep 的维度')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('构建模型...')
model = TextCNN(maxlen, max_features, embedding_dims).get_model()
model.compile('adam', 'categorical_crossentropy', metrics=['accuracy'])

print('训练...')
# 设定callbacks回调函数
my_callbacks = [
    ModelCheckpoint('./cnn_model.h5', verbose=1),
    EarlyStopping(monitor='val_accuracy', patience=2, mode='max')
]

# fit拟合数据
history = model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          callbacks=my_callbacks,
          validation_data=(x_test, y_test))

#print('对测试集预测...')
#result = model.predict(x_test)

然后开始迭代结果

数据预处理与加载数据...
对序列做padding,保证是 samples*timestep 的维度
WARNING: Logging before flag parsing goes to stderr.
W0719 13:12:37.727436 140625090512768 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/initializers.py:119: calling RandomUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0719 13:12:37.746002 140625090512768 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
x_train shape: (65924, 100)
x_test shape: (21975, 100)
构建模型...
训练...
Train on 65924 samples, validate on 21975 samples
Epoch 1/8
64768/65924 [============================>.] - ETA: 0s - loss: 1.4657 - acc: 0.3287W0719 13:12:50.143887 140625090512768 callbacks.py:1259] Early stopping conditioned on metric `val_accuracy` which is not available. Available metrics are: loss,acc,val_loss,val_acc

Epoch 00001: saving model to ./cnn_model.h5
65924/65924 [==============================] - 9s 132us/sample - loss: 1.4644 - acc: 0.3298 - val_loss: 1.3687 - val_acc: 0.3821
Epoch 2/8
65280/65924 [============================>.] - ETA: 0s - loss: 1.3437 - acc: 0.3930W0719 13:12:53.365740 140625090512768 callbacks.py:1259] Early stopping conditioned on metric `val_accuracy` which is not available. Available metrics are: loss,acc,val_loss,val_acc

Epoch 00002: saving model to ./cnn_model.h5
65924/65924 [==============================] - 3s 49us/sample - loss: 1.3437 - acc: 0.3930 - val_loss: 1.3347 - val_acc: 0.3922
Epoch 3/8
65280/65924 [============================>.] - ETA: 0s - loss: 1.3129 - acc: 0.4046W0719 13:12:56.593974 140625090512768 callbacks.py:1259] Early stopping conditioned on metric `val_accuracy` which is not available. Available metrics are: loss,acc,val_loss,val_acc

Epoch 00003: saving model to ./cnn_model.h5
65924/65924 [==============================] - 3s 49us/sample - loss: 1.3129 - acc: 0.4045 - val_loss: 1.3291 - val_acc: 0.4005
Epoch 4/8
64768/65924 [============================>.] - ETA: 0s - loss: 1.3008 - acc: 0.4083W0719 13:12:59.828505 140625090512768 callbacks.py:1259] Early stopping conditioned on metric `val_accuracy` which is not available. Available metrics are: loss,acc,val_loss,val_acc

Epoch 00004: saving model to ./cnn_model.h5
65924/65924 [==============================] - 3s 49us/sample - loss: 1.3005 - acc: 0.4084 - val_loss: 1.3299 - val_acc: 0.4019
Epoch 5/8
65280/65924 [============================>.] - ETA: 0s - loss: 1.2927 - acc: 0.4097W0719 13:13:03.061439 140625090512768 callbacks.py:1259] Early stopping conditioned on metric `val_accuracy` which is not available. Available metrics are: loss,acc,val_loss,val_acc

Epoch 00005: saving model to ./cnn_model.h5
65924/65924 [==============================] - 3s 49us/sample - loss: 1.2929 - acc: 0.4098 - val_loss: 1.3332 - val_acc: 0.4042
Epoch 6/8
65280/65924 [============================>.] - ETA: 0s - loss: 1.2867 - acc: 0.4140W0719 13:13:06.316307 140625090512768 callbacks.py:1259] Early stopping conditioned on metric `val_accuracy` which is not available. Available metrics are: loss,acc,val_loss,val_acc

Epoch 00006: saving model to ./cnn_model.h5
65924/65924 [==============================] - 3s 49us/sample - loss: 1.2868 - acc: 0.4138 - val_loss: 1.3436 - val_acc: 0.3949
Epoch 7/8
65536/65924 [============================>.] - ETA: 0s - loss: 1.2838 - acc: 0.4142W0719 13:13:09.551476 140625090512768 callbacks.py:1259] Early stopping conditioned on metric `val_accuracy` which is not available. Available metrics are: loss,acc,val_loss,val_acc

Epoch 00007: saving model to ./cnn_model.h5
65924/65924 [==============================] - 3s 49us/sample - loss: 1.2839 - acc: 0.4141 - val_loss: 1.3397 - val_acc: 0.4047
Epoch 8/8
65536/65924 [============================>.] - ETA: 0s - loss: 1.2793 - acc: 0.4174W0719 13:13:12.776823 140625090512768 callbacks.py:1259] Early stopping conditioned on metric `val_accuracy` which is not available. Available metrics are: loss,acc,val_loss,val_acc

Epoch 00008: saving model to ./cnn_model.h5
65924/65924 [==============================] - 3s 49us/sample - loss: 1.2794 - acc: 0.4174 - val_loss: 1.3450 - val_acc: 0.4027

训练中间信息输出

import matplotlib.pyplot as plt
plt.switch_backend('agg')
%matplotlib inline

fig1 = plt.figure()
plt.plot(history.history['loss'],'r',linewidth=3.0)
plt.plot(history.history['val_loss'],'b',linewidth=3.0)
plt.legend(['Training loss', 'Validation Loss'],fontsize=18)
plt.xlabel('Epochs ',fontsize=16)
plt.ylabel('Loss',fontsize=16)
plt.title('Loss Curves :CNN',fontsize=16)
fig1.savefig('loss_cnn.png')
plt.show()

在这里插入图片描述

fig2=plt.figure()
plt.plot(history.history['accuracy'],'r',linewidth=3.0)
plt.plot(history.history['val_accuracy'],'b',linewidth=3.0)
plt.legend(['Training Accuracy', 'Validation Accuracy'],fontsize=18)
plt.xlabel('Epochs ',fontsize=16)
plt.ylabel('Accuracy',fontsize=16)
plt.title('Accuracy Curves : CNN',fontsize=16)
fig2.savefig('accuracy_cnn.png')
plt.show()

在这里插入图片描述
坑二:colab里的tensflow不是2.0版本的,所以无法识别’accuracy’和’val_accuracy’,将其改成’acc’和’val_acc’即可。

模型结构打印

from tensorflow.keras.utils import plot_model
plot_model(model, show_shapes=True, show_layer_names=True)

在这里插入图片描述


  • 2
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值