2021SC@SDUSC
数据集为IMDB 电影影评,总共有三个数据文件,在/data/rawData目录下,包括unlabeledTrainData.tsv,labeledTrainData.tsv,testData.tsv。在进行文本分类时需要有标签的数据(labeledTrainData),数据预处理如文本分类实战(一)—— word2vec预训练词向量中一样,预处理后的文件为/data/preprocess/labeledTrain.csv。
1.RCNN 模型结构
RCNN模型来源于论文Recurrent Convolutional Neural Networks for Text Classification。模型结构图如下:
RCNN 整体的模型构建流程如下:
1)利用Bi-LSTM获得上下文的信息,类似于语言模型。
2)将Bi-LSTM获得的隐层输出和词向量拼接[fwOutput, wordEmbedding, bwOutput]。
3)将拼接后的向量非线性映射到低维。
4)向量中的每一个位置的值都取所有时序上的最大值,得到最终的特征向量,该过程类似于max-pool。
5)softmax分类。
2.参数配置
import os import csv import time import datetime import random import json import warnings from collections import Counter from math import sqrt import gensim import pandas as pd import numpy as np import tensorflow as tf from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score warnings.filterwarnings("ignore")
# 配置参数 class TrainingConfig(object): epoches = 10 evaluateEvery = 100 checkpointEvery = 100 learningRate = 0.001 class ModelConfig(object): embeddingSize = 200 hiddenSizes = [128] # LSTM结构的神经元个数 dropoutKeepProb = 0.5 l2RegLambda = 0.0 outputSize = 128 # 从高维映射到低维的神经元个数 class Config(object): sequenceLength = 200 # 取了所有序列长度的均值 batchSize = 128 dataSource = "../data/preProcess/labeledTrain.csv" stopWordSource = "../data/english" numClasses = 1 # 二分类设置为1,多分类设置为类别的数目 rate = 0.8 # 训练集的比例 training = TrainingConfig() model = ModelConfig() # 实例化配置参数对象 config = Config()
3.生成训练数据
1)将数据加载进来,将句子分割成词表示,并去除低频词和停用词。
2)将词映射成索引表示,构建词汇-索引映射表,并保存成json的数据格式,之后做inference时可以用到。(注意,有的词可能不在word2vec的预训练词向量中,这种词直接用UNK表示)
3)从预训练的词向量模型中读取出词向量,作为初始化值输入到模型中。
4)将数据集分割成训练集和测试集
# 数据预处理的类,生成训练集和测试集 class Dataset(object): def __init__(self, config): self.config = config self._dataSource = config.dataSource self._stopWordSource = config.stopWordSource self._sequenceLength = config.sequenceLength # 每条输入的序列处理为定长 self._embeddingSize = config.model.embeddingSize self._batchSize = config.batchSize self._rate = config.rate self._stopWordDict = {} self.trainReviews = [] self.trainLabels = [] self.evalReviews = [] self.evalLabels = [] self.wordEmbedding =None self.labelList = [] def _readData(self, filePath): """ 从csv文件中读取数据集 """ df = pd.read_csv(filePath) if self.config.numClasses == 1: labels = df["sentiment"].tolist() elif self.config.numClasses > 1: labels = df["rate"].tolist() review = df["review"].tolist() reviews = [line.strip().split() for line in review] return reviews, labels def _labelToIndex(self, labels, label2idx): """ 将标签转换成索引表示 """ labelIds = [label2idx[label] for label in labels] return labelIds def _wordToIndex(self, reviews, word2idx): """ 将词转换成索引 """ reviewIds = [[word2idx.get(item, word2idx["UNK"]) for item in review] for review in reviews] return reviewIds def _genTrainEvalData(self, x, y, word2idx, rate): """ 生成训练集和验证集 """ reviews = [] for review in x: if len(review) >= self._sequenceLength: reviews.append(review[:self._sequenceLength]) else: reviews.append(review + [word2idx["PAD"]] * (self._sequenceLength - len(review))) trainIndex = int(len(x) * rate) trainReviews = np.asarray(reviews[:trainIndex], dtype="int64") trainLabels = np.array(y[:trainIndex], dtype="float32") evalReviews = np.asarray(reviews[trainIndex:], dtype="int64") evalLabels = np.array(y[trainIndex:], dtype="float32") return trainReviews, trainLabels, evalReviews, evalLabels def _genVocabulary(self, reviews, labels): """ 生成词向量和词汇-索引映射字典,可以用全数据集 """ allWords = [word for review in reviews for word in review] # 去掉停用词 subWords = [word for word in allWords if word not in self.stopWordDict] wordCount = Counter(subWords) # 统计词频 sortWordCount = sorted(wordCount.items(), key=lambda x: x[1], reverse=True) # 去除低频词 words = [item[0] for item in sortWordCount if item[1] >= 5] vocab, wordEmbedding = self._getWordEmbedding(words) self.wordEmbedding = wordEmbedding word2idx = dict(zip(vocab, list(range(len(vocab))))) uniqueLabel = list(set(labels)) label2idx = dict(zip(uniqueLabe