1 大纲概述
文本分类这个系列将会有十篇左右,包括基于word2vec预训练的文本分类,与及基于最新的预训练模型(ELMo,BERT等)的文本分类。总共有以下系列:
jupyter notebook代码均在textClassifier仓库中,python代码在NLP-Project中的text_classfier中。
2 数据集
数据集为IMDB 电影影评,总共有三个数据文件,在/data/rawData目录下,包括unlabeledTrainData.tsv,labeledTrainData.tsv,testData.tsv。在进行文本分类时需要有标签的数据(labeledTrainData),数据预处理如文本分类实战(一)—— word2vec预训练词向量中一样,预处理后的文件为/data/preprocess/labeledTrain.csv。
3 Transformer 模型结构
Transformer模型来自于论文Attention Is All You Need,关于Transformer具体的介绍见这篇。Transformer模型具体结构如下图:
Transformer结构有两种:Encoder和Decoder,在文本分类中只使用到了Encoder,Decoder是生成式模型,主要用于自然语言生成的。
4 参数配置
import os import csv import time import datetime import random import json import warnings from collections import Counter from math import sqrt import gensim import pandas as pd import numpy as np import tensorflow as tf from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score warnings.filterwarnings("ignore")
# 配置参数 class TrainingConfig(object): epoches = 10 evaluateEvery = 100 checkpointEvery = 100 learningRate = 0.001 class ModelConfig(object): embeddingSize = 200 filters = 128 # 内层一维卷积核的数量,外层卷积核的数量应该等于embeddingSize,因为要确保每个layer后的输出维度和输入维度是一致的。 numHeads = 8 # Attention 的头数 numBlocks = 1 # 设置transformer block的数量 epsilon = 1e-8 # LayerNorm 层中的最小除数 keepProp = 0.9 # multi head attention 中的dropout dropoutKeepProb = 0.5 # 全连接层的dropout l2RegLambda = 0.0 class Config(object): sequenceLength = 200 # 取了所有序列长度的均值 batchSize = 128 dataSource = "../data/preProcess/labeledTrain.csv" stopWordSource = "../data/english" numClasses = 1 # 二分类设置为1,多分类设置为类别的数目 rate = 0.8 # 训练集的比例 training = TrainingConfig() model = ModelConfig() # 实例化配置参数对象 config = Config()
5 生成训练数据
1)将数据加载进来,将句子分割成词表示,并去除低频词和停用词。
2)将词映射成索引表示,构建词汇-索引映射表,并保存成json的数据格式,之后做inference时可以用到。(注意,有的词可能不在word2vec的预训练词向量中,这种词直接用UNK表示)
3)从预训练的词向量模型中读取出词向量,作为初始化值输入到模型中。
4)将数据集分割成训练集和测试集
# 数据预处理的类,生成训练集和测试集 class Dataset(object): def __init__(self, config): self.config = config self._dataSource = config.dataSource self._stopWordSource = config.stopWordSource self._sequenceLength = config.sequenceLength # 每条输入的序列处理为定长 self._embeddingSize = config.model.embeddingSize self._batchSize = config.batchSize self._rate = config.rate self._stopWordDict = {} self.trainReviews = [] self.trainLabels = [] self.evalReviews = [] self.evalLabels = [] self.wordEmbedding =None self.labelList = [] def _readData(self, filePath): """ 从csv文件中读取数据集 """ df = pd.read_csv(filePath) if self.config.numClasses == 1: labels = df["sentiment"].tolist() elif self.config.numClasses > 1: labels = df["rate"].tolist() review = df["review"].tolist() reviews = [line.strip().split() for line in review] return reviews, labels def _labelToIndex(self, labels, label2idx): """ 将标签转换成索引表示 """ labelIds = [label2idx[label] for label in labels] return labelIds def _wordToIndex(self, reviews, word2idx): """ 将词转换成索引 """ reviewIds = [[word2idx.get(item, word2idx["UNK"]) for item in review] for review in reviews] return reviewIds def _genTrainEvalData(self, x, y, word2idx, rate): """ 生成训练集和验证集 """ reviews = [] for review in x: if len(review) >= self._sequenceLength: reviews.append(review[:self._sequenceLength]) else: reviews.append(review + [word2idx["PAD"]] * (self._sequenceLength - len(review))) trainIndex = int(len(x) * rate) trainReviews = np.asarray(reviews[:trainIndex], dtype="int64") trainLabels = np.array(y[:trainIndex], dtype="float32") evalReviews = np.asarray(reviews[trainIndex:], dtype="int64") evalLabels = np.array(y[trainIndex:], dtype="float32") return trainReviews, trainLabels, evalReviews, evalLabels def _genVocabulary(self, reviews, labels): """ 生成词向量和词汇-索引映射字典,可以用全数据集 """ allWords = [word for review in reviews for word in review] # 去掉停用词 subWords = [word for word in allWords if word not in self.stopWordDict] wordCount = Counter(subWords) # 统计词频 sortWordCount = sorted(wordCount.items(), key=lambda x: x[1], reverse=True) # 去除低频词 words = [item[0] for item in sortWordCount if item