利用bert预训练模型进行文本分类

迷茫猿小明

已于 2023-08-04 11:20:37 修改

阅读量6k

点赞数 6

分类专栏：深度学习文章标签： bert 情感分类自然语言处理

于 2019-12-28 15:55:40 首次发布

本文链接：https://blog.csdn.net/bjjoy2009/article/details/103744991

版权

深度学习专栏收录该内容

16 篇文章 11 订阅

订阅专栏

该文章已经过时了，可以用Transformer来代替，更简洁方便
https://mp.weixin.qq.com/s/GmPGWHegdX5DgpCRi_BoMg

摘要

从git下载bert程序，下载bert预训练模型，自行标注数据，实现数据集加载程序，bert进行分类模型训练，评估。
bert和模型地址：https://github.com/google-research/bert

程序目录结构

在这里插入图片描述
bert文件夹：git clone的项目
cased_L12_H768_A12文件夹：下载bert模型
data文件夹：自行标注的数据
output文件夹：训练后保存的模型

数据说明

train.csv：训练数据集，需要有label
dev.csv：开发集，需要有label，模型评估准确性等指标
test.csv：测试集，不需要label，模型评价给出每个数据分类概率

第一列是标签，0表示负情绪，1表示正情绪，如下
1,l like this book
0,I dislike this book

run_classier.py添加代码

（1）添加数据加载class

class MyProcessor(DataProcessor):
    """
    My data processor
    """
    def _read_csv(self, data_dir, file_name):
        df = pd.read_csv(data_dir+file_name, header=None)
        return df

    def get_train_examples(self, data_dir):
        df = self._read_csv(data_dir, "train.csv")

        examples = []
        for row in df.iterrows():
            guid = "train-%d" % (row[0])
            text_a = tokenization.convert_to_unicode(row[1][1])
            label = tokenization.convert_to_unicode(str(row[1][0]))
            examples.append(
                InputExample(guid=guid, text_a=text_a, label=label))
        return examples

    def get_dev_examples(self, data_dir):
        df = self._read_csv(data_dir, "dev.csv")

        examples = []
        for row in df.iterrows():
            guid = "dev-%d" % (row[0])
            text_a = tokenization.convert_to_unicode(row[1][1])
            label = tokenization.convert_to_unicode(str(row[1][0]))
            examples.append(
                InputExample(guid=guid, text_a=text_a, label=label))

        return examples

    def get_test_examples(self, data_dir):
        df = self._read_csv(data_dir, "test.csv")

        examples = []
        for row in df.iterrows():
            guid = "test-%d" % (row[0])
            text_a = tokenization.convert_to_unicode(row[1][1])
            label = tokenization.convert_to_unicode(str(row[1][0]))
            examples.append(
                InputExample(guid=guid, text_a=text_a, label=label))
        return examples

    def get_labels(self):
        return ["0", "1"]

（2）在main(_)函数添加MyProcessor

def main(_):
    tf.logging.set_verbosity(tf.logging.INFO)
    processors = {
        "cola": ColaProcessor,
        "mnli": MnliProcessor,
        "mrpc": MrpcProcessor,
        "xnli": XnliProcessor,
        "my": MyProcessor
    }

训练

在bert_model_test文件夹下，terminal输入命令如下
（1）将python工作目录制定到bert
export PYTHONPATH=$PYTHONPATH:pwd:pwd/bert
（2）训练命令

python bert/run_classifier.py --data_dir=data/ --task_name=my --vocab_file=cased_L12_H768_A12/vocab.txt --bert_config_file=cased_L12_H768_A12/bert_config.json --output_dir=output/ --do_train=true   --do_eval=true   --init_checkpoint=cased_L12_H768_A12/bert_model.ckpt --max_seq_length=32  --train_batch_size=32  --learning_rate=5e-5  --num_train_epochs=1.0 --save_checkpoints_steps=100 --iterations_per_loop=100

参数说明：
data_dir：数据所在目录
task_name：main(_)里边自定义任务名
output_dir：模型训练后保存路径
do_train：是否执行训练
do_eval：是否在dev集评估，可给出准确率
max_seq_length：cell数量（该实验128就很好了）
train_batch_size：每次迭代输入数据量
num_train_epochs：训练数据集训练的轮数
save_checkpoints_steps：每训练多少步存储一次模型（一个batch算一步）
iterations_per_loop：How many steps to make in each estimator call，有待研究，先和save_checkpoints_steps一样，和默认一样。

预测

python bert/run_classifier.py --data_dir=data/ --task_name=my --vocab_file=cased_L12_H768_A12/vocab.txt --bert_config_file=cased_L12_H768_A12/bert_config.json --output_dir=output/ --do_predict=true --init_checkpoint=output --max_seq_length=32