BERT模型使用及心得

最新推荐文章于 2024-06-24 22:35:37 发布

tristan_tian

最新推荐文章于 2024-06-24 22:35:37 发布

阅读量1.5w

点赞数 10

分类专栏：学习文章标签： BERT 文本分类

本文链接：https://blog.csdn.net/tristan_tian/article/details/99626133

版权

学习专栏收录该内容

12 篇文章 1 订阅

订阅专栏

这几天学习使用了一下BERT模型

先简单介绍一下：

BERT是基于Vaswani et al(2017)的论文"Attention is all you need"中提出的transformer模型构建的多层双向transformoer encoder。就是说BERT只是一个NLP方向的编码器。他能对单独句子进行表征，也可以对问答等进行表征。具体的可以看文章https://blog.csdn.net/sinat_33761963/article/details/83578498。

很惭愧，我主要是简单的应用，将12层的BERT当成一个黑箱子，在后面简单的加上一个全连接得出分类结果。输入文本——>训练——>得出结果。下面就主要记录下，我如何使用的。

BERT是个开源的项目，能在github上下载源代码。值得一提到是，由于预训练消耗大量时间和资源，google还很贴心的给出了预训练的权重（模型），让我们进行微调即可。比较可惜的是中文预训练模型就一个。中文预训练模型如下。

模型解压，模型文件包括：
bert_config.json
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.index
bert_model.ckpt.meta
vocab.txt（除了中文字符，还有很多看不懂的特殊符号等）

首先，使用BERT的条件是tensorflow大于等于1.11.0（我使用的是1.11.0）

第一步：修改run_classifier.py

因为我是做分类，所以只看了这个

1、制作特定格式data

class InputExample(object):
  """A single training/test example for simple sequence classification."""

  def __init__(self, guid, text_a, text_b=None, label=None):
    """Constructs a InputExample.

    Args:
      guid: Unique id for the example.
      text_a: string. The untokenized text of the first sequence. For single
        sequence tasks, only this sequence must be specified.
      text_b: (Optional) string. The untokenized text of the second sequence.
        Only must be specified for sequence pair tasks.
      label: (Optional) string. The label of the example. This should be
        specified for train and dev examples, but not for test examples.
    """
    self.guid = guid
    self.text_a = text_a
    self.text_b = text_b
    self.label = label

可以发现它要求的输入分别是guid, text_a, text_b, label，其中text_b和label为可选参数。我们要做的是单个句子的分类任务，那么就不需要输入text_b。对了，我们是中文的模型，只能输入512个字符（具有长度限制）。

所以我们的输入需要是 label /t text 。下面是个例子。label AD ，然后是tab，然后是文本。

AD	啊，看到的东西都可以说啊。嗯。嗯还有呢？这里面有什么东西

2、重载DataProcessor类

每一个模型都有一个Processor，我们的预测也许要制作一个。

class MyProcessor(DataProcessor):
  """Processor for my task-news classification """
  def __init__(self):
    self.labels = ['AD', 'CTRL']

  def get_train_examples(self, data_dir):
    return self._create_examples(
      self._read_tsv(os.path.join(data_dir, 'traintext.csv')), 'train')

  def get_dev_examples(self, data_dir):
    return self._create_examples(
      self._read_tsv(os.path.join(data_dir, 'vaildtext.csv')), 'val')

  def get_test_examples(self, data_dir):
    return self._create_examples(
      self._read_tsv(os.path.join(data_dir, 'testtext.csv')), 'test')

  def get_labels(self):
    return self.labels

  def _create_examples(self, lines, set_type):
    """create examples for the training and val sets"""
    examples = []
    for (i, line) in enumerate(lines):
      guid = '%s-%s' %(set_type, i)
      #print("line[0]:",line[0])
      #print("line[1]:",line[1])
      text_a = tokenization.convert_to_unicode(line[1])
      label = tokenization.convert_to_unicode(line[0])
      examples.append(InputExample(guid=guid, text_a=text_a, label=label))
    return examples

def __init__(self):函数中self.labels = ['AD', 'CTRL']定义自己的分类标签。data三个文件，分别是训练集，验证集和测试集（或者是需要预测的目标）。可以看到_create_examples函数对于data的读取过程。如果改数据比较麻烦，也可以修改这个函数和这个类。

建立好了需要在main函数中的processors中增加自己的模型

  processors = {
      "cola": ColaProcessor,
      "mnli": MnliProcessor,
      "mrpc": MrpcProcessor,
      "xnli": XnliProcessor,
	  "my": MyProcessor,
  }

3、修改loss输出

原生的是全部epochs训练好之后再验证，验证输出一个acc和loss。修改训练的部分，让他n个iter输出一次loss

  if FLAGS.do_train:
    train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
    file_based_convert_examples_to_features(
        train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
    tf.logging.info("***** Running training *****")
    tf.logging.info("  Num examples = %d", len(train_examples))
    tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
    tf.logging.info("  Num steps = %d", num_train_steps)
    train_input_fn = file_based_input_fn_builder(
        input_file=train_file,
        seq_length=FLAGS.max_seq_length,
        is_training=True,
        drop_remainder=True)
    tensors_to_log={'train loss':'loss/Mean:0'}  #修改
    logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log,every_n_iter=20)#修改
    estimator.train(input_fn=train_input_fn, hooks=[logging_hook],max_steps=num_train_steps) #修改

第二步：训练

建立一个train.sh来存储训练的命令

export BERT_BASE_DIR=./chinese_L-12_H-768_A-12#这里是存放中文模型的路径
export DATA_DIR=.  #这里是存放数据的路径

python3 run_classifier.py \
--task_name=my \     #这里是processor的名字
--do_train=true \    #是否训练
--do_eval=true  \    #是否验证
--do_predict=false \  #是否预测（对应test）
--data_dir=$DATA_DIR \ 
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--max_seq_length=512 \#最大文本程度，最大512
--train_batch_size=4 \
--learning_rate=2e-5 \
--num_train_epochs=15 \
--output_dir=./mymodel #输出目录

第三步：预测（测试）

export BERT_BASE_DIR=./chinese_L-12_H-768_A-12
export DATA_DIR=./mymodel
# TRAINED_CLASSIFIER为刚刚训练的输出目录，无需在进一步指定模型模型名称，否则分类结果会不对
export ./mymodel

python3 run_classifier.py \
  --task_name=chi \
  --do_predict=true \
  --data_dir=$DATA_DIR \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$TRAINED_CLASSIFIER \
  --max_seq_length=512 \
  --output_dir=./mymodel

其实简单点，把训练的do_predict=false改成true就行。在输出路径中会有一个test_results。打开是各个类的概率（二分类就有两列）。

注意点：

1、test数据文件也要有标签（可以随便弄一个）

2、tenserflow版本问题

3、输入只有512个字（包括标点，实际不到，分割时会添加字符）

参考文献

1、BERT简介及中文分类

2、【NLP】彻底搞懂BERT

3、BERT模型实战之多文本分类（附源码）

tristan_tian

关注

10
点赞
踩
83

收藏

觉得还不错? 一键收藏
13
评论
BERT模型使用及心得

这几天学习使用了一下BERT模型先简单介绍一下： BERT是基于Vaswani et al(2017)的论文"Attention is all you need"中提出的transformer模型构建的多层双向transformoer encoder。就是说BERT只是一个NLP方向的编码器。他能对单独句子进行表征，也可以对问答等进行表征。具体的可以看文章https://blo...
复制链接

扫一扫