官方BERT代码 pre-training 一步步来

先放源码https://github.com/google-research/bert

1. 首先需要处理pre-training所需要的数据

对应代码create_pretraining_data.py

python create_pretraining_data.py \
  --input_file=./sample_text.txt \
  --output_file=/tmp/tf_examples.tfrecord \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --do_lower_case=True \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --masked_lm_prob=0.15 \
  --random_seed=12345 \
  --dupe_factor=5
  • max_predictions_per_seq:每个序列里最大的masked lm predictions。建议设置为max_seq_length*masked_lm_prob(这个脚本不会自动设置)

文本输入格式:一行一句话(对于next sentence prediction这很重要),不同文档间用空行分隔。例如源码中附带的sample_text.txt示例:

This text is included to make sure Unicode is handled properly: 力加勝北区ᴵᴺᵀᵃছজটডণত
Text should be one-sentence-per-line, with empty lines between documents.
This sample text is public domain and was randomly selected from Project Guttenberg.

The rain had only ceased with the gray streaks of morning at Blazing Star, and the settlement awoke to a moral sense of cleanliness, and the finding of forgotten knives, tin cups, and smaller camp utensils, where the heavy showers had washed away the debris and dust heaps before the cabin doors.
Indeed, it was recorded in Blazing Star that a fortunate early riser had once picked up on the highway a solid chunk of gold quartz which the rain had freed from its incumbering soil, and washed into immediate and glittering popularity.

 输出是一系列的TFRecordtf.train.Example

注意:这个脚本把整个输入文件都放到内存里了,所以对于大文件,你可能需要把文件分片,然后跑多次这个脚本,得到一堆tf_examples.tf_record*,然后把这些文件都作为下一个脚本run_pretraining.py的输入。

2. 接下来进行pre-training

对应代码run_pretraining.py

python run_pretraining.py \
  --input_file=/tmp/tf_examples.tfrecord \
  --output_dir=/tmp/pretraining_output \
  --do_train=True \
  --do_eval=True \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --train_batch_size=32 \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_train_steps=20 \
  --num_warmup_steps=10 \
  --learning_rate=2e-5
  • 如果你是从头开始pre-training,需要把include init_checkpoint去掉
  • 模型配置(包括vocab size)在bert_config_file中设置
  • num_train_steps在现实中一般要设置10000以上
  • max_seq_length和max_predictions_per_seq要和create_pretraining_data的参数一样

3.多机多卡

如果想跑多机多卡,可以参考这个代码https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT,英伟达爸爸公开的多机多卡加速代码

用多机多卡的时候在第一步处理数据的时候需要将数据分片,片数要大于你要用的GPU个数,在run_pretraining.py代码做了提示:

 if FLAGS.horovod and len(input_files) < hvd.size():
      raise ValueError("Input Files must be sharded")
  if FLAGS.amp and FLAGS.manual_fp16:
      raise ValueError("AMP and Manual Mixed Precision Training are both activated! Error")

hvd.size()即卡的个数,要求数据片数必须大于卡的个数,不然会报错。

分片之后,对每一片都运行一遍create_pretraining_data.py,得到对应的tf_examples.tfrecord_X, X是你随意给分片做的编号 

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值