使用自己的数据再训练BERT

最新推荐文章于 2024-07-22 08:47:08 发布

bayou3

最新推荐文章于 2024-07-22 08:47:08 发布

阅读量1w

点赞数 4

文章标签： BERT 预训练

原文链接：https://github.com/pren1/A_Pipeline_Of_Pretraining_Bert_On_Google_TPU

版权

Though the pre-trained model is good enough, you may still want to tune the pre-trained model offered by Google on your own domain-specific corpus for several additional epochs. That is, give your Bert model a chance to be familiar with your jargons. Then we can expect better performance in the end.
转自英文原文：https://github.com/pren1/A_Pipeline_Of_Pretraining_Bert_On_Google_TPU
准备一个 .txt文件。
这个文件就是你要训练BERT的自己的数据。训练的目的其实就是让BERT能对你数据中的那些黑话（专有名词等）有所熟悉。

做接下来的前提是，已经有了google cloud账户，并如何使用其shell和storage。
激活shell（打开google cloud的shell），执行命令：
ctpu up --name=yourname-tpu --tpu-size=v3-8 --preemptible
（其中的yourname是自己任意定义的名字）加上后面的–preemptible 是2.4刀/小时，不加是8刀/小时，区别是google是否可以随时终止你的运行。

接下来的提示中输入：y
询问设置ssh密码的时候，输入：y ，并设置自己的密码

ctpu status 可以查看状态

接着获得bert的源码：
git clone https://github.com/google-research/bert.git
cd bert

然后执行：
python create_pretraining_data.py
–input_file=gs://sample_bucket_test/sample_text.txt
–output_file=*gs://sample_bucket_test/tmp/tf_examples.tfrecord *
–vocab_file=gs://sample_bucket_test/multi_cased_L-12_H-768_A-12/vocab.txt
–do_lower_case=True
–max_seq_length=128
–max_predictions_per_seq=20
–masked_lm_prob=0.15
–random_seed=12345
–dupe_factor=5
上面的是创建预训练数据，最后，就是训练自己的model了（Now, it’s time to train the model! Run the following code. Notice that the tpu_name is set to the name you gave to the TPU previously）：
执行：
python run_pretraining.py
–input_file=gs://sample_bucket_test/tmp/tf_examples.tfrecord
–output_dir=gs://sample_bucket_test/tmp/pretraining_output
–do_train=True
–do_eval=True
–bert_config_file=gs://sample_bucket_test/multi_cased_L-12_H-768_A-12/bert_config.json
–init_checkpoint=gs://sample_bucket_test/multi_cased_L-12_H-768_A-12/bert_model.ckpt
–train_batch_size=32
–max_seq_length=128
–max_predictions_per_seq=20
–num_train_steps=20
–num_warmup_steps=10
–learning_rate=2e-5
–use_tpu=True
–tpu_name=yourname-tpu
上面的斜体替换成自己storage下的目录，跟作者一样，同样建议将vocab_file以及bert_config_file改为官方本身就有的路径，这样比自己传省时间空间，比如：gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/vocab.txt （是否用uncase或者 multi_case就根据自己的需求来决定了）