大模型必读系列01-2：BERT模型复现-tensorflow版

最新推荐文章于 2024-10-10 16:03:21 发布

YaoAIPro

最新推荐文章于 2024-10-10 16:03:21 发布

阅读量1k

点赞数 30

分类专栏：大模型必读系列文章标签： bert tensorflow 人工智能

本文链接：https://blog.csdn.net/qq_46883219/article/details/142378548

版权

大模型必读系列专栏收录该内容

3 篇文章 0 订阅

订阅专栏

作者：猥琐发育的

公众号：【DarkMythAI】

代码：https://github.com/YaoAIPro/bert-reproduction
大模型必读系列01-2：BERT模型复现-tensorflow版

前言

在篇文章中，笔者对BERT的文献进行了总结，而在这一章节中，笔者将对BERT模型进行复现。Google团队发布的BERT源代码是基于TensorFlow 1.x版本，而笔者跟随前沿的技术架构，因此选择使用TensorFlow 2.10.0进行复现。由于TensorFlow 2.x版本的API发生了很大的变动，想要成功运行官方源码需要进行大量的修改，这确实是一个相当痛苦的过程。

在安装TensorFlow-GPU之前，有必要先了解它与Python、CUDA和cuDNN的版本对应关系。目前，TensorFlow-GPU的最高版本为2.10.0，支持最高的Python版本为3.10，而这些版本的CUDA和cuDNN都在我系统的兼容范围内。以下是笔者的复现环境和TensorFlow-GPU的版本兼容表。

系统：Win11专业版
显卡：NVIDIA GeForce RTX 4060Ti 16G
CUDA：12.6
Python：3.10.12
Tensorflow-GPU：2.10.0
cudatoolkit：11.2.2 
cudnn：8.1.0.77

在这里插入图片描述

环境创建

首先是anconda虚拟环境创建：

conda create -n bert-tf python==3.10.12
conda activate bert-tf

接着是安装支持tensorflow-gpu-2.10.0的cuda和cudnn，这里可以直接使用conda进行安装，很快捷很方便。这里笔者有梯子，因此使用默认源下载，没有梯子换源下载更快。

conda install conda-forge::cudatoolkit==11.2.2
conda install conda-forge::cudnn==8.1.0.77

# 查询cuda和cudnn
conda list cuda
conda list cudnn

# 移除cuda和cudnn
conda remove cudatoolkit
conda remove cudnn

然后就是安装tensorflow-gpu和six。这里需要注意的是，如果需要更换tensorflow-gpu的版本，需要将全部的tensorflow卸载完切换版本再安装才能生效，否则模型跑不了GPU，这是笔者摸过的大坑。代码已经修改好了，可以直接从代码仓库下载。

pip install tensorflow-gpu==2.10.0
pip install six==1.15.0

模型下载

笔者选择下载的是BERT-Base, Multilingual Cased(New,recommended)，点击模型名称即可下载。在实际的微调实验中，base模型消耗14G左右的显存，因此至少也需要16G的显卡才能完成BERT的微调。笔者没有实验Large的模型，估计显存也不够。模型请去https://github.com/google-research/bert下载

BERT-Large, Uncased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Large, Cased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased: 12-layer, 768-hidden, 12-heads , 110M parameters
BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Multilingual Cased (New, recommended): 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Uncased (Orig, not recommended) (Not recommended, use Multilingual Cased instead): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

数据集下载

官网只给了两种数据集的微调代码，即GLUE和SQuAD。首先是GLUE数据集下载，通过运行download_glue_data.py就可以下载至本地。而SQuAD需要手动下载。并且SQuAD数据集的效果评估也是单独的代码文件。

GLUE

python download_glue_data.py

在这里插入图片描述

SQUAD

SQuAD 1.1	SQuAD 2.0
train-v1.1.json	train-v2.0.json
dev-v1.1.json	dev-v2.0.json
evaluate-v1.1.py	evaluate-v2.0.py

模型微调

GLUE数据集微调，以COLA为例

python ../run_classifier.py ^
  --task_name="%Dataset%" ^
  --do_train=true ^
  --do_eval=true ^
  --data_dir="%GLUE_DIR%\%Dataset%" ^
  --vocab_file="%BERT_BASE_DIR%\vocab.txt" ^
  --bert_config_file="%BERT_BASE_DIR%\bert_config.json" ^
  --init_checkpoint="%BERT_BASE_DIR%\bert_model.ckpt" ^
  --max_seq_length=128 ^
  --train_batch_size=32 ^
  --learning_rate=2e-5 ^
  --num_train_epochs=1.0 ^
  --output_dir="E:\Learning\Algorithm-Reproduction\bert\output\%Dataset%"

在这里插入图片描述

squad 1.1数据集微调

python ../run_squad.py ^
  --vocab_file=%BERT_BASE_DIR%\vocab.txt ^
  --bert_config_file=%BERT_BASE_DIR%\bert_config.json ^
  --init_checkpoint=%BERT_BASE_DIR%\bert_model.ckpt ^
  --do_train=True ^
  --train_file=%SQUAD_DIR%\train-v1.1.json ^
  --do_predict=True ^
  --predict_file=%SQUAD_DIR%\dev-v1.1.json ^
  --train_batch_size=12 ^
  --learning_rate=3e-5 ^
  --num_train_epochs=2.0 ^
  --max_seq_length=384 ^
  --doc_stride=128 ^
  --output_dir=E:\Learning\Algorithm-Reproduction\bert\output\%Dataset%

在这里插入图片描述

其它程序

预训练数据集创建

python ../create_pretraining_data.py ^
  --input_file=%DATA_DIR%\origin_data\%Dataset%.txt ^
  --output_file=%DATA_DIR%\pretraining_data\%Dataset%.tfrecord ^
  --vocab_file=%BERT_BASE_DIR%\vocab.txt ^
  --do_lower_case=True ^
  --max_seq_length=128 ^
  --max_predictions_per_seq=20 ^
  --masked_lm_prob=0.15 ^
  --random_seed=12345 ^
  --dupe_factor=5

在这里插入图片描述

提取文本的特征

python ../extract_features.py ^
  --input_file=%BASE_DIR%\dataset\input.txt ^
  --output_file=%BASE_DIR%\output\output.jsonl ^
  --vocab_file=%BERT_BASE_DIR%\\vocab.txt ^
  --bert_config_file=%BERT_BASE_DIR%\bert_config.json ^
  --init_checkpoint=%BERT_BASE_DIR%\bert_model.ckpt ^
  --layers=-1,-2,-3,-4 ^
  --max_seq_length=128 ^
  --batch_size=8

在这里插入图片描述

模型预训练

python ../run_pretraining.py ^
  --input_file=%BASE_DIR%\dataset\pretraining_data\%Dataset%.tfrecord  ^
  --output_dir=%BASE_DIR%\output\pretraining_%Dataset%  ^
  --do_train=True ^
  --do_eval=True ^
  --bert_config_file=%BERT_BASE_DIR%\bert_config.json ^
  --init_checkpoint=%BERT_BASE_DIR%\bert_model.ckpt ^
  --train_batch_size=32 ^
  --max_seq_length=128 ^
  --max_predictions_per_seq=20 ^
  --num_train_steps=20 ^
  --num_warmup_steps=10 ^
  --learning_rate=2e-5