NLP之ALBERT：ALBERT的简介、安装和使用方法、案例应用之详细攻略

一个处女座的程序猿

已于 2024-01-13 11:41:07 修改

阅读量5.1k

点赞数 6

分类专栏： DL/R NLP/LLMs 文章标签：自然语言处理大语言模型 ALBERT

于 2019-10-15 23:58:56 首次发布

本文链接：https://blog.csdn.net/qq_41185868/article/details/102577359

版权

NLP/LLMs 同时被 2 个专栏收录

765 篇文章

订阅专栏

DL/R

398 篇文章

订阅专栏

介绍ALBERT模型的设计理念，包括参数减少技术、跨层参数共享和句间连贯性损失，对比BERT，展示ALBERT在GLUE、SQuAD和RACE等任务上的卓越表现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

NLP之ALBERT：ALBERT的简介、安装和使用方法、案例应用之详细攻略

相关论文

《ALBERT: A Lite BERT for Self-supervised Learning of Language Representations》翻译与解读

地址	论文地址：https://arxiv.org/abs/1909.11942
时间	2019年9月26日
作者	Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut Google Research Toyota Technological Institute at Chicago
总结	此论文提出了一种基于BERT的轻量级自监督学习模型ALBERT，解决了BERT模型规模过大带来的问题。背景： >> BERT等预训练模型在下游任务上的表现与模型规模正相关，模型越大效果越好。 >> 但是如果继续增大模型规模，会面临GPU/TPU内存限制问题，训练时间也会大幅增加。解决方案： >> ALBERT提出了两个参数减少技术：分解词嵌入和跨层参数共享。 >> 分解词嵌入将词表大小与隐层大小脱钩，实现参数数量的分离。 >> 跨层参数共享防止模型深度增长导致参数线性增加。 >> 这两个技术大幅减少了参数数量，同时保留或提升下游任务表现。核心特点： >> 模型规模小，参数数量比BERT小18倍，训练速度快1.7倍。 >> 提出基于句子顺序预测的自监督损失函数，弥补 BERT NSP损失函数的不足。 >> 在GLUE、SQuAD、RACE等下游任务上效果优于BERT，实现新的状态指标。优势： >> 参数高效，规模好扩展，超过BERT-large但效果更好。 >> Sentence order prediction loss函数提升多语句编码下游表现。 >> 实现多任务下游表现新的SOTA，扭转跨度较大。

Abstract

Increasing model size when pretraining natural language representations often re-sults in improved performance on downstream tasks. However, at some point fur-ther model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT (Devlin et al., 2019). Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer param-eters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT.

在预训练自然语言表示时，增加模型大小通常会导致在下游任务中表现提升。然而，在某一点上，由于GPU/TPU内存限制和更长的训练时间，进一步增加模型变得更加困难。为解决这些问题，我们提出了两种参数减少技术，以降低内存消耗并提高BERT (Devlin等人，2019)的训练速度。全面的经验证据表明，我们提出的方法相对于原始的BERT而言，能够更好地扩展模型。我们还使用了一种自监督损失，专注于建模句间一致性，并展示它在具有多句输入的下游任务中始终提供帮助。因此，我们的最佳模型在GLUE、RACE和SQuAD基准测试中建立了新的最先进结果，同时与BERT-large相比具有更少的参数。代码和预训练模型可在https://github.com/google-research/ALBERT找到。

3 THE ELEMENTS OF ALBERT—ALBERT的要素

In this section, we present the design decisions for ALBERT and provide quantified comparisons against corresponding configurations of the original BERT architecture (Devlin et al., 2019).

在本节中，我们介绍了ALBERT的设计决策，并与原始BERT架构的相应配置进行了量化比较（Devlin等，2019）。

3.1 MODEL ARCHITECTURE CHOICES模型架构选择

The backbone of the ALBERT architecture is similar to BERT in that it uses a transformer en-coder (Vaswani et al., 2017) with GELU nonlinearities (Hendrycks & Gimpel, 2016). We follow the BERT notation conventions and denote the vocabulary embedding size as E, the number of encoder layers as L, and the hidden size as H. Following Devlin et al. (2019), we set the feed-forward/filter size to be 4H and the number of attention heads to be H/64.

There are three main contributions that ALBERT makes over the design choices of BERT.

ALBERT架构的骨干与BERT相似，它使用了一个transformer编码器（Vaswani等，2017），其中包含GELU非线性激活函数（Hendrycks＆Gimpel，2016）。我们遵循BERT的符号约定，将词汇嵌入大小表示为E，编码器层数表示为L，隐藏大小表示为H。根据Devlin等人（2019）的设定，我们将前馈/过滤器大小设置为4H，注意力头的数量设置为H/64。

ALBERT在设计选择上有三个主要贡献。

分解嵌入参数化

Factorized embedding parameterization. In BERT, as well as subsequent modeling improve-ments such as XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019), the WordPiece embedding size E is tied with the hidden layer size H, i.e., E ≡ H. This decision appears suboptimal for both modeling and practical reasons, as follows.

From a modeling perspective, WordPiece embeddings are meant to learn context-independent repre-sentations, whereas hidden-layer embeddings are meant to learn context-dependent representations. As experiments with context length indicate (Liu et al., 2019), the power of BERT-like represen-tations comes from the use of context to provide the signal for learning such context-dependent representations. As such, untying the WordPiece embedding size E from the hidden layer size H allows us to make a more efficient usage of the total model parameters as informed by modeling needs, which dictate that H  E.

分解嵌入参数化。在BERT以及后续的模型改进中，如XLNet（Yang等人，2019）和RoBERTa（Liu等人，2019），WordPiece嵌入大小E与隐藏层大小H相结合，即E ≡ H。出于建模和实际原因，这个决定似乎都不够优化，原因如下。

从建模的角度来看，WordPiece嵌入是为了学习与上下文无关的表示，而隐藏层嵌入是为了学习与上下文相关的表示。正如上下文长度相关的实验所表明的那样(Liu et al.， 2019)，

类似BERT的表示的强大之处在于利用上下文提供学习这种上下文相关表示的信号。因此，将WordPiece嵌入大小E与隐藏层大小H解开，允许我们根据建模需求更有效地使用总模型参数，建模需求指导H  E。

From a practical perspective, natural language processing usually require the vocabulary size V to be large.1 If E ≡ H, then increasing H increases the size of the embedding matrix, which has size V ×E. This can easily result in a model with billions of parameters, most of which are only updated sparsely during training.

Therefore, for ALBERT we use a factorization of the embedding parameters, decomposing them into two smaller matrices. Instead of projecting the one-hot vectors directly into the hidden space of size H, we first project them into a lower dimensional embedding space of size E, and then project it to the hidden space. By using this decomposition, we reduce the embedding parameters from O(V × H) to O(V × E + E × H). This parameter reduction is significant when H  E. We choose to use the same E for all word pieces because they are much more evenly distributed across documents compared to whole-word embedding, where having different embedding size (Grave et al. (2017); Baevski & Auli (2018); Dai et al. (2019) ) for different words is important.

从实际的角度来看，自然语言处理通常需要较大的词汇量V。1 如果E ≡ H，那么增加H会增加嵌入矩阵的大小，其大小为V × E。这很容易导致一个具有数十亿参数的模型，在训练过程中大多数参数只是稀疏更新。

因此，对于ALBERT，我们使用嵌入参数的因数分解，将它们分解成两个较小的矩阵。我们不直接将one-hot向量投影到大小为H的隐藏空间中，而是首先将它们投影到大小为E的较低维嵌入空间，然后再将其投影到隐藏空间。通过使用这种分解，我们将嵌入参数从O（V × H）减少到O（V × E + E × H）。当H  E时，这种参数减少是显著的。我们选择对所有词片段使用相同的E，因为它们在文档中分布比整个单词嵌入更均匀，而对于不同的单词具有不同的嵌入大小（Grave等人（2017）；Baevski＆Auli（2018）；Dai等人（2019））是重要的。

跨层参数共享

Cross-layer parameter sharing. For ALBERT, we propose cross-layer parameter sharing as an-other way to improve parameter efficiency. There are multiple ways to share parameters, e.g., only sharing feed-forward network (FFN) parameters across layers, or only sharing attention parameters. The default decision for ALBERT is to share all parameters across layers. All our experiments use this default decision unless otherwise specified. We compare this design decision against other strategies in our experiments in Sec. 4.5.

Similar strategies have been explored by Dehghani et al. (2018) (Universal Transformer, UT) and Bai et al. (2019) (Deep Equilibrium Models, DQE) for Transformer networks. Different from our observations, Dehghani et al. (2018) show that UT outperforms a vanilla Transformer. Bai et al.(2019) show that their DQEs reach an equilibrium point for which the input and output embedding of a certain layer stay the same. Our measurement on the L2 distances and cosine similarity show that our embeddings are oscillating rather than converging.

跨层参数共享。对于ALBERT，我们提出了跨层参数共享作为提高参数效率的另一种方法。参数共享的方式有很多种，例如，仅在层之间共享前馈网络（FFN）参数，或仅共享注意力参数。ALBERT的默认决策是跨层共享所有参数。除非另有说明，否则所有我们的实验都使用这个默认决定。我们在第4.5节的实验中将这个设计决策与其他策略进行了比较。

类似的策略已经被Dehghani等人（2018）（Universal Transformer，UT）和Bai等人（2019）（Deep Equilibrium Models，DQE）用于Transformer网络的研究。与我们的观察不同，Dehghani等人（2018）表明UT优于普通的Transformer。Bai等人（2019）表明他们的DQE达到一个平衡点，其中某一层的输入和输出嵌入保持不变。我们对L2距离和余弦相似度的测量表明，我们的嵌入是振荡而不是收敛的。

句间一致性损失

Inter-sentence coherence loss. In addition to the masked language modeling (MLM) loss (De-vlin et al., 2019), BERT uses an additional loss called next-sentence prediction (NSP). NSP is a binary classification loss for predicting whether two segments appear consecutively in the original text, as follows: positive examples are created by taking consecutive segments from the training corpus; negative examples are created by pairing segments from different documents; positive and negative examples are sampled with equal probability. The NSP objective was designed to improve performance on downstream tasks, such as natural language inference, that require reasoning about the relationship between sentence pairs. However, subsequent studies (Yang et al., 2019; Liu et al., 2019) found NSP’s impact unreliable and decided to eliminate it, a decision supported by an im-provement in downstream task performance across several tasks.

We conjecture that the main reason behind NSP’s ineffectiveness is its lack of difficulty as a task, as compared to MLM. As formulated, NSP conflates topic prediction and coherence prediction in a single task2. However, topic prediction is easier to learn compared to coherence prediction, and also overlaps more with what is learned using the MLM loss.

句间连贯丧失。除了掩码语言建模(MLM)损失(De-vlin等人，2019)之外，BERT还使用了一种称为下一句预测(NSP)的额外损失。NSP是一种二元分类损失，用于预测两个片段是否在原文中连续出现，具体如下：从训练语料库中取连续片段创建正例;反例是通过对来自不同文档的片段创建的;正例和反例以等概率抽样。NSP目标旨在提高下游任务的性能，例如自然语言推理，这需要对句子对之间的关系进行推理。然而，随后的研究(Yang et al.， 2019;Liu et al.， 2019)发现NSP的影响不可靠，并决定消除它，这一决定得到了多个任务的下游任务性能改善的支持。

我们推测NSP无效的主要原因是其作为任务的难度相对于MLM而言较低。按照公式，NSP将主题预测和一致性预测合并为一个单一的任务。然而，与一致性预测相比，主题预测更容易学习，并且与使用MLM损失学到的内容更重叠。

We maintain that inter-sentence modeling is an important aspect of language understanding, but we propose a loss based primarily on coherence. That is, for ALBERT, we use a sentence-order pre-diction (SOP) loss, which avoids topic prediction and instead focuses on modeling inter-sentence coherence. The SOP loss uses as positive examples the same technique as BERT (two consecu-tive segments from the same document), and as negative examples the same two consecutive seg-ments but with their order swapped. This forces the model to learn finer-grained distinctions about discourse-level coherence properties. As we show in Sec. 4.6, it turns out that NSP cannot solve the SOP task at all (i.e., it ends up learning the easier topic-prediction signal, and performs at random-baseline level on the SOP task), while SOP can solve the NSP task to a reasonable degree, pre-sumably based on analyzing misaligned coherence cues. As a result, ALBERT models consistently improve downstream task performance for multi-sentence encoding tasks.

我们认为句子间建模是语言理解的一个重要方面，但我们提出了主要基于连贯性的损失。也就是说，对于ALBERT，我们使用了句子顺序预测(SOP)损失，它避免了主题预测，而是专注于句子间一致性的建模。

SOP损失的正例使用与BERT相同的技术（同一文档中的两个连续片段），而负例则是相同的两个连续片段，但其顺序被交换。这迫使模型学习有关话语层面一致性属性的更细粒度的区别。正如我们在第4.6节中所展示的，结果表明NSP根本无法解决SOP任务（即它最终学到了更容易的主题预测信号，并在SOP任务上以随机基线水平执行），而SOP可以在合理程度上解决NSP任务，这可能是基于分析不对齐的一致性线索。因此，ALBERT模型在多句编码任务的下游性能上持续改善。

5、Conclusion

While ALBERT-xxlarge has less parameters than BERT-large and gets significantly better results, it is computationally more expensive due to its larger structure. An important next step is thus to speed up the training and inference speed of ALBERT through methods like sparse attention (Child et al., 2019) and block attention (Shen et al., 2018). An orthogonal line of research, which could provide additional representation power, includes hard example mining (Mikolov et al., 2013) and more efficient language modeling training (Yang et al., 2019). Additionally, although we have convincing evidence that sentence order prediction is a more consistently-useful learning task that leads to better language representations, we hypothesize that there could be more dimensions not yet captured by the current self-supervised training losses that could create additional representation power for the resulting representations.

虽然ALBERT-xxlarge的参数比BERT-large少，并且结果显著更好，但由于其更大的结构，计算成本更高。因此，一个重要的下一步是通过稀疏注意力 (Child等人，2019) 和块注意力 (Shen等人，2018) 等方法加快ALBERT的训练和推断速度。一个正交的研究方向，可能提供额外的表示能力，包括硬例子挖掘 (Mikolov等人，2013) 和更高效的语言建模训练 (Yang等人，2019)。此外，尽管我们有令人信服的证据表明句子顺序预测是一项更一致有用的学习任务，可以导致更好的语言表示，但我们假设当前自监督训练损失尚未捕捉到的更多维度可能会为结果表示提供额外的表示能力。

ALBERT的简介

ALBERT是BERT的“轻量”版本，是一种流行的无监督语言表示学习算法。ALBERT使用参数减少技术，允许大规模配置，克服了以前的内存限制，并在模型退化方面表现更好。

有关算法的技术描述，请参见我们的论文。

GitHub地址：https://github.com/google-research/ALBERT

1、更新

新版于2020年3月28日发布

添加了一个Colab教程，用于对GLUE数据集进行微调。

地址：https://github.com/google-research/albert/blob/master/albert_glue_fine_tuning_tutorial.ipynb

新版于2020年1月7日发布

v2 TF-Hub模型现在应该与TF 1.15兼容，因为我们从图中移除了原生的Einsum操作。请参见下面更新的TF-Hub链接。

新版于2019年12月30日发布

发布了中文模型。我们要感谢CLUE团队提供的训练数据。

ALBERT模型的第二版发布了。

Base: [Tar file] [TF-Hub]
Large: [Tar file] [TF-Hub]
Xlarge: [Tar file] [TF-Hub]
Xxlarge: [Tar file] [TF-Hub]

在这个版本中，我们对所有模型应用了“无丢失”、“额外训练数据”和“长时间训练”等策略。我们对ALBERT基础版进行了1000万步的训练，对其他模型进行了300万步的训练。

与v1模型相比，结果如下：

	Average	SQuAD1.1	SQuAD2.0	MNLI	SST-2	RACE
V2
ALBERT-base	82.3	90.2/83.2	82.1/79.3	84.6	92.9	66.8
ALBERT-large	85.7	91.8/85.2	84.9/81.8	86.5	94.9	75.2
ALBERT-xlarge	87.9	92.9/86.4	87.9/84.1	87.9	95.4	80.7
ALBERT-xxlarge	90.9	94.6/89.1	89.8/86.9	90.6	96.8	86.8
V1
ALBERT-base	80.1	89.3/82.3	80.0/77.1	81.6	90.3	64.0
ALBERT-large	82.4	90.6/83.9	82.3/79.4	83.5	91.7	68.5
ALBERT-xlarge	85.5	92.5/86.1	86.1/83.1	86.4	92.4	74.8
ALBERT-xxlarge	91.0	94.8/89.3	90.2/87.4	90.8	96.9	86.5

比较表明，对于ALBERT基础版、大型版和巨大版，v2要比v1好得多，表明应用上述三种策略的重要性。总体而言，ALBERT超大版略逊于v1，原因有两点：1）额外训练150万步（这两个模型唯一的区别是训练150万步和300万步），没有显著提高性能。2）对于v1，我们在BERT、Roberta和XLnet给定的参数集之间进行了一些超参数搜索。对于v2，我们除了RACE使用学习率为1e-5和0 ALBERT DR（ALBERT在微调中的丢失率）外，简单地采用了v1的参数。原始（v1）RACE超参数会导致v2模型发散。鉴于下游任务对微调超参数非常敏感，我们应该谨慎对待所谓的轻微改进。

初始发布日期：2019年10月9日

2、结果

ALBERT在GLUE基准测试上的性能结果，使用单一模型设置进行开发：

Models	MNLI	QNLI	QQP	RTE	SST	MRPC	CoLA	STS
BERT-large	86.6	92.3	91.3	70.4	93.2	88.0	60.6	90.0
XLNet-large	89.8	93.9	91.8	83.8	95.6	89.2	63.6	91.8
RoBERTa-large	90.2	94.7	92.2	86.6	96.4	90.9	68.0	92.4
ALBERT (1M)	90.4	95.2	92.0	88.1	96.8	90.2	68.7	92.7
ALBERT (1.5M)	90.8	95.3	92.2	89.2	96.9	90.9	71.4	93.0

ALBERT-xxl在SQuaD和RACE基准测试中使用单模型设置的性能:

Models	SQuAD1.1 dev	SQuAD2.0 dev	SQuAD2.0 test	RACE test (Middle/High)
BERT-large	90.9/84.1	81.8/79.0	89.1/86.3	72.0 (76.6/70.1)
XLNet	94.5/89.0	88.8/86.1	89.1/86.3	81.8 (85.5/80.2)
RoBERTa	94.6/88.9	89.4/86.5	89.8/86.8	83.2 (86.5/81.3)
UPM	-	-	89.9/87.2	-
XLNet + SG-Net Verifier++	-	-	90.1/87.2	-
ALBERT (1M)	94.8/89.2	89.9/87.2	-	86.0 (88.2/85.1)
ALBERT (1.5M)	94.8/89.3	90.2/87.4	90.9/88.1	86.5 (89.0/85.5)

ALBERT的安装和使用方法

1、预训练模型

在代码中使用TF-Hub模块的示例：

Base: [Tar file] [TF-Hub]
Large: [Tar file] [TF-Hub]
Xlarge: [Tar file] [TF-Hub]
Xxlarge: [Tar file] [TF-Hub]

tags = set()
if is_training:
  tags.add("train")
albert_module = hub.Module("https://tfhub.dev/google/albert_base/1", tags=tags,
                           trainable=True)
albert_inputs = dict(
    input_ids=input_ids,
    input_mask=input_mask,
    segment_ids=segment_ids)
albert_outputs = albert_module(
    inputs=albert_inputs,
    signature="tokens",
    as_dict=True)

# If you want to use the token-level output, use
# albert_outputs["sequence_output"] instead.
output_layer = albert_outputs["pooled_output"]

预训练说明

要预训ALBERT，请使用run_pretraining.py：

pip install -r albert/requirements.txt
python -m albert.run_pretraining \
    --input_file=... \
    --output_dir=... \
    --init_checkpoint=... \
    --albert_config_file=... \
    --do_train \
    --do_eval \
    --train_batch_size=4096 \
    --eval_batch_size=64 \
    --max_seq_length=512 \
    --max_predictions_per_seq=20 \
    --optimizer='lamb' \
    --learning_rate=.00176 \
    --num_train_steps=125000 \
    --num_warmup_steps=3125 \
    --save_checkpoints_steps=5000

2、微调

在GLUE上微调

要在GLUE上微调和评估预训练的ALBERT，请参见方便脚本run_glue.sh。

较低级别的用例可能希望直接使用run_classifier.py脚本。run_classifier.py脚本用于微调和评估ALBERT在个别GLUE基准任务上，例如MNLI：

pip install -r albert/requirements.txt
python -m albert.run_classifier \
  --data_dir=... \
  --output_dir=... \
  --init_checkpoint=... \
  --albert_config_file=... \
  --spm_model_file=... \
  --do_train \
  --do_eval \
  --do_predict \
  --do_lower_case \
  --max_seq_length=128 \
  --optimizer=adamw \
  --task_name=MNLI \
  --warmup_step=1000 \
  --learning_rate=3e-5 \
  --train_step=10000 \
  --save_checkpoints_steps=100 \
  --train_batch_size=128

每个GLUE任务的良好默认标志值可以在run_glue.sh中找到。

您可以从TF-Hub模块开始微调模型，而不是从原始检查点开始，方法是设置--albert_hub_module_handle=https://tfhub.dev/google/albert_base/1而不是--init_checkpoint。

您可以在tar文件或tf-hub模块的assets文件夹中找到spm_model_file。模型文件的名称是"30k-clean.model"。

评估后，脚本应该报告类似于以下的输出：

***** Eval results *****
  global_step = ...
  loss = ...
  masked_lm_accuracy = ...
  masked_lm_loss = ...
  sentence_order_accuracy = ...
  sentence_order_loss = ...

在SQuAD上微调

要在SQuAD v1上微调和评估预训练模型，请使用run_squad_v1.py脚本：

pip install -r albert/requirements.txt
python -m albert.run_squad_v1 \
  --albert_config_file=... \
  --output_dir=... \
  --train_file=... \
  --predict_file=... \
  --train_feature_file=... \
  --predict_feature_file=... \
  --predict_feature_left_file=... \
  --init_checkpoint=... \
  --spm_model_file=... \
  --do_lower_case \
  --max_seq_length=384 \
  --doc_stride=128 \
  --max_query_length=64 \
  --do_train=true \
  --do_predict=true \
  --train_batch_size=48 \
  --predict_batch_size=8 \
  --learning_rate=5e-5 \
  --num_train_epochs=2.0 \
  --warmup_proportion=.1 \
  --save_checkpoints_steps=5000 \
  --n_best_size=20 \
  --max_answer_length=30

您可以通过设置例如--albert_hub_module_handle=https://tfhub.dev/google/albert_base/1而不是--init_checkpoint，从TF-Hub模块而不是原始检查点开始微调模型。

对于SQuAD v2，请使用run_squad_v2.py脚本：

pip install -r albert/requirements.txt
python -m albert.run_squad_v2 \
  --albert_config_file=... \
  --output_dir=... \
  --train_file=... \
  --predict_file=... \
  --train_feature_file=... \
  --predict_feature_file=... \
  --predict_feature_left_file=... \
  --init_checkpoint=... \
  --spm_model_file=... \
  --do_lower_case \
  --max_seq_length=384 \
  --doc_stride=128 \
  --max_query_length=64 \
  --do_train \
  --do_predict \
  --train_batch_size=48 \
  --predict_batch_size=8 \
  --learning_rate=5e-5 \
  --num_train_epochs=2.0 \
  --warmup_proportion=.1 \
  --save_checkpoints_steps=5000 \
  --n_best_size=20 \
  --max_answer_length=30

您可以通过设置例如--albert_hub_module_handle=https://tfhub.dev/google/albert_base/1而不是--init_checkpoint，从TF-Hub模块而不是原始检查点开始微调模型。

在RACE上微调

对于RACE，请使用run_race.py脚本：

pip install -r albert/requirements.txt
python -m albert.run_race \
  --albert_config_file=... \
  --output_dir=... \
  --train_file=... \
  --eval_file=... \
  --data_dir=...\
  --init_checkpoint=... \
  --spm_model_file=... \
  --max_seq_length=512 \
  --max_qa_length=128 \
  --do_train \
  --do_eval \
  --train_batch_size=32 \
  --eval_batch_size=8 \
  --learning_rate=1e-5 \
  --train_step=12000 \
  --warmup_step=1000 \
  --save_checkpoints_steps=100

您可以通过设置例如--albert_hub_module_handle=https://tfhub.dev/google/albert_base/1而不是--init_checkpoint，从TF-Hub模块而不是原始检查点开始微调模型。

3、SentencePiece

生成句子片段词汇的命令：

spm_train \
--input all.txt --model_prefix=30k-clean --vocab_size=30000 --logtostderr
--pad_id=0 --unk_id=1 --eos_id=-1 --bos_id=-1
--control_symbols=[CLS],[SEP],[MASK]
--user_defined_symbols="(,),\",-,.,–,£,€"
--shuffle_input_sentence=true --input_sentence_size=10000000
--character_coverage=0.99995 --model_type=unigram