ALBERT:A LITE BERT FOR SELF-SUPERVISED LEAARNINGOF LANGUAGE REPRESENTATIONS

ABSTRACT
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks.
预训练自然语言表示的时候,增加模型的大小经常导致下游任务的表现提升。
However,at some point further model increases become harder due to GPU/TPU memory limitations and longer training times.
但是,在某种程度上更多的模型增加很难因为内存的限制和更长的训练次数。
To address these problems,we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT.
为了解决这些问题,我们提出了两个参数减少的策略去降低内存限制并且增加BERT的训练速度。
Comprehensive empirical evidence shows that our proposed methods leads to models that scale much better compared to the original BERT.
综合经验证据展示了我们之前的方法导致模型与之前的BERT相比规模更小。
We also use a self-supervised loss that focuses on modeling inter-sentence coherence,and show it consistently helps downstram tasks with multi-sentence inputs.
我们同样使用关注于模型内在句子耦合性的半监督学习,并且展示在多句子输入中它不断地帮助下游任务。
As a result,our best model establishes new state-of-the-art results on the GLUE,RARCE and SQuAD benchmarks while having fewer parameters compared to BERT-large.
结果我们最好的模型在GLUE,RACE和SQuAD基准上建立了新的最好的结果然而与BERT-large相比有更少的参数。
The code and the pretrained models are available at https://github.com/google-research/ALBERT
1 INTRODUCTION
Full network pre-training has led to a series of breakthroughs in language representation learning.
全网络预训练导致语言表示学习的一系列突破。
Many non-trivial NLP tasks,including those that have limited training data,have greatly benefited from these pre-trained models.
许多琐碎的NLP任务,包括一些有有限训练数据的任务,极大地从这些预训练模型之中得到了提升。
One of the most compelling signs of these breakthroughts is the evolution of machine performance on a reading comprehension task designed for middle and high-school English exams in China,the RACE test:
这些突破中最令人震惊的是在为中学和大学中国的英文考试中提供的机器翻译获得的突破:RACE测试:
the paper that originally describes the task and formulates the modeling challenge reports then state-of-the art machine accuracy at 44.1%;
论文原来描述任务并且指定了模型挑战任务,当时论文报告模型的最好准确率在44.1%。
the latest published result reports their model performance at 83.2%;the work we present here pushes it even higher to 89.4%,a stunning 45.3% improvement that is mainly attributable to our current ability to build high-performance pretrained language representations.
最近发布的结果报告他们模型表现在83.2%,最近我们在这展示的任务将它推到了89.4%,令人震惊的45.3%的提升主要得益于我们现在能够建立高表现的预训练语言表示的能力。

Evidence from these improvements reveals that a large network is of crucial importance for achieveing state-of-the-art performance.
从这些改进之中揭示了一个大的网络对于获得最好的表现至关重要。
It has become common practice to pretrain large models and distill them down to smaller ones for real applications.
将大的模型预训练并且将它们转变为小的模型应用到实际应用中已经变成了一种惯例。
Given the importance of model size,we ask:Is having better NLP models as easy as having larger models?
给定模型尺寸的重要性,我们不禁要问:有一个更好的NLP模型就是简单地将模型变大吗?

An obstacle to answering this question is the memory limitations of available hardware.
回答这个问题的障碍是可获得硬件的内存限制。
Given that current state-of-the-art models often have hundreds of millions or even billions of parameters,it is easy to hit these limitations as we try to scale our models.
现在最好的模型经常会有数以百计的亿或者十亿的参数,当我们调整模型的时候经常会达到这些限制。

Training speed can also be significantly hampered in distributed training,as the communication overhead is directly proportional to the number of parameters in the model.
训练速度同样很重要的阻碍分布式训练,因为通信开销直接与模型中的参数数量成比例(communication overhead:通信开销,这里的overhead理解为费用的意思)。

Existing solutions to the aforementioned problems include model parallelization and clever memory management.
上述问题的现有解决方案包括模型并行化以及聪明的内存管理。
These solutions address the memory limitation problem,but not the communication overhead.
这些方案解决了内存限制问题,但是没有解决通信开销问题。
In this paper,we address all of the aforementioned problems.by designing A Lite BERT(ALBERT) architecture that has significantly fewer parameters than a traditional BERT architecture.
在这篇论文中,我们通过设计一个ALBERT结构解决上面所有的问题,这个ALBERT结构比传统的BERT结构有明显更少的参数。

ALBERT incorporates two parameter reduction techniques that lift the major obstacles in scaling pre-trained models.
ALBERT包含了两个参数减少的策略解决了缩放预训练模型的主要障碍。
The first one is factorized embedding parameterization.
第一个方法是因子化embedding的参数。
By decomposing the large vocabulary embedding matrix into two small matrices,we separate the size of the hidden layers from the size of the vocabulary embedding.
通过将大的词汇embedding矩阵分解为两个小的矩阵,我们从词汇embedding的尺寸之中拆分出来隐藏层的大小。

This separation makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings.
这个拆分使得隐藏层的大小很容易地张起来,但是并没有明显地增加词汇embedding的参数尺寸。
This technique prevents the parameter from growing with the depth of the network.
这个策略阻止参数随着网络深度的增加而增长。
Both techniques significantly reduce the number of parameters for BERT without seriously hurting performance,thus improving paramter-efficiency.
两种策略都明显地减少了BERT的参数但是并没有严重地损害相应的表现,因此提升了参数的有效性。
An ALBERT configuration similar to BERT-large has 18x fewer paramters and can be trained about 1.7x faster.
ALBERT配置类似于BERT-large有18x更小的参数并且可以以1.7x更快地速度训练。
The parameter reduction techniques also act as a form of regularization that stabilizes the training and helps with generalization.
参数减少策略同样可以以一种正则化的形式出现,这种策略稳定训练并且帮助泛化。

To further improve the performance of ALBERT,we also introduce a self-supervised loss for sentence-order prediction(SOP).
为了进一步提升ALBERT的表现,我们同样也介绍了一种句子顺序预测的自我监督损失方案。
SOP primary focuses on inter-sentence coherence and is designed to address the ineffectiveness of the next sentence prediction(NSP) loss proposed in the original BERT.
句子顺序预测主要集中于内部句子的耦合性并且用于解决在原先BERT中提出的下一个句子预测的无效性。

As a result of these design decisions,we are able to scale up to much larger ALBERT configurations that still have fewer parameters than BERT-large but achieve significantly better performance.
作为这些设计的决定,我们能够将ALBERT所放到更大的配置但是比BERT-large有着更小的参数,并且可以获得明显地更好的效果。
We establish new state-of-the-art results on the well-known GLUE,SQuAD and RACE benchmarks for natural language understanding.
我们在知名数据集GLUE,SQuAD以及RACE对于自然语言理解基准上建立了新的最好的结果。
Specifically,we push the RACE accuracy to 89.4%,the GLUE benchmark to 89.4,and the F1 score of SQuAD 2.0 to 92.2.
特别地,我们将RACE准确率提高到89.4%,GLUE准确率提高到89.4,以及SQuAD2.0的F1分数提高到92.2。
2 RELATED WORK
2 相关工作
2.1 SCALING UP REPRESENTATION LEARNING FOR NATURAL LANGUAGE
自然语言中的缩放表示学习
自然语言表示中的学习被显示在很多的NLP任务中都有用并且被广泛地使用。
One of the most significant changes in the last two years is the shift from pre-training word embeddings,whether standard or contextualized to full-network pre-training followed by task-specific fine-tuning.
在过去两年中最重要的变化是预训练单词embedding的转移,对于整个网络是否标准或者是共同语境的预训练s使用任务特定的微调。
In this line of work,it is often shown that larger model size improves performance.
在这一系列的工作中,经常显示大的模型提升表现。
For example,Devlin show that across three selected natural language understanding tasks,using larger hidden size,more hidden layers,and more attention heads always leads to better performance.
例如,Devlin显示通过三个选择的自然语言理解任务,使用更大的隐藏尺寸,更多的隐藏层和更多的注意力头经常导致更好的表现。
However,they stop at a hidden size of 1024,presumably because of the model size and computation cost problems.
但是,他们在隐藏层尺寸为1024的地方停了下来,大概由于模型尺寸和计算花费问题。

It is difficult to experiment with large models due to computational constraints,especially in terms of GPU/TPU memory limitations.
使用大模型进行实验很难,特别是考虑到GPU/TPU的内存限制。
Given that current state-of-the-art models often have hundreds of millions or even billions of parameters,we can easily hit memory limits.
考虑到目前最好的模型经常有百亿或者百十亿计的参数,我们可以很容易地达到内存的限制。
To address this issue,Chen propose a method called gradient checkpointing to reduce the memory requirement to be sublinear at the cost of an extra forward pass.
为了解决这个问题,Chen提出一个方法叫做梯度停顿点去减少内存的需求,从而在额外的前向传递的过程中达到非线性的耗费。
Gomez propose a way to reconstruct each layer’s activations from the next layer so that they do not need to store the intermediate activations.
Gomez提出了一种方法去从下一个网络层重建每个网络层的激活函数使得他们不需要直接存储当前的激活函数。
Both methods reduce the memory comsumption at the cost of speed.
所有方法以速度变慢来减少内存的消耗。
Raffel proposed to use model parallelization to train a giant model.
Raffel提出使用模型并行化去训练一个大的模型。
In contrast,our parameter-reduction techniques reduce memory consumption and increase training speed.
相反,我们的参数减少策略减少内存消耗并且增加训练速度。

2.2 CROSS-LAYER PARAMETER SHARING
交叉网络层参数共享
The idea of sharing parameters across layers has been previously explored with the Transformer architecture,but this prior work has focused on training for standard encorder-decoder tasks rather than the pretraining/finetuning setting.
在网络层之间共享参数之前被Transformer结构探索过,但是之前的工作集中于标准的编码-解码任务的训练而不是预训练/微调的设置。

Different from our observations,Dehghani show that networks with cross-layer parameter sharing(Universal Transformer,UT) get better performance on language modeling and subject-verb agreement than the standard transformer.
与我们观察到的不同,Dehghani表示交叉网络层的参数共享(统一Transformer,UT)在语言模型以及主谓一致协议上比标准的transformer结构获得更好的表现。

交叉参数共享应该是不同的transformer共用同一套网络的参数。
Very recently,Bai propose a Deep Equilibrium Model(DQE) for transformer networks and show that DQE can reach an equilibrium point for which the input embedding and the output embedding of a certain layer stay the same.
最近,Bai为transformer网络结构提出了一个深度平等模型(DQE)并且显示DQE可以达到一个平衡点一个网络层的输入embedding和输出embedding保持一致。
Our observations show that our embeddings are oscillating rather than converging.
我们的观察显示我们的embedding层经常振荡而没有收敛。
Hao combine a parameter-sharing transformer with the standard one ,which further increases the number of parameters of the standard transformer.
Hao将参数共享的transformer和标准的transformer相结合,进一步的增加了标准transformer的参数。
2.3 SENTENCE ORDERING OBJECTIVES
句子顺序目标
ALBERT uses a pretraining loss based on predicting the ordering of two consecutive segments of text.
ALBERT使用基于预测两个连续文本段落顺序的预训练损失。
Several researchers have experimented with pretraining objectives that similarly relate to discourse coherence.
几个调查者做与对话连贯性相似的实验实现预训练目标。
Coherence and cohesion in discourse have been widely studied and many phenomena have been identified that connect neighboring text segments.
对话的连贯性和一致性被广泛地学习并且许多现象被证实连接相邻的文本段落。

Most objectives found effective in practice are quite simple.
许多在实践中有效的目标都相当简单。
Skip-thought and FastSent sentence embeddings are learned by using an encoding of a sentence to predict words in neighboring sentences.
Skip-thought和FastSent句子embedding通过学习句子的编码去预测相邻句子的单词。
Other objectives for sentence embedding learning include predicting future sentences rather than only neighbors and predicting explicit discourse markers.
其他目标的句子embedding学习包括预测将来的句子而不只是邻近的句子以及预测明显的标记语。
Our loss is most similar to the sentence ordering objective of Jernite,where sentence embeddings are learned in order to determine the ordering of two consecutive sentences.
我们的损失函数类似于Jernite的句子顺序目标,句子embeddings被学习为了确定接下来两个句子的顺序。
Unlike most of the above work,however,our loss is defined on textual segments rather than sentences.
但是不像上面大多数工作那样,我们的损失函数被定义在文本段落上而不是句子上。

BERT uses a loss based on predicting whether the second segment in a pair has been swapped with a segment from another document.
BERT使用损失函数基于预测是否这个文本对的第二个段落被替换为另外一个文本的段落。
We compare to this loss in our experiments and find that sentence ordering is a more challenging pretraining task and more useful for certain downstram tasks.
我们在我们的实验中比较损失函数并且发现句子的顺序是一个更具挑战性的预训练任务并且对于某些下游任务更有用。
Concurrently to our work,Wang also try to predict the order of two consecutive segments of text,but they combine it with the original next sentence prediction in a three-way classification task rather than empirically comparing the two.
与我们的工作一致,Wang也正在努力预测两个连续文本段落的顺序,但是他们将这与原先的下一个句子预测混合成为一个三路分类任务而不是按照经验性地比较两个任务。
3 THE ELEMENTS OF ALBERT
ALBERT的构成
In this section,we present the design decisions for ALBERT and provide quantified comparisons against corresponding configurations of the original BERT architecture.
在这个段落中,我们展示了ALBERT的设计决定并且提供了一些与原来的BERT的结构相应配置的有质量的比较。
3.1 MODEL ARCHITECTURE CHOICES
3.1 模型结构选择
The backbone of the ALBERT architecture is similar to BERT in that it uses a transformer encoder with GELU nonlinearities.
ALBERT结构的关键与BERT相似因为它使用了一个带有GULE非线性函数的transformer编码结构。
We follow the BERT notation conventions and denote the vocabulary embeding size as E,the number of encoder layers as L,and the hidden size as H.
我们遵循BERT注释的传统将词汇的embedding尺寸定义为E,将编码数量的网络层定义为L和将隐藏尺寸定义为H。
Following Devlin,we set the feed-forward/filter size to be 4H and the number of attention heads to be H/64.
遵循Devlin我们将前馈神经网络/过滤器的尺寸定义为4H并且将注意力头的数量定义为H/64。
There are three main contributions that ALBERT makes over the design choices of BERT.
ALBERT对于BERT的设计选择作出了三个主要的贡献:
Factorized embedding parameterization
分解embedding的参数
In BERT,as well as subsequent modeling improvements such as XLNet and RoBERTa,the WordPiece embedding size E is tied with the hidden layer size H,i,e,E=H.
在BERT中,以及随着的模型提升例如XLNet和RoBERTa中,每个单词片的嵌入尺寸E与隐藏层尺寸H相等,即E=H。
This decision appears suboptimal for both modeling and practical reasons,as follows.
这个决定对于模型和实用理性显得效果欠优,原因如下:
From a modeling perspective,WordPiece embeddings are meant to learn context-independent representations.whereas hidden-layer embeddings are meant to learn context-dependent representations.
从一个模型的角度,单词片embeddings用来学习内容独立的表示,然而隐藏层的embeddings用来学习内容依赖的表示。
As experiments with context length indicate,the power of BERT-like representations comes from the use of context to provide the signal for learning such context-dependent representations.
正如内容长度实验显示的那样,像BERT一样表示的能力来自于文本的使用,并且这些文本的使用提供信号去学习这种文本依靠性表示。

the power of BERT-like representations/comes from the use of context/(the use of context)to provide the signal for learning such context-dependent representations.

As such,untying the WordPiece embedding size E from the hidden layer size H allows us to make a more efficient usage of the total model parameters as informed by modeling needs,which dictate that H>>E.
严格意义上来说,从隐藏层尺寸H中解开单词片的embedding的尺寸使得我们在一个整个模型参数中更有效的使用,正如由模型中需要的那样,规定了H>>E。

From a practical perspective,natural language processing usually require the vocabulary size V to be large.
从一个实用角度来讲,自然语言处理经常需要词汇尺寸V变大。
If E = H,then increasing H increases the size of the embedding matrix,which has size V × E V \times E V×E.This can easily result in a model with billions of parameters,most of which are only updated sparsely during training.
如果E = H,则接着增加H会增加嵌入层矩阵的尺寸,因为矩阵的尺寸为 V × E V \times E V×E.这可以很轻易地导致一个模型有十亿的参数,这些参数中的许多在训练中很少被更新。

Therefore,for ALBERT we use a factorization of the embedding parameters,decomposing them into two smaller matrices.
因此,在ALBERT中我们对embedding进行因式分解,将它们分解为两个小的矩阵。
Instead of projecting the one-hot vectors directly into the hidden space of size H,we first project them into a lower dimensional embedding space of size E,and then project it to the hidden space.
我们先将矩阵映射到一个低一维的E大小的嵌入层空间,接着再将它映射到隐藏空间中。
By using this decomposition,we reduce the embedding parameters from O ( V × H ) O(V\times H) O(V×H) to O ( V × E + E × H ) O(V\times E+E\times H) O(V×E+E×H)
通过使用这种分解,我们将embedding参数由 O ( V × H ) O(V\times H) O(V×H)转化为 O ( V × E + E × H ) O(V\times E+E\times H) O(V×E+E×H)
This parameter reduction is significant when H >> E.We choose to use the same E for all word pieces because they are much more evenly distributed across documents compared to whole-word embedding,where having different embedding size for different words is important.
当H >> E的时候参数减少方法很有效。我们对于所有的单词片使用相同的E,因为他们与之前的单词embedding相比在整个文档中更加平均地分布,并且对于不同的单词有不同的嵌入层尺寸是很重要的。

Cross-layer parameter sharing
交叉网络层的参数共享
For ALBERT,we propose cross-layer parameter sharing as another way to improve parameter efficiency.
对于ALBERT,我们提出交叉网络层的参数共享作为另外一个提升参数效率的方法。
There are multiple ways to share parameters,e.g.,only sharing feed-forward network(FFN)parameters across layers,or only sharing attention parameters.
有多种方法去共享参数,即,只在网络层中分享前向反馈网络的参数,或者只分享注意力参数。
The default decision for ALBERT is to share all parameters across layers.All our experiments use this default decision unless otherwise specified.
对于ALBERT的默认设定是在神经网络层之间分享所有的参数。所有我们的实验使用这种设定除非其他特殊说明。
We compare this design decision against other strategies in our experiments in Sec.4.5.
我们在我们的实验Sec 4.5将这种设计的设定与其他的策略相比较。

Similar strategies have been explored by Dehghani(Universal Transformer,UT) and Bai et al.(2019)(Deep Equilibrium Models,DQE) for Transformer networks.
相似的策略被Dehghani和Bai为Transformer神经网络进行探索。
Different from our observations,Dehghani show that UT outperforms a vanilla Transformer.
与我们的观察不同,Dehghani显示统一的Transformer要比普通的Transformer效果要好。
Bai show that their DQEs reach an equilibrium point for which the input and output embedding of a certain layer stay the same.
Bai显示他们的深度平等模型达到某一个网络的输入embedding和输出embedding保持相同的平衡点。
Our measurement on the L2 distances and cosine similarity show that our embeddings are oscillating rather than converging.
我们在L2距离和余弦相似度的测量显示我们的嵌入层是振荡的而不是收敛的。
内容图片1图片1:BERT-large以及ALBERT-large中每一个神经网络层的输入和输出embedding层的L2距离和余弦相似度(考虑到度数)
Figure1 shows the L2 distances and cosine similarity of the input and the output embeddings for each layer,using BERT-large and ALBERT-large configurations(see Table1).
图片1显示每一个网络输入和输出embedding层之中的L2的距离和余弦相似度,使用BERT-large和ALBERT-large配置(看表1)。
We observe that the transitions from layer to layer are much smoother for ALBERT than for BERT.
我们观察到网络层到网络层的转换对于ALBERT来说比BERT要更加的平滑。
These results show that weight-sharing has an effect on stabilizing network parameters.
这些结果显示权重共享对于稳定网络参数有效果。
Although there is a drop for both metrics compared to BERT,they nevertheless do not converge to 0 even after 24 layers.
尽管对于各项指标与BERT相比来讲都有一个下降的过程,他们无论如何在24层之后并没有收敛到0.
This shows that the solution space for ALBERT parameters is very different from the one found by DQE.
这显示出对于ALBERT解决空间的参数与从深度平等模型之中找出的参数有所不同。

Inter-sentence coherence loss
In addition to the masked language modeling(MLM) loss,BERT uses an additional loss called next-sentence prediction(NSP).
除了被遮盖的模型损失,BERT使用了一种叫做下一个句子预测的额外损失。
NSP is a binary classification loss for predicting whether two segments appear consecutively in the original text,as follows:positive examples are created by taking consecutive segments from the training corpus;negative examples are created by pairing segments from different documents.
下一个句子是一个二分类问题去预测是否两个段落在原来的文本之中连续,如下所示:正例子被从训练语料中提取连续的段落进行创建,负例子为从不同文本中得到段落对进行创建。
positive and negative examples are sampled with equal probability.
正例子和负例子使用相等的概率进行采样。
The NSP objective was designed to improve performance on downstream tasks,such as natural language inference,that require reasoning about the relationship between sentence pairs.
下一个句子预测目标用于在下游任务中提升表现,例如自然语言推理任务,需要推测句子对之间的关系。
However,subsequent studies found NSP’s impact unreliable and decided to eliminate it ,a decision supported by an improvement in downstream task performance across several tasks.
但是,接下来的学习发现下一个句子预测的影响并不靠谱并且决定去除它,一个由几个任务的下游任务表现的提升支持的决定。

We conjecture that the main reason behind NSP’s ineffectiveness is its lack of difficulty as a task,as compared to MLM.
我们发现下一个句子预测没有效果的主要原因在于它作为一个任务与遮盖掩码模型相比没有难度。
As formulated,NSP conflates topic prediction and coherence prediction in a single task.
正如上面制定的那样,下一个句子预测在一个单独的任务中合并主体预测和连贯性预测。
However,topic prediction is easier to learn compared to coherence prediction,and also overlaps more with what is learned using the MLM loss.
但是,主题预测与连贯性预测相比更容易学习,并且经常在使用语言掩码模型损失的时候与已经学习到的东西相重合。

We maintain that inter-sentence modeling is an important aspect of language understanding,but we propose a loss based primarily on coherence.
我们发现内在句子模型是语言理解的重要方面,并且我们提出了一种基于连贯性的损失函数。
That is,for ALBERT,we use a sentence-order prediction(SOP) loss,which avoids topic prediction and instead focuses on modeling inter-sentence coherence.
即对于ALBERT我们使用一个句子顺序预测(SOP)损失,避免主题预测而是专注于模型内在句子的连贯性。
The SOP loss uses as positive examples the same technique as BERT(two consecutive segments from the same document),and as negative examples the same two consecutive segment bu with their order swapped.
下一个句子预测损失的使用作为正样本与BERT的策略相同(相同文本的两个连续段落),负样本与正样本相同但是他们的顺序换了一下。
This forces the model to learn finer-grained distinctions abount discourse-level coherence properties.
这迫使模型去学习对话级别连贯性价值的细粒度的区别。
As we show in Sec.4.6,it turns out that NSP cannot solve the SOP task at all,
正如我们在段落4.6展示的那样,它显示下一个句子的预测不能够解决句子顺序预测的任务。

???
(i.e.,it ends up learning the easier topic-prediction signal,and preforms at random baseline level on the SOP task),
即它以学习早期的主题预测信号结束,接着以随机基准标准对句子顺序预测任务进行预测。
while SOP can solve the NSP task to a reasonable degree,presumably based on analyzing misaligned coherence cues.
尽管句子顺序预测可以以一个可行的程度解决下一个句子预测任务,大概基于分析未校准的连贯性线索。
As a result,ALBERT models consistently improve downstream task performance for multi-sentence encoding task.
作为结果,ALBERT模型不断为多个句子编码任务提升下游任务的表现。
???

3.2 MODEL SETUP
We present the difference between BERT and ALBERT models with comparable hyperparameter settings in Table 2.
我们比较BERT和ALBERT模型在相对表2的超参数设置的不同之处。
BERT和ALBERT超参数设定Due to the design choices discussed above,ALBERT models have much smaller parameter size compared to corresponding BERT models.
基于上述设定的讨论,ALBERT模型比相应的BERT模型有更少的参数尺寸。
注释:Since a negative example is constructed using material from a different document,the negative-example segment is misaligned both from a topic and from a coherence perspective.
因为负样本使用来自不同文档的材料进行建立,负例子段落会来自一个主题或者一个连贯文本角度的不连贯的段落。
For example,ALBERT-large has about 18x fewer parameters compared to BERT-large,18M versus 334M.
例如ALBERT-large大概与BERT-large相比有18倍小的参数,18M对比334M的参数。
If we set BERT to have an extra-large size with H=2048,we end up with a model that has 1.27 billion parameters and under-performs(Fig.1).
如果我们设置BERT有一个额外大的尺寸为H=2048,我们结束一个有12.7亿参数并且没有很好表现的模型。
In contrast,an ALBERT-xlarge configuration with H = 2048 has only 60M parameters,while an ALBERT-xxlarge configuration with H=4096 has 233M parameters,i.e,around 70% of BERT-large’s parameters.
相反,一个ALBERT-xlarge配置在H=2048下只有60M参数,然而ALBERT-xxlarge配置在H=4096下有233M参数,即大概70%的BERT-large的参数
Note that for ALBERT-xxlarge,we mainly report results on a 12-layer network because a 24-layer network(with the same configuration)obtains similar results but is computationally more expensive.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值