day2--ULMFIT语言模型

呆呆有库

于 2022-01-04 20:50:43 发布

阅读量781

点赞数 3

分类专栏：语言模型文章标签：语言模型自然语言处理机器学习

本文链接：https://blog.csdn.net/Aaadsda414114/article/details/122310987

版权

语言模型专栏收录该内容

5 篇文章 0 订阅

订阅专栏

ULMFIT模型

ULMFIT和其它模型算法的比较

上篇介绍了ELMo。ELMo有以下几个步骤：

利用LM任务进行预训练，
再利用目标领域的语料对LM模型做微调，
最后针对目标任务进行最后的训练
ULMFiT一样需要执行上述步骤。它的论文名字(Universal Language Model Fine-tuning for Text Classification)顾名思义就是一个利用LM模型给出的统一的在文本分类方面进行transfer learning的解决方案。

一直以来都有人试图在NLP领域进行transfer learning的尝试，可惜从来都没有像ImageNet在CV里那样成功的案例。也许指望某个特定的方法一统江湖是不可能了，但是总结迄今的研究，显然LM基于其对语言内在结构的学习能力成为主流模型。

ULMFIT在源码方面较为全面，我们就用pytorch实现：

pytorch代码：

# ULMFiT在源码方面还是比较全面的，放出了论文中使用的所有脚本和详细的处理步骤，同时也提供了预训练好的模型，可以复现，也可以自己按照它那个步骤train自己想要的东西。下面笔者将按照论文中的三个步骤对相应的源码进行剖析：
# 
# 1. 语言模型pretrain
# 语言模型的构建和训练部分比较简单，其代码如下：

# 构建模型
m = to_gpu(get_language_model(md.n_tok, em_sz, nh, nl, md.pad_idx, decode_train=False, dropouts=drops))
# 损失函数
crit = CrossEntDecoder(prs, m[1].decoder, n_neg=n_neg, sampled=sampled).cuda()
# 训练
learner = RNN_Learner(md, LanguageModel(m), opt_fn=opt_fn)
lrs = np.array([lr/6,lr/3,lr,lr])
learner.fit(lrs, 1, wds=wd, use_clr=(32,10), cycle_len=cl)

# 主要分为3部分：
# 
# 构建语言模型。其代码如下：
def get_language_model(n_tok, em_sz, nhid, nlayers, pad_token, decode_train=True, dropouts=None):
    if dropouts is None: dropouts = [0.5,0.4,0.5,0.05,0.3]
    rnn_enc = RNN_Encoder(n_tok, em_sz, n_hid=nhid, n_layers=nlayers, pad_token=pad_token, dropouti=dropouts[0], wdrop=dropouts[2], dropoute=dropouts[3], dropouth=dropouts[4])
    rnn_dec = LinearDecoder(n_tok, em_sz, dropouts[1], decode_train=decode_train, tie_encoder=rnn_enc.encoder)
    return SequentialRNN(rnn_enc, rnn_dec)

# 可见，语言模型主要是构建了RNN_Encoder和LinearDecoder两部分，其具体代码如下：

class RNN_Encoder(nn.Module):

    """A custom RNN encoder network that uses
        - an embedding matrix to encode input,
        - a stack of LSTM or QRNN layers to drive the network, and
        - variational dropouts in the embedding and LSTM/QRNN layers

        The architecture for this network was inspired by the work done in
        "Regularizing and Optimizing LSTM Language Models".
        (https://arxiv.org/pdf/1708.02182.pdf)
    """

    initrange=0.1

    def __init__(self, ntoken, emb_sz, n_hid, n_layers, pad_token, bidir=False,
                 dropouth=0.3, dropouti=0.65, dropoute=0.1, wdrop=0.5, qrnn=False):
        """ Default constructor for the RNN_Encoder class

            Args:
                bs (int): batch size of input data
                ntoken (int): number of vocabulary (or tokens) in the source dataset
                emb_sz (int): the embedding size to use to encode each token
                n_hid (int): number of hidden activation per LSTM layer
                n_layers (int): number of LSTM layers to use in the architecture
                pad_token (int): the int value used for padding text.
                dropouth (float): dropout to apply to the activations going from one LSTM layer to another
                dropouti (float): dropout to apply to the input layer.
                dropoute (float): dropout to apply to the embedding layer.
                wdrop (float): dropout used for a LSTM's internal (or hidden) recurrent weights.

            Returns:
                None
          """

        super().__init__()
        self.ndir = 2 if bidir else 1
        self.bs, self.qrnn = 1, qrnn
        self.encoder = nn.Embedding(ntoken, emb_sz, padding_idx=pad_token)
        self.encoder_with_dropout = EmbeddingDropout(self.encoder)
        if self.qrnn:
            #Using QRNN requires cupy: https://github.com/cupy/cupy
            from .torchqrnn.qrnn import QRNNLayer
            self.rnns = [QRNNLayer(emb_sz if l == 0 else n_hid, (n_hid if l != n_layers - 1 else emb_sz)//self.ndir,
                save_prev_x=True, zoneout=0, window=2 if l == 0 else 1, output_gate=True) for l in range(n_layers)]
            if wdrop:
                for rnn in self.rnns:
                    rnn.linear = WeightDrop(rnn.linear, wdrop, weights=['weight'])
        else:
            self.rnns = [nn.LSTM(emb_sz if l == 0 else n_hid, (n_hid if l != n_layers - 1 else emb_sz)//self.ndir,
                1, bidirectional=bidir) for l in range(n_layers)]
            if wdrop: self.rnns = [WeightDrop(rnn, wdrop) for rnn in self.rnns]
        self.rnns = torch.nn.ModuleList(self.rnns)
        self.encoder.weight.data.uniform_(-self.initrange, self.initrange)

        self.emb_sz,self.n_hid,self.n_layers,self.dropoute = emb_sz,n_hid,n_layers,dropoute
        self.dropouti = LockedDropout(dropouti)
        self.dropouths = nn.ModuleList([LockedDropout(dropouth) for l in range(n_layers)])

    def forward(self, input):
        """ Invoked during the forward propagation of the RNN_Encoder module.
        Args:
            input (Tensor): input of shape (sentence length x batch_size)

        Returns:
            raw_outputs (tuple(list (Tensor), list(Tensor)): list of tensors evaluated from each RNN layer without using
            dropouth, list of tensors evaluated from each RNN layer using dropouth,
        """
        sl,bs = input.size()
        if bs!=self.bs:
            self.bs=bs
            self.reset()
        with set_grad_enabled(self.training):
            emb = self.encoder_with_dropout(input, dropout=self.dropoute if self.training else 0)
            emb = self.dropouti(emb)
            raw_output = emb
            new_hidden,raw_outputs,outputs = [],[],[]
            for l, (rnn,drop) in enumerate(zip(self.rnns, self.dropouths)):
                current_input = raw_output
                with warnings.catch_warnings():
                    warnings.simplefilter("ignore")
                    raw_output, new_h = rnn(raw_output, self.hidden[l])
                new_hidden.append(new_h)
                raw_outputs.append(raw_output)
                if l != self.n_layers - 1: raw_output = drop(raw_output)
                outputs.append(raw_output)

            self.hidden = repackage_var(new_hidden)
        return raw_outputs, outputs

class LinearDecoder(nn.Module):
    initrange=0.1
    def __init__(self, n_out, n_hid, dropout, tie_encoder=None, bias=False):
        super().__init__()
        self.decoder = nn.Linear(n_hid, n_out, bias=bias)
        self.decoder.weight.data.uniform_(-self.initrange, self.initrange)
        self.dropout = LockedDropout(dropout)
        if bias: self.decoder.bias.data.zero_()
        if tie_encoder: self.decoder.weight = tie_encoder.weight

    def forward(self, input):
        raw_outputs, outputs = input
        output = self.dropout(outputs[-1])
        decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
        result = decoded.view(-1, decoded.size(1))
        return result, raw_outputs, outputs

# 前者是通过多层LSTM对输入进行encode，而后经过一个线性变换层，将输出映射到词表上。这里要注意一个细节：在encode时，对于网络不同部分的参数，使用不同的dropout参数。
# 
# 定义损失函数。对于LM的训练，其损失函数一般都是交叉熵，但源码里面用了基于负采样的损失函数，其代码如下：
class CrossEntDecoder(nn.Module):
    initrange=0.1
    def __init__(self, prs, decoder, n_neg=4000, sampled=True):
        super().__init__()
        self.prs,self.decoder,self.sampled = T(prs).cuda(),decoder,sampled
        self.set_n_neg(n_neg)

    def set_n_neg(self, n_neg): self.n_neg = n_neg

    def get_rand_idxs(self): return pt_sample(self.prs, self.n_neg)

    def sampled_softmax(self, input, target):
        idxs = V(self.get_rand_idxs())
        dw = self.decoder.weight
        #db = self.decoder.bias
        output = input @ dw[idxs].t() #+ db[idxs]
        max_output = output.max()
        output = output - max_output
        num = (dw[target] * input).sum(1) - max_output
        negs = torch.exp(num) + (torch.exp(output)*2).sum(1)
        return (torch.log(negs) - num).mean()

    def forward(self, input, target):
        if self.decoder.training:
            if self.sampled: return self.sampled_softmax(input, target)
            else: input = self.decoder(input)
        return F.cross_entropy(input, target)

# 注意这里的sample_softmax函数即为先进行负采样，而后计算softmax，以及交叉熵的部分。
# 
# 训练。这里需要注意的一个小细节就是，传入了一个lrs参数，共有4个学习率，分别针对3个LSTM层，和最后的映射层，设置不同的学习率。同时也用了use_clr这个参数，它是用于设置STLR的。

ULMFiT实践意义：

它在不同的文档大小、数量和标记类型的任务上工作；
它采用一个单一架构和训练流程；
它不需要自定义特征工程或预处理；
不需要额外的领域文档或标记。
使用：AWD-LSTM + 各种tuned dropout超参数。

ULMFiT构成部分：

        通用领域LM 预训练；
        目标任务LM微调:discriminative fine-tuning, triangular learning rates
        目标任务分类器微调 : concat pooling, gradual unfreezing
        通过组合上述方法，在数据集中能够有很好的表现。

优点：

思想比较直观，就是pretrain+finetune的思路，也比较有用。
提出了一堆优化策略，原理解释的比较清楚。

缺点：

需要调整和注意的点比较多，看三步走的策略和那么多的trick就有点儿望而却步。
只在文本分类任务上评估，此方法对比BERT、ELMo等的优势在哪里？还是希望未来能探索更多任务上的应用。

总结：

语言模型微调在以下设置中尤其有用：

非英语语言的NLP，受监督的预训练任务的训练数据不足；
不存在最先进架构的新NLP任务；
标记数据量有限的任务（以及一些未标记数据量）。
一个可能的方向是改进语言模型的预训练和微调，使其更具可扩展性；语言建模还可以以多任务学习的方式增加额外的任务，或增加额外的监督，例如语法敏感依赖，以创建一个更通用或更适合某些下游任务的模型，理想情况下以弱监督的方式保持其通用属性。

另一个方向是将该方法应用于新的任务和模型：其他具有更复杂交互作用的任务，例如蕴含或者问答，可能需要新颖的方法来进行预处理和微调。

想到的优化方法：

我们提出了一种有效且样本有效的迁移学习方法ULMFiT，可以应用于任何NLP任务。我们还提出了几种新颖的微调技术，这些技术结合在一起可以防止灾难性的遗忘，并能够在不同的任务范围内进行强健地学习。

呆呆有库

关注

3
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
day2--ULMFIT语言模型

ULMFIT模型ULMFIT和其它模型算法的比较上篇介绍了ELMo。ELMo有以下几个步骤：利用LM任务进行预训练，再利用目标领域的语料对LM模型做微调，最后针对目标任务进行最后的训练ULMFiT一样需要执行上述步骤。它的论文名字(Universal Language Model Fine-tuning for Text Classification)顾名思义就是一个利用LM模型给出的统一的在文本分类方面进行transfer learning的解决方案。一直以来都有人试图在NLP领域进
复制链接

扫一扫