关于语言模型中CE loss的解释

最新推荐文章于 2024-11-07 14:05:19 发布

苏炘

最新推荐文章于 2024-11-07 14:05:19 发布

阅读量275

点赞数

文章标签：语言模型人工智能自然语言处理

本文链接：https://blog.csdn.net/weixin_44902962/article/details/132591705

版权

以BART做中英翻译为例，中文为“我发不出论文”，英文为"I can't publish a paper."

如果我们使用的是huggingface提供的tokenizer，会在中英文对应的编码前加上[BOS]，再句子末尾加上[EOS]，也就是实际上我们输入Encoder的是[BOS]我发不出论文[EOS](也就是"<s>我发不出论文</s>")，对应的翻译句子应该是[BOS]I can't publish a paper.[EOS]。

huggingface对于Encoder+Decoder结构，会有三个关键的参数，

第一个是input_ids，这个是输入Encoder的，对应于我们上面的例子也就是中文部分，它是给Encoder的输入，其对应有attention_mask

第二个是decoder_input_ids，也就是英文的部分，但要注意的是，由于我们是根据前面的词预测下一个词，也就意味着，我们是根据[BOS]I can't publish a paper.（去掉[EOS]）期待模型告诉我I can't publish a paper.[EOS]。（当然这里就没有[BOS]）。如果我们不是输入decoder_input_ids，而是输入一个label，则huggingface会把label做shift right作为decoder_input_ids。

以下是源码中对应的shift right的代码：


def shift_tokens_right(input_ids: torch.Tensor, pad_token_id: int, decoder_start_token_id: int):
    """
    Shift input ids one token to the right.
    """
    shifted_input_ids = input_ids.new_zeros(input_ids.shape)
    shifted_input_ids[:, 1:] = input_ids[:, :-1].clone() #句子第一个词后面的部分
    shifted_input_ids[:, 0] = decoder_start_token_id #在最开始的部分补上一个decoder_start_token_id，默认和句子结束符</s>是一样的

    if pad_token_id is None:
        raise ValueError("self.model.config.pad_token_id has to be defined.")
    # replace possible -100 values in labels by `pad_token_id`
    shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)

    return shifted_input_ids

也就是我们传入的label为[BOS]I can't publish a paper.[EOS]，经过shift right后，变为[decoder_start_token_id][BOS]I can't publish a paper. 期待模型输出[BOS]I can't publish a paper.[EOS]。那么有的小伙伴又会问了，那我预测的时候我要是预测不出这个[BOS]那后面不是全搞错了吗？

所以在generate的时候，我们会进行判断，如果当前句子的长度小于2，那么我们强制生成句子的[BOS]对应的概率为1，从而能满足格式。