基于CPM的中文文本生成学习过程

小鸡炖蘑菇@

已于 2023-06-18 22:06:22 修改

阅读量414

点赞数 1

文章标签：学习

于 2023-06-18 22:05:23 首次发布

本文链接：https://blog.csdn.net/weixin_48799576/article/details/131273909

版权

项目代码：GitHub - yangjianxin1/CPM: Easy-to-use CPM for Chinese text generation（基于CPM的中文文本生成）

论文地址： CPM: A large-scale generative Chinese Pre-trained language model - ScienceDirect

CPM（Chinese Pretrained Models）模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。

数据集

这是作者从网上爬取的26W片中文作文中的一个例子，将它作为训练语料，从展示的图片中可以看到，每一篇作文的格式由”标题-日期-作者-文章内容“构成，这里只使用文章标题和内容进行训练。

对于每篇作文，将标题与内容通过[sep]连接，以[eod]作为结束符，即最终形式为：标题[sep]内容[eod]。

数据预处理

CpmTokenizer是一个基于中文预训练模型CPM（Chinese Pretrained Models）的分词器，用于中文文本的分词。使用convert_tokens_to_ids()方法将特殊标记<eod>转换为对应的标记ID，并且使用sep_token_id属性获取分词器的分隔符标记ID。

tokenizer = CpmTokenizer(vocab_file="vocab/chinese_vocab.model")
eod_id = tokenizer.convert_tokens_to_ids("<eod>")   # 文档结束符
sep_id = tokenizer.sep_token_id

接着通过tokenizer.encode()方法将标题和文章内容进行分词并转换为对应的标记ID，设置了add_special_tokens=False以避免添加特殊标记。接下来，我们将标题的标记ID、分隔符标记ID、文章内容的标记ID和文档结束符的标记ID按照指定的顺序组合成token_ids列表。

title_ids = tokenizer.encode(title, add_special_tokens=False)
article_ids = tokenizer.encode(article, add_special_tokens=False)
token_ids = title_ids + [sep_id] + article_ids + [eod_id]

我们对每篇文章的进行截断处理，以每200个字符进行截断，然后将截断后的语料加入到训练列表train_list中。

win_size = args.win_size  # 200
step = args.step  # 200
start_index = 0
end_index = win_size
data = token_ids[start_index:end_index]
train_list.append(data)
start_index += step
end_index += step
while end_index+50 < len(token_ids):  # 剩下的数据长度，大于或等于50，才加入训练数据集
    data = token_ids[start_index:end_index]
    train_list.append(data)
    start_index += step
    end_index += step

最后将处理好的训练语料进行保存，便于训练时进行加载。

# 序列化训练数据
with open(args.save_path, "wb") as f:
    pickle.dump(train_list, f)

数据集构建

这是一个用于加载训练数据的函数。

def load_dataset(logger, args):
    """
    加载训练集
    """
    logger.info("loading training dataset")
    train_path = args.train_path

    with open(train_path, "rb") as f:
        train_list = pickle.load(f)

    # test
    # train_list = train_list[:24]
    logger.info('len of train data:{}'.format(len(train_list)))
    train_dataset = CPMDataset(train_list, args.max_len)

    return train_dataset

经过预处理和分词后的训练语料如下所示，因为是使用200的窗口进行截断的，所以每个训练预料的长度都为200。

模型

模型整体采用GPT的结构，使用了100G的中文语料进行预训练，本质上是一个中文的GPT2模型。以往的中文预训练模型中，往往采用字符粒度的词表，也就是一个字符作为token。但是在中文中，词语往往由多个汉字构成，其中包含了丰富的语义信息，如果只使用汉字词表，将会损失这种语义的内在关联。所以在CPM模型中，作者同时使用字符和词语来构造模型词表。在代码中，作者使用SentencePiece来构造词表。

    if args.pretrained_model:  # 加载预训练模型
        model = GPT2LMHeadModel.from_pretrained(args.pretrained_model)
    else:  # 初始化模型
        model_config = GPT2Config.from_json_file(args.model_config)
        model = GPT2LMHeadModel(config=model_config)

GPT2模型的结构与Transformer的Decoder大体相同。输入由单词嵌入和位置嵌入两部分组成，模型主要是由掩膜多头注意力、层归一化、全连接层和残差连接等结构组成。在训练的时候，掩膜矩阵为下三角矩阵，使用第n个位置的输出向量，预测第n+1个位置的单词。第n个位置的输出向量包含了前n个单词的语义信息，本质上也就是根据前n个单词，预测第n+1个单词。

模型训练

将训练语料依次送入模型中进行训练，每次更新参数并且计算模型预测token的正确率

    for batch_idx, (input_ids, labels) in enumerate(train_dataloader):
        # 捕获cuda out of memory exception
        try:
            input_ids = input_ids.to(device)
            labels = labels.to(device)
            outputs = model.forward(input_ids, labels=labels)
            logits = outputs.logits
            loss = outputs.loss
            loss = loss.mean()

            # 统计该batch的预测token的正确数与总数
            batch_correct_num, batch_total_num = calculate_acc(logits, labels, ignore_index=ignore_index)
            # 统计该epoch的预测token的正确数与总数
            epoch_correct_num += batch_correct_num
            epoch_total_num += batch_total_num
            # 计算该batch的accuracy
            batch_acc = batch_correct_num / batch_total_num

            total_loss += loss.item()
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            loss.backward()
            # 梯度裁剪
            torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)

            # 进行一定step的梯度累计之后，更新参数
            if (batch_idx + 1) % args.gradient_accumulation_steps == 0:
                # 更新参数
                optimizer.step()
                # 更新学习率
                scheduler.step()
                # 清空梯度信息
                optimizer.zero_grad()

            if (batch_idx + 1) % args.log_step == 0:
                logger.info(
                    "batch {} of epoch {}, step {}, loss {}, batch_acc {}, lr {}".format(
                        batch_idx + 1, epoch + 1, step, loss.item() * args.gradient_accumulation_steps, batch_acc, scheduler.get_lr()))
                step = epoch * len(train_dataloader) + batch_idx
                writer.add_scalar('train loss', loss.item()*args.gradient_accumulation_steps, step)
                writer.add_scalar('train acc', batch_acc, step)

            del input_ids, outputs

        except RuntimeError as exception:
            if "out of memory" in str(exception):
                logger.info("WARNING: ran out of memory")
                if hasattr(torch.cuda, 'empty_cache'):
                    torch.cuda.empty_cache()
            else:
                logger.info(str(exception))
                raise exception

模型预测

在生成阶段，根据输入的标题和上文，在每一步中，模型输出下一个位置的单词概率分布，然后使用topk采样或者topp采样（核采样）生成下一个位置的单词。

首先是同样的将输入的标题和提示内容进行编码：

    title_ids = tokenizer.encode(title, add_special_tokens=False)
    context_ids = tokenizer.encode(context, add_special_tokens=False)
    input_ids = title_ids + [sep_id] + context_ids

接着将编码后的上文送入训练好的模型对下一个词进行预测，这里使用top-k和top-p（nucleus）采样策略对调整后的概率分布进行过滤，得到过滤后的概率filtered_logits。接着使用多项式分布进行抽样，根据过滤后的概率分布从候选集合中抽取一个下一个token的ID，使用torch.multinomial()函数实现。

def generate_next_token(input_ids):
    """
    对于给定的上文，生成下一个单词
    """
    outputs = model(input_ids=input_ids)
    logits = outputs.logits
    # next_token_logits表示最后一个token的hidden_state对应的prediction_scores,也就是模型要预测的下一个token的概率
    next_token_logits = logits[0, -1, :]
    next_token_logits = next_token_logits / args.temperature
    # 对于<unk>的概率设为无穷小，也就是说模型的预测结果不可能是[UNK]这个token
    next_token_logits[unk_id] = -float('Inf')
    filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=args.topk, top_p=args.topp)
    # torch.multinomial表示从候选集合中选出无放回地进行抽取num_samples个元素，权重越高，抽到的几率越高，返回元素的下标
    next_token_id = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
    return next_token_id

整个创作过程采用自回归的方式使用训练好的模型对上文不断对下一个词进行预测，再把预测到的词加入上文中形成新的语句再次对下一个词进行预测，这样不断循环直到生成结束符，生成代码如下：

def generate(max_len):
    # 对title与context进行tokenize
    title_ids = tokenizer.encode(title, add_special_tokens=False)
    context_ids = tokenizer.encode(context, add_special_tokens=False)
    input_ids = title_ids + [sep_id] + context_ids
    cur_len = len(input_ids)
    last_token_id = input_ids[-1]  # 已生成的内容的最后一个token
    input_ids = torch.tensor([input_ids], dtype=torch.long, device=device)

    while True:
        next_token_id = generate_next_token(input_ids[:, -args.context_len:])
        input_ids = torch.cat((input_ids, next_token_id.unsqueeze(0)), dim=1)
        cur_len += 1
        word = tokenizer.convert_ids_to_tokens(next_token_id.item())
        # if cur_len >= max_len:
        #     break
        # 超过最大长度，并且换行
        if cur_len >= max_len and last_token_id == 8 and next_token_id == 3:
            break
        # 超过最大长度，并且生成标点符号
        if cur_len >= max_len and word in [".", "。", "！", "!", "?", "？", ",", "，"]:
            break
        # 生成结束符
        if next_token_id == eod_id:
            break
    result = tokenizer.decode(input_ids.squeeze(0))
    return result

以下是生成的样例：