工程化实践：如何基于自定义数据进行P-tuning实验？如何部署预测？

最新推荐文章于 2024-05-28 15:25:23 发布

NLP分享汇

最新推荐文章于 2024-05-28 15:25:23 发布

阅读量2k

点赞数 2

分类专栏：小样本学习文章标签：深度学习自然语言处理人工智能

本文链接：https://blog.csdn.net/u014577702/article/details/124588548

版权

小样本学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

关注微信公众号：NLP分享汇。【喜欢的扫波关注，每天都在更新自己之前的积累】

· 背景说明

谁说GPT只擅长生成，GPT也能自然语言理解。利用 P-tuning 方法，GPT 的自然语言能力可以匹敌 BERT。2021年，清华、智源等机构的一项研究打破了这一刻板印象。

这一切源于这篇论文：《GPT Understands, Too》

论文原文：https://arxiv.org/pdf/2103.10385.pdf

GitHub：https://github.com/THUDM/P-tuning

但今天这篇文章并非要向大家赘述P-tuning原理。关于上述论文大家可以搜搜博客自行消化，我们这里主要要借助百度PaddleNLP去介绍一下P-tuning小样本模型的一些工程化干货。

· 前言

小样本学习（Few-Shot Learning）旨在研究如何从少量有监督的训练样本中学习出具有良好泛化性的模型，对训练数据很少或监督数据获取成本极高的应用场景有很大价值。百度将中文预训练模型renie1.0和prompt方法结合，为我们提供了一个P-tuning一站式的应用方法，代码用起来超级方便。但是，直接用百度github提供的内置数据、README去按部就班跑起来程序似乎对我们的帮助就很鸡肋，更多的我们希望该开源能够应用到我们着手的具体小样本项目场景中，比如：风控、质检等。这篇文章则希望帮助读者们解决如下3个问题：1）复现可能存在的环境安装问题和代码运行问题，并奉上解决方法；2）如何基于自定义数据实验P-tuning；3）如何部署训练好的模型进行单句预测。

· 基础准备

在基于自定义数据进行P-tuning实验前，小编希望读者能先按github上的README先跑起其内置的FewCLUE数据集（小样本学习测评基准-中文版）。

1）P-tuning github：https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/few_shot/p-tuning

代码结构：


|—— ptuning.py # P-tuning 策略的训练、评估主脚本
|—— dataset.py # P-tuning 策略针对 FewCLUE 9 个数据集的任务转换逻辑，以及明文 -> 训练数据的转换
|—— model.py # P-tuning 的网络结构
|—— evaluate.py # 针对 FewCLUE 9 个数据集的评估函数
|—— predict.py # 针对 FewCLUE 9 个数据集进行预测

2）FewCLUE数据集

github: https://github.com/CLUEbenchmark/FewCLUE

paper：https://arxiv.org/abs/2107.07498

· 工程实践分析

1）环境安装和代码运行BUG

【BUG-1】ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: VERSION 'GLIBCXX_3.4.22' not found
【解决方法】https://blog.csdn.net/u014577702/article/details/123453453?spm=1001.2014.3001.5502

【BUG-2】request.exceptions.ConnectionError: HTTPSConnectionPool(host='paddlenlp.bj.bcebos.com',port=443): Max retries exceeded w: /model/transformers/ernie/ernie_v1_chn_base.pdparams (Caused by New Connection)
【解决方法】自行下载相应安装内容，导入相应位置。

【BUG-3】找不到libcublas.so
【解决方法】找不到libcublas.so，则需要链接到该软链接所在位置，使用如下命令
export LD_LIBRARY_PATH=/home/pafl/anaconda3/envs/mypaddle/lib:$PATH
export LD_LIBRARY_PATH=conda环境所在位置:$PATH

【BUG-4】
INFO 2021-10-26 19:01:38,701 launch_utils.py:327] terminate all the procs
ERROR 2021-10-26 19:01:38,702 launch_utils.py:584] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2021-10-26 19:01:41,705 launch_utils.py:327] terminate all the procs
【解决方法】https://github.com/PaddlePaddle/PaddleNLP/issues/1238

【BUG-5】训练报错ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.
【解决方法】单卡不需要 paddle.distributed.launch

【BUG-6】ln: failed to create symbolic link 'libcudnn.so' : File exits
【解决方法】https://www.jianshu.com/p/b308d3bbde8a


【BUG-7】fatal error: 'Segmentation fault' is detected by the operating system
【解决方法】https://blog.csdn.net/u014577702/article/details/123453508

【BUG-8】Fatal Error : 'Access to an undefined portion of a memory object' is detected by the operating system
【解决方法】一般就是cudatoolkit、cudann的版本不符合环境要求，升级到对应版本就行。我的cudatoolkit从8.0升级到1.0就能正常运行。conda install cudatoolkit=10.0

2）如何基于自定义数据实验P-tuning?

step-1 实现读取函数，以字典形式返回明文数据，并在def train()中更改加载数据的方式 eg:

# 假设明文数据为 \t 分隔的 2 列数据: text \t label
def read_fn(data_path):
    example = []
    with open(data_path, 'r', encoding='utf-8') as f:
        for line in f:
            text, label = line.rstrip().split('\t')
            yield {"text":text, ";label":label}

def do_train():
    ...
    # 原直接利用内置数据fewclue的数据加载方式
    # train_ds, dev_ds, public_test_ds = load_dataset(
    #     "fewclue",
    #     name=args.task_name,
    #     splits=("train_0", "dev_0", "test_public"))
    # 自己的数据加载方式
    train_ds = load_dataset(read_fn,data_path = "../train.txt",lazy=False)
    dev_ds = load_dataset(read_fn,data_path = "../dev.txt",lazy=False)
    public_test_ds = load_dataset(read_fn,data_path = "../test.txt",lazy=False)

step-2 在 ./label_normalized/ 目录下创建名为 mytask.json 的 Label map 词典，负责对 Label 进行标准化, eg:

# 以 label 为 "Positive"、"Negative"的分类任务为例。
{
    "Negative":"负例",
    "Positive":"正例"
    }

3）如何部署训练好的模型进行单句预测？

# 单独写个py进行单句预测
import argparse
import os
import sys
import random
import time
import json
from functools import partial

import numpy as np
import paddle
import paddle.nn.functional as F

import paddlenlp as ppnlp
from model import ErnieForPretraining
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.datasets import load_dataset

from data import create_dataloader, transform_fn_dict
from data import convert_example, convert_chid_example
from evaluate import do_evaluate, do_evaluate_chid
import time

parser = argparse.ArgumentParser()
args = parser.parse_args()

def set_seed(seed):
    """sets random seed"""
    random.seed(seed)
    np.random.seed(seed)
    paddle.seed(seed)

@paddle.no_grad()
def do_predict(model, tokenizer, data_loader, label_normalize_dict):
    model.eval()

    normed_labels = [
        normalized_lable
        for origin_lable, normalized_lable in label_normalize_dict.items()
    ]

    origin_labels = [
        origin_lable
        for origin_lable, normalized_lable in label_normalize_dict.items()
    ]

    label_length = len(normed_labels[0])

    y_pred_labels = []

    for batch in data_loader:
        src_ids, token_type_ids, masked_positions = batch

        # [bs * label_length, vocab_size]
        prediction_probs = model.predict(
            input_ids=src_ids,
            token_type_ids=token_type_ids,
            masked_positions=masked_positions)

        batch_size = len(src_ids)
        vocab_size = prediction_probs.shape[1]

        # prediction_probs: [batch_size, label_lenght, vocab_size]
        prediction_probs = paddle.reshape(
            prediction_probs, shape=[batch_size, -1, vocab_size]).numpy()

        # [label_num, label_length]
        label_ids = np.array(
            [tokenizer(label)["input_ids"][1:-1] for label in normed_labels])

        y_pred = np.ones(shape=[batch_size, len(label_ids)])

        # Calculate joint distribution of candidate labels
        for index in range(label_length):
            y_pred *= prediction_probs[:, index, label_ids[:, index]]

        # Get max probs label's index
        y_pred_index = np.argmax(y_pred, axis=-1)

        for index in y_pred_index:
            y_pred_labels.append(origin_labels[index])

    return y_pred_labels

predict_file = {
    "mytask": "mytask_predict.json"
}

def read_fn(data_path):
    example = []
    with open(data_path, 'r', encoding='utf-8') as f:
        for line in f:
            sid, text = line.rstrip().split('\t')
            yield {"id":int(sid), "sentence":text}

def write_my(task_name, output_file, pred_labels):
    test_ds = load_dataset(read_fn, data_path = "../test.txt", lazy=False)
    test_example = {}
    with open(output_file, 'w', encoding='utf-8') as f:
        for idx, example in enumerate(test_ds):
            test_example["id"] = example["id"]
            test_example["label"] = pred_labels[idx]
            str_test_example = json.dumps(test_example)
            f.write(str_test_example + "\n")

write_fn = {
    "mytask": write_my
}

if __name__ == "__main__":
    paddle.set_device('cpu')
    set_seed(1000)

    label_normalize_json = os.path.join("./label_normalized","mytask.json")

    init_from_ckpt = "../model_state.pdparams"

    label_norm_dict = None
    with open(label_normalize_json, encoding='utf-8') as f:
        label_norm_dict = json.load(f)

    convert_example_fn = convert_example
    predict_fn = do_predict

    print("model paramas loading ...")
    model = ErnieForPretraining.from_pretrained('ernie-1.0')
    tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0')

    # Load parameters of best model on test_public.json of current task
    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
        state_dict = paddle.load(args.init_from_ckpt)
        model.set_dict(state_dict)
        print("Loaded parameters from %s" % args.init_from_ckpt)
    else:
        raise ValueError(
            "Please set --params_path with correct pretrained model file")

    # [src_ids, token_type_ids, masked_positions, masked_lm_labels]
    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_ids
        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
        Stack(dtype="int64"),  # masked_positions
    ): [data for data in fn(samples)]

    trans_func = partial(
    convert_example_fn,
    tokenizer=tokenizer,
    max_seq_length=args.max_seq_length,
    p_embedding_num=args.p_embedding_num,
    is_test=True)

    while True:
        print("Input sentence: ")
        sentence = sys.stdin.readline().strip()
        # 如下只是一种最简单粗暴的方式，大家有时间可以自行更改
        with open("../test.txt","w",encoding='utf-8') as ftest:
            ftest.write('1' + '\t' + sentence)
            ftest.close()
        test_ds = load_dataset(read_fn,data_path="../test.txt",lazy=False)

    # Task related transform operations, eg: numbert label -> text_label, english -> chinese
    transform_fn = partial(
        transform_fn_dict[args.task_name],
        label_normalize_dict=label_norm_dict,
        is_test=True)
    test_ds = test_ds.map(transform_fn, lazy=False)

    test_data_loader = create_dataloader(
        test_ds,
        mode='eval',
        batch_size=1,
        batchify_fn=batchify_fn,
        trans_fn=trans_func)

    y_pred_labels = predict_fn(model, tokenizer, test_data_loader,
                               label_norm_dict)
    if y_pred_labels[0] == 'Poscase':
        print("正例")
    else:
        print("负例")

如果你在进行P-tuning实验中遇到了问题，可以在公众号给我发消息，定期查看回复。