ERNIE3.0实践手记

xganoderma

已于 2024-06-11 23:36:31 修改

阅读量431

点赞数 6

分类专栏：实验记录文章标签： python ipython

于 2024-03-05 16:11:36 首次发布

本文链接：https://blog.csdn.net/m0_73085867/article/details/135864376

版权

实验记录专栏收录该内容

3 篇文章 1 订阅

订阅专栏

一、代码实践

（一）基础版

本实验大体框架参照这篇博文：

https://lizhiyang.blog.csdn.net/article/details/132394853

基于Ernie-3.0-medium-zh大模型，这篇博文运用孤注一掷影评数据进行情感分析，共600条数据，进行了词云展示。利用正则表达式清理数据，利用paddlenlp.datasets中的 DatasetBuilder函数对数据进行处理，数据变成了[{‘text_a’: ‘data’, ‘label’: label},……] 的格式。

1 数据处理

1.1 划分数据集

import random

# 读取自定义.txt文件中的内容
with open('weibo_senti_100k.txt', 'r',encoding='utf-8') as file:
    lines = file.readlines()

# 随机打乱数据
random.shuffle(lines)

# 计算切分的索引
total_lines = len(lines)
train_end = int(total_lines * 0.7)
dev_end = int(total_lines * 0.9)

# 切分数据
train_data = lines[:train_end]
dev_data = lines[train_end:dev_end]
test_data = lines[validation_end:]

# 将数据写入train.txt
with open('train.txt', 'w' ,encoding='utf-8') as file:
    file.writelines(train_data)

# 将数据写入validation.txt
with open('dev.txt', 'w' ,encoding='utf-8') as file:
    file.writelines(dev_data)

# 将数据写入test.txt
with open('test.txt', 'w' ,encoding='utf-8') as file:
    file.writelines(test_data)

这里的数据集用的是：weibo_senti_100k数据集：ChineseNlpCorpus/datasets/weibo_senti_100k/intro.ipynb at master · SophonPlus/ChineseNlpCorpus · GitHub

将微博数据集切成三个：训练集70%、测试集20%、验证集10%，编码是utf-8，但是没有列名，放在notepad里转成ansi格式，换文件名为csv，正常打开，操作好后，另存为uncode的txt，再用notepad转换成utf-8

notepad++网盘资源

https://pan.baidu.com/s/14cRU0EjD0BiPl5doYj6pMA
[提取码]：kwii

1.2 加载数据

# 导入DatasetBuilder
from paddlenlp.datasets import DatasetBuilder


class NewsData(DatasetBuilder):
    SPLITS = {
        'train': r'train.txt',  # 训练集
        'dev': r'dev.txt', # 验证集
        'test': r'test.txt' #测试集
    }

    def _get_data(self, mode, **kwargs):
        filename = self.SPLITS[mode]
        return filename

    def _read(self, filename):
        """读取数据"""
        with open(filename, 'r', encoding='utf-8',errors='ignore') as f:
            for line in f:
                if line == '\n':
                    continue
                data = line.strip().split("\t")    # 以'\t'分隔各列
                label, text_a = data
                text_a = text_a.replace(" ", "")
                if label in ['0', '1']:
                    yield {"text_a": text_a, "label": label}  # 此次设置数据的格式为：text_a,label，可以根据具体情况进行修改

    def get_labels(self):
        return label_list   # 类别标签
    
    from paddlenlp.datasets import load_dataset

-------------------------------------------
D:******lib\site-packages\_distutils_hack\__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

[是报错但没关系，不影响运行]

# 定义数据集加载函数
def load_dataset(name=None,
                 data_files=None,
                 splits=None,
                 lazy=None,
                 **kwargs):
   
    reader_cls = NewsData  # 加载定义的数据集格式
    print(reader_cls)
    if not name:
        reader_instance = reader_cls(lazy=lazy, **kwargs)
    else:
        reader_instance = reader_cls(lazy=lazy, name=name, **kwargs)
    datasets = reader_instance.read_datasets(data_files=data_files, splits=splits)
    return datasets

# 加载训练和验证集
label_list = ['0', '1']
train_ds, dev_ds, text_t = load_dataset(splits=['train', 'dev', 'test'])


-------------------------------------
<class '__main__.NewsData'>

1.3 展示数据

#展示前五行数据
train_ds[:5]

--------------------------------------
[{'text_a': '[抓狂][抓狂][抓狂]起晚了[泪]', 'label': 0},
 {'text_a': '分享图片，不要啊~~~虽然我很喜欢周迅，可是八阿哥，你一定要等晴川啊~~[泪]', 'label': 0},
 {'text_a': '想shi的就上飞机吧???只见君去不见君还[泪]//@玉翠文章:不错很丰满，赛过杨贵妃，我喜欢，带着你超过盖茨4倍家财的嫁妆来吧[酷]请各路大仙作媒',
  'label': 0},
 {'text_a': '多谢支持！//@洪三水:[嘻嘻]画面不错，其他继续研究中。。//@曹欣Dyson:抢怪的那叫个多。-。-//@刘波BOB:钢铁侠，有没有？有没有？亮了！',
  'label': 1},
 {'text_a': '#周末节奏#美好的一天从早餐开始，黄金蛋炒饭，番茄牛尾汤[嘻嘻]', 'label': 1}]

2 ERNIE3.0模型

2.1 导入模型

import os
import paddle
import paddlenlp

from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "ernie-3.0-medium-zh"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_classes=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

---------------------------------------
[2024-01-26 13:15:04,015] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification'> to load 'ernie-3.0-medium-zh'.
[2024-01-26 13:15:04,016] [    INFO] - Already cached *****\.paddlenlp\models\ernie-3.0-medium-zh\model_state.pdparams
[2024-01-26 13:15:04,017] [    INFO] - Loading weights file model_state.pdparams from cache at *****\.paddlenlp\models\ernie-3.0-medium-zh\model_state.pdparams
[2024-01-26 13:15:04,276] [    INFO] - Loaded weights file from disk, setting weights to model.
[2024-01-26 13:15:09,982] [ WARNING] - Some weights of the model checkpoint at ernie-3.0-medium-zh were not used when initializing ErnieForSequenceClassification: ['ernie.encoder.layers.6.self_attn.k_proj.weight', 'ernie.encoder.layers.6.self_attn.q_proj.bias', 'ernie.encoder.layers.6.linear1.weight', 'ernie.encoder.layers.6.norm2.bias', 'ernie.encoder.layers.6.self_attn.k_proj.bias', 'ernie.encoder.layers.6.self_attn.v_proj.bias', 'ernie.encoder.layers.6.self_attn.out_proj.weight', 'ernie.encoder.layers.6.self_attn.v_proj.weight', 'ernie.encoder.layers.6.norm1.bias', 'ernie.encoder.layers.6.norm1.weight', 'ernie.encoder.layers.6.linear2.bias', 'ernie.encoder.layers.6.linear1.bias', 'ernie.encoder.layers.6.self_attn.q_proj.weight', 'ernie.encoder.layers.6.linear2.weight', 'ernie.encoder.layers.6.self_attn.out_proj.bias', 'ernie.encoder.layers.6.norm2.weight']
- This IS expected if you are initializing ErnieForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ErnieForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2024-01-26 13:15:09,982] [ WARNING] - Some weights of ErnieForSequenceClassification were not initialized from the model checkpoint at ernie-3.0-medium-zh and are newly initialized: ['classifier.bias', 'ernie.pooler.dense.bias', 'classifier.weight', 'ernie.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2024-01-26 13:15:10,009] [    INFO] - We are using (<class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'>, False) to load 'ernie-3.0-medium-zh'.
[2024-01-26 13:15:10,010] [    INFO] - Already cached *****\.paddlenlp\models\ernie-3.0-medium-zh\ernie_3.0_medium_zh_vocab.txt
[2024-01-26 13:15:10,028] [    INFO] - tokenizer config file saved in C:\Users\徐金硕\.paddlenlp\models\ernie-3.0-medium-zh\tokenizer_config.json
[2024-01-26 13:15:10,030] [    INFO] - Special tokens file saved in *****\.paddlenlp\models\ernie-3.0-medium-zh\special_tokens_map.json

需要首先下载utils包，conda install utils后会报错：PackagesNotFoundError: The following packages are not available from current channels，参照下面这篇博文在Anaconda里复制命令下载。

PackagesNotFoundError: The following packages are not available from current channels的解决办法-CSDN博客

from functools import partial
from paddlenlp.data import Stack, Tuple, Pad
from utils import convert_example, create_dataloader

# 模型运行批处理大小
batch_size = 32
max_seq_length = 128

trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
    Stack(dtype="int64")  # label
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
dev_data_loader = create_dataloader(
    dev_ds,
    mode='dev',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

即使下载了utils，这里第3句还是会报错：ImportError: cannot import name 'convert_example' from 'utils' (D:\gpu\anaconda\in\envs\py38\lib\site-packages\utils\__init__.py)

沿着路径进去查看发现__init.py__文件是空的，该问题与下面这篇博文类似：

ImportError: cannot import name SVOInfo from utils (D:\Develop_Tool\Anaconda\lib\site-packages\u_baidubce importerror: cannot import name 'expando'-CSDN博客

尝试将aistudio中下面这个开源项目的utils.py文件内容复制到本地的__init.py__文件中，运行成功。

『NLP经典项目集』02：使用预训练模型ERNIE优化情感分析 - 飞桨AI Studio星河社区 (baidu.com)

2.2 模型训练

import paddlenlp as ppnlp
import paddle

from paddlenlp.transformers import LinearDecayWithWarmup

# 训练过程中的最大学习率
learning_rate = 5e-6 
# 训练轮次
epochs = 20 #3
# 学习率预热比例
warmup_proportion = 0.3
# 权重衰减系数，类似模型正则项策略，避免模型过拟合
weight_decay = 0.1

num_training_steps = len(train_data_loader) * epochs
lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ])

criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()

import paddle.nn.functional as F
from utils import evaluate
all_train_loss=[]
all_train_accs = []
Batch=0
Batchs=[]
global_step = 0
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, segment_ids, labels = batch
        logits = model(input_ids, segment_ids)
        loss = criterion(logits, labels)
        probs = F.softmax(logits, axis=1)
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()
        global_step += 1
        if global_step % 10 == 0 :
            print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc))
            Batch += 10 
            Batchs.append(Batch)
            all_train_loss.append(loss)
            all_train_accs.append(acc)
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.clear_grad()
    evaluate(model, criterion, metric, dev_data_loader)

model.save_pretrained('/home/aistudio/checkpoint')
tokenizer.save_pretrained('/home/aistudio/checkpoint')

-------------------------------------
[2024-01-26 12:48:05,721] [    INFO] - Configuration saved in /home/aistudio/checkpoint\config.json
[2024-01-26 12:48:06,184] [    INFO] - Model weights saved in /home/aistudio/checkpoint\model_state.pdparams
[2024-01-26 12:48:06,186] [    INFO] - tokenizer config file saved in /home/aistudio/checkpoint\tokenizer_config.json
[2024-01-26 12:48:06,187] [    INFO] - Special tokens file saved in /home/aistudio/checkpoint\special_tokens_map.json
('/home/aistudio/checkpoint\\tokenizer_config.json',
 '/home/aistudio/checkpoint\\special_tokens_map.json',
 '/home/aistudio/checkpoint\\added_tokens.json')

2.3 可视化曲线

import matplotlib.pyplot as plt
def draw_train_acc(Batchs, train_accs,train_loss):
    title="training accs"
    plt.title(title, fontsize=24)
    plt.xlabel("batch", fontsize=14)
    plt.ylabel("acc", fontsize=14)
    plt.plot(Batchs, train_accs, color='green', label='training accs')
    plt.plot(Batchs, train_loss, color='red', label='training loss')
    plt.legend()
    plt.grid()
    plt.show()
draw_train_acc(Batchs,all_train_accs,all_train_loss)

2.4 模型预测

# 加载模型参数
import os
import paddle
params_path = 'checkpoint/model_state.pdparams'
if params_path and os.path.isfile(params_path):
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Successful Loaded down!")

from utils import predict
batch_size = 32
data = text_t
label_map = {0: '0', 1: '1',2:'2'}
results = predict(
    model, data, tokenizer, label_map, batch_size=batch_size)
for idx, text in enumerate(data):
    print('Data: {} \t Lable: {}'.format(text, results[idx]))