ERNIE3.0实践手记

一、代码实践

(一)基础版

本实验大体框架参照这篇博文:

https://lizhiyang.blog.csdn.net/article/details/132394853

基于Ernie-3.0-medium-zh大模型,这篇博文运用孤注一掷影评数据进行情感分析,共600条数据,进行了词云展示。利用正则表达式清理数据,利用paddlenlp.datasets中的 DatasetBuilder函数对数据进行处理,数据变成了[{‘text_a’: ‘data’, ‘label’: label},……] 的格式。

1 数据处理

1.1 划分数据集

import random

# 读取自定义.txt文件中的内容
with open('weibo_senti_100k.txt', 'r',encoding='utf-8') as file:
    lines = file.readlines()

# 随机打乱数据
random.shuffle(lines)

# 计算切分的索引
total_lines = len(lines)
train_end = int(total_lines * 0.7)
dev_end = int(total_lines * 0.9)

# 切分数据
train_data = lines[:train_end]
dev_data = lines[train_end:dev_end]
test_data = lines[validation_end:]

# 将数据写入train.txt
with open('train.txt', 'w' ,encoding='utf-8') as file:
    file.writelines(train_data)

# 将数据写入validation.txt
with open('dev.txt', 'w' ,encoding='utf-8') as file:
    file.writelines(dev_data)

# 将数据写入test.txt
with open('test.txt', 'w' ,encoding='utf-8') as file:
    file.writelines(test_data)

这里的数据集用的是:weibo_senti_100k数据集:ChineseNlpCorpus/datasets/weibo_senti_100k/intro.ipynb at master · SophonPlus/ChineseNlpCorpus · GitHub

 将微博数据集切成三个:训练集70%、测试集20%、验证集10%,编码是utf-8,但是没有列名,放在notepad里转成ansi格式,换文件名为csv,正常打开,操作好后,另存为uncode的txt,再用notepad转换成utf-8

notepad++网盘资源

https://pan.baidu.com/s/14cRU0EjD0BiPl5doYj6pMA
[提取码]:kwii

1.2 加载数据

# 导入DatasetBuilder
from paddlenlp.datasets import DatasetBuilder


class NewsData(DatasetBuilder):
    SPLITS = {
        'train': r'train.txt',  # 训练集
        'dev': r'dev.txt', # 验证集
        'test': r'test.txt' #测试集
    }

    def _get_data(self, mode, **kwargs):
        filename = self.SPLITS[mode]
        return filename

    def _read(self, filename):
        """读取数据"""
        with open(filename, 'r', encoding='utf-8',errors='ignore') as f:
            for line in f:
                if line == '\n':
                    continue
                data = line.strip().split("\t")    # 以'\t'分隔各列
                label, text_a = data
                text_a = text_a.replace(" ", "")
                if label in ['0', '1']:
                    yield {"text_a": text_a, "label": label}  # 此次设置数据的格式为:text_a,label,可以根据具体情况进行修改

    def get_labels(self):
        return label_list   # 类别标签
    
    from paddlenlp.datasets import load_dataset

-------------------------------------------
D:******lib\site-packages\_distutils_hack\__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
[是报错但没关系,不影响运行]
# 定义数据集加载函数
def load_dataset(name=None,
                 data_files=None,
                 splits=None,
                 lazy=None,
                 **kwargs):
   
    reader_cls = NewsData  # 加载定义的数据集格式
    print(reader_cls)
    if not name:
        reader_instance = reader_cls(lazy=lazy, **kwargs)
    else:
        reader_instance = reader_cls(lazy=lazy, name=name, **kwargs)
    datasets = reader_instance.read_datasets(data_files=data_files, splits=splits)
    return datasets
# 加载训练和验证集
label_list = ['0', '1']
train_ds, dev_ds, text_t = load_dataset(splits=['train', 'dev', 'test'])


-------------------------------------
<class '__main__.NewsData'>

1.3 展示数据

#展示前五行数据
train_ds[:5]

--------------------------------------
[{'text_a': '[抓狂][抓狂][抓狂]起晚了[泪]', 'label': 0},
 {'text_a': '分享图片,不要啊~~~虽然我很喜欢周迅,可是八阿哥,你一定要等晴川啊~~[泪]', 'label': 0},
 {'text_a': '想shi的就上飞机吧???只见君去不见君还[泪]//@玉翠文章:不错很丰满,赛过杨贵妃,我喜欢,带着你超过盖茨4倍家财的嫁妆来吧[酷]请各路大仙作媒',
  'label': 0},
 {'text_a': '多谢支持!//@洪三水:[嘻嘻]画面不错,其他继续研究中。。//@曹欣Dyson:抢怪的那叫个多。-。-//@刘波BOB:钢铁侠,有没有?有没有?亮了!',
  'label': 1},
 {'text_a': '#周末节奏#美好的一天从早餐开始,黄金蛋炒饭,番茄牛尾汤[嘻嘻]', 'label': 1}]

2 ERNIE3.0模型

2.1 导入模型

import os
import paddle
import paddlenlp
from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "ernie-3.0-medium-zh"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_classes=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

---------------------------------------
[2024-01-26 13:15:04,015] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification'> to load 'ernie-3.0-medium-zh'.
[2024-01-26 13:15:04,016] [    INFO] - Already cached *****\.paddlenlp\models\ernie-3.0-medium-zh\model_state.pdparams
[2024-01-26 13:15:04,017] [    INFO] - Loading weights file model_state.pdparams from cache at *****\.paddlenlp\models\ernie-3.0-medium-zh\model_state.pdparams
[2024-01-26 13:15:04,276] [    INFO] - Loaded weights file from disk, setting weights to model.
[2024-01-26 13:15:09,982] [ WARNING] - Some weights of the model checkpoint at ernie-3.0-medium-zh were not used when initializing ErnieForSequenceClassification: ['ernie.encoder.layers.6.self_attn.k_proj.weight', 'ernie.encoder.layers.6.self_attn.q_proj.bias', 'ernie.encoder.layers.6.linear1.weight', 'ernie.encoder.layers.6.norm2.bias', 'ernie.encoder.layers.6.self_attn.k_proj.bias', 'ernie.encoder.layers.6.self_attn.v_proj.bias', 'ernie.encoder.layers.6.self_attn.out_proj.weight', 'ernie.encoder.layers.6.self_attn.v_proj.weight', 'ernie.encoder.layers.6.norm1.bias', 'ernie.encoder.layers.6.norm1.weight', 'ernie.encoder.layers.6.linear2.bias', 'ernie.encoder.layers.6.linear1.bias', 'ernie.encoder.layers.6.self_attn.q_proj.weight', 'ernie.encoder.layers.6.linear2.weight', 'ernie.encoder.layers.6.self_attn.out_proj.bias', 'ernie.encoder.layers.6.norm2.weight']
- This IS expected if you are initializing ErnieForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ErnieForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2024-01-26 13:15:09,982] [ WARNING] - Some weights of ErnieForSequenceClassification were not initialized from the model checkpoint at ernie-3.0-medium-zh and are newly initialized: ['classifier.bias', 'ernie.pooler.dense.bias', 'classifier.weight', 'ernie.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2024-01-26 13:15:10,009] [    INFO] - We are using (<class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'>, False) to load 'ernie-3.0-medium-zh'.
[2024-01-26 13:15:10,010] [    INFO] - Already cached *****\.paddlenlp\models\ernie-3.0-medium-zh\ernie_3.0_medium_zh_vocab.txt
[2024-01-26 13:15:10,028] [    INFO] - tokenizer config file saved in C:\Users\徐金硕\.paddlenlp\models\ernie-3.0-medium-zh\tokenizer_config.json
[2024-01-26 13:15:10,030] [    INFO] - Special tokens file saved in *****\.paddlenlp\models\ernie-3.0-medium-zh\special_tokens_map.json

需要首先下载utils包,conda install utils后会报错:PackagesNotFoundError: The following packages are not available from current channels,参照下面这篇博文在Anaconda里复制命令下载。

PackagesNotFoundError: The following packages are not available from current channels的解决办法-CSDN博客

from functools import partial
from paddlenlp.data import Stack, Tuple, Pad
from utils import convert_example, create_dataloader

# 模型运行批处理大小
batch_size = 32
max_seq_length = 128

trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
    Stack(dtype="int64")  # label
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
dev_data_loader = create_dataloader(
    dev_ds,
    mode='dev',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

即使下载了utils,这里第3句还是会报错:ImportError: cannot import name 'convert_example' from 'utils' (D:\gpu\anaconda\in\envs\py38\lib\site-packages\utils\__init__.py)

沿着路径进去查看发现__init.py__文件是空的,该问题与下面这篇博文类似:

ImportError: cannot import name SVOInfo from utils (D:\Develop_Tool\Anaconda\lib\site-packages\u_baidubce importerror: cannot import name 'expando'-CSDN博客

尝试将aistudio中下面这个开源项目的utils.py文件内容复制到本地的__init.py__文件中,运行成功。

『NLP经典项目集』02:使用预训练模型ERNIE优化情感分析 - 飞桨AI Studio星河社区 (baidu.com)

2.2 模型训练

import paddlenlp as ppnlp
import paddle
from paddlenlp.transformers import LinearDecayWithWarmup

# 训练过程中的最大学习率
learning_rate = 5e-6 
# 训练轮次
epochs = 20 #3
# 学习率预热比例
warmup_proportion = 0.3
# 权重衰减系数,类似模型正则项策略,避免模型过拟合
weight_decay = 0.1

num_training_steps = len(train_data_loader) * epochs
lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ])

criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()
import paddle.nn.functional as F
from utils import evaluate
all_train_loss=[]
all_train_accs = []
Batch=0
Batchs=[]
global_step = 0
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, segment_ids, labels = batch
        logits = model(input_ids, segment_ids)
        loss = criterion(logits, labels)
        probs = F.softmax(logits, axis=1)
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()
        global_step += 1
        if global_step % 10 == 0 :
            print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc))
            Batch += 10 
            Batchs.append(Batch)
            all_train_loss.append(loss)
            all_train_accs.append(acc)
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.clear_grad()
    evaluate(model, criterion, metric, dev_data_loader)
model.save_pretrained('/home/aistudio/checkpoint')
tokenizer.save_pretrained('/home/aistudio/checkpoint')

-------------------------------------
[2024-01-26 12:48:05,721] [    INFO] - Configuration saved in /home/aistudio/checkpoint\config.json
[2024-01-26 12:48:06,184] [    INFO] - Model weights saved in /home/aistudio/checkpoint\model_state.pdparams
[2024-01-26 12:48:06,186] [    INFO] - tokenizer config file saved in /home/aistudio/checkpoint\tokenizer_config.json
[2024-01-26 12:48:06,187] [    INFO] - Special tokens file saved in /home/aistudio/checkpoint\special_tokens_map.json
('/home/aistudio/checkpoint\\tokenizer_config.json',
 '/home/aistudio/checkpoint\\special_tokens_map.json',
 '/home/aistudio/checkpoint\\added_tokens.json')

2.3 可视化曲线

import matplotlib.pyplot as plt
def draw_train_acc(Batchs, train_accs,train_loss):
    title="training accs"
    plt.title(title, fontsize=24)
    plt.xlabel("batch", fontsize=14)
    plt.ylabel("acc", fontsize=14)
    plt.plot(Batchs, train_accs, color='green', label='training accs')
    plt.plot(Batchs, train_loss, color='red', label='training loss')
    plt.legend()
    plt.grid()
    plt.show()
draw_train_acc(Batchs,all_train_accs,all_train_loss)

2.4 模型预测

# 加载模型参数
import os
import paddle
params_path = 'checkpoint/model_state.pdparams'
if params_path and os.path.isfile(params_path):
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Successful Loaded down!")
from utils import predict
batch_size = 32
data = text_t
label_map = {0: '0', 1: '1',2:'2'}
results = predict(
    model, data, tokenizer, label_map, batch_size=batch_size)
for idx, text in enumerate(data):
    print('Data: {} \t Lable: {}'.format(text, results[idx]))

(二)提升版

上面只输出了标签,不带概率,下面这个是百度aistudio的开源项目,需要使用bash,或者直接在aistudio上运行。

飞桨AI Studio星河社区-人工智能学习与实训社区 (baidu.com)

其他自定义数据集的方法:

如何自定义数据集 — PaddleNLP 文档

微调版:

百度PaddleHub-ERNIE微调中文情感分析(文本分类)_paddle-ernie-CSDN博客

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值