大模型学习记录


Bilibil-githubi

大模型:

大模型 = 预训练 + 微调

1. 预训练:

过程:

  • 初始模型参数是随机产生的;
  • 使用来自网络中搜集的巨大的语料库,通过多次清洗,送入模型进行自监督训练,【目的 是预测下一个token】。

优点:

  • 通用性:预训练模型具有广泛的通用性,可以适应多种下游任务。
  • 数据效率:通过预训练,模型可以在较少的数据上进行微调,达到良好的性能。

2. 微调(Fine-tuning)

微调是指在预训练模型的基础上,使用特定任务的数据集进行进一步的训练,使模型能够适应特定的应用场景。

过程:

  • 选择预训练模型:选择一个已经预训练好的模型,比如BERT或GPT。
  • 任务数据集:准备一个特定任务的数据集,比如情感分析、问答系统等。
  • 任务训练:使用这个数据集对预训练模型进行训练,调整模型参数,使其能够更好地完成特定任务。

优势:

  • 快速适应:微调可以使预训练模型快速适应特定任务,节省时间和计算资源。
  • 性能提升:通过微调,预训练模型可以达到在特定任务上的最佳性能。

3.数据:

  • 高质量
  • 多样性
  • 真实性,非模型生成

3.1 hungging face 下载数据集

from datasets import load_dataset
ds = load_dataset('rotten_tomatoes', split='train') 
# 可以在split中指定数量 split='train[:100:500] // split='train[:50%]

ds['train'][0]

{'qid': 1,
 'language': 'Java',
 'level': 'Easy',
 'question': 'What is Java?',
 'answer': 'Java is a high-level, class-based, object-oriented programming language.'}
 
ds['train'].column_names
# ['qid', 'language', 'level', 'question', 'answer']

ds['train'].features
#{'qid': Value(dtype='int64', id=None), 'language': Value(dtype='string', id=None), 'level': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'answer': Value(dtype='string', id=None)}

data_files = {'train': 'train.csv', 'test': 'test.csv'}
ds = load_dataset('namespace/your_dataset_name', data_files=data_files)

3.2 下载本地dataset

  • 本地数据集
from datasets import load_dataset
ds = load_dataset('path/to/local/loading_script/loading_script.py', split='train')
  • csv
from datasets import load_dataset
ds = load_dataset('csv', data_files='path/to/local/my_dataset.csv')
# 默认加载为DatasetDict 其中包括train: Dataset 
# 如果不想要外面的DatasetDict 可以在参数中加split='train'
# 方法2️⃣:
dataset = Dataset.from_csv(data_files)

同时处理多个csv文件

# 如果文件在同一个文件夹
dataset = load_dataset('csv', data_dir='./all_data/', split='train')

# 如果文件不在同一个文件夹,要写清每一个csv文件的路径
dataset = load_dataset('csv', data_dir=['./csv1.csv','./csv2.csv'] split='train')
  • json
from datasets import load_dataset
ds = load_dataset('json', data_files='path/to/local/my_dataset.json')

3.3 不一次性下载数据集,stream输入

from datasets import load_dataset
ds = load_dataset('rotten_tomatoes', split='train', streaming=True)

3.4 划分数据集

split_dataset = dataset.train_test_split(test_size=0.2, shuffle=True,  seed=123)

# 如果需要标签均衡
split_dataset = dataset.train_test_split(test_size=0.2, stratify_by_columu='label')
 # 按照label比例均衡的划分

3.5 选取和过滤

方法一: 使用[:num]

ds['train'][:2]
# 字典
{'qid': [1, 2],
 'language': ['Java', 'Java'],
 'level': ['Easy', 'Easy'],
 'question': ['What is Java?', 'What is the difference between JDK and JRE?'],
 'answer': ['Java is a high-level, class-based, object-oriented programming language.',
  'JDK (Java Development Kit) is for development purposes, while JRE (Java Runtime Environment) is for running Java programs.']}

方法二: 使用select选取

a = ds['train'].select([0,1,4])
# 使用select出的结果仍被保存为Dataset格式
Dataset({
    features: ['qid', 'language', 'level', 'question', 'answer'],
    num_rows: 3
})

过滤: filter( func / lambda )

new_ds = ds['train'].filter(lambda x: 'Java' in x['language'])

3.6 map()统一处理数据

from transformers import AutoTokenizer
tokenizer = AutoTokenizer('bert-base_chinese')

def processes_func(example):
	model_inputs = tokenizer(example['content',max_length=512,truncate=True)
	labels = tokenizer(example['title'],max_length=32,truncate=True)
	model_input['label'] = labels['input_ids']
	return model_input

processed_datasets = dataset.map(process_func, batched=True)

# 当前模型不支持fast时,可以使用多线程进行,
# 但是多个线程为定义tokenizer,所以在process_func的参数中要传入tokenizer
def process_func(..., tokenizer):
	pass

processed_datasets = dataset.map(process_func, num_proc=4)

3.7 去除字段

processed_datasets = dataset.map(process_func, remove_column=dataset['train'].column_names)

3.8 保存与加载

保存

save_path = './processes_dataset'
processed_dataset.save_to_disk(save_path)

加载

dataset = load_from_disk(save_path)

3.9 自建数据集过程

数据准备过程:

  • 收集数据对 【问-答】

  • 拼接数据对,添加提示词

  • 分词,tokenizer、pad、truncate

  • 划分训练、测试数据

  • 自定义加载脚本加载数据集

import json
import datasets
from datasets import DownloadManager, DatasetInfo


class CMRC2018TRIAL(datasets.GeneratorBasedBuilder):

    def _info(self) -> DatasetInfo:
        """
            info方法, 定义数据集的信息,这里要对数据的字段进行定义
        :return:
        """
        return datasets.DatasetInfo(
            description="CMRC2018 trial",
            features=datasets.Features({
                    "id": datasets.Value("string"),
                    "context": datasets.Value("string"),
                    "question": datasets.Value("string"),
                    "answers": datasets.features.Sequence(
                        {
                            "text": datasets.Value("string"),
                            "answer_start": datasets.Value("int32"),
                        }
                    )
                })
        )

    def _split_generators(self, dl_manager: DownloadManager):
        """
            返回datasets.SplitGenerator
            涉及两个参数: name和gen_kwargs
            name: 指定数据集的划分
            gen_kwargs: 指定要读取的文件的路径, 与_generate_examples的入参数一致
        :param dl_manager:
        :return: [ datasets.SplitGenerator ]
        """
        return [datasets.SplitGenerator(name=datasets.Split.TRAIN, 
                                        gen_kwargs={"filepath": "./cmrc2018_trial.json"})]

    def _generate_examples(self, filepath):
        """
            生成具体的样本, 使用yield
            需要额外指定key, id从0开始自增就可以
        :param filepath:
        :return:
        """
        # Yields (key, example) tuples from the dataset
        with open(filepath, encoding="utf-8") as f:
            data = json.load(f)
            for example in data["data"]:
                for paragraph in example["paragraphs"]:
                    context = paragraph["context"].strip()
                    for qa in paragraph["qas"]:
                        question = qa["question"].strip()
                        id_ = qa["id"]

                        answer_starts = [answer["answer_start"] for answer in qa["answers"]]
                        answers = [answer["text"].strip() for answer in qa["answers"]]

                        yield id_, {
                            "context": context,
                            "question": question,
                            "id": id_,
                            "answers": {
                                "answer_start": answer_starts,
                                "text": answers,
                            },
                        }


dataset = load_dataset("./load_script.py", split="train")

3.10 Dataset with DataCollator

添加 padding

collator = DataCollatorWithPadding(tokenizer=tokenizer)

dl = DataLoader(tokenized_dataset, batch_size=4, collate_fn=collator, shuffle=True)

4. Pipeline

4.1 查看所有的pipeline支撑模型

import transformers
from transformers.pipelines import SUPPORTED_TASKS

for k,v in SUPPORTED_TASKS.items():
  print(k,v)

返回类别以及pytorch和tf对应的模型名称
在这里插入图片描述

4.2. 使用pipeline

from transformers import pipeline
pipe = pipeline('text-classification')
# help(pipe)
pipe('very good')

使用help(pipe) 查看预测时模型的输入参数格式

4.3 pipeline实现

# 导入分词器和模型
tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-dianping-chinese')
model = AutoModelForSequenceClassification.from_pretrained('uer/roberta-base-finetuned-dianping-chinese')

# 输入数据
input_text = '我觉得还可以'
input = tokenizer(input_text, return_tensors='pt')
# input {'input_ids': tensor([[ 101, 2769, 6230, 2533, 6820, 1377,  809,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

# 模型输出
res = model(**input)
# res SequenceClassifierOutput(loss=None, logits=tensor([[ 0.3320, -0.4150]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

# 后处理
logits = res.logits
logits = torch.softmax(logits, dim=-1)
# logits tensor([[0.6785, 0.3215]], grad_fn=<SoftmaxBackward0>)

# 取下标
pred = torch.argmax(logits).item()
# pred 0

model.config.id2label 
#{0: 'negative (stars 1, 2 and 3)', 1: 'positive (stars 4 and 5)'}

result = model.config.id2label.get(pred)
# result negative (stars 1, 2 and 3)


5. Tokenizer

Tokenizer会对应给定的模型,对应使用

将词组转化为数字【按字符出现的频率】

5.1 加载Tokenizer

AutoTokenizer 可以根据模型自动选择合适的Tokenizer

  • 指定Tokenizer的保存路径:
save_path = './tokenizer'
tokenizer.save_pretrained(save_path) 
  • 加载本地Tokenizer
tokenizer = AutoTokenizer.from_pretrained(save_path)

5.2 分词

tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-dianping-chinese')
tokens = tokenizer.tokenize('very well')
# tokens ['very', 'well'] 划分单词
tokenizer.vocab # 查看词典

#转换为id
ids = tokenizer.convert_tokens_to_ids(tokens)
# ids [11785, 12010]

# ids 转换为token
tokens = tokenizer.convert_ids_to_tokens(ids)
#tokens ['very', 'well']

# tokens转化为str
str_text = tokenizer.convert_tokens_to_string(tokens)
#str_text very well

# 加padding
ids = tokenizer.encode(str_text, add_special_tokens=True, padding='max_length',max_length=50, truncate=False)

# 计算attention_mask
attention_mask = [1 if i != 0 else 0 for i in ids]

5.3 更便捷的方式

ids = tokenizer.encode(str_text, add_special_tokens=False) # dd_special_tokens在开始和结尾添加特殊字符cls
# [11785, 12010]

str_text = tokenizer.decode(ids, skip_special_tokens=False)
#str_text  [CLS] very well [SEP]'

5.4 快速调用

inputs = tokenizer.encode_plus(str_text,padding='max_length', max_length=20)
# {'input_ids': [101, 8228, 162, 11944, 11052, 9264, 117, 8997, 10564, 8303, 8228, 11485, 8984, 8165, 11052, 8217, 11300, 8118, 113, 10539, 114, 8228, 8228, 11285, 8231, 8118, 119, 102], 
#'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
# 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

inputs = tokenizer(str_text,padding='max_length', max_length=20)

5.5 Fast/Slow Tokenizer

Fast 是使用rust
slow是使用python
Fast Tokenizer 会返回一些其他的值

# slow
tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-dianping-chinese',use_fast=False)
# fast
fast_tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-dianping-chinese',use_fast=True)

inputs = tokenizer(str_text, return_offsets_mapping=True)

{'input_ids': [101, 8228, 162, 11944, 11052, 9264, 117, 8997, 10564, 8303, 8228, 11485, 8984, 8165, 11052, 8217, 11300, 8118, 113, 10539, 114, 8228, 8228, 11285, 8231, 8118, 119, 102], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
'offset_mapping': [(0, 0), (0, 2), (3, 4), (4, 8), (9, 12), (13, 18), (18, 19), (20, 22), (23, 25), (25, 27), (28, 30), (31, 34), (34, 37), (37, 38), (39, 42), (43, 45), (45, 48), (48, 49), (50, 51), (51, 55), (55, 56), (57, 59), (60, 62), (62, 65), (66, 68), (68, 69), (69, 70), (0, 0)]}
 # (0,2) 0-2是一个词

inputs.word_ids()
#[None,0,1, 1, 2, 3, 4, 5, 6, 6, 7, 8, 8, 8, 9, 10, 10, 10, 11, 12, 13, 14, 15, 15, 16, 16, 17, None]
# 一个词可能被分成2个词,dreaming =dream+ing

5.6 加载数据集

# 加载数据集
from datasets import load_dataset

dataset = load_dataset("glue", "mrpc", split="train")

# 加载模型、tokenizer
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = 'hi, how are you?'

encoded_text = tokenizer(text) 
# 输入可以是一个句子,或者多个句子;
# 返回一个 dict,包括 idx,句子,标签,input_ids【句子的编码】,attention_mask【不参与计算的为0,否则为1】,token_type_ids【用于在多个句子输入是进行标签】

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
'label': 1,
'idx': 0,
'input_ids': array([  101,  7277,  2180,  5303,  4806,  1117,  1711,   117,  2292, 1119,  1270,   107,  1103,  7737,   107,   117,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102, 11336,  6732, 3384,  1106,  1140,  1112,  1178,   107,  1103,  7737,   107, 117,  7277,  2180,  5303,  4806,  1117,  1711,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102]),
'token_type_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
'attention_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}

decoded_text = tokenizer.decode(encoded_text) # 返回元素句子

5.7 用map()函数对所有数据进行批量处理

def encode(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length", return_tensors='pt')

dataset = dataset.map(encode, batched=True,batch_size=1,drop_last_batch=True)

5.8 更改占位符

tokenizer.pad_token = tokenizer.eos_token

5.9 truncate

tokenizer.truncation_side = 'left' # 从左侧截断

6. Model

模型类型:

  • 编码器模型:使用双向注意力,能看到所有的上下文内容【早期BERT】难做文本生成类任务
  • 解码器模型:使用单向注意力,只能看到上文和自己的内容,看不到下文 【GPT、LLaMA】
  • 编码器解码器模型:encoder使用双向注意力,decoder使用单向注意力

根据任务在model后添加model head:

  • Model(模型本身)
  • *ForCausalLM
  • *ForMaskedLM
  • *ForSeq2SeqLM
  • *ForQuestionAnswering

不指定model head的模型

# 加载模型
model = AutoModel.from_pretrained('hfl/rbt3')
# 查看模型参数
model.config # 根据参数修改模型返回值

# 输入
inputs = tokenizer.encode_plus(str_text,padding='max_length', max_length=20,return_tensors='pt')
# 输出
out = model(**inputs)
out.last_hidden_state # 用. 将单独的值取出来

对于没有model head的模型的返回值是(bs, str_len, 768)

指定model head的模型

  • 先用原始模型进行训练
  • 通过dropout
  • self.classifier = nn.Linear(hidden_size, num_labels)
  • 如果模型的输入给定label,计算loss
clz_model = AutoModelForSequenceClassification.from_pretrained('hfl/rbt3')

clz_model(**input)
# SequenceClassifierOutput(loss=None, logits=tensor([[ 1.1540, -0.3010]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

7. Evaluate

封装了一些评估指标

7.1 查看支持的评估指标

import evaluate
# 查看支持的评估指标
evaluate.list_evaluation_modules()

'LuckiestOne/valid_efficiency_score',
 'Fritz02/execution_accuracy',
 'huanghuayu/multiclass_brier_score',
 'jialinsong/apps_metric',
 'DoctorSlimm/bangalore_score',
 'agkphysics/ccc',
 'DoctorSlimm/kaushiks_criteria',
 'CZLC/rouge_raw',
 'bascobasculino/mot-metrics',
 ...

evaluate.list_evaluation_modules(include_community=False, with_details=True)

7.2 加载评估函数

accuracy = evaluate.load('accuracy')
# 查看描述
accuracy.description
#
Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
 Where:
TP: True positive
TN: True negative
FP: False positive
FN: False negative

# 查看输入描述
accuraacy.inputs_description

7.3 计算

  • 全局计算
results = accuracy.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
  • 迭代计算
for refs,preds in zip([0,1,0],[1,1,0]):
	accuracy.add(refs,preds)
accuracy.compute()
  • batch
 for refs,preds in zip([[0,1],[1,0],[[1,1],[1,0]]):
 	accuracy.add_batch(refs,preds)
accuracy.compute()

7.4 多个评估指标

cls_metrics = evaluate.combine(['accuracy', 'f1','recall'])
cls_metrics.compute([0,1,0],[1,1,0])
# {'accuracy': 0.6666666666666666, 'f1': 0.6666666666666666, 'recall': 0.5}

7.5 评估结果可视化

from evaluate.visualization import radar_plot   # 目前只支持雷达图
data = [
   {"accuracy": 0.99, "precision": 0.8, "f1": 0.95, "latency_in_seconds": 33.6},
   {"accuracy": 0.98, "precision": 0.87, "f1": 0.91, "latency_in_seconds": 11.2},
   {"accuracy": 0.98, "precision": 0.78, "f1": 0.88, "latency_in_seconds": 87.6}, 
   {"accuracy": 0.88, "precision": 0.78, "f1": 0.81, "latency_in_seconds": 101.6}
   ]
model_names = ["Model 1", "Model 2", "Model 3", "Model 4"]

plot = radar_plot(data=data, model_names=model_names)

8. Trainer

8.1 创建TrainingArguments

from transformers import Trainer,TrainingArguments
train_args = TrainingArguments(output_dir="./checkpoints",      # 输出文件夹
                               per_device_train_batch_size=64,  # 训练时的batch_size
                               per_device_eval_batch_size=128,  # 验证时的batch_size
                               logging_steps=10,                # log 打印的频率
                               evaluation_strategy="epoch",     # 评估策略
                               save_strategy="epoch",           # 保存策略
                               save_total_limit=3,              # 最大保存数
                               learning_rate=2e-5,              # 学习率
                               weight_decay=0.01,               # weight_decay
                               metric_for_best_model="f1",      # 设定评估指标
                               load_best_model_at_end=True)     # 训练完成后加载最优模型

8. 2创建Trainer

import evaluate
acc_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def eval_metric(eval_predict):
    predictions, labels = eval_predict
    predictions = predictions.argmax(axis=-1)
    acc = acc_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels)
    acc.update(f1)
    return acc
    
from transformers import DataCollatorWithPadding
trainer = Trainer(model=model, 
                  args=train_args, 
                  train_dataset=tokenized_datasets["train"], 
                  eval_dataset=tokenized_datasets["test"], 
                  data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
                  compute_metrics=eval_metric)

8.3 训练

trainer.train()

8.4 验证

trainer.evaluate()

8.5 模型预测

trainer.predict(tokenizer_dataset['test'])

9. NLP模型可能出现的问题

9.1 Error Analysis

  • 错误拼写(misspelling)
  • 语句过长(too long)
  • 存在重复语句(repetitive)

10. 模型优化

  • 梯度累加:
    通过减小batch_size,多次运行后累计梯度,再进行参数更新->优化前向激活值
    在TrainArguments 中加入gradient_accumulation-steps=32
    训练时间会加长
  • Gradient Checkpoint:
    gradient_checkpointing=True 不保存前向激活函数的某些中间值
  • 优化器
    optim=‘adafactor’ 比Adam参数更小
  • Freeze Model
    BERT 冻结这部分的参数,便利所有参数,gradient=False
for name,param in model.bert.named_parameters():
	param.requires_grad = False
  • Data Length
    截断数据长度
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值