Transformer入门

最新推荐文章于 2024-09-30 08:05:35 发布

lilsyoss

最新推荐文章于 2024-09-30 08:05:35 发布

阅读量106

点赞数

文章标签： linux 运维服务器

本文链接：https://blog.csdn.net/lilsyoss/article/details/132023526

版权

pipeline

就是封装好的各种NPL处理方法，可以快速应对简单需求

from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("This morning I went to the ", max_length=10, num_return_sequences=5)

from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

Verctor & Tensor

顺便理解下向量和张量，我不用去从数学和物理的角度解释，因为我只应用在计算机领域

向量就是二维的数组

张量包括一，二三四维数字是向量等的合计，网上有更好的解释

引用地址这篇文章我觉得对计算机专业而言就足够了，注意理解 ndim 层数和shape形状

Model

transformer推荐使用预训练的模型提高模型训练到使用的周期

下载的模型等文件会放到这个目录下：~/.cache/huggingface/transformers

下载模型并加载
model = BertModel.from_pretrained("bert-base-cased")

保存模型到本地

model.save_pretrained("directory_on_my_computer")

顺带说下，这个语句只能保存模型，还得保存tokenizer

config.json pytorch_model.bin

tokenizer 把人类使用的字符串数组，老外称之为raw text 转成数字数组

sequences = ["Hello!", "Cool.", "Nice!"]

encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

然后就能轻而易举的转成张量，进而发给模型，model只会接受tensor当做parameters
model_inputs = torch.tensor(encoded_sequences)

output = model(model_inputs)

Tokenizers

就是我刚刚说的从人类语言转成数字，

有三种解析方式：

1，Character-based，基于英文字母

2， Word-based,基于单词

3，subword tokenization ，常用的单词不打散的规则，而稀有的词需要打散成，由子单词组成

model_inputs = tokenizer(sequence,padding="max_length",truncation=True, return_tensors="pt")

padding: 由于模型接受内容是批量的，所以要保证输入的list中解析的ids长度一致

truncation:所有的模型接受tokens都是有上限的

return_tensors:发挥tensors格式，如果不加这个直接tokenizer.decode 即可；加了这个那就回变成tensor格式，不能直接decode了