使用HuggingFace的Transformers库的学习笔记(pipeline实战+官方readme文件的解读）

最新推荐文章于 2024-07-15 09:30:00 发布

置顶 1996MZH

最新推荐文章于 2024-07-15 09:30:00 发布

阅读量1.2w

点赞数 16

分类专栏： NLP 文章标签： pytorch 深度学习机器学习

本文链接：https://blog.csdn.net/weixin_41545780/article/details/107021523

版权

NLP 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

https://github.com/huggingface/transformers

先放一段可调用的代码

import torch
import transformers# import *
tokenizer = transformers.BertTokenizer.from_pretrained('./bert-base-chinese/')
tokenizer.encode("我今天很开心！")

transformers主要包含以下这些类
在这里插入图片描述

安装

pip install transformers
#并安装pytorch或tf2.0中的至少一个

包含的模型结构

BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
GPT (from OpenAI) released with the paper Improving Language Understanding by Generative Pre-Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
GPT-2 (from OpenAI) released with the paper Language Models are Unsupervised Multitask Learners by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
Transformer-XL (from Google/CMU) released with the paper Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
XLNet (from Google/CMU) released with the paper XLNet: Generalized Autoregressive Pretraining for Language Understanding by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
XLM (from Facebook) released together with the paper Cross-lingual Language Model Pretraining by Guillaume Lample and Alexis Conneau.
RoBERTa (from Facebook), released together with the paper a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
DistilBERT (from HuggingFace), released together with the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into DistilGPT2, RoBERTa into DistilRoBERTa, Multilingual BERT into DistilmBERT and a German version of DistilBERT.
CTRL (from Salesforce) released with the paper CTRL: A Conditional Transformer Language Model for Controllable Generation by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
CamemBERT (from Inria/Facebook/Sorbonne) released with the paper CamemBERT: a Tasty French Language Model by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
ALBERT (from Google Research and the Toyota Technological Institute at Chicago) released with the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
T5 (from Google AI) released with the paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
XLM-RoBERTa (from Facebook AI), released together with the paper Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
MMBT (from Facebook), released together with the paper a Supervised Multimodal Bitransformers for Classifying Images and Text by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine.
FlauBERT (from CNRS) released with the paper FlauBERT: Unsupervised Language Model Pre-training for French by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
BART (from Facebook) released with the paper BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
ELECTRA (from Google Research/Stanford University) released with the paper ELECTRA: Pre-training text encoders as discriminators rather than generators by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
DialoGPT (from Microsoft Research) released with the paper DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
Reformer (from Google Research) released with the paper Reformer: The Efficient Transformer by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
MarianMT Machine translation models trained using OPUS data by Jörg Tiedemann. The Marian Framework is being developed by the Microsoft Translator Team.
Longformer (from AllenAI) released with the paper Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan.
Other community models, contributed by the community.
Want to contribute a new model? We have added a detailed guide and templates to guide you in the process of adding a new model. You can find them in the templates folder of the repository. Be sure to check the contributing guidelines and contact the maintainers or open an issue to collect feedbacks before starting your PR.

demo
给了一个写作的demo 本部分略
快速预览
以下是官方给的样例
先从下面的地址下载模型文件到本地直接加载
#我下载的是WWM的chinese

https://huggingface.co/models

import torch
from transformers import *

# Transformers has a unified API
# for 10 transformer architectures and 30 pretrained weights.
#          Model          | Tokenizer          | Pretrained weights shortcut
'''
MODELS = [(BertModel,       BertTokenizer,       'bert-base-uncased'),
          (OpenAIGPTModel,  OpenAIGPTTokenizer,  'openai-gpt'),
          (GPT2Model,       GPT2Tokenizer,       'gpt2'),
          (CTRLModel,       CTRLTokenizer,       'ctrl'),
          (TransfoXLModel,  TransfoXLTokenizer,  'transfo-xl-wt103'),
          (XLNetModel,      XLNetTokenizer,      'xlnet-base-cased'),
          (XLMModel,        XLMTokenizer,        'xlm-mlm-enfr-1024'),
          (DistilBertModel, DistilBertTokenizer, 'distilbert-base-cased'),
          (RobertaModel,    RobertaTokenizer,    'roberta-base'),
          (XLMRobertaModel, XLMRobertaTokenizer, 'xlm-roberta-base'),
         ]'''
# 上面是所包含的模型，比较常用的还是BertModel
# 如果用的是TF2.0 需要将Model的名字前加上"TF"
# To use TensorFlow 2.0 versions of the models, simply prefix the class names with 'TF', e.g. `TFRobertaModel` is the TF 2.0 counterpart of the PyTorch model `RobertaModel`

# 下面展示了使用预训练模型将文本转化为嵌入隐态的方法 
# Let's encode some text in a sequence of hidden-states using each model:





#实例化模型和分词器
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
#tokenizer_class即为上面的表中的第二列，如BertTokenizer
#pretrained_weights即告诉分词器所使用的是哪个预训练模型
model = model_class.from_pretrained(pretrained_weights)
#model_class即为上面的表中的第一列，如BertModel



'''
# For example purposes. Not runnable.
model = BertModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
model = BertModel.from_pretrained('./test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = BertModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
assert model.config.output_attention == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = BertConfig.from_json_file('./tf_model/my_tf_model_config.json')
model = BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index', from_tf=True, config=config)
'''



#用上述实例将自然语言文本转化为id，并添加<SEP>和<CLS>
x = tokenizer.encode('前三季度,中部、西部地区社零总额增速分别为10.1%和7.8%,快于东部地区7.3%的增速。', add_special_tokens=True)
#结果为 x = [101,1184,...,102]

##下面可以使用pytorch 将input_demo输入到模型中  先转化为tensor
input_demo = torch.tensor([x])
#结果为 input_demo = tensor([[ 101, 1184,..., 102]])
with torch.no_grad():
    output_demo = model(input_demo)
#输入语句长度为47（算特殊符号）
#输出的output_demo是一个tuple
#output_demo[0]的size是 torch.Size([1, 47, 768])
#output_demo[1]的size是 torch.Size([1, 768])
'''
for model_class, tokenizer_class, pretrained_weights in MODELS:
    # Load pretrained model/tokenizer
    tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
    model = model_class.from_pretrained(pretrained_weights)

    # Encode text
    input_ids = torch.tensor([tokenizer.encode("Here is some text to encode", add_special_tokens=True)])  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
    with torch.no_grad():
        last_hidden_states = model(input_ids)[0]  # Models outputs are now tuples
'''

#上述每种模型都提供了几个类，进行下游任务的fine-tuning
# Each architecture is provided with several class for fine-tuning on down-stream tasks, e.g.
BERT_MODEL_CLASSES = [BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
                      BertForSequenceClassification, BertForTokenClassification, BertForQuestionAnswering]

# All the classes for an architecture can be initiated from pretrained weights for this architecture每个类可以初始化为预训练权重
# Note that additional weights added for fine-tuning are only initialized
# and need to be trained on the down-stream task
pretrained_weights = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(pretrained_weights)
for model_class in BERT_MODEL_CLASSES:
    # Load pretrained model/tokenizer
    model = model_class.from_pretrained(pretrained_weights)

    # Models can return full list of hidden-states & attentions weights at each layer
    model = model_class.from_pretrained(pretrained_weights,
                                        output_hidden_states=True,
                                        output_attentions=True)
                                        #指定输出output_hidden_states和output_attentions后，输出tensor包含更多的信息
    input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
    all_hidden_states, all_attentions = model(input_ids)[-2:]

    # Models are compatible with Torchscript
    model = model_class.from_pretrained(pretrained_weights, torchscript=True)
    traced_model = torch.jit.trace(model, (input_ids,))

    # Simple serialization for models and tokenizers
    model.save_pretrained('./directory/to/save/')  # save
    model = model_class.from_pretrained('./directory/to/save/')  # re-load
    tokenizer.save_pretrained('./directory/to/save/')  # save
    tokenizer = BertTokenizer.from_pretrained('./directory/to/save/')  # re-load

    # SOTA examples for GLUE, SQUAD, text generation...

快速预览pytorch和tf2.0的互用性
主要是用pytorch推理和测试更快

import tensorflow as tf
import tensorflow_datasets
from transformers import *

# Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')#读取的是TF2.0的模型
data = tensorflow_datasets.load('glue/mrpc')

# Prepare dataset for GLUE as a tf.data.Dataset instance
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')
valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, max_length=128, task='mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
valid_dataset = valid_dataset.batch(64)

# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])#Keras的compile

# Train and evaluate using tf.keras.Model.fit()
history = model.fit(train_dataset, epochs=2, steps_per_epoch=115,
                    validation_data=valid_dataset, validation_steps=7)#keras的fit

# Load the TensorFlow model in PyTorch for inspection
model.save_pretrained('./save/')
pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf=True)
# 用pytorch推理和测试更快

# Quickly test a few predictions - MRPC is a paraphrasing task, let's see if our model learned the task
sentence_0 = "This research was consistent with his findings."
sentence_1 = "His findings were compatible with this research."
sentence_2 = "His findings were not compatible with this research."
inputs_1 = tokenizer(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
inputs_2 = tokenizer(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')

pred_1 = pytorch_model(inputs_1['input_ids'], token_type_ids=inputs_1['token_type_ids'])[0].argmax().item()
pred_2 = pytorch_model(inputs_2['input_ids'], token_type_ids=inputs_2['token_type_ids'])[0].argmax().item()

print("sentence_1 is", "a paraphrase" if pred_1 else "not a paraphrase", "of sentence_0")
print("sentence_2 is", "a paraphrase" if pred_2 else "not a paraphrase", "of sentence_0")

fine-tuning/usage scripts
这里以后再填坑
pipelines的使用
文档中对2.3版本中新增加的Pipeline是这样描述的：它是为一些高级功能提供的接口
Pipeline are high-level objects which automatically handle tokenization, running your data through a transformers model and outputting the result in a structured object.
这些高级功能如下
feature-extraction: Generates a tensor representation for the input sequence
即生成对输入序列的表征tensor
ner: Generates named entity mapping for each word in the input sequence.
命名体识别
sentiment-analysis: Gives the polarity (positive / negative) of the whole input sequence.
情感分析
text-classification: Initialize a TextClassificationPipeline directly, or see sentiment-analysis for an example.
文本分类
question-answering: Provided some context and a question refering to the context, it will extract the answer to the question in the context.
QA
fill-mask: Takes an input sequence containing a masked token (e.g. ) and return list of most probable filled sequences, with their probabilities.
填空
summarization
摘要
translation_xx_to_yy
翻译（具体咋用还要探索）

举2个例子
第一、fill mask

https://huggingface.co/transformers/main_classes/pipelines.html#transformers.FillMaskPipeline

在这里插入图片描述
假设下载的模型文件都放在D盘下的xx文件夹里

BERT_MODEL_DIR = "D:/xx"
BERT_MODEL_PATH = "D:/xx/pytorch_model.bin"
BERT_CONFIG_PATH = "D:/xx/config.json"


model = pipeline('fill-mask',
                              model=BERT_MODEL_PATH,
                              config=BERT_CONFIG_PATH,
                              tokenizer=BERT_MODEL_DIR,
                              framework='pt',
                              topk=3)
                              #framework = 'pt'或者'tf'，根据下载模型不同选取
                              #topk是一个int 是返回多少个预测结果

mask = model.tokenizer.mask_token#实例化下载模型默认的mask标识符

print(mask)#看看mask是啥

test_sentence = '我今天很'+mask+'乐'#让模型预测“我今天很_乐”中的_
model.predict(test_sentence)

'''模型
[{'sequence': '[CLS] 我 今 天 很 快 乐 [SEP]',
  'score': 0.998837411403656,
  'token': 2571},
 {'sequence': '[CLS] 我 今 天 很 欢 乐 [SEP]',
  'score': 0.0009607075480744243,
  'token': 3614},
 {'sequence': '[CLS] 我 今 天 很 喜 乐 [SEP]',
  'score': 3.116693551419303e-05,
  'token': 1599}]
'''
#试试多个句子


test_sentence2 = ['我今天很'+mask+'乐','但是天' + mask + '很热']
'''
[[{'sequence': '[CLS] 我 今 天 很 快 乐 [SEP]',
   'score': 0.998837411403656,
   'token': 2571},
  {'sequence': '[CLS] 我 今 天 很 欢 乐 [SEP]',
   'score': 0.0009607103420421481,
   'token': 3614},
  {'sequence': '[CLS] 我 今 天 很 喜 乐 [SEP]',
   'score': 3.1167113775154576e-05,
   'token': 1599}],
 [{'sequence': '[CLS] 但 是 天 气 很 热 [SEP]',
   'score': 0.9807906150817871,
   'token': 3698},
  {'sequence': '[CLS] 但 是 天 天 很 热 [SEP]',
   'score': 0.0060727293603122234,
   'token': 1921},
  {'sequence': '[CLS] 但 是 天 也 很 热 [SEP]',
   'score': 0.0018581727053970098,
   'token': 738}]]
'''