transformer参数

最新推荐文章于 2024-06-26 13:16:46 发布

big_matster

最新推荐文章于 2024-06-26 13:16:46 发布

阅读量1k

点赞数

分类专栏：常用模块收集文章标签： transformer 深度学习人工智能

本文链接：https://blog.csdn.net/kuxingseng123/article/details/128807436

版权

常用模块收集专栏收录该内容

8 篇文章 0 订阅

订阅专栏

pipelines

transformers.pipeline(task: str, 
					  model: Optional = None, 
					  config: Optional[Union[str, transformers.configuration_utils.PretrainedConfig]] = None, 
					  tokenizer: Optional[Union[str, transformers.tokenization_utils.PreTrainedTokenizer]] = None, 
					  framework: Optional[str] = None, 
					  revision: Optional[str] = None, 
					  use_fast: bool = True, **kwargs) → transformers.pipelines.base.Pipeline

参数说明

在这里插入图片描述

>>> from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

>>> # Sentiment analysis pipeline
>>> pipeline('sentiment-analysis')

>>> # Question answering pipeline, specifying the checkpoint identifier
>>> pipeline('question-answering', model='distilbert-base-cased-distilled-squad', tokenizer='bert-base-cased')

>>> # Named entity recognition pipeline, passing in a specific model and tokenizer
>>> model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
>>> pipeline('ner', model=model, tokenizer=tokenizer)

Bert常用预训练模型

在这里插入图片描述

使用transformer库的三种方法:

使用pipeline

指定预训练模型

使用 AutoModels 加载预训练模型。

在官方文档中**，bert的tokenizer有BertTokenizer, BertTokenizerFast。我比较推荐使用BertTokenizerFast。理由很简单，虽然中文bert是基于char级别实**现的，可是对于数字、英文，完全拆分成[0-9a-zA-Z]的字符则完全没有任何意义。因此再做Tokennize的时候，我们需要知道哪些字符被tokenize到了一个input_id上，这样我们在做NER预测的时候，才能知道边界应该切在哪个位置。这种映射我们统一称为offset先。（当然如果是做分类或者seq2seq的时候，offset就不是那么必须的）。
BertTokenizerFast可以在帮助我们完成tokennize的同时输出我们需要的offset，就不用我们在寻找一遍映射关系了

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
# 可以按照最大长度进行截断，但是貌似不能自动做padding。并且会在开头和结尾添加[CLS]和[SEP]的tag
# tokens:{
#     "input_ids": list [batch_size,id*(length+2)] 
#     "attention_mask": [batch_size, [1]*(length+2)]
#     "token_type_ids": [batch_size,[0]*(length+2)]
#     "offset_mapping": [[0,0],[begin_index,end_index]*length,[0,0]]
tokens = tokenizer(string_list, 
                   return_offsets_mapping=True,
                   max_length=max_seq_length, 
                   truncation=True)