抱抱脸(hugging face)教程-中文翻译-任务总结

任务总结


任务摘要

此页面显示了使用库时最常见的用例。可用的模型允许许多不同的配置,并且在用例中有很大的通用性。这里展示了最简单的方法,展示了问题回答、序列分类、命名实体识别等任务的用法。

这些示例利用 auto-models,这些类将根据给定的检查点实例化一个模型,并自动选择正确的模型体系结构。有关更多信息,请查看 AutoModel 文档。您可以随意修改代码,使其更加具体,并根据您的具体用例对其进行调整。

为了使模型能够很好地执行任务,必须从与该任务相对应的检查点加载模型。这些检查点通常是针对大量数据进行预先训练,并针对特定任务进行微调。这意味着:

并非所有的模型都对所有任务进行了微调。如果希望对特定任务的模型进行微调,可以利用示例目录中的 run_$TASK.py 脚本之一。微调模型针对特定的数据集进行微调。此数据集可能与您的用例和域重叠,也可能不重叠。正如前面提到的,您可以利用示例脚本来微调您的模型,或者您可以创建自己的培训脚本。

为了对一个任务进行推理,该库提供了几种机制:

  • 管道: 非常易于使用的抽象,只需要两行代码。
  • 直接模型使用: 更少的抽象,但通过直接访问 tokenizer (PyTorch/TensorFlow)和完整的推理能力,可以获得更多的灵活性和功能。

这里展示了这两种方法。

这里介绍的所有任务都利用了针对具体任务进行微调的预先培训的检查点。加载一个没有针对特定任务进行微调的检查点将只加载基本变压器层,而不加载用于该任务的额外磁头,随机初始化该磁头的权重。这将产生随机输出。

序列分类

序列分类是根据给定数量的类对序列进行分类的任务。序列分类的一个例子是 GLUE 数据集,它完全基于该任务。如果您想在 GLUE 序列分类任务上微调模型,可以利用 run_GLUE.py、 run_tf_GLUE.py、 run_tf_text_classification.py 或 run_xnli.py 脚本。

这里有一个使用管道进行情绪分析的例子: 识别一个序列是正的还是负的。它利用 sst2上的一个经过优化的模型,这是一个 GLUE 任务。
这将返回一个标签(“ POSITIVE”或“ NEGATIVE”)和一个分数,如下所示:

from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
label: NEGATIVE, with score: 0.9991result = classifier("I love you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
label: POSITIVE, with score: 0.9999

下面是一个使用模型进行序列分类的示例,以确定两个序列是否相互转述。整个过程如下:

  1. 打印结果
    Pytorch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"sequence_1 = "Apples are especially bad for your health"sequence_2 = "HuggingFace's headquarters are situated in Manhattan"# The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to# the sequence, as well as compute the attention masks.paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits

paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

# Should be paraphrasefor i inrange(len(classes)):
... print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
not paraphrase: 10%
is paraphrase: 90%

# Should not be paraphrasefor i inrange(len(classes)):
... print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
not paraphrase: 94%
is paraphrase: 6%

TensorFlow

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"sequence_1 = "Apples are especially bad for your health"sequence_2 = "HuggingFace's headquarters are situated in Manhattan"# The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to# the sequence, as well as compute the attention masks.paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")

paraphrase_classification_logits = model(paraphrase).logits
not_paraphrase_classification_logits = model(not_paraphrase).logits

paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]

# Should be paraphrasefor i inrange(len(classes)):
... print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
not paraphrase: 10%
is paraphrase: 90%

# Should not be paraphrasefor i inrange(len(classes)):
... print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
not paraphrase: 94%
is paraphrase: 6%

提取性问题回答

提取性问题回答的任务是从给定的问题的文本中提取答案。问题回答数据集的一个例子是小队数据集,它完全基于该任务。如果您想在一个 SQuAD 任务上微调一个模型,那么您可以利用 run_qa.py 并运行_tf_squade.py 脚本。

下面是一个使用管道进行问答的示例: 从给定的问题的文本中提取答案。它利用了对 SQuAD 进行了优化的模型。

from transformers import pipeline
question_answerer = pipeline("question-answering")
context = r"""
... Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
... question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
... a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
... """

这将返回一个从文本中提取的答案,一个置信度评分,以及“开始”和“结束”值,这两个值是文本中提取的答案的位置。

result = question_answerer(question="What is extractive question answering?", context=context)
print(
... f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"... )
Answer: 'the task of extracting an answer from a text given a question', score: 0.6177, start: 34, end: 95result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
print(
... f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"... )
Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160

下面是一个使用模型和标记器回答问题的例子,过程如下:

  1. 打印结果
    Pytorch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
... 🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
... architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
... Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
... TensorFlow 2.0 and PyTorch.
... """questions = [
... "How many pretrained models are available in 🤗 Transformers?",
... "What does 🤗 Transformers provide?",
... "🤗 Transformers provides interoperability between which frameworks?",
... ]

for question in questions:
...     inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
...     input_ids = inputs["input_ids"].tolist()[0]

...     outputs = model(**inputs)
...     answer_start_scores = outputs.start_logits
...     answer_end_scores = outputs.end_logits

... # Get the most likely beginning of answer with the argmax of the score...     answer_start = torch.argmax(answer_start_scores)
... # Get the most likely end of answer with the argmax of the score...     answer_end = torch.argmax(answer_end_scores) + 1...     answer = tokenizer.convert_tokens_to_string(
...         tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
...     )

... print(f"Question: {question}")
... print(f"Answer: {answer}")
Question: How many pretrained models are available in 🤗 Transformers?
Answer: over 32 +
Question: What does 🤗 Transformers provide?
Answer: general - purpose architectures
Question: 🤗 Transformers provides interoperability between which frameworks?
Answer: tensorflow 2.0and pytorch

TensorFlow

from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
... 🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
... architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
... Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
... TensorFlow 2.0 and PyTorch.
... """questions = [
... "How many pretrained models are available in 🤗 Transformers?",
... "What does 🤗 Transformers provide?",
... "🤗 Transformers provides interoperability between which frameworks?",
... ]

for question in questions:
...     inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="tf")
...     input_ids = inputs["input_ids"].numpy()[0]

...     outputs = model(inputs)
...     answer_start_scores = outputs.start_logits
...     answer_end_scores = outputs.end_logits

... # Get the most likely beginning of answer with the argmax of the score...     answer_start = tf.argmax(answer_start_scores, axis=1).numpy()[0]
... # Get the most likely end of answer with the argmax of the score...     answer_end = tf.argmax(answer_end_scores, axis=1).numpy()[0] + 1...     answer = tokenizer.convert_tokens_to_string(
...         tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
...     )

... print(f"Question: {question}")
... print(f"Answer: {answer}")
Question: How many pretrained models are available in 🤗 Transformers?
Answer: over 32 +
Question: What does 🤗 Transformers provide?
Answer: general - purpose architectures
Question: 🤗 Transformers provides interoperability between which frameworks?
Answer: tensorflow 2.0and pytorch

语言建模

语言建模是将一个模型拟合到一个语料库中的任务,这个语料库可以是特定领域的。所有流行的基于变压器的模型都是使用不同的语言建模来训练的,例如: 带Masked语言建模的 BERT,带因果语言建模的 GPT-2。

语言建模在预训练之外也很有用,例如将模型分布转换为特定领域: 使用在非常大的语料库上训练的语言模型,然后将其微调为新闻数据集或科学论文(如 LysandreJik/arxiv-nlp)。

Masked语言建模

Masked语言建模的任务是使用屏蔽令牌按顺序屏蔽令牌,并提示模型使用适当的令牌填充该屏蔽。这允许模型同时关注右上下文(掩码右侧的标记)和左上下文(掩码左侧的标记)。

这样的训练为需要双向上下文的下游任务创建了一个强有力的基础,比如 SQuAD (问题回答,参见 Lewis,Lui,Goyal 等,第4.2部分)。如果希望在掩码语言建模任务上对模型进行微调,可以使用 run_mlm.py 脚本。

下面是一个使用管道替换来自序列的掩码的例子:

from transformers import pipeline

unmasker = pipeline("fill-mask")

这将输出带有掩码填充的序列、置信度得分和 tokenizer 词汇表中的标记 id:

from pprint import pprint

pprint(
...     unmasker(
... f"HuggingFace is creating a {unmasker.tokenizer.mask_token} that the community uses to solve NLP tasks."...     )
... )
[{'score': 0.1793,
  'sequence': 'HuggingFace is creating a tool that the community uses to solve ''NLP tasks.',
  'token': 3944,
  'token_str': ' tool'},
 {'score': 0.1135,
  'sequence': 'HuggingFace is creating a framework that the community uses to ''solve NLP tasks.',
  'token': 7208,
  'token_str': ' framework'},
 {'score': 0.0524,
  'sequence': 'HuggingFace is creating a library that the community uses to ''solve NLP tasks.',
  'token': 5560,
  'token_str': ' library'},
 {'score': 0.0349,
  'sequence': 'HuggingFace is creating a database that the community uses to ''solve NLP tasks.',
  'token': 8503,
  'token_str': ' database'},
 {'score': 0.0286,
  'sequence': 'HuggingFace is creating a prototype that the community uses to ''solve NLP tasks.',
  'token': 17715,
  'token_str': ' prototype'}]

下面是一个使用模型和标记器进行掩码语言建模的例子,过程如下:

从检查点名称实例化一个标记器和一个模型。该模型被识别为 DistilBERT 模型,并将存储在检查点中的权重加载到模型中。定义一个带有掩码标记的序列,将 tokenizer.mask 标记替换为单词。将这个序列编码到一个 id 列表中,并在该列表中找到掩码标记的位置。检索掩码标记索引处的预测值: 该张量与词汇表的大小相同,值是每个标记的分数。模型为它认为在该上下文中可能的标记提供更高的分数。使用 PyTorch topk 或 TensorFlow top_k 方法检索前5个令牌。用标记替换掩码标记并打印结果
Pytorch

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

sequence = (
... "Distilled models are smaller than the models they mimic. Using them instead of the large "... f"versions would help {tokenizer.mask_token} our carbon footprint."... )

inputs = tokenizer(sequence, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
... print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.

TensorFlow

from transformers import TFAutoModelForMaskedLM, AutoTokenizer
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFAutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

sequence = (
... "Distilled models are smaller than the models they mimic. Using them instead of the large "... f"versions would help {tokenizer.mask_token} our carbon footprint."... )

inputs = tokenizer(sequence, return_tensors="tf")
mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]

token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = tf.math.top_k(mask_token_logits, 5).indices.numpy()

for token in top_5_tokens:
... print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.

这将打印出5个序列,其中前5个符号是由模型预测的。

因果语言建模

因果语言建模是根据一系列标记预测标记的任务。在这种情况下,模型只关注左上下文(掩码左侧的标记)。这样的训练对于生成任务来说特别有趣。如果希望在因果语言建模任务上对模型进行微调,可以使用 run_clm.py 脚本。
通常,通过对模型从输入序列中产生的最后一个隐藏状态的日志进行采样来预测下一个令牌。

Pytorch

下面是一个使用标记器和模型的示例,并利用 top_k_top_p_filtering ()方法对标记输入序列后的下一个标记采样。

from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch import nn

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

sequence = f"Hugging Face is based in DUMBO, New York City, and"inputs = tokenizer(sequence, return_tensors="pt")
input_ids = inputs["input_ids"]

# get logits of last hidden statenext_token_logits = model(**inputs).logits[:, -1, :]

# filterfiltered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

# sampleprobs = nn.functional.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

generated = torch.cat([input_ids, next_token], dim=-1)

resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
Hugging Face is based in DUMBO, New York City, and ...

TensorFlow

下面是一个使用标记器和模型的示例,并利用 tf_top_k_top_p_filtering ()方法对输入序列后的下一个标记进行采样。

from transformers import TFAutoModelForCausalLM, AutoTokenizer, tf_top_k_top_p_filtering
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = TFAutoModelForCausalLM.from_pretrained("gpt2")

sequence = f"Hugging Face is based in DUMBO, New York City, and"inputs = tokenizer(sequence, return_tensors="tf")
input_ids = inputs["input_ids"]

# get logits of last hidden statenext_token_logits = model(**inputs).logits[:, -1, :]

# filterfiltered_next_token_logits = tf_top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

# samplenext_token = tf.random.categorical(filtered_next_token_logits, dtype=tf.int32, num_samples=1)

generated = tf.concat([input_ids, next_token], axis=1)

resulting_string = tokenizer.decode(generated.numpy().tolist()[0])
print(resulting_string)
Hugging Face is based in DUMBO, New York City, and ...

这将在原始序列之后输出(希望是)一个连贯的下一个令牌,在我们的例子中,原始序列是单词 is 或 features。

在下一节中,我们将展示如何生成 _utils。GenerationMixin.generate ()可用于生成指定长度的多个令牌,而不是一次生成一个令牌。

文本生成

在文本生成(也称为开放式文本生成)中,目标是创建一个连贯的文本部分,它是给定上下文的延续。下面的示例演示如何在管道中使用 gpt-2生成文本。默认情况下,所有模型在管道中使用时都应用 Top-K 抽样,这是在它们各自的配置中配置的(例如,参见 gpt-2 config)。

Pytorch

from transformers import pipeline

text_generator = pipeline("text-generation")
print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))
[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a
"free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]

在这里,模型从上下文“ As far As i am concerned,i will”生成一个总长度最大为50个标记的随机文本。在幕后,pipeline 对象调用 PreTrainedModel.generate ()方法来生成文本。此方法的默认参数可以在管道中重写,如上面的 max_length 和 do_sample 参数所示。下面是使用 XLNet 及其标记器生成文本的示例,其中包括直接调用 generate () :

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

"""In 1991, the remains of Russian Tsar Nicholas II and his family# Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodologyPADDING_TEXT = 
... (except for Alexei and Maria) are discovered.
... The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
... remainder of the story. 1883 Western Siberia,
... a young Grigori Rasputin is asked by his father and a group of men to perform magic.
... Rasputin has a vision and denounces one of the men as a horse thief. Although his
... father initially slaps him for making such an accusation, Rasputin watches as the
... man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
... the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
... with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""prompt = "Today the weather is really nice and I am planning on "inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]

prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length + 1 :]

print(generated)
Today the weather is really nice and I am planning ...

TensorFlow

from transformers import TFAutoModelForCausalLM, AutoTokenizer

model = TFAutoModelForCausalLM.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

"""In 1991, the remains of Russian Tsar Nicholas II and his family# Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodologyPADDING_TEXT = 
... (except for Alexei and Maria) are discovered.
... The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
... remainder of the story. 1883 Western Siberia,
... a young Grigori Rasputin is asked by his father and a group of men to perform magic.
... Rasputin has a vision and denounces one of the men as a horse thief. Although his
... father initially slaps him for making such an accusation, Rasputin watches as the
... man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
... the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
... with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""prompt = "Today the weather is really nice and I am planning on "inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="tf")["input_ids"]

prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length + 1 :]

print(generated)
Today the weather is really nice and I am planning ...

目前,在 PyTorch 中使用 GPT-2、 OpenAi-GPT、 CTRL、 XLNet、 Transfo-XL 和 Reformer 可以生成文本,在 Tensorflow 也可以生成大多数型号的文本。从上面的示例中可以看出,XLNet 和 Transfo-XL 通常需要进行填充才能很好地工作。Gpt-2通常是开放式文本生成的一个很好的选择,因为它是在具有因果语言建模目标的数百万个网页上进行训练的。

关于如何应用不同的解码策略生成文本的更多信息,请参考我们的文本生成博客文章。

命名实体识别

命名实体识别(NER)是根据一个类对令牌进行分类的任务,例如,将令牌标识为一个人、一个组织或一个位置。命名实体识别数据集的一个示例是 conll-2003数据集,它完全基于该任务。如果希望在 NER 任务上对模型进行微调,可以使用 run_NER.py 脚本。

下面是一个使用管道进行命名实体识别的例子,特别是试图识别标记属于9个类中的一个:

  • 信息管理信息系统,杂项实体
  • 投诉人姓名
  • I-ORG,Organisation
  • I-LOC,位置

它利用了 conll-2003上的一个经过优化的模型,由@stefan-It 从 dbmdz 进行了优化。

from transformers import pipeline
ner_pipe = pipeline("ner")
sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
... therefore very close to the Manhattan Bridge which is visible from the window."""

这将输出上面定义的9个类中已被确定为实体之一的所有单词的列表。以下是预期的结果:

for entity in ner_pipe(sequence):
... print(entity)
{'entity': 'I-ORG', 'score': 0.9996, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.9910, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': 0.9982, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-ORG', 'score': 0.9995, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
{'entity': 'I-LOC', 'score': 0.9994, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
{'entity': 'I-LOC', 'score': 0.9993, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
{'entity': 'I-LOC', 'score': 0.9994, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
{'entity': 'I-LOC', 'score': 0.9863, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}
{'entity': 'I-LOC', 'score': 0.9514, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}
{'entity': 'I-LOC', 'score': 0.9337, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}
{'entity': 'I-LOC', 'score': 0.9762, 'index': 28, 'word': 'Manhattan', 'start': 114, 'end': 123}
{'entity': 'I-LOC', 'score': 0.9915, 'index': 29, 'word': 'Bridge', 'start': 124, 'end': 130}

请注意“拥抱脸”序列的标记是如何被确定为一个组织的,以及“纽约市”、“ DUMBO”和“曼哈顿桥”是如何被确定为地点的。

下面是一个使用模型和标记器进行命名实体识别的例子:

从检查点名称实例化一个标记器和一个模型。该模型被识别为一个 BERT 模型,并加载存储在检查点中的权值。定义一个已知实体的序列,例如“拥抱脸”作为一个组织,“纽约市”作为一个地点。把单词分割成标记,这样它们就可以映射成预测。我们使用一个小的黑客,首先,完全编码和解码序列,这样我们就留下了一个包含特殊标记的字符串。将该序列编码为 id (自动添加特殊标记)。通过将输入传递给模型并获得第一个输出来检索预测。这将导致分布在每个令牌的9个可能的类上。我们使用 argmax 为每个令牌检索最可能的类。将每个令牌及其预测压缩并打印出来。

Pytorch

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = (
... "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, "... "therefore very close to the Manhattan Bridge."... )

inputs = tokenizer(sequence, return_tensors="pt")
tokens = inputs.tokens()

outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)

TensorFlow

from transformers import TFAutoModelForTokenClassification, AutoTokenizer
import tensorflow as tf

model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = (
... "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, "... "therefore very close to the Manhattan Bridge."... )

inputs = tokenizer(sequence, return_tensors="tf")
tokens = inputs.tokens()

outputs = model(**inputs)[0]
predictions = tf.argmax(outputs, axis=2)

这将输出映射到相应预测的每个令牌的列表。与管道不同,这里每个令牌都有一个预测,因为我们没有删除“0”类,这意味着在该令牌上没有找到特定的实体。

在上面的示例中,predications 是一个与预测类相对应的整数。我们可以使用 model.config.id2label 属性来恢复对应于类号的类名,如下所示:

for token, prediction inzip(tokens, predictions[0].numpy()):
... print((token, model.config.id2label[prediction]))
('[CLS]', 'O')
('Hu', 'I-ORG')
('##gging', 'I-ORG')
('Face', 'I-ORG')
('Inc', 'I-ORG')
('.', 'O')
('is', 'O')
('a', 'O')
('company', 'O')
('based', 'O')
('in', 'O')
('New', 'I-LOC')
('York', 'I-LOC')
('City', 'I-LOC')
('.', 'O')
('Its', 'O')
('headquarters', 'O')
('are', 'O')
('in', 'O')
('D', 'I-LOC')
('##UM', 'I-LOC')
('##BO', 'I-LOC')
(',', 'O')
('therefore', 'O')
('very', 'O')
('close', 'O')
('to', 'O')
('the', 'O')
('Manhattan', 'I-LOC')
('Bridge', 'I-LOC')
('.', 'O')
('[SEP]', 'O')

摘要任务

摘要是将一个文件或文章总结成一个较短的文本的任务。如果您希望在汇总任务上对模型进行微调,可以利用 run_summaryization.py 脚本。

摘要数据集的一个例子是 CNN/Daily Mail 数据集,它由长篇新闻文章组成,用于摘要任务。如果您希望对摘要任务的模型进行微调,本文档中描述了各种方法。

下面是一个使用管道进行总结的示例。它利用了对 CNN/Daily Mail 数据集进行了微调的 Bart 模型。

from transformers import pipeline
summarizer = pipeline("summarization")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
... A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
... Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
... In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
... Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
... 2010 marriage license application, according to court documents.
... Prosecutors said the marriages were part of an immigration scam.
... On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
... After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
... Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
... All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
... Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
... Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
... The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
... Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
... Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
... If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
... """

因为汇总管道依赖于 PreTrainedModel.generate ()方法,所以我们可以直接在管道中覆盖 PreTrainedModel.generate ()的缺省参数,最大长度和最小长度如下所示。该报告产生了以下摘要:

print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))
[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in
the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and
2002 . At one time, she was married to eight men at once, prosecutors say .'}]

下面是一个使用模型和标记器进行摘要的例子,过程如下:

从检查点名称实例化一个标记器和一个模型。摘要通常使用编码器-解码器模型进行,如 Bart 或 T5。定义应该总结的文章。添加 t5特定前缀“汇总:”。使用 PreTrainedModel.generate ()方法生成摘要。

在这个例子中,我们使用谷歌的 t5模型。即使它只是在多任务混合数据集(包括 CNN/每日邮报)上进行预先训练,它仍然会产生非常好的结果。

Pytorch

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# T5 uses a max_length of 512 so we cut the article to 512 tokens.inputs = tokenizer("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
...     inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True... )

print(tokenizer.decode(outputs[0]))
<pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
between 1999and2002.</s>

TensorFlow

from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer

model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# T5 uses a max_length of 512 so we cut the article to 512 tokens.inputs = tokenizer("summarize: " + ARTICLE, return_tensors="tf", max_length=512)
outputs = model.generate(
...     inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True... )

print(tokenizer.decode(outputs[0]))
<pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
between 1999and2002.

翻译

翻译是把文本从一种语言翻译成另一种语言的任务。如果希望在翻译任务上微调模型,可以使用 run_translation.py 脚本。

翻译数据集的一个例子是 WMT 英语到德语数据集,它以英语句子作为输入数据,以德语相应的句子作为目标数据。如果您希望对翻译任务中的模型进行微调,本文将描述各种方法。

下面是一个使用管道进行翻译的示例。它利用了一个 t5模型,该模型仅在多任务混合数据集(包括 WMT)上进行了预训练,但是产生了令人印象深刻的转换结果。

from transformers import pipeline

translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]

因为转换管道依赖于 PreTrainedModel.generate ()方法,所以我们可以直接在管道中覆盖 PreTrainedModel.generate ()的默认参数,如上面的 max_length 所示。

下面是一个使用模型和标记器进行翻译的例子,过程如下:

从检查点名称实例化一个标记器和一个模型。摘要通常使用编码器-解码器模型进行,如 Bart 或 T5。定义应该总结的文章。添加 t5特定前缀“ translate English to German:”使用 PreTrainedModel.generate ()方法执行转换。

Pytorch

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

inputs = tokenizer(
... "translate English to German: Hugging Face is a technology company based in New York and Paris",
...     return_tensors="pt",
... )
outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs[0]))
<pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.</s>

TensorFlow

from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer

model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

inputs = tokenizer(
... "translate English to German: Hugging Face is a technology company based in New York and Paris",
...     return_tensors="tf",
... )
outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs[0]))
<pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.

我们得到了与管道示例相同的翻译。

本文是抱抱脸(Hugging Face)教程中文翻译,仅学习使用

原文链接

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值