pipeline()
函数
pipeline()
函数是Transformers库中最基本的工具。Transformer模型用于解决各种NLP任务,Transformers库提供了创建和使用这些模型的功能。我们先来看一看pipeline()
是如何解决NLP问题。
文章目录
情感分析
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
classifier('I`ve been waiting for a HuggingFace course my whole life.')
# [{'label': 'POSITIVE', 'score': 0.9273151755332947}]
我们可以上传多句进行情感分析:
classifier(['I`ve been waiting for a HuggingFace course my whole life.', 'I hate this so much!'])
# [{'label': 'POSITIVE', 'score': 0.9273151755332947},
# {'label': 'NEGATIVE', 'score': 0.9994558691978455}]
将文本上传到pipeline()
时涉及三个主要步骤:
- 文本预处理:处理成模型可以理解的格式;
- 传递给模型:将预处理后的文本传递给模型;
- 输出结果:模型处理后输出结果。
目前可用的一些pipelines有:
- feature-extraction(特征提取)
- fill-mask(填充空缺)
- named entity recognition(ner, 命名实体识别)
- question-answering(问答)
- sentiment-analysis(情感分析)
- summarization(文本摘要)
- text-generation(文本生成)
- translation(翻译)
- zero-shot-classification(零样本分类)
zero-shot-classification(零样本分类)
零样本分类(zero-shot-classification),是一种特殊的机器学习范式,它允许机器在不依赖任何特定未见过类别的训练样本的情况下,对这些未知类别的实例进行正确的分类。这种技术的核心思想是利用现有类别的数据来学习,并通过这种方式对新对象进行分类。零样本分类可以广泛应用于计算机视觉、自然语言处理等多种监督学习任务中。
可以使用pipeline()
直接指定用于分类的标签进行零样本分类:
from transformers import pipeline
classifier = pipeline('zero-shot-classification')
classifier(
'This is a course about the Transformers library',
candidate_labels = ['education', 'politics', 'business'],
)
# {'sequence': 'This is a course about the Transformers library',
# 'labels': ['education', 'business', 'politics'],
# 'scores': [0.8445989489555359, 0.11197412759065628, 0.04342695698142052]}
text-generation(文本生成)
使用pipeline
生成一些文本。首先需要给一个提示,模型根据提示自动完成需要生成的内容。文本生成有一定的随机性,生成的内容有偏差是正常的。
from transformers import pipeline
generator = pipeline('text-generation')
generator('In the course, we will teach you how to')
# [{'generated_text': "In the course, we will teach you how to use your own ideas,
# thoughts, and ideas in practice through building a toolset.\n\nDon't put yourself
# in anyone's shoes, however, if you put a whole bunch of assumptions and beliefs"}]
可以使用参数num_return_sequences
控制生成多个不同的序列,使用参数max_length
控制输出文本的总长度:
from transformers import pipeline
generator = pipeline('text-generation', num_return_sequences = 3, max_length = 10)
generator('I')
# [{'generated_text': 'I have already talked to the other parties who want'},
# {'generated_text': 'I don\'t know what to say."\n\n'},
# {'generated_text': 'I a very happy woman who still loves it,"'}]
fill-mask(填充空缺)
fill-mask 是填充给定文本中的空白。
from transformers import pipeline
unmasker = pipeline('fill-mask')
unmasker('This course will teach you all about <mask> models', top_k = 2)
# [{'score': 0.19631513953208923,
# 'token': 30412,
# 'token_str': ' mathematical',
# 'sequence': 'This course will teach you all about mathematical models'},
# {'score': 0.04449228197336197,
# 'token': 745,
# 'token_str': ' building',
# 'sequence': 'This course will teach you all about building models'}]
参数top_k
控制要显示多少种结果。这里模型填充了特殊的**< mask >词,它通常被称为掩码标记**。其他掩码填充模型可能有不同的掩码标记,因此在探索其他模型时要验证正确的掩码词是什么。
注:本节中的代码均在colab中运行。