学习笔记2(小型项目开发的综合学习流程)

目标

利用huggingface的开源深度学习模型,基于gradio完成一个深度学习的文本应用开发(例如基于GPT的文本续写,或者基于BERT的情感分类)。

一些概念梳理

Gradio是MIT的开源项目,使用时可理解为一个Python包,它的安装命令:pip install gradio。使用Gradio,通过少量的几行Python代码就能自动化生成交互式web页面,并支持多种输入输出格式,图像加载/显示框、文本框、各类按钮等常见控件。同时,还支持生成能外部网络访问的链接,能够迅速让非技术人员体验你的算法。
————————————————
版权声明:本文为CSDN博主「snooby101」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/xmh_free/article/details/127210992

BERT 模型自 2018 年问世起就各种屠榜,开启了 NLP 领域预训练+微调的范式。

HuggingFace 是一家总部位于纽约的聊天机器人初创服务商,很早就捕捉到 BERT 大潮流的信号并着手实现基于 pytorch 的 BERT 模型。这一项目最初名为 pytorch-pretrained-bert,在复现了原始效果的同时,提供了易用的方法以方便在这一强大模型的基础上进行各种玩耍和研究。

随着使用人数的增加,这一项目也发展成为一个较大的开源社区,合并了各种预训练语言模型以及增加了 Tensorflow 的实现,并且在 2019 年下半年改名为 Transformers。截止写文章时(2021 年 3 月 30 日)这一项目已经拥有 43k+ 的star,可以说 Transformers 已经成为事实上的 NLP 基本工具。
————————————————
版权声明:本文为CSDN博主「PaperWeekly」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/c9Yv2cf9I06K2A9E/article/details/118230669

BERT 是基于双向 Transformer 结构构建,而 GPT-2 是基于单向 Transformer,这里的双向与单向,是指在进行注意力计算时,BERT会同时考虑被遮蔽词左右的词对其的影响,融合了双向上下文信息,它比较适合于文本生成类任务。而 GPT-2 只会考虑在待预测词位置左侧的词对待预测词的影响,GPT-2模型比较适合文本生成类任务,因此本章将使用GPT-2模型来完成中文新闻文本生成任务。
————————————————
版权声明:本文为CSDN博主「wumg3000」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。

原文链接:https://blog.csdn.net/wumg3000/article/details/129392406

GPT2语言模型概述

       GPT-2 就是一个语言模型,能够根据上文预测下一个单词,所以它就可以利用预训练已经学到的知识来生成文本,如生成新闻。也可以使用另一些数据进行微调,生成有特定格式或者主题的文本,如诗歌、戏剧。所以接下来,我们会用 GPT-2 模型进行一个文本生成。

接下来我将使用GPT2进行文本生成任务。

from transformers import GPT2LMHeadModel
 
# 读取 GPT-2 预训练模型
model = GPT2LMHeadModel.from_pretrained("./gpt2")
model.eval()
 
total_predicted_text = text
n = 100  # 预测过程的循环次数
for _ in range(n):
    with torch.no_grad():
        outputs = model(tokens_tensor)
        predictions = outputs[0]
 
    predicted_index = select_top_k(predictions, k=10)
    predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
    total_predicted_text += tokenizer.decode(predicted_index)
 
    if '<|endoftext|>' in total_predicted_text:
        # 如果出现文本结束标志,就结束文本生成
        break
 
    indexed_tokens += [predicted_index]
    tokens_tensor = torch.tensor([indexed_tokens])
 
print(total_predicted_text)
————————————————
版权声明:本文为CSDN博主「wumg3000」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/wumg3000/article/details/129392406

跑了一下跑不通啊!可能是没有先下好模型。

我先登录了一下Models - Hugging Face找到gpt2 · Hugging Face,看到网页左下角有一个Model Card说了一些仿佛我能够看得明白的东西!

我来当一下人类智慧结晶的搬运工吧!

以下是GPT的概述:

GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.

More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the predictions for the token i only uses the inputs from 1 to i but not the future tokens.

This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks.

The model is best at what it was pretrained for however, which is generating texts from a prompt.

如果要做更细致的任务,可以在model-hub中寻找fine-tune的版本。

使用方法

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='gpt2')

set_seed(42)

generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

[{'generated_text': "Hello, I'm a language model, a language for thinking, a language for expressing thoughts."},
 {'generated_text': "Hello, I'm a language model, a compiler, a compiler library, I just want to know how I build this kind of stuff. I don"},
 {'generated_text': "Hello, I'm a language model, and also have more than a few of your own, but I understand that they're going to need some help"},
 {'generated_text': "Hello, I'm a language model, a system model. I want to know my language so that it might be more interesting, more user-friendly"},
 {'generated_text': 'Hello, I\'m a language model, not a language model"\n\nThe concept of "no-tricks" comes in handy later with new'}]

我直接把代码丢到pycharm中,不成功,报错如下:

You have modified the pretrained model configuration to control generation.
 This is a deprecated strategy to control generation and will 
be removed soon,
 in a future version. Please use a generation configuration file
 (see 
https://huggingface.co/docs/transformers/main_classes/text_generation)

他说代码中调整了预训练模型的configuration也就是配置。

我不是很理解,尝试去掉setseed(42),结果依旧没什么区别。

from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='gpt2')
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

那还能怎么办捏!看看人家的技术文档呗!

里面说:

Each framework has a generate method for text generation implemented in their respective GenerationMixin class。

Regardless of your framework of choice, you can parameterize the generate method with a GenerationConfig class instance. Please refer to this class for the complete list of generation parameters, which control the behavior of the generation method.

To learn how to inspect a model’s generation configuration, what are the defaults, how to change the parameters ad hoc, and how to create and save a customized generation configuration, refer to the text generation strategies guide.

Ad-hoc这个词来源于拉丁语,在百度上解释为“for this purpose only”

生成的配置

  • max_length (int, optional, defaults to 20) — The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens. Its effect is overridden(僭越,超过覆盖,凌驾) by max_new_tokens, if also set.
  • max_new_tokens (int, optional) — The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
  • min_length (int, optional, defaults to 0) — The minimum length of the sequence to be generated. Corresponds to the length of the input prompt + min_new_tokens. Its effect is overridden by min_new_tokens, if also set.
  • min_new_tokens (int, optional) — The minimum numbers of tokens to generate, ignoring the number of tokens in the prompt.
  • early_stopping (bool or str, optional, defaults to False) — Controls the stopping condition for beam-based methods, like beam-search. It accepts the following values: True, where the generation stops as soon as there are num_beams complete candidates; False, where an heuristic(探试算法) is applied and the generation stops when is it very unlikely to find better candidates; "never", where the beam search procedure only stops when there cannot be better candidates (canonical(佳能:标准教会,canonical标准的) beam search algorithm).
  • max_time(float, optional) — The maximum amount of time you allow the computation to run for in seconds. generation will still finish the current pass after allocated time has been passed.

其他的千奇百怪的设置还有许许多,我就不列举啦,因为实在搞不完还记不住,性价比不太高!那么我要如何设置才能让他自动生成且不会报错呢!好困难看上去!我不用pipeline方法后怎么调用GPT模型呢!

还是看看dhl一秒代写的基于BERT的情感分类吧

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from gradio import Interface, inputs, outputs

# 使用预训练模型
MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

# 情感分类函数
def sentiment_analysis(text: str) -> str:
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs)
    prob = torch.nn.functional.softmax(outputs.logits, dim=-1)
    classes = ["negative", "positive"]

    return classes[int(prob.argmax())]

# 使用Gradio创建图形界面
input_text = inputs.Textbox(lines=5, label="输入文本")
output_text = outputs.Textbox(label="情感")

iface = Interface(
    fn=sentiment_analysis,
    inputs=input_text,
    outputs=output_text,
    examples=[["I love Civil Engineering so much ."], ["I dislike(loathe) Peking University because it has no Department of Civil Engineering at all."]],
)

iface.launch(share=True)

今天就先到这里吧虽然问题还没有解决呢哭哭哭!

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值