快速了解并体验Hugging Face(综合版)

一、参考资料

Hugging Face 官网

Hugging Face 文档

二、Hugging Face相关介绍

1. Hugging Face简介

Hugging Face是“构建未来的AI开源社区”,被称为“AI领域的Github”。

名称介绍链接
TransformersTransformers提供了API和工具,可轻松下载和训练最先进的预训练模型。https://huggingface.co/docs/transformers/index
Models下载预训练模型。https://huggingface.co/models
Datasets下载数据集。https://huggingface.co/datasets
Accelerate帮助Pytorch用户很方便的实现 multi-GPU/TPU/fp16。https://huggingface.co/docs/accelerate/index

Hugging Face Hub

名称介绍文档链接
Repositories模型仓库可以管理模型版本、开源模型等。使用方式与Github类似。https://huggingface.co/docs/hub/repositories
Models提供预训练模型。https://huggingface.co/docs/hub/models
Datasets一个轻量级的数据集框架。https://huggingface.co/docs/hub/datasets
SpacesSpace提供了许多好玩的深度学习应用。https://huggingface.co/docs/hub/spaces

2. Hugging Face模型

请参考另一篇博客:Hugging Face模型的简单使用

3. Hugging Face数据集

请参考另一篇博客:Hugging Face数据集的简单使用

三、常用操作

1. 获取Access Token

通过Hugging Face,获取Access Token用于登录Hugging Face 账户。

注意:选择 Write 权限。

在这里插入图片描述

在这里插入图片描述

2. 登录Hugging Face 账户

pip install huggingface_hub
huggingface-cli login
# Log in using a token from huggingface.co/settings/tokens
(llama_fct) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# huggingface-cli login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
Cannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.
Token has not been saved to git credential helper.
Your token has been saved to /root/.cache/huggingface/token
Login successful

3. 安装transformers

https://huggingface.co/docs/transformers/main/installation

# 安装 transformers 
pip install transformers

# 安装 huggingface
pip install huggingface huggingface_hub

测试是否安装成功:

python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))"

输出结果

[{'label': 'POSITIVE', 'score': 0.9998704791069031}]

四、pipeline相关介绍

1. 常见pipeline

名称含义
feature-extraction获取文本的向量表示
fill-mask填充给定文本中的空白(完形填空)
ner(named entity recognition)词性标注
question-answering问答
sentiment-analysis情感分析
summarization摘要生成
text-generation文本生成
translation翻译
zero-shot-classification零样本分类

2. sentiment-analysis

sentiment-analysis 是情感分析任务的pipeline。

通俗易懂理解情感分析任务的pipeline,如下图所示:

在这里插入图片描述

2.1 Tokenizer阶段

与其他神经网络一样,Transformer模型也不能直接处理原始文本,因此我们管道的第一步就是将文本输入转换为模型可以理解的数字。为此,我们使用了一个分词器tokenizer,它负责:

  • 将输入文本拆分为称为标记的单词、子词(subword)或符号(symbols,如标点符号)。
  • 将每个标记映射成一个整数。
  • 添加可能对模型有用的其他输入。

我们可以使用 AutoTokenizer 类及其 from_pretrained 方法,以保证所有这些预处理都以与模型预训练时完全相同的方式完成。

设定模型的checkpoint(检查点)名称,它会自动获取与模型的Tokenizer关联的数据并缓存在本地(第一次执行代码时下载)。

由于情感分析管道的默认检查点是 distilbert-base-uncased-finetuned-sst-2-english,我们可以运行以下命令得到我们需要的tokenizer

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
# 自动加载该模型训练时所用的tokenizer分词器
tokenizer = AutoTokenizer.from_pretrained(checkpoint) 

输出结果

在这里插入图片描述

有了分词器后,我们可以调用分词器来完成上面所说的过程:

raw_inputs = ["We are very happy to show you the Transformers library.", 'Oh, no.']

# padding=True 填充输入序列,使得批次内序列长度一致
# truncation=True 截断过长的序列
# return_tensors="pt" 返回PyTorch 张量
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

输出结果

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

可以看到,我们得到了两组张量。input_ids就是分词后的标记转换为数字的形式,其中0表示填充标记;而attention_mask中的0表示填充标记。

2.2 Model阶段

我们也可以像分词器一样下载我们的预训练模型,Transformers 提供了一个 AutoModel 类,它也有一个 from_pretrained 方法:

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

AutoModel类可以从checkpoint实例化任何模型,而且这是一种比较好的实例化模型方法

在上面的代码中,我们下载了之前在管道中使用的相同检查点,并用它实例化了一个模型。但是这个架构只包含基本的Transformer模块:给定一些输入,它会输出隐藏状态。通常这些隐藏状态会作为模型另一部分的输入,即模型Head。

2.3 代码示例

from transformers import pipeline
 
classifier = pipeline("sentiment-analysis")
res = classifier(["We are very happy to show you the Transformers library.", 'Oh, no.'])

print(res)

输出结果

[{'label': 'POSITIVE', 'score': 0.9997994303703308}, {'label': 'NEGATIVE', 'score': 0.9975526928901672}]

在这里插入图片描述

五、FAQ

Q:libgomp: Thread creation failed: Resource temporarily unavailable

(llama_factory) root@notebook-1813389960667746306-scnlbe5oi5-50216:~/Downloads/models/LLaMA-Factory# python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))"
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://hf-mirror.com/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 629B [00:00, 2.06MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████████| 268M/268M [00:25<00:00, 10.4MB/s]
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████| 48.0/48.0 [00:00<00:00, 368kB/s]
vocab.txt: 232kB [00:00, 1.06MB/s]

libgomp: Thread creation failed: Resource temporarily unavailable
(llama3) root@notebook-1813389960667746306-scnlbe5oi5-50216:~/Downloads/models/LLaMA-Factory# python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I love you'))"
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://hf-mirror.com/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
/opt/conda/envs/llama3/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

libgomp: Thread creation failed: Resource temporarily unavailable

在这里插入图片描述

错误原因:未知。

Q:ImportError: tokenizers>=0.11.1,!=0.11.3,<0.14 is required for a normal functioning of this module, but found tokenizers==0.15.0.

(llama3) root@notebook-1813389960667746306-scnlbe5oi5-50216:~/Downloads/models/LLaMA-Factory# python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I love you'))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/envs/llama3/lib/python3.10/site-packages/transformers/__init__.py", line 26, in <module>
    from . import dependency_versions_check
  File "/opt/conda/envs/llama3/lib/python3.10/site-packages/transformers/dependency_versions_check.py", line 57, in <module>
    require_version_core(deps[pkg])
  File "/opt/conda/envs/llama3/lib/python3.10/site-packages/transformers/utils/versions.py", line 117, in require_version_core
    return require_version(requirement, hint)
  File "/opt/conda/envs/llama3/lib/python3.10/site-packages/transformers/utils/versions.py", line 111, in require_version
    _compare_versions(op, got_ver, want_ver, requirement, pkg, hint)
  File "/opt/conda/envs/llama3/lib/python3.10/site-packages/transformers/utils/versions.py", line 44, in _compare_versions
    raise ImportError(
ImportError: tokenizers>=0.11.1,!=0.11.3,<0.14 is required for a normal functioning of this module, but found tokenizers==0.15.0.
Try: pip install transformers -U or pip install -e '.[dev]' if you're working with git main

错误原因:transformers与tokenizers版本冲突。

解决方法:根据错误提示,安装合适版本。

pip install --no-dependencies tokenizers=0.11.1

Q:如何获取Access Token

仓库申请报错:Cannot access gated repo for url https://huggingface.co/api

如何获取HuggingFace的Access Token;如何获取HuggingFace的API Key

Hugging Face下载Llama2模型权限问题

参见上文介绍的【获取Access Token】。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

花花少年

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值