快速了解并体验Hugging Face（综合版）

最新推荐文章于 2025-05-18 00:00:23 发布

花花少年

最新推荐文章于 2025-05-18 00:00:23 发布

阅读量1.6k

点赞数 32

分类专栏：深度学习文章标签： Hugging Face

本文链接：https://blog.csdn.net/m0_37605642/article/details/140794224

版权

深度学习专栏收录该内容

135 篇文章

订阅专栏

一、参考资料

Hugging Face 官网

Hugging Face 文档

二、Hugging Face相关介绍

1. Hugging Face简介

Hugging Face是“构建未来的AI开源社区”，被称为“AI领域的Github”。

名称	介绍	链接
Transformers	Transformers提供了API和工具，可轻松下载和训练最先进的预训练模型。	https://huggingface.co/docs/transformers/index
Models	下载预训练模型。	https://huggingface.co/models
Datasets	下载数据集。	https://huggingface.co/datasets
Accelerate	帮助Pytorch用户很方便的实现 multi-GPU/TPU/fp16。	https://huggingface.co/docs/accelerate/index

Hugging Face Hub

名称	介绍	文档链接
Repositories	模型仓库可以管理模型版本、开源模型等。使用方式与Github类似。	https://huggingface.co/docs/hub/repositories
Models	提供预训练模型。	https://huggingface.co/docs/hub/models
Datasets	一个轻量级的数据集框架。	https://huggingface.co/docs/hub/datasets
Spaces	Space提供了许多好玩的深度学习应用。	https://huggingface.co/docs/hub/spaces

2. Hugging Face模型

请参考另一篇博客：Hugging Face模型的简单使用

3. Hugging Face数据集

请参考另一篇博客：Hugging Face数据集的简单使用

三、常用操作

1. 获取Access Token

通过Hugging Face，获取Access Token用于登录Hugging Face 账户。

注意：选择 Write 权限。

在这里插入图片描述

2. 登录Hugging Face 账户

pip install huggingface_hub
huggingface-cli login
# Log in using a token from huggingface.co/settings/tokens

(llama_fct) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# huggingface-cli login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
Cannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.
Token has not been saved to git credential helper.
Your token has been saved to /root/.cache/huggingface/token
Login successful

3. 安装transformers

https://huggingface.co/docs/transformers/main/installation

# 安装 transformers 
pip install transformers

# 安装 huggingface
pip install huggingface huggingface_hub

测试是否安装成功：

python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))"

输出结果

[{'label': 'POSITIVE', 'score': 0.9998704791069031}]

四、pipeline相关介绍

1. 常见pipeline

名称	含义
`feature-extraction`	获取文本的向量表示
`fill-mask`	填充给定文本中的空白（完形填空）
`ner`(named entity recognition)	词性标注
`question-answering`	问答
`sentiment-analysis`	情感分析
`summarization`	摘要生成
`text-generation`	文本生成
`translation`	翻译
`zero-shot-classification`	零样本分类

2. sentiment-analysis

sentiment-analysis 是情感分析任务的pipeline。

通俗易懂理解情感分析任务的pipeline，如下图所示：

在这里插入图片描述

2.1 Tokenizer阶段

与其他神经网络一样，Transformer模型也不能直接处理原始文本，因此我们管道的第一步就是将文本输入转换为模型可以理解的数字。为此，我们使用了一个分词器tokenizer，它负责：

将输入文本拆分为称为标记的单词、子词（subword）或符号（symbols，如标点符号）。
将每个标记映射成一个整数。
添加可能对模型有用的其他输入。

我们可以使用 AutoTokenizer 类及其 from_pretrained 方法，以保证所有这些预处理都以与模型预训练时完全相同的方式完成。

设定模型的checkpoint(检查点)名称，它会自动获取与模型的Tokenizer关联的数据并缓存在本地（第一次执行代码时下载）。

由于情感分析管道的默认检查点是 distilbert-base-uncased-finetuned-sst-2-english，我们可以运行以下命令得到我们需要的tokenizer：

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
# 自动加载该模型训练时所用的tokenizer分词器
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

输出结果

在这里插入图片描述

有了分词器后，我们可以调用分词器来完成上面所说的过程：

raw_inputs = ["We are very happy to show you the Transformers library.", 'Oh, no.']

# padding=True 填充输入序列，使得批次内序列长度一致
# truncation=True 截断过长的序列
# return_tensors="pt" 返回PyTorch 张量
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

输出结果

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

可以看到，我们得到了两组张量。input_ids就是分词后的标记转换为数字的形式，其中0表示填充标记；而attention_mask中的0表示填充标记。

2.2 Model阶段

我们也可以像分词器一样下载我们的预训练模型，Transformers 提供了一个 AutoModel 类，它也有一个 from_pretrained 方法：

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

AutoModel类可以从checkpoint实例化任何模型，而且这是一种比较好的实例化模型方法。

在上面的代码中，我们下载了之前在管道中使用的相同检查点，并用它实例化了一个模型。但是这个架构只包含基本的Transformer模块：给定一些输入，它会输出隐藏状态。通常这些隐藏状态会作为模型另一部分的输入，即模型Head。

2.3 代码示例

from transformers import pipeline
 
classifier = pipeline("sentiment-analysis")
res = classifier(["We are very happy to show you the Transformers library.", 'Oh, no.'])

print(res)

输出结果

[{'label': 'POSITIVE', 'score': 0.9997994303703308}, {'label': 'NEGATIVE', 'score': 0.9975526928901672}]

在这里插入图片描述

五、FAQ

Q：`libgomp: Thread creation failed: Resource temporarily unavailable`

(llama_factory) root@notebook-1813389960667746306-scnlbe5oi5-50216:~/Downloads/models/LLaMA-Factory# python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))"
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://hf-mirror.com/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 629B [00:00, 2.06MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████████| 268M/268M [00:25<00:00, 10.4MB/s]
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████| 48.0/48.0 [00:00<00:00, 368kB/s]
vocab.txt: 232kB [00:00, 1.06MB/s]

libgomp: Thread creation failed: Resource temporarily unavailable

(llama3) root@notebook-1813389960667746306-scnlbe5oi5-50216:~/Downloads/models/LLaMA-Factory# python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I love you'))"
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://hf-mirror.com/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
/opt/conda/envs/llama3/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

libgomp: Thread creation failed: Resource temporarily unavailable

在这里插入图片描述

错误原因：未知。

Q：`ImportError: tokenizers>=0.11.1,!=0.11.3,<0.14 is required for a normal functioning of this module, but found tokenizers==0.15.0.`

(llama3) root@notebook-1813389960667746306-scnlbe5oi5-50216:~/Downloads/models/LLaMA-Factory# python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I love you'))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/envs/llama3/lib/python3.10/site-packages/transformers/__init__.py", line 26, in <module>
    from . import dependency_versions_check
  File "/opt/conda/envs/llama3/lib/python3.10/site-packages/transformers/dependency_versions_check.py", line 57, in <module>
    require_version_core(deps[pkg])
  File "/opt/conda/envs/llama3/lib/python3.10/site-packages/transformers/utils/versions.py", line 117, in require_version_core
    return require_version(requirement, hint)
  File "/opt/conda/envs/llama3/lib/python3.10/site-packages/transformers/utils/versions.py", line 111, in require_version
    _compare_versions(op, got_ver, want_ver, requirement, pkg, hint)
  File "/opt/conda/envs/llama3/lib/python3.10/site-packages/transformers/utils/versions.py", line 44, in _compare_versions
    raise ImportError(
ImportError: tokenizers>=0.11.1,!=0.11.3,<0.14 is required for a normal functioning of this module, but found tokenizers==0.15.0.
Try: pip install transformers -U or pip install -e '.[dev]' if you're working with git main

错误原因：transformers与tokenizers版本冲突。

解决方法：根据错误提示，安装合适版本。

pip install --no-dependencies tokenizers=0.11.1

Q：如何获取Access Token

仓库申请报错：Cannot access gated repo for url https://huggingface.co/api

如何获取HuggingFace的Access Token；如何获取HuggingFace的API Key

Hugging Face下载Llama2模型权限问题

参见上文介绍的【获取Access Token】。

快速了解并体验Hugging Face（综合版）

一、参考资料

二、Hugging Face相关介绍

1. Hugging Face简介

2. Hugging Face模型

3. Hugging Face数据集

三、常用操作

1. 获取Access Token

2. 登录Hugging Face 账户

3. 安装transformers

四、pipeline相关介绍

1. 常见pipeline

2. sentiment-analysis

2.1 Tokenizer阶段

2.2 Model阶段

2.3 代码示例

五、FAQ

Q：libgomp: Thread creation failed: Resource temporarily unavailable

Q：ImportError: tokenizers>=0.11.1,!=0.11.3,<0.14 is required for a normal functioning of this module, but found tokenizers==0.15.0.

Q：如何获取Access Token

Q：`libgomp: Thread creation failed: Resource temporarily unavailable`

Q：`ImportError: tokenizers>=0.11.1,!=0.11.3,<0.14 is required for a normal functioning of this module, but found tokenizers==0.15.0.`