一、参考资料
二、Hugging Face相关介绍
1. Hugging Face简介
Hugging Face是“构建未来的AI开源社区”,被称为“AI领域的Github”。
名称 | 介绍 | 链接 |
---|---|---|
Transformers | Transformers提供了API和工具,可轻松下载和训练最先进的预训练模型。 | https://huggingface.co/docs/transformers/index |
Models | 下载预训练模型。 | https://huggingface.co/models |
Datasets | 下载数据集。 | https://huggingface.co/datasets |
Accelerate | 帮助Pytorch用户很方便的实现 multi-GPU/TPU/fp16。 | https://huggingface.co/docs/accelerate/index |
Hugging Face Hub
名称 | 介绍 | 文档链接 |
---|---|---|
Repositories | 模型仓库可以管理模型版本、开源模型等。使用方式与Github类似。 | https://huggingface.co/docs/hub/repositories |
Models | 提供预训练模型。 | https://huggingface.co/docs/hub/models |
Datasets | 一个轻量级的数据集框架。 | https://huggingface.co/docs/hub/datasets |
Spaces | Space提供了许多好玩的深度学习应用。 | https://huggingface.co/docs/hub/spaces |
2. Hugging Face模型
请参考另一篇博客:Hugging Face模型的简单使用
3. Hugging Face数据集
请参考另一篇博客:Hugging Face数据集的简单使用
三、常用操作
1. 获取Access Token
通过Hugging Face,获取Access Token用于登录Hugging Face 账户。
注意:选择 Write
权限。
2. 登录Hugging Face 账户
pip install huggingface_hub
huggingface-cli login
# Log in using a token from huggingface.co/settings/tokens
(llama_fct) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# huggingface-cli login
_| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|
_| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|
_| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|
To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
Cannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.
git config --global credential.helper store
Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.
Token has not been saved to git credential helper.
Your token has been saved to /root/.cache/huggingface/token
Login successful
3. 安装transformers
https://huggingface.co/docs/transformers/main/installation
# 安装 transformers
pip install transformers
# 安装 huggingface
pip install huggingface huggingface_hub
测试是否安装成功:
python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))"
输出结果
[{'label': 'POSITIVE', 'score': 0.9998704791069031}]
四、pipeline相关介绍
1. 常见pipeline
名称 | 含义 |
---|---|
feature-extraction | 获取文本的向量表示 |
fill-mask | 填充给定文本中的空白(完形填空) |
ner (named entity recognition) | 词性标注 |
question-answering | 问答 |
sentiment-analysis | 情感分析 |
summarization | 摘要生成 |
text-generation | 文本生成 |
translation | 翻译 |
zero-shot-classification | 零样本分类 |
2. sentiment-analysis
sentiment-analysis
是情感分析任务的pipeline。
通俗易懂理解情感分析任务的pipeline,如下图所示:
2.1 Tokenizer阶段
与其他神经网络一样,Transformer模型也不能直接处理原始文本,因此我们管道的第一步就是将文本输入转换为模型可以理解的数字。为此,我们使用了一个分词器tokenizer,它负责:
- 将输入文本拆分为称为标记的单词、子词(subword)或符号(symbols,如标点符号)。
- 将每个标记映射成一个整数。
- 添加可能对模型有用的其他输入。
我们可以使用 AutoTokenizer
类及其 from_pretrained
方法,以保证所有这些预处理都以与模型预训练时完全相同的方式完成。
设定模型的checkpoint(检查点)名称,它会自动获取与模型的Tokenizer关联的数据并缓存在本地(第一次执行代码时下载)。
由于情感分析管道的默认检查点是 distilbert-base-uncased-finetuned-sst-2-english
,我们可以运行以下命令得到我们需要的tokenizer
:
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
# 自动加载该模型训练时所用的tokenizer分词器
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
输出结果
有了分词器后,我们可以调用分词器来完成上面所说的过程:
raw_inputs = ["We are very happy to show you the Transformers library.", 'Oh, no.']
# padding=True 填充输入序列,使得批次内序列长度一致
# truncation=True 截断过长的序列
# return_tensors="pt" 返回PyTorch 张量
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)
输出结果
{'input_ids': tensor([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172,
2607, 2026, 2878, 2166, 1012, 102],
[ 101, 1045, 5223, 2023, 2061, 2172, 999, 102, 0, 0,
0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
可以看到,我们得到了两组张量。input_ids
就是分词后的标记转换为数字的形式,其中0
表示填充标记;而attention_mask
中的0
表示填充标记。
2.2 Model阶段
我们也可以像分词器一样下载我们的预训练模型,Transformers 提供了一个 AutoModel
类,它也有一个 from_pretrained
方法:
from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
AutoModel
类可以从checkpoint实例化任何模型,而且这是一种比较好的实例化模型方法。
在上面的代码中,我们下载了之前在管道中使用的相同检查点,并用它实例化了一个模型。但是这个架构只包含基本的Transformer模块:给定一些输入,它会输出隐藏状态。通常这些隐藏状态会作为模型另一部分的输入,即模型Head。
2.3 代码示例
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
res = classifier(["We are very happy to show you the Transformers library.", 'Oh, no.'])
print(res)
输出结果
[{'label': 'POSITIVE', 'score': 0.9997994303703308}, {'label': 'NEGATIVE', 'score': 0.9975526928901672}]
五、FAQ
Q:libgomp: Thread creation failed: Resource temporarily unavailable
(llama_factory) root@notebook-1813389960667746306-scnlbe5oi5-50216:~/Downloads/models/LLaMA-Factory# python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))"
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://hf-mirror.com/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 629B [00:00, 2.06MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████████| 268M/268M [00:25<00:00, 10.4MB/s]
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████| 48.0/48.0 [00:00<00:00, 368kB/s]
vocab.txt: 232kB [00:00, 1.06MB/s]
libgomp: Thread creation failed: Resource temporarily unavailable
(llama3) root@notebook-1813389960667746306-scnlbe5oi5-50216:~/Downloads/models/LLaMA-Factory# python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I love you'))"
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://hf-mirror.com/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
/opt/conda/envs/llama3/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
libgomp: Thread creation failed: Resource temporarily unavailable
错误原因:未知。
Q:ImportError: tokenizers>=0.11.1,!=0.11.3,<0.14 is required for a normal functioning of this module, but found tokenizers==0.15.0.
(llama3) root@notebook-1813389960667746306-scnlbe5oi5-50216:~/Downloads/models/LLaMA-Factory# python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I love you'))"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/envs/llama3/lib/python3.10/site-packages/transformers/__init__.py", line 26, in <module>
from . import dependency_versions_check
File "/opt/conda/envs/llama3/lib/python3.10/site-packages/transformers/dependency_versions_check.py", line 57, in <module>
require_version_core(deps[pkg])
File "/opt/conda/envs/llama3/lib/python3.10/site-packages/transformers/utils/versions.py", line 117, in require_version_core
return require_version(requirement, hint)
File "/opt/conda/envs/llama3/lib/python3.10/site-packages/transformers/utils/versions.py", line 111, in require_version
_compare_versions(op, got_ver, want_ver, requirement, pkg, hint)
File "/opt/conda/envs/llama3/lib/python3.10/site-packages/transformers/utils/versions.py", line 44, in _compare_versions
raise ImportError(
ImportError: tokenizers>=0.11.1,!=0.11.3,<0.14 is required for a normal functioning of this module, but found tokenizers==0.15.0.
Try: pip install transformers -U or pip install -e '.[dev]' if you're working with git main
错误原因:transformers与tokenizers版本冲突。
解决方法:根据错误提示,安装合适版本。
pip install --no-dependencies tokenizers=0.11.1
Q:如何获取Access Token
仓库申请报错:Cannot access gated repo for url https://huggingface.co/api
如何获取HuggingFace的Access Token;如何获取HuggingFace的API Key
参见上文介绍的【获取Access Token】。