在hugging face上发布自己的模型（ubuntu 19.0）

Ensheng Shi

已于 2022-09-26 15:53:34 修改

阅读量1.5k

点赞数 3

分类专栏： tools 文章标签： ubuntu linux 运维 huggingface

于 2022-08-14 17:00:31 首次发布

本文链接：https://blog.csdn.net/qq_36097393/article/details/126333377

版权

tools 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

以ubuntu为例子

前言

hugging face 主页:https://huggingface.co/
用邮箱注册一个账号

Install git lfs

huggingface 的使用和github类似，但是github单个文件不大于50MB，而一个模型动辄几百MB，需要用到lfs

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git lfs install

在终端登录

（base) workspace:~$ huggingface-cli login

        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .

Token:
Login successful
Your token has been saved to /home/t-enshengshi/.huggingface/token

为了后续create repo方便，这里使用开通write权限而并非仅仅是read权限。

创建与上传repo

huggingface-cli repo create model_name
git clone https://huggingface.co/username/model_name

上传类似于github的操作

git add .
git commit -m "first commit"
git push

上传文件说明

上传文件包括了model和tokenizer两部分。

model.save_pretrained("pytorch_model.bin")

可以得到pytorch_model.bin 和 config.json

tokenizer.save_pretrained("./saved_pre_model")

可以得到added_tokens.json merges.txt special_tokens_map.json tokenizer_config.json vocab.json

使用

我是用的Roberta，以roberta为例子

import torch
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("Ensheng/coco")
model = RobertaModel.from_pretrained("Ensheng/coco")

output

Downloading: 100%
916k/916k [00:00<00:00, 1.60MB/s]
Downloading: 100%
434k/434k [00:00<00:00, 411kB/s]
Downloading: 100%
941/941 [00:00<00:00, 25.7kB/s]
Downloading: 100%
1.63k/1.63k [00:00<00:00, 45.5kB/s]
Downloading: 100%
1.10k/1.10k [00:00<00:00, 29.4kB/s]
Downloading: 100%
738/738 [00:00<00:00, 16.7kB/s]
Downloading: 100%
481M/481M [00:08<00:00, 58.3MB/s]

测试tokenizer 和model

nl_tokens=tokenizer.tokenize("return maximum value")
code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.sep_token]
tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]
print(context_embeddings)

output

tensor([[[-0.6205,  0.2075, -0.6909,  ...,  0.4914,  1.5620,  0.5642],
         [-0.6205,  0.2075, -0.6909,  ...,  0.4914,  1.5620,  0.5642],
         [-0.6205,  0.2075, -0.6909,  ...,  0.4914,  1.5620,  0.5642],
         ...,
         [-0.3708,  0.5695, -1.5493,  ..., -0.0023,  1.2854,  0.3780],
         [ 1.3056, -0.1004, -0.6191,  ..., -0.4956,  1.5792,  1.5347],
         [ 0.1874,  1.3228, -0.9529,  ..., -1.0119,  1.7750,  1.3678]]],
       grad_fn=<NativeLayerNormBackward0>)