零基础(连python都不会)用huggingface中的transformers模型分词和编解码

  • 在anaconda中创建虚拟环境
conda env list    # 查看已有环境
  • 在anaconda中创建名字为transformer的新环境
conda create -n transformers python=3.6
  • 进入创建好的环境
conda activate transformers
  • 安装所需要的库
pip install torch
pip install transformers
  • 导入所需工具
import torch
from transformers import AutoTokenizer
  • 对单句英文进行基本操作(想用中文分词可以自己在github上找哈工大的model)
# 用bert-base-uncased 分词 + convert to id + encoder(prepare for model)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# tokenizer = AutoTokenizer.from_pretrained("albert-base-v1")
tokens = tokenizer.tokenize("Sometime it last in love, sometime it hurts instead.") # 分词
input_ids = tokenizer.convert_tokens_to_ids(tokens)  # 将tokens转化成id

final_inputs = tokenizer.prepare_for_model(input_ids)   # 将上一步token的id转化为模型训练所需id(transformer训练还需要attention mask matrix)
  • 解码部分
# 用bert-base-uncased解码 ———> [CLS] should i give up, or should i just keep chasing pavement, even if it leads nowhere. [SEP]
inputs = tokenizer("Should i give up, or should i just keep chasing pavement, even if it leads nowhere.")
sentence = tokenizer.decode(inputs["input_ids"]) # 解码后的句子;tokenizer之后包含input_ids、token_ids、attention_mask

# 用roberta-base解码 ———> <s>Should i give up, or should i just keep chasing pavement, even if it leads nowhere.</s>
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
inputs = tokenizer("Should i give up, or should i just keep chasing pavement, even if it leads nowhere.")
sentence2 = tokenizer.decode(inputs["input_ids"])

以上内容来自https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值