【备忘录】transformers tokenizer.tokenize和tokenizer.encode

最新推荐文章于 2024-07-01 01:15:35 发布

今天NLP了吗

最新推荐文章于 2024-07-01 01:15:35 发布

阅读量2.5k

点赞数 3

分类专栏：笔记文章标签： nlp python

本文链接：https://blog.csdn.net/li_jiaoyang/article/details/112730850

版权

本文探讨了Transformers库中tokenizer.tokenize和tokenizer.encode的区别。tokenize仅负责分词，不处理特殊标记如[CLS]和[SEP]。而encode在分词后会根据BERT的词汇表转换为ID，添加[CLS]和[SEP]，并返回包含input_ids, token_type_ids和attention_mask的字典，这对于理解BERT的输入格式至关重要。" 132156047,19671763,Matlab模拟线性调频连续波详解,"['Matlab', '信号处理', '通信工程']

摘要由CSDN通过智能技术生成

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('hfl/chinese-bert-wwm')
text = '在此基础上，美国试图挑拨伊朗和伊拉克关系。'

tokenizer_out = tokenizer.tokenize(text)
print(tokenizer_out)

['在', '此', '基', '础', '上', '，', '美', '国', '试', '图', '挑', '拨', '伊', '朗', '和', '伊', '拉', '克', '关', '系', '。']

tokenize仅仅只是分词，没有在句子前后加[cls]和[sep]

tokenizer_encode = tokenizer.encode(text)
print(tokenizer_encode)

[101, 1762, 3634, 1825, 4794, 677, 8024, 5401, 1744, 6407, 1745, 2904, 2884, 823, 3306, 1469, 823, 2861, 1046, 1068, 5143, 511, 102]

encode在分词的基础上，使用bert自带的vocab.txt转化为id，并且在前后都加了[cls]和[sep]，
因为他俩在字典的id分别是101和102
也就是input_ids

data = tokenizer(text)
print(data)
{
   
 'input_ids': [101, 1762, 3634, 1825, 4794, 677, 8024, 5401, 1744, 6407, 1745, 2904, 2884, 823, 3306, 1469, 823, 2861, 1046, 1068, 5143, 511, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}

直接调用，则返回一个字典，包含了input_ids，token_type_ids， attention_mask

bert结构

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(21128, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True