transformers库的使用【三】数据的预处理

最新推荐文章于 2024-06-09 17:52:15 发布

桉夏与猫

最新推荐文章于 2024-06-09 17:52:15 发布

阅读量4.6k

点赞数 22

分类专栏： transformers 文章标签： nlp 机器学习 pytorch 神经网络自然语言处理

本文链接：https://blog.csdn.net/qq_28790663/article/details/117073917

版权

transformers 专栏收录该内容

4 篇文章 7 订阅

订阅专栏

处理数据

在这里，将介绍如何使用Transformers库来对数据进行处理，我们主要使用的工具是tokenizer。

你可以创建一个和模型相关的tokenizer类，或者直接使用AutoTokenizer类。

tokenizer是用来把一段文本划分成单词（或者单词的一部分，标点符号等）这些划分以后的到的结果，通常称之为tokens。

接下来把这些tokens转换成numbers，这样就可以创建一个tensor来把它们送到模型当中去。

注意：如果你打算使用一个预选练的模型，那么去使用和该模型配对的tokenizer就很重要！

为了去自动的下载在预训练过程中使用的单词表（vocab）你可以使用from_pretrained()方法：

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

基础使用

PreTrainedTokenizer有很多的方法，但是你需要记住的只有一个__call__:你只需要把句子直接放到tokenizer对象中，就可以得到结果：

sentence = "Hello, I'm a single sentence!"
encoded_input = tokenizer(sentence)
print(encoded_input)

{'input_ids': [101, 8667, 117, 146, 112, 182, 170, 1423, 5650, 106, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

返回的结果是一个字典，键是字符串，而值是数字组成的列表。

input_id字段是句子中的每一个token对应的id

attention_mask和token_type_ids将在接下来进行介绍

tokenzier可以对之前得到的token_ids列表进行还原，得到原始的句子：

decoded_input = tokenizer.decode(encoded_input['input_ids'])
print(decoded_input)

[CLS] Hello, I'm a single sentence! [SEP]

如你所见，tokenizer对句子进行还原，并且会根据模型加入一些特殊的token。并不是所有的模型都需要这些特殊的token

举个例子来说，gpt2-medium模型还原的时候，只会得到和原始句子相同的句子，不会加入特殊的符号。

如果你希望它不要添加任何特殊的token，你可以在encoded时设置属性：

add_special_tokens = False

注意：在还原时，没有add_special_tokens属性

要使用skip_special_tokens，默认值为False，所以你可以通过

skip_special_token=True

来阻止添加特殊tokens

decoded_input = tokenizer.decode(encoded_input['input_ids'],skip_special_tokens=True)
print(decoded_input)

Hello, I'm a single sentence!

如果你有一组句子希望去处理，你可以把它放到一个列表里面，然后直接放到tokenizer中就可以得到结果

batch_sentences = ["Hello I'm a single sentence",
"And another sentence",
"And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102], 
[101, 1262, 1330, 5650, 102],
[101, 1262, 1103, 1304, 1304, 1314, 1141, 102]], 
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

那么返回的结果是list组成的list

如果你希望把一组句子一次送入tokenizer来创建batch，那么你可能需要：

1、把所有的句子补全成最大长度

2、根据模型可以接收的最大长度，对所有句子做一个截断

3、返回一个tensor

那么你可以使用接下来的方法

batch = tokenizer(batch_sentences,padding=True,truncation=True,return_tensors='pt')
print(batch)

{'input_ids': tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
        [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
        [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]), 
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 0]])}

注意：可以清楚地看到，input_ids已经是一个tensor类型

而且所有句子经过编码后的长度都已经相同，不足的地方被补了0

处理句子对

有时你需要送一对句子到模型当中去，举个例子来说，如果你希望把两个句子十分相似的句子分成一类，或者送到一个问答的模型当中去，对于Bert来说，输入是接下来这种形式：

【CLS】 Sequence A 【SEP】 Sequence B【SEP】

你可以把两个句子同时送入tokenizer

注意⚠️这时候两个句子不要放到list中

sentence1="How old are you"
sentence2="I'm 6 years old"
encoded_input = tokenizer(sentence1,sentence2)
print(encoded_input)

{'input_ids': [101, 1731, 1385, 1132, 1128, 102, 146, 112, 182, 127, 1201, 1385, 102], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

结果可以看到，在input_ids中:

101代表[CLS]

102代表[SEP]

这说明一次传入句子对的时候，模型会自动添加一些特殊的标签，来划分两个句子。

同样的，也可以对得到的结果进行还原：

tokenizer.decode(encoded_input['input_ids'])

[CLS] How old are you [SEP] I'm 6 years old [SEP]

如果你有很多句问题，很多句回答，这时候该怎么做呢？

你可以使用接下来的做法：

batch_sentences = ["Hello I'm a single sentence",
"And another sentence",
"And the very very last one"]

batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
"And I should be encoded with the second sentence",
"And I go with the very last one"]

encoded_inputs = tokenizer(batch_sentences,batch_of_second_sentences)
print(encoded_inputs)

{'input_ids': 
[[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102],
[101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102],
[101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]], 
'token_type_ids': 
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
 'attention_mask': 
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

同样的，也可以使用decode方法进行还原：

for ids in encoded_inputs['input_ids']:
    print(tokenizer.decode(ids))

[CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
[CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
[CLS] And the very very last one [SEP] And I go with the very last one [SEP]

当使用padding与truncation时，你需要知道的事

首先，这里有三个参数：padding、truncation、max_length

padding控制对句子长度不够的句子进行填充的任务

True或者"longest"是指把句子填补成在整个batch中最长的句子的长度（如果只有一个句子，不需要进行padding）

'max_length'则是最大的长度

False或者"do_not_pad"则是默认值，默认不会对batch的句子进行填充

truncation则控制着对句子进行裁剪，取值为布尔类型或者字符串类型

True或者"only_first"则是可以把句子裁剪到由max_length指明的长度，如果不指明max_length，会自动把长度裁剪到模型可以接受的大小。

only_second根据max_length把句子对中的第二个句子进行裁剪

Longest_first则是把句子对中最长的句子进行裁剪到max_length大小

False或者"do_not_truncate"则不做任何裁剪

处理预先token化的输入

如果你事先对输入已经进行了处理，比如在命名实体识别NER或者POS任务中，通常会事先处理。

那么这时候你只需要指明is_split_into_words=True即可（在encode的方法）

split_sentences=["Hello","I'm","a","single","sentence"]
encoded_input = tokenizer.encode(split_sentences,is_split_into_words=True)
print(encoded_input)

[101, 8667, 146, 112, 182, 170, 1423, 5650, 102]

桉夏与猫

关注

22
点赞
踩
43

收藏

觉得还不错? 一键收藏
1
评论
transformers库的使用【三】数据的预处理

处理数据在这里，将介绍如何使用Transformers库来对数据进行处理，我们主要使用的工具是tokenizer。你可以创建一个和模型相关的tokenizer类，或者直接使用AutoTokenizer类。tokenizer是用来把一段文本划分成单词（或者单词的一部分，标点符号等）这些划分以后的到的结果，通常称之为tokens。接下来把这些tokens转换成numbers，这样就可以创建一个tensor来把它们送到模型当中去。注意：如果你打算使用一个预选练的模型，那么去使用和该模型配对的t
复制链接

扫一扫