错误样例输入和输出
样例代码如下:
from transformers import GPT2Tokenizer,GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
special_tokens_dict = {'cls_token': '<CLS>'}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
text = ["this is the first sentences", "this is the second sentece, ", "this one is the third sentence"]
tp = ['first sentence']
output = tokenizer(text,tp)
print(output)
报错如下:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-66-71da4f74dda3> in <module>()
8 text = ["this is the first sentences", "this is the second sentece, ", "this one is the third sentence"]
9 tp = ['first sentence']
---> 10 output = tokenizer(text,tp)
11 print(output)
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2377 if text_pair is not None and len(text) != len(text_pair):
2378 raise ValueError(
-> 2379 f"batch length of `text`: {len(text)} does not match batch length of `text_pair`: {len(text_pair)}."
2380 )
2381 batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
ValueError: batch length of `text`: 3 does not match batch length of `text_pair`: 1.
在样例代码中,注意text_pair的长度是1,但是text的长度是3。这就是错误的原因。
错误分析
Exception Class: TypeError
Raise code
lit_into_words:
is_batched = isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))
else:
is_batched = isinstance(text, (list, tuple))
if is_batched:
if isinstance(text_pair, str):
raise TypeError(
"when tokenizing batches of text, `text_pair` must be a list or tuple with the same length as `text`."
)
if text_pair is not None and len(text) != len(text_pair):
raise ValueError(
f"batch length of `text`: {len(text)} does not match batch length of `text_pair`: {len(text_pair)}."
)
batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
retu
分析如下:
这个错误是被Transformers.PreTrainedTokenizer class的主要代码部分被raised. 此主要方法用于tokenize和为模型准备一个或多个序列或一对或多对序列。如果text参数是以批处理形式给出的,那么text_pairs应该是一个与text长度相同的元组或列表。
解决的方法
确保text_pair要和text保持一样的长度,在text是batched sequences的情况下。
代码修改成如下:
from transformers import GPT2Tokenizer,GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
special_tokens_dict = {'cls_token': '<CLS>'}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
text = ["this is the first sentences", "this is the second sentece, ", "this one is the third sentence"]
tp = ['first sentence',"second","third"]
output = tokenizer(text,tp)
print(output)
输出结果:
{'input_ids': [[5661, 318, 262, 717, 13439, 11085, 6827], [5661, 318, 262, 1218, 1908, 68, 344, 11, 220, 12227], [5661, 530, 318, 262, 2368, 6827, 17089]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]]}