【Bugs】pyarrow.lib.ArrowInvalid: Column 2 named label expected length 1004 but got length 1000
又是一年双十一,又累又困惨戚戚。
——2021.11.11
bugs描述
在pytorch huggingface环境中想更改tokenizer编码的input_ids
的index,本想通过赋值的方式发现报错:
tokenized_datasets_test = agnews_dataset['test'].map(tokenize_test,batched=True,
... remove_columns=["text",'label_name'])
0%| | 0/8 [00:00<?, ?ba/s]
Traceback (most recent call last):
File "<input>", line 2, in <module>
File "D:\anaconda3\envs\pt16\lib\site-packages\datasets\arrow_dataset.py", line 1703, in map
desc=desc,
File "D:\anaconda3\envs\pt16\lib\site-packages\datasets\arrow_dataset.py", line 185, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "D:\anaconda3\envs\pt16\lib\site-packages\datasets\fingerprint.py", line 398, in wrapper
out = func(self, *args, **kwargs)
File "D:\anaconda3\envs\pt16\lib\site-packages\datasets\arrow_dataset.py", line 2065, in _map_single
writer.write_batch(batch)
File "D:\anaconda3\envs\pt16\lib\site-packages\datasets\arrow_writer.py", line 411, in write_batch
pa_table = pa.Table.from_pydict(typed_sequence_examples)
File "pyarrow\table.pxi", line 1640, in pyarrow.lib.Table.from_pydict
File "pyarrow\table.pxi", line 1613, in pyarrow.lib.Table.from_arrays
File "pyarrow\table.pxi", line 1232, in pyarrow.lib.Table.validate
File "pyarrow\error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 2 named label expected length 1004 but got length 1000
官方文档描述参数
- batched (
bool
, default False) – Provide batch of examples to function.
就是按照batch丢给tokenizer去编码的,不同的batch编出来的index长度会存在差异。
缓兵之计:
去掉了batched=True
。
不得行。×
解决
期间看到了一篇抱抱脸公司工作人员对该类error的解读
huggingface datasets issues 1817,huggingface datasets issues 116
但是没看懂,决定还是从原始数据集入手,改data_loader
解决方法:从custom入手, 自定义数据加载器。
参考:HuggingFace Datasets来写一个数据加载脚本,官方文档Writing a dataset loading script