【Bugs】pyarrow.lib.ArrowInvalid Column 2 named label expected length 1004 but got length 1000

【Bugs】pyarrow.lib.ArrowInvalid: Column 2 named label expected length 1004 but got length 1000

又是一年双十一,又累又困惨戚戚。
——2021.11.11

bugs描述

在pytorch huggingface环境中想更改tokenizer编码的input_ids的index,本想通过赋值的方式发现报错:

tokenized_datasets_test = agnews_dataset['test'].map(tokenize_test,batched=True,
...                                                      remove_columns=["text",'label_name'])
  0%|          | 0/8 [00:00<?, ?ba/s]
Traceback (most recent call last):
  File "<input>", line 2, in <module>
  File "D:\anaconda3\envs\pt16\lib\site-packages\datasets\arrow_dataset.py", line 1703, in map
    desc=desc,
  File "D:\anaconda3\envs\pt16\lib\site-packages\datasets\arrow_dataset.py", line 185, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "D:\anaconda3\envs\pt16\lib\site-packages\datasets\fingerprint.py", line 398, in wrapper
    out = func(self, *args, **kwargs)
  File "D:\anaconda3\envs\pt16\lib\site-packages\datasets\arrow_dataset.py", line 2065, in _map_single
    writer.write_batch(batch)
  File "D:\anaconda3\envs\pt16\lib\site-packages\datasets\arrow_writer.py", line 411, in write_batch
    pa_table = pa.Table.from_pydict(typed_sequence_examples)
  File "pyarrow\table.pxi", line 1640, in pyarrow.lib.Table.from_pydict
  File "pyarrow\table.pxi", line 1613, in pyarrow.lib.Table.from_arrays
  File "pyarrow\table.pxi", line 1232, in pyarrow.lib.Table.validate
  File "pyarrow\error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 2 named label expected length 1004 but got length 1000

官方文档描述参数

  • batched (bool, default False) – Provide batch of examples to function.

就是按照batch丢给tokenizer去编码的,不同的batch编出来的index长度会存在差异。

缓兵之计:

去掉了batched=True

不得行。×

解决

期间看到了一篇抱抱脸公司工作人员对该类error的解读

huggingface datasets issues 1817huggingface datasets issues 116

但是没看懂,决定还是从原始数据集入手,改data_loader

解决方法:从custom入手, 自定义数据加载器。

参考:HuggingFace Datasets来写一个数据加载脚本官方文档Writing a dataset loading script

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值