【原创】尽量使用标准 json 格式替换 jsonl 以避免 datasets 包引入数据时处理报错

AI莉莉兹

已于 2024-05-15 20:35:21 修改

阅读量474

点赞数 5

文章标签： json 自然语言处理

于 2024-05-15 20:27:25 首次发布

本文链接：https://blog.csdn.net/leayc/article/details/138921530

版权

文章目录

问题发现
问题解决
问题回看

问题发现

使用 LLaMA-Factory 以 ORPO 方式训练模型，数据集习惯性按照 jsonl 的格式组织。
突然在读取数据阶段报错，且是很少见的 TypeError。对照官方模板没发现问题出在哪，又从内容层面做了初步排除，数据处理代码应该是正确的。一时陷入苦战，不得不深入到内部代码 debug。报错信息精简如下：

File "/home/xxx/miniforge3/envs/lf/lib/python3.11/site-packages/datasets/builder.py", line 2011, in _prepare_split_single
    writer.write_table(table)
    ...
  File "/home/xxx/miniforge3/envs/lf/lib/python3.11/site-packages/datasets/table.py", line 1957, in array_cast
    raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{pa_type}")
TypeError: Couldn't cast array of type
list<item: string>
to
null

The above exception was the direct cause of the following exception:
...
    for job_id, done, content in self._prepare_split_single(
  File "/home/xxx/miniforge3/envs/lf/lib/python3.11/site-packages/datasets/builder.py", line 2038, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

经排查核心报错代码的位置是

...arrow_writer.py", line 585, in write_table
pa_table = table_cast(pa_table, self._schema)

继续单步调试和检查关键变量，发现数据 chunk 的 schema 不正常：

self.schema
prompt: string
query: string
answer: list<item: string>
  child 0, item: string
history: list<item: null>
  child 0, item: null

我使用的是一组多轮对话数据，history 列根据对话轮数，应为空 list 或嵌套的 list of list of str，
对比正常处理的 chunk，schema 应为

self.schema
prompt: string
query: string
answer: list<item: string>
  child 0, item: string
history: list<item: string>
  child 0, item: string

又查看了上下文代码，schema 是由 chunk 内的数据推断而来，于是猜测可能是 arrow 库按照表格格式处理数据时没有覆盖某些特殊情况导致 schema 前后不一致报错。去搜 datasets 的仓库，果然发现有人遇到过同样问题：
https://github.com/huggingface/datasets/issues/6845

问题解决

本想看看能否做个 contributor 修复这个错误，但 debug 至此发现错误层数太深且精力能力有限，可能只有等官方自行修复了。
既然知道了错误原因，猜测是否是直接按表格形式读取产生的，一拍脑袋把数据改写成标准 json 内 list 的形式，即 lines of json -> json list，验证可以正常运行了。

问题回看

如果以上猜想正确，即在不指定 schema 而任由 writer 根据 chunk 数据自行推断时，有一定概率会因为 chunk 内数据类型一致但 chunk 外格式/类型不一致导致推断错误，进而归一化/cast 过程中报错。仅依靠 chunk 内推断大概是假定数据量巨大时仅采样足以处理格式正确的数据集，但显然如果数据量较大+分布有偏可能触发此类错误。
在相关函数入口指定 features 参数是一种解决办法，但简单看了下代码并不是那么直观，可能要写成一组配置并且在读入后进行标准类型映射。
作者没想到什么精妙的办法能快速解决这个问题，有没有神通广大的读者可以给些建议呢？