python划分数据集时出现KeyError: ‘instruction‘错误

wxr0616

已于 2024-05-15 16:59:47 修改

阅读量239

点赞数 2

分类专栏：各种报错文章标签： python 数据分析语言模型人工智能

于 2024-04-12 21:02:55 首次发布

本文链接：https://blog.csdn.net/wxr0616/article/details/137694686

版权

各种报错专栏收录该内容

13 篇文章 0 订阅

订阅专栏

今天在将json文件划分数据集时出现了KeyError: 'instruction'错误，由于数据量比较大，所以在浏览了部分数据后以为结构没有问题，这是一部分的数据结构

  {
        "instruction": "描述面向对象编程（OOP）的原则。",
        "input": "OOP 原则包括封装、继承、多态和抽象，促进了有组织和可维护的代码。",
        "output": "输出评价：你对面向对象编程的原则有很好的理解。在你的开发经验中，这些原则是如何指导你编写代码的？"
    },

因此百思不得其解，但是后来通过打印JSON 文件中的示例条目发现存在少部分格式不同，这才导致KeyError: 'instruction'错误

Keys in example: dict_keys(['instruction', 'input', 'output'])
Keys in example: dict_keys(['instruction', 'input', 'output'])
Keys in example: dict_keys(['question', 'answer', 'output'])
Keys in example: dict_keys(['question', 'answer', 'output'])
Keys in example: dict_keys(['question', 'answer', 'output'])
Keys in example: dict_keys(['question', 'answer', 'output'])
Keys in example: dict_keys(['question', 'answer', 'output'])
Keys in example: dict_keys(['question', 'answer', 'output'])
Keys in example: dict_keys(['question', 'answer', 'output'])
Keys in example: dict_keys(['instruction', 'input', 'output'])
Keys in example: dict_keys(['instruction', 'input', 'output'])

因此，我们需要在处理数据之前先检查每个示例的键结构，并根据不同的结构来处理数据。我们可以在读取数据时，对每个示例的键结构进行检查，然后选择性地提取所需的键。这样就可以确保代码不会在处理不同结构的数据时出现错误。

这是处理好后的代码

import json
from sklearn.model_selection import train_test_split

# 读取 JSON 文件
with open('qiyeruangong.json', 'r', encoding='utf-8') as file:
    data = json.load(file)

# 提取所需的键
X = []
y = []
for example in data:
    # 检查示例的键结构
    if 'instruction' in example and 'input' in example and 'output' in example:
        X.append(example['instruction'] + ' ' + example['input'])
        y.append(example['output'])
    elif 'question' in example and 'answer' in example and 'output' in example:
        X.append(example['question'])
        y.append(example['output'])
    else:
        print("Unsupported example format:", example)

# 划分训练集、验证集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42) # 0.25 x 0.8 = 0.2

# 可选：保存划分后的数据集
with open('train_data.json', 'w') as train_file:
    json.dump({'X_train': X_train, 'y_train': y_train}, train_file)

with open('val_data.json', 'w') as val_file:
    json.dump({'X_val': X_val, 'y_val': y_val}, val_file)

with open('test_data.json', 'w') as test_file:
    json.dump({'X_test': X_test, 'y_test': y_test}, test_file)

如果没能解决问题，请看该错误解决的2.0版LLAMA-Factory微调chatglm3-6b出现KeyError: ‘instruction‘错误

wxr0616

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python划分数据集时出现KeyError: ‘instruction‘错误

因此，我们需要在处理数据之前先检查每个示例的键结构，并根据不同的结构来处理数据。我们可以在读取数据时，对每个示例的键结构进行检查，然后选择性地提取所需的键。今天在将json文件划分数据集时出现了KeyError: 'instruction'错误，由于数据量比较大，所以在浏览了部分数据后以为结构没有问题，这是一部分的数据结构。因此百思不得其解，但是后来通过打印JSON 文件中的示例条目发现存在少部分格式不同，这才导致KeyError: 'instruction'错误。
复制链接

扫一扫