seamew/ChnSentiCorp数据集加载

CCSBRIDGE

已于 2023-08-16 11:29:16 修改

阅读量692

点赞数

文章标签： python 开发语言

于 2023-08-16 11:22:15 首次发布

本文链接：https://blog.csdn.net/weixin_47420447/article/details/132315580

版权

一、下载数据集

在hugggingface直接下载seamew/ChnSentiCorp的全部文件，实际上ChnSentiCorp.py和.txt文件不用下载。

二、添加一个JSON文件（作用很大，但我没想明白）

在先前下载内容的相同目录下创建一个名为state.json的文件，其内容为：

{
    "_data_files": [
      {
        "filename": "chn_senti_corp-train.arrow"
      }
    ],
    "_fingerprint": "24c4fd9824d8b978",
    "_format_columns": null,
    "_format_kwargs": {},
    "_format_type": null,
    "_indexes": {},
    "_output_all_columns": false,
    "_split": "train"
  }

三、在自己的Python脚本中加载并且读取这些数据的内容

# -*- coding: utf-8 -*-
# coding=utf-8
# coding: utf-8
# os.environ['HTTP_PROXY'] = 'http://127.0.0.1:7890'
# os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'

from datasets import load_from_disk


class Dataset:
    def __init__(self):
        self.dataset = load_from_disk(
            "./ChnSentiCorp"
        )

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, i):
        text = self.dataset[i]["text"]
        label = self.dataset[i]["label"]
        return text, label


if __name__ == '__main__':
    dataset = Dataset()
    print(len(dataset), dataset[0])
    pass