【问题解决】load_dataset报错An error occurred while generating the dataset

【问题描述】datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

使用load_dataset读取在HuggingFace上的数据集,代码如下

from transformers import pipeline
from transformers import Trainer, TrainingArguments

import os
os.environ["http_proxy"] = "http://127.0.0.1:7890"
os.environ["https_proxy"] = "http://127.0.0.1:7890"

from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

pip install datasets直接装了库,然后一运行就报错了

【问题解决】
本文原先的datasets是2.13.1
将其降低版本为2.1.0就好了

# 尝试降级到一个更稳定的版本
pip install datasets==2.1.0

【其他解决办法】

  1. 清理缓存:删除 Hugging Face 缓存文件夹中与 glue 相关的数据,以确保加载的是最新的数据集文件。默认的缓存路径是 ~/.cache/huggingface/datasets,你可以删除该文件夹下 glue 相关的内容,然后重新运行代码。
from datasets import load_dataset
import shutil
import os

# 删除特定数据集的缓存
cache_dir = os.path.expanduser("~/.cache/huggingface/datasets")
glue_cache_path = os.path.join(cache_dir, "glue")
if os.path.exists(glue_cache_path):
    shutil.rmtree(glue_cache_path)

# 重新加载数据集
raw_datasets = load_dataset("glue", "mrpc")
print(raw_datasets)

2.升级datasets

pip install --upgrade datasets
# 或者如果升级不奏效,尝试降级到一个更稳定的版本
pip install datasets==2.1.0
Traceback (most recent call last): File "C:\Anaconda\envs\pytorch\lib\site-packages\datasets\builder.py", line 1855, in _prepare_split_single for _, table in generator: File "C:\Anaconda\envs\pytorch\lib\site-packages\datasets\packaged_modules\parquet\parquet.py", line 90, in _generate_tables if parquet_fragment.row_groups: File "pyarrow\\_dataset_parquet.pyx", line 386, in pyarrow._dataset_parquet.ParquetFileFragment.row_groups.__get__ File "pyarrow\\_dataset_parquet.pyx", line 393, in pyarrow._dataset_parquet.ParquetFileFragment.metadata.__get__ File "pyarrow\\_dataset_parquet.pyx", line 382, in pyarrow._dataset_parquet.ParquetFileFragment.ensure_complete_metadata File "pyarrow\\error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet file size is 0 bytes The above exception was the direct cause of the following exception: Traceback (most recent call last): File "C:\Users\31035\PycharmProjects\pythonProject1\main.py", line 12, in <module> dataset = load_dataset("imdb") File "C:\Anaconda\envs\pytorch\lib\site-packages\datasets\load.py", line 2084, in load_dataset builder_instance.download_and_prepare( File "C:\Anaconda\envs\pytorch\lib\site-packages\datasets\builder.py", line 925, in download_and_prepare self._download_and_prepare( File "C:\Anaconda\envs\pytorch\lib\site-packages\datasets\builder.py", line 1001, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "C:\Anaconda\envs\pytorch\lib\site-packages\datasets\builder.py", line 1742, in _prepare_split for job_id, done, content in self._prepare_split_single( File "C:\Anaconda\envs\pytorch\lib\site-packages\datasets\builder.py", line 1898, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset Proce
最新发布
03-29
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值