huggingface.datasets无法加载数据集和指标的解决方案

诸神缄默不语

已于 2022-04-08 22:30:58 修改

阅读量2.4w

点赞数 16

分类专栏：人工智能学习笔记文章标签： Python huggingface datasets NLP yelp

于 2022-04-08 20:19:12 首次发布

本文链接：https://blog.csdn.net/PolarisRisingWar/article/details/124042709

版权

人工智能学习笔记专栏收录该内容

268 篇文章

订阅专栏

诸神缄默不语-个人CSDN博文目录

本文是作者在使用huggingface的datasets包时，出现无法加载数据集和指标的问题，故撰写此博文以记录并分享这一问题的解决方式。以下将依次介绍我的代码和环境、报错信息、错误原理和解决方案。首先介绍数据集的，后面介绍指标的。

系统环境：
操作系统：Linux
Python版本：3.8.12
代码编辑器：VSCode+Jupyter Notebook
datasets版本：2.0.0

数据集的：

代码：

import datasets
dataset=datasets.load_dataset("yelp_review_full")

报错信息：

ConnectionError                           Traceback (most recent call last)
/tmp/ipykernel_21708/3707219471.py in <module>
----> 1 dataset=datasets.load_dataset("yelp_review_full")

myenv/lib/python3.8/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
   1658 
   1659     # Create a dataset builder
-> 1660     builder_instance = load_dataset_builder(
   1661         path=path,
   1662         name=name,

myenv/lib/python3.8/site-packages/datasets/load.py in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, use_auth_token, **config_kwargs)
   1484         download_config = download_config.copy() if download_config else DownloadConfig()
   1485         download_config.use_auth_token = use_auth_token
-> 1486     dataset_module = dataset_module_factory(
   1487         path,
   1488         revision=revision,

myenv/lib/python3.8/site-packages/datasets/load.py in dataset_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, data_dir, data_files, **download_kwargs)
   1236                         f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}"
   1237                     ) from None
-> 1238                 raise e1 from None
   1239     else:
   1240         raise FileNotFoundError(

myenv/lib/python3.8/site-packages/datasets/load.py in dataset_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, data_dir, data_files, **download_kwargs)
   1173             if path.count("/") == 0:  # even though the dataset is on the Hub, we get it from GitHub for now
   1174                 # TODO(QL): use a Hub dataset module factory instead of GitHub
-> 1175                 return GithubDatasetModuleFactory(
   1176                     path,
   1177                     revision=revision,

myenv/lib/python3.8/site-packages/datasets/load.py in get_module(self)
    531         revision = self.revision
    532         try:
--> 533             local_path = self.download_loading_script(revision)
    534         except FileNotFoundError:
    535             if revision is not None or os.getenv("HF_SCRIPTS_VERSION", None) is not None:

myenv/lib/python3.8/site-packages/datasets/load.py in download_loading_script(self, revision)
    511         if download_config.download_desc is None:
    512             download_config.download_desc = "Downloading builder script"
--> 513         return cached_path(file_path, download_config=download_config)
    514 
    515     def download_dataset_infos_file(self, revision: Optional[str]) -> str:

myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in cached_path(url_or_filename, download_config, **download_kwargs)
    232     if is_remote_url(url_or_filename):
    233         # URL, so get it from the cache (downloading if necessary)
--> 234         output_path = get_from_cache(
    235             url_or_filename,
    236             cache_dir=cache_dir,

myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, local_files_only, use_etag, max_retries, use_auth_token, ignore_url_params, download_desc)
    580         _raise_if_offline_mode_is_enabled(f"Tried to reach {url}")
    581         if head_error is not None:
--> 582             raise ConnectionError(f"Couldn't reach {url} ({repr(head_error)})")
    583         elif response is not None:
    584             raise ConnectionError(f"Couldn't reach {url} (error {response.status_code})")

ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/2.0.0/datasets/yelp_review_full/yelp_review_full.py (ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Read timed out. (read timeout=100)")))

很明显这是上不了raw.githubusercontent.com的问题。
如果你可以使用代理，最好的解决方式就是直接挂代理运行全程。

对于不方便直接使用代理的情况，以下介绍我使用的解决方案：在本机使用代理，然后将文件上传到运行环境的解决方案。（注意本机和服务器可以是不同操作系统的）

我试过直接把这个Python文件下载下来，然后上传到服务器上，但是操作了半天也不行，因为这个Python文件里面给出的数据下载链接在谷歌云，但是直接把那个数据下下来上传还是不行，修改数据下载链接到S3文件也不行。总之不行，如果有可行的方法请直接给我讲一下。
大略来说，我的成功做法就是现在本地加载数据集，然后储存到磁盘，然后将文件夹上传至服务器，并从磁盘直接加载数据集。

在本地加载数据集并储存到本地磁盘（注意这个路径是Windows系统的路径）：

import datasets
dataset=datasets.load_dataset("yelp_review_full",cache_dir='mypath\data\huggingfacedatasetscache')

dataset.save_to_disk('mypath\\data\\yelp_review_full_disk')

将路径文件夹上传到服务器：
可以使用bypy和百度网盘来进行操作，参考我之前撰写的博文bypy：使用Linux命令行上传及下载百度云盘文件（远程服务器大文件传输必备）_诸神缄默不语的博客-CSDN博客_bypy 命令。
先上传到我的应用数据-bypy文件夹中，然后在服务器上下载文件夹（注意下载文件夹是将远程文件夹里的所有文件下载到本地文件夹，而不是直接下载整个文件夹）：bypy downdir yelp_full_review_disk mypath/datasets/yelp_full_review_disk

然后在服务器上从磁盘加载数据集：

dataset=datasets.load_from_disk("mypath/datasets/yelp_full_review_disk")

就可以正常使用数据集了：
在这里插入图片描述
注意，根据datasets的文档，这个数据集也可以直接存储到S3FileSystem（https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/main_classes#datasets.filesystems.S3FileSystem）上。我觉得这大概也是个类似谷歌云或者百度云那种可公开下载文件的API？感觉会比存储到本地然后转储到服务器更方便。
我没有研究过这个功能，所以没有使用这个。

指标的：
代码：

metric=datasets.load_metric('accuracy')

报错信息：

ConnectionError                           Traceback (most recent call last)
/tmp/ipykernel_24141/2186493793.py in <module>
----> 1 metric=datasets.load_metric('accuracy')

myenv/lib/python3.8/site-packages/datasets/load.py in load_metric(path, config_name, process_id, num_process, cache_dir, experiment_id, keep_in_memory, download_config, download_mode, revision, **metric_init_kwargs)
   1390     """
   1391     download_mode = DownloadMode(download_mode or DownloadMode.REUSE_DATASET_IF_EXISTS)
-> 1392     metric_module = metric_module_factory(
   1393         path, revision=revision, download_config=download_config, download_mode=download_mode
   1394     ).module_path

myenv/lib/python3.8/site-packages/datasets/load.py in metric_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, **download_kwargs)
   1322             except Exception as e2:  # noqa: if it's not in the cache, then it doesn't exist.
   1323                 if not isinstance(e1, FileNotFoundError):
-> 1324                     raise e1 from None
   1325                 raise FileNotFoundError(
   1326                     f"Couldn't find a metric script at {relative_to_absolute_path(combined_path)}. "

myenv/lib/python3.8/site-packages/datasets/load.py in metric_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, **download_kwargs)
   1310     elif is_relative_path(path) and path.count("/") == 0 and not force_local_path:
   1311         try:
-> 1312             return GithubMetricModuleFactory(
   1313                 path,
   1314                 revision=revision,

myenv/lib/python3.8/site-packages/datasets/load.py in get_module(self)
    598         revision = self.revision
    599         try:
--> 600             local_path = self.download_loading_script(revision)
    601             revision = self.revision
    602         except FileNotFoundError:

myenv/lib/python3.8/site-packages/datasets/load.py in download_loading_script(self, revision)
    592         if download_config.download_desc is None:
    593             download_config.download_desc = "Downloading builder script"
--> 594         return cached_path(file_path, download_config=download_config)
    595 
    596     def get_module(self) -> MetricModule:

myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in cached_path(url_or_filename, download_config, **download_kwargs)
    232     if is_remote_url(url_or_filename):
    233         # URL, so get it from the cache (downloading if necessary)
--> 234         output_path = get_from_cache(
    235             url_or_filename,
    236             cache_dir=cache_dir,

myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, local_files_only, use_etag, max_retries, use_auth_token, ignore_url_params, download_desc)
    580         _raise_if_offline_mode_is_enabled(f"Tried to reach {url}")
    581         if head_error is not None:
--> 582             raise ConnectionError(f"Couldn't reach {url} ({repr(head_error)})")
    583         elif response is not None:
    584             raise ConnectionError(f"Couldn't reach {url} (error {response.status_code})")

ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/2.0.0/metrics/accuracy/accuracy.py (ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Read timed out. (read timeout=100)")))

指标的简单一点，只要把这个Python文件下载到本地（这个可以不用代理。免代理下载GitHub文件的方法我没有专门撰写博文，但是可以参考我之前写的类似主题的博文：PyG的Planetoid无法直接下载Cora等数据集的3个解决方式_诸神缄默不语的博客-CSDN博客_planetoid数据集），然后改为调用这个文件即可：

metric=datasets.load_metric('mypath/accuracy.py')

本文撰写过程中所使用的参考资料：

datasets加载数据集相关方法的文档：https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/loading_methods
datasets.save_to_disk()的文档：https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/main_classes#datasets.Dataset.save_to_disk
HuggingFace使用datasets加载数据时出现ConnectionError 无法获得数据可以将数据保存到本地_zero requiem的博客-CSDN博客：这一篇使用的方法跟我的差不多，他用的是google colab来加载和存储数据集。
ConnectionError: Couldn‘t reach https://raw.githubuserc//huggingface/datasets/1.15.1/datasets/squad/_随便写写诶的博客-CSDN博客：呃感觉这篇可能是因为datasets版本比较早，所以我看现在数据集不再存储在那个位置了，可能这个方法无法使用了。
HuggingFace代码本地运行报错ConnectionError: Couldn‘t reach https://raw.githubuserc_愚昧之山绝望之谷开悟之坡的博客-CSDN博客：这个方法我试过，我把Python文件放到cache文件夹后，发现需要下载谷歌云数据。我把谷歌云数据也放到cache文件夹后，它还是给我报一些别的错，我不会解决，所以放弃了这个解决思路。
HuggingFace 加载数据集报错 ConnectionError 无需GoogleColab_zero requiem的博客-CSDN博客：和序号4的情况类似。
使用datasets库加载glue数据集时load_dataset发生Connection Error问题解决方法_j_thame_myhome的博客-CSDN博客_datasets.load_dataset：升级datasets版本对我的情况无效，因为2.0.0已经是现在最新的datasets版本了。