datasets DatasetBuilder类

对象属性/方法

1、as_dataset函数
dataset_builder.as_dataset(
		split: Optional[Split] = None, 
		run_post_process=True, 
		ignore_verifications=False, 
		in_memory=False
    )

通过构造器返回一个Dataset对象。

>>> from datasets import load_dataset_builder
>>> builder = load_dataset_builder('rotten_tomatoes')
>>> ds = builder.download_and_prepare()
>>> ds = builder.as_dataset(split='train')
>>> ds
Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})
2、download_and_prepare函数
dataset_builder.download_and_prepare(
        download_config: Optional[DownloadConfig] = None,
        download_mode: Optional[DownloadMode] = None,
        ignore_verifications: bool = False,
        try_from_hf_gcs: bool = True,
        dl_manager: Optional[DownloadManager] = None,
        base_path: Optional[str] = None,
        use_auth_token: Optional[Union[bool, str]] = None,
        **download_and_prepare_kwargs,
    )

下载并且准备好数据。

>>> from datasets import load_dataset_builder
>>> builder = load_dataset_builder('rotten_tomatoes')
>>> ds = builder.download_and_prepare()
# 等价
>>> df = load_dataset('rotten_tomatoes')
3、get_exported_dataset_info函数
dataset_builder.get_exported_dataset_info()

返回一个DatasetInfo对象,获取数据集的信息。

>>> from datasets import load_dataset_builder
>>> ds_builder = load_dataset_builder('rotten_tomatoes')
>>> ds_builder.get_exported_dataset_info()
DatasetInfo(description="Movie Review Dataset.
a dataset of containing 5,331 positive and 5,331 negative processed
s from Rotten Tomatoes movie reviews. This data was first used in Bo
 Lillian Lee, ``Seeing stars: Exploiting class relationships for
t categorization with respect to rating scales.'', Proceedings of the
5.
ion='@InProceedings{Pang+Lee:05a,
 =       {Bo Pang and Lillian Lee},
=        {Seeing stars: Exploiting class relationships for sentiment
          categorization with respect to rating scales},
tle =    {Proceedings of the ACL},
         2005

age='http://www.cs.cornell.edu/people/pabo/movie-review-data/', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None)}, post_processed=None, supervised_keys=SupervisedKeysData(input='', output=''), task_templates=[TextClassification(task='text-classification', text_column='text', label_column='label')], builder_name='rotten_tomatoes_movie_review', config_name='default', version=1.0.0, splits={'train': SplitInfo(name='train', num_bytes=1074810, num_examples=8530, dataset_name='rotten_tomatoes_movie_review'), 'validation': SplitInfo(name='validation', num_bytes=134679, num_examples=1066, dataset_name='rotten_tomatoes_movie_review'), 'test': SplitInfo(name='test', num_bytes=135972, num_examples=1066, dataset_name='rotten_tomatoes_movie_review')}, download_checksums={'https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz': {'num_bytes': 487770, 'checksum': 'a05befe52aafda71d458d188a1c54506a998b1308613ba76bbda2e5029409ce9'}}, download_size=487770, post_processing_size=None, dataset_size=1345461, size_in_bytes=1833231)
4、get_all_exported_dataset_infos函数
dataset_builder.get_all_exported_dataset_infos()

返回一个关于DatasetInfo对象的字典,获取数据集的信息。

>>> from datasets import load_dataset_builder
>>> ds_builder = load_dataset_builder('rotten_tomatoes')
>>> ds_builder.get_all_exported_dataset_infos()
{'default': DatasetInfo(description="Movie Review Dataset.
a dataset of containing 5,331 positive and 5,331 negative processed
s from Rotten Tomatoes movie reviews. This data was first used in Bo
 Lillian Lee, ``Seeing stars: Exploiting class relationships for
t categorization with respect to rating scales.'', Proceedings of the
5.
ion='@InProceedings{Pang+Lee:05a,
 =       {Bo Pang and Lillian Lee},
=        {Seeing stars: Exploiting class relationships for sentiment
          categorization with respect to rating scales},
tle =    {Proceedings of the ACL},
         2005

age='http://www.cs.cornell.edu/people/pabo/movie-review-data/', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None)}, post_processed=None, supervised_keys=SupervisedKeysData(input='', output=''), task_templates=[TextClassification(task='text-classification', text_column='text', label_column='label')], builder_name='rotten_tomatoes_movie_review', config_name='default', version=1.0.0, splits={'train': SplitInfo(name='train', num_bytes=1074810, num_examples=8530, dataset_name='rotten_tomatoes_movie_review'), 'validation': SplitInfo(name='validation', num_bytes=134679, num_examples=1066, dataset_name='rotten_tomatoes_movie_review'), 'test': SplitInfo(name='test', num_bytes=135972, num_examples=1066, dataset_name='rotten_tomatoes_movie_review')}, download_checksums={'https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz': {'num_bytes': 487770, 'checksum': 'a05befe52aafda71d458d188a1c54506a998b1308613ba76bbda2e5029409ce9'}}, download_size=487770, post_processing_size=None, dataset_size=1345461, size_in_bytes=1833231)}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
使用Paddle框架的DatasetBuilder可以很方便地加载本地自定义数据。以下是一个简单的示例代码,展示如何使用DatasetBuilder加载本地自定义数据集: ```python import paddle from paddle.io import Dataset, DataLoader from paddle.text.datasets import DatasetBuilder # 自定义数据集,继承paddle.io.Dataset class MyDataset(Dataset): def __init__(self, data_file): super(MyDataset, self).__init__() self.data_file = data_file # 加载数据集 self.data = self.load_data() def load_data(self): # 加载数据集 data = [] with open(self.data_file, 'r', encoding='utf-8') as f: for line in f: data.append(line.strip()) return data def __getitem__(self, index): # 根据索引返回对应的数据 return self.data[index] def __len__(self): # 返回数据集的长度 return len(self.data) # 定义数据集路径 data_file = 'data.txt' # 创建自定义数据集实例 my_dataset = MyDataset(data_file) # 创建DatasetBuilder实例 dataset_builder = DatasetBuilder() # 加载本地自定义数据集 dataset = dataset_builder.process(my_dataset) # 创建DataLoader实例 dataloader = DataLoader(dataset=dataset, batch_size=32, shuffle=True, num_workers=0) # 遍历数据集 for data in dataloader: print(data) ``` 在上面的示例代码中,我们首先定义了一个自定义数据集MyDataset,继承了paddle.io.Dataset。在MyDataset中,我们实现了load_data方法用来加载数据集,实现了__getitem__方法和__len__方法用来根据索引获取数据和获取数据集的长度。 然后我们定义了数据集路径data_file,创建了自定义数据集实例my_dataset,接着创建了DatasetBuilder实例dataset_builder,使用process方法加载本地自定义数据集,并将其赋值给dataset变量。最后我们创建了DataLoader实例dataloader,遍历数据集并打印出来。 这样,我们就可以使用Paddle框架的DatasetBuilder来加载本地自定义数据集了。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

不负韶华ღ

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值