datasets DatasetDict类

类属性/方法

1、from_csv函数
dataset.DatasetDict.from_csv(
        path_or_paths: Dict[str, PathLike],
        features: Optional[Features] = None,
        cache_dir: str = None,
        keep_in_memory: bool = False,
        **kwargs,
    )

根据csv文件字典来创建DatasetDict对象。

>>> from datasets import DatasetDict
>>> ds = DatasetDict.from_csv({'train': 'path/to/dataset.csv'})
2、from_json函数
dataset.DatasetDict.from_json(
        path_or_paths: Dict[str, PathLike],
        features: Optional[Features] = None,
        cache_dir: str = None,
        keep_in_memory: bool = False,
        **kwargs,
    )

根据json文件字典来创建DatasetDict对象。

>>> from datasets import DatasetDict
>>> ds = DatasetDict.from_json({'train': 'path/to/dataset.json'})
3、from_text函数
dataset.DatasetDict.from_text(
        path_or_paths: Dict[str, PathLike],
        features: Optional[Features] = None,
        cache_dir: str = None,
        keep_in_memory: bool = False,
        **kwargs,
    )

根据text文件字典来创建DatasetDict对象。

4、from_parquet函数
dataset.DatasetDict.from_parquet(
        path_or_paths: Dict[str, PathLike],
        features: Optional[Features] = None,
        cache_dir: str = None,
        keep_in_memory: bool = False,
        columns: Optional[List[str]] = None,
        **kwargs,
    )

根据parquet文件字典来创建DatasetDict对象。

对象属性/方法

1、data属性

获取DatasetDict各个Dataset中的数据。

from datasets import load_dataset
ds = load_dataset("rotten_tomatoes")
ds.data
2、cache_files属性

获取DatasetDict各个Dataset中数据所在的缓存文件。

>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> ds.cache_files
{'test': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-test.arrow'}],
 'train': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-train.arrow'}],
 'validation': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-validation.arrow'}]}
3、num_columns、num_rows属性

获取DatasetDict各个Dataset中数据的列数、行数。

>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> ds.num_columns
{'test': 2, 'train': 2, 'validation': 2}
>>> ds.num_rows
{'test': 1066, 'train': 8530, 'validation': 1066}
4、column_names属性

获取DatasetDict各个Dataset中的列名称。

>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> ds.column_names
{'test': ['text', 'label'],
 'train': ['text', 'label'],
 'validation': ['text', 'label']}
5、shape属性

获取DatasetDict各个Dataset中的数据形状。

>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> ds.shape
{'test': (1066, 2), 'train': (8530, 2), 'validation': (1066, 2)}
6、unique函数
dataset.unique(column)

在指定列中返回一个不重复的列表。

>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> ds.unique("label")
{'test': [1, 0], 'train': [1, 0], 'validation': [1, 0]}
7、map函数
dataset.map(
        function: Optional[Callable] = None,
        with_indices: bool = False,
        with_rank: bool = False,
        input_columns: Optional[Union[str, List[str]]] = None,
        batched: bool = False,
        batch_size: Optional[int] = 1000,
        drop_last_batch: bool = False,
        remove_columns: Optional[Union[str, List[str]]] = None,
        keep_in_memory: bool = False,
        load_from_cache_file: bool = True,
        cache_file_names: Optional[Dict[str, Optional[str]]] = None,
        writer_batch_size: Optional[int] = 1000,
        features: Optional[Features] = None,
        disable_nullable: bool = False,
        fn_kwargs: Optional[dict] = None,
        num_proc: Optional[int] = None,
        desc: Optional[str] = None,
    )

将一个函数应用于DatasetDict中的Dataset数据集中的每一个元素,如果batched为True,则应用于每一个batch。

>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> def add_prefix(example):
    	example["text"] = "Review: " + example["text"]
	    return example
>>> ds = ds.map(add_prefix)
>>> ds["train"][0:3]["text"]
['Review: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'Review: the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .',
 'Review: effective but too-tepid biopic']

# process a batch of examples
>>> ds = ds.map(lambda example: tokenizer(example["text"]), batched=True)
# set number of processors
>>> ds = ds.map(add_prefix, num_proc=4)
8、filter函数
dataset.filter(
        function,
        with_indices=False,
        input_columns: Optional[Union[str, List[str]]] = None,
        batched: bool = False,
        batch_size: Optional[int] = 1000,
        keep_in_memory: bool = False,
        load_from_cache_file: bool = True,
        cache_file_names: Optional[Dict[str, Optional[str]]] = None,
        writer_batch_size: Optional[int] = 1000,
        fn_kwargs: Optional[dict] = None,
        num_proc: Optional[int] = None,
        desc: Optional[str] = None,
    )

根据函数来过滤掉符合条件的DatasetDict中的Dataset数据集中的每一个元素。

>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> ds.filter(lambda x: x["label"] == 1)
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 4265
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 533
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 533
    })
})
9、sort函数
dataset.sort(
        column: str,
        reverse: bool = False,
        kind: str = None,
        null_placement: str = "last",
        keep_in_memory: bool = False,
        load_from_cache_file: bool = True,
        indices_cache_file_names: Optional[Dict[str, Optional[str]]] = None,
        writer_batch_size: Optional[int] = 1000,
    ) 

将DatasetDict中Dataset按照列名进行排序。

>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> ds["train"]["label"][:10]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> sorted_ds = ds.sort("label")
>>> sorted_ds["train"]["label"][:10]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
10、shuffle函数
dataset.shuffle(
        seeds: Optional[Union[int, Dict[str, Optional[int]]]] = None,
        seed: Optional[int] = None,
        generators: Optional[Dict[str, np.random.Generator]] = None,
        keep_in_memory: bool = False,
        load_from_cache_file: bool = True,
        indices_cache_file_names: Optional[Dict[str, Optional[str]]] = None,
        writer_batch_size: Optional[int] = 1000,
    )

打乱Dataset中的数据。

>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> ds["train"]["label"][:10]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

# set a seed
>>> shuffled_ds = ds.shuffle(seed=42)
>>> shuffled_ds["train"]["label"][:10]
[0, 1, 0, 1, 0, 0, 0, 0, 0, 0]
  • 5
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

不负韶华ღ

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值