datasets
- 读取数据(load_dataset)
load_dataset("json", data_files=data_files, field="xxx_field")
- 数据文件类型:
json
、text
、csv
、pandas
- 数据文件列表:
from datasets import load_dataset data_files = {"train": "xxx.train.json", "test": "xxx.test.json"} my_dataset = load_dataset("json", data_files=data_files, field="xxx_field")
- 数据文件类型:
- 处理数据(datasets对象)
shuffle
examples = my_dataset["train"].shuffle(seed=42) examples["label"][:10]
select
examples = my_dataset["train"].select([0, 10, 20, 30, 40, 50]) examples = my_dataset["train"].shuffle(seed=42).select([0, 10, 20, 30, 40, 50])
unique
unique_data = my_dataset["train"].unique("xxx_field") unique_data_nums = len(unique_data)
rename_column
my_dataset = my_dataset.rename_column( original_column_name=getattr(my_dataset, "xxx_field"), new_column_name="yyy_field" )