如何创建训练数据集

在 HuggingFace 上创建数据集非常方便,创建完成之后,通过 API 可以方便的下载并使用数据集,在 Google Colab 上进行模型调优,下载数据集速度非常快,本文通过 Dataset 库创建一个简单的训练数据集。

首先安装数据集依赖

  1. HuggingFace
datasets
huggingface_hub

创建数据集

替换为自己的 HuggingFace API key,运行一下数据集就生成了。Dataset 类提供了多种格式数据的上传,API 比较简单,例如list、dict、pandas 等等。

def upload():
    training_data = [
        [
            {"from": "user", "value": "hi"},
            {"from": "gpt", "value": "hello"}
        ],
        [
            {"from": "user", "value": "how are you?"},
            {"from": "gpt", "value": "I am doing well, thank you."}
        ],
        [
            {"from": "user", "value": "what is your name?"},
            {"from": "gpt", "value": "I am zidk2"}
        ],
        [
            {"from": "user", "value": "who are you?"},
            {"from": "gpt", "value": "I am zidk2, your assistant."}
        ],
        [
            {"from": "user", "value": "can you tell me your name?"},
            {"from": "gpt", "value": "I am zidk2."}
        ],
        [
            {"from": "user", "value": "please tell me your name"},
            {"from": "gpt", "value": "I am zidk2."}
        ],
        [
            {"from": "user", "value": "hello, what is your name?"},
            {"from": "gpt", "value": "I am zidk2."}
        ],
        [
            {"from": "user", "value": "name?"},
            {"from": "gpt", "value": "I am zidk2."}
        ],
        [
            {"from": "user", "value": "hi, who are you?"},
            {"from": "gpt", "value": "I am zidk2."}
        ],
        [
            {"from": "user", "value": "what can I call you?"},
            {"from": "gpt", "value": "You can call me zidk2."}
        ]
    ]

    dataset = Dataset.from_dict({"conversations": training_data })

    # Set up Hugging Face API
    hf_api = HfApi()

    # Provide your Hugging Face token
    token = os.getenv("HF_KEY")  # Replace with your Hugging Face token
    HfFolder.save_token(token)

    # Create a new dataset repository on Hugging Face
    repo_name = "test-data"  # Replace with your desired dataset name
    username = hf_api.whoami(token)['name']
    repo_id = f"{username}/{repo_name}"

    # Create the dataset repository on Hugging Face
    hf_api.create_repo(repo_id, repo_type="dataset", exist_ok=True)

    # Push the dataset to Hugging Face Hub
    dataset.push_to_hub(repo_id)

    print(f"Dataset uploaded to Hugging Face Hub: https://huggingface.co/datasets/{repo_id}")

登录 HuggingFace 进行查看
在这里插入图片描述

总结

在 HuggingFace 上创建数据集的好处是,可以方便的在云平台上获取数据并进行训练,如果使用阿里云,可以上将数据集上传到魔搭进行使用。

  • 5
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值