如何创建训练数据集

最新推荐文章于 2024-09-07 21:01:55 发布

hawk2014bj

最新推荐文章于 2024-09-07 21:01:55 发布

阅读量155

点赞数 5

文章标签：自然语言处理数据集

本文链接：https://blog.csdn.net/hawk2014bj/article/details/141998652

版权

在 HuggingFace 上创建数据集非常方便，创建完成之后，通过 API 可以方便的下载并使用数据集，在 Google Colab 上进行模型调优，下载数据集速度非常快，本文通过 Dataset 库创建一个简单的训练数据集。

首先安装数据集依赖

HuggingFace

datasets
huggingface_hub

创建数据集

替换为自己的 HuggingFace API key，运行一下数据集就生成了。Dataset 类提供了多种格式数据的上传，API 比较简单，例如list、dict、pandas 等等。

def upload():
    training_data = [
        [
            {"from": "user", "value": "hi"},
            {"from": "gpt", "value": "hello"}
        ],
        [
            {"from": "user", "value": "how are you?"},
            {"from": "gpt", "value": "I am doing well, thank you."}
        ],
        [
            {"from": "user", "value": "what is your name?"},
            {"from": "gpt", "value": "I am zidk2"}
        ],
        [
            {"from": "user", "value": "who are you?"},
            {"from": "gpt", "value": "I am zidk2, your assistant."}
        ],
        [
            {"from": "user", "value": "can you tell me your name?"},
            {"from": "gpt", "value": "I am zidk2."}
        ],
        [
            {"from": "user", "value": "please tell me your name"},
            {"from": "gpt", "value": "I am zidk2."}
        ],
        [
            {"from": "user", "value": "hello, what is your name?"},
            {"from": "gpt", "value": "I am zidk2."}
        ],
        [
            {"from": "user", "value": "name?"},
            {"from": "gpt", "value": "I am zidk2."}
        ],
        [
            {"from": "user", "value": "hi, who are you?"},
            {"from": "gpt", "value": "I am zidk2."}
        ],
        [
            {"from": "user", "value": "what can I call you?"},
            {"from": "gpt", "value": "You can call me zidk2."}
        ]
    ]

    dataset = Dataset.from_dict({"conversations": training_data })

    # Set up Hugging Face API
    hf_api = HfApi()

    # Provide your Hugging Face token
    token = os.getenv("HF_KEY")  # Replace with your Hugging Face token
    HfFolder.save_token(token)

    # Create a new dataset repository on Hugging Face
    repo_name = "test-data"  # Replace with your desired dataset name
    username = hf_api.whoami(token)['name']
    repo_id = f"{username}/{repo_name}"

    # Create the dataset repository on Hugging Face
    hf_api.create_repo(repo_id, repo_type="dataset", exist_ok=True)

    # Push the dataset to Hugging Face Hub
    dataset.push_to_hub(repo_id)

    print(f"Dataset uploaded to Hugging Face Hub: https://huggingface.co/datasets/{repo_id}")