在 HuggingFace 上创建数据集非常方便,创建完成之后,通过 API 可以方便的下载并使用数据集,在 Google Colab 上进行模型调优,下载数据集速度非常快,本文通过 Dataset 库创建一个简单的训练数据集。
首先安装数据集依赖
- HuggingFace
datasets
huggingface_hub
创建数据集
替换为自己的 HuggingFace API key,运行一下数据集就生成了。Dataset 类提供了多种格式数据的上传,API 比较简单,例如list、dict、pandas 等等。
def upload():
training_data = [
[
{"from": "user", "value": "hi"},
{"from": "gpt", "value": "hello"}
],
[
{"from": "user", "value": "how are you?"},
{"from": "gpt", "value": "I am doing well, thank you."}
],
[
{"from": "user", "value": "what is your name?"},
{"from": "gpt", "value": "I am zidk2"}
],
[
{"from": "user", "value": "who are you?"},
{"from": "gpt", "value": "I am zidk2, your assistant."}
],
[
{"from": "user", "value": "can you tell me your name?"},
{"from": "gpt", "value": "I am zidk2."}
],
[
{"from": "user", "value": "please tell me your name"},
{"from": "gpt", "value": "I am zidk2."}
],
[
{"from": "user", "value": "hello, what is your name?"},
{"from": "gpt", "value": "I am zidk2."}
],
[
{"from": "user", "value": "name?"},
{"from": "gpt", "value": "I am zidk2."}
],
[
{"from": "user", "value": "hi, who are you?"},
{"from": "gpt", "value": "I am zidk2."}
],
[
{"from": "user", "value": "what can I call you?"},
{"from": "gpt", "value": "You can call me zidk2."}
]
]
dataset = Dataset.from_dict({"conversations": training_data })
# Set up Hugging Face API
hf_api = HfApi()
# Provide your Hugging Face token
token = os.getenv("HF_KEY") # Replace with your Hugging Face token
HfFolder.save_token(token)
# Create a new dataset repository on Hugging Face
repo_name = "test-data" # Replace with your desired dataset name
username = hf_api.whoami(token)['name']
repo_id = f"{username}/{repo_name}"
# Create the dataset repository on Hugging Face
hf_api.create_repo(repo_id, repo_type="dataset", exist_ok=True)
# Push the dataset to Hugging Face Hub
dataset.push_to_hub(repo_id)
print(f"Dataset uploaded to Hugging Face Hub: https://huggingface.co/datasets/{repo_id}")
登录 HuggingFace 进行查看
总结
在 HuggingFace 上创建数据集的好处是,可以方便的在云平台上获取数据并进行训练,如果使用阿里云,可以上将数据集上传到魔搭进行使用。