hugginface相关数据集整理

swaption2009/20k-en-zh-translation-pinyin-hsk
翻译
Source: https://mnemosyne-proj.org/cards/20000-chinese-sentences-translations-and-pinyin
Contributed by: Brian Vaughan http://brianvaughan.net/

RUCAIBox/Translation
翻译
WMT14 English-French (wmt14-fr-en)
WMT16 Romanian-English (wmt16-ro-en)
WMT16 German-English (wmt16-de-en)
WMT19 Czech-English (wmt19-cs-en)
WMT13 Spanish-English (wmt13-es-en)
WMT19 Chinese-English (wmt19-zh-en)
WMT19 Russian-English (wmt19-ru-en).

dbarbedillo/SMS_Spam_Multilingual_Collection_Dataset
The text has been further translated into Spanish, Chinese, Arabic, Bengali, Russian, Portuguese, Indonesian, Urdu, Japanese, Punjabi, Javanese, Turkish, Korean, Marathi, Ukrainian, Swedish, and Norwegian using M2M100_418M a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation created by Facebook AI.
The original English text was taken from- https://www.kaggle.com/uciml/sms-spam-collection-dataset Hindi, German and French taken from - https://www.kaggle.com/datasets/rajnathpatel/multilingual-spam-data

projecte-aina/ca_zh_wikipedia
中文-加拿大语之间的翻译

wanng/wukong100m
简介 Brief Introduction
取自Noah-Wukong多语言多模态数据集中的中文部分,一共100M个图文对。
A subset from Noah-Wukong (a multimodal dataset), around 100M image-text pairs (only Chinese).

MMChat
基于图片的聊天对
is a large-scale dialogue dataset that contains image-grounded dialogues in Chinese. Each dialogue in MMChat is associated with one or more images (maximum 9 images per dialogue). We design various strategies to ensure the quality of the dialogues in MMChat.

Jiangjie/ekar_chinese
Explainable Knowledge-intensive Analogical Reasoning benchmark (E-KAR).

Hello-SimpleAI/HC3-Chinese
人工和-ChatGPT对比数据集

kuroneko5943/weibo16
微博情感

wangrui6/Zhihu-KOL
知乎问题答案,一个问题,多个答案,根据赞同数量可以排序

silver/personal_dialog
中文个人对话,多轮回

medical_dialog
医学上的病人和医生的对话

mteb/amazon_massive_intent
亚马逊意图识别

qanastek/MASSIVE
意图识别,ner,

GEM/RiSAWOZ
对话多轮

sunzeyeah/chinese_chatgpt_corpus
train_data_external_v1.jsonl
prompt: prompt, string
answers: list of answers
answer: answer, string
score: score of answer, int
prefix: prefix to the answer, string

BelleGroup/generated_train_0.5M_CN
BELLE: Bloom-Enhanced Large Language model Engine
prompt_cn.txt: 生成所使用的提示语
0.5M生成的数据 : 为了方便模型训练,huggingface开源数据将原始生成文件中的"instruction"、"input"字段合并成"input"字段,"output"字段修改为"target"字段。

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Hugging Face是一个开源社区,它提供了先进的NLP模型、数据集和其他便利的工具。在NLP领域中,Hugging Face最著名的是其提供的基于Transformer的模型。为了提高用户的易用性,Hugging Face还提供了几个项目。其中,Transformers项目提供了上千个预训练好的模型,可用于不同的任务,如文本领域、音频领域和CV领域。Datasets是一个轻量级的数据集框架,可以简化常用公开数据集的下载和预处理。Accelerate项目则帮助PyTorch用户方便地实现多GPU/TPU/fp16加速。此外,Hugging Face还提供了Space,一个提供了许多有趣的深度学习应用的平台。综上所述,Hugging Face是一个提供先进的NLP模型、数据集和工具的开源社区。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *3* [Hugging Face快速入门(重点讲解模型(Transformers)和数据集部分(Datasets))](https://blog.csdn.net/zhaohongfei_358/article/details/126224199)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] - *2* [HuggingFace简明教程](https://blog.csdn.net/weixin_44748589/article/details/126359019)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值