Datasets for Large Language Models: A Comprehensive Survey

828 篇文章

已下架不支持订阅

本文深入探讨了大型语言模型(LLM)数据集,包括预训练语料库、指令微调数据集、偏好数据集和评估数据集。调查涵盖了444个数据集,涉及8种语言和32个领域,总计超过774.5 TB的预训练数据和7亿个实例。研究突出了数据集在LLM发展中关键作用,同时指出了当前挑战和未来发展方向,如数据多样性和质量评估。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

本文是LLM系列文章,针对《Datasets for Large Language Models: A Comprehensive Survey》的翻译。

摘要

本文对大型语言模型(LLM)数据集进行了探索,这些数据集在LLM的显著进步中发挥着至关重要的作用。数据集作为基础基础设施,类似于支撑和培育LLM发展的根系统。因此,对这些数据集的检查成为研究中的一个关键课题。为了解决LLM数据集目前缺乏全面概述和彻底分析的问题,并深入了解其现状和未来趋势,本调查从五个角度对LLM数据集中的基本方面进行了整合和分类:(1)预训练语料库;(2) 指令微调数据集;(3) 偏好数据集;(4) 评估数据集;(5) 传统的自然语言处理(NLP)数据集。该调查揭示了当前的挑战,并指出了未来调查的潜在途径。此外,还提供了对现有可用数据集资源的全面审查,包括444个数据集的统计数据,涵盖8个语言类别和32个领域。来自20个维度的信息被纳入数据集统计。所调查的预训练语料库的总数据量超过774.5 TB,其他数据集的实例数超过700M。我们的目标是展示LLM文本数据集的整个面貌,为该领域的研究人员提供全面的参考,并为未来的研究做出贡献。相关资源可访问

### Chain-of-Thought Prompting Mechanism in Large Language Models In large language models, chain-of-thought prompting serves as a method to enhance reasoning capabilities by guiding the model through structured thought processes. This approach involves breaking down complex problems into simpler components and providing step-by-step guidance that mirrors human cognitive processing. The creation of these prompts typically includes selecting examples from training datasets where each example represents part of an overall problem-solving process[^2]. By decomposing tasks into multiple steps, this technique encourages deeper understanding and more accurate predictions compared to traditional methods. For instance, when faced with multi-hop question answering or logical deduction challenges, using such chains allows models not only to generate correct answers but also articulate intermediate thoughts leading up to those conclusions. Such transparency facilitates better interpretability while improving performance on various NLP benchmarks. ```python def create_chain_of_thought_prompt(task_description, examples): """ Creates a chain-of-thought prompt based on given task description and examples. Args: task_description (str): Description of the task at hand. examples (list): List containing tuples of input-output pairs used for demonstration purposes. Returns: str: Formatted string representing the final prompt including both instructions and sample cases. """ formatted_examples = "\n".join([f"Input: {ex[0]}, Output: {ex[1]}" for ex in examples]) return f""" Task: {task_description} Examples: {formatted_examples} Now try solving similar questions following above pattern. """ # Example usage examples = [ ("What color do you get mixing red and blue?", "Purple"), ("If it rains tomorrow, will we have our picnic?", "No") ] print(create_chain_of_thought_prompt("Solve logic puzzles", examples)) ```
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

UnknownBody

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值