GPQA 数据集介绍
Paper: https://arxiv.org/pdf/2311.12022
Huggingface: https://huggingface.co/datasets/Idavidrein/gpqa#dataset-card-for-gpqa
GPQA 是一个多选问答数据集,问题由生物学、物理学和化学领域的专家精心编写和验证,具有极高的难度。即使专家在不熟悉的领域答题(例如,物理学家回答化学问题),其准确率仅为 34%,尽管他们可以花费超过 30 分钟并完全访问 Google。
数据集概述
- 题目数量:448 道高质量多选题
- 领域:生物学、物理学、化学
- 创建者:领域内博士学位或在读博士的专家
- 难度:
- 同领域专家平均准确率为 65%(去除明显错误后可达 74%)
- 非专家验证者平均准确率仅为 34%
- 基于 GPT-4 的最强 AI 基线模型准确率为 39%
- 特点:这些问题经过精心设计,即使非专家使用互联网搜索也难以得出答案,故被称为“Google-proof”。
数据集意义
GPQA 的主要目的是为开发 AI 可扩展监督方法 提供测试场景。未来的 AI 系统可能在科学知识的生成上超越人类能力,但这也对人类如何监督 AI 的输出提出了挑战。GPQA 提供了一个困难且真实的实验环境,帮助研究者探索如何通过监督机制让人类从超越自身能力的 AI 系统中获取可靠的信息。
一个示例题目
{
"question": "一个大型基因有数十个外显子,其中中部外显子编码三螺旋重复结构,这些结构连接细胞骨架、肌膜和细胞外空间。最常见的基因突变是中部外显子的缺失,导致移码肽链和渐进性器官退化。解决方案是使用识别移码外显子 5' 端的 Morpholino,阻止剪接体结合,创建外显子跳跃并形成阅读框对齐的连接。多个缺失外显子对有机体的耐受性较好。以下哪种结构与提议的疗法无关?",
"choices": ["(A) 套索结构", "(B) 反义", "(C) R 环", "(D) polyA 尾"],
"answer": "C",
"explanation": "此处描述了抗肌萎缩蛋白基因及其 FDA 批准的寡核苷酸疗法,该疗法通过创建功能性但较短的抗肌萎缩蛋白来跳过外显子。Morpholino 以反义方向结合 pre-mRNA。每个剪接机制都会产生带 3' 尾的环形套索分子,随后降解。剪接后的 RNA 在 3' 端加 polyA 尾。R 环是 DNA 和 pre-mRNA 的三螺旋结构,是 RNA 转录的结果,与剪接和 RNA 成熟无关。"
}
数据集用途
- 可扩展监督研究:测试如何在 AI 系统超过人类能力时设计可靠的监督方法。
- 领域知识检验:提供一个严格的基准,用于衡量不同 AI 系统的科学领域知识。
- 模型改进:推动在高难度科学问答场景下更强大的模型发展。
许可协议
GPQA 数据集采用 CC BY 4.0 许可,允许在保留署名的前提下自由使用和再分发。
注意:为了避免该数据集的题目泄露到基础模型训练语料中,官方要求不要在线公开展示题目或答案。
通过 GPQA,我们不仅可以了解当前 AI 系统的局限性,还可以探索未来科学知识生成领域的可能性。如果你从事 AI 和自然语言处理研究,这是一个不可错过的重要资源!
英文版
Introduction to the GPQA Dataset
GPQA is a challenging multiple-choice Q&A dataset with questions meticulously designed and validated by experts in biology, physics, and chemistry. These questions are extremely difficult, with non-expert validators achieving only 34% accuracy when answering questions outside their domain, even with over 30 minutes of unrestricted access to Google.
Overview of the Dataset
- Number of Questions: 448 high-quality multiple-choice questions
- Domains: Biology, Physics, Chemistry
- Created by: Experts with or pursuing PhDs in their respective fields
- Difficulty:
- Experts achieve 65% accuracy on average (74% after removing clear mistakes).
- Highly skilled non-experts achieve only 34% accuracy.
- The strongest GPT-4-based AI baseline achieves 39% accuracy.
- Special Feature: These questions are “Google-proof,” meaning they are intentionally designed to be difficult to solve even with access to online resources.
Purpose of the Dataset
GPQA aims to provide a benchmark for developing scalable oversight methods for future AI systems. As AI advances, especially in generating scientific knowledge, it is crucial to establish mechanisms that allow humans to reliably evaluate and supervise outputs from AI systems that may surpass human expertise. The GPQA dataset presents a realistic and challenging environment to test such mechanisms.
Example Question
{
"question": "A large gene has dozens of exons, of which the central ones code for folded triple helical repeats that connect the cytoskeleton with sarcolemma and extracellular space. Each exon usually codes for one folded triple alpha helix. The most common mutations of the gene are central exon deletions that create out-of-frame peptides and progressive degenerative organ waste. A solution is to deliver a Morpholino that recognizes the 5' end of the out-of-frame exon in pre-mRNA. The molecule prevents binding of the spliceosome and creates exon skipping and in-frame joining. Several missing exons are well tolerated by an organism. Which structure below is not involved in the proposed therapy?",
"choices": ["(A) lariat", "(B) antisense", "(C) R-loops", "(D) polyA tail"],
"answer": "C",
"explanation": "The text describes the dystrophin gene and the FDA-approved oligonucleotide therapy that causes exon skipping by creating a functional, albeit shorter, dystrophin protein. Morpholino is bound to the pre-mRNA in an antisense orientation. Every splicing mechanism creates the lariat molecule that is circular with a 3' tail and soon degraded. The spliced RNA is polyadenylated at the 3' end. R-loops are triple helix structures formed between DNA and pre-mRNA during transcription and are not involved in splicing or RNA maturation."
}
Applications of the Dataset
- Scalable Oversight Research: Enables experiments on how to design supervision methods for AI systems exceeding human capabilities.
- Domain Knowledge Testing: Acts as a rigorous benchmark for evaluating scientific knowledge in AI systems.
- Model Development: Encourages the development of more robust models capable of handling difficult scientific Q&A tasks.
Licensing
The GPQA dataset is licensed under CC BY 4.0, allowing for free use and redistribution with proper attribution.
Note: To avoid the risk of dataset leakage into foundation model training corpora, the dataset creators explicitly request that examples not be shared online in plain text or images.
Conclusion
GPQA not only highlights the current limitations of AI systems but also paves the way for exploring the potential of AI in scientific knowledge generation. If you’re working on AI or natural language processing research, GPQA is an invaluable resource for testing and improving your models.
后记
2024年12月19日13点49分于上海,在GPT4o大模型辅助下完成。