博士专家水平的生物物理化学难题数据集GPQA: A Graduate-Level Google-Proof Q&A Benchmark介绍:中英双语

GPQA 数据集介绍

Paper: https://arxiv.org/pdf/2311.12022
Huggingface: https://huggingface.co/datasets/Idavidrein/gpqa#dataset-card-for-gpqa

GPQA 是一个多选问答数据集,问题由生物学、物理学和化学领域的专家精心编写和验证,具有极高的难度。即使专家在不熟悉的领域答题(例如,物理学家回答化学问题),其准确率仅为 34%,尽管他们可以花费超过 30 分钟并完全访问 Google。

数据集概述
  • 题目数量:448 道高质量多选题
  • 领域:生物学、物理学、化学
  • 创建者:领域内博士学位或在读博士的专家
  • 难度
    • 同领域专家平均准确率为 65%(去除明显错误后可达 74%)
    • 非专家验证者平均准确率仅为 34%
    • 基于 GPT-4 的最强 AI 基线模型准确率为 39%
  • 特点:这些问题经过精心设计,即使非专家使用互联网搜索也难以得出答案,故被称为“Google-proof”。
数据集意义

GPQA 的主要目的是为开发 AI 可扩展监督方法 提供测试场景。未来的 AI 系统可能在科学知识的生成上超越人类能力,但这也对人类如何监督 AI 的输出提出了挑战。GPQA 提供了一个困难且真实的实验环境,帮助研究者探索如何通过监督机制让人类从超越自身能力的 AI 系统中获取可靠的信息。


一个示例题目
{
  "question": "一个大型基因有数十个外显子,其中中部外显子编码三螺旋重复结构,这些结构连接细胞骨架、肌膜和细胞外空间。最常见的基因突变是中部外显子的缺失,导致移码肽链和渐进性器官退化。解决方案是使用识别移码外显子 5' 端的 Morpholino,阻止剪接体结合,创建外显子跳跃并形成阅读框对齐的连接。多个缺失外显子对有机体的耐受性较好。以下哪种结构与提议的疗法无关?",
  "choices": ["(A) 套索结构", "(B) 反义", "(C) R 环", "(D) polyA 尾"],
  "answer": "C",
  "explanation": "此处描述了抗肌萎缩蛋白基因及其 FDA 批准的寡核苷酸疗法,该疗法通过创建功能性但较短的抗肌萎缩蛋白来跳过外显子。Morpholino 以反义方向结合 pre-mRNA。每个剪接机制都会产生带 3' 尾的环形套索分子,随后降解。剪接后的 RNA 在 3' 端加 polyA 尾。R 环是 DNA 和 pre-mRNA 的三螺旋结构,是 RNA 转录的结果,与剪接和 RNA 成熟无关。"
}

数据集用途
  1. 可扩展监督研究:测试如何在 AI 系统超过人类能力时设计可靠的监督方法。
  2. 领域知识检验:提供一个严格的基准,用于衡量不同 AI 系统的科学领域知识。
  3. 模型改进:推动在高难度科学问答场景下更强大的模型发展。
许可协议

GPQA 数据集采用 CC BY 4.0 许可,允许在保留署名的前提下自由使用和再分发。

注意:为了避免该数据集的题目泄露到基础模型训练语料中,官方要求不要在线公开展示题目或答案。


通过 GPQA,我们不仅可以了解当前 AI 系统的局限性,还可以探索未来科学知识生成领域的可能性。如果你从事 AI 和自然语言处理研究,这是一个不可错过的重要资源!

英文版

Introduction to the GPQA Dataset

GPQA is a challenging multiple-choice Q&A dataset with questions meticulously designed and validated by experts in biology, physics, and chemistry. These questions are extremely difficult, with non-expert validators achieving only 34% accuracy when answering questions outside their domain, even with over 30 minutes of unrestricted access to Google.


Overview of the Dataset
  • Number of Questions: 448 high-quality multiple-choice questions
  • Domains: Biology, Physics, Chemistry
  • Created by: Experts with or pursuing PhDs in their respective fields
  • Difficulty:
    • Experts achieve 65% accuracy on average (74% after removing clear mistakes).
    • Highly skilled non-experts achieve only 34% accuracy.
    • The strongest GPT-4-based AI baseline achieves 39% accuracy.
  • Special Feature: These questions are “Google-proof,” meaning they are intentionally designed to be difficult to solve even with access to online resources.

Purpose of the Dataset

GPQA aims to provide a benchmark for developing scalable oversight methods for future AI systems. As AI advances, especially in generating scientific knowledge, it is crucial to establish mechanisms that allow humans to reliably evaluate and supervise outputs from AI systems that may surpass human expertise. The GPQA dataset presents a realistic and challenging environment to test such mechanisms.


Example Question
{
  "question": "A large gene has dozens of exons, of which the central ones code for folded triple helical repeats that connect the cytoskeleton with sarcolemma and extracellular space. Each exon usually codes for one folded triple alpha helix. The most common mutations of the gene are central exon deletions that create out-of-frame peptides and progressive degenerative organ waste. A solution is to deliver a Morpholino that recognizes the 5' end of the out-of-frame exon in pre-mRNA. The molecule prevents binding of the spliceosome and creates exon skipping and in-frame joining. Several missing exons are well tolerated by an organism. Which structure below is not involved in the proposed therapy?",
  "choices": ["(A) lariat", "(B) antisense", "(C) R-loops", "(D) polyA tail"],
  "answer": "C",
  "explanation": "The text describes the dystrophin gene and the FDA-approved oligonucleotide therapy that causes exon skipping by creating a functional, albeit shorter, dystrophin protein. Morpholino is bound to the pre-mRNA in an antisense orientation. Every splicing mechanism creates the lariat molecule that is circular with a 3' tail and soon degraded. The spliced RNA is polyadenylated at the 3' end. R-loops are triple helix structures formed between DNA and pre-mRNA during transcription and are not involved in splicing or RNA maturation."
}

Applications of the Dataset
  1. Scalable Oversight Research: Enables experiments on how to design supervision methods for AI systems exceeding human capabilities.
  2. Domain Knowledge Testing: Acts as a rigorous benchmark for evaluating scientific knowledge in AI systems.
  3. Model Development: Encourages the development of more robust models capable of handling difficult scientific Q&A tasks.

Licensing

The GPQA dataset is licensed under CC BY 4.0, allowing for free use and redistribution with proper attribution.

Note: To avoid the risk of dataset leakage into foundation model training corpora, the dataset creators explicitly request that examples not be shared online in plain text or images.


Conclusion

GPQA not only highlights the current limitations of AI systems but also paves the way for exploring the potential of AI in scientific knowledge generation. If you’re working on AI or natural language processing research, GPQA is an invaluable resource for testing and improving your models.

后记

2024年12月19日13点49分于上海,在GPT4o大模型辅助下完成。

### GLUE 基准及其在自然语言理解 (NLU) 多任务评估中的应用 #### 什么是 GLUE 基准? GLUE(General Language Understanding Evaluation)基准是一种用于评估自然语言处理模型语义理解能力的多任务框架[^1]。它由一系列 NLU 任务组成,旨在全面衡量模型的语言理解和泛化能力。 #### GLUE 的任务构成 GLUE 包含 9 个核心任务,这些任务均涉及单句或双句的理解问题,具体如下: - **MNLI**(Multi-Genre Natural Language Inference):跨多个领域判断句子之间的推理关系。 - **QQP**(Quora Question Pairs):检测 Quora 上的问题是否重复。 - **QNLI**(Question-NLI):通过自然语言推断回答给定问题。 - **RTE**(Recognizing Textual Entailment):识别文本蕴含关系。 - **STS-B**(Semantic Textual Similarity Benchmark):测量两个句子间的语义相似度。 - **MRPC**(Microsoft Research Paraphrase Corpus):判定两句话是否互为同义表达。 - **CoLA**(Corpus of Linguistic Acceptability):预测句子语法和语义接受程度。 - **SST-2**(Stanford Sentiment Treebank):分析电影评论的情感倾向。 - **WNLI**(Winograd NLI):解决代词消解问题。 上述任务涵盖了多种语言现象,包括但不限于逻辑推理、情感分析以及语义匹配等。 #### GLUE 在多任务学习中的作用 为了更好地支持多任务场景下的 NLP 模型开发,研究人员提出了基于 GLUE 的解决方案。例如,在一篇来自微软的研究论文中提到一种名为 MT-DNN(Multi-Task Deep Neural Networks)的方法,该方法能够有效提升单一模型在多项 NLU 任务上的综合表现[^2]。 此外,还有其他工作扩展了传统意义上的 GLUE 设计理念。比如 ASR-GLUE 将自动语音识别引入到标准 NLU 测试集中,进一步考察当输入存在不同程度噪音干扰时系统的鲁棒性表现[^4]。 #### 实际案例展示 以 BERT 和其变体为例说明如何利用 GLUE 数据集进行实验验证。下图展示了 SST-2 这一特定子任务上几种主流架构的表现情况对比图表显示即使面对加入随机扰动后的样本集合,“人类级”的基线仍然难以超越某些精心设计的人工智能算法。 ```python import matplotlib.pyplot as plt # Sample data representing accuracy across different noise levels. noise_levels = ['Clean', 'Low Noise', 'Medium Noise', 'High Noise'] human_performance = [0.87, 0.85, 0.83, 0.78] model_a_accuracy = [0.92, 0.88, 0.86, 0.80] plt.figure(figsize=(10, 6)) plt.plot(noise_levels, human_performance, label='Human Performance') plt.plot(noise_levels, model_a_accuracy, label="Model A's Accuracy", linestyle="--") plt.title('Accuracy Comparison Between Human and Model Under Various Noises on SST-2 Task') plt.xlabel('Noise Levels') plt.ylabel('Accuracy Score') plt.legend() plt.grid(True) plt.show() ``` 此代码片段绘制了一条折线图用来直观呈现随着环境复杂性的增加两者之间差距的变化趋势。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值