逻辑推理数据集Big Bench Hard (BBH) 介绍:中英双语

Big Bench Hard (BBH) 数据集全解析:原理、任务与实践评测

Paper: https://arxiv.org/pdf/2210.09261

Big Bench Hard (BBH) 是近年来引入的一组基准测试数据集,主要用于评估大型语言模型(LLM)的推理和逻辑能力。BBH 旨在解决当前大模型在复杂推理任务中的泛化能力问题,为进一步研究提供重要的测评标准。


1. BBH 数据集背景与意义

提出原因:

尽管现有的大语言模型如 GPT-3、InstructGPT 等在自然语言处理任务中表现出色,但它们在一些涉及多步逻辑推理、抽象思维和非直观问题的任务上仍存在明显的不足。BBH 数据集包含一组更为困难的任务,这些任务专门设计用于挑战模型的逻辑推理、注意力控制、记忆能力等核心能力。

数据集性质:
  • Few-Shot Prompting:BBH 数据集专注于少样本学习(few-shot learning)场景,即模型在极少量示例下进行推理和回答。
  • CoT(Chain-of-Thought):通过引导模型逐步推理,CoT 提升了任务性能,但仍不能完全匹配人类水平。

2. BBH 数据集的主要任务

BBH 数据集共包含 23 个高难度任务,其中部分任务的名称、目标以及挑战点总结如下:

任务名称目标描述挑战点
Boolean Expressions判断布尔表达式的真假。多层次逻辑运算的正确执行。
Causal Judgement根据前提判断因果关系。正确区分相关性与因果性。
Date Understanding理解和计算日期相关的问题。日期转换和计算的复杂逻辑。
Dyck Languages判断括号序列是否有效(平衡性)。多层嵌套结构的注意力控制。
Formal Fallacies识别逻辑谬误。复杂语境下的逻辑错误识别。
Logical Deduction在多个对象条件下推断逻辑关系(如三对象、五对象、七对象)。多步骤逻辑推理的准确性。
Navigate根据指令移动到目标位置。理解指令并正确执行路径规划。
Object Counting在一组物体中正确数数。精确的注意力控制和记忆能力。
Tracking Shuffled Objects跟踪被打乱顺序的物体。持续的记忆跟踪和更新能力。
Word Sorting将一组单词按照字典顺序排序。文字排序规则的执行。
Hyperbaton解析打乱顺序的句子(如文学修辞中的反常语序)。复杂语言理解与语序重建。

3. 我的 BBH 任务测试结果:Tulu-CoT

我使用了 Olmes 框架 https://github.com/allenai/olmesTulu 模型 对 BBH 数据集进行了测试,具体运行配置如下:

MODEL_NAME=.cache/huggingface/hub/models--allenai--Llama-3.1-Tulu-3-8B-SFT/snapshots/6371bc13f9f3618b340524b14981f64386f613e0

TASK_NAME_03=bbh:cot-v1::tulu  
OUTPUT_DIR_03=my-eval-bbh:cot-v1::tulu  
BATCH_SIZE=4

proxychains4 olmes \  
    --model $MODEL_NAME  \  
    --task $TASK_NAME_03 \  
    --batch-size $BATCH_SIZE \  
    --output-dir $OUTPUT_DIR_03  

最终得到的评测结果如下:

 INFO     [run_eval.py:668] Summary of primary scores: 
bbh:cot-v1::tulu: 0.692674
bbh_boolean_expressions:cot-v1::tulu: 0.928
bbh_causal_judgement:cot-v1::tulu: 0.57754
bbh_date_understanding:cot-v1::tulu: 0.82
bbh_disambiguation_qa:cot-v1::tulu: 0.628
bbh_dyck_languages:cot-v1::tulu: 0.14
bbh_formal_fallacies:cot-v1::tulu: 0.528
bbh_geometric_shapes:cot-v1::tulu: 0.532
bbh_hyperbaton:cot-v1::tulu: 0.944
bbh_logical_deduction_five_objects:cot-v1::tulu: 0.484
bbh_logical_deduction_seven_objects:cot-v1::tulu: 0.388
bbh_logical_deduction_three_objects:cot-v1::tulu: 0.852
bbh_movie_recommendation:cot-v1::tulu: 0.824
bbh_multistep_arithmetic_two:cot-v1::tulu: 0.64
bbh_navigate:cot-v1::tulu: 0.896
bbh_object_counting:cot-v1::tulu: 0.86
bbh_penguins_in_a_table:cot-v1::tulu: 0.767123
bbh_reasoning_about_colored_objects:cot-v1::tulu: 0.784
bbh_ruin_names:cot-v1::tulu: 0.74
bbh_salient_translation_error_detection:cot-v1::tulu: 0.536
bbh_snarks:cot-v1::tulu: 0.629213
bbh_sports_understanding:cot-v1::tulu: 0.928
bbh_temporal_sequences:cot-v1::tulu: 0.832
bbh_tracking_shuffled_objects_five_objects:cot-v1::tulu: 0.692
bbh_tracking_shuffled_objects_seven_objects:cot-v1::tulu: 0.688
bbh_tracking_shuffled_objects_three_objects:cot-v1::tulu: 0.752
bbh_web_of_lies:cot-v1::tulu: 1.0
bbh_word_sorting:cot-v1::tulu: 0.296
2024-12-18:11:31:37,365 INFO     [run_eval.py:674] Saving final metrics in my-eval-bbh:cot-v1::tulu/metrics-all.jsonl...
2024-12-18:11:31:37,366 INFO     [run_eval.py:686] Saving beaker metrics.json so in my-eval-bbh:cot-v1::tulu/metrics.json...

4. BBH 数据集的格式

BBH 数据集中的每个任务通常采用 JSON 格式,包括任务描述、示例输入和输出。以下是一个典型示例:

{
  "request_type": "generate_until",
  "doc": {
    "index": 0,
    "input": "Sort the following words alphabetically: List: syndrome therefrom",
    "query": "Q: Sort the following words alphabetically: List: syndrome therefrom\nA: Let's think step by step.",
    "solution": "syndrome therefrom",
    "answer": "syndrome therefrom"
  },
  "request": {
    "context": {
      "messages": [
        {
          "role": "user",
          "content": "Sort a list of words.\n\nQ: Sort the following words alphabetically: List: oven costume counterpart\nA: Let's think step by step."
        },
        {
          "role": "assistant",
          "content": "The first letter: \"oven\": \"o\" (15). \"costume\": \"c\" (3). \"counterpart\": \"c\" (3). We now have: (3) [\"costume\" ? \"counterpart\"] < (15) \"oven\". Now let's sort this subpart [\"costume\" ? \"counterpart\"] by looking at their second letters.\nThe second letter: \"costume\": \"o\" (15). \"counterpart\": \"o\" (15). We now have: (15) [\"costume\" ? \"counterpart\"]. Now let's sort this subpart [\"costume\" ? \"counterpart\"] by looking at their third letters.\nThe third letter: \"costume\": \"s\" (19). \"counterpart\": \"u\" (21). We now have: (19) \"costume\" < (21) \"counterpart\". Hence, we have [\"costume\" < \"counterpart\"] < \"oven\". So the answer is costume counterpart oven."
        },
        {
          "role": "user",
          "content": "Q: Sort the following words alphabetically: List: hypochlorite ponderosa phone credulity\nA: Let's think step by step."
        },
        {
          "role": "assistant",
          "content": "The first letter: \"hypochlorite\": \"h\" (8). \"ponderosa\": \"p\" (16). \"phone\": \"p\" (16). \"credulity\": \"c\" (3). We now have: (3) \"credulity\" < (8) \"hypochlorite\" < (16) [\"ponderosa\" ? \"phone\"]. Now let's sort this subpart [\"ponderosa\" ? \"phone\"] by looking at their second letters.\nThe second letter: \"ponderosa\": \"o\" (15). \"phone\": \"h\" (8). We now have: (8) \"phone\" < (15) \"ponderosa\". Hence, we have \"credulity\" < \"hypochlorite\" < [\"phone\" < \"ponderosa\"]. So the answer is credulity hypochlorite phone ponderosa."
        },
        {
          "role": "user",
          "content": "Q: Sort the following words alphabetically: List: newt arson parthia seismography mugho aspect census\nA: Let's think step by step."
        },
        {
          "role": "assistant",
          "content": "The first letter: \"newt\": \"n\" (14). \"arson\": \"a\" (1). \"parthia\": \"p\" (16). \"seismography\": \"s\" (19). \"mugho\": \"m\" (13). \"aspect\": \"a\" (1). \"census\": \"c\" (3). We now have: (1) [\"arson\" ? \"aspect\"] < (3) \"census\" < (13) \"mugho\" < (14) \"newt\" < (16) \"parthia\" < (19) \"seismography\". Now let's sort this subpart [\"arson\" ? \"aspect\"] by looking at their second letters.\nThe second letter: \"arson\": \"r\" (18). \"aspect\": \"s\" (19). We now have: (18) \"arson\" < (19) \"aspect\". Hence, we have [\"arson\" < \"aspect\"] < \"census\" < \"mugho\" < \"newt\" < \"parthia\" < \"seismography\". So the answer is arson aspect census mugho newt parthia seismography."
        },
        {
          "role": "user",
          "content": "Q: Sort the following words alphabetically: List: syndrome therefrom\nA: Let's think step by step."
        }
      ],
      "assistant_prefix": ""
    },
    "stop_sequences": [],
    "generation_kwargs": {
      "max_gen_toks": 512,
      "temperature": 0.0,
      "do_sample": false
    }
  },
  "idx": 0,
  "task_name": "bbh_word_sorting",
  "doc_id": 0,
  "native_id": 0,
  "label": "syndrome therefrom"
}

5. BBH 数据集带来的挑战

  • 逻辑推理:如 Logical DeductionTracking Shuffled Objects,对模型的多步骤逻辑能力和记忆更新能力提出了挑战。
  • 复杂语言理解:如 HyperbatonFormal Fallacies,要求模型能够准确解析非标准句法和修辞手法。
  • 注意力控制与记忆:如 Object CountingNavigate,模型需要精准执行任务并维持注意力。

6. 未来工作与改进方向

  • 模型结构优化:增强模型的多步骤推理能力。
  • 数据增强:针对特定任务生成更多训练样例,提升泛化性。
  • 动态记忆机制:改善模型的注意力和记忆管理,解决 Dyck Languages 等任务的挑战。

结语

BBH 数据集为评估大语言模型的极限推理能力提供了严谨的标准。Tulu-CoT 测试结果表明,模型在逻辑推理、复杂语言理解等任务上取得了较好的成绩,但仍需持续优化,以应对更高难度的挑战。

英文版

Big Bench Hard (BBH) Dataset: A Comprehensive Analysis – Principles, Tasks, and Evaluation Practices

Paper: https://arxiv.org/pdf/2210.09261

Big Bench Hard (BBH) is a recently introduced benchmark dataset primarily designed to assess the reasoning and logical abilities of large language models (LLMs). BBH aims to address the generalization issues of large models when confronted with complex reasoning tasks, providing important evaluation standards for further research.


1. Background and Significance of BBH Dataset

Motivation:

While existing large language models such as GPT-3 and InstructGPT perform excellently in many natural language processing tasks, they still fall short in tasks that involve multi-step logical reasoning, abstract thinking, and non-intuitive problems. The BBH dataset contains a series of more difficult tasks, specifically designed to challenge a model’s core abilities in logical reasoning, attention control, and memory.

Nature of the Dataset:
  • Few-Shot Prompting: The BBH dataset focuses on few-shot learning scenarios, where models are tasked with reasoning and answering based on very few examples.
  • CoT (Chain-of-Thought): By guiding the model to reason step-by-step, CoT improves task performance, though it still cannot fully match human-level reasoning.

2. Main Tasks in the BBH Dataset

The BBH dataset includes 23 challenging tasks, summarized with task names, goals, and challenges as follows:

Task NameGoal DescriptionChallenges
Boolean ExpressionsJudge the truth value of Boolean expressions.Correct execution of multi-level logic operations.
Causal JudgementDetermine causality based on premises.Correctly distinguishing correlation from causation.
Date UnderstandingUnderstand and calculate date-related queries.Complex logic for date conversions and calculations.
Dyck LanguagesCheck if parentheses sequences are valid (balanced).Attention control for deeply nested structures.
Formal FallaciesIdentify logical fallacies.Recognizing logical errors in complex contexts.
Logical DeductionDeduce logical relationships in multiple object conditions (e.g., three objects, five objects, seven objects).Accuracy in multi-step logical reasoning.
NavigateMove to a target location based on instructions.Understanding instructions and executing path planning.
Object CountingCorrectly count objects in a set.Precise attention control and memory capabilities.
Tracking Shuffled ObjectsTrack objects whose order has been shuffled.Continuous memory tracking and updating ability.
Word SortingSort a set of words in dictionary order.Correct implementation of sorting rules.
HyperbatonParse scrambled sentences (e.g., inverted word order in literary rhetoric).Complex language understanding and sentence reordering.

3. My BBH Task Evaluation Results: Tulu-CoT

I tested the BBH dataset using the Olmes framework https://github.com/allenai/olmes and the Tulu model. The specific configuration for running the evaluation is as follows:

MODEL_NAME=.cache/huggingface/hub/models--allenai--Llama-3.1-Tulu-3-8B-SFT/snapshots/6371bc13f9f3618b340524b14981f64386f613e0

TASK_NAME_03=bbh:cot-v1::tulu  
OUTPUT_DIR_03=my-eval-bbh:cot-v1::tulu  
BATCH_SIZE=4

proxychains4 olmes \  
    --model $MODEL_NAME  \  
    --task $TASK_NAME_03 \  
    --batch-size $BATCH_SIZE \  
    --output-dir $OUTPUT_DIR_03  

The final evaluation results are as follows:

 INFO     [run_eval.py:668] Summary of primary scores: 
bbh:cot-v1::tulu: 0.692674
bbh_boolean_expressions:cot-v1::tulu: 0.928
bbh_causal_judgement:cot-v1::tulu: 0.57754
bbh_date_understanding:cot-v1::tulu: 0.82
bbh_disambiguation_qa:cot-v1::tulu: 0.628
bbh_dyck_languages:cot-v1::tulu: 0.14
bbh_formal_fallacies:cot-v1::tulu: 0.528
bbh_geometric_shapes:cot-v1::tulu: 0.532
bbh_hyperbaton:cot-v1::tulu: 0.944
bbh_logical_deduction_five_objects:cot-v1::tulu: 0.484
bbh_logical_deduction_seven_objects:cot-v1::tulu: 0.388
bbh_logical_deduction_three_objects:cot-v1::tulu: 0.852
bbh_movie_recommendation:cot-v1::tulu: 0.824
bbh_multistep_arithmetic_two:cot-v1::tulu: 0.64
bbh_navigate:cot-v1::tulu: 0.896
bbh_object_counting:cot-v1::tulu: 0.86
bbh_penguins_in_a_table:cot-v1::tulu: 0.767123
bbh_reasoning_about_colored_objects:cot-v1::tulu: 0.784
bbh_ruin_names:cot-v1::tulu: 0.74
bbh_salient_translation_error_detection:cot-v1::tulu: 0.536
bbh_snarks:cot-v1::tulu: 0.629213
bbh_sports_understanding:cot-v1::tulu: 0.928
bbh_temporal_sequences:cot-v1::tulu: 0.832
bbh_tracking_shuffled_objects_five_objects:cot-v1::tulu: 0.692
bbh_tracking_shuffled_objects_seven_objects:cot-v1::tulu: 0.688
bbh_tracking_shuffled_objects_three_objects:cot-v1::tulu: 0.752
bbh_web_of_lies:cot-v1::tulu: 1.0
bbh_word_sorting:cot-v1::tulu: 0.296
2024-12-18:11:31:37,365 INFO     [run_eval.py:674] Saving final metrics in my-eval-bbh:cot-v1::tulu/metrics-all.jsonl...
2024-12-18:11:31:37,366 INFO     [run_eval.py:686] Saving beaker metrics.json so in my-eval-bbh:cot-v1::tulu/metrics.json...

4. BBH Dataset Format

Each task in the BBH dataset is typically represented in JSON format, including task descriptions, example inputs, and expected outputs. A typical example is as follows:

{
  "request_type": "generate_until",
  "doc": {
    "index": 0,
    "input": "Sort the following words alphabetically: List: syndrome therefrom",
    "query": "Q: Sort the following words alphabetically: List: syndrome therefrom\nA: Let's think step by step.",
    "solution": "syndrome therefrom",
    "answer": "syndrome therefrom"
  },
  "request": {
    "context": {
      "messages": [
        {
          "role": "user",
          "content": "Sort a list of words.\n\nQ: Sort the following words alphabetically: List: oven costume counterpart\nA: Let's think step by step."
        },
        {
          "role": "assistant",
          "content": "The first letter: \"oven\": \"o\" (15). \"costume\": \"c\" (3). \"counterpart\": \"c\" (3). We now have: (3) [\"costume\" ? \"counterpart\"] < (15) \"oven\". Now let's sort this subpart [\"costume\" ? \"counterpart\"] by looking at their second letters.\nThe second letter: \"costume\": \"o\" (15). \"counterpart\": \"o\" (15). We now have: (15) [\"costume\" ? \"counterpart\"]. Now let's sort this subpart [\"costume\" ? \"counterpart\"] by looking at their third letters.\nThe third letter: \"costume\": \"s\" (19). \"counterpart\": \"u\" (21). We now have: (19) \"costume\" < (21) \"counterpart\". Hence, we have [\"costume\" < \"counterpart\"] < \"oven\". So the answer is costume counterpart oven."
        },
        {
          "role": "user",
          "content": "Q: Sort the following words alphabetically: List: hypochlorite ponderosa phone credulity\nA: Let's think step by step."
        },
        {
          "role": "assistant",
          "content": "The first letter: \"hypochlorite\": \"h\" (8). \"ponderosa\": \"p\" (16). \"phone\": \"p\" (16). \"credulity\": \"c\" (3). We now have: (3) \"credulity\" < (8) \"hypochlorite\" < (16) [\"ponderosa\" ? \"phone\"]. Now let's sort this subpart [\"ponderosa\" ? \"phone\"] by looking at their second letters.\nThe second letter: \"ponderosa\": \"o\" (15). \"phone\": \"h\" (8). We now have: (8) \"phone\" < (15) \"ponderosa\". Hence, we have \"credulity\" < \"hypochlorite\" < [\"phone\" < \"ponderosa\"]. So the answer is credulity hypochlorite phone ponderosa."
        },
        {
          "role": "user",
          "content": "Q: Sort the following words alphabetically: List: newt arson parthia seismography mugho aspect census\nA: Let's think step by step."
        },
        {
          "role": "assistant",
          "content": "The first letter: \"newt\": \"n\" (14). \"arson\": \"a\" (1). \"parthia\": \"p\" (16). \"seismography\": \"s\" (19). \"mugho\": \"m\" (13). \"aspect\": \"a\" (1). \"census\": \"c\" (3). We now have: (1) [\"arson\" ? \"aspect\"] < (3) \"census\" < (13) \"mugho\" < (14) \"newt\" < (16) \"parthia\" < (19) \"seismography\". Now let's sort this subpart [\"arson\" ? \"aspect\"] by looking at their second letters.\nThe second letter: \"arson\": \"r\" (18). \"aspect\": \"s\" (19). We now have: (18) \"arson\" < (19) \"aspect\". Hence, we have [\"arson\" < \"aspect\"] < \"census\" < \"mugho\" < \"newt\" < \"parthia\" < \"seismography\". So the answer is arson aspect census mugho newt parthia seismography."
        },
        {
          "role": "user",
          "content": "Q: Sort the following words alphabetically: List: syndrome therefrom\nA: Let's think step by step."
        }
      ],
      "assistant_prefix": ""
    },
    "stop_sequences": [],
    "generation_kwargs": {
      "max_gen_toks": 512,
      "temperature": 0.0,
      "do_sample": false
    }
  },
  "idx": 0,
  "task_name": "bbh_word_sorting",
  "doc_id": 0,
  "native_id": 0,
  "label": "syndrome therefrom"
}

5. Challenges Posed by the BBH Dataset

  • Logical Reasoning: Tasks such as Logical Deduction and Tracking Shuffled Objects challenge the model’s ability to handle multi-step logical reasoning and memory updates.
  • Complex Language Understanding: Tasks like Hyperbaton and Formal Fallacies require the model to accurately parse non-standard syntax and rhetorical devices.
  • Attention Control and Memory: Tasks such as Object Counting and Navigate demand precise task execution while maintaining attention.

6. Future Work and Directions for Improvement

  • Model Architecture Optimization: Enhance models’ ability to perform multi-step reasoning.
  • Data Augmentation: Generate more training examples for specific tasks to improve generalization.
  • Dynamic Memory Mechanisms: Improve the model’s attention and memory management to address challenges in tasks like Dyck Languages.

Conclusion

The BBH dataset provides a rigorous benchmark for evaluating the reasoning capabilities of large language models. The Tulu-CoT test results indicate that the model has achieved good performance in tasks requiring logical reasoning and complex language understanding, but further optimizations are needed to meet the challenges posed by more advanced tasks.

后记

2024年12月18日20点26分于上海, 在GPT4o大模型辅助下完成。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值