逻辑推理数据集Big Bench Hard (BBH) 介绍：中英双语

阿正的梦工坊

于 2024-12-18 20:27:59 发布

阅读量2.5k

点赞数 17

分类专栏： LLM 文章标签： LLM

本文链接：https://blog.csdn.net/shizheng_Li/article/details/144568537

版权

LLM 专栏收录该内容

217 篇文章

订阅专栏

Big Bench Hard (BBH) 数据集全解析：原理、任务与实践评测

Paper: https://arxiv.org/pdf/2210.09261

Big Bench Hard (BBH) 是近年来引入的一组基准测试数据集，主要用于评估大型语言模型（LLM）的推理和逻辑能力。BBH 旨在解决当前大模型在复杂推理任务中的泛化能力问题，为进一步研究提供重要的测评标准。

1. BBH 数据集背景与意义

提出原因：

尽管现有的大语言模型如 GPT-3、InstructGPT 等在自然语言处理任务中表现出色，但它们在一些涉及多步逻辑推理、抽象思维和非直观问题的任务上仍存在明显的不足。BBH 数据集包含一组更为困难的任务，这些任务专门设计用于挑战模型的逻辑推理、注意力控制、记忆能力等核心能力。

数据集性质：

Few-Shot Prompting：BBH 数据集专注于少样本学习（few-shot learning）场景，即模型在极少量示例下进行推理和回答。
CoT（Chain-of-Thought）：通过引导模型逐步推理，CoT 提升了任务性能，但仍不能完全匹配人类水平。

2. BBH 数据集的主要任务

BBH 数据集共包含 23 个高难度任务，其中部分任务的名称、目标以及挑战点总结如下：

任务名称	目标描述	挑战点
Boolean Expressions	判断布尔表达式的真假。	多层次逻辑运算的正确执行。
Causal Judgement	根据前提判断因果关系。	正确区分相关性与因果性。
Date Understanding	理解和计算日期相关的问题。	日期转换和计算的复杂逻辑。
Dyck Languages	判断括号序列是否有效（平衡性）。	多层嵌套结构的注意力控制。
Formal Fallacies	识别逻辑谬误。	复杂语境下的逻辑错误识别。
Logical Deduction	在多个对象条件下推断逻辑关系（如三对象、五对象、七对象）。	多步骤逻辑推理的准确性。
Navigate	根据指令移动到目标位置。	理解指令并正确执行路径规划。
Object Counting	在一组物体中正确数数。	精确的注意力控制和记忆能力。
Tracking Shuffled Objects	跟踪被打乱顺序的物体。	持续的记忆跟踪和更新能力。
Word Sorting	将一组单词按照字典顺序排序。	文字排序规则的执行。
Hyperbaton	解析打乱顺序的句子（如文学修辞中的反常语序）。	复杂语言理解与语序重建。

3. 我的 BBH 任务测试结果：Tulu-CoT

我使用了 Olmes 框架 https://github.com/allenai/olmes 和 Tulu 模型 对 BBH 数据集进行了测试，具体运行配置如下：

MODEL_NAME=.cache/huggingface/hub/models--allenai--Llama-3.1-Tulu-3-8B-SFT/snapshots/6371bc13f9f3618b340524b14981f64386f613e0

TASK_NAME_03=bbh:cot-v1::tulu  
OUTPUT_DIR_03=my-eval-bbh:cot-v1::tulu  
BATCH_SIZE=4

proxychains4 olmes \  
    --model $MODEL_NAME  \  
    --task $TASK_NAME_03 \  
    --batch-size $BATCH_SIZE \  
    --output-dir $OUTPUT_DIR_03

最终得到的评测结果如下：

 INFO     [run_eval.py:668] Summary of primary scores: 
bbh:cot-v1::tulu: 0.692674
bbh_boolean_expressions:cot-v1::tulu: 0.928
bbh_causal_judgement:cot-v1::tulu: 0.57754
bbh_date_understanding:cot-v1::tulu: 0.82
bbh_disambiguation_qa:cot-v1::tulu: 0.628
bbh_dyck_languages:cot-v1::tulu: 0.14
bbh_formal_fallacies:cot-v1::tulu: 0.528
bbh_geometric_shapes:cot-v1::tulu: 0.532
bbh_hyperbaton:cot-v1::tulu: 0.944
bbh_logical_deduction_five_objects:cot-v1::tulu: 0.484
bbh_logical_deduction_seven_objects:cot-v1::tulu: 0.388
bbh_logical_deduction_three_objects:cot-v1::tulu: 0.852
bbh_movie_recommendation:cot-v1::tulu: 0.824
bbh_multistep_arithmetic_two:cot-v1::tulu: 0.64
bbh_navigate:cot-v1::tulu: 0.896
bbh_object_counting:cot-v1::tulu: 0.86
bbh_penguins_in_a_table:cot-v1::tulu: 0.767123
bbh_reasoning_about_colored_objects:cot-v1::tulu: 0.784
bbh_ruin_names:cot-v1::tulu: 0.74
bbh_salient_translation_error_detection:cot-v1::tulu: 0.536
bbh_snarks:cot-v1::tulu: 0.629213
bbh_sports_understanding:cot-v1::tulu: 0.928
bbh_temporal_sequences:cot-v1::tulu: 0.832
bbh_tracking_shuffled_objects_five_objects:cot-v1::tulu: 0.692
bbh_tracking_shuffled_objects_seven_objects:cot-v1::tulu: 0.688
bbh_tracking_shuffled_objects_three_objects:cot-v1::tulu: 0.752
bbh_web_of_lies:cot-v1::tulu: 1.0
bbh_word_sorting:cot-v1::tulu: 0.296
2024-12-18:11:31:37,365 INFO     [run_eval.py:674] Saving final metrics in my-eval-bbh:cot-v1::tulu/metrics-all.jsonl...
2024-12-18:11:31:37,366 INFO     [run_eval.py:686] Saving beaker metrics.json so in my-eval-bbh:cot-v1::tulu/metrics.json...

4. BBH 数据集的格式

BBH 数据集中的每个任务通常采用 JSON 格式，包括任务描述、示例输入和输出。以下是一个典型示例：

{
  "request_type": "generate_until",
  "doc": {
    "index": 0,
    "input": "Sort the following words alphabetically: List: syndrome therefrom",
    "query": "Q: Sort the following words alphabetically: List: syndrome therefrom\nA: Let's think step by step.",
    "solution": "syndrome therefrom",
    "answer": "syndrome therefrom"
  },
  "request": {
    "context": {
      "messages": [
        {
          "role": "user",
          "content": "Sort a list of words.\n\nQ: Sort the following words alphabetically: List: oven costume counterpart\nA: Let's think step by step."
        },
        {
          "role": "assistant",
          "content": "The first letter: \"oven\": \"o\" (15). \"costume\": \"c\" (3). \"counterpart\": \"c\" (3). We now have: (3) [\"costume\" ? \"counterpart\"] < (15) \"oven\". Now let's sort this subpart [\"costume\" ? \"counterpart\"] by looking at their second letters.\nThe second letter: \"costume\": \"o\" (15). \"counterpart\": \"o\" (15). We now have: (15) [\"costume\" ? \"counterpart\"]. Now let's sort this subpart [\"costume\" ? \"counterpart\"] by looking at their third letters.\nThe third letter: \"costume\": \"s\" (19). \"counterpart\": \"u\" (21). We now have: (19) \"costume\" < (21) \"counterpart\". Hence, we have [\"costume\" < \"counterpart\"] < \"oven\". So the answer is costume counterpart oven."
        },
        {
          "role": "user",
          "content": "Q: Sort the following words alphabetically: List: hypochlorite ponderosa phone credulity\nA: Let's think step by step."
        },
        {
          "role": "assistant",
          "content": "The first letter: \"hypochlorite\": \"h\" (8). \"ponderosa\": \"p\" (16). \"phone\": \"p\" (16). \"credulity\": \"c\" (3). We now have: (3) \"credulity\" < (8) \"hypochlorite\" < (16) [\"ponderosa\" ? \"phone\"]. Now let's sort this subpart [\"ponderosa\" ? \"phone\"] by looking at their second letters.\nThe second letter: \"ponderosa\": \"o\" (15). \"phone\": \"h\" (8). We now have: (8) \"phone\" < (15) \"ponderosa\". Hence, we have \"credulity\" < \"hypochlorite\" < [\"phone\" < \"ponderosa\"]. So the answer is credulity hypochlorite phone ponderosa."
        },
        {
          "role": "user",
          "content": "Q: Sort the following words alphabetically: List: newt arson parthia seismography mugho aspect census\nA: Let's think step by step."
        },
        {
          "role": "assistant",
          "content": "The first letter: \"newt\": \"n\" (14). \"arson\": \"a\" (1). \"parthia\": \"p\" (16). \"seismography\": \"s\" (19). \"mugho\": \"m\" (13). \"aspect\": \"a\" (1). \"census\": \"c\" (3). We now have: (1) [\"arson\" ? \"aspect\"] < (3) \"census\" < (13) \"mugho\" < (14) \"newt\" < (16) \"parthia\" < (19) \"seismography\". Now let's sort this subpart [\"arson\" ? \"aspect\"] by looking at their second letters.\nThe second letter: \"arson\": \"r\" (18). \"aspect\": \"s\" (19). We now have: (18) \"arson\" < (19) \"aspect\". Hence, we have [\"arson\" < \"aspect\"] < \"census\" < \"mugho\" < \"newt\" < \"parthia\" < \"seismography\". So the answer is arson aspect census mugho newt parthia seismography."
        },
        {
          "role": "user",
          "content": "Q: Sort the following words alphabetically: List: syndrome therefrom\nA: Let's think step by step."
        }
      ],
      "assistant_prefix": ""
    },
    "stop_sequences": [],
    "generation_kwargs": {
      "max_gen_toks": 512,
      "temperature": 0.0,
      "do_sample": false
    }
  },
  "idx": 0,
  "task_name": "bbh_word_sorting",
  "doc_id": 0,
  "native_id": 0,
  "label": "syndrome therefrom"
}

5. BBH 数据集带来的挑战

逻辑推理：如 Logical Deduction 和 Tracking Shuffled Objects，对模型的多步骤逻辑能力和记忆更新能力提出了挑战。
复杂语言理解：如 Hyperbaton 和 Formal Fallacies，要求模型能够准确解析非标准句法和修辞手法。
注意力控制与记忆：如 Object Counting 和 Navigate，模型需要精准执行任务并维持注意力。

6. 未来工作与改进方向

模型结构优化：增强模型的多步骤推理能力。
数据增强：针对特定任务生成更多训练样例，提升泛化性。
动态记忆机制：改善模型的注意力和记忆管理，解决 Dyck Languages 等任务的挑战。

结语

BBH 数据集为评估大语言模型的极限推理能力提供了严谨的标准。Tulu-CoT 测试结果表明，模型在逻辑推理、复杂语言理解等任务上取得了较好的成绩，但仍需持续优化，以应对更高难度的挑战。

英文版

Big Bench Hard (BBH) Dataset: A Comprehensive Analysis – Principles, Tasks, and Evaluation Practices

Paper: https://arxiv.org/pdf/2210.09261

Big Bench Hard (BBH) is a recently introduced benchmark dataset primarily designed to assess the reasoning and logical abilities of large language models (LLMs). BBH aims to address the generalization issues of large models when confronted with complex reasoning tasks, providing important evaluation standards for further research.

1. Background and Significance of BBH Dataset

Motivation:

While existing large language models such as GPT-3 and InstructGPT perform excellently in many natural language processing tasks, they still fall short in tasks that involve multi-step logical reasoning, abstract thinking, and non-intuitive problems. The BBH dataset contains a series of more difficult tasks, specifically designed to challenge a model’s core abilities in logical reasoning, attention control, and memory.

Nature of the Dataset:

Few-Shot Prompting: The BBH dataset focuses on few-shot learning scenarios, where models are tasked with reasoning and answering based on very few examples.
CoT (Chain-of-Thought): By guiding the model to reason step-by-step, CoT improves task performance, though it still cannot fully match human-level reasoning.

2. Main Tasks in the BBH Dataset

The BBH dataset includes 23 challenging tasks, summarized with task names, goals, and challenges as follows:

Task Name	Goal Description	Challenges
Boolean Expressions	Judge the truth value of Boolean expressions.	Correct execution of multi-level logic operations.
Causal Judgement	Determine causality based on premises.	Correctly distinguishing correlation from causation.
Date Understanding	Understand and calculate date-related queries.	Complex logic for date conversions and calculations.
Dyck Languages	Check if parentheses sequences are valid (balanced).	Attention control for deeply nested structures.
Formal Fallacies	Identify logical fallacies.	Recognizing logical errors in complex contexts.
Logical Deduction	Deduce logical relationships in multiple object conditions (e.g., three objects, five objects, seven objects).	Accuracy in multi-step logical reasoning.
Navigate	Move to a target location based on instructions.	Understanding instructions and executing path planning.
Object Counting	Correctly count objects in a set.	Precise attention control and memory capabilities.
Tracking Shuffled Objects	Track objects whose order has been shuffled.	Continuous memory tracking and updating ability.
Word Sorting	Sort a set of words in dictionary order.	Correct implementation of sorting rules.
Hyperbaton	Parse scrambled sentences (e.g., inverted word order in literary rhetoric).	Complex language understanding and sentence reordering.

3. My BBH Task Evaluation Results: Tulu-CoT

I tested the BBH dataset using the Olmes framework https://github.com/allenai/olmes and the Tulu model. The specific configuration for running the evaluation is as follows:

MODEL_NAME=.cache/huggingface/hub/models--allenai--Llama-3.1-Tulu-3-8B-SFT/snapshots/6371bc13f9f3618b340524b14981f64386f613e0

TASK_NAME_03=bbh:cot-v1::tulu  
OUTPUT_DIR_03=my-eval-bbh:cot-v1::tulu  
BATCH_SIZE=4

proxychains4 olmes \  
    --model $MODEL_NAME  \  
    --task $TASK_NAME_03 \  
    --batch-size $BATCH_SIZE \  
    --output-dir $OUTPUT_DIR_03

The final evaluation results are as follows:

 INFO     [run_eval.py:668] Summary of primary scores: 
bbh:cot-v1::tulu: 0.692674
bbh_boolean_expressions:cot-v1::tulu: 0.928
bbh_causal_judgement:cot-v1::tulu: 0.57754
bbh_date_understanding:cot-v1::tulu: 0.82
bbh_disambiguation_qa:cot-v1::tulu: 0.628
bbh_dyck_languages:cot-v1::tulu: 0.14
bbh_formal_fallacies:cot-v1::tulu: 0.528
bbh_geometric_shapes:cot-v1::tulu: 0.532
bbh_hyperbaton:cot-v1::tulu: 0.944
bbh_logical_deduction_five_objects:cot-v1::tulu: 0.484
bbh_logical_deduction_seven_objects:cot-v1::tulu: 0.388
bbh_logical_deduction_three_objects:cot-v1::tulu: 0.852
bbh_movie_recommendation:cot-v1::tulu: 0.824
bbh_multistep_arithmetic_two:cot-v1::tulu: 0.64
bbh_navigate:cot-v1::tulu: 0.896
bbh_object_counting:cot-v1::tulu: 0.86
bbh_penguins_in_a_table:cot-v1::tulu: 0.767123
bbh_reasoning_about_colored_objects:cot-v1::tulu: 0.784
bbh_ruin_names:cot-v1::tulu: 0.74
bbh_salient_translation_error_detection:cot-v1::tulu: 0.536
bbh_snarks:cot-v1::tulu: 0.629213
bbh_sports_understanding:cot-v1::tulu: 0.928
bbh_temporal_sequences:cot-v1::tulu: 0.832
bbh_tracking_shuffled_objects_five_objects:cot-v1::tulu: 0.692
bbh_tracking_shuffled_objects_seven_objects:cot-v1::tulu: 0.688
bbh_tracking_shuffled_objects_three_objects:cot-v1::tulu: 0.752
bbh_web_of_lies:cot-v1::tulu: 1.0
bbh_word_sorting:cot-v1::tulu: 0.296
2024-12-18:11:31:37,365 INFO     [run_eval.py:674] Saving final metrics in my-eval-bbh:cot-v1::tulu/metrics-all.jsonl...
2024-12-18:11:31:37,366 INFO     [run_eval.py:686] Saving beaker metrics.json so in my-eval-bbh:cot-v1::tulu/metrics.json...

4. BBH Dataset Format

Each task in the BBH dataset is typically represented in JSON format, including task descriptions, example inputs, and expected outputs. A typical example is as follows:

{
  "request_type": "generate_until",
  "doc": {
    "index": 0,
    "input": "Sort the following words alphabetically: List: syndrome therefrom",
    "query": "Q: Sort the following words alphabetically: List: syndrome therefrom\nA: Let's think step by step.",
    "solution": "syndrome therefrom",
    "answer": "syndrome therefrom"
  },
  "request": {
    "context": {
      "messages": [
        {
          "role": "user",
          "content": "Sort a list of words.\n\nQ: Sort the following words alphabetically: List: oven costume counterpart\nA: Let's think step by step."
        },
        {
          "role": "assistant",
          "content": "The first letter: \"oven\": \"o\" (15). \"costume\": \"c\" (3). \"counterpart\": \"c\" (3). We now have: (3) [\"costume\" ? \"counterpart\"] < (15) \"oven\". Now let's sort this subpart [\"costume\" ? \"counterpart\"] by looking at their second letters.\nThe second letter: \"costume\": \"o\" (15). \"counterpart\": \"o\" (15). We now have: (15) [\"costume\" ? \"counterpart\"]. Now let's sort this subpart [\"costume\" ? \"counterpart\"] by looking at their third letters.\nThe third letter: \"costume\": \"s\" (19). \"counterpart\": \"u\" (21). We now have: (19) \"costume\" < (21) \"counterpart\". Hence, we have [\"costume\" < \"counterpart\"] < \"oven\". So the answer is costume counterpart oven."
        },
        {
          "role": "user",
          "content": "Q: Sort the following words alphabetically: List: hypochlorite ponderosa phone credulity\nA: Let's think step by step."
        },
        {
          "role": "assistant",
          "content": "The first letter: \"hypochlorite\": \"h\" (8). \"ponderosa\": \"p\" (16). \"phone\": \"p\" (16). \"credulity\": \"c\" (3). We now have: (3) \"credulity\" < (8) \"hypochlorite\" < (16) [\"ponderosa\" ? \"phone\"]. Now let's sort this subpart [\"ponderosa\" ? \"phone\"] by looking at their second letters.\nThe second letter: \"ponderosa\": \"o\" (15). \"phone\": \"h\" (8). We now have: (8) \"phone\" < (15) \"ponderosa\". Hence, we have \"credulity\" < \"hypochlorite\" < [\"phone\" < \"ponderosa\"]. So the answer is credulity hypochlorite phone ponderosa."
        },
        {
          "role": "user",
          "content": "Q: Sort the following words alphabetically: List: newt arson parthia seismography mugho aspect census\nA: Let's think step by step."
        },
        {
          "role": "assistant",
          "content": "The first letter: \"newt\": \"n\" (14). \"arson\": \"a\" (1). \"parthia\": \"p\" (16). \"seismography\": \"s\" (19). \"mugho\": \"m\" (13). \"aspect\": \"a\" (1). \"census\": \"c\" (3). We now have: (1) [\"arson\" ? \"aspect\"] < (3) \"census\" < (13) \"mugho\" < (14) \"newt\" < (16) \"parthia\" < (19) \"seismography\". Now let's sort this subpart [\"arson\" ? \"aspect\"] by looking at their second letters.\nThe second letter: \"arson\": \"r\" (18). \"aspect\": \"s\" (19). We now have: (18) \"arson\" < (19) \"aspect\". Hence, we have [\"arson\" < \"aspect\"] < \"census\" < \"mugho\" < \"newt\" < \"parthia\" < \"seismography\". So the answer is arson aspect census mugho newt parthia seismography."
        },
        {
          "role": "user",
          "content": "Q: Sort the following words alphabetically: List: syndrome therefrom\nA: Let's think step by step."
        }
      ],
      "assistant_prefix": ""
    },
    "stop_sequences": [],
    "generation_kwargs": {
      "max_gen_toks": 512,
      "temperature": 0.0,
      "do_sample": false
    }
  },
  "idx": 0,
  "task_name": "bbh_word_sorting",
  "doc_id": 0,
  "native_id": 0,
  "label": "syndrome therefrom"
}

5. Challenges Posed by the BBH Dataset

Logical Reasoning: Tasks such as Logical Deduction and Tracking Shuffled Objects challenge the model’s ability to handle multi-step logical reasoning and memory updates.
Complex Language Understanding: Tasks like Hyperbaton and Formal Fallacies require the model to accurately parse non-standard syntax and rhetorical devices.
Attention Control and Memory: Tasks such as Object Counting and Navigate demand precise task execution while maintaining attention.

6. Future Work and Directions for Improvement

Model Architecture Optimization: Enhance models’ ability to perform multi-step reasoning.
Data Augmentation: Generate more training examples for specific tasks to improve generalization.
Dynamic Memory Mechanisms: Improve the model’s attention and memory management to address challenges in tasks like Dyck Languages.

Conclusion

The BBH dataset provides a rigorous benchmark for evaluating the reasoning capabilities of large language models. The Tulu-CoT test results indicate that the model has achieved good performance in tasks requiring logical reasoning and complex language understanding, but further optimizations are needed to meet the challenges posed by more advanced tasks.