使用 Distilabel 复现 DeepSeek-Prover 详解_deepseek-prover: advancing theorem proving in llms-CSDN博客

本文链接：https://blog.csdn.net/weixin_43837507/article/details/145744143

背景介绍

《DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data》这篇论文介绍了一种用于从非正式数学问题生成定理证明的方法。通过生成大规模 Lean 4 证明数据，提升 LLM 在形式化定理证明上的能力。论文中构建了包含 800 万条带证明数学陈述的合成数据集，并微调 DeepSeekMath 7B，使其在 Lean 4 miniF2F 测试中的证明生成准确率达 52%，超越 GPT-4（23%）。在 Lean 4 FIMO 基准测试中，模型成功证明 5 题，而 GPT-4 无一成功。研究表明，大规模合成数据能显著增强 LLM 的数学推理能力，展示了使用合成数据提升模型定理证明能力的潜力。Distilabel尝试复现了论文的流程。

论文的相关模型和数据参考：deepseek-ai/DeepSeek-Prover-V1 · Hugging Face

Distilabel 是一个数据标注与处理框架，支持自动化数据生成、筛选与标注。它可集成强大 LLM，如 Llama 3，用于任务自动化，如数学定理形式化、评估及证明生成，适用于机器学习数据构建与优化。后续笔者会写关于 Distilabel 框架的详细技术博客，请大家关注。

本文会重点介绍 Distilabel 通过构建流水线复现 DeepSeek-Prover 主要步骤的方法。

整体说明

过程概述

DeepSeek-Prover 论文中提到的数据集生成方法的整体流程为：

论文作者提出了一种从非正式数学问题生成 Lean 4 证明数据的方法，该方法能够将高中及大学数学竞赛问题转换为正式的数学陈述。论文中完整的流程为：

完整流程如下：

1. 问题形式化（AutoFormalization）：使用 LLM 将自然语言数学问题转换为 Lean 4 形式化数学陈述。

2. 定理筛选（Scoring & Filtering）：通过 LLM 评估生成的定理，过滤掉低质量、不相关或冗余的陈述。

3. 定理证明生成（Proof Generation）：使用 LLM 生成形式化证明，确保 Lean 4 语言能够正确验证。

4. 验证与迭代优化（Proof Verification & Iteration）

- 使用 Lean 4 验证证明的正确性。
- 微调 LLM 以提高证明生成能力，并重新生成新的定理和证明，循环迭代优化。

5. 模型训练与评测（Fine-Tuning & Evaluation）

- 在 800 万个形式化定理和证明数据上微调 DeepSeekMath 7B。
- 在 Lean 4 miniF2F 和 FIMO 基准测试上评估性能。

Distilabel中实现了步骤 1、2和3，而论文中提到的后续步骤（例如使用 Lean 4 进行定理验证、迭代优化、DeepSeekMath 7B模型微调等）则需要进一步探索。整个流程会不断迭代，直至无法获得更进一步的提升。流程整体过程如下：

注意：这个过程虽然称为 Replication（复现），但实际上是展示如何利用 Distilabel 的流水线来实现 DeepSeek-Prover 方法中的不同步骤。虽然省略了一些环节，但可以轻松扩展补充。

前置安装 Distilabel

在运行代码之前，需要先安装 distilabel，可使用以下命令完成安装：

pip install "distilabel[hf-inference-endpoints]"

将使用 InferenceEndpointsLLM 作为模型调用方式，这个接口会调用Hugging Face Hub做为模型供应商，但也可以使用其他性能强大的模型提供商。

构建模块

该流程需要定义三个主要组件，对应论文中的不同步骤：

将非正式数学陈述转换为 Lean 4 语言的正式定理（AutoFormalization）
评估定理的相关性并进行筛选（Scorer）
为定理生成证明（Solver）

注意
将在所有任务中复用同一个 LLM（大语言模型），因此只需定义一次即可：

llm = InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3-70B-Instruct",
)

1. DeepSeekProverAutoFormalization（定理自动形式化）

该任务对应流程中的第一步，其目标是将非正式数学陈述转换为 Lean 4 语言的正式定理。这意味着它需要从非正式的数学问题翻译成 Lean 4 结构化语言。详细的实现为：

_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX = r"```lean4(.*?)```"

template_deepseek_prover_auto_formalization = """\
Mathematical Problem in Natural Language:
{
  { informal_statement }}
{%- if few_shot %}

Please use the following examples to guide you with the answer:
{%- for example in examples %}
- {
  { example }}
{%- endfor %}
{% endif -%}"""


class DeepSeekProverAutoFormalization(Task):
    examples: Optional[List[str]] = None
    system_prompt: str = "Translate the problem to Lean 4 (only the core declaration):\n```lean4\nformal statement goes here\n```"
    _template: Union[Template, None] = PrivateAttr(...)
    _few_shot: bool = PrivateAttr(default=False)

    def load(self) -> None:
        super().load()
        self._template = Template(template_deepseek_prover_auto_formalization)

    @property
    def inputs(self) -> List[str]:
        return ["informal_statement"]

    @property
    def outputs(self):
        return ["formal_statement", "model_name"]

    def format_input(self, input: str) -> ChatType:  # type: ignore
        return [
            {
                "role": "system",
                "content": self.system_prompt,
            },
            {
                "role": "user",
                "content": self._template.render(
                    informal_statement=input[self.inputs[0]],
                    few_shot=bool(self.examples),
                    examples=self.examples,
                ),
            },
        ]

    @override
    def format_output(  # type: ignore
        self, output: Union[str, None], input: Dict[str, Any] = None
    ) -> Dict[str, Any]:  # type: ignore
        match = re.search(_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX, output, re.DOTALL)
        if match:
            match = match.group(1).strip()
        return {"formal_statement": match}

根据论文的内容，在少样本学习