【反编译二进制代码 + LLM】LLM4Decompile: Decompiling Binary Code with Large Language Models-CSDN博客

本文链接：https://blog.csdn.net/Arachis_X/article/details/136850578

LLM4Decompile: Decompiling Binary Code with Large Language Models

2024.3.8

Abstract

Decompilation aims to restore compiled code to human-readable source code, but struggles with details like names and structure. Large language models (LLMs) show promise for programming tasks, motivating their application to decompilation. However, there does not exist any open-source LLM for decompilation. Moreover, existing decompilation evaluation systems mainly consider token-level accuracy and largely ignore code executability, which is the most important feature of any program. Therefore, we release the first open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. The open-source LLMs can serve as baselines for further development in the field. To ensure practical program evaluation, we introduce Decompile-Eval, the first dataset that considers re-compilability and re-executability for decompilation. The benchmark emphasizes the importance of evaluating the decompilation model from the perspective of program semantics. Experiments indicate that our LLM4Decompile has demonstrated the capability to accurately decompile 21% of the assembly code, which achieves a 50% improvement over GPT-4. Our code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile

反编译的目的是将编译后的代码还原为人类可读的源代码，但在名称和结构等细节方面却很难做到。

大型语言模型（LLM）在编程任务中大有可为，这促使它们被应用于反编译。

然而，目前还没有任何用于反编译的开源 LLM。此外，现有的反编译评估系统主要考虑标记级的准确性，在很大程度上忽略了代码的可执行性，而代码的可执行性是任何程序最重要的特征。

因此，我们发布了首个开源反编译 LLM，其范围从 1B 到 33B 不等，对 40 亿个 C 源代码标记和相应的汇编代码进行了预训练。

这些开源 LLM 可作为该领域进一步发展的基线。

为了确保程序评估的实用性，我们引入了 Decompile-Eval，这是第一个考虑反编译的可重编译性和可重执行性的数据集。该基准强调了从程序语义角度评估反编译模型的重要性。

实验表明，我们的 LLM4Decompile 能够准确反编译 21% 的汇编代码，比 GPT-4 提高了 50%。

我们的代码、数据集和模型发布在 https://github.com/albertan017/LLM4Decompile 上。