深入了解 IDA F.L.I.R.T.技术

IDA F.L.I.R.T. Technology: In-Depth

深入了解 IDA F.L.I.R.T.技术

FLIRT 全称是 Fast Library Identification and Recognition Technology(快速库函数标识和识别技术)。

这是IDA官网一篇文章的翻译。只能先按照我的理解来翻译,第一,翻译水平有限。第二,正在学习IDA,技术水平也有限。如有不妥,先说声抱歉。
原文出处:https://hex-rays.com/products/ida/tech/flirt/in_depth/

The Goal 目标
The Difficulties 难处
The Idea 思路
The Implementation 实现
The Results 结果

The Goal 目标

One major stumbling block in the disassembly of programs written in modern high level languages is the time required to isolate library functions. This time may be considered lost because it does not bring us new knowledge : it is only a mandatory step that allows further analysis of the program and its meaningful algorithms. Unfortunately, this process has to be repeated for each and every new disassembly.
反编译那些用现代高级语言写的程序,一个主要的困扰是如何隔离出库函数。这些库函数带不来新的知识,所以时间都浪费掉了。那只是一个去分析真正算法前的必要步骤。不幸的是,每次反汇编却不得不重复这个工作。

Sometimes, the knowledge of the class of a library function can considerably ease the analysis of a program. This knowledge might be extremely helpful in discarding useless information. For example, a C++ function that works with streams usually has nothing to do with the main algorithm of a program.
有时,关于库函数class的知识让分析程序容易些。在剔除那些无用的信息后,这种知识可能帮助很大。比如,C++中,那些与流有关的函数通常与该程序的算法无关。

It is interesting to note that every high level language program uses a great number of standard library functions, sometimes even up to 95% of all the functions called are standard functions. For one well known compiler, the “Hello, world!” program contains:
有趣的注意到,每一种现代高级语言都使用了大量的标准库函数,有时被调用的函数中95%的都来自标准库。一个大家熟知的编译器编译的"Hello, World!"程序包括:

        library functions       -       58
        function main()         -       1

Of course, this is an artificial example but our analysis has shown that real life programs contain, on average, 50% of library functions. This is why the user of a disassembler is forced to waste more than half of his time isolating those library functions. The analysis of an unknown program resembles the resolution of a gigantic crossword puzzle : the more letters we know, the easier it is to guess the next word. During a disassembly, more comments and meaningful names in a function means a faster understanding of its purpose. Widespread use of standard libraries such as OWL, MFC and others increase even more the contribution of the standard functions in the target program.
当然,这是个特例。但我们的分析显示,那些真实世界的程序平均包含50%的调用是库函数。这就是为什么反编译程序的用户浪费了50%以上的时间来隔离这些库函数。对未知程序的分析类似于解决一个巨大的填字游戏:我们知道的字母越多,就越容易猜测下一个单词。 在反汇编过程中,函数中更多的注释和有意义的名称意味着可以更快地理解其用途。OWL、MFC等标准库的广泛使用,进一步增加了标准函数在目标程序中的占比和贡献。

A middle sized program for Win32, written in C++, using modern technologies (e.g., AppExpert or a similar wizard) calls anything from 1000 to 2500 library functions.
一个中等大小的Win32程序,利用现代技术手段(比如, AppExpert 或类似的导向程序),使用C++语言,大概会调用1000-2500次库函数。

To assist IDA users we attempted to create an algorithm to recognize the standard library functions. We wanted to achieve a practical, usable result and therefore accepted some limitations
为了帮助IDA的用户,我们打算去创造一种算法来识别标准库函数。在某些可接受的限制的前提下,我们希望能找到一种可实现和有用的方法。

  • we only consider programs written in C/C++

  • 我们仅仅考虑用C++写的程序。

  • we do not attempt to achieve perfect function recognition : this is theoretically impossible. Moreover the recognition of some functions may lead to undesirable consequences. For example, the recognition of the following function would lead to many misidentifications. It is worth noting that in modern C++ libraries one can find a lot of functions that are absolutely identical byte-to-byte but have different names.

  • 我们并不打算做到完美的函数识别,那在理论上也不可能。过尤不及嘛。例如识别下面的函数将会导致错误的标定。在现代C++函数库中可以找到大量的有完全一致的标识字节序列的函数,却拥有不一样的函数名字,不值得这样搞。

                push    bp
                mov     bp, sp
                xor     ax, ax
                pop     bp
                ret
  • we only recognize and identify functions located in the code segment, we ignore the data segment.

  • 我们仅仅会去识别和标识那些放在代码段的函数,忽略数据段。

  • when a function has been successfully identified, we assign it a name and an eventual comment. We do not aim to provide information about the function arguments or about the behaviour of the function.

  • 一个函数被成功识别后,会被赋予一个名字和最终的注释。我们不打算把提供这个函数参数的信息,或者这个函数的行为信息作为目标。

and we imposed the following constraints upon ourselves
并且我们给自己加上了下面的限制

  • we try to avoid false positives completely. We consider that a false positive (a function wrongly identified) is worse than a false negative (a function not identified). Ideally, there should be no false positive at all.

  • 我们完全尽力避免误报。 我们认为错误识别比未识别更糟糕。 理想情况下,根本不应该出现误报。

  • the recognition of the functions must require a minimum of processor and memory resources.

  • 函数的识别必须占用很小的处理器和内存资源。

  • because of the polyvalent architecture of IDA – it supports tens of very different processors – the identification algorithm must be platform-independent, i.e. it must work with the programs compiled for any processor.

  • 因为IDA的跨处理器架构,它支持几十种不同的处理器,标识算法必须与平台无关,也就是说,它必须可以与不同编译器编译的各种程序良好协作。

  • the main() function should be identified and properly labelled as the library’ startup-code is of no interest.

  • main()函数应该被识别和正确标记,库函数的启动代码没有意义。

The Difficulties 难处

Memory usage 内存的使用

The main obstacle to recognition and identification is the sheer quantity of functions and the size of the memory they occupy. If we evaluate the size of the memory occupied by all versions of all libraries produced by all compiler vendors for memory models, we easily fall into the tens of gigabytes range.
识别和标识的主要障碍是函数的巨大数量和它们占用的内存规模。如果我们去评估他们占用的内存规模,包括所有函数库的所有版本,所有编译器的厂家以及针对的各种内存模式,很容易就跌落入几十上百GB的范围里。

Matters get even worse if we try to take OWL, MFC, MFC and similar libraries into account. The storage needed is huge. At this time, personal computers’ users can’t afford to set aside hundreds of Megabytes of disk space for a simple utility disassembler. Therefore, we had to find an algorithm that diminishes the size of the information needed to recognize standard library functions. Of course, the number of functions that should be recognized dictates the need for an efficient recognition algorithm : a simple brute force search is not an option.
如果我们试图将OWL,MFC和类似的库放在一起,就更糟糕了。存储需求是格外巨大的。这时候,个人计算机用户大概就担负不起花成百上千兆的存储空间用在单一反汇编的工具上。所以,我们不得不去找寻一种算法能缩减识别标准库函数的信息数据大小。当然,要识别的函数数量也决定了需要高效的识别算法:简单的暴力搜索不是选项。

Variability 可变性

An additional difficulty arises from the presence of variant bytes in the program. Some bytes are corrected (fixed up) at load time, others become constants at link time, but most of the variant bytes originate from references to external names. In that case the compiler does not know the addresses of the called functions and leaves these bytes equal to zeroes. This so called “fixup information” is usually written to a table in the output file (sometimes called “relocation table” or “relocation information”). The example below
程序中存在 variant(会异变的) 字节会产生额外的困难。 一些字节在加载时被纠正(修复),其他字节在链接时成为常量,但大多数异变字节是对外部名称的引用。 在这种情况下,编译器不知道被调用函数的地址,并将这些字节保留为零。 这种所谓的“修复信息”通常写入输出文件中的表中(有时称为“重定位表”或“重定位信息”)。 下面的例子

B8 0000s                             mov     ax, seg _OVRGROUP_
9A 00000000se                        call    _farmalloc
26: 3B 1E 0000e                      cmp     bx, word ptr es:__ovrbuffer

contains variant bytes. The linker will try to resolve external references, replacing zeroes with the addresses of called functions, but some bytes will stay untouched : references to dynamic libraries or bytes containing absolute address in the program. These references can be resolved only at load time by the system loader. It will try to resolve all external references and replace zeroes with absolute addresses. When the system loader cannot resolve an external referenceI, as it is the case when the program refers to an unknown DLL, the program will simply not run.
Optimizations introduced by some linkers will also complicate the matter because constant bytes will sometimes be changed. For example:
包含异变字节。链接程序会试图找到外部的引用,用被调用的函数地址替换这些地址的零值,但某些异变字节不会被修改,如指向动态链接库的引用或程序中使用的绝对地址。这些引用只能在加载时间被系统加载程序所修改。加载程序会试图给所有的外部引用赋值绝对地址。加载程序如果不能找到某个外部引用,那么说明目标程序引用了未知的DLL, 就简单的放弃执行了。
某些链接器引入的优化也会使问题变得复杂,因为常量字节有时会被更改。 例如:

               0000: 9A........        call    far ptr xxx

             is replaced by

               0000: 90                nop
               0001: 0E                push    cs
               0002: E8....            call    near ptr xxx

The program will execute as usual, but the replacement introduced by the linker effectively prohibits byte-to-byte comparison with a function template. The presence of variant bytes in a program makes the use of simple checksums for recognition impossible. If functions did not contain variant bytes, the CRC of the first N Bytes would be enough to identify and select a group of functions in a hash table. The use of such tables would greatly decrease the size of the information required for identification : the name of a function, its length and checksum would suffice.
程序将照常执行,但链接器引入的更改实质地禁止与函数模板进行字节到字节的比较。 程序中存在异变字节使得无法使用简单的校验和(checksum)进行识别。 如果函数不包含异变字节,则用一个哈希表和函数前 N 个字节的 CRC 将足以识别和选择一组函数。 使用这样的表将大大减少识别所需信息的大小:函数的名称、其长度和校验和就足够了。

We have already mentionned the fact that the recognition of all standard library functions was not possible or even desirable.
我们已经提到过,识别所有标准库函数是不可能的,甚至是不可取的。

One additional proof is the fact that some identical functions do exactly the same thing but are called in a different manner. For example, the functions strcmp() and fstrcmp() are identical in large memory models.
另一个证据是,一些相同的函数执行完全相同的操作,但以不同的方式调用。 例如,函数 strcmp() 和 fstrcmp() 在大内存模型中是相同的。

We face a dilemna here : we do not want to discard these functions from the recgnition process since they are not trivial and their labelling would help the user but, we are unable to distinguish them.
我们在这里面临一个困境:我们不想从识别过程中丢弃这些函数,因为它们并不是无关紧要的,而且它们打上标签会对用户有所帮助,但我们的确无法区分它们。

And there is more : consider this
思考下面一段代码:

                call    xxx
                ret
        or
                jmp     xxx

At first sight, these pieces of code are not interesting. The problem is that they are present, sometimes in significant number, in standard libraries. The libraries for the Borland C++ v1.5 OS/2 compiler contains 20 calls of this type, in important functions such as read(), write(), etc.
初看这些代码片段没什么意义。问题是它们就在那里,有时标准库中数量还不少。 Borland C++ v1.5 OS/2的库中包含了20个这种类型的调用,而且还是满重要的函数,像 read(), write(), 等等.

Plain comparison of these functions yields nothing. The only way to distinguish those functions is to discover what other function they call. Generally, all short functions (consisting merely of 2-3 instructions) are difficult to recognize and the probability of wrong recognition is very high. However not recognizing them is undesirable, as it can lead to cascade failures : if we do not recognize the function tolower(), we may fail to recognize strlwr() which refers to tolower().
简单地比较这些函数不会产生任何结果。 区分这些函数的唯一方法是发现它们调用了什么其他函数。 一般而言,所有短函数(仅由2-3条指令组成)都难以识别,并且错误识别的概率非常高。 然而,不识别它们是不可取的,因为它可能导致级联失败:如果我们不识别函数 tolower(),我们可能无法识别引用 tolower() 的 strlwr()。

Copyright 版权

Finally, there is an obvious copyright problem : standard libraries may simply not be distributed with a disassembler.
最后,很明显还有版权问题,标准库不可以简单的通过反汇编程序来分发。

The idea 思路

To address those issues, we created a database of all the functions from all libraries we wanted to recognize. IDA now checks, at each byte of the program being disassembled, whether this byte can mark the start of a standard library function.
为了解决这些问题,我们创建了一个数据库,其中包含我们想要识别的所有库中的所有函数。 IDA 现在在被反汇编的程序的每个字节做检查,检查该字节是否可以标记为一个标准库函数的开始。

The information required by the recognition algorithm is kept in a signature file. Each function is represented by a pattern. Patterns are first 32 bytes of a function where all variant bytes are marked.
识别算法所需的信息保存在签名文件中。 每个函数都由一个识别特样来表示。 识别特样是函数的前 32 个字节,其中所有异变字节都被标记了。

For example: 例如:

558BEC0EFF7604..........59595DC3558BEC0EFF7604..........59595DC3 _registerbgidriver
558BEC1E078A66048A460E8B5E108B4E0AD1E9D1E980E1C0024E0C8A6E0A8A76 _biosdisk
558BEC1EB41AC55604CD211F5DC3.................................... _setdta
558BEC1EB42FCD210653B41A8B5606CD21B44E8B4E088B5604CD219C5993B41A _findfirst

where variant bytes are displayed as “…” Several functions start with the same byte sequence. Therefore a tree structure seems particularly well suited to the storage of those functions :
变量字节的位置显示为"…". 这几个函数在开始的几个字节上相同。因此,树结构似乎特别适合存储这些函数:

558BEC
      0EFF7604..........59595DC3558BEC0EFF7604..........59595DC3 _registerbgidriver
      1E
        078A66048A460E8B5E108B4E0AD1E9D1E980E1C0024E0C8A6E0A8A76 _biosdisk
        B4
          1AC55604CD211F5DC3                                       _setdta
          2FCD210653B41A8B5606CD21B44E8B4E088B5604CD219C5993B41A   _findfirst

Sequences of bytes are kept in the nodes of the tree. In this example, the root of the tree contains the sequence “558BEC”, three subtrees stem from the root, respectively starting with bytes 0E, 1E, B4. The subtree starting with B4 gives birth to two subtrees. Each subtree ends with leaves . The information about the function is kept in that (only the name is visible in the above example).
字节序列保存在树结构的节点上。在这个例子中,树的根节点保存序列 “558BEC”,从根节点分出三个叉,分别是字节 0E, 1E, B4。子树B4 长出两个子树。每个子树的终点是叶节点。关于函数的信息就保存在页节点上(上面的例子中,只有显示了名字)。

The tree data structure simultaneously achieves two goals :
树型的数据结构同时达到了两个目标。

  • Memory requirements are decreased since we store bytes common to several functions in tree nodes. This saving is, of course, proportional to the number of functions starting with the same bytes.

  • 由于我们在树节点中存储了多个函数通用的字节,因此内存需求减少了。 当然,这种节省同比于相同字节开头的函数数量。

  • It is well suited to fast fast pattern matching. The number of comparisons required to match a specific location within a program to all functions in a signature file grows logarithmically with the number of functions.

  • 它也很适合快速模式匹配。将程序中的特定位置与签名文件中的所有函数进行匹配所需的比较次数随着函数数量呈对数式增长。

It would not be very wise to take a decision based on the first 32 bytes of a function alone. As already suggested, modern real-world libraries contain several functions starting with the same bytes:
单独根据函数的前 32 个字节做出决定并不是很明智。 正如已经提示过的,现代真实世界的库包含多个以相同字节开头的函数:

558BEC
      56
        1E
          B8....8ED8
                   33C050FF7608FF7606..........83C406
                                                      8BF083FEFF
                    0. _chmod   (20 5F33)
                    1. _access  (18 9A62)

When two functions have the same first 32 bytes, they are stored in the same leaf of the tree. To resolve that situation, we calculate the CRC16 of the bytes starting from position 33 until till the first variant byte. The CRC is stored in the signature file. The number of bytes used to calculate that CRC also needs to be saved, as it differs from function to function. In the above example, the CRC16 is calculated on 20 bytes for the _chmod (bytes 33…52) function and 18 _access function.
There is, of course, a possibility that the first variant byte will be at the 33d position. The length of the sequence of bytes used to calculate the CRC16 is then equal to zero. In practice, this happens rarely and this algorithm gives very low number of false recognitions.
当两个函数有相同的前32字节,它们就被存在相同的树叶下面。为了解决这种情况,我们计算从33字节开始到下一个遇到的异变字节为止的CRC16。这个CRC也保存在签名文件中。同时保存参与CRC计算的字节数,每个函数各不相同。在上面的例子中,函数_chmod是20字节 (bytes 33…52), 函数_access是18字节。当然,有可能的,第33字节正好就是异变字节。那么用来计算CRC16的字节序列长度等于0。 实践中,这很少发生而且这个算法的假识别数量很低。

Sometimes functions have the same initial 32-byte pattern and the same CRC16, as in the example below
有时函数正好前32字节和计算的CRC16都相同,如下面的例子

05B8FFFFEB278A4606B4008BD8B8....8EC0
          0. _tolower (03 41CB) (000C:00)
          1. _toupper (03 41CB) (000C:FF)

We are unlucky: only 3 bytes were used to calculate the CRC16 and they were the same for both functions. In this case we will try to find a position at which all functions in a leaf have different bytes. (in our example this position is 32+3+000C)
But even this method does not allow to recognize all functions. Here is another example:
不走运,正好这两个函数只有3字节用来计算CRC16,而且相等。在这个例子中,我们试图去找到一个位置,叶子中的所有函数有不一样的字节。(在我们举出的例子中,这个位置是32+3+000C)。但这种方法并不能识别所有的函数。下面是另外一个例子:

... (partial tree is shown here)
                0D8A049850E8....83C402880446803C0075EE8BC7:
                  0. _strupr (04 D19F) (REF 0011: _toupper)
                  1. _strlwr (04 D19F) (REF 0011: _tolower)

These functions are identical at non-variant bytes and differ only by the functions they call. In this example the only way to distinguish functions is to examine the name referenced from the instruction at offset 11.
这些函数在非变量字节上是相同的,仅在它们调用的函数上有所不同。 在此示例中,区分函数的唯一方法是检查偏移量 11 处的指令引用的名字。

The last method has a disadvantage: proper recognition of functions _strupr() and _strlwr() depends on the recognition of functions _toupper() and _tolower(). It means that in the case of failure because of the absence of reference to _toupper() or _tolower() we should defer recognition and repeat it later, after finding _tolower() or _toupper(). This has an impact on the general design of the algorithm : we need a second pass to resolve those deferred recognitions. Luckily, subsequent passes are only applied to a few locations in the program.
最后一种方法有一个缺点:函数 _strupr() 和 _strlwr() 的正确识别取决于函数 _toupper() 和 _tolower() 的识别。 这意味着,在因为缺少了对 _toupper() 或 _tolower() 的引用而识别失败的情况下,我们应该推迟识别,并在找到 _tolower() 或 _toupper() 后重复识别。 这对算法的总体设计有影响:我们需要第二遍来解决这些延迟的识别。 幸运的是,一个程序中需要第二次识别的情况并不多。

Finally, one can find functions that are identical in non-variant bytes, refer to the same names but are called differently. Those functions have the same implementation but different names. Surprisingly, this is a frequent situation in standard libraries, especially in C++ libraries.
最后,一种情况是,一些函数可以被非变异字节所标识,它们却用了相同的名称但却不同的调用方式。 这些函数具有相同的实现,只是名称不同。 令人惊讶的是,这种情况在标准库中很常见,尤其是在 C++ 库中。(例如多态,译者)

We call this situation a collision which occurs when functions attached to a leaf cannot be distinguished from each other by using the described methods. A classical example is:
当函数到达了叶子节点仍然不能用前述的方法分辨彼此,我们称这种情况为冲突。一个经典的例子是:

558BEC1EB441C55606CD211F720433C0EB0450E8....5DCB................
   0. _remove (00 0000)
   1. _unlink (00 0000)

   or
8BDC36834702FAE9....8BDC36834702F6E9............................
   0. @iostream@$vsn            (00 0000)
   1. @iostream_withassign@$vsn (00 0000)

Artificial Intelligence is the only way to resolve those cases. Since our goal was efficiency and speed, we decided to leave artificial intelligence for the future developments of the algorithm.
人工智能是唯一解决这种情况的途径。因为我们的目标是有效和快速,未来这种算法的开发就留给AI了。

The Implementation 实现

In IDA version 3.6, the practical implementation of the algorithm matches the above description almost perfectly. We have limited ourselves to the C and C++ language but it will be, without doubt, possible to write pre-processors for other libraries in the future.
在IDA3.6版中,使用上述的算法的作出的程序几乎完美。我们的实现限制在C和C++中,但毫无疑问,通过写一些预处理的程序,这个算法未来可以用在其他的库函数中。

A separate signature file is provided for each compiler. This segregation decreases the probability of cross-compiler identification collisions. A special signature file, called startup signature file is applied to the entry point of the disassembled program to determine the generating compiler. Once it has been identified, we know which signature file should be used for the rest of the disassembly. Our algorithm successfully discerns the startup modules of most popular compilers on the market.
每一种编译器都有不同的签名文件。这种隔离降低了交叉编译器识别冲突的可能性。 一个特殊的签名文件,称为启动签名文件,应用于反汇编程序的入口点,以确定该程序是被哪种编译器编译的。 一旦被识别,我们就知道应该使用哪个签名文件来进行其余的反汇编。 我们的算法成功识别了市场上最流行的编译器的启动模块。

Since we store all functions’ signatures for one compiler in one signature file, it is not necessary to discriminate the memory models (small,compact, medium, large, huge) of the libraries and/or versions of the compilers.
因为我们把一个编译器的所有函数签名保存在一个签名文件中,就不需要区别对待不同函数库的内存模型(小,紧凑,中等大,巨大)且/或不同编译器的版本。(这真是件奇妙的事情啊,译者)

We use special startup-signatures for every format of disassembled file. The signature exe.sig is used for programs running under MS DOS, lx.sig or ne.sig – for OS/2, etc.
我们针对每种反汇编格式使用特别的启动签名。exe.sig签名被用在MS DOS下运行的程序, lx.sig 或 ne.sig 用在OS/2,其他类似。

To decrease a probability of false recognition of short functions, we must absolutely remember any reference to an external name if such a reference exists. It may decrease, to some degree, the probability of the recognition of the function in general but we believe that such an approach is justified. It is better not to recognize than to recognize wrongly. Short functions (shorter than 4 bytes) that do no contain references to external names are not used in the creation of a signature file and no attempt is made to recognize such functions.
为了减低假识别的可能性,我们必须绝对记住任何存在的对外部名称的引用。 在某种程度上,它通常会降低函数的识别概率,但我们相信这种方法是合理的。 不认识比错误认识要好。 不包含对外部名称的引用的短函数(短于 4 个字节)不会用于创建签名文件,并且不会尝试识别此类函数。

The functions from <ctype.h> are short and refer to the array of types of the symbols, therefore we decided to consider the references to this array as an exception : we calculate the CRC16 of the array of the types of the symbols and store it in the signature file.
<ctype.h>中的函数是短小的,只引用符号表的类型数组,我们决定这种对数组的引用作为一个例外:我们计算符号表类型数组的CRC16值,并保存在签名文件中。

Without artificial intelligence, the collisions are solved by natural intelligence. The human creator of a signature file chooses the functions to include and to discard from the signature file. This choice is very easy and is practically implemented by the edition of a text file.
没有AI,只好使用人力来解决签名冲突。创造了签名文件的程序员来选择保留或去除那些冲突的函数。这个选择很容易,只需要编辑文本文件。

The patterns of the functions are not stored in a signature file under their original form (i.e., they do not look like the example figures). In place of the patterns, we store the arrays of bits determining the changing bytes and the values of the individual bytes are stored. Therefore the signature file contains no byte from the original libraries, except for the names of the functions. The creation of a signature file involves in 2 stages: the preprocessing of the libraries and the creation of a signature file. In the first stage the program ‘parselib’ is used. It preprocesses *.obj and *.lib files to produce a pattern-file. The pattern-file contains the patterns of the functions, their names, their CRC16 and all other information necessary to create the signature file. At the second stage the ‘sigmake’ program builds the signature file from the pattern-file.
函数的识别特样并不以他们的本来形式直接存储在签名文件中(也就是说,看起来不像前面例子中的样子)。在识别特样中,我们存放了异变字节的位数组,其他独特字节则是保存其值(HEX? 译者)。也就是说,签名文件中不包含任何原始库文件的字节,除了函数的名字。创建签名文件被分为两个阶段完成:库的预处理程序和创建签名文件。在第一阶段,是程序parselib,它预处理库文件,创建.pat文件保存函数的识别特样,函数名字,CRC16码和其他所有创建签名文件需要的信息。第二阶段,sigmake则用.pat文件来创建.sig文件。

This division into 2 stages allows sigmake utility to be independent of the format of the input file. Therefore it will be possible to write other preprocessors for files differing from *.obj and *.lib in future.
分割为两阶段,允许工具sigmake独立于输入文件(指库文件,译者)的格式。这样未来可能会写一些其他的预处理程序去处理不同于.OBJ 和.lib的文件。

We decided to compress (using the InfoZip algorithm) the created signature files to decrease the disk space necessary for their storage.
我们决定使用InfoZIP算法来压缩创建好的签名文件来减少磁盘存储需求。

For the sake of user’s convenience we attempted to recognize the main() function as often as it was possible. The algorithm for identifying this function differs from compiler to compiler and from program to program. (DOS/OS2/Windows/GUI/Console…).
为了用户的方便使用,我们打算尽可能的识别main()函数。标识这个函数的方法在不同的编译器和程序中是完全不同的。

This algorithm is written, as a text string, in a signature file. Unfortunately we have not yet been able to automate the creation of this algorithm.
这个算法被写为一段文本串,放在一个签名文件中。不幸的是,我们目前还没法自动创建这个算法。

The Results 结果

As it turns out the signature files compress well; they may be compressed by a factor bigger than 2. The reason of this compressibility is that about 95% of a signature file are function names. (Example: the signature file for MFC 2.x was 2.5MB before compression, 700Kb after. It contains 33634 function names; an average of 21 bytes is stored per function). Generally, the ratio of the size of a library size to the size of a signature file varies from 100 to 500.
事实证明,签名文件压缩得很好; 它们可能会被压缩超过 2 倍。这种可压缩性的原因是签名文件中大约 95% 是函数名称。 (例如:MFC 2.x 的签名文件压缩前为 2.5MB,压缩后为 700Kb。它包含 33634 个函数名称;每个函数平均存储 21 个字节)。 通常,库大小与签名文件大小的比率在 100 到 500 之间。

The percentage of properly recognized functions is very high. Our algorithm recognized all but one function of the “Hello World” program. The unrecognized function consists of only one instruction:
正确识别功能的百分比非常高。 我们的算法可以识别“Hello World”程序中几乎所有函数,只有一个函数无法识别,其仅包含一条指令:

        jmp     off_1234

We were especially pleased with the fact that there was no false recognition. However it does not mean that they will not occur in the future. It should be noted that the algorithm only works with functions.
没有假识别的事实让我们特别开心,虽然不意味着未来不会发生(指假识别,译者)。应该注意到,上面的算法仅仅用来识别函数。

Data is sometimes located in the code segment and therefore we need to mark some names as “data names”, not as “function names”. It is not easy to examine all names in a modern large library and mark all data names.
数据有时也会放在代码段,也应该给这些数据指定数据名称,而非函数名。现代大型库中,并不容易找出所有的名字和标记数据名称。

The implementation of these data names is planned, some time in the future.
这些数据名称的实现还在计划中,在未来的某个时间吧。

译后记

文中提到了版本3.6,目前最新的版本是8.4。看来此文是满久以前写的,但其中的思想却一直灼灼生辉,启发很大。

Variants Bytes,直译是会变的字节,初步译为"异变字节"。更确切的译法还需要推敲。有兴趣了解变量字节被标记成什么,就去检查.pat文件,原来真的被标记为‘.’了, 😉 .

可变异字节这一段可以参考编译原理的相关内容,静态链接库和动态链接库,分别在链接和加载时才确定函数调用地址。在编译和调用时,不同的程序,这些地址也是不同的,所以不能作为特征字节来使用。

纠正了原文的几个错误拼写,比如,dilemna -> dilemma; recgnition -> recognition。

IDA pro

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值