【译】LLVM 类型系统

最新推荐文章于 2024-04-25 22:17:06 发布

流左沙

最新推荐文章于 2024-04-25 22:17:06 发布

阅读量347

点赞数

分类专栏：编译文章标签： llvm

原文链接：http://blog.llvm.org/2011/11/llvm-30-type-system-rewrite.html

版权

编译专栏收录该内容

13 篇文章 1 订阅

订阅专栏

LLVM IR(intermediate representation) 的类型系统，在 LLVM 3.0 的时候被重写了，本文将介绍其原因，以及新类型系统如何运作。

内容来源： http://blog.llvm.org/2011/11/llvm-30-type-system-rewrite.html
原文标题：LLVM 3.0 Type System Rewrite
作者：Chris Lattner
译者：流左沙
注：翻译较原文会有一定的精简、重排和添删，主要是为了提取重要内容以达到更好的理解，请以原文为准。

1、基本类型

The LLVM IR type system is a fairly straight-forward part of the IR.

LLVM IR 的类型系统是 IR 的一部分。

The type system consists of three major parts: 它包含有三个主要部分

primitive types (like 'double' and integer types), 原始类型
derived types (like structs, arrays and vectors), 派生类型
and a mechanism for handling forward declarations of types ('opaque'). 提前声明的类型，不明确的

2、类型系统的目标

使用高效的指针，来判断结构体类型是否一致
类型可以提前声明，而后续可能在其他地方补全信息
可以表示不同语言的源码
简单、可预测行为的

3、旧类型系统的运作

对于不明确类型的，当其后来被解析了(可能在链接时)，名叫类型补全(type refinement)的进程，会把所有指向该不明确类型的指针修正。

%T1 = type opaque
%T2 = type %T1*

T2 is a PointerType to an OpaqueType. If we resolve %T1 to {} (an empty struct), then %T2 mutates to be a PointerType to the empty StructType.

T2 是不明确类型的指针类型，如果把 T1 解析为空结构体类型，那么 T2 就能改变指针指向空结构体类型。

3.1、产生的问题

In order to guarantee that pointer equality checks work for structural type equivalence checks, VMCore is required to re-unique types whenever they are mutated during type resolution. 为了保证指针判同的工作——判断结构体类型是否一致，VMCore 需要在类型解析后，又重新处理其类型。
Another problem is that more than just types need to be updated: anything that contains a pointer to a type has to be updated, or it gets a dangling pointer. This issue manifested in a number of ways: for example every Value has a pointer to a type. In order to make the system a bit more efficient, Value::getType() actually performed a lazy "union find" step to ensure that it always returned a canonicalized and uniqued type. This made Value::getType() (a very common call) more expensive than it should be. 所有指向类型的指针都需要更新，不然它就是个野指针。这加剧了类型系统任务的繁重。
An even worse problem that this "type updating" problem caused is when you were manipulating and building IR through the LLVM APIs. Because types could move, it was very easy to get dangling pointers, causing a lot of confusion and a lot of broken clients of the LLVM API. 当我们通过 LLVM API 去编译 IR 时，因为类型会变化，很容易就会导致野指针，导致 API 带来很多的困扰，以及对稳定性的破坏。

3.2、类型统一的怪异行为

type names were not part of the type system, they were an "on the side" data structure. The names were not taking into consideration during type uniquing, and you could have multiple names for one type (which led to a lot of confusion). 类型名字不是类型系统的一部分，它们是数据结构。在类型统一时没有考虑类型名字，导致一个类型可能会有多个名字的困扰。

%T1 = type opaque
@G1 = external global %T1*

%T2 = type {}
@G2 = external global %T2*

If %T1 was later resolved to {}, then %T1 and %T2 would both become names for the same empty structure type, the type formerly known as "%T1" would be unified with the type formerly known as "%T2", and now the IR would print out as: 如果 T1 被解析为空，那么 T1 和 T2 都是相同的空结构体类型名字，%T1* 和 %T2 会是一致的。

%T1 = type {}
@G1 = external global %T1*

%T2 = type {}
@G2 = external global %T1*

... note that G2 now has type "%T1*"! This is because the names in the type system was just a hash table on the side, so that asmprinter would pick one of the arbitrarily large number of names for a type when printing. This was "correct", but highly confusing the folks who did not know the ins and outs of the type system, and not helpful behavior. It also made reading .ll dumps from a C++ compiler very difficult, because it is very common to have many structurally identical types with different names. 注意 G2 现在类型是 %T1* ，因为名字在类型系统里只是一个哈希表，汇编器会随意地获取类型的其中一个名字。这可能是正确的，但会扰乱不清楚类型系统输入输出的人，而且这是毫无帮助的行为。导致阅读 IR 异常困难。

3.3、类型重定向

A final problem (that I don't want to dwell on too much) is that we previously could have the situation where a type existed that had no name at all. While this is fine from the type system graph's perspective, this made printing types impossible if they were cyclic and had no names. The solution to this problem was a system known as type up-references. 类型可能还存在没有名字存在的问题。对于类型系统的图结构没什么问题，但对于类型输出是灾难。一个解决方法就是类型重定向。

An "up reference" allows you to refer to a lexically enclosing type without requiring it to have a name. For instance, a structure declaration may contain a pointer to any of the types it is lexically a member of. Example of up references (with their equivalent as named type declarations) include: 类型重定向让你指向词法闭包类型，而不需要名字。

{ \2 * }                %x = type { %x* } 
{ \2 }*                 %y = type { %y }*
\1*                     %z = type %z*        // Self-referential pointer.

An up reference is needed by the asmprinter for printing out cyclic types when there is no declared name for a type in the cycle. Because the asmprinter does not want to print out an infinite type string, it needs a syntax to handle recursive types that have no names (all names are optional in llvm IR). 类型重定向对于没有名字的循环类型而言，是必需的。

Type up-references were an elegant solution that allowed the asmprinter (and parser) to be able to represent an arbitrary recursive type in finite space, without requiring names. For example, the %intlist example above could be represented as "{\2, i32}". It also allowed for construction of some nice (but surprising) types like "\1" which was a pointer to itself! 类型重定向可谓是一种优雅的解决方式。它允许指针指向其自身。

Despite having some amount of beauty and elegance, type up-references were never well understood by most people and caused a lot of confusion. It is important to be able to strip the names out of an LLVM IR module (e.g. the -strip pass), but it is also important for compiler hackers to be able to understand the system! 但类型重定向是难以理解的。

4、LLVM 3.0 新类型系统

the primitive and derived types are the same, only OpaqueType has been removed and StructType has been enhanced. 原始类型和派生类型保持不变，而不明确类型被移除了，以及结构体类型被优化了。

LLVM 3.0 uses a type system very similar to what C has, based on type completion. Basically, instead of creating an opaque type and replacing it later, you now create an StructType with no body, then specify the body later. To create the %intlist type, you'd now write something like this: LLVM 3.0 的新类型系统和 C 比较类似，基于类型补全。现在新的类型系统不会创建不明确类型，而是会创建一个没有实现体的结构体类型，然后之后再去补全其实现。

StructType *IntList = StructType::create(SomeLLVMContext, "intlist");
Type *Elts[] = { PointerType::getUnqual(IntList), Int32Type };
IntList->setBody(Elts);

... which is simple and to the point, much better than the 2.9 way. There are a few non-obvious ramifications to this design though.

4.1、Only struct types can be recursive

In the previous type system, an OpaqueType could be resolved to any arbitrary type, allowing such oddities as "%t1 = type %t1*", which is a pointer to itself. In the new type system, only IR structure types can have their body missing, so it is impossible to create a recursive type that doesn't involve a struct. 新类型系统只允许结构体类型，才可以有循环类型，比如指针指向其自身。

4.2、Literal and Identified structs

In the new type system, there are actually two different kinds of structure type: a "literal" structure (e.g. "{i32, i32}") and an "identified" structure (e.g. "%ty = type {i32, i32}"). 新类型系统当中，只有两种不同的结构体类型：字面结构体和已识别结构体。

Identified structures are the kind we are talking about: they can have a name, and can have their body specified after the type is created. The identified structure is not uniqued with other structure types, which is why they are produced with StructType::create(...). Because identified types are potentially recursive, the asmprinter always prints them by their name (or a number like %42 if the identified struct has no name). 已识别结构体是可以有类型名字的，其实现体可以在类型被创建时补全的。

Literal structure types work similarly to the old IR structure types: they never have names and are uniqued by structural identity: this means that they must have their body elements available at construction time, and they can never be recursive. When printed by the asmprinter, they are always printed inline without a name. Literal structure types are created by the StructType::get(...) methods, reflecting that they are uniqued (the call may or may not actually allocate a new StructType). 字面结构体不会有类型名字，在构建时实现体必须被补全了，不能是循环类型。

已识别结构体和其类型名字是一一对应的。

Because StructType::create always returns a new identified type, we need some behavior for when you try to create two types with the same name. The solution is that VMCore detects the conflict and autorenames the later request by adding a suffix to the type: when you request a "foo" type, you may actually get back a type named "foo.42". This is consistent with other IR objects like instructions and functions, and the names are uniqued at the LLVMContext level.

但这也导致了链接器的任务更加困难。

x.ll:
  %A = type { i32 }
  @G = external global %A
y.ll:
  %A = type { i32 }
  @G = global %A zeroinitializer

两个模块都被加载到 LLVMContext 当中，但它们都有类型名字 A，我们需要保证类型名字的唯一，所以在内存中实际上是这样的：

x.ll module:
  %A = type { i32 }
  @G = external global %A
y.ll module:
  %A.1 = type { i32 }
  @G = global %A.1 zeroinitializer

... and now it is quite clear that the @G objects have different types. When linking these two global variables, it is now up to the linker to remap the types of IR objects into a consistent set of types, and rewrite things into a consistent state. This requires the linker to compute the set of identical types and solve the same graph isomorphism problems that VMCore used to (see the "remapType" logic in lib/Linker/LinkModules.cpp if you're interested). 这样的话会导致 @G 类型不一致，所以需要链接器去修正。

4.3、这种改变的优势

类型名字在进行 pass 优化时，也是一个很好的传播信息。

5、概念

LLVM Language Reference Manual

5.1、Opaque Structure Types

Opaque structure types are used to represent named structure types that do not have a body specified. This corresponds (for example) to the C notion of a forward declared structure.
不明确类型用来表示未补全实现体的结构体类型。与 C 的提前声明的结构体一致。