LLVM full LTO 学习笔记

电影旅行敲代码

已于 2022-02-13 10:51:53 修改

阅读量3.3k

点赞数 8

分类专栏： LLVM源码系列文章标签： c语言开发语言后端

于 2022-02-06 01:00:25 首次发布

本文链接：https://blog.csdn.net/dashuniuniu/article/details/122769486

版权

LLVM源码系列专栏收录该内容

18 篇文章

订阅专栏

垃圾笔记，勿看，请观看 Teresa Johnson 的视频

什么是LTO(略)

LLVM LTO Objects都包含了哪些？

使用 Example of link time optimization 中的例子，如下所示。

--- a.h ---
extern int foo1(void);
extern void foo2(void);
extern void foo4(void);

--- a.c ---
#include "a.h"

static signed int i = 0;

void foo2(void) {
  i = -1;
}

static int foo3() {
  foo4();
  return 10;
}

int foo1(void) {
  int data = 0;

  if (i < 0)
    data = foo3();

  data = data + 42;
  return data;
}

--- main.c ---
#include <stdio.h>
#include "a.h"

void foo4(void) {
  printf("Hi\n");
}

int main() {
  return foo1();
}

编译 LTO 的版本，我们可以看到 a_lto.o 比 a.o 多了1.2 K的内容。

$ clang -flto -c a.c -o a_lto.o
$ clang -c a.c -o a.o
$ ls -alh
...
-rw-r--r-- 1 wangliushuai wangliushuai 1.6K Feb  2 14:41 a.o
-rw-r--r-- 1 wangliushuai wangliushuai 2.8K Feb  2 14:41 a_lto.o

我们使用 hexdump 来打印object file中的内容，我们知道 magic number 通常用来识别文件格式。a.o 肯定是普通的 ELF 文件，ELF 文件的magic number 是 7F 45(E) 4C(L) 46(F) 。所以我们把焦点专注在a_lto.o 的magic number 4342 dec0上。

$ hexdump a_lto.o | head
0000000 4342 dec0 1435 0000 0005 0000 0c62 2430
0000010 594d 66be fb8d 4fb4 c81b 4424 3201 0005
0000020 0c21 0000 0262 0000 020b 0021 0002 0000
0000030 0016 0000 8107 9123 c841 4904 1006 3932

$ hexdump a.o | head
0000000 457f 464c 0102 0001 0000 0000 0000 0000
0000010 0001 003e 0001 0000 0000 0000 0000 0000

通过 man ascii 我们知道，42 43 分别是 B C，llvm IR 有三种表示形式，text，in memory 以及 bitcode。所以猜测 BC 就是 bitcode的意思。

Oct   Dec   Hex   Char                        Oct   Dec   Hex   Char
------------------------------------------------------------------------
# ...
002   2     02    STX (start of text)         102   66    42    B
003   3     03    ETX (end of text)           103   67    43    C
# ...

由此我们按图索骥，找到 bitcode 的 magic number 如下。正好与 4342 dec0 进行对应，总共 4 个字节。

$B’_8, ‘C’_8, 0x0_4, 0xC_4, 0xE_4, 0xD_4]$

对于 bitcode file format，有专门的工具 llvm-bcanalyzer 进行分析，dump 出来的数据很多。

$ llvm-bcanalyzer -dump a_lto.o
# ...
Summary of a_lto.o:
         Total size: 22592b/2824.00B/706W
        Stream type: LLVM IR
  # Toplevel Blocks: 4
  # ...
  Block ID #12 (FUNCTION_BLOCK):
      Num Instances: 3
         Total Size: 956b/119.50B/29W
    Percent of file: 4.2316%
       Average Size: 318.67/39.83B/9W
  Tot/Avg SubBlocks: 6/2.000000e+00
    Tot/Avg Abbrevs: 0/0.000000e+00
    Tot/Avg Records: 20/6.666667e+00
    Percent Abbrevs: 35.0000%

	Record Histogram:
		  Count    # Bits     b/Rec   % Abv  Record Kind
		      4       184      46.0          INST_STORE
		      3        57      19.0  100.00  INST_LOAD
		      3        24       8.0  100.00  INST_RET
		      3        66      22.0          DECLAREBLOCKS
		      2       128      64.0          INST_CALL
		      2        56      28.0          INST_BR
		      1        40                    INST_CMP2
		      1        46                    INST_ALLOCA
		      1        28            100.00  INST_BINOP

  Block ID #13 (IDENTIFICATION_BLOCK_ID):
# ...

  Block ID #14 (VALUE_SYMTAB):
# ...
  Block ID #15 (METADATA_BLOCK):
# ...
  Block ID #17 (TYPE_BLOCK_ID):
# ...
  Block ID #21 (OPERAND_BUNDLE_TAGS_BLOCK):
# ...
  Block ID #22 (METADATA_KIND_BLOCK):
# ...
  Block ID #23 (STRTAB_BLOCK):
# ...
  Block ID #24 (FULL_LTO_GLOBALVAL_SUMMARY_BLOCK):
# ...
  Block ID #25 (SYMTAB_BLOCK):
# ...

根据 bitcode file 按照一定格式对数据进行了组织，我们不做详细分析。我们使用 llvm-dis 将其转换为人类可读的形式。我们可以看到 a_lto.o 中编码的就是LLVM IR。

如果仔细阅读 LLVM 文档的话，我们可以发现
In ThinLTO mode, as with regular LTO, clang emits LLVM bitcode after the compile phase. The ThinLTO bitcode is augmented with a compact summary of the module. During the link step, only the summaries are read and merged into a combined summary index, which includes an index of function locations for later cross-module function importing. Fast and efficient whole-program analysis is then performed on the combined summary index. ThinLTO

; ModuleID = 'a_lto.o'
source_filename = "a.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@i = internal global i32 0, align 4

; Function Attrs: noinline nounwind optnone uwtable
define dso_local void @foo2() #0 {
entry:
  store i32 -1, i32* @i, align 4
  ret void
}

; Function Attrs: noinline nounwind optnone uwtable
define dso_local i32 @foo1() #0 {
entry:
  %data = alloca i32, align 4
  store i32 0, i32* %data, align 4
  %0 = load i32, i32* @i, align 4
  %cmp = icmp slt i32 %0, 0
  br i1 %cmp, label %if.then, label %if.end

if.then:                                          ; preds = %entry
  %call = call i32 @foo3()
  store i32 %call, i32* %data, align 4
  br label %if.end

if.end:                                           ; preds = %if.then, %entry
  %1 = load i32, i32* %data, align 4
  %add = add nsw i32 %1, 42
  store i32 %add, i32* %data, align 4
  %2 = load i32, i32* %data, align 4
  ret i32 %2
}

; Function Attrs: noinline nounwind optnone uwtable
define internal i32 @foo3() #0 {
entry:
  call void @foo4()
  ret i32 10
}

declare dso_local void @foo4() #1

attributes #0 = { noinline nounwind optnone uwtable "frame-pointer"="all" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes #1 = { "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }

!llvm.module.flags = !{!0, !1, !2, !3, !4}
!llvm.ident = !{!5}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 7, !"uwtable", i32 1}
!2 = !{i32 7, !"frame-pointer", i32 2}
!3 = !{i32 1, !"ThinLTO", i32 0}
!4 = !{i32 1, !"EnableSplitLTOUnit", i32 1}
!5 = !{!"clang version 14.0.0 (https://github.com/llvm/llvm-project.git 58e7bf78a3ef724b70304912fb3bb66af8c4a10c)"}

^0 = module: (path: "a_lto.o", hash: (0, 0, 0, 0, 0))
^1 = gv: (name: "foo2", summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 1, live: 0, dsoLocal: 1, canAutoHide: 0), insts: 2, funcFlags: (readNone: 0, readOnly: 0, noRecurse: 0, returnDoesNotAlias: 0, noInline: 1, alwaysInline: 0, noUnwind: 1, mayThrow: 0, hasUnknownCall: 0, mustBeUnreachable: 0), refs: (^2)))) ; guid = 2494702099028631698
^2 = gv: (name: "i", summaries: (variable: (module: ^0, flags: (linkage: internal, visibility: default, notEligibleToImport: 1, live: 0, dsoLocal: 1, canAutoHide: 0), varFlags: (readonly: 1, writeonly: 1, constant: 0)))) ; guid = 2708120569957007488
^3 = gv: (name: "foo1", summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 1, live: 0, dsoLocal: 1, canAutoHide: 0), insts: 13, funcFlags: (readNone: 0, readOnly: 0, noRecurse: 0, returnDoesNotAlias: 0, noInline: 1, alwaysInline: 0, noUnwind: 1, mayThrow: 0, hasUnknownCall: 0, mustBeUnreachable: 0), calls: ((callee: ^5)), refs: (^2)))) ; guid = 7682762345278052905
^4 = gv: (name: "foo4") ; guid = 11564431941544006930
^5 = gv: (name: "foo3", summaries: (function: (module: ^0, flags: (linkage: internal, visibility: default, notEligibleToImport: 1, live: 0, dsoLocal: 1, canAutoHide: 0), insts: 2, funcFlags: (readNone: 0, readOnly: 0, noRecurse: 0, returnDoesNotAlias: 0, noInline: 1, alwaysInline: 0, noUnwind: 1, mayThrow: 0, hasUnknownCall: 0, mustBeUnreachable: 0), calls: ((callee: ^4))))) ; guid = 17367728344439303071
^6 = flags: 8
^7 = blockcount: 5

但是这里有一个问题，如果链接时没有添加 -flto 选项的话，不会进行 lto 优化的。那么编译时添加 -flto 链接时没有 -flto，那么 linker 直接处理是 LLVM IR，链接能够通过吗？是可以直接处理的。例如对于 lld 来说，它会根据 object file 的类型来选择合适的函数来对 lto bitcode file进行处理。这里是 LinkDriver::link -> compileBitcodeFiles。这里暂时略过，后面再进行介绍。

// Do actual linking. Note that when this function is called,
// all linker scripts have already been parsed.
template <class ELFT> void LinkerDriver::link(opt::InputArgList &args) {
  // ...
  if (!bitcodeFiles.empty()) {
    // ...

    // Do link-time optimization if given files are LLVM bitcode files.
    // This compiles bitcode files into real object files.
    //
    // With this the symbol table should be complete. After this, no new names
    // except a few linker-synthesized ones will be added to the symbol table.
    compileBitcodeFiles<ELFT>();

    // ...
  }
  // ...
}

至此我们知道了，对于 full LTO 来说，生成的出来就是 bitcode，存储的就是 LLVM IR。lld 会根据objece file 类型来选择合适的函数进行处理。

LTO 过程是如何进行的

首先给出 lld 在处理 lto objects 时的完整命令。

~/workspace/llvm-project/build/bin/ld.lld --hash-style=both --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o exe /lib/x86_64-linux-gnu/crt1.o /lib/x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/8/crtbegin.o -L/usr/lib/gcc/x86_64-linux-gnu/8 -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../../lib64 -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib64 -L/usr/local/bin/../lib -L/lib -L/usr/lib -plugin-opt=mcpu=x86-64 a-lto.o main-lto.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed /usr/lib/gcc/x86_64-linux-gnu/8/crtend.o /lib/x86_64-linux-gnu/crtn.o

接下来给出 lld 整个的执行流程，整体分为三个部分

Preparation，准备过程，主要包括 configuration 的处理，搜索并打开文件，为这些文件创建对应的 lld 处理对象；
LTO Backend，如果 lld 在处理的过程中，发现有 lto objects (前面我们提到 lto objects 其实就是 bitcode format file)的话，设置 BitcodeCompiler，然后转入 lld lto 过程；
- 计算死函数
- 将它们连接成一个整体的 IR Module
- 更新 visibility
- 构建 LTO optimization pipeline
- 执行真正的优化
- 代码生成
LLD 的链接过程
- 此时所有的 “文件” 已经准备好了，将 Input Section 聚合在一起
- gc-sections
- 计算各个 section 聚合在一起后，各个 symbol 的 offset 等信息；并进行重定位
- Identical Code Folding
- 最终的结果输出

在这里插入图片描述

详细过程如下图所示：
请添加图片描述

Preparation

1. Configuration

2. Search and Open Files

3. Add and Create Files

4. Parse the Files

LTO Backend

5-(1) 计算死函数

前面我们使用 llvm-dis 将 lto 编译后得到的 bitcode file 进行处理后得到的IR中，有一系列的 gv ，表示的是 Global Value Summary Entry。

There can be multiple entries in a combined summary index for symbols with weak linkage.
Compiling with ThinLTO causes the building of a compact summary of the module that is emitted into the bitcode. The summary is emitted into the LLVM assembly and identified in syntax by a caret (‘^’).

下图是 main_lto.o 中的 gv 信息，
请添加图片描述

下图是 a_lto.o 中的 gv 信息，

computeDeadSymbolsWithConstProp() 会调用 computeDeadSymbolsAndUpdateIndirectCalls()，该函数基于一个根集，使用 worklist 算法来计算得到 live sets。

根集目前只有一个 main 函数，首先将其设置为 live，将其压到 Worklist 中，然后根据 refs 和 calls 关系不停地迭代这个 Worklist。
请添加图片描述

5-(2) Link the IRs together

请添加图片描述

整个过程是 IRLinker 基于前面 live 信息、符号决议信息以及符号 visibility 来将各个 source module “链接” 到 destination module，主要是按需 materialize 的过程。

5-(3) Update Call Visibility

5-(4) Build LTO Optimization Pipeline

buildLTODefaultPipeline 预定义了一组 passes，然后添加到 ModulePassManager 中。

  /// Build an LTO default optimization pipeline to a pass manager.
  ///
  /// This provides a good default optimization pipeline for link-time
  /// optimization and code generation. It is particularly tuned to fit well
  /// when IR coming into the LTO phase was first run through \c
  /// addPreLinkLTODefaultPipeline, and the two coordinate closely.
  ///
  /// Note that \p Level cannot be `O0` here. The pipelines produced are
  /// only intended for use when attempting to optimize code. If frontends
  /// require some transformations for semantic reasons, they should explicitly
  /// build them.
  ModulePassManager buildLTODefaultPipeline(OptimizationLevel Level,
                                            ModuleSummaryIndex *ExportSummary);

5-(5) Optimization

  /// Run all of the passes in this manager over the given unit of IR.
  /// ExtraArgs are passed to each pass.
  PreservedAnalyses run(IRUnitT &IR, AnalysisManagerT &AM,
                        ExtraArgTs... ExtraArgs) {
  }

5-(6) Code Generation

在 LTO 执行完成以后，会根据前面 codegen 得到的内容，调用 createObjectFile() 创建一个 lld 能够处理的 InputFile，名字为 lto.tmp，按照常规文件，对 lto.tmp 执行一遍 parse() 操作。

Link Stuff

6. Aggregate InputSections

至此，所有链接的准备工作都已经做完了，目前的 sections 总共有 38 个，我们可以看到已经没有 a_lto.o 和 main_lto.o，只剩一个 main_lto.o 了。

请添加图片描述

7. GC Sections

8. Finalize InputSections, Compute Offset

// This function scans over the InputSectionBase list sectionBases to create
// InputSectionDescription::sections.
//
// It removes MergeInputSections from the input section array and adds
// new synthetic sections at the location of the first input section
// that it replaces. It then finalizes each synthetic section in order
// to compute an output offset for each piece of each input section.
void OutputSection::finalizeInputSections() {}

// This function is very hot (i.e. it can take several seconds to finish)
// because sometimes the number of inputs is in an order of magnitude of
// millions. So, we use multi-threading.
//
// For any strings S and T, we know S is not mergeable with T if S's hash
// value is different from T's. If that's the case, we can safely put S and
// T into different string builders without worrying about merge misses.
// We do it in parallel.
void MergeNoTailSection::finalizeContents() {}

9. Identical Code Folding

// ICF is short for Identical Code Folding. This is a size optimization to
// identify and merge two or more read-only sections (typically functions)
// that happened to have the same contents. It usually reduces output size
// by a few percent.
//
// In ICF, two sections are considered identical if they have the same
// section flags, section data, and relocations. Relocations are tricky,
// because two relocations are considered the same if they have the same
// relocation types, values, and if they point to the same sections *in
// terms of ICF*.

// [1] Safe ICF: Pointer Safe and Unwinding aware Identical Code Folding
// in the Gold Linker
// http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36912.pdf

10. Write the Result

请添加图片描述