调试LLVM如何生成SSA

调试过程,垃圾内容,勿读


我在文章《 构造SSA》中介绍了如何构造 SSA,也就是place ϕ \phi ϕ- functionrename到后面的 SSA destruction。这篇文章一步步调试给出LLVM如何构造最终的SSA。

int fac(int num) {
	if (num == 1)
		return 1;
	return num * fac(num - 1)
}
int main() {
	fac(10);
}

在介绍llvm如何生成SSA之前,先介绍如何生成带有 ϕ \phi ϕ-instruction的IR。对IR不熟悉的话,《2019 EuroLLVM Developers’ Meeting: V. Bridgers & F. Piovezan “LLVM IR Tutorial - Phis, GEPs …”》是入门LLVM IR最好的视频。

Clang itself does not produce optimized LLVM IR. It produces fairly straightforward IR wherein locals are kept in memory (using allocas). The optimizations are done by opt on LLVM IR level, and one of the most important optimizations is indeed mem2reg which makes sure that locals are represented in LLVM’s SSA values instead of memory. - 《How to get “phi” instruction in llvm without optimization

// test.c
int foo(int a, int b) {
	int r;
	if (a > b)
		r = a;
	else 
		r = b;
	return r;
}

对于上面的代码,使用clang直接生成的IR如下所示,我们可以看到IR还是非常原始的。

// clang -S -emit-llvm test.c -o test_original.ll
; ModuleID = 'test.c'
source_filename = "test.c"
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.15.0"
; Function Attrs: noinline nounwind optnone ssp uwtable
define i32 @foo(i32 %a, i32 %b) #0 {
entry:
  %a.addr = alloca i32, align 4
  %b.addr = alloca i32, align 4
  %r = alloca i32, align 4
  store i32 %a, i32* %a.addr, align 4
  store i32 %b, i32* %b.addr, align 4
  %0 = load i32, i32* %a.addr, align 4
  %1 = load i32, i32* %b.addr, align 4
  %cmp = icmp sgt i32 %0, %1
  br i1 %cmp, label %if.then, label %if.else

if.then:                                          ; preds = %entry
  %2 = load i32, i32* %a.addr, align 4
  store i32 %2, i32* %r, align 4
  br label %if.end

if.else:                                          ; preds = %entry
  %3 = load i32, i32* %b.addr, align 4
  store i32 %3, i32* %r, align 4
  br label %if.end

if.end:                                           ; preds = %if.else, %if.then
  %4 = load i32, i32* %r, align 4
  ret i32 %4
}

attributes #0 = { noinline nounwind optnone ssp uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="penryn" "target-features"="+cx16,+cx8,+fxsr,+mmx,+sahf,+sse,+sse2,+sse3,+sse4.1,+ssse3,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.module.flags = !{!0, !1, !2}
!llvm.ident = !{!3}

!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 10, i32 15]}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 7, !"PIC Level", i32 2}
!3 = !{!"clang version 10.0.0 (https://github.com/llvm/llvm-project.git 36663d506e31a43934f10dff5a3020d3aad41ef1)"}

llvm中使用-mem2reg,来将上述IR中的allocastoreload指令删除,并将代码转化为SSA IR。

This file promotes memory references to be register references. It promotes alloca instructions which only have loads and stores as uses. An alloca is transformed by using dominator frontiers to place phi nodes, then traversing the function in depth-first order to rewrite loads and stores as appropriate. This is just the standard SSA construction algorithm to construct “pruned” SSA form. - mem2reg: Promote Memory to Register

生成SSA IR的命令

生成含有 ϕ \phi ϕ-instruction的命令如下:

$clang -S -emit-llvm -Xclang -disable-O0-optnone test.c // 生成人类可读的IR
$opt -mem2reg test.ll -o test.bc // 将IR转换成SSA形式
$llvm-dis test.bc // 使用llvm-dis生成人类可读的形式

上述指令中的-disable-O0-optnone来删除optnone属性,从而使opt能调用pass。第一条命令生成的结果如下:

; ModuleID = 'test.c'
source_filename = "test.c"
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.15.0"
; Function Attrs: noinline nounwind ssp uwtable
define i32 @foo(i32 %a, i32 %b) #0 {
entry:
  %a.addr = alloca i32, align 4
  %b.addr = alloca i32, align 4
  %r = alloca i32, align 4
  store i32 %a, i32* %a.addr, align 4
  store i32 %b, i32* %b.addr, align 4
  %0 = load i32, i32* %a.addr, align 4
  %1 = load i32, i32* %b.addr, align 4
  %cmp = icmp sgt i32 %0, %1
  br i1 %cmp, label %if.then, label %if.else

if.then:                                          ; preds = %entry
  %2 = load i32, i32* %a.addr, align 4
  store i32 %2, i32* %r, align 4
  br label %if.end

if.else:                                          ; preds = %entry
  %3 = load i32, i32* %b.addr, align 4
  store i32 %3, i32* %r, align 4
  br label %if.end

if.end:                                           ; preds = %if.else, %if.then
  %4 = load i32, i32* %r, align 4
  ret i32 %4
}

attributes #0 = { noinline nounwind ssp uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="penryn" "target-features"="+cx16,+cx8,+fxsr,+mmx,+sahf,+sse,+sse2,+sse3,+sse4.1,+ssse3,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.module.flags = !{!0, !1, !2}
!llvm.ident = !{!3}

!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 10, i32 15]}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 7, !"PIC Level", i32 2}
!3 = !{!"clang version 10.0.0 (https://github.com/llvm/llvm-project.git 36663d506e31a43934f10dff5a3020d3aad41ef1)"}

第二条命令生成的结果如下:

; ModuleID = 'test.bc'
source_filename = "test.c"
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.15.0"
; Function Attrs: noinline nounwind ssp uwtable
define i32 @foo(i32 %a, i32 %b) #0 {
entry:
  %cmp = icmp sgt i32 %a, %b
  br i1 %cmp, label %if.then, label %if.else

if.then:                                          ; preds = %entry
  br label %if.end

if.else:                                          ; preds = %entry
  br label %if.end

if.end:                                           ; preds = %if.else, %if.then
  %r.0 = phi i32 [ %a, %if.then ], [ %b, %if.else ]
  ret i32 %r.0
}

attributes #0 = { noinline nounwind ssp uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="penryn" "target-features"="+cx16,+cx8,+fxsr,+mmx,+sahf,+sse,+sse2,+sse3,+sse4.1,+ssse3,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.module.flags = !{!0, !1, !2}
!llvm.ident = !{!3}

!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 10, i32 15]}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 7, !"PIC Level", i32 2}
!3 = !{!"clang version 10.0.0 (https://github.com/llvm/llvm-project.git 36663d506e31a43934f10dff5a3020d3aad41ef1)"}

IRGen: Add optnone attribute on function during O0
[llvm-dev] Clang/LLVM 5.0 optnone attribute with -O0
LLVM opt mem2reg has no effect
Assignment 1: Introduction to LLVM
-O0 is not a recommended option for clang
opt is defunct when code built without optimizations

lDominatorTreeWrapperPass

dominator信息的计算是由lDominatorTreeWrapperPass完成的,这个pass也是命令opt -mem2reg test.ll -o test.bc在这个module上跑的第一个pass。

compute dominance tree & dominant frontier

llvm在2017年使用SEMI-NCA算法代替传统的LT算法计算dominator信息,见《2017 LLVM Developers’ Meeting: J. Kuderski “Dominator Trees and incremental updates that transcend time》。

首先使用命令opt -dot-cfg ...生成示例代码的CFG图,如下所示:
cfg

The SEMI-NCA algorithm

SEMI-NCA算法是由[Dominators] Use Semi-NCA instead of SLT to calculate dominators提进llvm的。
注:关于SEMI-NCA算法的细节请见再谈Dominator Tree的计算

Debug Process

debug process
上面这个图展示了执行到DominatorTreeWrapperPass入口之前的调用关系,我们可以看到dominator pass是众多passes中占比很小的一部分。中间涉及到的各个类的继承关系如下:

对应的代码及其相关的注释如下:

// LegacyPassManager.cpp

// PassManager manages ModulePassManagers
class PassManager : public PassManagerBase{
	// ...
};

// run - Execute all of the passes scheduled for execution. Keep track of 
// whether any of the passes modifies the module, and if so, return true.
bool PassManager::run(Module &M) {
	return PM->run(M);
}
//===----------------------------------------------------------------------===//
// MPPassManager
//
// MPPassManager manages ModulePasses and function pass managers.
// It batches all Module passes and function pass managers together and
// sequences them to process one module.
class MPPassManager : public Pass, public PMDataManager {
	// ...
};

// Execute all of the passes scheduled for execution by invoking
// runOnModule method. Keep track of whether any of the passes modifies
// the module, and if so, return true.
bool MPPassManager::runOnModule(Module &M) {
	// ...
	for (unsinged Index = 0; Index < getNumContainedPasses(); ++Index) {
		// ...
		LocalChanged |= MP->runOnModule(M);
		// ...
	}
}
// LegacyPasssManager.cpp::FPPassManager::runOnFunction

// FPPassManager manages BBPassManagers and FunctionPasses.
// It batches all function passes and basic block pass managers together and 
// sequence them to process one function at a time before processing next
// function.
class FPPassManager : public ModulePass, public PMDataManager {
// ...
};

// Execute all of the passes scheduled of execution by invoking
/// runOnFunction method. Keep track of whether any of the passes modifies
/// the function, and if so, return true.
bool FPPassManager::runOnFunction(Function &F) {
  // ...
  for (unsigned Index = 0; Index < getNumContainedPasses(); ++Index) {
  	FunctionPass *FP = getcontainedPass(Index);
  	bool LocalChanged = false;

	{
		// ...
		LocalChanged |= FP->runOnFunction(F);
		// ...
	}
  }
  // ...
}
bool DominatorTreeWrapperPass::runOnFunction(Function &F) {
  DT.recalculate(F);
  return false;
}

DominatorTreeBase::recalculate

下面就进入了真正的dominator tree计算过程,SemiNCAInfo<DomTreeT>::CalculateFromScratch执行具体的计算。

/// recalculate - compute a dominator tree for the given function
void recalculate(ParentType &Func) {
  Parent = &Func;
  DomTreeBuilder::Calculate(*this);
}
// ...
template <class DomTreeT>
void Calculate(DomTreeT &DT) {
  SemiNCAInfo<DomTreeT>::CalculateFromScratch(DT, nullptr);
}

SemiNCAInfo<DomTreeT>::CalculateFromScratch就是一个典型的SEMA-NCA的算法实现了,第一步doFullDFSWalk,第二步执行runSemiNCA

static void CalculateFromScratch(DomTree &DT, BatchUpdatePtr BUI) {
	auto *Parent = DT.Parent;
	DT.reset();
	DT.parent = Parent;
	SemiNCAInfo SNCA(nullptr); // Since we are rebuilding the whole tree,
							   // there is no point doing it incrementally.

	// Step #0: Number blocks in depth-first order and initialize variables used 
	// in later stages of the algorithm.
	DT.Roots = FindRoots(DT, nullptr);
	SNCA.doFullDFSWalk(DT, AlwaysDescend);

	SNCA.runSemiNCA(DT);
	if (BUI) {
		BUI->IsRecalculated = true;
		LLVM_DEBUG(
			dbgs() << "DomTree recalculated, skipping future batch updates\n");
	}

	if (DT.Roots.empty()) return;

	// Add a node for the root. If the tree is a PostDominatorTree it will be
	// the virtual exit (denoted by (BasicBlock *) nullptr) which postdominates
	// all real exits (including multiple exit blocks, infinite loops).
	NodePtr Root = IsPostDom ? nullptr : DT.Roots[0]

	DT.RootNode = (DT.DomTreeNodes[Root] = 
					std::make_unique<DomTreeNodeBase<NodeT>>(Root, nullptr)).get();
	SNCA.attachNewSubTree(DT, DT.RootNode);
}
runDFS

runDFS是一个栈实现的典型深度优先遍历,其中对BasicBlock进行了DFS编号,并记录了逆children关系,这里就不展开了。

// Custom DFS implementation which can skip nodes based on a provided
// predicate. It also collects ReverseChildren so that we don't have to spend 
// time getting predecessors in SemiNCA.
//
// If IsReverse is set to true, the DFS walk will be performed backwards
// relative to IsPostDom -- using reverse edges for dominators and forward
// edges for postdominators.
template <bool IsReverse = false, typename DescendCondition>
unsigned runDFS(NodePtr V, unsigned LastNum, DescendCondition Condition, unsigned AttachToNum) {
	// ...
}

经过runDFS之后,最开始的CFG图变为下面的样子。
DFS

runSemiNCA

runSemiNCA可以分为典型的两步,第一步以reverse preorder计算 s d o m sdom sdom值,第二步以preorder序通过NCA计算 i d o m idom idom值。

// This function requires DFS to be run before calling it.
void runSemiNCA(DomTreeT &DT, const unsigned MinLevel != 0) {
	const unsigned NextDFSNum(NumToNode.size());
	// Initialize IDoms to spanning tree parents.
	for (unsigned i = 1; i < NextDFSNum; ++i) {
		const NodePtr V = NumToNode[i];
		auto &VInfo = NodeToInfo[V];
		VInfo.IDom = NumToNode[VInfo.Parent];
	}

	// Step #1: Calculate the semidominators of all vertices.
	SmallVector<InfoSec *, 32> EvalStack;
	for (unsigned i = NextDFSNum - 1; i >= 2; --i) {
		NodePtr W = NumToNode[i];
		auto &WInfo = NodeToInfo[W];

		// Initialize the semi dominator to point to the parent node.
		WInfo.Semi = WInfo.Parent;
		for (const auto &N : WInfo.ReverseChildren) {
			if (NodeToInfo.count(N) == 0) // Skip unreachable predecessors.
				continue;
			
			const TreeNodePtr TN = DT.getNode(N);
			// Skip predecessors whose level is above the subtree we are processing.
			if (TN & TN->getLevel() < MinLevel)
				continue;
			
			unsigned SemiU = NodeToInfo[eval(N, i + 1, EvalStack)].Semi;
			if (SemiU < WInfo.Semi) WInfo.Semi = Semi;
		}
	}

	// Step #2: Explicitly define the immediate dominator of each vertex.
	// 			IDom[i] = NCA(SDom[i], SpanningTreeParent(i)).
	// Note that the parents were stored in IDoms and later got invalidated
	// during path conpression in Eval.
	for (unsigned i = 2; i < NextDFSNum; ++i) {
		const NodePtr W = NumToNode[i];
		auto &WInfo = NodeToInfo[W];
		const unsigned SDomNum = NodeToInfo[NumToNode[WInfo.Semi]].DFSNum;
		NodePtr WIDomCandidate = WInfo.IDom;
		while (NodeToInfo[WIDomCandidate].DFSNum > SDomNum)
			WIDomCandidate = NodeToInfo[WIDomCandidate].IDom;
		
		WInfo.IDom = WIDomCandidate;
	}
}

Step #1执行完成之后,CFG如下图所示。
eval
Step #2执行完成之后,CFG如下图所示。
IDom

mem2reg

pass mem2reg存在于llvm/lib/Transforms/Utils/Mem2Reg.cpp,我把断点打在Mem2Reg.cpp::PromoteLegacyPass::runOnFunction函数体里,call stack如下。

// commit 36663d506e31a43934f10dff5a3020d3aad41ef1
// vscode lldb

// Call Stack
(anoymous namespace)::PromoteLegacyPass::runOnFunction(llvm::Function&)    Mem2Reg.cpp
llvm::FPPassManager::runOnFunction(llvm::Function&)                        LegacyPassManager.cpp
llvm::FPPassManager::runOneModule(llvm::Module&)                           LegacyPassManager.cpp
(anonymous namespace)::MPPassManager::runOneModule(llvm::Module&)          LegacyPassManager.cpp
llvm::legacy::PassManagerImpl::run(llvm::Module&)                          LegacyPassManager.cpp
llvm::legacy::PassManager::run(llvm::Module&)                              LegacyPassManager.cpp
main opt.cpp

runOnFunction的函数体如下:

// runOnFunction - To run this pass, first we calculate the alloca
// instructions that are safe for promotion, then we promote each one.
bool runOnFunction() override {
	if (skipFunction(F))
		return false;
	
	DominatorTree &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
	AssumptionCache &AC = 
		getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
	return promoteMemoryToRegister(F, DT, AC);
}

整个程序的执行时一个Call Tree,但是debugger hit到某个断点,只是展现出当前的一个path。而像lDominatorTreeWrapperPass的执行就是在前面完成的。
Call Stack

place ϕ \phi ϕ-function

In LLVM the transformation from stack variables to register values is performed in optimization passes. Running a mem2reg optimization pass on the IR will transform memory objects to register values whenever possible (or the heuristics say so). The optimization pass is implemented in PromoteMemoryToRegister.cpp which analyzes the BasicBlocks and the alloca instructions for PHINode placement. The PHINode placement is calculated with algorithm by Sreedhar and Gao that has been modified to not use the DJ (Dominator edge, Join edge) graphs. According to Sreedhar and Gao the algorithm is approximately five times faster on average than the Cytron et al. algorithm. The speed gain results from calculating dominance frontiers for only nodes that potentially need phi nodes and well designed data structures. LLVM SSA
Skip to end of metadata

我们知道生成SSA分三步走,

  • 计算dominate信息
  • 插入 ϕ \phi ϕ-instruction
  • rename

在我们dominate信息计算完成之后,后面就是插入 ϕ \phi ϕ-intruction,这个过程由PromoteMem2Reg::run()完成,run()方法分为两个大部分,一是place ϕ \phi ϕ-instrunction,一是rename

// PromoteMemoryToRegister.cpp
// This file promotes memory references to be register references. It promotes
// alloca instructions which only have loads and stores as uses. An alloca is
// transformed by using iterated dominator order to rewrite loads and stores as 
// appropriate.

struct PromoteMem2Reg {
	// The alloca instructions being promoted.
	std::vector<AllocaInst *> Allocas;

	DominatorTree &DT;

	const SimplifyQuery SQ;

	// Reverse mapping of Allocas.
	DenseMap<AllocaInst *, unsigned> AllocaLookup;

	// The PhiNodes we're adding.
	//
	// That map is used to simplify some Phi nodes as we iterate over it, so
	// it should have deterministic iterators. We could use MapVector, but
	// since we already maintain a map from BasicBlock* to a stable numbering
	// (BBNumbers), the DenseMap is more efficient (also supports removal).
	DenseMap<std::pair<unsigned, unsigned>, PHINode *> NewPhiNodes;

	// For each PHI node, keep track of which entry in Allocas it corresponds
	// to.
	DenseMap<PHINode *, unsigned> PhiToAllocaMap;
	
	// The set of basic blocks the renamer has already visited.
	SmallPtrSet<BasicBlock *, 16> Visited;

	// Contains a stable numbering of basic blocks to avoid non-deterministic
	// behavior.
	DenseMap<BasicBlock *, unsigned> BBNumbers;

	// Lazily compute the number of predecessors a block has.
	DenseMap<const BasicBlock *, unsigned> BBNumPreds;

	void run();
private:
	void ComputeLiveInBlocks(AllocaInst *AI, AllocaInfo &Info, 
							const SmallPtrSetImpl<BasicBlock &> &DefBlocks,
							SmallPtrSetImpl<BasicBlock *> &LiveInBlocks);

	void RenamePass(BasicBlock *BB, BasicBlock *Pred,
					RenamePassData::ValVector &IncVals,
					RenamePassData::LocationVector &InstLocs,
					std::vector<RenamePassData> &WorkList);

	bool QueuePhiNode(BasicBlock *BB, unsigned AllocaIdx, unsigned &Version);
};

void PromoteMem2Reg::run() {
	Function &F = *DT.getRoot()->getParent();

	AllocaDgbDeclares.resize(Allocas.size());

	AllocaInfo Info;
	LargeBlockInfo LBI;
	ForwardIDFCalculator IDF(DT);

	// 第一部分,place phi node
	for(unsigned AllocaNum = 0; AllocaNum != Allocas.size(); ++AllocaNum) {
		AllocaInst *AI = Allocas[AllocaNum];
		if (AI->use_empty()) {
			// If there are no uses of the alloca, just delete it now.
			AI->eraseFromParent();

			// Remote the alloca from the Allocas list, since it has been processed
			RemoveFromAllocasList(AllocaNum);
			++NumDeadAlloca;
			continue;
		}

		// Calculate the set of read and write-locations for each alloca. This is
		// analogous to finding the 'uses' and 'definitions' of each variable.
		Info.AnalyzeAlloca(AI);

		// If there is only a single store to this value, replace any loads of
		// it that are directly dominated by the definition with the value stored.
		if (Info.DefiningBlocks.size() == 1) {
			if (rewritingSingleStoreAlloca(AI, Info, LBI, SQ.DL, DT, AC)) {
				// The alloca has been processed, move on.
				RemoveFromAllocaList(AllocaNum);
				++NumSingleStore;
				contiune;
			}
		}

		// If the alloca is only read and written in one basic block, just perform a 
		// linear sweep over the block to eliminate it.
		if (Info.OnlyUsedInOneBlock && 
			promoteSingleBlockAlloca(AI, Info, LBI, SQ.DL, DT, AC)) {
			// The alloca has been processed, move on.
			RemoveFromAllocasList(AllocaNum);
			continue;
		}

		// ...

		// Unique the set of defining blocks for efficient lookup.
		SmallPtrSet<BasicBlock *, 32> DefBlocks(Info.DefiningBlocks.begin(),
												Info.DefineingBlocks.end());
		
		// Determine which blocks the value is live in. These are blocks which lead
		// to uses.
		SmallPtrSet<BasicBlock *, 32> LiveInBlocks;
		ComputeLiveInBlocks(AI, Info, DefBlocks, LiveInBlocks);

		// At this point, we're commited to promoting the alloca using IDF's, and
		// the standard SSA construction algorithm. Determine which blocks need PHI
		// nodes and see if we can optimize out some work by avoiding insertion of
		// dead phi nodes.
		IDF.setLiveInBlocks(LiveBlocks);
		IDF.setDefiningBlocks(DefBlocks);
		SmallVector<BasicBlock *, 32> PHIBlocks;
		IDF.calculate(PHIBlocks);
		llvm::sort(PHIBlocks, [this](BasicBlock *A, BasicBlock *B) {
			return BBNumbers.find(A)->second < BBNumbers.find(B)->second;
		});

		unsigned CurrentVersion = 0;
		for (BasicBlock *BB : PHIBlocks)
			QueuePhiNode(BB, AllocaNum, CurrentVersion);
	}

	// 第二部分 rename pass
	// ...
}

run()方法的第一部分是一个for循环,用于处理 alloca instruction,计算其对应的 ϕ \phi ϕ-instructions。我们回顾一下最开始的IR,有3个alloca指令,其中store指令可以看做一次 d e f def def

define i32 @foo(i32 %a, i32 %b) #0 {
entry:
  %a.addr = alloca i32, align 4  // 第一条alloca指令 %a.addr
  %b.addr = alloca i32, align 4  // 第二条alloca指令 %b.addr
  %r = alloca i32, align 4       // 第三条alloca指令 %r
  store i32 %a, i32* %a.addr, align 4 // %a.addr的定义
  store i32 %b, i32* %b.addr, align 4 // %b.addr的定义
  %0 = load i32, i32* %a.addr, align 4 // %a.addr的读取
  %1 = load i32, i32* %b.addr, align 4 // %b.addr的读取
  %cmp = icmp sgt i32 %0, %1
  br i1 %cmp, label %if.then, label %if.else

if.then:                                          ; preds = %entry
  %2 = load i32, i32* %a.addr, align 4
  store i32 %2, i32* %r, align 4   // %r的第一个定义
  br label %if.end

if.else:                                          ; preds = %entry
  %3 = load i32, i32* %b.addr, align 4
  store i32 %3, i32* %r, align 4   // %r的第二个定义
  br label %if.end

if.end:                                           ; preds = %if.else, %if.then
  %4 = load i32, i32* %r, align 4  // %r的读取
  ret i32 %4
}

收集alloca信息

这一部分主要是收集关于alloca instruction的一些信息,例如有哪些store,有哪些load,然后剔除一些完全不需要 ϕ \phi ϕ-instruction的alloca instruction。收集AllocaInfo关注点在于,store instruction所在的BasicBlockload instruction所在的BasicBlock它们是否在同一个BasicBlock中

// PromoteMemoryToRegister.cpp
struct AllocaInfo {
	// Scan the uses of the specified alloca, filling in the AllocaInfo used 
	// by the rest of the pass to reason about the uses of this alloca.
	void AnalyzeAlloca(AllocaInst *AI) {
		// As we scan the uses of the alloca instruction, keep track of stores,
		// and decide whether all of the loads and stores to the alloca are within
		// the same basic block.
		for (auto UI = AI->user_begin(), E = AI->user_end(); UI != E;) {
			// ...
		}
	}
}

针对这些不同的情况又有不同的处理,

  • DefiningBlocks.size()=1
  • OnlyUsedInOneBlock
  • 正常情况
DefiningBlocks.size()=1

示例IR中的%a.addr就属于这一情况,对这一部分的处理主要集中rewriteSingleStoreAlloca()函数实现的,这个函数的核心在于将storeload这一个过程删掉,直接将欲store的值,直接替换到所有load指令被使用的地方。整个过程就是减少 ϕ \phi ϕ节点的插入,我唯一不能理解的是只有这一个store,难道还不能dominate所有的load,是IR信息不全不能完全保证dominate?

如下图所示,经过这一过程,与%a.addr相关的指令都直接删除了,直接将store到%a.addr的值%a替换到所有使用load%a.addr值的位置。
alloca - a

// Rewrite as many loads as possible given a single store
// 
// When there is only a single store, we can use the domtree to trivially
// replace all of the dominated loads with the stored value. Do so, and return
// true if this has successfully promoted the alloca entirely. If this returns
// false there were some loads which were not dominated by the single store
// and thus must be phi-ed with undef. We fall back to the standard alloca
// promotion algorithm in that case.
static bool rewriteSingleStoreAlloca(AllocaInst *AI, AllocaInfo &Info,
									LargeBlockInfo &LBT, const DataLayout &DL,
									DominatorTree &DT, Assumption *AC) {
	//... 代码我就不贴了
}
OnlyUsedInOneBlock
正常情况

正常情况第一步是计算AllocaInst会在哪些BasicBlock入口活跃。

ComputeLiveInBlocks

One drawback of minimal SSA form is that it may place φ-functions for a variable
at a point in the control-flow graph where the variable was not actually live prior
to SSA. - Static Single Assignment Book

One possible way to do this is to perform liveness analysis prior to SSA construction, and then use the liveness information to suppress the placement of φ-functions as described above; another approach is to construct minimal SSA and then remove the dead φ-functions using dead code elimination. - Static Single Assignment Book

Pruned SSA form,。剔除一些不需要插入 ϕ \phi ϕ-instruction的BasicBlock,因为反正也是死的。

// Determine which blocks the value is live in.
//
// These are blocks which to lead to uses. Knowning this allows us to avoid
// inserting PHI nodes into blocks which don't lead to uses (thus, the
// inserted phi nodes would be dead).
void PromoteMem2Reg::ComputeLiveInBlocks(
	AllocaInst *AI, AllocaInfo &Info,
	const SmallPtrSetImpl<BasicBlock *> &DefBlocks,
	SmallPtrSetImpl<BasicBlock *> &LiveBlocks) {
	// To determine liveness, we must iterate through the predecessors of blocks
	// where the def is live. Blocks are added to the worklist if we need to
	// check their predecessors. Start with all the using blocks.
	SmallVector<BasicBlock *, 64> LiveBlockWorklist(Info.UsingBlocks.begin(),
													Info.UsingBlocks.end());

	// If any of the using blocks is also a definition block, check to see if the
	// definition occurs before or after the use. If it happens before the use,
	// the value isn't realy live-in.
	
}
插入 ϕ \phi ϕ node
// Calculate iterated dominance frontiers
// 
// This uses the linear-time phi algorithm based on DJ-graphs mentioned in
// the file-level comment. It performs DF->IDF pruning using the live-in
// set, to avoid computing the IDF for blocks where an inserted PHI node
// would be dead.
void calculate(SmallVectorImpl<NodeTy*> &IDFBlocks);
DJ-graphs(Dominator edge, Join edge)

关于 D J − g r a p h DJ-graph DJgraph的细节,可以参考论文 A Linear Time Algorithm for Placing phi-Nodes:阅读笔记

With dominance frontiers, the compiler can determine more precisely where ϕ \phi ϕ-functions might be needed. The basic idea is simple. A definition of x x x in block b b b forces a ϕ \phi ϕ-function at very node in D F ( b ) DF(b) DF(b). Since that ϕ \phi ϕ-function is a new definition of x x x, it may, in turn, force the insertion of additional ϕ \phi ϕ-functions.

Iterated Dominance Fontier IDF.calculate(PHIBlocks)

计算 ϕ \phi ϕ-node的核心在于IDFCalculatorBase类,IDF的意思是iterated dominance frontier的意思,核心算法就是DJ-graph。在PromoteMem2Reg::run()函数中,针对单个alloca instruction,我们已经执行完IDF.setLiveInBlocks(LiveBlocks)IDF.setDefiningBlocks(DefBlocks),下一步就是计算插入 ϕ \phi ϕ-node的BasicBlock,这一步的核心是IDF.calculate(PHIBlocks)

根据示例代码,结合 D J − g r a p h DJ-graph DJgraph,解释一下下面的代码。

template<class NodeTy, bool IsPostDom>
void IDFCalculatorBase<NodeTy, IsPostDom>::calculate(
	SmallVectorImpl<NodeTy *> &PHIBlocks) {
	// Use a priority queue keyed on dominator tree level so that inserted nodes
	// are handled from the bottom of the dominator tree upwards. We also augment
	// the level with a DFS number to ensure that the blocks are ordered in a
	// deterministic way.
	
	IDFPriorityQueue PQ;
	DT.updateDFSNumbers();

	for (NodeTy *BB : *DefBlocks) {
		if (DomTreeNodeBase<NodeTy> *Node = DT.getNode(BB)) {
			PQ.push({Node, std::make_pair(Node->getLevel(), Node->DFSNumIn())})
		}
	}

	while(!PQ.empty()) {
		DomTreeNodePair RootPair = PQ.top();
		PQ.pop();
		DomTreeNodeBase<NodeTy> *Root = RootPair.first;
		unsigned RootLevel = RootPair.second.first;

		// Walk all dominator tree children of Root, inspecting their CFG edge with
		// target elsewhere on the dominator tree. Only targets whose level is at
		// most Root's level are added to the iterated dominator frontier of the
		// definition set.
		Worklist.clear();
		Worklist.push_back(Root);
		VisitiedWorklist.insert(Root);

		while(!Worklist.empty()) {
			DomTreeNodeBase<NodeTy> *Node = Worklist.pop_back_val();
			NodeTy *BB = Node->getBlock();
			// Succ is the successor in the direction we are calculating IDF, so it is
			// successor for IDF, and predecessor for Reverse IDF.
			auto DoWork = [&](NodeTy *Succ) {
				DomTreeNodeBase<NodeTy> *SuccNode = DT.getNode(Succ);

				const unsigned SuccLevel = SuccNode->getLevel();
				if (SuccLevel > RootLevel)
					return;
				
				if (!VisitedPQ.insert(SuccNode).second)
					return;
				
				NodeTy *SuccBB = SuccNode->getBlock();
				if (useLiveIn && !LiveInBlocks->count(SuccBB))
					return;
				
				PHIBlocks.emplace_back(SuccBB);
				if (!DefBlocks->count(SuccBB))
					PQ.push_back(std::make_pair(
						SuccNode, std::make_pair(SuccLevel, SuccNode->getDFSNumIn())));
			};

			for (auto Succ : ChildrenGetter.get(BB))
				DoWork(Succ);
			
			for (auto DomChild : *Node) {
				if (VisitedWorklist.insert(DomChild).second)
					Worklist.push_back(DomChild);
			}
		}
	}
}

DJ
CFG中有两个节点 i f . t h e n if.then if.then i f . e l s e if.else if.else % r \%r %r进行了定义,最终得到的 ϕ \phi ϕ block i f . e n d if.end if.end。需要注意的是原始的DefiningBlocks没有if.end,但是由于需要在if.end插入phi-instruction,这是一个新的 d e f def def,所以需要将其放入PQ中。

PromoteMem2Reg::QueuePhiNode

在计算完需要插入 ϕ \phi ϕ blocks以后,llvm会创建一个新的PHINode对象,然后将其记录到PhiToAllocaMap中。

// Queue a phi-node to be added to a basic-block for a specific Alloca.
//
// Returns true if there wasn't already a phi-node for that variable.
bool PromoteMem2Reg::QueuePhiNode(BasicBlock *BB, unsigned AllocaNo,
								unsigned &Version) {
	// ...
}

run()方法的第一部分执行完之后, % a . a d d r \%a.addr %a.addr % b . a d d r \%b.addr %b.addr % r \%r %r所处的状态应该像下面的样子。此时我们已经构造好 ϕ \phi ϕ-node,并收集了这些 ϕ \phi ϕ-node所要插入的BasicBlock
llvm SSA示意图
注:由于%a.addr%b.addr比较简单,上图中的红色表示我们已经将相关的指令处理完成了

in memory llvm IR还没有处理完成,上图中的text IR是我手写出来的,大概是那么意思。

rename

当收集完PhiToAllocaMap以后,就要进行下一步rename过程。首先我们要明确,处理的IR是in memory的IR,llvm IR通过user和use相互勾连,在memory中就是一个指过来指过去的图。在《构造SSA》中我展示的感觉好像rename就真的是重命名的意思,但rename的核心是将 d e f def def ϕ \phi ϕ-instruction勾连起来,所谓的name只是表层的含义,name就是 d e f def def。而在llvm IR中 d e f def def就是向store instruction所要存储的值。

所以理解llvm rename的核心,就在于

  • 挑出来store instruction,把要存储的值,与alloca instruction关联起来,方便以后塞进 ϕ \phi ϕ-instruction 的参数中
  • 挑出来load instruction,看情况替换成前面store instruction要存储的值,或者替换成 ϕ \phi ϕ-instruction
  • 当然这个需要按照值流动的顺序来处理
  • 最后删除storeload指令
void PromoteMem2Reg::run() {
	Function &F = *DT.getRoot()->getParent();

	AllocaDgbDeclares.resize(Allocas.size());

	AllocaInfo Info;
	LargeBlockInfo LBI;
	ForwardIDFCalculator IDF(DT);
	// 第一部分,place phi node
	// ...
	// 第二部分 rename pass
	// Walks all basic blocks in the funtion performing the SSA rename algorithm
	// and inserting the phi nodes we marked as necessary.
	std::vector<RenamePassData> RenamePassWorkList;
	RenamePassWorkList.emplace_back(&F.front(), nullptr, std::move(Values),
									std::move(Locations));
	do {
		RenamePassData PRD = std::move(RenamePassWorkList.back());
		RenamePassWorkList.pop_back();
		// RenamePass may add new worklist entries.
		RenamePass(PRD.BB, PRD.Pred, PRD.Values, PRD.Locations, RenamePassWorkList);
	} while (!RenamePassWorkList.empty());
}

上面的代码预定义了与alloca instruction相关的数据,我们现在要处理只有一条alloca instruction(另外两条已经处理了),所以预定义的数据只有一条。然后初始化,RenamePassWorkList为整个Function的第一个BasicBlock,然后转入整个rename过程的核心RenamePass()

PromoteMem2Reg::RenamePass()

整个renmae pass比较核心的一个结构是IncomingVals,它的类型是下面的结构中的ValVector

struct RenamePassData {
	using ValVector = std::vector<Value *>;
	BasicBlock *BB;
	BasicBlock *Pred;
	ValVector Values;
	
};

而这个IncomingVals通过worklist就起到了与《构造SSA》中rename过程中的栈类似的作用。存储了当前我们应该使用的 d e f def def

处理store instruction & load instruction
// Recursively traverse the CFG of the function, renaming loads and
// stores to the allocas which we are promoting.
//
// IncomingVals indicates what value each Alloca contains on exit from the
// predecessor block Pred.
void PromoteMem2Reg::RenamePass(BasicBlock *BB, BasicBlock *Pred,
								RenamePassData::ValVector &IncomingVals,
								RenamePassData::LocationVector &IncomingLocs,
								std::vector<RenamePassData> &Worklist) {
NextIteration:
	// If we are inserting any phi nodes into this BB, they will already be in the
	// block.
	// 第一部分:填充 phi-node
	// 第二部分:收集store instruction & alloca instruction
	// Don't revisit blocks
	if (!Visited.insert(BB).second)
		return;
	
	for (BasicBlock::iterator II = BB->begin(); !II->isTerminator();) {
		Instruction *I = &*II++; // get the instruction, increment iterator
		
		if (LoadInst *LI = dyn_cast<LoadInst>(I)) {
			AllocaInst *Src = dyn_cast<AllocaInst>(LI->getPointerOperand());
			if (!Src)
				continue;
			
			DenseMap<AllocaInst *, unsigned>::iterator AI = AllocaLookup.find(Src);
			if (AI == AllocaLookup.end())
				continue;
			
			Value *V = IncomingVals[AI->second];

			// If the load was marked as nonnull we don't want to lose
			// that information when we erase this Load. So we preserve
			// it with an assume.
			// ...

			// Anything using the load now uses the current value.
			LI->replaceAllIUsesWith(V);
			BB->getInstList().erase(LI);
		} else if (StoreInst *SI = dyn_cast<StoreInst>(I)) {
			// Delete this instruction and mark the name as the current holder of the
			// value
			AllocaInst *AI = dyn_cast<AllocaInst>(SI->getPointerOperand());
			if (!Dest)
				continue;
			
			DenseMap<AllocaInst *, unsigned>::iterator ai = AllocaLookup.find(Dest);
			if (ai == AllocaLookup.end())
				continue;
			
			// What value were we writing?
			unsigned AllocaNo = ai->second;
			IncomingVals[AllocaNo] = SI->getOperand(0);

			BB->getInstList().erase(SI);
		}
	}
	// 第三部分:更新迭代数据
}

对于load instruction,将所有使用到load instruction的地方替换为收集到的源操作数alloca指令的当前的值,也就是当前 d e f def def的值,并将load instruction删除。

对于store instruction,更新 d e f def def的值,然后删除store instruction

填充 ϕ \phi ϕ-node
void PromoteMem2Reg::RenamePass(BasicBlock *BB, BasicBlock *Pred,
								RenamePassData::ValVector &IncomingVals,
								RenamePassData::LocationVector &IncomingLocs,
								std::vector<RenamePassData> &Worklist) {
NextIteration:
	// If we are inserting any phi nodes into this BB, they will already be in the
	// block.
	// 第一部分:填充 phi-node
	if (PHINode *APN = dyn_cast<PHINode>(BB->begin())) {
		// If we have PHI nodes to update, compute the number of edges from Pred to
		// BB.
		if (PhiToAllocaMap.count(APN)) {
			// We want to be able to distinguish between PHI nodes being inserted by
			// this invocation of mem2reg from those phi nodes that already existed in
			// the IR before mem2reg was run. We determine that APN is being inserted
			// because it is missing incoming edges. All other PHI nodes being
			// inserted by this pass of mem2reg will have the same number of incoming
			// operands so far. Remember this count.
			unsigned NewPHINumOperands = APN->getNumOperands();

			unsigned NumEdges = std::count(succ_begin(Pred), succ_end(Pred), BB);

			// Add entries for all the phis.
			BasicBlock::iterator PNI = BB->begin();
			do {
				unsigned AllocaNo = PhiToAllocaMap[APN];
				
				// Update the location of the phi node.
				updateForIncomingValueLocation(APN, IncomingLocs[AllocaNo],
											APN->getNumIncomingValues() > 0);

				// Add N incoming values to the PHI node.
				for (unsigned i = 0; i != NumEdges; ++i) 
					APN->addIncoming(IncomingVals[AllocaNo], Pred);
				
				// The currently active variable for this block is now the PHI.
				IncomingVals[AllocaNo] = APN;

				// Get the next phi node.
				++PHI;
				APN = dyn_cast<PHINode>(PNI);
				if (!APN)
					break;
			} while(APN->getNumOperands() == NewPHINumOperands);
		}
	}
	// 第二部分:收集store instruction & alloca instruction
	// 第三部分:更新迭代数据
}

如果遍历到了 ϕ \phi ϕ-node,此时一定是通过predecessor 迭代下来的,IncomingVals数组存储了从相应的predecessor中传递过来的 d e f def def,然后以< d e f def def, p r e d pred pred> pair的形式填充 ϕ \phi ϕ-node的一个operand。而do{}while()的形式,是因为通常有很多 ϕ \phi ϕ-node,但像我们这里只有一条 ϕ \phi ϕ-node。

整个迭代过程
// Recursively traverse the CFG of the function, renaming loads and
// stores to the allocas which we are promoting.
//
// IncomingVals indicates what value each Alloca contains on exit from the
// predecessor block Pred.
void PromoteMem2Reg::RenamePass(BasicBlock *BB, BasicBlock *Pred,
								RenamePassData::ValVector &IncomingVals,
								RenamePassData::LocationVector &IncomingLocs,
								std::vector<RenamePassData> &Worklist) {
NextIteration:
	// If we are inserting any phi nodes into this BB, they will already be in the
	// block.
	// 第一部分:填充 phi-node
	// 第三部分:收集store instruction & alloca instruction
	// 第三部分:更新数据

	// 'Recurse' to our successors.
	succ_iterator I = succ_begin(BB), E = succ_end(BB);
	if (I == E)
		return;
	
	// Keep track of the successors so we don't visit the same successor twice
	SmallPtrSet<BasicBlock *, 8> VisitiedSuccs;

	// Handle the first successor without using the worklist.
	VisitedSuccs.insert(*I);
	Pred = BB;
	BB = *I;
	for (; I != E; ++I)
		if (VisitedSuccs.insert(*I).second)
			Worklist.emplace_back(*I, Pred, IncomingVals, IncomingLocs);
	
	goto NextIteration;
}

RenamePass()上层还有一个do{}while()循环,处理Worklist中的数据。结合我们的示例代码,整个过程如下图所示:
最终结果

清理

最终的清理很简单,包括以下几步:

  • 就是删除alloca指令
  • merge incoming值相同的 ϕ \phi ϕ-node
  • 补齐一些不可达basic block中的 ϕ \phi ϕ-node
void PromoteMem2Reg::run() {
	// 清理部分
	// Remove the allocas themselves from the function
	for (Instruction *A : Allocas) {
		// If there are any uses of the alloca instructions left, they must be in
		// unreachable basic blocks that were not processed by walking the dominator
		// tree. Just delete the users now.
		if (!A->use_empty())
			A->replaceAllUsesWith(UndefValue::get(A->getType()));
		A->reaseFromParent();
	}

	// Loop over all of the PHI nodes and see if there are any that we can get
	// rid of because they merge all of the same incoming values. This can
	// happen due to undef values coming into the PHI nodes. This process is
	// iterative, because eliminating one PHI node can cause others to be removed.
	// ...

	// At this point, the renamer has added entries to PHI nodes for all reachable
	// code. Unfortunately, there may be unreachable blocks which the renamer
	// hasn't traversed. If this is the case, the PHI nodes may not 
	// have incoming values for all predecessors. Look over all PHI nodes we have
	// created, inserting undef values if they are missing any incoming values.
	// ...
}

至此整个过程就完成了,然后将这个pass的状态变量LocalChanged=true。当然,由于我们使用了命令opt -mem2reg test.ll -o test.bc,后面会有一个BitcodeWriterPass

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值