[Computer Architecture读书笔记] H.2 Detecting and Enhancing Loop-Level Parallelism-CSDN博客

本文链接：https://blog.csdn.net/dashuniuniu/article/details/121590373

H.1 Introduction: Exploiting Instruction-Level Parallelism Statically

附录H对C3.2进行了的扩展，首先提出了探索ILP的四个技术要点

finding parallelism
reducing control
reducing data dependences
using speculation

相比于runtime，cpu为了提升ILP所做的工作，they can take into account a wider range of the program than a runtime approach might be able to incorporate，但是缺点也很明显，they can use only compile time information. Without runtime information, compile time techniques must often be conservative and assume the worst case。

H.2 Detecting and Enhancing Loop-Level Parallelism

这一小节基于3.2的内容，继续沿着loop-level parallelism相关的内容，C3.2给出的loop太trivial了。该小节基于data dependence，系统性介绍了loop-level parallelis的内容。下面的例子存在data-dependence，也就是前面提到的理论约束，必须完成 x[i] + s -> x[i] 这一语义。loop body中唯一存在的loop-carried dependence，就是induction variable i，但是这个i是并不是“真实”的依赖，是为了方便编写程序而存在的变量。

for (i = 1000; i > 0; i = i-1) 
	x[i] = x[i] + s;

The analysis of loop-level parallelism focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations; such a dependence is called a loop-carried dependence.

In computer science, an induction variable is a variable that gets increased or decreased by a fixed amount on every iteration of a loop or is a linear function of another induction variable

由于loop-level parallelism需要识别循环体，induction variable以及识别依赖关系，所以source-level更容易做循环优化（但是循环优化做的好不好，需要考虑到具体的微架构信息，也就是说循环优化并不单单是一个source-level的分析）。

Data Dependence

for (i = 1; i <= 100; i = i + 1) {
	A[i + 1] = A[i] + C[i]; /* S1 */ 
	B[i + 1] = B[i] + A[i + 1]; /* S2 */
}

这里有两个不同类型的dependences

loop-carried dependence：S1 的 A[i + 1] 依赖于上次迭代的结果 A[i]；同样的还有 S2 的 B[i + 1] 依赖于上次的 B[i]；
dependence in same iteration：S2 的 B[i + 1] 依赖于此次迭代的 A[i + 1]。

如果只有dependence in same iteration，那么这些iterations就可以并行的来做。但是这里的loop-carried dependence会阻止loop-level parallelism。

A loop is parallel if it can be written without a cycle in the dependences, since the absence of a cycle means that the dependences give a partial ordering on the statements.

再看下面的例子，S1依赖于上一次的S2的结果，但是S2并不依赖与S1。

for (i = 1; i <= 100; i = i + 1) {
	A[i] = A[i] + B[i]; /* S1 */ 
	B[i + 1] = C[i] + D[i]; /* S2 */
}

我们可以通过loop unroll，将循环转换成下面的形式：

// 变换前的形式
for (i = 1; i <= 100; i = i + 4) {
	A[i] = A[i] + B[i]; /* S1 */ 
	B[i + 1] = C[i] + D[i]; /* S2 */

	A[i + 1] = A[i + 1] + B[i + 1]; /* S3 */ 
	B[i + 2] = C[i + 1] + D[i + 1]; /* S4 */
	
	A[i + 2] = A[i + 2] + B[i + 2]; /* S5 */ 
	B[i + 3] = C[i + 2] + D[i + 2]; /* S6 */
	
	A[i + 3] = A[i + 3] + B[i + 3]; /* S7 */ 
	B[i + 4] = C[i + 3] + D[i + 3]; /* S8 */
}

// 变换后的形式
for (i = 1; i <= 100; i = i + 4) {
	A[i] = A[i] + B[i]; /* S1 */ 
	B[i + 1] = C[i] + D[i]; /* S2 */

	A[i + 1] = A[i + 1] + B[i + 1]; /* S3 */
	B[i + 2] = C[i + 1] + D[i + 1]; /* S4 */
	
	A[i + 2] = A[i + 2] + B[i + 2]; /* S5 */ 
	B[i + 3] = C[i + 2] + D[i + 2]; /* S6 */
	
	A[i + 3] = A[i + 3] + B[i + 3]; /* S7 */ 
	B[i + 4] = C[i + 3] + D[i + 3]; /* S8 */
}

[S3, S4] 与 [S1, S2] 是不能并行化的，因为S3依赖于S2，那么能不能将iteration间的依赖，转换为单次iteration内部呢？是可以做到的。如下所示，将S2与S3放到一个iteration中，如下所示。

for (i = 1; i <= 100; i = i + 4) {
	A[i] = A[i] + B[i]; /* S1 */ 
	
	B[i + 1] = C[i] + D[i]; /* S2 */
	A[i + 1] = A[i + 1] + B[i + 1]; /* S3 */
	
	B[i + 2] = C[i + 1] + D[i + 1]; /* S4 */
	A[i + 2] = A[i + 2] + B[i + 2]; /* S5 */
	 
	B[i + 3] = C[i + 2] + D[i + 2]; /* S6 */
	A[i + 3] = A[i + 3] + B[i + 3]; /* S7 */
	 
	B[i + 4] = C[i + 3] + D[i + 3]; /* S8 */
}

但是上面的代码形式还需要再调整，如下所示，就可以将依赖全部收敛到同一次iteration中。从而删除了loop carried dependence。

A[1] = A[1] + B[1];
for (i = 1; i <= 99; i = i + 1) {
	B[i + 1] = C[i] + D[i]; 
	A[i + 1] = A[i + 1] + B[i + 1];
}
B[101] = C[100] + D[100];

基于前面的介绍，发现data dependence analysis的重要性，而且首先要定位到 loop-carried dependence。另外一个需要提到的是 recurrence 的概念，recurrence虽然依赖了前面迭代的结果，但离前一次迭代的距离，可以决定向量化的程度。

A recurrence is when a variable is defined based on the value of that variable in an earlier iteration, often the one immediately preceding, as in the above fragment.

对于下面的第一个循环，第i次迭代依赖于第i-1次迭代的结果。确实很难向量化（并行化）。但是对于第二个循环，它依赖的是i-5次的迭代结果，那么i-4，i-3，i-2 以及 i-1 的执行与第i次就可以并行完成。当前迭代与所依赖的前一次迭代的距离，我们称之为 dependence distance ，The larger the distance, the more potential parallelism can be obtained by unrolling the loop。
注：loop unroll和ILP关系比较大，和loop-level parallelism关系不大

for (i=2;i<=100;i=i+1) {
	Y[i] = Y[i-1] + Y[i]; 
}

// unroll loop, unroll factor 5
// loop unroll能够提升并行化的原因在于，其能够通过提升instruction scheduling的
// scope来，提升ILP。对于下面的例子，iteration间的依赖，卡的死死的，就是unroll 
// loop，也无法提升ILP。
// for (i=2;i<=100;i=i+5) {
//	Y[i] = Y[i-1] + Y[i];
//	Y[i+1] = Y[i] + Y[i+1];
//	Y[i+2] = Y[i+1] + Y[i+2];
//	Y[i+3] = Y[i+2] + Y[i+3];
//	Y[i+4] = Y[i+3] + Y[i+4]; 
// }

for (i=5;i<=100;i=i+1) {
	Y[i] = Y[i-5] + Y[i]; // 这里的dependence distance就是5
	// Y[5] = Y[0] + Y[5] 
	// Y[6] = Y[1] + Y[6]
    // Y[7] = Y[2] + Y[7]
	// Y[8] = Y[3] + Y[8]
	// Y[9] = Y[4] + Y[9] i = 9，那么i-4，i-3, i-2, i-1, i 可以并行完成
}

// unroll factor是5，loop body中的instructions没有任何依赖关系，可以来回进行调度。
// 注：另一种解释是，dependence distance是5，也就是i-4 i-3 i-2 i-1 i次的loop是可以
// 并行来做的。
// 注：这里unroll后，完全可以做向量化。
// for (i=5;i<=100;i=i+5) {
//   Y[i] = Y[i-5] + Y[i]; 
//   Y[i+1] = Y[i-4] + Y[i+1] 
//   Y[i+2] = Y[i-3] + Y[i+2]
//   Y[i+3] = Y[i-2] + Y[i+3]
//   Y[i+4] = Y[i-1] + Y[i+4]
//}

Finding Dependences

与之引出来的问题是，如何发现这些依赖，finding dependences在下面是三个task中都会遇到，而且在C/C++这种语言中，data dependences显得更为重要。

(1) good scheduling of code
(2) determining which loops might contain parallelism
and (3) eliminating name dependences.

affine array acess and affine loop

本节提到了仿射变换，来介绍如何找到dependences，更确切的时候是array dependence analysis。所以这里岔开话题，首先介绍下这块儿的内容。

Nearly all dependence analysis algorithms work on the assumption that array indices are affine.
Most data dependence analysis work has restricted the domain to only solve
the affine memory disambiguation problem. This simple domain can be solved
exactly and efficiently.

non-affine loop会增加分析的难度，有时候是NP问题，所以关于loop dependence的问题都将视角约束在了affine loop上，例如下面的循环

a[i*i] = a[i * 4 + 5] + 3

In simplest terms, a one-dimensional array index is affine if it can be written in the form a * i + b, where a and b are constants and i is the loop index variable. The index of a multidimensional array is affine if the index in each dimension is affine. Sparse array accesses, which typically have the form x[y[i]], are one of the major examples of nonaffine accesses.

首先给出什么affine function的定义，而数组的下标通常是affine function。

A function of one or more variables, $i_1$ , $i_2$ , …, $i_n$ is affine, if it can be expressed as a sum of a constant, plus constant multiples of the variables.

$=c_0 + \sum_{i=1}^{n} {c_i x^i}$

给出affine array access的定义如下：

An array access is affine if:
the bounds of the loop and the index of each dimension of the array are affine expressions of loop variables and symbolic constants.

The loop is parallelizable because each iteration accesses a different set of data. CS 293S Parallelism and Dependence Theory

for (i=1; i<N; i++)
	A[i] = A[i-1]+B[i]; 
	// A[1] = A[0]+B[1];
	// A[2] = A[1]+B[2];
	// A[3] = A[2]+B[3];

为了继续，先把前面以及未来需要的概念列出来：

概念	解释
True dependence	read after write
Antidependence	write after read
Output dependence	write after write
Source and Sink	Source: the statement (instance) executed earlier Sink: the statement (instance) executed later Graphically a dependence is an edge from source to
Distance vector	The distance vector is a vector `d(sink, source)` such that: $d_k = sink_k - source_k$
Direction Vector
loop-carried dependence	A non loop carried dependence occurs when a location is both read and written
in the same iteration.
Loop-Independent Dependences	Loop independent data dependence occurs between accesses in the same loop iteration.
Dependece Graph
Strip Mining

为了挖掘loop-parallism，那么核心目标就是 compute the set of statement instances which are dependent。对于在循环中的对同一个array的两次memory access，两次访问是否存在dependence，可以通过是否存在两个仿射变换函数（例如一个仿射函数是read array element，另一个仿射函数是write array element），对于不同的index，得到同样的值。例如下面的array访问，需要判断 $i_m * 3 + 4$ （仿射函数1）是否等于 $i_n * 4 + 5$ （仿射函数2）。这个是有解的，例如 $i_m = 3$ ， $i_n = 2$ 的时候。
注：这里只考虑简单的array访问，对于非线性的，或者说non-affine的访问，则不需要考虑

for (int i = 0; i < 100; i ++)
	a[i * 3 + 4] = a[i * 4 + 5] + 3 // 例如i = 2时
	                                // a[2 * 3 + 4] = a[2 * 4 + 5]，亦即a[10] = a[13]
	                                // i = 3时，
	                                // a[3 * 3 + 4] = a[3 * 4 + 5]。亦即a[13] = a[17]
	                                // 我们可以到第3次迭代与第2次迭代有，写后读的依赖，anti-dependence
	                                // 也就是说，每次iteration access的数据其实是有交叉的，不是parallel loop

例如对于下面的循环，如何判断其是否存在依赖呢

for(i = 1; i <= n; i++) 
	for(j = 2 * i; j <= 100; j++)
		a[i+2*j+3][4i*+2*j] = a[1][2*i+1]

对于上述的示例代码，需要在边界约束下，计算是否存在两组 $i$ 、 $j$ ，使得两者affine function（array index）的值相同（描述口语化）。

Iteration Space

在这里插入图片描述

注：上图摘抄自 http://www.cs.cmu.edu/afs/cs/academic/class/15745-s09/www/lectures/lect6-deps.pdf

Iteration Vectors

在这里插入图片描述
注：上图摘抄自 http://www.cs.cmu.edu/afs/cs/academic/class/15745-s09/www/lectures/lect6-deps.pdf

Loop Dependence

在这里插入图片描述
注：上图摘抄自 http://www.cs.cmu.edu/afs/cs/academic/class/15745-s09/www/lectures/lect6-deps.pdf

Distance Vector

前面我们提到了，dependence distance只是一个数字，其实泛化来讲是个vector。
在这里插入图片描述 注：上图摘抄自 http://www.cs.cmu.edu/afs/cs/academic/class/15745-s09/www/lectures/lect6-deps.pdf

前面在表格中提到，source和sink点的概念，上面C引入的dependence vector中的负数-1部分，方向上表示source > sink，不存在依赖。

如何求解

在这里插入图片描述
继续讨论下面的循环

for(i = 1; i <= n; i++) 
	for(j = 2 * i; j <= 100; j++)
		a[i+2*j+3][4i*+2*j] = a[1][2*i+1]

首先这些约束可以表示为matrix vector的乘积，一组约束 + 一组下标计算。

Like iteration space, array access can be represented as $F_i + f$ ; $F$ and $f$ represent the functions of the loop-index variables.

Formally, an array access, $A = < F, f, B, b >$ ;
where $i$ = $i n d e x$ variable vector;
$A$ maps $i$ within the bounds $B_i + b\geq 0$ to the array element location $F_i + f$
Array Dependence Analysis : COMP 621 Special Topics

Access	Affine Expression
$X [i, j]$	$\begin{bmatrix} 1 & 0\\ 0 & 1\end{bmatrix} \begin{bmatrix} i \\ j \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \end{bmatrix}$
X[6 - j*2]	$\begin{bmatrix} 0 & 0\\ 0 & -2\end{bmatrix} \begin{bmatrix} i \\ j \end{bmatrix} + \begin{bmatrix} 0 \\ 6 \end{bmatrix}$
X[1, 5]	$\begin{bmatrix} 0 & 0\\ 0 & 0\end{bmatrix} \begin{bmatrix} i \\ j \end{bmatrix} + \begin{bmatrix} 1 \\ 5 \end{bmatrix}$
X[0, i - 5, 2 * i + j]	$\begin{bmatrix} 0 & 0\\ 1 & 0\\ 2 & 1\end{bmatrix} \begin{bmatrix} i \\ j \end{bmatrix} + \begin{bmatrix} 0 \\ -5 \\ 0 \end{bmatrix}$

注：上表抄袭自Array Dependence Analysis : COMP 621 Special Topics

下面我们将其正规化：

Consider two static accesses $A$ in a $d$ -deep loop nest and $A ’$ in a $d ’$ -deep loop nest respectively defined as

$A = < F, f, B, b >$ and $A^{'} = < F^{'}, f^{'}, B^{'}, b^{'} >$

$A$ and $A^{'}$ are data dependent if

$B_i\geq 0$ ; $B'_{i'}$ ≥ 0 and
$F_i + f = F'_{i'} + f'$
(and $i \neq = i^{'}$ for dependencies between instances of the same static access)

针对这类求解，有以下几个特点：

Array data dependence basically requires finding integer solutions to a system( often refers to as dependence system) consisting of equalities and inequalities.
Equalities are derived from array accesses.
Inequalities from the loop bounds.
It is an integer linear programming problem.
ILP is an NP-Complete problem.
Several Heuristics have been developed.

for(i = 1; i <= n; i++) 
	for(j = 2 * i; j <= 100; j++)
		a[i+2*j+3][4i*+2*j] = a[1][2*i+1]

言归正传，下属的代码约束的求解过程如下。先列出来关于 $i_r$ , $j_r$ , $i_w$ ， $j_w$ 以及 $n$ 的两组约束，也就是inequalities。
$\begin{bmatrix} 1 & 0 & 0 \\ -1 & 0 & 1 \\ -2 & 1 & 0 \\ 0 & -1 & 0 \end{bmatrix} \begin{bmatrix} i_w \\ j_w \\ n \end{bmatrix} + \begin{bmatrix} -1 \\ 0 \\ 0 \\ 100 \end{bmatrix} \geq \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}$
$\begin{bmatrix} 1 & 0 & 0 \\ -1 & 0 & 1 \\ -2 & 1 & 0 \\ 0 & -1 & 0 \end{bmatrix} \begin{bmatrix} i_r \\ j_r \\ n \end{bmatrix} + \begin{bmatrix} -1 \\ 0 \\ 0 \\ 100 \end{bmatrix} \geq \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}$

也就是
$\left\{ \begin{aligned} i_w \geq 1 \\ n \geq i_w \\ j_w \geq 2 * i_w \\ 100 \geq j_w \\ \ i_r \geq 1 \\ n \geq i_w \\ j_r \geq 2 * i_r \\ 100 \geq j_r \end{aligned} \right.$

array acess对应的equalities如下所示。

$\begin{bmatrix} 1 & 2 \\ 4 & 2 \end{bmatrix} \begin{bmatrix} i_w \\ j_w \end{bmatrix} + \begin{bmatrix} 3 \\ 0 \end{bmatrix} = \begin{bmatrix} 0 & 0 \\ 2 & 0 \end{bmatrix} \begin{bmatrix} i_r \\ j_r \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \end{bmatrix}$
也就是如下的两组等式（对）：
$i_w + 2*j_w + 3 = 1 \\ 4 * i_w + 2 * j_w = 2 * i_r + 1$

言归正回到H.2章节，Dependence Analysis非常重要除了在确定是否可以loop unroll之外，还可以在指令调度时，提供关于memory access的信息。但是dependence analysis的缺点是所能处理的场景太少了，namely, among references within a single loop nest and using affine index functions。所不能处理的场景有很多，例如

When objects are referenced via pointers rather than array indices (but see discussion below)
When array indexing is indirect through another array, which happens with many representations of sparse arrays
When a dependence may exist for some value of the inputs but does not exist in actuality when the code is run since the inputs never take on those values
When an optimization depends on knowing more than just the possibility of a dependence but needs to know on which write of a variable does a read of that variable depend

感觉第 3 点和第 4 点略牵强，这些 limination 在其它编译优化中也存在。为了解决第1点的问题，需要另外一项技术，points-to analysis，就是给出两个指针指向的是不是同一块内存，在compile time很难回答这一问题。所以大部分情况下，只给出一些简单的分析。例如我曾经分析过的 anderson’s pointer analysis。

本书对于points-to analysis的介绍篇幅不多，提到了points-to analysis所需要依赖的信息：

Type information, which restricts what a pointer can point to. （这个不容置疑，type info可以大大缩小需要分析的子集）
Informationderivedwhenanobjectisallocatedorwhentheaddressofanobject is taken, which can be used to restrict what a pointer can point to. For example, if $p$ always points to an object allocated in a given source line and $q$ never points to that object, then $p$ and $q$ can never point to the same object.
Information derived from pointer assignments. For example, if $p$ may be assigned the value of $q$ , then $p$ may point to anything q points to. （后面两点是常见points-to analysis算法需要做的）

后面篇幅介绍了，dependence anlysis的两个局限，

算法局限，例如points-to analysis，
IPA局限

编译器通常会遇到两种尴尬的情况，一种情况是，计算fully accurate interprocedural information开销太大了，以至于在实际的compiler中都不会做这种分析。另外一种情况是，计算得到的信息太不精确了，以至于没法儿用。

Eliminating Dependent Computations

在这个小章节中，本书揭示了一个很重要的容易被忽视的问题，那就是正确性问题。在C3.2中，在介绍loop unroll时，其实隐含了另外一个优化，也就是algebraic optimization。

(((i - 1) - 1) - 1) -> i - 3

这类的algebraic optimization在理论情况下是OK的，但是可能存在正确性问题，也就是 computer arithmetic is not associative 。

Although arithmetic with unlimited range and precision is associative, computer arithmetic is not associative , for either integer arithmetic, because of limited range, or floating-point arithmetic, because of both range and precision.

具体可以参考这篇牛逼的文章What Every Computer Scientist Should Know About Floating-Point Arithmetic，或者是https://en.wikipedia.org/wiki/IEEE_754。

Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the expression (x+y)+z has a totally different answer than x+(y+z) when x = 1e30, y = -1e30 and z = 1 (it is 1 in the former case, 0 in the latter).

TODO

Steensgaard’s algorithm
https://stackoverflow.com/questions/39311872/is-performance-reduced-when-executing-loops-whose-uop-count-is-not-a-multiple-of
https://www.agner.org/optimize/optimizing_assembly.pdf 12.10 Loop unrolling
Strip Mining
Loop Stream Detection
why gcc do not prefer loop unroll and loop vectorization。https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760#c3
https://godbolt.org/z/v8zcsW6qn

Reference

Combining loop transformations considering caches and scheduling
https://cseweb.ucsd.edu//~carter/perfprog.html
https://arxiv.org/pdf/1811.06043.pdf
https://suif.stanford.edu/~courses/cs243/lectures/l9.pdf
http://www.cs.cmu.edu/afs/cs/academic/class/15745-s09/www/lectures/lect6-deps.pdf