CS:APP第五章知识总结（写出适合被编译器优化的代码、内联）

最新推荐文章于 2023-07-21 14:45:39 发布

rookie19_

最新推荐文章于 2023-07-21 14:45:39 发布

阅读量163

点赞数

分类专栏：读书

本文链接：https://blog.csdn.net/weixin_42100211/article/details/112744431

版权

读书专栏收录该内容

15 篇文章 2 订阅

订阅专栏

优化从三个方面进行：

an appropriate set of algorithms and data structures
write source code that the compiler can effectively optimize to turn into efficient executable code
divide a task into portions that can be computed in parallel

In general, programmers must make a trade-off between how easy a program is to implement and
maintain, and how fast it runs.

memory alias问题：
The case where two pointers may designate the same memory location is known as memory aliasing.只要出现了不止一个指针就需要考虑一下。
function call问题

主要原因是f()改变了全局变量。
该问题可以通过内联函数来解决：

内联会导致无法下断点调试，还会导致无法进行函数级的性能测试。
变长参数、递归等情况不适用内联。内联会使代码“变丑”。如前文所言，性能和颜值难以兼得。GCC默认不使用内联。需要注意的是，仅仅对函数使用inline关键字是没用的，还要加编译选项。

loop unrolling循环展开:
对汇编代码进行数据流分析，会发现stride（每轮循环处理的元素的个数）大于1时会减少一些操作。
除此之外，循环条件判断也是循环的一部分。减少了判断的次数，自然能节省时间。

指令级并行（非多进程多线程）：
并行和unrolling常常一起做。这是下图中2x2的由来。第一个2是unrolling的stride，第二个2是并行程度。acc0和acc1把工作划分为两个部分，最后将两部分的结果合并。
在这里插入图片描述

code motion：
一个例子是把for条件判断中的len函数提前。也是减少了循环体内的内容。这种优化和unrolling的优化效果都不明显，因为在loop中，分支预测的成功率很高，所以压根就不用等条件判断的结果。code motion应作为编程习惯培养。

减少不必要的内存访问：
比如书中的例子，在循环中用解引用符号（星号）去写一个数据。优化方法是在循环中引入一个局部变量，局部变量在寄存器中反复运算，循环结束后将运算结果写入内存。优化之前则是频繁写入内存。这一点也应该作为编程习惯去培养。

用data transfer代替control transfer：
第三章也提到过。当分支预测成功率不高的时候，不如两个分支都执行。这样分支预测失误的代价会降低（分支内只有结果选择语句），最终总耗时反而减少。

减少函数调用：
好处一，call和ret以及各种寄存器保存的操作被省了；
好处二，有利于进一步的优化。比如unroll和内存访问减少。

SIMD：
简单地说，这是对更大寄存器的一种利用。
Current AVX vector registers are 32 bytes long, and therefore each can hold eight 32-bit numbers or four 64-bit numbers, where the numbers can be either integer or floating-point values. AVX instructions can then perform vector operations on these registers, such as adding or multiplying eight or four sets of values in parallel.
gcc supports extensions to the C language that let programmers express a program in terms of
vector operations that can be compiled into the vector instructions of AVX (as well as code based
on the earlier SSE instructions). This coding style is preferable to writing code directly in assembly
language, since gcc can also generate code for the vector instructions found on other processors.

运算优先级：
我暂时不知道这个优化要怎么推广，感觉完全是基于汇编语言的数据流分析做的优化。之后有空再研究吧。

上述优化方法的总结：
High-level design.
Choose appropriate algorithms and data structures for the problem at hand. Be especially vigilant to avoid algorithms or coding techniques that yield asymptotically poor performance.
Basic coding principles.
Avoid optimization blockers so that a compiler can generate efficient code. Eliminate excessive function calls. Move computations out of loops when possible. Consider selective compromises of program modularity to gain greater efficiency.
Eliminate unnecessary memory references. Introduce temporary variables to hold intermediate results. Store a result in an array or global variable only when the final value has been computed.
Low-level optimizations.
Structure code to take advantage of the hardware capabilities.Unroll loops to reduce overhead and to enable further optimizations.Find ways to increase instruction-level parallelism by techniques such
as multiple accumulators and reassociation.
Rewrite conditional operations in a functional style to enable compilation via conditional data transfers.
我个人认为，实际使用中效果比较明显的优化：内联+O3，SIMD，指令级并行（非多进程多线程）。而唯一比较容易操作的仅第一种，且只适用于C语言。

save load 相关硬件小科普：
For example, our reference machine has two load units, each of which can hold up to 72 pending read requests. It has a single store unit with a store buffer containing up to 42 write requests. Each
of these units can initiate 1 operation every clock cycle.
在这里插入图片描述
我一开始以为store buffer只是普通的用来加速的缓存，但感觉上下文不太通顺，我反复读了数次，理解如下：
正因为store unit是比较大的（上一段文字提到的），所以它有能力一次接很多store指令，由于指令数量较大，需要save的数据无法被马上写入处理器cache，这导致load unit此时不能到cache去取数据（因为store还没完成！这也是上图中must的含义），它只能在store buffer里检索，这个检索跟用确定性的地址在data cache直接取数相比是更慢的（也就是说，store buffer并不是比data cache更小更快的缓存！）。我们得到的结论是，如果试图去读一个刚刚写的数据（也就是write/read dependency存在的时候，the outcome of a memory read depends on a recent memory write），会比较慢。

Amdahl’s law 阿姆达尔定律
在并行计算中，使用多个处理器的程序的加速比受限制于程序串行部分的执行时间。为了提高系统的速度，仅增加CPU处理器的数量不一定能起到有效的作用，需要提高系统内可并行化的模块比重，在此基础上合理增加并行处理器数量，才能以最小的投入得到最大的加速比。

GPROF工具每隔 t 时间发起中断，判断此时正在执行的函数，令其sum+=t。通过这种方法来统计运行时间。此方法对于执行时间特别短（少于1s）的函数可能不太准确。
GPROF不会统计库函数的运行时间，这部分时间会被算进caller的耗时。
通过时间统计来判断瓶颈时，需要注意对于特殊的数据，程序可能会有不同的表现（因为算法有“最好”“最坏”“平均”表现）。
处理掉一个瓶颈之后，可能会有新的瓶颈（它被之前的瓶颈给掩盖了）。

书中还提到了比GPROF更复杂的profiler，可以进行基本代码块级别的分析。