Chapter 5-04

原创 2015年11月18日 00:22:30

Please indicate the source: http://blog.csdn.net/gaoxiangnumber1
5.9.2 Reassociation Transformation

Figure 5.26 shows a function combine7 that differs from the unrolled code of combine5 (Figure 5.16) only in the way the elements are combined in the inner loop. In combine5, the combining is performed by the statement
acc = (acc OP data[i]) OP data[i+1];
while in combine7 it is performed by the statement
acc = acc OP (data[i] OP data[i+1]);

Figure 5.27 demonstrates the effect of applying the reassociation transformation to achieve k-way loop unrolling with reassociation.

i: %rdx; data: %rax; limit: %rbp; acc: %xmm1

i: %rdx; data: %rax; limit: %rbp; acc: %xmm1

The load operations resulting from the movss and the first mulss instructions load vector elements i and i + 1 from memory, and the first mul operation multiplies them together. The second mul operation then multiples this result by the accumulated value acc.
We have two load and two mul operations, but only one of the mul operations forms a data-dependency chain between loop registers. When we then replicate this template n/2 times to show the computations performed in multiplying n vector elements (Figure 5.30), we see that we only have n/2 operations along the critical path. The first multiplication within each iteration can be performed without waiting for the accumulated value from the previous iteration. Thus, we reduce the minimum possible CPE by a factor of 2. As we increase k, we continue to have only one operation per iteration along the critical path.

For integer addition and multiplication, the fact that these operations are associative implies that this reordering will have no effect on the result. For the floating-point cases, we must once again assess whether this reassociation is likely to significantly affect the outcome.
5.10 Summary of Results for Optimizing Combining Code

By using multiple optimizations, we have been able to achieve a CPE close to 1.00 for all combinations of data type and operation using ordinary C code, a performance improvement of over 10X compared to the original version combine1.
5.11 Some Limiting Factors
The critical path in a data-flow graph representation of a program indicates a fundamental lower bound on the time required to execute a program. That is, if there is some chain of data dependencies in a program where the sum of all of the latencies along that chain equals T , then the program will require at least T cycles to execute.
The throughput bounds of the functional units also impose a lower bound on the execution time for a program. That is, assume that a program requires a total of N computations of some operation, that the microprocessor has only m functional units capable of performing that operation, and that these units have an issue time of i. Then the program will require at least N * i/m cycles to execute.
5.11.1 Register Spilling
The benefits of loop parallelism are limited by the ability to express the computation in assembly code. IA32 instruction set only has a small number of registers to hold the values being accumulated. If we have a degree of parallelism p that exceeds the number of available registers, then the compiler will resort to spilling, storing some of the temporary values on the stack. Once this happens, the performance can drop significantly.

We see that for IA32, the lowest CPE is achieved when just k = 4 values are accumulated in parallel, and it gets worse for higher values of k. We also see that we cannot get down to the CPE of 1.00 achieved for x86-64.
Examining the IA32 code for the case of k = 5 shows the effect of the small number of registers with IA32:

Accumulator values acc1 and acc4 have been “spilled” onto the stack, at offsets −16 and −28 relative to %ebp. In addition, the termination value limit is kept on the stack at offset −20. The loads and stores associated with reading these values from memory and then storing them back negates any value obtained by accumulating multiple values in parallel.
5.11.2 Branch Prediction and Misprediction Penalties
Recent versions of x86 processors have conditional move instructions and that gcc can generate code that uses these instructions when compiling conditional statements and expressions, rather than the more traditional realizations based on conditional transfers of control. The basic idea for translating into conditional moves is to compute the values along both branches of a conditional expression or statement, and then use conditional moves to select the desired value. Conditional move instructions can be implemented as part of the pipelined processing of ordinary instructions. There is no need to guess whether or not the condition will hold, and hence no penalty for guessing incorrectly.
Do Not Be Overly Concerned about Predictable Branches
We have seen that the effect of a mispredicted branch can be very high, but that does not mean that all program branches will slow a program down. In fact, the branch prediction logic found in modern processors is very good at discerning regular patterns and long-term trends for the different branch instructions.
Write Code Suitable for Implementation with Conditional Moves
Branch prediction is only reliable for regular patterns. Many tests in a program are completely unpredictable, such as whether a number is negative or positive. For these, the branch prediction logic will do very poorly, possibly giving a prediction rate of 50%. For inherently unpredictable cases, program performance can be greatly enhanced if the compiler is able to generate code using conditional data transfers rather than conditional control transfers. This cannot be controlled directly by the C programmer, but some ways of expressing conditional behavior can be more directly translated into conditional moves than others.
We have found that gcc is able to generate conditional moves for code written in a more “functional” style, where we use conditional operations to compute values and then update the program state with these values, as opposed to a more “imperative” style, where we use conditionals to selectively update program state.
Suppose we are given two arrays of integers a and b, and at each position i, we want to set a[i] to the minimum of a[i] and b[i], and b[i] to the maximum. An imperative style of implementing this function is to check at each position i and swap the two elements if they are out of order:

Our measurements for this function on random data show a CPE of around 14.50 for random data, and 3.00–4.00 for predictable data, a clear sign of a high misprediction penalty.
A functional style of implementing this function is to compute the minimum and maximum values at each position i and then assign these values to a[i] and b[i], respectively:

Our measurements for this function show a CPE of around 5.0 regardless of whether the data are arbitrary or predictable.
5.12 Understanding Memory Performance
5.12.1 Load Performance
The performance of a program containing load operations depends on both the pipelining capability and the latency of the load unit. In our experiments with combining operations on a Core i7, we saw that the CPE never got below 1.00. One factor limiting the CPE for our examples is that they all require reading one value from memory for each element computed. Since the load unit can only initiate one load operation every clock cycle, the CPE cannot be less than 1.00. For applications where we must load k values for every element computed, we can never achieve a CPE lower than k.

Our measurements show that function list_len has a CPE of 4.00, which we claim is a direct indication of the latency of the load operation.
Consider the assembly code for the loop. (x86-64)

The movq instruction on line 3 forms the critical bottleneck in this loop. Each successive value of register %rdi depends on the result of a load operation having the value in %rdi as its address. Thus, the load operation for one iteration cannot begin until the one for the previous iteration has completed. The CPE of 4.00 for this function is determined by the latency of the load operation.
5.12.2 Store Performance

As with the load operation, the store operation can operate in a fully pipelined mode, beginning a new store on every cycle. First version: CPE = 2.00; By unrolling the loop four times, clear_array_4: CPE = 1.00. Thus, we have achieved the optimum of one new store operation per cycle.
The store operation does not affect any register values, so a series of store operations cannot create a data dependency. Only a load operation is affected by the result of a store operation, since only a load can read back the memory value that has been written by the store.

Example A: CPE = 2.00. Example B: illustrates a phenomenon we will call a write/read dependency—the outcome of a memory read depends on a recent memory write. Example B has a CPE of 6.00. The write/read dependency causes a slowdown in the processing.

The store unit contains a store buffer containing the addresses and data of the store operations that have been issued to the store unit, but have not yet been completed, where completion involves updating the data cache. This buffer is provided so that a series of store operations can be executed without having to wait for each one to update the cache.
When a load operation occurs, it must check the entries in the store buffer for matching addresses. If it finds a match (meaning that any of the bytes being written have the same address as any of the bytes being read), it retrieves the corresponding data entry as the result of the load operation.

dest: %ecx; val: %eax; src: %ebx; cnt: %edx

The instruction movl %eax,(%ecx) is translated into two operations: The s_addr instruction computes the address for the store operation, creates an entry in the store buffer, and sets the address field for that entry. The s_data operation sets the data field for the entry. As we will see, the fact that these two computations are performed independently can be important to program performance.
The arcs on the right of the operators denote a set of implicit dependencies for these operations. The address computation of the s_addr operation must precede the s_data operation. The load operation generated by decoding the instruction movl (%ebx), %eax must check the addresses of any pending store operations, creating a data dependency between it and the s_addr operation. The figure shows a dashed arc between the s_data and load operations. This dependency is conditional: if the two addresses match, the load operation must wait until the s_data has deposited its result into the store buffer, but if the two addresses differ, the two operations can proceed independently.

The arc labeled (1) represents the requirement that the store address must be computed before the data can be stored. The arc labeled (2) represents the need for the load operation to compare its address with that for any pending store operations. Finally, the dashed arc labeled (3) represents the conditional data dependency that arises when the load and store addresses match.
Figure 5.36(b) shows just two chains of dependencies: the one on the left, with data values being stored, loaded, and incremented (only for the case of matching addresses), and the one on the right, decrementing variable cnt.

For the case of Example A of Figure 5.33, with differing source and destination addresses, the load and store operations can proceed independently, and hence the only critical path is formed by the decrementing of variable cnt. This would lead us to predict a CPE of just 1.00, rather than the measured CPE of 2.00. The reason for 2.00 is that the effort to compare load addresses with those of the pending store operations forms an additional bottleneck.
For the case of Example B, with matching source and destination addresses, the data dependency between the s_data and load instructions causes a critical path to form involving data being stored, loaded, and incremented.
With operations on registers, the processor can determine which instructions will affect which others as they are being decoded into operations. With memory operations, the processor cannot predict which will affect which others until the load and store addresses have been computed. The memory subsystem makes use of many optimizations, such as the potential parallelism when operations can proceed independently.
5.13 Life in the Real World: Performance Improvement Techniques
A number of basic strategies for optimizing program performance:
High-level design. Choose appropriate algorithms and data structures for the problem at hand.
Basic coding principles. Avoid optimization blockers so that a compiler can generate efficient code. Eliminate excessive function calls. Move computations out of loops when possible. Consider selective compromises of program modularity to gain greater efficiency. Eliminate unnecessary memory references. Introduce temporary variables to hold intermediate results. Store a result in an array or global variable only when the final value has been computed.
Low-level optimizations. Unroll loops to reduce overhead and to enable further optimizations. Find ways to increase instruction-level parallelism by techniques such as multiple accumulators and reassociation. Rewrite conditional operations in a functional style to enable compilation via conditional data transfers.
5.14 Identifying and Eliminating Performance Bottlenecks
5.14.1 Program Profiling
Program profiling involves running a version of a program in which instrumentation code has been incorporated to determine how much time the different parts of the program require.
Unix systems provide the profiling program gprof. This program generates two forms of information. First, it determines how much CPU time was spent for each of the functions in the program. Second, it computes a count of how many times each function gets called, categorized by which function performs the call. Both forms of information can be quite useful. The timings give a sense of the relative importance of the different functions in determining the overall run time. The calling information allows us to understand the dynamic behavior of the program.
Profiling with gprof requires three steps, as shown for a C program prog.c, which runs with command line argument file.txt:
1. The program must be compiled and linked for profiling. With gcc (and other C compilers) this involves simply including the run-time flag ‘-pg’ on the command line:
unix> gcc -O1 -pg prog.c -o prog
2. The program is then executed as usual:
unix> ./prog file.txt
It runs slightly (around a factor of 2) slower than normal, but otherwise the only difference is that it generates a file gmon.out.
3. gprof is invoked to analyze the data in gmon.out.
unix> gprof prog
The first part of the profile report lists the times spent executing the different functions, sorted in descending order. As an example, the following listing shows this part of the report for the three most time-consuming functions in a program:

Each row represents the time spent for all calls to some function. The first column indicates the percentage of the overall time spent on the function. The second shows the cumulative time spent by the functions up to and including the one on this row. The third shows the time spent on this particular function, and the fourth shows how many times it was called (not counting recursive calls).
Library function calls are normally not shown in the results by gprof. Their times are usually reported as part of the function calling them.
The second part of the profile report shows the calling history of the functions. The following is the history for a recursive function find_ele_rec:

This history shows both the functions that called find_ele_rec, as well as the functions that it called. The first two lines show the calls to the function: 158,655,725 calls by itself recursively, and 965,027 calls by function insert_string (which is itself called 965,027 times). Function find_ele_rec in turn called two other functions, save_string and new_ele, each a total of 363,039 times.
Some properties of gprof are worth noting:
The timing is not very precise.
The calling information is quite reliable.
By default, the timings for library functions are not shown. Instead, these times are incorporated into the times for the calling functions.
5.14.2 Using a Profiler to Guide Optimization
Please indicate the source: http://blog.csdn.net/gaoxiangnumber1.

版权声明:Please indicate the source if you want to reprint this article: http://blog.csdn.net/gaoxiangnumber1

Chapter 1 正则表达式入门

1 正则表达式简介 “正则表达式是描述一组字符串特征的模式,用来匹配特定的字符串。”                                                        ...
  • skuxeqrsjnxdo596
  • skuxeqrsjnxdo596
  • 2016年10月07日 22:02
  • 85

Android studio导入项目报错Please refer to the user guide chapter on the daemon at http://gradle.org/docs/2

做一个小项目,通过android studio打开,可打开后的项目报如下错误: Error:Unable to start the daemon process.  This problem mi...
  • jinhui157
  • jinhui157
  • 2017年04月20日 10:10
  • 1685

chapter12test3

Stock.h #ifndef STOCK_H_ #define STOCK_H_ #include class Stock { private: char *company; ...
  • cutelily2014
  • cutelily2014
  • 2015年06月12日 16:00
  • 161

蜂鸟e200 risc-v cpu 源代码分析笔记

蜂鸟e200 risc-v cpu源代码分析笔记 cpu core 外设 软件使用
  • xiantongma
  • xiantongma
  • 2018年02月04日 17:09
  • 25

Latex 中文章节

\usepackage{titlesec} \renewcommand{\chaptername}{第\CJKnumber{\thechapter}章} \newcommand{\sectionnam...
  • virhuiai
  • virhuiai
  • 2012年07月28日 19:07
  • 8570

逐步适应 DocBook XML 方言

David Mertz,博士档案保管专家,Gnosis Software, Inc.2000 年 10 月 内容: ...
  • snaill
  • snaill
  • 2005年02月12日 02:25
  • 1594

正则表达式之轻松入门

正则表达式(Regular Expression) 首次接触到正则表达式的时候,简直看不懂这是什么鬼,无从下手。学习后发现,其实正则表达式也就类似于数学运算。数学运算中,我们只要记住数学运算中的+、-...
  • zhlelva
  • zhlelva
  • 2017年04月05日 15:53
  • 246

Chapter 13. Web框架Spring

Chapter 13. Web框架13.1. 介绍Spring的web框架是围绕DispatcherServlet来进行设计的。DispatcherServlet的作用是将请求分发到不同的处理器。Sp...
  • yao_2008
  • yao_2008
  • 2009年05月31日 17:00
  • 3144

PRML Notes- Chapter2 Probability Distribution(2.1,2.2)

第二章 概率分布第二章 概率分布 一些概念 主要分布从第一章中我们了解了机器学习的一些概念、定义等,并知道了ML中最重要的三个部分概率论、信息论和决策论,并简单介绍了贝叶斯学派的思想。这一章中会更加详...
  • shiyanwei1989
  • shiyanwei1989
  • 2017年09月24日 14:10
  • 98

CHAPTER 6 Deep learning

In the last chapter we learned that deep neuralnetworks are often much harder to train than shallo...
  • mydear_11000
  • mydear_11000
  • 2016年04月20日 16:44
  • 646
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:Chapter 5-04
举报原因:
原因补充:

(最多只允许输入30个字)