Please indicate the source: http://blog.csdn.net/gaoxiangnumber1
5.9.2 Reassociation Transformation
Figure 5.26 shows a function combine7 that differs from the unrolled code of combine5 (Figure 5.16) only in the way the elements are combined in the inner loop. In combine5, the combining is performed by the statement
acc = (acc OP data[i]) OP data[i+1];
while in combine7 it is performed by the statement
acc = acc OP (data[i] OP data[i+1]);
Figure 5.27 demonstrates the effect of applying the reassociation transformation to achieve k-way loop unrolling with reassociation.
i: %rdx; data: %rax; limit: %rbp; acc: %xmm1
i: %rdx; data: %rax; limit: %rbp; acc: %xmm1
The load operations resulting from the movss and the first mulss instructions load vector elements i and i + 1 from memory, and the first mul operation multiplies them together. The second mul operation then multiples this result by the accumulated value acc.
We have two load and two mul operations, but only one of the mul operations forms a data-dependency chain between loop registers. When we then replicate this template n/2 times to show the computations performed in multiplying n vector elements (Figure 5.30), we see that we only have n/2 operations along the critical path. The first multiplication within each iteration can be performed without waiting for the accumulated value from the previous iteration. Thus, we reduce the minimum possible CPE by a factor of 2. As we increase k, we continue to have only one operation per iteration along the critical path.
For integer addition and multiplication, the fact that these operations are associative implies that this reordering will have no effect on the result. For the floating-point cases, we must once again assess whether this reassociation is likely to significantly affect the outcome.
5.10 Summary of Results for Optimizing Combining Code
By using multiple optimizations, we have been able to achieve a CPE close to 1.00 for all combinations of data type and operation using ordinary C code, a performance improvement of over 10X compared to the original version combine1.
5.11 Some Limiting Factors
The critical path in a data-flow graph representation of a program indicates a fundamental lower bound on the time required to execute a program. That is, if there is some chain of data dependencies in a program where the sum of all of the latencies along that chain equals T , then the program will require at least T cycles to execute.
The throughput bounds of the functional units also impose a lower bound on the execution time for a program. That is, assume that a program requires a total of N computations of some operation, that the microprocessor has only m functional units capable of performing that operation, and that these units have an issue time of i. Then the program will require at least N * i/m cycles to execute.
5.11.1 Register Spilling
The benefits of loop parallelism are limited by the ability to express the computation in assembly code. IA32 instruction set only has a small number of registers to hold the values being accumulated. If we have a degree of parallelism p that exceeds the number of available registers, then the compiler will resort to spilling, storing some of the temporary values on the stack. Once this happens, the performance can drop significantly.
We see that for IA32, the lowest CPE is achieved when just k = 4 values are accumulated in parallel, and it gets worse for higher values of k. We also see that we cannot get down to the CPE of 1.00 achieved for x86-64.
Examining the IA32 code for the case of k = 5 shows the effect of the small number of registers with IA32:
Accumulator values acc1 and acc4 have been “spilled” onto the stack, at offsets −16 and −28 relative to %ebp. In addition, the termination value limit is kept on the stack at offset −20. The loads and stores associated with reading these values from memory and then storing them back negates any value obtained by accumulating multiple values in parallel.
5.11.2 Branch Prediction and Misprediction Penalties
Recent versions of x86 processors have conditional move instructions and that gcc can generate code that uses these instructions when compiling conditional statements and expressions, rather than the more traditional realizations based on conditional transfers of control. The basic idea for translating into conditional moves is to compute the values along both branches of a conditional expression or statement, and then use conditional moves to select the desired value. Conditional move instructions can be implemented as part of the pipelined processing of ordinary instructions. There is no need to guess whether or not the condition will hold, and hence no penalty for guessing incorrectly.
Do Not Be Overly Concerned about Predictable Branches
We have seen that the effect of a mispredicted branch can be very high, but that does not mean that all program branches will slow a program down. In fact, the branch prediction logic found in modern processors is very good at discerning regular patterns and long-term trends for the different branch instructions.
Write Code Suitable for Implementation with Conditional Moves
Branch prediction is only reliable for regular patterns. Many tests in a program are completely unpredictable, such as whether a number is negative or positive. For these, the branch prediction logic will do very poorly, possibly giving a prediction rate of 50%. For inherently unpredictable cases, program performance can be greatly enhanced if the compiler is able to generate code using conditional data transfers rather than conditional control transfers. This cannot be controlled directly by the C programmer, but some ways of expressing conditional behavior can be more directly translated into conditional moves than others.
We have found that gcc is able to generate conditional moves for code written in a more “functional” style, where we use conditional operations to compute values and then update the program state with these values, as opposed to a more “imperative” style, where we use conditionals to selectively update program state.
Suppose we are given two arrays of integers a and b, and at each position i, we want to set a[i] to the minimum of a[i] and b[i], and b[i] to the maximum. An imperative style of implementing this function is to check at each position i and swap the two elements if they are out of order:
Our measurements for this function on random data show a CPE of around 14.50 for random data, and 3.00–4.00 for predictable data, a clear sign of a high misprediction penalty.
A functional style of implementing this function is to compute the minimum and maximum values at each position i and then assign these values to a[i] and b[i], respectively:
Our measurements for this function show a CPE of around 5.0 regardless of whether the data are arbitrary or predictable.
5.12 Understanding Memory Performance
5.12.1 Load Performance
The performance of a program containing load operations depends on both the pipelining capability and the latency of the load unit. In our experiments with combining operations on a Core i7, we saw that the CPE never got below 1.00. One factor limiting the CPE for our examples is that they all require reading one value from memory for each element computed. Since the load unit can only initiate one load operation every clock cycle, the CPE cannot be less than 1.00. For applications where we must load k values for every element computed, we can never achieve a CPE lower than k.
Our measurements show that function list_len has a CPE of 4.00, which we claim is a direct indication of the latency of the load operation.
Consider the assembly code for the loop. (x86-64)
The movq instruction on line 3 forms the critical bottleneck in this loop. Each successive value of register %rdi depends on the result of a load operation having the value in %rdi as its address. Thus, the load operation for one iteration cannot begin until the one for the previous iteration has completed. The CPE of 4.00 for this function is determined by the latency of the load operation.
5.12.2 Store Performance
As with the load operation, the store operation can operate in a fully pipelined mode, beginning a new store on every cycle. First version: CPE = 2.00; By unrolling the loop four times, clear_array_4: CPE = 1.00. Thus, we have achieved the optimum of one new store operation per cycle.
The store operation does not affect any register values, so a series of store operations cannot create a data dependency. Only a load operation is affected by the result of a store operation, since only a load can read back the memory value that has been written by the store.
Example A: CPE = 2.00. Example B: illustrates a phenomenon we will call a write/read dependency—the outcome of a memory read depends on a recent memory write. Example B has a CPE of 6.00. The write/read dependency causes a slowdown in the processing.
The store unit contains a store buffer containing the addresses and data of the store operations that have been issued to the store unit, but have not yet been completed, where completion involves updating the data cache. This buffer is provided so that a series of store operations can be executed without having to wait for each one to update the cache.
When a load operation occurs, it must check the entries in the store buffer for matching addresses. If it finds a match (meaning that any of the bytes being written have the same address as any of the bytes being read), it retrieves the corresponding data entry as the result of the load operation.
dest: %ecx; val: %eax; src: %ebx; cnt: %edx
The instruction movl %eax,(%ecx) is translated into two operations: The s_addr instruction computes the address for the store operation, creates an entry in the store buffer, and sets the address field for that entry. The s_data operation sets the data field for the entry. As we will see, the fact that these two computations are performed independently can be important to program performance.
The arcs on the right of the operators denote a set of implicit dependencies for these operations. The address computation of the s_addr operation must precede the s_data operation. The load operation generated by decoding the instruction movl (%ebx), %eax must check the addresses of any pending store operations, creating a data dependency between it and the s_addr operation. The figure shows a dashed arc between the s_data and load operations. This dependency is conditional: if the two addresses match, the load operation must wait until the s_data has deposited its result into the store buffer, but if the two addresses differ, the two operations can proceed independently.
The arc labeled (1) represents the requirement that the store address must be computed before the data can be stored. The arc labeled (2) represents the need for the load operation to compare its address with that for any pending store operations. Finally, the dashed arc labeled (3) represents the conditional data dependency that arises when the load and store addresses match.
Figure 5.36(b) shows just two chains of dependencies: the one on the left, with data values being stored, loaded, and incremented (only for the case of matching addresses), and the one on the right, decrementing variable cnt.
For the case of Example A of Figure 5.33, with differing source and destination addresses, the load and store operations can proceed independently, and hence the only critical path is formed by the decrementing of variable cnt. This would lead us to predict a CPE of just 1.00, rather than the measured CPE of 2.00. The reason for 2.00 is that the effort to compare load addresses with those of the pending store operations forms an additional bottleneck.
For the case of Example B, with matching source and destination addresses, the data dependency between the s_data and load instructions causes a critical path to form involving data being stored, loaded, and incremented.
With operations on registers, the processor can determine which instructions will affect which others as they are being decoded into operations. With memory operations, the processor cannot predict which will affect which others until the load and store addresses have been computed. The memory subsystem makes use of many optimizations, such as the potential parallelism when operations can proceed independently.
5.13 Life in the Real World: Performance Improvement Techniques
A number of basic strategies for optimizing program performance:
High-level design. Choose appropriate algorithms and data structures for the problem at hand.
Basic coding principles. Avoid optimization blockers so that a compiler can generate efficient code. Eliminate excessive function calls. Move computations out of loops when possible. Consider selective compromises of program modularity to gain greater efficiency. Eliminate unnecessary memory references. Introduce temporary variables to hold intermediate results. Store a result in an array or global variable only when the final value has been computed.
Low-level optimizations. Unroll loops to reduce overhead and to enable further optimizations. Find ways to increase instruction-level parallelism by techniques such as multiple accumulators and reassociation. Rewrite conditional operations in a functional style to enable compilation via conditional data transfers.
5.14 Identifying and Eliminating Performance Bottlenecks
5.14.1 Program Profiling
Program profiling involves running a version of a program in which instrumentation code has been incorporated to determine how much time the different parts of the program require.
Unix systems provide the profiling program gprof. This program generates two forms of information. First, it determines how much CPU time was spent for each of the functions in the program. Second, it computes a count of how many times each function gets called, categorized by which function performs the call. Both forms of information can be quite useful. The timings give a sense of the relative importance of the different functions in determining the overall run time. The calling information allows us to understand the dynamic behavior of the program.
Profiling with gprof requires three steps, as shown for a C program prog.c, which runs with command line argument file.txt:
1. The program must be compiled and linked for profiling. With gcc (and other C compilers) this involves simply including the run-time flag ‘-pg’ on the command line:
unix> gcc -O1 -pg prog.c -o prog
2. The program is then executed as usual:
unix> ./prog file.txt
It runs slightly (around a factor of 2) slower than normal, but otherwise the only difference is that it generates a file gmon.out.
3. gprof is invoked to analyze the data in gmon.out.
unix> gprof prog
The first part of the profile report lists the times spent executing the different functions, sorted in descending order. As an example, the following listing shows this part of the report for the three most time-consuming functions in a program:
Each row represents the time spent for all calls to some function. The first column indicates the percentage of the overall time spent on the function. The second shows the cumulative time spent by the functions up to and including the one on this row. The third shows the time spent on this particular function, and the fourth shows how many times it was called (not counting recursive calls).
Library function calls are normally not shown in the results by gprof. Their times are usually reported as part of the function calling them.
The second part of the profile report shows the calling history of the functions. The following is the history for a recursive function find_ele_rec:
This history shows both the functions that called find_ele_rec, as well as the functions that it called. The first two lines show the calls to the function: 158,655,725 calls by itself recursively, and 965,027 calls by function insert_string (which is itself called 965,027 times). Function find_ele_rec in turn called two other functions, save_string and new_ele, each a total of 363,039 times.
Some properties of gprof are worth noting:
The timing is not very precise.
The calling information is quite reliable.
By default, the timings for library functions are not shown. Instead, these times are incorporated into the times for the calling functions.
5.14.2 Using a Profiler to Guide Optimization
Please indicate the source: http://blog.csdn.net/gaoxiangnumber1.