Part2
In high school, we learned that some mathematical operations are logically equivalent, as well as rules for transforming operations into easier-to-solve formats. But as programmers, we know that not all forms of a problem are computationally the same, because some can take advantage of particular programming languages or microprocessor features to reduce the clock cycles required for evaluation.
Alan Zeichick concludes his by looking at tricks for ensuring data structure alignment and speeding up the evaluation of integer and floating-point math.
In the first part of this series, we examined six ways to speed up C/C++ applications running on either 32-bit or 64-bit operating systems. Let's take a look at eight additional optimizations.
#7. Sort and pad C and C++ structures to achieve natural alignment
If your compiler allows it, pad data structure to make their sizes a multiple of a word, doubleword or quadword.
The best approach is to sort the structure members according to their type sizes, declaring members with larger type sizes first. Then, pad the structure so the size of the structure is a multiple of the largest member's type size.
For example, you should avoid structure declarations in which the members are not declared in order of their type sizes and the size of the structure is not a multiple of the size of the largest member's type. This code is suboptimal:
struct { char a[5]; // Smallest type size (1 byte * 5) long k; // 4 bytes in this example double x; // Largest type size (8 bytes) } baz;
A better approach is:
struct { double x; // Largest type size (8 bytes) long k; // 4 bytes in this example char a[5]; // Smallest type size (1 byte * 5) char pad[7]; // Make structure size a multiple of 8. } baz;
Your he compiler will use padding to "naturally align" elements by default. That means that that pointer-size elements will grow when compiled for 64-bit targets, which can affect the structure size and padding, and cause unnecessary "data bloat" unless care is taken to avoid this.
#8: Declare single-precision float constants using f
Most C and C++ compilers treat floating-point constants and arguments as double precision unless you specify otherwise. While that obviously gives you more accuracy, single precision occupies half the memory space and can often provide the precision necessary for a given computational problem.
So, if you're trying to use a constant value for the conversation factor between millimeters and inches, you might define it as 25.4f instead of as 25.4, to provide a single-precision constant value that may be adequate for your task.
#9. Restructure floating-point math to reduce the number of operations, if possible
Floating point operations have long latencies, even when you're using x87 (in 32-bit mode) or SSE/SSE instruction set extensions (in 32-bit or 64-bit mode), so those types of operations should be a target for optimization, particularly if you take into consideration the size of the AMD Opteron or Athlon64 processors' instruction pipeline.
However, you should be careful because even when algebraic rules would permit you to transform one floating-point operation into another, the algorithms within a microprocessor may not yield exactly the same results down to the least significant bits. You should consult a book on numerical analysis or experiment to make sure that it's okay to tinker with the math.
Here's an optimization example that involves the concept of data dependencies. This example, recommend by AMD, uses four-way unrolling to exploit the four-stage fully pipelined floating-point adder. Each stage of the floating-point adder is occupied on every clock cycle, ensuring maximum sustained utilization. Mathematically, it's equivalent; computationally, it's faster.
The original code:
double a[100], sum; int i; sum = 0.0f; for (i = 0; i < 100; i++) { sum += a[i]; }
This version is faster, because the code implements four separate dependence chains instead of just one, which keeps the pipeline full. The /fp:fast compiler switch in Visual Studio 2005 can help do this sort of optimization automatically.
double a[100], sum1, sum2, sum3, sum4, sum; int i; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; for (i = 0; i < 100; i + 4) { sum1 += a[i]; sum2 += a[i+1]; sum3 += a[i+2]; sum4 += a[i+3]; } sum = (sum4 + sum3) + (sum1 + sum2);
#10. Manually extract subexpressions from floating-point operations
Because the C/C++ compiler won't change your math around, you'll have to apply algebraic optimizations yourself. As mentioned earlier, there's a slight chance that changing the way that the math is performed may change the least-significant bits of the results, so always exercise caution when performing this type of optimization. That /fp:fast compiler switch in Visual Studio 2005 can do some of these optimizations automatically, too.
Here's an example of code that can be optimized by removing a common subexpression. Before:
double a, b, c, d, e, f; e = b * c / d; f = b / d * a;
And after:
double a, b, c, d, e, f, t; t = b / d; e = c * t; f = a * t;
Here's another example. Before:
double a, b, c, e, f; e = a / c; f = b / c;
And after:
double a, b, c, e, f, t; t = 1 / c; e = a * t f = b * t;
#11. Use 32-bit integers instead of 8-bit or 16-bit integers
This is a tip specifically for 32-bit applications running either natively on a 32-bit operating system, or within a 64-bit operating system. You'll see better performance because operations on 8-bit or 16-bit integers are often less efficient. Of course, you'll be using twice or four times the memory, so be aware of that trade-off.
By the way, 32-bit integers execute at full speed when you're running in 64-bit mode, so unless you need the extra bits for some application-specific reason, you should stick to 32-bit ones.
#12. If you don't need signed data types, note that operations on unsigned types can be faster
In many case—such as with counters, array indexes, or quotients/remainders after division—it's almost certain that your integer will be non-negative. In such cases, operations on unsigned integers may be more efficient.
However, if you're using those integers for mixed-mode (that is, integer combined with floating point) type of arithmetic, AMD advises that integer-to-floating-point conversion using integers larger than 16 bits is faster with signed types, because the AMD64 architecture provides native instructions for converting signed integers to floating-point but has no instructions for converting unsigned integers.
#13. Replace integer division with multiplication when there are multiple divisions in an expression
Integer division is the slowest of all the integer arithmetic operations on most microprocessors, including the AMD Opteron and Athlon64 processors. So, multiply instead of dividing! (Be careful to ensure that you won't have an overflow during the computation of the divisors.)
Restructure code that uses two integer division operations like
int i, j, k, m; m = i / j / k;
to be faster if you replace one of the integer divisions with the equivalent multiplication:
int i, j, k, m; m = i / (j * k);
#14. Use SSE inline instructions for performing single-precision division and square root operations.
Division and square root operations have a much longer latency than other 32-bit floating-point operations, advises AMD, and in some application programs, these operations occur so often as to seriously impact performance. It's often faster to drop to SSE inline instructions to perform that operation. (If your compiler can optimize to the SSE instruction set extensions, it may perform this task for you.)
If your 32-bit code has hot spots that use single precision arithmetic only, you can set the precision-control field which is part of the x87 floating point unit to single precision, and override Windows' default of double-precision for those operations. Because the processor's latency is faster when doing single-precision division or square roots than double-precision, it'll save you some type conversions or latencies.
Look for more
There are many great resources where one can find more information about application optimization for AMD64 processors. I can heartily recommend the Software Optimization Guide for AMD Athlon 64 and AMD Opteron Processors (PDF)
Also valuable is the AMD64 Architecture Reference Manuals, which can be ordered on CD-ROM or in hard copy, or downloaded at no charge. Those guides are geared more for the assembly programmer, but proved essential during the research for these articles on C/C++ optimization.
There are also some excellent AMD developer presentations, which are worth downloading and reading—may have great code samples.
A former mainframe software developer and systems analyst, Alan Zeichick is principal analyst at Camden Associates, an independent technology research firm focusing on networking, storage, and software development.