Optimizing Your C/C++ Applications

C/C++ optimizing compilers are great--but there *are* a few techniques for hand-tuning your code to run as efficiently as possible on the AMD Athlon64 and Opteron processors, especially when developing DLLs, device drivers, or other performance-bound pieces of code.
Alan Zeichick  
Share |
6/21/2004 
 
Sidebar: Know the Code
A code profiler can help determine which parts of your application should be optimized. You should always use a profiler both before and after optimization to see if your change actually made an improvement—with C/C++, you can never be sure if your hand-tuning methods are better than those of your optimizing compiler!

The smart engineers at AMD have spent a lot of time identifying places where C/C++ developers can optimize code for best performance. Some of these optimizations might look familiar—they're generally great coding practice for any 32-bit or 64-bit platform or operating system. In fact, most of these suggestions are appropriate whether you're developing 32-bit or 64-bit apps. Others are specifically for 64-bit applications running on the AMD64 architecture. This article will look at six optimizations; Part 2 will examine eight more.

#1. Declare as static functions that are not used outside the file where they are defined

Declaring a function as static forces an internal linkage within that file, which can improve the performance of the code. Functions that are not declared as static default to external linkage, which may inhibit certain optimizations, such as aggressive inlining, with some C/C++ compilers.

#2: Use array notation instead of pointer notation when working with arrays

You can use either array operators ([]) or pointers to access the elements of an array. However, it's hard for the compiler to optimize pointers because the pointers are, by default, less constrained—the compiler has to assume that the pointer can read/write to any location in memory. Because array notation is less ambiguous, the compiler may be able to optimize it more effectively.

Because it's hard to know which way is more efficient, you may want to try coding your array reference using both pointers and array notation, and use a code profiler to see which way is fastest for your particular application.

#3. Completely unroll loops that have a small fixed loop count and a small loop body

As you know, many compilers do not aggressively unroll loops. Manually unrolling loops can benefit performance, because you're avoiding the loop overhead; for a small loop, that overhead can be significant.

For example, avoid a small loop like this:

// 3D-transform: Multiply vector V by 4x4 transform matrix M.
for (i = 0; i < 4; i++) {
   r[i] = 0;
   for (j = 0; j < 4; j++) {
      r[i] += m[j][i] * v[j];
   }
}

Instead, replace it with its completely unrolled equivalent, as shown here:

r[0] = m[0][0] * v[0] + m[1][0] * v[1] + m[2][0] * v[2] + m[3][0] * v[3];
r[1] = m[0][1] * v[0] + m[1][1] * v[1] + m[2][1] * v[2] + m[3][1] * v[3];
r[2] = m[0][2] * v[0] + m[1][2] * v[1] + m[2][2] * v[2] + m[3][2] * v[3];
r[3] = m[0][3] * v[0] + m[1][3] * v[1] + m[2][3] * v[2] + m[3][3] * v[3];
#4. Consider expression 

#4. Consider expression order when coding a compound branch

Complex branches—those which consist of many Boolean expressions with logical AND (&&) and logical OR (||) operators—provide great opportunities for optimization, if you know which AND and OR conditions are more likely to occur.

That's because the C/C++ compiler will optimize the code to terminate execution of that complex branch as soon as it's sure if the operand evaluates to true or false, depending on the logic. For example, the branch conditions are satisfied as soon as it detects a true operand in an OR expression; similarly, the application knows that the branch will fail as soon as it sees a false operand in an AND expression.

You might experiment with changing the order of the operands if you know that a true/false outcome is more probable for a specific operand. For example, in an AND expression, place the operand most likely to be false up front. Of course, you'll need to make sure that changing these order of the operands won't cause any unwanted side effects.

#5. Leverage the early termination abilities of complex If statements

As seen above, as soon as a true operand is found in an OR expression, or a false one is found in an AND expression, execution of that particular expression is terminated because outcome is known. You can leverage that, if you have a good idea as to whether an entire expression is likely to be true or false, to try to short-cut execution of the condition.

For example, say you're testing to see if a and b are going to be within a specified numeric range. If you know that those numbers will generally be within that range, you can use the following because the most likely outcome will evaluate to false and thereby terminate the AND condition:

  if (a <= max && a >= min && b <= max && b >= min)

If you know that the data will generally not be in a specified range, you can logically reverse the If statement to optimize execution in the same way, with a true terminating the OR condition:

  if (a > max || a < min || b > max || b < min)

#6. Use if-else statements in place of switch statements that have noncontiguous case expressions

If the case expressions are contiguous or nearly contiguous integer values, most compilers translate the switch statement as a jump table instead of a comparison chain. Jump tables generally improve performance because they reduce the number of branches to a single procedure call, and shrink the size of the control-flow code no matter how many cases there are. The amount of control-flow code that the processor must execute is also the same for all values of the switch expression.

However, if the case expressions are noncontiguous values, most compilers translate the switch statement as a comparison chain. Comparison chains are undesirable because they use dense sequences of conditional branches, which interfere with the processor's ability to successfully perform branch prediction. Also, the amount of control-flow code increases with the number of cases, and the amount of control-flow code that the processor must execute varies with the value of the switch expression.

For example, if the case expression are contiguous integers, a switch statement can provide good performance:

switch (grade)
{
   case 'A':
      ...
      break;
   case 'B':
      ...
      break;
   case 'C':
      ...
      break;
   case 'D':
      ...
      break;
   case 'F':
      ...
      break;

But if the case expression aren't contiguous, the compiler may likely translate the code into a comparison chain instead of a jump table, and that can be slow:

switch (a)
{
   case 8:
      // Sequence for a==8
      break;
   case 16:
      // Sequence for a==16
      break;
   ...
   default:
      // Default sequence
      break;
}

In cases like this, replace the switch with a series of if-else statements:

if (a==8) {
   // Sequence for a==8
}
else if (a==16) {
   // Sequence for a==16
}
...
else {
   // Default sequence
}

Tune in next time
In the second part of this series, we'll look at eight more optimizations, focused on data structure alignment, integer operations and floating-point math.

A former mainframe software developer and systems analyst, Alan Zeichick is principal analyst at Camden Associates, an independent technology research firm focusing on networking, storage, and software development.

 

Part2

 

In high school, we learned that some mathematical operations are logically equivalent, as well as rules for transforming operations into easier-to-solve formats. But as programmers, we know that not all forms of a problem are computationally the same, because some can take advantage of particular programming languages or microprocessor features to reduce the clock cycles required for evaluation.

Alan Zeichick concludes his by looking at tricks for ensuring data structure alignment and speeding up the evaluation of integer and floating-point math.

In the first part of this series, we examined six ways to speed up C/C++ applications running on either 32-bit or 64-bit operating systems. Let's take a look at eight additional optimizations.

#7. Sort and pad C and C++ structures to achieve natural alignment

If your compiler allows it, pad data structure to make their sizes a multiple of a word, doubleword or quadword.

The best approach is to sort the structure members according to their type sizes, declaring members with larger type sizes first. Then, pad the structure so the size of the structure is a multiple of the largest member's type size.

For example, you should avoid structure declarations in which the members are not declared in order of their type sizes and the size of the structure is not a multiple of the size of the largest member's type. This code is suboptimal:

struct {
   char a[5]; // Smallest type size (1 byte * 5)
   long k; // 4 bytes in this example
   double x; // Largest type size (8 bytes)
} baz;

A better approach is:

struct {
   double x; // Largest type size (8 bytes)
   long k; // 4 bytes in this example
   char a[5]; // Smallest type size (1 byte * 5)
   char pad[7]; // Make structure size a multiple of 8.
} baz;

Your he compiler will use padding to "naturally align" elements by default. That means that that pointer-size elements will grow when compiled for 64-bit targets, which can affect the structure size and padding, and cause unnecessary "data bloat" unless care is taken to avoid this.

#8: Declare single-precision float constants using f

Most C and C++ compilers treat floating-point constants and arguments as double precision unless you specify otherwise. While that obviously gives you more accuracy, single precision occupies half the memory space and can often provide the precision necessary for a given computational problem.

So, if you're trying to use a constant value for the conversation factor between millimeters and inches, you might define it as 25.4f instead of as 25.4, to provide a single-precision constant value that may be adequate for your task.

#9. Restructure floating-point math to reduce the number of operations, if possible

Floating point operations have long latencies, even when you're using x87 (in 32-bit mode) or SSE/SSE instruction set extensions (in 32-bit or 64-bit mode), so those types of operations should be a target for optimization, particularly if you take into consideration the size of the AMD Opteron or Athlon64 processors' instruction pipeline.

However, you should be careful because even when algebraic rules would permit you to transform one floating-point operation into another, the algorithms within a microprocessor may not yield exactly the same results down to the least significant bits. You should consult a book on numerical analysis or experiment to make sure that it's okay to tinker with the math.

Here's an optimization example that involves the concept of data dependencies. This example, recommend by AMD, uses four-way unrolling to exploit the four-stage fully pipelined floating-point adder. Each stage of the floating-point adder is occupied on every clock cycle, ensuring maximum sustained utilization. Mathematically, it's equivalent; computationally, it's faster.

The original code:

double a[100], sum;
int i;
sum = 0.0f;
for (i = 0; i < 100; i++) {
   sum += a[i];
}

This version is faster, because the code implements four separate dependence chains instead of just one, which keeps the pipeline full. The /fp:fast compiler switch in Visual Studio 2005 can help do this sort of optimization automatically.

double a[100], sum1, sum2, sum3, sum4, sum;
int i;
sum1 = 0.0;
sum2 = 0.0;
sum3 = 0.0;
sum4 = 0.0;
for (i = 0; i < 100; i + 4) {
   sum1 += a[i];
   sum2 += a[i+1];
   sum3 += a[i+2];
   sum4 += a[i+3];
}
sum = (sum4 + sum3) + (sum1 + sum2);

#10. Manually extract subexpressions from floating-point operations

Because the C/C++ compiler won't change your math around, you'll have to apply algebraic optimizations yourself. As mentioned earlier, there's a slight chance that changing the way that the math is performed may change the least-significant bits of the results, so always exercise caution when performing this type of optimization. That /fp:fast compiler switch in Visual Studio 2005 can do some of these optimizations automatically, too.

Here's an example of code that can be optimized by removing a common subexpression. Before:

double a, b, c, d, e, f;
e = b * c / d;
f = b / d * a;

And after:

double a, b, c, d, e, f, t;
t = b / d;
e = c * t;
f = a * t;

Here's another example. Before:

double a, b, c, e, f;
e = a / c;
f = b / c;

And after:

double a, b, c, e, f, t;
t = 1 / c;
e = a * t
f = b * t;

#11. Use 32-bit integers instead of 8-bit or 16-bit integers

This is a tip specifically for 32-bit applications running either natively on a 32-bit operating system, or within a 64-bit operating system. You'll see better performance because operations on 8-bit or 16-bit integers are often less efficient. Of course, you'll be using twice or four times the memory, so be aware of that trade-off.

By the way, 32-bit integers execute at full speed when you're running in 64-bit mode, so unless you need the extra bits for some application-specific reason, you should stick to 32-bit ones.

#12. If you don't need signed data types, note that operations on unsigned types can be faster

In many case—such as with counters, array indexes, or quotients/remainders after division—it's almost certain that your integer will be non-negative. In such cases, operations on unsigned integers may be more efficient.

However, if you're using those integers for mixed-mode (that is, integer combined with floating point) type of arithmetic, AMD advises that integer-to-floating-point conversion using integers larger than 16 bits is faster with signed types, because the AMD64 architecture provides native instructions for converting signed integers to floating-point but has no instructions for converting unsigned integers.

#13. Replace integer division with multiplication when there are multiple divisions in an expression

Integer division is the slowest of all the integer arithmetic operations on most microprocessors, including the AMD Opteron and Athlon64 processors. So, multiply instead of dividing! (Be careful to ensure that you won't have an overflow during the computation of the divisors.)

Restructure code that uses two integer division operations like

int i, j, k, m;
m = i / j / k;

to be faster if you replace one of the integer divisions with the equivalent multiplication:

int i, j, k, m;
m = i / (j * k);

#14. Use SSE inline instructions for performing single-precision division and square root operations.

Division and square root operations have a much longer latency than other 32-bit floating-point operations, advises AMD, and in some application programs, these operations occur so often as to seriously impact performance. It's often faster to drop to SSE inline instructions to perform that operation. (If your compiler can optimize to the SSE instruction set extensions, it may perform this task for you.)

If your 32-bit code has hot spots that use single precision arithmetic only, you can set the precision-control field which is part of the x87 floating point unit to single precision, and override Windows' default of double-precision for those operations. Because the processor's latency is faster when doing single-precision division or square roots than double-precision, it'll save you some type conversions or latencies.

Look for more

There are many great resources where one can find more information about application optimization for AMD64 processors. I can heartily recommend the Software Optimization Guide for AMD Athlon 64 and AMD Opteron Processors (PDF)

Also valuable is the AMD64 Architecture Reference Manuals, which can be ordered on CD-ROM or in hard copy, or downloaded at no charge. Those guides are geared more for the assembly programmer, but proved essential during the research for these articles on C/C++ optimization.

There are also some excellent AMD developer presentations, which are worth downloading and reading—may have great code samples.

A former mainframe software developer and systems analyst, Alan Zeichick is principal analyst at Camden Associates, an independent technology research firm focusing on networking, storage, and software development.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值