C++优化三板斧:Three Optimization Tips for C++

C++优化三板斧的原文

This is an approximate transcript of my talk at Facebook NYC on December 4, 2012, which discusses optimization tips for C++ programs. The video of the talk is here and the accompanying slides are here.

Scope

Commonly given advice about approaching optimization in general, and optimization of C++ code in particular, includes:

  • Quoting Knuth more or less out of context
  • The classic one-two punch: (a) Don’t do it; (b) Don’t do it yet
  • Most programs are I/O bound
  • Avoid constructing objects unnecessarily
  • Use C++11’s rvalue references to implement move constructors

That’s great advice, save for two issues. First, it has becomed hackneyed by overuse and is often wielded to dogmatically smother new discussions before they even happen. Second, some of it is vague. For example, “choose the right algorithm” is vacuous without a good understanding of what algorithms are best supported by the computing fabric, which is complex enough to make certain algorithmic approaches better than others overall. So I won’t focus on the above at all; I assume familiarity with such matters and a general “Ok, now what to do?” attitude.

With that in mind, I’ll discuss simple high-level pieces of advice that are likely to lead to better code on modern computing architectures. There is no guarantee, but these are good rules of thumb to keep in mind for efficiently exploring a large optimization space.

Things I shouldn’t even

As mentioned, many of us are familiar with the classic advice regarding optimization. Nevertheless, a recap of a few “advanced basics” is useful for setting the stage properly.

Today’s CPUs are complex in a whole different way than CPUs were complex a few decades ago. Those older CPUs were complex in a rather deterministic way: there was a clock; each operation took a fixed number of cycles; each memory access was zero-wait; and generally there was little environmental influence on the implacable ticking–no pipelining, no speculation, no cache, no register renaming, and few unmaskable interrupts if at all. That was a relatively simple model to optimize against. Today’s CPUs, however, have long abandoned simplicity of their performance model in favor of achieving good performance statistically. Today’s deep cache hierarchies, deep pipelines, speculative execution, and many amenities for detecting and exploiting instruction-level parallelism make for faster execution on average–at the cost of deterministic, reproducible performance and a simple mental model of the machine.

But no worries. All we need to remember is that intuition is an ineffective approach to writing efficient code. Everything should be validated by measurements; at the very best, intuition is a good guide in deciding approaches to try when optimizing something (and therefore pruning the search space). And the best intution to be ever had is “I should measure this.” As Walter Bright once said, measuring gives you a leg up on experts who are too good to measure.

Aside from not measuring, there are a few common pitfalls to be avoided:

Measuring the speed of debug builds. We’ve all done that, and people showing puzzling results may have done that too, so keep it in mind whenever looking at numbers.
Setting up the stage such that the baseline and the benchmarked code work under different conditions. (Stereotypical example: the baseline runs first and changes the memory allocator state for the benchmarked code.)
Including ancillary work in measurement. Typical noise is added by ancillary calls to the likes of malloc and printf, or dealing with clock primitives and performance counters. Try to eliminate such noise from measurements, or make sure it’s present in equal amounts in the baseline code and the benchmarked code.
Optimizing code for statistically rare cases. Making sort work faster for sorted arrays to the detriment of all other arrays is a bad idea (http://stackoverflow.com/questions/6567326/does-stdsort-check-if-a-vector-is-already-sorted).
A few good, but less known, things to do for fast code:

Prefer static linking and position-dependent code (as opposed to PIC, position-independent code).
Prefer 64-bit code and 32-bit data.
Prefer array indexing to pointers (this one seems to reverse every ten years).
Prefer regular memory access patterns.
Minimize control flow.
Avoid data dependencies.
This writeup won’t get into these, but the video presentation has a few words about each.

Reduce strength

The first tip is simple: When implementing an algorithm, use operations of the minimum strength possible. The poster child of strength reduction is replacing x / 2 with x >> 1 in source code. In 1985, that was a good thing to do; nowadays, you’re just making your compiler yawn.

The speed hierarchy of operations is:

comparisons
(u)int add, subtract, bitops, shift
floating point add, sub (separate unit!)
indexed array access (caveat: cache effects)
(u)int32 mul
FP mul
FP division, remainder
(u)int division, remainder
Interestingly, there are operations on integers that are in fact slower than operations on floating point numbers, with integral division, and remainder as a worst offender.

Let’s spin some code with a realistic example. For example, consider we want to figure the number of digits a number has. This is a classic - just divide the number by 10 until it goes down to zero, counting the number of steps. Without further ado:

uint32_t digits10(uint64_t v) {

    uint32_t result = 0;

    do {

        ++result;

         v /= 10;

    } while (v);

     return result;

}

The dominant cost is the division. (Truth be told, it’s a multiplication because many compilers transform all divisions by a constant into multiplications; see e.g. http://goo.gl/LhPeH.) To reduce the strength of that operation, let’s make the observation that digit counting can be reframed as a cascade of comparisons against powers of 10. Following the adage “most numbers are small,”we expect to encounter small numbers more often. When the number gets too large we divide by a large amount and continue.

uint32_t digits10(uint64_t v) {

  uint32_t result = 1;

  for (;;) {

    if (v < 10) return result;

    if (v < 100) return result + 1;

    if (v < 1000) return result + 2;

    if (v < 10000) return result + 3;

    // Skip ahead by 4 orders of magnitude

    v /= 10000U;

    result += 4;

  }

}

This looks like partial loop unrolling, but it’s not; it’s a reformulation of the algorithm to use comparison instead of division as the core operation. Let’s take a look at the performance:

The horizontal axis is the number of digits and the vertical axis is relative performance of the new function against the old one. The new digits10 is 1.7x to 6.5 faster.

Minimize array writes

To be faster, code should reduce the number of array writes, and more generally, writes through pointers.

On modern machines with large register files and ample register renaming hardware, you can assume that most named individual variables (numbers, pointers) end up sitting in registers. Operating with registers is fast and plays into the strengths of the hardware setup. Even when data dependencies–a major enemy of instruction–level parallelism - come into play, CPUs have special hardware dedicated to managing various dependency patterns. Operating with registers (i.e. named variables) is betting on the house. Do it.

In contrast, array operations (and general indirect accesses) are less natural across the entire compiler-processor-cache hierarchy. Save for a few obvious patterns, array accesses are not registered. Also, whenever pointers are involved, the compiler must assume the pointers could point to global data, meaning any function call may change pointed-to data arbitrarily. And of array operations, array writes are the worst of the pack. Given that all traffic with memory is done at cache-line granularity, writing one word to memory is essentially a cache line read followed by a cache line write. So given that to a good extent array reads are inevitable anyway, this piece of advice boils down to “avoid array writes wherever possible.”

Here’s an example where an alternative approach to a classic algorithm saves a lot of array wites. Consider the classic “integer to string” interview question. Here’s the stock solution:

uint32_t u64ToAsciiClassic(uint64_t value, char* dst) {

    // Write backwards.

    auto start = dst;

    do {

        *dst++ = ’0’ + (value % 10);

        value /= 10;

    } while (value != 0);

    const uint32_t result = dst - start;

    // Reverse in place.

    for (dst--; dst > start; start++, dst--) {

        std::iter_swap(dst, start);

    }

    return result;

}

The loop produces the digits in increasing order, which is why we need a reverse at the end. Reversing does extra writes to the array so we better avoid it. To do so, we’d need to take a gambit: We make an additional “pass” through the number, which is extra work. But then that work will be rewarded with–you guessed– ewer array writes because we get to write the digits last to first. To count digits, we conveniently avail ourselves of digits10, which we just carefully optimized.

uint32_t uint64ToAscii(uint64_t v, char *const buffer) {

    auto const result = digits10(v);

    uint32_t pos = result - 1;

    while (v >= 10) {

        auto const q = v / 10;

        auto const r = static_cast<uint32_t>(v % 10);</uint32_t>

        buffer[pos--] = ’0’ + r;

        v = q;

    }    assert(pos == 0); // Last digit is trivial to handle

    *buffer = static_cast<uint32_t>(v) + ’0’;</uint32_t>

    return result;

} 

Results? To quote a classic: “not bad.”

More computation and less array writes helps. Don’t forget–computers are good at computation. The whole business of dealing with memory is more awkward.

One last pass

Let’s make a final pass through uint64ToAscii from a different angle. One simple insight is that digits10 is not counting; it’s search. We must look for a number between 1 and 20 whose magnitude grows logarithmically with the magnitude of the input. Let’s take a look (P01, P02,…, are the respective powers of 10):


uint32_t digits10(uint64_t v) {

  if (v < P01) return 1;

  if (v < P02) return 2;

  if (v < P03) return 3;

  if (v < P12) {

    if (v < P08) {

      if (v < P06) {

        if (v < P04) return 4;

        return 5 + (v >= P05);

      }

      return 7 + (v >= P07);

    }

    if (v < P10) {

      return 9 + (v >= P09);

    }

    return 11 + (v >= P11);

  }

  return 12 + digits10(v / P12);

}

The search starts with a short gallop favoring small numbers, after which it goes into a hand-woven binary search. The second insight is that at best the conversion itself would proceed two digits at a time, as opposed to one. That cuts in half the number of expensive operations.

unsigned u64ToAsciiTable(uint64_t value, char* dst) {

  static const char digits[201] =

    "0001020304050607080910111213141516171819"

    "2021222324252627282930313233343536373839"

    "4041424344454647484950515253545556575859"

    "6061626364656667686970717273747576777879"

    "8081828384858687888990919293949596979899";

  uint32_t const length = digits10(value);

  uint32_t next = length - 1;

  while (value >= 100) {

    auto const i = (value % 100) * 2;

    value /= 100;

    dst[next] = digits[i + 1];

    dst[next - 1] = digits[i];

    next -= 2;

  }

  // Handle last 1-2 digits

  if (value < 10) {

    dst[next] = '0' + uint32_t(value);

  } else {

    auto i = uint32_t(value) * 2;

    dst[next] = digits[i + 1];

    dst[next - 1] = digits[i];

  }

  return length;

}

The results are nothing to sneeze at! For comparison, the plot below shows the performance of both improved implementations, relative to the baseline. The best of the breed is the latest implementation, which hovers at an average of 4x over the baseline.

Summary

A quest to improving something should start by measuring it. It is surprising how often this near-tautology is ignored in optimizing software for speed. To accelerate code, try to reduce strength of operations–which may lead you to a whole ‘nother algorithm. Also, be stingy with indirect writes (such as array writes)–of all memory operations, they are the most expensive.

Andrei will be next at the D Programming Language Conference on May 1-3 2013, hosted by Facebook at its headquarters in Menlo Park, California: http://dconf.org.

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Harriet Gecks, "Optimizing C++" Global Media | 2009 | ISBN: 9380168489 | 84 pages | PDF | 1,1 MB Introduction The main reason to choose C++ as the software implementation language, instead of simpler or higher-level languages, is the fact that this language allows to build complex software that is more efficient than that built using other programming languages. Please note that this language does not guarantee automatically to build efficient software, but it only allows it. Actually, when the same program is written sloppily in C++ and in higherlevel languages, typically the C++ version is just as efficient as the one developed in other languages. Though, a good C++ programmer, by following guide-lines learned from expert programmers, from its own experience, or from this book, is able, in the first place, to write software that is already rather efficient since the first cut, and then to optimize the resulting program, that is to improve substantially the performance of the program by replacing some language features with equivalent but faster features. Such optimization requires, first, that the software source is written in a sufficiently modular way, to isolate the performance critical parts, and then to use tools, libraries, knowledge, and time, to change those parts in a way to increase the speed of the resulting software. Nowadays, many optimizing replacements are already performed by compilers, and then they are no more programmers' burden. Though, present compilers are not able to perform many other optimizations. This book explains just such optimizations that can be performed by the programmers but not by all present compilers. This book is for persons that already know how to use the C++ language, and that want to use it to develop high quality application software or software libraries. Almost all the optimization techniques presented are platform independent, and therefore there will be few references to specific operating systems, processor architectures, or compilers. Though, some of the presented techniques come out to be ineffective or not applicable in some combinations of operating system/processor/compiler.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值