Writing Cache-friendly Code

In the previous essay Exhibiting Good Locality in Your Programs, we presented two functions named sumarrayrows and sumarraycols respectively. And we knew that sumarrayrows had a stride-1 reference pattern (visit each element of the array sequentially), whereas sumarraycols had a stride-N reference pattern (visit every Nth element of the contiguous array). In this essay, we will show you how to quantify the idea of locality in terms of cache hits and cache misses.
      In general, if a cache has a block size of B bytes, then a stride-k reference pattern (where k is expressed in words) results in an average of min(1,  (wordsize × k) / B) misses per loop iteration. This is minimized for = 1.
      To take sumarrayrowsfor example, 
int sumarrayrows(int a[M][N])
{
    int sum = 0;
    for (int i=0; i!=M; ++i)
        for (int j=0; j!=N; ++j)
            sum += a[i][j];
    return sum;
}
since C stores arrays in row-major order, the inner loop of this function has a desirable stride-1 access pattern. Suppose that a is block aligned, words are 4 bytes, cache blocks are 4 words, and the cache is initially empty (a cold cache). Then the references to the array a will result in the following pattern of hits and misses:

      In this example, the reference to a[0][0] misses and the corresponding block which contains a[0][0]-a[0][3], is loaded into the cache from memory. Thus, the next three reference are all hits. The reference to a[0][4] causes another miss as a new block is loaded into the cache, the next three references are hits, and so on. In general, three out of four references will hit, which is the best we can do in this case with a cold cache.
      But consider what happens if we make the seemingly innocuous change of permuting the loops as sumarraycols:
int sumarraycols(int a[M][N])
{
    int sum = 0;
    for (int j=0; j!=N; ++j)
        for (int i=0; i!=M; ++i)
            sum += a[i][j];
    return sum;
}
In this case, we are scanning the array column by column instead of  row by row. If we are lucky and the entire array fits in the cache, then we will enjoy the same miss rate of 1/4. However, if the array if larger than the cache (the more likely case), then each and every access of a[i][j] will miss!

      Higher miss rates can have a significant impact on running time. For example, on our desktop machine,sumarrayrows runs twice as fast as sumarraycols. To summarize, the two functions illustrate two important points about writing cache-friendly code:
   1. Repeated references to local variables are good because the compiler can cache them in the register file (temporal locality).
   2. Stride-1 reference patterns are good because caches at all levels of the memory hierarchy store data as contiguous blocks (spatial locality).
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值