Rearranging Loops to Increase Spatial Locality

最新推荐文章于 2022-09-27 23:17:20 发布

zhangyubingcatherine

最新推荐文章于 2022-09-27 23:17:20 发布

阅读量602

点赞数

分类专栏： Computer Systems 文章标签： Spatial Locality Rearranging Loops

本文链接：https://blog.csdn.net/zhangyubingcatherine/article/details/17111649

版权

Computer Systems 专栏收录该内容

33 篇文章 0 订阅

订阅专栏

Consider the problem of multiplying a pair of n×n matrices: C = AB. For example, if n=2, then

Matrix multiply is usually implemented using three nested loops, which are identified by their indexes i,j, and k. The following two versions ijk and ijk share the same cycles per inner loop iteration:

// Version ijk                      // Version jik                     
for (int i=0; i!=n; ++i)            for (int j=0; j!=n; ++j)           
    for (int j=0; j!=n; ++j) {          for (int i=0; i!=n; ++i) {     
        sum = 0.0;                          sum = 0.0;                 
        for (int k=0; k!=n; ++k)            for (int k=0; k!=n; ++k)   
            sum += A[i][k]*B[k][j];             sum += A[i][k]*B[k][j];
        C[i][j] += sum;                     C[i][j] += sum;            
    }                                   }

The inner loops of the two routines scan a row of array A with a stride of 1 and a column of B with a stride of n. Supposing a block holds four words and the array size is so large that a single matrix row does not fit in the L1 cache, the miss rate for A is 0.25 misses per iteration and each access of array B results in a miss, for a total of 1.25 misses per iteration.

To increase spatial locality, loops are rearranged as follows:

// Version kij                      // Version ikj                     
for (int k=0; k!=n; ++k)            for (int i=0; i!=n; ++i)           
    for (int i=0; i!=n; ++i) {          for (int k=0; k!=n; ++k) {     
        r = A[i][k];                        r = A[i][k];                 
        for (int j=0; j!=n; ++j)            for (int j=0; j!=n; ++j)   
            C[i][j] += r*B[k][j];             C[i][j] += r*B[k][j];
    }                                   }

The routines present an interesting trade-off: With two loads and a store, they require one more memory operation than version ijk and jik. On the other hand, since the inner loop scans both B and C row-wise with a stride-1 access pattern, the miss rate on each array is each array is only 0.25 misses per iteration, for a total of 0.50 misses per iteration for both version kij and ikj.
It is concluded that: Pairs of versions with the same number of memory-references and misses per iteration have almost identical measured performance (cycles per inner loop iteration); Miss rate, in this case, is a better predictor of performance than the total number of memory access; For large value of n, the performance of the faster pair of versions (kij and ikj) is const.