# Rearranging Loops to Increase Spatial Locality

346人阅读 评论(0)

Consider the problem of multiplying a pair of n×n matrices: C = AB. For example, if n=2, then

Matrix multiply is usually implemented using three nested loops, which are identified by their indexes i,j, and k. The following two versions ijk and ijk share the same cycles per inner loop iteration:

// Version ijk                      // Version jik
for (int i=0; i!=n; ++i)            for (int j=0; j!=n; ++j)
for (int j=0; j!=n; ++j) {          for (int i=0; i!=n; ++i) {
sum = 0.0;                          sum = 0.0;
for (int k=0; k!=n; ++k)            for (int k=0; k!=n; ++k)
sum += A[i][k]*B[k][j];             sum += A[i][k]*B[k][j];
C[i][j] += sum;                     C[i][j] += sum;
}                                   }                              
The inner loops of the two routines scan a row of array A with a stride of 1 and a column of B with a stride of n. Supposing a block holds four words and the array size is so large that a single matrix row does not fit in the L1 cache, the miss rate for A is 0.25 misses per iteration and each access of array B results in a miss, for a total of 1.25 misses per iteration.

To increase spatial locality, loops are rearranged as follows:

// Version kij                      // Version ikj
for (int k=0; k!=n; ++k)            for (int i=0; i!=n; ++i)
for (int i=0; i!=n; ++i) {          for (int k=0; k!=n; ++k) {
r = A[i][k];                        r = A[i][k];
for (int j=0; j!=n; ++j)            for (int j=0; j!=n; ++j)
C[i][j] += r*B[k][j];             C[i][j] += r*B[k][j];
}                                   }                             
The routines present an interesting trade-off: With two loads and a store, they require one more memory operation than version ijk and jik. On the other hand, since the inner loop scans both B and C row-wise with a stride-1 access pattern, the miss rate on each array is each array is only 0.25 misses per iteration, for a total of 0.50 misses per iteration for both version kij and ikj.
It is concluded that: Pairs of versions with the same number of memory-references and misses per iteration have almost identical measured performance (cycles per inner loop iteration); Miss rate, in this case, is a better predictor of performance than the total number of memory access; For large value of n, the performance of the faster pair of versions (kij and ikj) is const.

0
0

* 以上用户言论只代表其个人观点，不代表CSDN网站的观点或立场
个人资料
• 访问：35096次
• 积分：773
• 等级：
• 排名：千里之外
• 原创：42篇
• 转载：0篇
• 译文：0篇
• 评论：0条
文章分类