Consider the problem of multiplying a pair of n×n matrices: C = AB. For example, if n=2, then
Matrix multiply is usually implemented using three nested loops, which are identified by their indexes i,j, and k. The following two versions ijk and ijk share the same cycles per inner loop iteration:
// Version ijk // Version jik
for (int i=0; i!=n; ++i) for (int j=0; j!=n; ++j)
for (int j=0; j!=n; ++j) { for (int i=0; i!=n; ++i) {
sum = 0.0; sum = 0.0;
for (int k=0; k!=n; ++k) for (int k=0; k!=n; ++k)
sum += A[i][k]*B[k][j]; sum += A[i][k]*B[k][j];
C[i][j] += sum; C[i][j] += sum;
} }
The inner loops of the two routines scan a row of array A with a stride of 1 and a column of B with a stride of n. Supposing a block holds four words and the array size is so large that a single matrix row does not fit in the L1 cache, the miss rate for A is 0.25 misses per iteration and each access of array B results in a miss, for a total of 1.25 misses per iteration.
To increase spatial locality, loops are rearranged as follows:
// Version kij // Version ikj
for (int k=0; k!=n; ++k) for (int i=0; i!=n; ++i)
for (int i=0; i!=n; ++i) { for (int k=0; k!=n; ++k) {
r = A[i][k]; r = A[i][k];
for (int j=0; j!=n; ++j) for (int j=0; j!=n; ++j)
C[i][j] += r*B[k][j]; C[i][j] += r*B[k][j];
} }
The routines present an interesting trade-off: With two loads and a store, they require one more memory operation than version ijk and jik. On the other hand, since the inner loop scans both B and C row-wise with a stride-1 access pattern, the miss rate on each array is each array is only 0.25 misses per iteration, for a total of 0.50 misses per iteration for both version kij and ikj.It is concluded that: Pairs of versions with the same number of memory-references and misses per iteration have almost identical measured performance (cycles per inner loop iteration); Miss rate, in this case, is a better predictor of performance than the total number of memory access; For large value of n, the performance of the faster pair of versions (kij and ikj) is const.