Rearranging Loops to Increase Spatial Locality

原创 2013年12月04日 09:38:59

Consider the problem of multiplying a pair of n×n matrices: C = AB. For example, if n=2, then


Matrix multiply is usually implemented using three nested loops, which are identified by their indexes i,j, and k. The following two versions ijk and ijk share the same cycles per inner loop iteration:

// Version ijk                      // Version jik                     
for (int i=0; i!=n; ++i)            for (int j=0; j!=n; ++j)           
    for (int j=0; j!=n; ++j) {          for (int i=0; i!=n; ++i) {     
        sum = 0.0;                          sum = 0.0;                 
        for (int k=0; k!=n; ++k)            for (int k=0; k!=n; ++k)   
            sum += A[i][k]*B[k][j];             sum += A[i][k]*B[k][j];
        C[i][j] += sum;                     C[i][j] += sum;            
    }                                   }                              
The inner loops of the two routines scan a row of array A with a stride of 1 and a column of B with a stride of n. Supposing a block holds four words and the array size is so large that a single matrix row does not fit in the L1 cache, the miss rate for A is 0.25 misses per iteration and each access of array B results in a miss, for a total of 1.25 misses per iteration.

      To increase spatial locality, loops are rearranged as follows:

// Version kij                      // Version ikj                     
for (int k=0; k!=n; ++k)            for (int i=0; i!=n; ++i)           
    for (int i=0; i!=n; ++i) {          for (int k=0; k!=n; ++k) {     
        r = A[i][k];                        r = A[i][k];                 
        for (int j=0; j!=n; ++j)            for (int j=0; j!=n; ++j)   
            C[i][j] += r*B[k][j];             C[i][j] += r*B[k][j];
    }                                   }                             
The routines present an interesting trade-off: With two loads and a store, they require one more memory operation than version ijk and jik. On the other hand, since the inner loop scans both B and C row-wise with a stride-1 access pattern, the miss rate on each array is each array is only 0.25 misses per iteration, for a total of 0.50 misses per iteration for both version kij and ikj.
      It is concluded that: Pairs of versions with the same number of memory-references and misses per iteration have almost identical measured performance (cycles per inner loop iteration); Miss rate, in this case, is a better predictor of performance than the total number of memory access; For large value of n, the performance of the faster pair of versions (kij and ikj) is const.

Using Blocking to Increase Temporal Locality

In the last essay Rearranging Loops to Increase Spatial Locality we saw how some simple rearrangeme...

VARCHART XGantt应用实例:用于To-Increase项目管理图形化

Visual Job Planner     对于一个公司尤其是复杂的项目来说时间的分配决定一切。To-Increase作为Microsoft Dynamics全球领先的ISV,使用NETRON...

How to increase MySQL memory limit?

The question – I have a lot of RAM on my machine. How can I increase the memory limits used by MySQL...
  • boy317
  • boy317
  • 2016年09月27日 11:20
  • 155

5种提高认知潜能的方法(You can increase your intelligence: 5 ways to maximize your cognitive potential )

今天逛新浪微博时发现了一篇scientific american的文章,写得挺好:一个是文笔,读起来通俗易懂;另一个是内容,有理有据。原文在这里。而且作者态度显得非常自信(大概意思是,只要你照着我说的...
  • BusyCai
  • BusyCai
  • 2011年04月07日 18:44
  • 2476

How to Increase the Memory Limit for 32-bit Applications in Windows 64-bit OS

1. Go to Control Panel, and click view by “small icons” in the top right hand corner 2. Click Syste...

Effective Objective-C 2.0:Item 48: Prefer Block Enumeration to for Loops

Item 48: Prefer Block Enumeration to for Loops Enumerating a collection is a very common task in ...

ArcCatalog加载ArcSDE数据报:ERROR 000372: Spatial Reference for output is invalid, Please update to allow

最近有一用户咨询,他们的数据老是导入不到ArcSDE中,数据源为Filegeodatabase里面的要素类,不管是import还是Copy/paste都出错。 具体情况描述:用户的一个服务器A里面的...
您举报文章:Rearranging Loops to Increase Spatial Locality