Using Blocking to Increase Temporal Locality

原创 2013年12月04日 10:27:53

In the last essay Rearranging Loops to Increase Spatial Locality we saw how some simple rearrangements of the loops could increase spatial locality. But observe that even with good loop nestings, the time per loop iteration increases with increasing array size. What is happening is that as the array size increases, the temporal locality decreases, and the cache experiences an increasing number of capacity misses. To fix this, we can use a general technique called blocking.

      The general idea of blocking is to organize the data structures in a program into large chunks called blocks. (In this context, the term “block” refers to an application-level chunk of data, not a cache block.) The program is structured so that it loads a chunk into the L1 cache, does all the reads and writes that it needs to on that chunk, then discards the chunk, loads in the next chunk, and so on.

      Blocking a matrix multiply routine works by partitioning the matrices into submatrices and then exploiting the mathematical fact that these submatrices can be manipulated just like scalars. For example, if n = 8, then we could partition each matrix into four 4×4 submatrices:


The version of  blocked matrix multiplication, which we call the bijk version is presented below. The basic idea behind this code is to partition A and C into 1×bsize row slivers and to partition B into bsize×bsize blocks. The innermost (jk) loop pair multiplies a sliver of A by a block of B and accumulates the result into a sliver of C. The i loop iterates through n row slivers of A and C, using the same block in B.

void bijk(array A, array B, array C, int n, int bsize)
		double sum = 0.0;
		int en = bsize*(n/bsize); // Amount that fits evenly into blocks 
		for (int i=0; i!=n; ++i)
		    for (int j=0; j!=n; ++j)
		        C[i][j] = 0.0;
		for (int kk=0; kk < en; kk += bsize) {
		    for (int jj=0; jj < en; jj += bsize) {
		        for (int i=0; i!=n; ++i) {
		            for (int j=jj; j != jj+bsize; ++j) {
		                sum = C[i][j];
		                for (int k=kk; k != kk+bsize; ++k) {
		                    sum += A[i][k]*B[k][j];
		                C[i][j] = sum;

      The key idea is that it loads a block of B into the cache, uses it up, and then discards it. References to A enjoy good spatial locality
because each sliver is accessed with a stride of 1. There is also good temporal locality because the entire sliver is referenced bsize times in succession. References to B enjoy good temporal locality because the entire bsize×bsize block is accessed times in succession. Finally, the references to C have good spatial locality because each element of the sliver is written in succession. Notice that references to C do not have
good temporal locality because each sliver is only accessed one time.

      Blocking can make code harder to read, but it can also pay big performance dividends. Blocking improves the running time by a factor of two over the best non-blocked version, from about 20 cycles per iteration down to about 10 cycles per iteration.

数据局部性(data locality)

  • lanchunhui
  • lanchunhui
  • 2016年09月20日 10:53
  • 1442

Rearranging Loops to Increase Spatial Locality

Consider the problem of multiplying a pair of n×n matrices: C = AB. For example, if n=2, then       ...
  • zhangyubingcatherine
  • zhangyubingcatherine
  • 2013年12月04日 09:38
  • 388


Spark运行是内存分为三部分,执行内存(execute memory),存储内存(storge memory),预留内存(reserved memory).在1.6版本以前执行内存和存储内存是静态分...
  • Evankaka
  • Evankaka
  • 2017年03月23日 08:43
  • 904

spark 2.0.0 开始了

spark 2.0.0 开始      SparkSession    spark = SparkSession     .builder()     .master("Local")     .ap...
  • duan_zhihua
  • duan_zhihua
  • 2016年09月10日 09:49
  • 4503


在学习hal库的时候,看见这样一段话: * @brief Sends an amount of data in blocking mode. * @param huart: Pointer...
  • sinat_26492471
  • sinat_26492471
  • 2017年03月23日 22:37
  • 503

memory wall/Spatial locality/Temporal locality/Memory Latency/

 Generally speaking, memory bus bandwidth has not seen the same improvement as CPU performance (an ...
  • zhuliting
  • zhuliting
  • 2010年12月27日 20:57
  • 989

论文阅读《Long-term Temporal Convolutions for Action Recognition》

论文阅读《Long-term Temporal Convolutions for Action Recognition》标签(空格分隔): ActionRecognition ReadingNotep...
  • bojackhosreman
  • bojackhosreman
  • 2017年10月17日 15:32
  • 447

[Hadoop]Hadoop上Data Locality

Hadoop上的Data Locality是指数据与Mapper任务运行时数据的距离接近程度(Data Locality in Hadoop refers to the“proximity” of t...
  • SunnyYoona
  • SunnyYoona
  • 2016年12月26日 17:47
  • 926

HBase File Locality in HDFS

罪过啊,之前的几篇翻墙文章已经全部都转过来了,但是这篇却给忘记了。 文章的大意就是hbase是否会保证RegionServer所管理的数据在本地就可以拿到,或者到最近的地方就可以拿到。 文章来源:ht...
  • macyang
  • macyang
  • 2011年03月23日 00:06
  • 1523

[spark] 内存管理 MemoryManager 解析

  • UUfFO
  • UUfFO
  • 2017年11月22日 11:27
  • 414
您举报文章:Using Blocking to Increase Temporal Locality