Cache and Cache Line Fills -- Critical Word First

最新推荐文章于 2025-03-20 19:38:09 发布

dumb_man

最新推荐文章于 2025-03-20 19:38:09 发布

阅读量3.7k

点赞数 6

分类专栏： arm知识 linux内存管理

linux内存管理同时被 2 个专栏收录

10 篇文章

订阅专栏

arm知识

8 篇文章

订阅专栏

以下内容来自arm v8官方说明与网络寻踪，个人理解而来：

arm采用的是 Harvard architecture,有分离的指令与数据总线，因此有两种cache（指令cache和数据cache）；区别于冯.诺依曼架构的单一cache (即存指令也存数据，被称为unified cache)。

对与Arm v8处理器，L1 cache是指令与数据分离的，L2 cache是unified cache。

cache controller是一个负责管理cache memory的硬件模块，某种程度上说它对于程序是大部分不可见的。它自动从main memory中写数据或者代码到cache memory中，它处理来自cpu的读写memory的请求，并且表现为必要的动作来访问cache memory 或the external memory(L2/L3/... , main memory).

当它从CPU接收一个请求，它必须查看是否请求的地址中数据被载入到cache中。这种行为已知被称之为cache look-up。

cache的组织形式分为全相联、组相联等；而arm的main caches一直都是使用组相联（set associative caches）,组相联的cache与直接map的cache的最大区别就是减少了可能的cache抖动，加快了程序的执行速度并且给出了更多的确定的执行。但增加了硬件复杂度和稍微增加了电量的消耗，因为需要在每一个周期中进行多个tags的比较。

关于set associative caches，细节术语可以从arm的文档的两幅图中了解：

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch11s01s02.html

简单的说，main memory的空间可以按照cache way的大小，分成若干份；也就是说每一个地址的数据都可以

被放入某一个cache way中；而一个cache way又分成若干个cache line（也即block）,一个cache line通常是若干

个word(一个word是4个字节)；这些不同的cache way中的具有相同的index的cache line，被合起来称为一个

cache set；之前说过每一个地址的数据可以被放入某一个cache way中，但只能放在cache way中的固定的一个

位置（某个index的cache line中），结合起来也就是说，一个地址可以放在某一个set中的任意一个cache line中。

比如说一个cache是由4个way组成，那么这个cache中一个set就是由4个cache line组成，这4个cache line具有同样的

index,可以放置地址中index相同的所有地址数据。

一个地址被分成三份：

tag（比较set中的每一个cache line的tag，确定是否hit一个cache line） + index(确定set) + offset(在一个cache line中确定唯一word)

另外cache还有inclusive caches 与 exclusive caches；

区别是，在inclusive cache中，当一个地址数据存在于L1 cache中时，也可以存在于L2 cache中，当L1 cache发生cache miss时，

产生的cache fillline操作会同时向L2 cache与L1 cache做fillline操作。

在exclusive cache中，当一个地址数据存在于L1 cache中是，必须不能同时存在与L2 cache中。

如果cache look-up中请求访问的address中的开头部分与cache line中的tag相同，就产生一个hit；此时如果这个cache line的标识上

显示这个cache line是valid的，那么读和写将会在这个cache memory上产生。

当look-up没有发生hit,或者cache line的标识是invalid，就产生了一个cache miss，这个cache miss的结果和这个地址的访问请求就

必须被传递到内存层次结构的下一级中，一个L2 cache或者更下层的memory上。这也能导致一个cache linefill。一个

cache linefill导致main memory中的一小片（一个cache line大小）被copy进入cache。同时，请求的数据或指令被提供给

cpu，这些发生的事情对于软件开发者来说是透明的。CPU不需要等着整个linefile完成才使用需要的数据。cache controller

典型的首先访问critical word(就是发生cache miss的地址), 比如：如果你执行一个load指令发生cache miss, 并触发一次cache

linefill。linefill也是多次才能把一个cache line大小的所有word的数据加载进cache中，而cache controller首先会加载critical word数据，也就是CPU此时正需要的数据，然后就会传给CPU使用，而cache line中其余的数据的加载过程都是在后台完成的。

关于详细的加载cache line的过程看下面的说明。

from : https://groups.google.com/forum/#!topic/comp.arch/iKsqAEFP4NE

When some caches miss, they just go out to memory and snort up a line in

sequential order. I believe the MIPS R6000 is an example of this.

But with most designs, such as the 486, SPARC, 68040, etc., the first
item to be returned to the CPU is the item which caused the miss. This
is called "critical word first".

One difference between the 486 and everybody else is that the 486 will
always read the next word from the other half of an aligned 2-word block
before reading the other two words needed to fill a cache line.

For example, if the miss is on address xxx1, the 486 will fill the cache line
in the order 1-0-3-2. On SPARC or the 68040, the read continues to the end
of the block and wrapsaround, i.e. 1-2-3-0.

Why did Intel do it this way? To optimize the performance of 64-bit
memory systems. Using the conventional order, the CPU might read one word
from one 64-bit block, read two words from a second block, then hop back to the
first block for another word. Intel's method always reads one block before
reading the other.

Does this little design tweak pay off? Just recently, Motorola has provided
proof! They've introduced two static RAM's with on-chip burst control logic
optimized for use as cache memories. The MCM62940 is compatible with
conventional burst ordering. The MCM62486 uses Intel-style burst ordering.
The MCM62940 is offered in the following speed grades: 15, 20, and 25 ns.
(Max. access time.) The MCM62486 is offered in: 14, 19, and 24 ns. Intel

wins by a nanosecond!

以上是理解部分，以下是还未理解部分，暂存。

This is discussed in Patterson & Hennessy, near end of Chapter 8.
They note it doesn't gain as much as you'd think.

How much this is worth depends a lot on the following attributes, at least:
a) I-cache versus D-cache
b) the pipeline structure
        instruction latencies
        issue parallelism
c) the memory system
        cache structure & buffering
        main memory latency & number of pipes to it

For the I-cache, when you have a cache miss:
        I1) You can stall the machine, fetch the entire cache block,
        then restart. This is clearly the simplest.
        I2) You can do "early restart", where you begin executing as soon
        as the requested word is available. This is sometimes called
        "Instruction Streaming" (in the MIPS R3000), i.e., when you
        cache miss:
                start fetching at word 0 of the block
                stall until the needed word is fetched, then stream
                if you branch elsewhere before the end of the block,
                stop streaming, stall the pipeline until block filled
                also, if a load/store causes a cache miss, complete
                the I-cache refill, then handle the D-cache miss
        I3) You can do "out-of-order fetch" in addition to early restart,
        and then do "wrapped fetch", so that you wrap-around to complete.

Cache miss penalty = access time + transfer time.
In all of these cases, the cache miss penalty is the same, but I2 and I3
try to "cover" some of the transfer time by converting stalls into
issues. (It's hard to do much about access time for I-cache misses :-)
I2 and I3 are the same, except that I3 gets to issue
instructions earlier. This doesn't do much for short-latency operations,
but can help those with longer latency, especially if there are multiple
independent units. In practice, this means that:
        - The approach probably helps floating point codes more than integer
        - It helps pipelined-FP units more than low-latency scalar ones
        - It can REALLY help the class of program that consists of
        a giant loop filled with sequential code larger than the cache
        size. (Now you see the real reason why such features are included,
        as this class of program happens be very important to chip designers,
        and they take care of themselves :-)

For the I-cache, it is known that the first word of a cache block accounts
for an unusually high share of the misses, due to fall-thru. I think this
is true across a wide range of programs. Hence, there is even less
difference between I2 and I3 than you'd expect, since they behave
identically in ths case. I can't recall the numbers, but I think the
simulations done when the R3000 was being desinged showed relatively little
difference, but note that the R3000's FP operations are fairly low
latency also, hence woudl helped less than some others designed then.

For the D-cache, it's more of the same, except with more options,
and more complexity, as there are many more options for D-cache
implementation. Here are few of the relevant items:

- Unlike the I-cache, where word 0 of a block is more likely cause a cache miss,
and this is true of almost every program, the effect is much more
program-dependent for the D-cache. Some programs make many sequential
references likely to cause cache misses, others much less so.

- Unlike the I-cache, where successive instructions continue accessing
the same block, loads and stores may well access different blocks,
and this implies extra complexity or buffering. For example, if you have:
        load        r1,a
        load        r2,a+4
        load        r3,a+8
        etc
then using read-data-streaming works real well, as does:
        load        r1,a
        load        r2,a+4
        div        r1,r3
        etc
where you get to start the divide early

On the other hand,
        load        r1,a
        load        r2,b                b in different cache block
        store        r1,c                c in third block
gets awkward. Although a load of a+4 is probably referencing an
internal cache-line buffer, the load of b is referencing the cache,
and would do so at the same time as the cache is being refilled.
Should the refill have priority, or the load of b?
Some of your choices depend on how many memory pipes you have.

Big machines, with long memory latencies (in terms of cycle counts),
heavy pipelining for FP, and multiple memory pipes, can contemplate
more parallelism, than makes sense for a micro with 1 memory pipe,
where most attempts to "get ahead" quickly run into the constriction
of the single pipe.

All of this, especially the D-cache choices, is likely to be very
benchmark-dependent. I'm sure there are programs where you can get
2X better by wrap-around with early restart, and otheres where you
get little or nothing.

All of this is just a long way of saying there is no simple answer
to the question "is this worth it?" and that finding the answer
takes detailed simulation of many, big programs running on
simulators that know every detail of the pipeline, cache, etc.
(I've often commented in public talks that we had something like 400-500
VAX-mips used for the design of some future chip. The designers
beat me up, saying it's more like 2X that these days...)
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: ma...@mips.com OR {ames,decwrl,prls,pyramid}!

mips!mash
DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086