Cache and Cache Line Fills -- Critical Word First

以下内容来自arm v8官方说明与网络寻踪,个人理解而来:

arm采用的是 Harvard architecture,有分离的指令与数据总线,因此有两种cache(指令cache和数据cache);区别于 冯.诺依曼 架构的单一cache (即存指令也存数据,被称为unified cache)。

对与Arm v8处理器,L1 cache是指令与数据分离的,L2 cache是unified cache。

cache controller是一个负责管理cache memory的硬件模块,某种程度上说它对于程序是大部分不可见的。它自动从main memory中写数据或者代码到cache memory中,它处理来自cpu的读写memory的请求,并且表现为必要的动作来访问cache memory 或the external memory(L2/L3/... , main memory).

当它从CPU接收一个请求,它必须查看是否请求的地址中数据被载入到cache中。这种行为已知被称之为cache look-up。

cache的组织形式分为全相联、组相联等;而arm的main caches一直都是使用组相联(set associative caches),组相联的cache与直接map的cache的最大区别就是减少了可能的cache抖动,加快了程序的执行速度并且给出了更多的确定的执行。但增加了硬件复杂度和稍微增加了电量的消耗,因为需要在每一个周期中进行多个tags的比较。

关于set associative caches,细节术语可以从arm的文档的两幅图中了解:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch11s01s02.html

简单的说,main memory的空间可以按照cache way的大小,分成若干份;也就是说每一个地址的数据都可以

被放入某一个cache way中;而一个cache way又分成若干个cache line(也即block),一个cache line通常是若干

个word(一个word是4个字节);这些不同的cache way中的具有相同的index的cache line,被合起来称为一个

cache set;之前说过每一个地址的数据可以被放入某一个cache way中,但只能放在cache way中的固定的一个

位置(某个index的cache line中),结合起来也就是说,一个地址可以放在某一个set中的任意一个cache line中。

比如说一个cache是由4个way组成,那么这个cache中一个set就是由4个cache line组成,这4个cache line具有同样的

index,可以放置地址中index相同的所有地址数据。

一个地址被分成三份:

tag(比较set中的每一个cache line的tag,确定是否hit一个cache line) + index(确定set) + offset(在一个cache line中确定唯一word)

另外cache还有inclusive caches 与 exclusive caches;

区别是,在inclusive cache中,当一个地址数据存在于L1 cache中时,也可以存在于L2 cache中,当L1 cache发生cache miss时,

产生的cache fillline操作会同时向L2 cache与L1 cache做fillline操作。

在exclusive cache中,当一个地址数据存在于L1 cache中是,必须不能同时存在与L2 cache中。

如果cache look-up中请求访问的address中的开头部分与cache line中的tag相同,就产生一个hit;此时如果这个cache line的标识上

显示这个cache line是valid的,那么读和写将会在这个cache memory上产生。

当look-up没有发生hit,或者cache line的标识是invalid,就产生了一个cache miss,这个cache miss的结果和这个地址的访问请求就

必须被传递到内存层次结构的下一级中,一个L2 cache或者更下层的memory上。这也能导致一个cache linefill。一个

cache linefill导致main memory中的一小片(一个cache line大小)被copy进入cache。同时,请求的数据或指令被提供给

cpu,这些发生的事情对于软件开发者来说是透明的。CPU不需要等着整个linefile完成才使用需要的数据。cache controller

典型的首先访问critical word(就是发生cache miss的地址), 比如:如果你执行一个load指令发生cache miss, 并触发一次cache

linefill。linefill也是多次才能把一个cache line大小的所有word的数据加载进cache中,而cache controller首先会加载critical word数据,也就是CPU此时正需要的数据,然后就会传给CPU使用,而cache line中其余的数据的加载过程都是在后台完成的。

关于详细的加载cache line的过程看下面的说明。

from : https://groups.google.com/forum/#!topic/comp.arch/iKsqAEFP4NE

When some caches miss, they just go out to memory and snort up a line in

sequential order.  I believe the MIPS R6000 is an example of this.

But with most designs, such as the 486, SPARC, 68040, etc., the first
item to be returned to the CPU is the item which caused the miss.  This
is called "critical word first".

One difference between the 486 and everybody else is that the 486 will
always read the next word from the other half of an aligned 2-word block
before reading the other two words needed to fill a cache line.

For example, if the miss is on address xxx1, the 486 will fill the cache line
in the order 1-0-3-2.  On SPARC or the 68040, the read continues to the end
of the block and wrapsaround, i.e. 1-2-3-0.

Why did Intel do it this way?  To optimize the performance of 64-bit
memory systems.  Using the conventional order, the CPU might read one word
from one 64-bit block, read two words from a second block, then hop back to the
first block for another word.  Intel's method always reads one block before
reading the other.

Does this little design tweak pay off?  Just recently, Motorola has provided
proof!  They've introduced two static RAM's with on-chip burst control logic
optimized for use as cache memories.  The MCM62940 is compatible with
conventional burst ordering.  The MCM62486 uses Intel-style burst ordering.
The MCM62940 is offered in the following speed grades: 15, 20, and 25 ns.
(Max. access time.)  The MCM62486 is offered in:  14, 19, and 24 ns.  Intel

wins by a nanosecond!


以上是理解部分,以下是还未理解部分,暂存。

This is discussed in Patterson & Hennessy, near end of Chapter 8.
They note it doesn't gain as much as you'd think.

How much this is worth depends a lot on the following attributes, at least:
a) I-cache versus D-cache
b) the pipeline structure
        instruction latencies
        issue parallelism
c) the memory system
        cache structure & buffering
        main memory latency & number of pipes to it

For the I-cache, when you have a cache miss:
        I1) You can stall the machine, fetch the entire cache block,
        then restart.  This is clearly the simplest.
        I2) You can do "early restart", where you begin executing as soon
        as the requested word is available.  This is sometimes called
        "Instruction Streaming" (in the MIPS R3000), i.e.,  when you
        cache miss:
                start fetching at word 0 of the block
                stall until the needed word is fetched, then stream
                if you branch elsewhere before the end of the block,
                stop streaming, stall the pipeline until block filled
                also, if a load/store causes a cache miss, complete
                the I-cache refill, then handle the D-cache miss
        I3) You can do "out-of-order fetch" in addition to early restart,
        and then do "wrapped fetch", so that you wrap-around to complete.

Cache miss penalty = access time + transfer time.
In all of these cases, the cache miss penalty is the same, but I2 and I3
try to "cover" some of the transfer time by converting stalls into
issues. (It's hard to do much about access time for I-cache misses :-)
I2 and I3 are the same, except that I3 gets to issue
instructions earlier.  This doesn't do much for short-latency operations,
but can help those with longer latency, especially if there are multiple
independent units.  In practice, this means that:
        - The approach probably helps floating point codes more than integer
        - It helps pipelined-FP units more than low-latency scalar ones
        - It can REALLY help the class of program that consists of
        a giant loop filled with sequential code larger than the cache
        size.  (Now you see the real reason why such features are included,
        as this class of program happens be very important to chip designers,
        and they take care of themselves :-)

For the I-cache, it is known that the first word of a cache block accounts
for an unusually high share of the misses, due to fall-thru. I think this
is true across a wide range of programs.  Hence, there is even less
difference between I2 and I3 than you'd expect, since they behave
identically in ths case.  I can't recall the numbers, but I think the
simulations done when the R3000 was being desinged showed relatively little
difference, but note that the R3000's FP operations are fairly low
latency also, hence woudl helped less than some others designed then.

For the D-cache, it's more of the same, except with more options,
and more complexity, as there are many more options for D-cache
implementation.  Here are few of the relevant items:

- Unlike the I-cache, where word 0 of a block is more likely cause a cache miss,
and this is true of almost every program, the effect is much more
program-dependent for the D-cache.  Some programs make many sequential
references likely to cause cache misses, others much less so.

- Unlike the I-cache, where successive instructions continue accessing
the same block, loads and stores may well access different blocks,
and this implies extra complexity or buffering.  For example, if you have:
        load        r1,a
        load        r2,a+4
        load        r3,a+8
        etc
then using read-data-streaming works real well, as does:
        load        r1,a
        load        r2,a+4
        div        r1,r3
        etc
where you get to start the divide early

On the other hand,
        load        r1,a
        load        r2,b                b in different cache block
        store        r1,c                c in third block
gets awkward.  Although a load of a+4 is probably referencing an
internal cache-line buffer, the load of b is referencing the cache,
and would do so at the same time as the cache is being refilled.
Should the refill have priority, or the load of b?
Some of your choices depend on how many memory pipes you have.

Big machines, with long memory latencies (in terms of cycle counts),
heavy pipelining for FP, and multiple memory pipes, can contemplate
more parallelism, than makes sense for a micro with 1 memory pipe,
where most attempts to "get ahead" quickly run into the constriction
of the single pipe.

All of this, especially the D-cache choices, is likely to be very
benchmark-dependent.  I'm sure there are programs where you can get
2X better by wrap-around with early restart, and otheres where you
get little or nothing.

All of this is just a long way of saying there is no simple answer
to the question "is this worth it?"  and that finding the answer
takes detailed simulation of many, big programs running on
simulators that know every detail of the pipeline, cache, etc.
(I've often commented in public talks that we had something like 400-500
VAX-mips used for the design of some future chip.  The designers
beat me up, saying it's more like 2X that these days...)
--
-john mashey        DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:          ma...@mips.com OR {ames,decwrl,prls,pyramid}!

mips!mash
DDD:          408-524-7015, 524-8253 or (main number) 408-720-1700
USPS:         MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086


<think>嗯,用户问的是AXI协议中的“突发”是什么意思。我需要先回忆一下AXI的基本知识。AXI是ARM的AMBA协议的一部分,主要用于高性能系统设计。突发传输应该和数据传输的方式有关吧? 首先,AXI协议支持不同的传输方式,比如单次传输和突发传输。突发传输可能是指一次传输多个数据单元。对吗?记得突发传输有相关的控制信号,比如突发长度、大小和类型。 用户可能想知道突发传输的具体定义,以及它在AXI中的作用。应该解释清楚突发传输如何提高效率,减少总线事务的开销。比如,一次地址传输后连续传输多个数据,这样就不需要每次传输都发送地址信息。 另外,需要提到AXI支持的突发类型,比如固定地址、递增和回环。不同的类型适用于不同的场景,比如访问FIFO或者连续内存。这些例子能帮助用户理解应用场景。 可能还需要说明突发传输的参数,比如突发长度(Burst Length)决定了传输的数据量,突发大小(Burst Size)是每个传输的数据字节数,以及突发类型(Burst Type)如何影响地址生成。 用户可能对为什么使用突发传输感兴趣,所以需要强调其优势,比如提高带宽利用率,减少延迟,优化系统性能。特别是在需要大量数据传输的应用中,比如视频处理或DMA传输,突发传输的优势更加明显。 要注意用中文解释,避免专业术语过多,但必要时需准确使用。结构要清晰,分点说明,但根据系统指令,不需要使用Markdown格式。确保LaTeX公式正确,如果有数学表达式的话,但这个问题可能不需要用到。 最后,检查是否有遗漏的关键点,比如AXI的通道结构,突发传输在地址通道和数据通道中的体现,以及握手信号的作用。可能不需要深入细节,但需要覆盖基本概念和主要优势。</think>在AXI协议中,**突发(Burst)**是指通过单次地址握手完成多个数据传输的机制。这种设计旨在提升总线利用效率,减少频繁地址协商带来的开销。以下是关键点解析: --- ### **核心概念** 1. **单地址多数据** - 主机只需在地址通道(Address Channel)发送**一次起始地址** - 从机根据突发参数(长度、大小、类型)自动计算后续数据地址 - 示例:突发长度为4,主机发送起始地址`0x1000`后,自动传输`0x1000`、`0x1004`、`0x1008`、`0x100C`的数据(假设突发大小为4字节) 2. **突发参数** - **突发长度(Burst Length)**:传输的数据项数量(AXI4支持1-256次传输) - **突发大小(Burst Size)**:单次传输的数据字节数(如1/2/4/8字节) - **突发类型(Burst Type)**:决定地址变化规律 - **固定地址(Fixed)**:地址不变(适用于FIFO或寄存器访问) - **递增地址(Incr)**:地址按突发大小递增(适用于连续内存) - **回环地址(Wrap)**:地址递增到边界后回环(用于缓存行填充) --- ### **为何需要突发传输?** 1. **效率提升** - 减少地址通道交互次数(避免每个数据单独发地址) - 在数据通道(Data Channel)中连续传输多个数据项 - 公式:总线利用率 ≈ $\frac{\text{数据有效时间}}{\text{总传输时间}}$ → 突发传输显著提高分子值 2. **典型应用场景** - DMA批量数据传输 - 缓存行填充(Cache Line Fill) - 视频流处理(如帧缓冲区读写) --- ### **协议实现特点** - **通道分离**:地址通道与数据通道解耦,支持乱序完成(Out-of-Order) - **握手信号**:通过`VALID`和`READY`信号协调传输节奏(流量控制) - **突发终止**:主机可通过`AxLEN=0`提前结束突发(仅AXI3支持,AXI4已移除) --- ### **示例说明** ```plaintext 假设配置:突发长度=4,突发大小=4字节(32位),突发类型=递增(Incr) 地址通道发送:Start Address = 0x1000 数据通道依次传输: 0x1000-0x1003 (数据1) 0x1004-0x1007 (数据2) 0x1008-0x100B (数据3) 0x100C-0x100F (数据4) ``` 通过这种机制,AXI协议能在高性能场景(如SoC互联、处理器与加速器通信)中显著降低延迟并提升吞吐量。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值