Memory Hierarchy

最新推荐文章于 2021-10-26 10:51:29 发布

连理o

最新推荐文章于 2021-10-26 10:51:29 发布

阅读量306

点赞数 1

分类专栏：计算机体系结构

本文链接：https://blog.csdn.net/weixin_42437114/article/details/116277065

版权

计算机体系结构专栏收录该内容

11 篇文章 10 订阅

订阅专栏

参考： $Computer\ Arichitecture\ (6\th\ Edition)$

The Principle of Locality

局部性原理

The Principle of Locality:
- Program access a relatively small portion of the address space at any instant of time.
- It is a property of programs which is exploited in machine design.
Two Different Types of Locality:
- Temporal Locality (时间局部性) (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)
- Spatial Locality (空间局部性) (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access)

Memory Hierarchy

在这里插入图片描述

Goal: Illusion of large, fast, cheap memory

在这里插入图片描述

Memory Hierarchy: Apple iMac G5

在这里插入图片描述

Terminology

Hit 命中: data appears in some block in the upper level (example: Block $X$ )
- Hit Rate: the fraction of memory access found in the upper level
  - So high that usually talk about Miss rate
- Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss
Miss 失效: data needs to be retrieve from a block in the lower level (Block $Y$ )
- Miss Rate = 1 - (Hit Rate)
  - as MIPS to CPU performance, miss rate to average memory access time in memory
- Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor
Hit Time << Miss Penalty

存储层次的性能参数

设： $S$ ── 容量; $T_A$ ── 访问时间; $C$ ── 每位价格
考虑由 $M_1$ 和 $M_2$ 构成的两级存储层次：
- $M_1$ 的参数： $S_1$ ， $T_{A_1}$ ， $C_1$
- $M_2$ 的参数： $S_2$ ， $T_{A_2}$ ， $C_2$
每位价格 $C$
命中率 $H$ 和失效率 $F$
$H＝N_1/(N_1＋N_2)\\F＝1－H$
- $N_1$ ── 访问 $M_1$ 的次数; $N_2$ ── 访问 $M_2$ 的次数
Average memory access time = Hit time + Miss rate x Miss penalty
$T_A＝HT_{A_1}＋(1－H)(T_{A_1}＋T_M)＝T_{A_1}＋(1－H)T_M＝ T_{A_1}＋FT_M$
- 失效开销 Miss Penalty $T_M$ ：time to replace a block from lower level, including time to replace in CPU
  - access time: time to lower level = $f$ (Latency to lower level)
  - transfer time: time to transfer block = $f$ (BandWidth between upper & lower levels)
  - 从向 $M_2$ 发出访问请求到把整个数据块调入 $M_1$ 中所需的时间: $T_M ＝T_{A_2}＋T_B$ (传送一个信息块所需的时间为 $T_B$ (根据数据量大小发生变化 (总线宽度)))

程序执行时间

CPU 时间＝(CPU 执行周期数+存储器停顿周期数) $\times$ 时钟周期时间
- 存储器停顿时钟周期数＝访存次数 $\times$ 失效率 $\times$ 失效开销

例

假设 Cache 失效开销为 50 个时钟周期，当不考虑存储器停顿时，所有指令的执行时间都是 2.0 个时钟周期，访问 Cache 失效率为 2%，平均每条指令访存 1.33 次 (每条指令都必访问指令 Cache，不一定访问数据 Cache)。试分析 Cache 对性能的影响

解

CPU 时间＝IC $\times$ (2.0＋1.33 $\times$ 2% $\times$ 50) $\times$ 时钟周期时间 = IC $\times$ 3.33 $\times$ 时钟周期时间;
实际 CPI ：3.33
但若不采用 Cache, 则：CPI＝2.0＋50 $\times$ 1.33＝68.5

Four Questions for Memory Hierarchy

Q1: Where can a block be placed in the upper level?

块的放置：映像规则 Block placement

全相联映象：主存中的任一块可以被放置到 Cache 中的任意一个位置
- 空间利用率最高，冲突概率最低，实现最复杂。
直接映象：主存中的每一块只能被放置到 Cache 中唯一的一个位置; 对于主存的第 $i$ 块，若它映象到 Cache 的第 $j$ 块，则 $j＝i\ \mod (M )$ （ $M$ 为 Cache 的块数）; 设 $M＝2^m$ ，则当表示为二进制数时， $j$ 实际上就是 $i$ 的低 $m$ 位
- 空间利用率最低，冲突概率最高，实现最简单
组相联映象：主存中的每一块可以被放置到 Cache 中唯一的一个组中的任何一个位置。若主存第 $i$ 块映象到第 $k$ 组，则: $\ \mod(G)$ （ $G$ 为 Cache 的组数）设 $G＝2^g$ ，则当表示为二进制数实际上就是 $i$ 的低 $g$ 位
- 是直接映象和全相联的一种折衷。相联度越高，空间利用率就越高，块冲突概率就越低，失效率也就越低

Q2: How is a block found if it is in the upper level?

块的定位：查找算法 Block identification

在这里插入图片描述

tag 是块标记，Index 是组地址

Fully Associative Cache

在这里插入图片描述

2-Way Set-Associative Cache

在这里插入图片描述

Direct-Mapped Cache

在这里插入图片描述

Q3: Which block should be replaced on a miss?

块的替换：替换算法 Block replacement

Easy for Direct Mapped, no choice
Set Associative or Fully Associative:
- Random
- First in, first out (FIFO)
- LRU (Least Recently Used)

Q4: What happens on a write?

写策略 Write strategy

在这里插入图片描述

Additional option – let writes to an un-cached address allocate a new cache line (“write-allocate”). (写一个不在 cache 中的数据)

Write allocate and No-write allocate

Write allocate (fetch on write) 按写分配 (写时取)
- The block is allocated on a write miss. 写失效时，先把所写单元所在的块调入Cache，再行写入
- Write-back caches always use. (要写的数据总在 Cache 中，因此常用写回法)
No-write allocate (write around) 不按写分配 (绕写法)
- This apparently unusual alternative is write misses do not affect the cache. Instead, the block is modified only in the lower-level memory.
- Write-through caches often use.

Write Policy Choices

Cache hit: write through / write back
Cache miss: no write allocate / write allocate
Common combinations:
- write through & no write allocate
- write back & write allocate

Write Buffers for Write-Through Caches

在这里插入图片描述

Q. Why a write buffer?
- So CPU doesn’t stall
Q. Why a buffer, why not just one register?
- Bursts of writes are common.
Q. Are Read After Write (RAW) hazards an issue for write buffer?
- Yes! Drain buffer before next read, or send read 1st after check write buffers.

Virtual Memory Address Space

在这里插入图片描述

User programs run in an standardized virtual address space (每一个进程都有自己的虚存空间)
Address Translation hardware managed by the operating system (OS) maps virtual address to physical memory Hardware supports “modern” OS features: Protection, Translation, Sharing

Use virtual addresses for cache?

A. Synonym(同义字) problem. If two address spaces share a physical frame, data may be in cache twice. Maintaining consistency is a nightmare.

Four Memory Hierarchy Questions Revisited

Q1: Where Can a Block Be Placed in Main Memory?

Operating systems allow blocks to be placed anywhere in main memory. The strategy is fully associative.

Q2: How Is a Block Found If It Is in Main Memory?

Both paging and segmentation rely on a data structure that is indexed by the page or segment number.
- For paging, this data structure is page table which contains the physical page address. Indexed by the virtual page number, the size of the table is the number of pages in the virtual address space.
  - Given a 32-bit virtual address,4 KB pages, and 4 bytes per Page Table Entry (PTE), the size of the page table would be $(2^{32}/2^{12}) \times 2^2 = 2^{22}$ or $4$ MB.

Q3: Which Block Should Be Replaced on a Virtual Memory Miss?

Replace the least-recently used (LRU) page.
- A use bit or reference bit is provided. The operating system periodically clears the bits and later records them so it can determine which pages were touched during a particular time period.

Q4: What Happens on a Write?

Because of the great discrepancy in access time, the write strategy is always write back.
- Virtual memory systems usually include a dirty bit. It allows blocks to be written to disk only if they have been altered since being read from the disk.

Page replacement policy

在这里插入图片描述

Caching vs. Demand Paging

在这里插入图片描述

Cache Design

Six Basic Cache Optimizations

Average memory access time = Hit time + Miss rate x Miss penalty

Reducing Miss Rate

(1) Larger Block size (compulsory misses)
(2) Larger Cache size (capacity misses)
(3) Higher Associativity (conflict misses)

Reducing Miss Penalty

(4) Multilevel Caches
(5) Giving Reads Priority over Writes
- E. g., Read complete before earlier writes in write buffer

Reducing hit time

(6) Avoiding address translation when indexing the cache

What causes a MISS?

Three Major Categories of Cache Misses:

Compulsory 强制性 —The very first access to a block cannot be in the cache, so the block must be brought into the cache. These are also called cold-start misses (冷启动失效) or first-reference misses.(首次访问失效)
Capacity 容量 —If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur because of blocks being discarded and later retrieved.
Conflict 冲突 —If the block placement strategy is set associative or direct mapped, conflict misses will occur because a block may be discarded and later retrieved if too many blocks map to its set. These misses are also called collision misses (碰撞失效).

1. Larger Block Size to Reduce Miss Rate

Larger block sizes will reduce compulsory misses, because larger blocks take advantage of spatial locality.

Larger Blocks may Increase Conflict Misses

Since they reduce the number of blocks in the cache, larger blocks may increase conflict misses and even capacity misses if the cache is small.

可以看到，Cache 容量为 16K 或 4K 时，block size 过大还会导致 miss rate 上升。这是因为在 Cache 容量有限的情况下，过大的 block size 使 Cache 中放不下几个 block，可能会导致其他类型的 miss

Larger Blocks may Increase the Miss Penalty

Assume the memory system takes 80 clock cycles of overhead and then delivers 16 bytes every 2 clock cycles. Thus, it can supply 16 bytes in 82 clock cycles, 32 bytes in 84 clock cycles, and so on.
- 16-byte block $\rightarrow$ Miss penalty: 82; 32-byte block $\rightarrow$ Miss penalty: 84
- The selection of block size depends on both the latency and bandwidth of the lower-level memory.

2. Larger Caches to Reduce Miss Rate

The obvious way to reduce capacity misses is to increase capacity of the cache.
- The obvious drawback is potentially longer hit time and higher cost and power.
- This technique has been especially popular in off-chip caches.

3. Higher Associativity to Reduce Miss Rate

Figure shows how miss rates improve with higher associativity

Higher Associativity Increase the Clock Cycle Time

Greater associativity can come at the cost of increased hit time.
- Clock cycle time 2-way = 1.36 × Clock cycle time 1-way
- Clock cycle time 4-way = 1.44 × Clock cycle time 1-way
- Clock cycle time 8-way = 1.52 × Clock cycle time 1-way

4. Multilevel Caches to Reduce Miss Penalty

The performance gap between processors and memory leads the architect to this question: Should I make the cache faster to keep pace with the speed of processors, or make the cache larger to overcome the widening gap between the processor and main memory?
- One answer is, do both. Adding another level of cache between the original cache and memory simplifies the decision.
- The first-level cache can be small enough to match the clock cycle time of the fast processor. Yet the second-level cache can be large enough to capture many accesses that would go to main memory, thereby lessening the effective miss penalty.

硬件复杂度太高，因此现在最多也就三级 Cache

A Typical Memory Hierarchy

在这里插入图片描述

多核架构下的多级 Cache

在这里插入图片描述

产生 Cache 一致性问题，这个后面讲

It Complicates Performance Analysis

Average memory access time = $\textrm{Hit\ time}_{L_1}$ + $\textrm{Miss\ rate}_{L_1}$ $\times$ ( $\textrm{Hit\ time}_{L_2}$ + $\textrm{Miss\ rate}_{L_2}$ $\times$ $\textrm{Miss\ penalty}_{L_2}$
- Local miss Such as $\textrm{Miss\ rate}_{L_1}$ , and $\textrm{Miss\ rate}_{L_2}$ .
- Global miss The global miss rate for the first-level cache is still just $\textrm{Miss\ rate}_{L_1}$ , but for the second-level cache it is $\textrm{Miss\ rate}_{L_1}$ $\times$ $\textrm{Miss\ rate}_{L_2}$ .

5. Giving Priority to Read Misses over Writes to Reduce Miss Penalty

e.g. Read complete before earlier writes in write buffer

write-through cache

With a write-through cache the most important improvement is a write buffer of the proper size. Write buffers do complicate memory accesses because they might hold the updated value of a location needed on a read miss.
Assume a direct-mapped, write-through cache that maps 512 and 1024 to the same block, and a four-word write buffer that is not checked on a read miss. Will the value in $R_2$ always be equal to the value in $R_3$ ?
- 不一定，得看什么时候把 write buffer 中的脏数据写回 memory

SW R3, 512(R0)		;M[512] ← R3 (cache index 0) (实际写入 write buffer)
LW R1, 1024(R0)		;R1 ← M[1024] (cache index 0) (block 1024 替换 Cache 中的 block 512)
LW R2, 512(R0) 		;R2 ← M[512] (cache index 0)

Solve the Problem

The simplest way is for the read miss to wait until the write buffer is empty.
The alternative is to check the contents of the write buffer on a read miss, and if there are no conflicts and the memory system is available, let the read miss continue. (如果有 conflict，一个比较激进的方法是直接从 write buffer 中取数据)

write-back cache

The cost of writes by the processor in a write-back cache can also be reduced. Suppose a read miss will replace a dirty memory block. Instead of writing the dirty block to memory, and then reading memory, we could copy the dirty block to a buffer, then read memory, and then write memory. This way the processor read, for which the processor is probably waiting, will finish sooner.

6. Avoiding Address Translation during Indexing of the Cache to Reduce Hit Time

Cache must cope with the translation of a virtual address from the processor to a physical address to access memory.

virtual caches vs. physical cache

Using virtual addresses for the cache eliminates address translation time from a cache hit. 但每个进程都有一个虚存空间，即不同虚地址可能对应同一个物理地址。有如下两种解决方案：
- (1) One solution is when a process is switched, the virtual addresses refer to different physical addresses, requiring the cache to be flushed.
- (2) The alternative is to increase the width of the cache address tag with a process-identifier tag (PID).

virtual caches 最终未能流行，因为产生了 Cache 一致性的问题

7. Victim Cache to Reduce Miss Rate

基本思想：在 Cache 和它从下一级存储器调数据的通路之间设置一个全相联的小 Cache，用于存放被替换出去的块(称为 Victim)，以备重用
- 对于减小冲突失效很有效，特别是对于小容量的直接映象数据 Cache，作用尤其明显
- 例如，项数为 4 的 Victim Cache: 使 4KB Cache 的冲突失效减少 20%～90%

Summary of Basic Cache Optimization

No optimization in this figure helps more than one category.

+ meaning that the technique improves the factor, – meaning it hurts that factor

在这里插入图片描述

连理o

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Memory Hierarchy

参考：Computer Arichitecture (6th⁡ Edition)Computer\ Arichitecture\ (6\th\ Edition)Computer Arichitecture (6th Edition)目录The Principle of LocalityMemory HierarchyTerminology存储层次的性能参数Four Questions for Memory HierarchyQ1: Where.
复制链接

扫一扫

专栏目录