Memory Hierarchy

  • 参考: C o m p u t e r   A r i c h i t e c t u r e   ( 6 th ⁡   E d i t i o n ) Computer\ Arichitecture\ (6\th\ Edition) Computer Arichitecture (6th Edition)

The Principle of Locality

局部性原理

  • The Principle of Locality:
    • Program access a relatively small portion of the address space at any instant of time.
    • It is a property of programs which is exploited in machine design.
  • Two Different Types of Locality:
    • Temporal Locality (时间局部性) (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)
    • Spatial Locality (空间局部性) (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access)

Memory Hierarchy

在这里插入图片描述

  • Goal: Illusion of large, fast, cheap memory

在这里插入图片描述


Memory Hierarchy: Apple iMac G5

在这里插入图片描述

Terminology

  • Hit 命中: data appears in some block in the upper level (example: Block X X X)
    • Hit Rate: the fraction of memory access found in the upper level
      • So high that usually talk about Miss rate
    • Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss
  • Miss 失效: data needs to be retrieve from a block in the lower level (Block Y Y Y)
    • Miss Rate = 1 - (Hit Rate)
      • as MIPS to CPU performance, miss rate to average memory access time in memory
    • Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor
  • Hit Time << Miss Penalty
    在这里插入图片描述

存储层次的性能参数

  • 设: S S S ── 容量; T A T_A TA ── 访问时间; C C C ── 每位价格
  • 考虑由 M 1 M_1 M1 M 2 M_2 M2 构成的两级存储层次:
    • M 1 M_1 M1 的参数: S 1 S_1 S1 T A 1 T_{A_1} TA1 C 1 C_1 C1
    • M 2 M_2 M2 的参数: S 2 S_2 S2 T A 2 T_{A_2} TA2 C 2 C_2 C2
  • 每位价格 C C C
    在这里插入图片描述
  • 命中率 H H H失效率 F F F
    H = N 1 / ( N 1 + N 2 ) F = 1 - H H=N_1/(N_1+N_2)\\F=1-H HN1/(N1N2)F1H
    • N 1 N_1 N1 ── 访问 M 1 M_1 M1 的次数; N 2 N_2 N2 ── 访问 M 2 M_2 M2 的次数
  • Average memory access time = Hit time + Miss rate x Miss penalty
    T A = H T A 1 + ( 1 - H ) ( T A 1 + T M ) = T A 1 + ( 1 - H ) T M = T A 1 + F T M T_A=HT_{A_1}+(1-H)(T_{A_1}+T_M)=T_{A_1}+(1-H)T_M= T_{A_1}+FT_M TAHTA1(1H)(TA1TM)TA1(1H)TMTA1FTM
    • 失效开销 Miss Penalty T M T_M TM:time to replace a block from lower level, including time to replace in CPU
      • access time: time to lower level = f f f(Latency to lower level)
      • transfer time: time to transfer block = f f f(BandWidth between upper & lower levels)
      • 从向 M 2 M_2 M2 发出访问请求到把整个数据块调入 M 1 M_1 M1 中所需的时间: T M = T A 2 + T B T_M =T_{A_2}+T_B TMTA2TB (传送一个信息块所需的时间为 T B T_B TB (根据数据量大小发生变化 (总线宽度)))

程序执行时间

  • CPU 时间=(CPU 执行周期数+存储器停顿周期数) × × × 时钟周期时间
    • 存储器停顿时钟周期数=访存次数 × × × 失效率 × × × 失效开销
      在这里插入图片描述

  • 假设 Cache 失效开销为 50 个时钟周期,当不考虑存储器停顿时,所有指令的执行时间都是 2.0 个时钟周期,访问 Cache 失效率为 2%,平均每条指令访存 1.33 次 (每条指令都必访问指令 Cache,不一定访问数据 Cache)。试分析 Cache 对性能的影响

  • CPU 时间=IC × × × (2.0+1.33 × × × 2% × × × 50) × × × 时钟周期时间 = IC × × × 3.33 × × × 时钟周期时间;
  • 实际 CPI :3.33
  • 但若不采用 Cache, 则:CPI=2.0+50 × × × 1.33=68.5

Four Questions for Memory Hierarchy

Q1: Where can a block be placed in the upper level?

  • 块的放置:映像规则 Block placement

  • 全相联映象:主存中的任一块可以被放置到 Cache 中的任意一个位置
    • 空间利用率最高,冲突概率最低,实现最复杂。
  • 直接映象:主存中的每一块只能被放置到 Cache 中唯一的一个位置; 对于主存的第 i i i 块,若它映象到 Cache 的第 j j j 块,则 j = i   m o d    ( M ) j=i\ \mod (M ) ji mod(M) M M M 为 Cache 的块数); 设 M = 2 m M=2^m M2m,则当表示为二进制数时, j j j 实际上就是 i i i 的低 m m m
    • 空间利用率最低,冲突概率最高,实现最简单
  • 组相联映象:主存中的每一块可以被放置到 Cache 中唯一的一个组中的任何一个位置。若主存第 i i i 块映象到第 k k k 组,则: k = i   m o d    ( G ) k=i \ \mod(G) ki mod(G) G G G 为 Cache 的组数)设 G = 2 g G=2^g G2g,则当表示为二进制数实际上就是 i i i 的低 g g g
    • 是直接映象和全相联的一种折衷。相联度越高,空间利用率就越高,块冲突概率就越低,失效率也就越低
      在这里插入图片描述

Q2: How is a block found if it is in the upper level?

  • 块的定位:查找算法 Block identification

在这里插入图片描述

  • tag 是块标记,Index 是组地址

Fully Associative Cache

在这里插入图片描述

2-Way Set-Associative Cache

在这里插入图片描述

Direct-Mapped Cache

在这里插入图片描述

Q3: Which block should be replaced on a miss?

  • 块的替换:替换算法 Block replacement

  • Easy for Direct Mapped, no choice
  • Set Associative or Fully Associative:
    • Random
    • First in, first out (FIFO)
    • LRU (Least Recently Used)
      在这里插入图片描述

Q4: What happens on a write?

  • 写策略 Write strategy

在这里插入图片描述


  • Additional option – let writes to an un-cached address allocate a new cache line (“write-allocate”). (写一个不在 cache 中的数据)

Write allocate and No-write allocate

  • Write allocate (fetch on write) 按写分配 (写时取)
    • The block is allocated on a write miss. 写失效时,先把所写单元所在的块调入Cache,再行写入
    • Write-back caches always use. (要写的数据总在 Cache 中,因此常用写回法)
  • No-write allocate (write around) 不按写分配 (绕写法)
    • This apparently unusual alternative is write misses do not affect the cache. Instead, the block is modified only in the lower-level memory.
    • Write-through caches often use.

Write Policy Choices

  • Cache hit: write through / write back
  • Cache miss: no write allocate / write allocate
  • Common combinations:
    • write through & no write allocate
    • write back & write allocate

Write Buffers for Write-Through Caches

在这里插入图片描述

  • Q. Why a write buffer?
    • So CPU doesn’t stall
  • Q. Why a buffer, why not just one register?
    • Bursts of writes are common.
  • Q. Are Read After Write (RAW) hazards an issue for write buffer?
    • Yes! Drain buffer before next read, or send read 1st after check write buffers.

Virtual Memory Address Space

在这里插入图片描述

  • User programs run in an standardized virtual address space (每一个进程都有自己的虚存空间)
  • Address Translation hardware managed by the operating system (OS) maps virtual address to physical memory Hardware supports “modern” OS features: Protection, Translation, Sharing

Use virtual addresses for cache?

  • A. Synonym(同义字) problem. If two address spaces share a physical frame, data may be in cache twice. Maintaining consistency is a nightmare.

Four Memory Hierarchy Questions Revisited

Q1: Where Can a Block Be Placed in Main Memory?

  • Operating systems allow blocks to be placed anywhere in main memory. The strategy is fully associative.

Q2: How Is a Block Found If It Is in Main Memory?

  • Both paging and segmentation rely on a data structure that is indexed by the page or segment number.
    • For paging, this data structure is page table which contains the physical page address. Indexed by the virtual page number, the size of the table is the number of pages in the virtual address space.
      • Given a 32-bit virtual address,4 KB pages, and 4 bytes per Page Table Entry (PTE), the size of the page table would be ( 2 32 / 2 12 ) × 2 2 = 2 22 (2^{32}/2^{12}) \times 2^2 = 2^{22} (232/212)×22=222 or 4 4 4 MB.

Q3: Which Block Should Be Replaced on a Virtual Memory Miss?

  • Replace the least-recently used (LRU) page.
    • A use bit or reference bit is provided. The operating system periodically clears the bits and later records them so it can determine which pages were touched during a particular time period.

Q4: What Happens on a Write?

  • Because of the great discrepancy in access time, the write strategy is always write back.
    • Virtual memory systems usually include a dirty bit. It allows blocks to be written to disk only if they have been altered since being read from the disk.

Page replacement policy

在这里插入图片描述

Caching vs. Demand Paging

在这里插入图片描述

Cache Design

Six Basic Cache Optimizations

  • Average memory access time = Hit time + Miss rate x Miss penalty

Reducing Miss Rate

  • (1) Larger Block size (compulsory misses)
  • (2) Larger Cache size (capacity misses)
  • (3) Higher Associativity (conflict misses)

Reducing Miss Penalty

  • (4) Multilevel Caches
  • (5) Giving Reads Priority over Writes
    • E. g., Read complete before earlier writes in write buffer

Reducing hit time

  • (6) Avoiding address translation when indexing the cache

What causes a MISS?

Three Major Categories of Cache Misses:

  • Compulsory 强制性 —The very first access to a block cannot be in the cache, so the block must be brought into the cache. These are also called cold-start misses (冷启动失效) or first-reference misses.(首次访问失效)
  • Capacity 容量 —If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur because of blocks being discarded and later retrieved.
  • Conflict 冲突 —If the block placement strategy is set associative or direct mapped, conflict misses will occur because a block may be discarded and later retrieved if too many blocks map to its set. These misses are also called collision misses (碰撞失效).

1. Larger Block Size to Reduce Miss Rate

  • Larger block sizes will reduce compulsory misses, because larger blocks take advantage of spatial locality.

Larger Blocks may Increase Conflict Misses

  • Since they reduce the number of blocks in the cache, larger blocks may increase conflict misses and even capacity misses if the cache is small.
    在这里插入图片描述

可以看到,Cache 容量为 16K 或 4K 时,block size 过大还会导致 miss rate 上升。这是因为在 Cache 容量有限的情况下,过大的 block size 使 Cache 中放不下几个 block,可能会导致其他类型的 miss


Larger Blocks may Increase the Miss Penalty

  • Assume the memory system takes 80 clock cycles of overhead and then delivers 16 bytes every 2 clock cycles. Thus, it can supply 16 bytes in 82 clock cycles, 32 bytes in 84 clock cycles, and so on.
    • 16-byte block → \rightarrow Miss penalty: 82; 32-byte block → \rightarrow Miss penalty: 84
    • The selection of block size depends on both the latency and bandwidth of the lower-level memory.
      在这里插入图片描述

2. Larger Caches to Reduce Miss Rate

  • The obvious way to reduce capacity misses is to increase capacity of the cache.
    • The obvious drawback is potentially longer hit time and higher cost and power.
    • This technique has been especially popular in off-chip caches.

3. Higher Associativity to Reduce Miss Rate

  • Figure shows how miss rates improve with higher associativity
    在这里插入图片描述

Higher Associativity Increase the Clock Cycle Time

  • Greater associativity can come at the cost of increased hit time.
    • Clock cycle time 2-way = 1.36 × Clock cycle time 1-way
    • Clock cycle time 4-way = 1.44 × Clock cycle time 1-way
    • Clock cycle time 8-way = 1.52 × Clock cycle time 1-way
      在这里插入图片描述

4. Multilevel Caches to Reduce Miss Penalty

  • The performance gap between processors and memory leads the architect to this question: Should I make the cache faster to keep pace with the speed of processors, or make the cache larger to overcome the widening gap between the processor and main memory?
    • One answer is, do both. Adding another level of cache between the original cache and memory simplifies the decision.
    • The first-level cache can be small enough to match the clock cycle time of the fast processor. Yet the second-level cache can be large enough to capture many accesses that would go to main memory, thereby lessening the effective miss penalty.

  • 硬件复杂度太高,因此现在最多也就三级 Cache

A Typical Memory Hierarchy

在这里插入图片描述

多核架构下的多级 Cache

在这里插入图片描述

产生 Cache 一致性问题,这个后面讲


It Complicates Performance Analysis

  • Average memory access time = Hit time L 1 \textrm{Hit\ time}_{L_1} Hit timeL1 + Miss rate L 1 \textrm{Miss\ rate}_{L_1} Miss rateL1 × \times × ( Hit time L 2 \textrm{Hit\ time}_{L_2} Hit timeL2 + Miss rate L 2 \textrm{Miss\ rate}_{L_2} Miss rateL2 × \times × Miss penalty L 2 \textrm{Miss\ penalty}_{L_2} Miss penaltyL2
    • Local miss Such as Miss rate L 1 \textrm{Miss\ rate}_{L_1} Miss rateL1, and Miss rate L 2 \textrm{Miss\ rate}_{L_2} Miss rateL2.
    • Global miss The global miss rate for the first-level cache is still just Miss rate L 1 \textrm{Miss\ rate}_{L_1} Miss rateL1, but for the second-level cache it is Miss rate L 1 \textrm{Miss\ rate}_{L_1} Miss rateL1 × \times × Miss rate L 2 \textrm{Miss\ rate}_{L_2} Miss rateL2.

5. Giving Priority to Read Misses over Writes to Reduce Miss Penalty

e.g. Read complete before earlier writes in write buffer

write-through cache

  • With a write-through cache the most important improvement is a write buffer of the proper size. Write buffers do complicate memory accesses because they might hold the updated value of a location needed on a read miss.
  • Assume a direct-mapped, write-through cache that maps 512 and 1024 to the same block, and a four-word write buffer that is not checked on a read miss. Will the value in R 2 R_2 R2 always be equal to the value in R 3 R_3 R3?
    • 不一定,得看什么时候把 write buffer 中的脏数据写回 memory
SW R3, 512(R0)		;M[512] ← R3 (cache index 0) (实际写入 write buffer)
LW R1, 1024(R0)		;R1 ← M[1024] (cache index 0) (block 1024 替换 Cache 中的 block 512)
LW R2, 512(R0) 		;R2 ← M[512] (cache index 0)

Solve the Problem

  • The simplest way is for the read miss to wait until the write buffer is empty.
  • The alternative is to check the contents of the write buffer on a read miss, and if there are no conflicts and the memory system is available, let the read miss continue. (如果有 conflict,一个比较激进的方法是直接从 write buffer 中取数据)

write-back cache

  • The cost of writes by the processor in a write-back cache can also be reduced. Suppose a read miss will replace a dirty memory block. Instead of writing the dirty block to memory, and then reading memory, we could copy the dirty block to a buffer, then read memory, and then write memory. This way the processor read, for which the processor is probably waiting, will finish sooner.

6. Avoiding Address Translation during Indexing of the Cache to Reduce Hit Time

  • Cache must cope with the translation of a virtual address from the processor to a physical address to access memory.

virtual caches vs. physical cache

  • Using virtual addresses for the cache eliminates address translation time from a cache hit. 但每个进程都有一个虚存空间,即不同虚地址可能对应同一个物理地址。有如下两种解决方案:
    • (1) One solution is when a process is switched, the virtual addresses refer to different physical addresses, requiring the cache to be flushed.
    • (2) The alternative is to increase the width of the cache address tag with a process-identifier tag (PID).

virtual caches 最终未能流行,因为产生了 Cache 一致性的问题

7. Victim Cache to Reduce Miss Rate

  • 基本思想:在 Cache 和它从下一级存储器调数据的通路之间设置一个全相联的小 Cache,用于存放被替换出去的块(称为 Victim),以备重用
    • 对于减小冲突失效很有效,特别是对于小容量的直接映象数据 Cache,作用尤其明显
    • 例如,项数为 4 的 Victim Cache: 使 4KB Cache 的冲突失效减少 20%~90%

Summary of Basic Cache Optimization

  • No optimization in this figure helps more than one category.

+ meaning that the technique improves the factor, meaning it hurts that factor

在这里插入图片描述

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Memory hierarchy management refers to the process of organizing and managing the different levels of memory in a computer system, including the cache, main memory, and secondary storage. The goal of memory hierarchy management is to optimize the use of memory resources and improve system performance. This is achieved through a combination of hardware and software techniques that help to minimize the amount of time it takes to access data from memory. At the lowest level of the memory hierarchy is the main memory, which is typically implemented using dynamic random-access memory (DRAM) chips. Main memory is fast but expensive, so it is relatively small compared to the amount of data that needs to be stored in a typical computer system. To make up for the limited capacity of main memory, computer systems use caching techniques to store frequently accessed data in a faster, smaller cache memory. The cache is typically implemented using static random-access memory (SRAM) chips, which are much faster than DRAM but more expensive. Memory hierarchy management involves coordinating the movement of data between the different levels of the memory hierarchy, based on factors such as the frequency of access, the size of the data, and the available memory resources. This is done using algorithms such as cache replacement policies, which determine which data should be evicted from the cache to make room for new data. Other memory management techniques include virtual memory, which allows the operating system to use secondary storage as an extension of main memory, and memory compression, which compresses data in memory to free up space for additional data. Overall, memory hierarchy management is critical to the performance and efficiency of modern computer systems, and requires a careful balance of hardware and software optimizations to achieve optimal results.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值