5.1
In this exercise we look at memory locality properties of matrix computation. The following code is written in C, where elements within the same row are stored contiguously. Assume each word is a 32-bit integer
for (I=0; I<8; I++)
for (J=0; J<8000; J++)
A[I][J]=B[I][0]+A[J][I];
5.1.1 [5] <§5.1> How many 32-bit integers can be stored in a 16-byte cache block?
1
b
y
t
e
=
8
b
i
t
1\ byte = 8 \ bit
1 byte=8 bit
16
×
8
/
32
=
4
16\times 8 / 32 = 4
16×8/32=4
5.1.2 [5] <§5.1> References to which variables exhibit temporal locality?
访问 I J I\quad J IJ以及 B [ I ] [ 0 ] B[I][0] B[I][0]会产生时间局限性(在循环中被再次访问)
5.1.3 [5] <§5.1> References to which variables exhibit spatial locality?
A
[
I
]
[
J
]
A[I][J]
A[I][J]会产生空间局限性(在循环中会迅速被访问下一个位置)
而
A
[
J
]
[
I
]
A[J][I]
A[J][I]访问距离较远,所以不认为有空间局限性
Locality is aff ected by both the reference order and data layout. Th e same computation can also be written below in Matlab, which diff ers from C by storing matrix elements within the same column contiguously in memory.
for I=1:8
for J=1:8000
A(I,J)=B(I,0)+A(J,I);
end
end
5.1.4 [10] <§5.1> How many 16-byte cache blocks are needed to store all 32-bit matrix elements being referenced?
32位矩阵元素共
8
×
800
=
6400
8\times 800=6400
8×800=6400个
根据5.1.1,一个16字节cache可以储存4个
一共需要
6400
/
4
=
1600
6400/4=1600
6400/4=1600
5.1.5 [5] <§5.1> References to which variables exhibit temporal locality?
访问 I J I\quad J IJ以及 B ( I , 0 ) B(I,0) B(I,0)会产生时间局限性(在循环中被再次访问)
5.1.6 [5] <§5.1> References to which variables exhibit spatial locality
A ( I , J ) A(I,J) A(I,J)会产生空间局限性(在循环中会迅速被访问下一个位置)
5.2
Caches are important to providing a high-performance memory hierarchy to processors. Below is a list of 32-bit memory address references, given as word addresses.
3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253
5.2.1 [10] <§5.3> For each of these references, identify the binary address, the tag, and the index given a direct-mapped cache with 16 one-word blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty.
一开始都是空的
-
cache大小为 16 = 2 4 16 = 2^4 16=24,索引字段 n = 4 n = 4 n=4
不难得到index应为4位二进制数 -
数据块大小为 1 = 2 0 1 = 2^0 1=20个单字, m = 0 m=0 m=0
所以剩余的4位完全用于tag
字地址 | 二进制地址 | 标签 | 索引 | 命中或失效 |
---|---|---|---|---|
3 | 0000 0011 | 000 0 ( 2 ) 0000_{(2)} 0000(2) = 0 | 001 1 ( 2 ) 0011_{(2)} 0011(2) = 3 | Miss |
180 | 1011 0100 | 101 1 ( 2 ) 1011_{(2)} 1011(2) = 11 | 001 1 ( 2 ) 0011_{(2)} 0011(2) = 4 | Miss |
43 | 0010 1011 | 001 0 ( 2 ) 0010_{(2)} 0010(2) = 2 | 101 1 ( 2 ) 1011_{(2)} 1011(2) = 11 | Miss |
2 | 0000 0010 | 000 0 ( 2 ) 0000_{(2)} 0000(2) = 0 | 001 0 ( 2 ) 0010_{(2)} 0010(2) = 2 | Miss |
191 | 1011 1111 | 101 1 ( 2 ) 1011_{(2)} 1011(2) = 11 | 111 1 ( 2 ) 1111_{(2)} 1111(2) = 15 | Miss |
88 | 0101 1000 | 010 1 ( 2 ) 0101_{(2)} 0101(2) = 5 | 100 0 ( 2 ) 1000_{(2)} 1000(2) = 8 | Miss |
190 | 1011 1110 | 101 1 ( 2 ) 1011_{(2)} 1011(2) = 11 | 111 0 ( 2 ) 1110_{(2)} 1110(2) = 14$ | Miss |
14 | 0000 1111 | 000 0 ( 2 ) 0000_{(2)} 0000(2) = 0 | 111 0 ( 2 ) 1110_{(2)} 1110(2) = 14 | Miss |
181 | 1011 0101 | 101 1 ( 2 ) 1011_{(2)} 1011(2) = 11 | 010 1 ( 2 ) 0101_{(2)} 0101(2) = 5 | Miss |
44 | 0010 1100 | 001 0 ( 2 ) 0010_{(2)} 0010(2) = 2 | 110 0 ( 2 ) 1100_{(2)} 1100(2) = 12 | Miss |
186 | 1011 0101 | 101 1 ( 2 ) 1011_{(2)} 1011(2) = 11 | 110 0 ( 2 ) 1100_{(2)} 1100(2) = 10 | Miss |
253 | 1111 1101 | 111 1 ( 2 ) 1111_{(2)} 1111(2) = 15 | 110 1 ( 2 ) 1101_{(2)} 1101(2) = 13 | Miss |
5.2.2 [10] <§5.3> For each of these references, identify the binary address, the tag, and the index given a direct-mapped cache with two-word blocks and a total size of 8 blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty.
-
cache大小为 8 = 2 3 8 = 2^3 8=23,索引字段 n = 3 n = 3 n=3
不难得到index应为3位二进制数 -
数据块大小为 2 = 2 1 2 = 2^1 2=21个单字, m = 1 m=1 m=1
所以剩余的4位用于tag
字地址 | 二进制地址 | 标签 | 索引 | 命中或失效 |
---|---|---|---|---|
3 | 0000 0011 | 000 0 ( 2 ) 0000_{(2)} 0000(2) = 0 | 00 1 ( 2 ) 001_{(2)} 001(2) = 1 | Miss |
180 | 1011 0100 | 101 1 ( 2 ) 1011_{(2)} 1011(2) = 11 | 00 1 ( 2 ) 001_{(2)} 001(2) = 2 | Miss |
43 | 0010 1011 | 001 0 ( 2 ) 0010_{(2)} 0010(2) = 2 | 10 1 ( 2 ) 101_{(2)} 101(2) = 5 | Miss |
2 | 0000 0010 | 000 0 ( 2 ) 0000_{(2)} 0000(2) = 0 | 00 1 ( 2 ) 001_{(2)} 001(2) = 1 | Hit(第一行) |
191 | 1011 1111 | 101 1 ( 2 ) 1011_{(2)} 1011(2) = 11 | 11 1 ( 2 ) 111_{(2)} 111(2) = 7 | Miss |
88 | 0101 1000 | 010 1 ( 2 ) 0101_{(2)} 0101(2) = 5 | 10 0 ( 2 ) 100_{(2)} 100(2) = 4 | Miss |
190 | 1011 1110 | 101 1 ( 2 ) 1011_{(2)} 1011(2) = 11 | 11 1 ( 2 ) 111_{(2)} 111(2) = 7 | Hit(第五行) |
14 | 0000 1111 | 000 0 ( 2 ) 0000_{(2)} 0000(2) = 0 | 11 1 ( 2 ) 111_{(2)} 111(2) = 7 | Miss |
181 | 1011 0101 | 101 1 ( 2 ) 1011_{(2)} 1011(2) = 11 | 01 0 ( 2 ) 010_{(2)} 010(2) = 2 | Hit(第二行) |
44 | 0010 1100 | 001 0 ( 2 ) 0010_{(2)} 0010(2) = 2 | 11 0 ( 2 ) 110_{(2)} 110(2) = 6 | Miss |
186 | 1011 0101 | 101 1 ( 2 ) 1011_{(2)} 1011(2) = 11 | 11 0 ( 2 ) 110_{(2)} 110(2) = 5 | Miss |
253 | 1111 1101 | 111 1 ( 2 ) 1111_{(2)} 1111(2) = 15 | 11 0 ( 2 ) 110_{(2)} 110(2) = 7$Miss |
5.2.3 [20] <§§5.3, 5.4> You are asked to optimize a cache design for the given references. Th ere are three direct-mapped cache designs possible, all with a total of 8 words of data: C1 has 1-word blocks, C2 has 2-word blocks, and C3 has 4-word blocks. In terms of miss rate, which cache design is the best? If the miss stall time is 25 cycles, and C1 has an access time of 2 cycles, C2 takes 3 cycles, and C3 takes 5 cycles, which is the best cache design
C1块大小1
- cache大小为 32 = 2 5 32 = 2^5 32=25,索引字段$n = 5$
- 块大小数据块大小为 1 = 2 0 1 = 2^0 1=20个单字, m = 0 m=0 m=0
省略二进制转换过程
字地址 | 二进制地址 | 标签 | 索引 | 命中或失效 |
---|---|---|---|---|
3 | 00000 011 | 0 | 3 | Miss |
180 | 10110 100 | 22 | 4 | Miss |
43 | 00101 011 | 5 | 3 | Miss |
2 | 00000 010 | 0 | 2 | Miss |
191 | 10111 111 | 23 | 7 | Miss |
88 | 01011 000 | 11 | 0 | Miss |
190 | 10111 110 | 23 | 6 | Miss |
14 | 00001 111 | 1 | 6 | Miss |
181 | 10110 101 | 22 | 5 | Miss |
44 | 00101 100 | 5 | 4 | Miss |
186 | 10110 101 | 23 | 2 | Miss |
253 | 11111 101 | 31 | 5 | Miss |
失效率百分之百
阻塞时间
12
×
25
+
12
×
2
=
324
阻塞时间 12\times 25+ 12\times 2 = 324
阻塞时间12×25+12×2=324
C2块大小2
块大小数据块大小为 2 = 2 1 2 = 2^1 2=21个单字, m = 1 m=1 m=1
字地址 | 二进制地址 | 标签 | 索引 | 命中或失效 |
---|---|---|---|---|
3 | 00000 01 1 | 0 | 1 | Miss |
180 | 10110 10 0 | 22 | 2 | Miss |
43 | 00101 01 1 | 5 | 1 | Miss |
2 | 00000 01 0 | 0 | 1 | Hit |
191 | 10111 11 1 | 23 | 3 | Miss |
88 | 01011 00 0 | 11 | 0 | Miss |
190 | 10111 11 0 | 23 | 3 | Hit |
14 | 00001 11 1 | 1 | 3 | Miss |
181 | 10110 10 1 | 22 | 2 | Miss |
44 | 00101 10 0 | 5 | 2 | Miss |
186 | 10110 10 1 | 23 | 1 | Miss |
253 | 11111 10 1 | 31 | 2 | Miss |
失效率
10
/
12
=
83.33
%
失效率 \ 10/12 = 83.33\%
失效率 10/12=83.33%
阻塞时间
10
×
25
+
12
×
3
=
286
阻塞时间 10\times 25+ 12\times 3 = 286
阻塞时间10×25+12×3=286
C2块大小2
块大小数据块大小为
4
=
2
2
4 = 2^2
4=22个单字,
m
=
2
m=2
m=2
字地址 | 二进制地址 | 标签 | 索引 | 命中或失效 |
---|---|---|---|---|
3 | 00000 0 11 | 0 | 0 | Miss |
180 | 10110 1 00 | 22 | 1 | Miss |
43 | 00101 0 11 | 5 | 0 | Miss |
2 | 00000 0 10 | 0 | 0 | Miss |
191 | 10111 1 11 | 23 | 1 | Miss |
88 | 01011 0 00 | 11 | 0 | Miss |
190 | 10111 1 10 | 23 | 1 | Hit |
14 | 00001 1 11 | 1 | 1 | Miss |
181 | 10110 1 01 | 22 | 1 | Miss |
44 | 00101 1 00 | 5 | 1 | Miss |
186 | 10110 1 01 | 23 | 0 | Miss |
253 | 11111 1 01 | 31 | 1 | Miss |
失效率 11 / 12 = 91.67 % 失效率 \ 11/12 = 91.67\% 失效率 11/12=91.67% | ||||
阻塞时间 11 × 25 + 12 × 5 = 335 阻塞时间 11\times 25+ 12\times 5 = 335 阻塞时间11×25+12×5=335 |
Th ere are many diff erent design parameters that are important to a cache’s overall performance. Below are listed parameters for diff erent direct-mapped cache designs.
Cache Data Size: 32 KiB
Cache Block Size: 2 words
Cache Access Time: 1 cycle
5.2.4 [15] <§5.3> Calculate the total number of bits required for the cache listed above, assuming a 32-bit address. Given that total size, fi nd the total size of the closest direct-mapped cache with 16-word blocks of equal size or greater. Explain why the second cache, despite its larger data size, might provide slower performance than the fi rst cache.
知识补充
KiB单位大小指的是(字节 type),下面cache问的是位(bit)大小
单位B指的是字节,单位b才是位
所以算bits的公式是
2
n
×
[
1
+
(
32
−
n
−
m
−
2
)
+
(
2
m
×
32
)
]
2^n \times [ 1 + (32 - n - m - 2) + (2^m\times 32) ]
2n×[1+(32−n−m−2)+(2m×32)]
题目信息整理
- 1个字word = 4个字节byte = 32位bit(要除以每个字的字节数——4)
- Cache数据大小 32KiB
- 每个Cache块存有两个字(一个Cache存两个字)
先计算cache容量块数
32
K
i
b
/
4
/
2
=
4096
=
2
12
32Kib/4/2=4096=2^{12}
32Kib/4/2=4096=212
索引位数为n=12
字偏移量占1位,字节偏移量占两位(RISC-V版270的图)
所以标签(Tag)位计算如下
32
−
12
−
1
−
2
=
17
32-12-1-2 = 17
32−12−1−2=17
还有一位valid
17
+
1
=
18
17+1=18
17+1=18
所以需要
18
×
4096
=
73728
b
i
t
s
18\times 4096 = 73728bits
18×4096=73728bits
也就是9216bytes
cache的大小为
9216
+
32768
=
41984
9216+32768 = 41984
9216+32768=41984
总cache大小计算如下
总大小
=
数据大小
+
(有效位
b
i
t
大小
+
标签大小)
×
块
总大小=数据大小+(有效位bit大小 + 标签大小)\times 块
总大小=数据大小+(有效位bit大小+标签大小)×块
数据大小
=
块
×
块大小
×
字大小
数据大小=块\times 块大小\times 字大小
数据大小=块×块大小×字大小
- 字大小 = 4
- 标签大小 = 32 − l o g 2 ( 块 ) − l o g 2 ( 块大小 ) − l o g 2 ( 字大小 ) 标签大小 =32 - log2(块)-log2(块大小)-log2(字大小) 标签大小=32−log2(块)−log2(块大小)−log2(字大小)
- 有效位bit大小 = 1
将 每个块从2个字变成16个字会把标签大小从17变为14
所以得到以下不等式
41984
≤
(
64
+
15
)
×
块
41984\le (64+15)\times 块
41984≤(64+15)×块
块 ≥ 531 块\ge 531 块≥531,531的下一个为2的幂的数为1024
上面内容翻译自答案,但是我感觉有问题
15的单位是bits,64和41986的单位是bytes,为啥能直接加起来?应该15还得/8才对,也就是
41984 ≤ ( 64 + 15 / 8 ) × 块 41984\le (64+15/8)\times 块 41984≤(64+15/8)×块
cache的块容量增大可能会需要更多的击中实践和失效惩罚时间,块数量减少可能会造成更高的失效率。所以第二种cache有可能比第一种cache访问速度更慢。
5.2.5 [20] <§§5.3, 5.4> Generate a series of read requests that have a lower miss rate on a 2 KiB 2-way set associative cache than the cache listed above. Identify one possible solution that would make the cache listed have an equal or lower miss rate than the 2 KiB cache. Discuss the advantages and disadvantages of such a solution.
相联cache是用来降低冲突未命中率的。所以 虽然同样具有12位tag字段,但不同tag字段的读取请求序列 会产生大量失效。
对于上述缓存,序列0,32768,0,32768……会在每次访问时丢失,而如果让两路组相联cache与LRU替换相关联,即使是总体容量较小的缓存,在前两次访问之后,也会在每次访问时命中
5.2.6 [15] <§5.3> Th e formula shown in Section 5.3 shows the typical method to index a direct-mapped cache, specifi cally (Block address) modulo (Number of blocks in the cache). Assuming a 32-bit address and 1024 blocks in the cache, consider a different indexing function, specifi cally (Block address[31:27] XOR Block address[26:22]). Is it possible to use this to index a direct-mapped cache? If so, explain why and discuss any changes that might need to be made to the cache. If it is not possible, explain why.
可以使用这个公式索引直接映射的cache。
可以使用此功能为cache编写索引。但是,由于这五个位是异或块地址的,因此有关这五个位的信息会丢失,因此必须包含更多的tag位来标识缓存中的地址
5.3 For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access the cache
Tag | index | offset |
---|---|---|
31-10 | 9-5 | 4-0 |
5.3.1 [5] <§5.3> What is the cache block size (in words)?
偏移量是5位,表示5位字节,转化为字为3位
2
3
=
8
2^3=8
23=8
所以大小为8个字
5.3.2 [5] <§5.3> How many entries does the cache have?
tag占5位
2
5
=
32
2^5 = 32
25=32
块项为32
5.3.3 [5] <§5.3> What is the ratio between total bits required for such a cache implementation over the data storage bits?
2 n × ( 块大小,标记大小,有效域大小 ) 2^n\times(块大小,标记大小,有效域大小) 2n×(块大小,标记大小,有效域大小)
- 块大小为 2 5 × 32 b i t s 2^5\times 32 bits 25×32bits,也就是 2 3 2^3 23个字,m=3
- cache大小为5位 为 n = 5 n = 5 n=5
- 有效位假定为1
cache总位数
2 5 × ( 2 3 × 32 + ( 32 − 5 − 3 − 2 ) + 1 ) = 32 × ( 8 × 32 + 23 ) = 8928 2^5\times(2^3\times 32 + (32 - 5 - 3 - 2) + 1) = 32 \times(8\times 32 + 23) = 8928 25×(23×32+(32−5−3−2)+1)=32×(8×32+23)=8928
数据储存位数位,一块是32字节,一共32块
32
×
32
×
8
32\times32\times 8
32×32×8
两者相比
8928
/
(
32
×
8
×
32
)
=
1.0898
8928/(32\times8\times32) = 1.0898
8928/(32×8×32)=1.0898
但是突然发现题目没给有效位信息,上面的计算应该是错的
(
32
∗
8
+
22
)
/
(
32
∗
8
)
(32*8+22)/(32*8)
(32∗8+22)/(32∗8)
一个block本身有22位tag,加上尾巴的数据大小,再整体除以携带的数据
就是 ( t a g + 数据) / 数据 (tag+数据)/数据 (tag+数据)/数据
Starting from power on, the following byte-addressed cache references are recorded
0 4 16 132 232 160 1024 30 140 3100 180 218
5.3.4 [10] <§5.3> How many blocks are replaced?
最大数据为3100,二进制为 110000011100
n
=
5
,
m
=
3
n = 5, m = 3
n=5,m=3
index为5位,m占三位
这里是真实地址,所以最后一个划分要比字地址划分多2,也就是3+2=5
可以分为 1100000 11100
构造下表
字地址 | 二进制地址 | 标签tag | 索引index | 命中或失效 |
---|---|---|---|---|
0 | 00 00000 00000 | 0 | 0 | Miss |
4 | 00 00000 00100 | 0 | 0 | Hit |
16 | 00 00000 10000 | 0 | 0 | Hit |
132 | 00 00100 00100 | 0 | 4 | Miss |
232 | 00 00111 01000 | 0 | 7 | Miss |
160 | 00 00101 00000 | 0 | 5 | Miss |
1024 | 01 00000 00000 | 1 | 0 | Miss |
30 | 00 00000 11110 | 0 | 0 | Miss |
140 | 00 00100 01100 | 0 | 4 | Hit |
3100 | 11 00000 11100 | 3 | 0 | Miss |
180 | 00 00101 10100 | 0 | 5 | Miss |
2180 | 10 00100 00100 | 2 | 4 | Miss |
先看index,然后如果tag不对就换掉
,所以有如下replace情况
- 1024(1,0)换掉(0,0)
- 30(0,0)换掉(1,0)
- 3100(3,0)换掉(0,0)
- 2180(2,4)换掉(0,4)
一共4次
5.3.5 [10] <§5.3> What is the hit ratio?
3 / 12 = 25 % 3/12=25\% 3/12=25%
5.3.6 [20] <§5.3> List the fi nal state of the cache, with each valid entry represented as a record of <index, tag, data>.
index | tag | data |
---|---|---|
0 | 3 | mem[3100] |
4 | 2 | mem[2180] |
5 | 0 | mem[180] |
7 | 0 | mem[232] |
答案给的后三题结果如下
但是我感觉很疑惑,为啥index是6位?tag是4位? 1024转化二进制不是只有一个1吗,为啥会出现两个1?所以我按照课本正文的理解写后续三道题目了。