Computer composition and design work07 ——fifth verson

JamSlade

已于 2022-08-26 20:58:40 修改

阅读量885

点赞数 1

分类专栏：计算机组成与结构文章标签：矩阵算法线性代数

于 2022-06-14 16:26:32 首次发布

本文链接：https://blog.csdn.net/JamSlade/article/details/125097121

版权

矩阵计算内存局部性缓存设计命中率直接映射

关键词由CSDN通过智能技术生成

计算机组成与结构专栏收录该内容

11 篇文章 21 订阅

订阅专栏

5.1

In this exercise we look at memory locality properties of matrix computation. The following code is written in C, where elements within the same row are stored contiguously. Assume each word is a 32-bit integer

for (I=0; I<8; I++)
    for (J=0; J<8000; J++)
        A[I][J]=B[I][0]+A[J][I];

5.1.1 [5] <§5.1> How many 32-bit integers can be stored in a 16-byte cache block?

$1\ byte = 8 \ bit$
$16\times 8 / 32 = 4$

5.1.2 [5] <§5.1> References to which variables exhibit temporal locality?

访问 $I\quad J$ 以及 $B [I] [0]$ 会产生时间局限性（在循环中被再次访问）

5.1.3 [5] <§5.1> References to which variables exhibit spatial locality?

$A [I] [J]$ 会产生空间局限性（在循环中会迅速被访问下一个位置）
而 $A [J] [I]$ 访问距离较远，所以不认为有空间局限性

Locality is aff ected by both the reference order and data layout. Th e same computation can also be written below in Matlab, which diff ers from C by storing matrix elements within the same column contiguously in memory.

for I=1:8
    for J=1:8000
        A(I,J)=B(I,0)+A(J,I);
    end
end

5.1.4 [10] <§5.1> How many 16-byte cache blocks are needed to store all 32-bit matrix elements being referenced?

32位矩阵元素共 $8\times 800=6400$ 个
根据5.1.1，一个16字节cache可以储存4个
一共需要
$6400/4 = 1600$

5.1.5 [5] <§5.1> References to which variables exhibit temporal locality?

访问 $I\quad J$ 以及 $B (I, 0)$ 会产生时间局限性（在循环中被再次访问）

5.1.6 [5] <§5.1> References to which variables exhibit spatial locality

$A (I, J)$ 会产生空间局限性（在循环中会迅速被访问下一个位置）

5.2

Caches are important to providing a high-performance memory hierarchy to processors. Below is a list of 32-bit memory address references, given as word addresses.

3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253

5.2.1 [10] <§5.3> For each of these references, identify the binary address, the tag, and the index given a direct-mapped cache with 16 one-word blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty.

一开始都是空的

cache大小为 $16 = 2^4$ ,索引字段 $n = 4$
不难得到index应为4位二进制数
数据块大小为 $1 = 2^0$ 个单字， $m = 0$
所以剩余的4位完全用于tag

字地址	二进制地址	标签	索引	命中或失效
3	0000 0011	$0000_{(2)}$ = 0	$0011_{(2)}$ = 3	Miss
180	1011 0100	$1011_{(2)}$ = 11	$0011_{(2)}$ = 4	Miss
43	0010 1011	$0010_{(2)}$ = 2	$1011_{(2)}$ = 11	Miss
2	0000 0010	$0000_{(2)}$ = 0	$0010_{(2)}$ = 2	Miss
191	1011 1111	$1011_{(2)}$ = 11	$1111_{(2)}$ = 15	Miss
88	0101 1000	$0101_{(2)}$ = 5	$1000_{(2)}$ = 8	Miss
190	1011 1110	$1011_{(2)}$ = 11	$1110_{(2)}$ = 14$	Miss
14	0000 1111	$0000_{(2)}$ = 0	$1110_{(2)}$ = 14	Miss
181	1011 0101	$1011_{(2)}$ = 11	$0101_{(2)}$ = 5	Miss
44	0010 1100	$0010_{(2)}$ = 2	$1100_{(2)}$ = 12	Miss
186	1011 0101	$1011_{(2)}$ = 11	$1100_{(2)}$ = 10	Miss
253	1111 1101	$1111_{(2)}$ = 15	$1101_{(2)}$ = 13	Miss

5.2.2 [10] <§5.3> For each of these references, identify the binary address, the tag, and the index given a direct-mapped cache with two-word blocks and a total size of 8 blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty.

cache大小为 $8 = 2^3$ ,索引字段 $n = 3$
不难得到index应为3位二进制数
数据块大小为 $2 = 2^1$ 个单字， $m = 1$
所以剩余的4位用于tag

字地址	二进制地址	标签	索引	命中或失效
3	0000 0011	$0000_{(2)}$ = 0	$001_{(2)}$ = 1	Miss
180	1011 0100	$1011_{(2)}$ = 11	$001_{(2)}$ = 2	Miss
43	0010 1011	$0010_{(2)}$ = 2	$101_{(2)}$ = 5	Miss
2	0000 0010	$0000_{(2)}$ = 0	$001_{(2)}$ = 1	Hit(第一行)
191	1011 1111	$1011_{(2)}$ = 11	$111_{(2)}$ = 7	Miss
88	0101 1000	$0101_{(2)}$ = 5	$100_{(2)}$ = 4	Miss
190	1011 1110	$1011_{(2)}$ = 11	$111_{(2)}$ = 7	Hit（第五行）
14	0000 1111	$0000_{(2)}$ = 0	$111_{(2)}$ = 7	Miss
181	1011 0101	$1011_{(2)}$ = 11	$010_{(2)}$ = 2	Hit（第二行）
44	0010 1100	$0010_{(2)}$ = 2	$110_{(2)}$ = 6	Miss
186	1011 0101	$1011_{(2)}$ = 11	$110_{(2)}$ = 5	Miss
253	1111 1101	$1111_{(2)}$ = 15	$110_{(2)}$ = 7$Miss

5.2.3 [20] <§§5.3, 5.4> You are asked to optimize a cache design for the given references. Th ere are three direct-mapped cache designs possible, all with a total of 8 words of data: C1 has 1-word blocks, C2 has 2-word blocks, and C3 has 4-word blocks. In terms of miss rate, which cache design is the best? If the miss stall time is 25 cycles, and C1 has an access time of 2 cycles, C2 takes 3 cycles, and C3 takes 5 cycles, which is the best cache design

C1块大小1

cache大小为 $32 = 2^5$ ,索引字段$n = 5$
块大小数据块大小为 $1 = 2^0$ 个单字， $m = 0$

省略二进制转换过程

字地址	二进制地址	标签	索引	命中或失效
3	00000 011	0	3	Miss
180	10110 100	22	4	Miss
43	00101 011	5	3	Miss
2	00000 010	0	2	Miss
191	10111 111	23	7	Miss
88	01011 000	11	0	Miss
190	10111 110	23	6	Miss
14	00001 111	1	6	Miss
181	10110 101	22	5	Miss
44	00101 100	5	4	Miss
186	10110 101	23	2	Miss
253	11111 101	31	5	Miss

失效率百分之百
$12\times 25+ 12\times 2 = 324$

C2块大小2

块大小数据块大小为 $2 = 2^1$ 个单字， $m = 1$

字地址	二进制地址	标签	索引	命中或失效
3	00000 01 1	0	1	Miss
180	10110 10 0	22	2	Miss
43	00101 01 1	5	1	Miss
2	00000 01 0	0	1	Hit
191	10111 11 1	23	3	Miss
88	01011 00 0	11	0	Miss
190	10111 11 0	23	3	Hit
14	00001 11 1	1	3	Miss
181	10110 10 1	22	2	Miss
44	00101 10 0	5	2	Miss
186	10110 10 1	23	1	Miss
253	11111 10 1	31	2	Miss

$\ 10/12 = 83.33\%$
$10\times 25+ 12\times 3 = 286$

C2块大小2
块大小数据块大小为 $4 = 2^2$ 个单字， $m = 2$

字地址	二进制地址	标签	索引	命中或失效
3	00000 0 11	0	0	Miss
180	10110 1 00	22	1	Miss
43	00101 0 11	5	0	Miss
2	00000 0 10	0	0	Miss
191	10111 1 11	23	1	Miss
88	01011 0 00	11	0	Miss
190	10111 1 10	23	1	Hit
14	00001 1 11	1	1	Miss
181	10110 1 01	22	1	Miss
44	00101 1 00	5	1	Miss
186	10110 1 01	23	0	Miss
253	11111 1 01	31	1	Miss
$\ 11/12 = 91.67\%$
$11\times 25+ 12\times 5 = 335$

Th ere are many diff erent design parameters that are important to a cache’s overall performance. Below are listed parameters for diff erent direct-mapped cache designs.

Cache Data Size: 32 KiB
Cache Block Size: 2 words
Cache Access Time: 1 cycle

5.2.4 [15] <§5.3> Calculate the total number of bits required for the cache listed above, assuming a 32-bit address. Given that total size, fi nd the total size of the closest direct-mapped cache with 16-word blocks of equal size or greater. Explain why the second cache, despite its larger data size, might provide slower performance than the fi rst cache.

知识补充
在这里插入图片描述
KiB单位大小指的是（字节 type），下面cache问的是位（bit）大小

单位B指的是字节，单位b才是位

所以算bits的公式是
$2^n \times [ 1 + (32 - n - m - 2) + (2^m\times 32) ]$

题目信息整理

1个字word = 4个字节byte = 32位bit（要除以每个字的字节数——4）
Cache数据大小 32KiB
每个Cache块存有两个字（一个Cache存两个字）

先计算cache容量块数
$32Kib/4/2=4096=2^{12}$

索引位数为n=12

字偏移量占1位，字节偏移量占两位(RISC-V版270的图)

所以标签(Tag)位计算如下
$32 - 12 - 1 - 2 = 17$
还有一位valid
$17 + 1 = 18$

所以需要
$18\times 4096 = 73728bits$
也就是9216bytes

cache的大小为
$9216 + 32768 = 41984$

总cache大小计算如下
$标签大小）\times 块$
$数据大小=块\times 块大小\times 字大小$

字大小 = 4
$标签大小 = 32 - l o g 2 (块) - l o g 2 (块大小) - l o g 2 (字大小)$
有效位bit大小 = 1

将每个块从2个字变成16个字会把标签大小从17变为14
所以得到以下不等式
$41984\le (64+15)\times 块$

$块\ge 531$ ,531的下一个为2的幂的数为1024

上面内容翻译自答案，但是我感觉有问题
在这里插入图片描述
15的单位是bits，64和41986的单位是bytes，为啥能直接加起来？应该15还得/8才对，也就是

$41984\le (64+15/8)\times 块$

cache的块容量增大可能会需要更多的击中实践和失效惩罚时间，块数量减少可能会造成更高的失效率。所以第二种cache有可能比第一种cache访问速度更慢。

5.2.5 [20] <§§5.3, 5.4> Generate a series of read requests that have a lower miss rate on a 2 KiB 2-way set associative cache than the cache listed above. Identify one possible solution that would make the cache listed have an equal or lower miss rate than the 2 KiB cache. Discuss the advantages and disadvantages of such a solution.

相联cache是用来降低冲突未命中率的。所以虽然同样具有12位tag字段，但不同tag字段的读取请求序列会产生大量失效。

对于上述缓存，序列0,32768,0,32768……会在每次访问时丢失，而如果让两路组相联cache与LRU替换相关联，即使是总体容量较小的缓存，在前两次访问之后，也会在每次访问时命中

5.2.6 [15] <§5.3> Th e formula shown in Section 5.3 shows the typical method to index a direct-mapped cache, specifi cally (Block address) modulo (Number of blocks in the cache). Assuming a 32-bit address and 1024 blocks in the cache, consider a different indexing function, specifi cally (Block address[31:27] XOR Block address[26:22]). Is it possible to use this to index a direct-mapped cache? If so, explain why and discuss any changes that might need to be made to the cache. If it is not possible, explain why.

可以使用这个公式索引直接映射的cache。
可以使用此功能为cache编写索引。但是，由于这五个位是异或块地址的，因此有关这五个位的信息会丢失，因此必须包含更多的tag位来标识缓存中的地址

5.3 For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access the cache

Tag	index	offset
31-10	9-5	4-0

5.3.1 [5] <§5.3> What is the cache block size (in words)?

偏移量是5位，表示5位字节，转化为字为3位
$2^3=8$
所以大小为8个字

5.3.2 [5] <§5.3> How many entries does the cache have?

tag占5位
$2^5 = 32$
块项为32

5.3.3 [5] <§5.3> What is the ratio between total bits required for such a cache implementation over the data storage bits?

$2^n\times(块大小，标记大小，有效域大小)$

块大小为 $2^5\times 32 bits$ ,也就是 $2^3$ 个字，m=3
cache大小为5位为 $n = 5$
有效位假定为1

cache总位数

$2^5\times(2^3\times 32 + (32 - 5 - 3 - 2) + 1) = 32 \times(8\times 32 + 23) = 8928$

数据储存位数位,一块是32字节，一共32块
$32\times32\times 8$

两者相比
$8928/(32\times8\times32) = 1.0898$

但是突然发现题目没给有效位信息，上面的计算应该是错的

$(32 * 8 + 22) / (32 * 8)$
一个block本身有22位tag，加上尾巴的数据大小，再整体除以携带的数据

就是 $(t a g + 数据） / 数据$

Starting from power on, the following byte-addressed cache references are recorded

0 4 16 132 232 160 1024 30 140 3100 180 218

5.3.4 [10] <§5.3> How many blocks are replaced?

最大数据为3100，二进制为 110000011100

$n = 5, m = 3$
index为5位，m占三位

这里是真实地址，所以最后一个划分要比字地址划分多2，也就是3+2=5
可以分为 1100000 11100

构造下表

字地址	二进制地址	标签tag	索引index	命中或失效
0	00 00000 00000	0	0	Miss
4	00 00000 00100	0	0	Hit
16	00 00000 10000	0	0	Hit
132	00 00100 00100	0	4	Miss
232	00 00111 01000	0	7	Miss
160	00 00101 00000	0	5	Miss
1024	01 00000 00000	1	0	Miss
30	00 00000 11110	0	0	Miss
140	00 00100 01100	0	4	Hit
3100	11 00000 11100	3	0	Miss
180	00 00101 10100	0	5	Miss
2180	10 00100 00100	2	4	Miss