文章目录
前言
前面已经分析了Vortex GPGPU的架构:Vortex GPGPU的硬件设计和代码结构分析
前面也分析了Vortex GPGPU中关于Cache设计的一部分代码:
1、Vortex GPGPU的硬件代码分析(Cache篇1)
2、Vortex GPGPU的硬件代码分析(Cache篇2)
3、Vortex GPGPU的硬件代码分析(Cache篇3)
在Cache篇3中我们详细分析了对Cache
访问,从接收对Cache
的访问request开始,到访问Cache
内的各个bank,随后将应答信号gather。其中针对一个周期内接收到多个访问request作者设计仲裁器来优先选择某一个request。深究仲裁器的代码,作者设计了不少仲裁器,后续可以妥善利用该模块。
以上流程在下图中可以展示,第三篇中分析了下图中从AddrX到ARB的阶段、core response merge阶段,本节将深入分析串行cache中的tag access和data access。
本文接着分析VX_cache.sv代码
一、bank access?
// Banks access
for (genvar i = 0; i < NUM_BANKS; ++i) begin
wire [`CS_LINE_ADDR_WIDTH-1:0] curr_bank_mem_req_addr;
wire curr_bank_mem_rsp_valid;
if (NUM_BANKS == 1) begin
assign curr_bank_mem_rsp_valid = mem_rsp_valid_s;
end else begin
assign curr_bank_mem_rsp_valid = mem_rsp_valid_s && (`CS_MEM_TAG_TO_BANK_ID(mem_rsp_tag_s) == i);
end
`RESET_RELAY (bank_reset, reset);
VX_cache_bank #(
.BANK_ID (i),
.INSTANCE_ID (INSTANCE_ID),
.CACHE_SIZE (CACHE_SIZE),
.LINE_SIZE (LINE_SIZE),
.NUM_BANKS (NUM_BANKS),
.NUM_WAYS (NUM_WAYS),
.WORD_SIZE (WORD_SIZE),
.NUM_REQS (NUM_REQS),
.CRSQ_SIZE (CRSQ_SIZE),
.MSHR_SIZE (MSHR_SIZE),
.MREQ_SIZE (MREQ_SIZE),
.WRITE_ENABLE (WRITE_ENABLE),
.UUID_WIDTH (UUID_WIDTH),
.TAG_WIDTH (TAG_WIDTH),
.CORE_OUT_BUF (CORE_REQ_BUF_ENABLE ? 0 : CORE_OUT_BUF),
.MEM_OUT_BUF (MEM_REQ_BUF_ENABLE ? 0 : MEM_OUT_BUF)
) bank (
.clk (clk),
.reset (bank_reset),
// Core request
.core_req_valid (per_bank_core_req_valid[i]),
.core_req_addr (per_bank_core_req_addr[i]),
.core_req_rw (per_bank_core_req_rw[i]),
.core_req_wsel (per_bank_core_req_wsel[i]),
.core_req_byteen (per_bank_core_req_byteen[i]),
.core_req_data (per_bank_core_req_data[i]),
.core_req_tag (per_bank_core_req_tag[i]),
.core_req_idx (per_bank_core_req_idx[i]),
.core_req_ready (per_bank_core_req_ready[i]),
// Core response
.core_rsp_valid (per_bank_core_rsp_valid[i]),
.core_rsp_data (per_bank_core_rsp_data[i]),
.core_rsp_tag (per_bank_core_rsp_tag[i]),
.core_rsp_idx (per_bank_core_rsp_idx[i]),
.core_rsp_ready (per_bank_core_rsp_ready[i]),
// Memory request
.mem_req_valid (per_bank_mem_req_valid[i]),
.mem_req_addr (curr_bank_mem_req_addr),
.mem_req_rw (per_bank_mem_req_rw[i]),
.mem_req_wsel (per_bank_mem_req_wsel[i]),
.mem_req_byteen (per_bank_mem_req_byteen[i]),
.mem_req_data (per_bank_mem_req_data[i]),
.mem_req_id (per_bank_mem_req_id[i]),
.mem_req_ready (per_bank_mem_req_ready[i]),
// Memory response
.mem_rsp_valid (curr_bank_mem_rsp_valid),
.mem_rsp_data (mem_rsp_data_s),
.mem_rsp_id (`CS_MEM_TAG_TO_REQ_ID(mem_rsp_tag_s)),
.mem_rsp_ready (per_bank_mem_rsp_ready[i]),
// initialization
.init_enable (init_enable),
.init_line_sel (init_line_sel)
);
if (NUM_BANKS == 1) begin
assign per_bank_mem_req_addr[i] = curr_bank_mem_req_addr;
end else begin
assign per_bank_mem_req_addr[i] = `CS_LINE_TO_MEM_ADDR(curr_bank_mem_req_addr, i);
end
end
核心模块就是被例化的VX_cache_bank
模块。尽管目前无法对某些assign赋值语句的真正含义理解清楚,但是粗略分析模块之间的连接关系对梳理模块有帮助!
回到作者提供的vortex/docs/cache_subsystem.md
中,cache子系统包含以下要素:
1、High-bandwidth with bank parallelism
2、Snoop protocol to flush data for CPU access (基于Snooping协议解决cache一致性问题)
3、Generic design: Dcache, Icache, Shared Memory, L2 cache, L3 cache(提到cache的层次性)
1、Cache can be configured to be any level in the hierarchy
2、Caches communicate via snooping
3、Cache flush from AFU is passed down the hierarchy
二、作者归纳的VX_cache和VX_bank特点——指导VX_bank的设计细节
2.1 VX_cache特点
VX.cache.v
is the top module
of the cache verilog code located in the /hw/rtl/cache
directory.
1. Configurable (Cache size, number of banks, bank line size, etc.)
2. I/O signals
- Core Request
- Core Rsp
- DRAM Req
- DRAM Rsp
- Snoop Req
- Snoop Rsp
- Snoop Forwarding Out
- Snoop Forwarding In
3. Bank Select : Assigns valid and ready signals for each bank
4. Snoop Forwarder
5. DRAM Request Arbiter : Prepares cache response for communication with DRAM
6. Snoop Response Arbiter : Sends snoop response
7. Core Response Merge : Cache accesses one line at a time. As a result, each request may not come back in the same response. This module tries to recombine the responses by thread ID.
貌似至今为止还没看到有和Snooping协议
相关的信号!
2.2 VX_bank特点
VX_bank.v
is the verilog code that handles cache bank functionality
and is located in the /hw/rtl/cache
directory.
关于bank
的各个部分如下:
1. Allows for high throughput
2. Each bank contains queues to hold requests to the cache
3. I/O signals
- Core request
- Core Response
- DRAM Fill Requests
- DRAM Fill Response
- DRAM WB Requests
- Snp Request
- Snp Response
4. Request Priority: DRAM fill, miss reserve, core request, snoop request
5. Snoop Request Queue
6. DRAM Fill Queue
7. Core Req Arbiter : Requests to be processed by the bank
8. Tag Data Store :
- Registers for valid, dirty, dirtyb, tag, and data
- Length of registers determined by lines in the bank
9. Tag Data Access:
- I/O: stall, snoop info, force request miss
- Writes to cache or sends read response; hit or miss determined here
- A missed request goes to the miss reserve if it is not a snoop request or DRAM fill
好了,这是个好东西,有助于理解uarch的设计细节。关于这里提到的snoop
协议建议回到《计算机体系结构——量化研究方法》
这本书中的线程级并行
去回顾。Cache
的受众有三个:本地的core
、下级的memory
和远程的core
,而远程的core通过MSI
(或者更进一步扩展的MESI
和MOESI
来解决Cache一致性问题
),因此IO
部分包含Core
、DRAM
和Snp
也就不足为怪了。
2.3 一点疑惑
另外关于Tag Data Store
里指出的若干个Registers
,我们先回顾Cache
应该有哪些Field Register
?
- 由于
Cache
往往采用write back
,有了dirty位
,与此同时为了支撑write back
操作,可能需要在多个dirty的cacheline
中选择一个cacheline
写回到下一级memory
,需要相应的替换位
和替换策略
。- 由于
Cache
在冷启动
时面临强制缺失
,在随着程序的进行中,强制缺失
导致的读缺失
或者写缺失
会陆陆续续从下一级存储中获取相应的block
填充到cacheline
中,为了标记哪些cacheline
已经填了数据或者没填数据,使用valid位
来标记cacheline
的有效性。Cache
的索引(组相联)需要tag
和data
,所以自然而然就有这俩的存在。- 在线程级并行的环境中,各个
Core
的私有Cache
会有部分数据处于shared
状态,同时由于程序在运行且对数据的读写行为是串行化的,因此总有某些时刻的cacheline
是modified
的。对于常见的设计场景,写作废
协议相比写更新
协议有某些好处比如下一拍可能不会再用到的数据用写更新
反而会增加多余的操作并引起相应的带宽压力、功耗等,因此写作废
协议导致cache
需要支持shared、modified和invalid
等bit
。
跳回来,如果综合刚才的分析,看第8
点:
8. Tag Data Store :
- Registers for valid, dirty, dirtyb, tag, and data
- Length of registers determined by lines in the bank
感觉还是很疑惑!缺了支持MSI
的bit
位。
=======================================
有懂行的大佬可以指出为什么不需要!!!
=======================================
三、整理点debug或者test的预备知识
其实前面三篇写完发现已经对信号有看不懂的地方了,翻了翻github
,在vortex/docs/codebase.md
翻到这些:
1. unit_tests: unit tests for some hardware components
2. driver: host drivers repository
- include: Vortex driver public headers
- stub: Vortex stub driver library
- fpga: software driver that uses Intel OPAE FPGA
- asesim: software driver that uses Intel ASE simulator
- vlsim: software driver that uses vlsim simulator
- rtlsim: software driver that uses rtlsim simulator
- simx: software driver that uses simX simulator
3. sim:
- vlsim: AFU RTL simulator
- rtlsim: processor RTL simulator
- simX: cycle approximate simulator for vortex
以上文件路径的作用作者明明白白说了!后面打算在理清楚各个模块之间的关系和大致作用之后,再去精细化弄清楚原理。
总结
先水一节吧,也考虑到看着看着有点迷糊,想回过头来翻翻作者还给了哪些资料,这一节就先这么着吧!不过有些东西已经整理好了,但篇幅还不够!
最近忙着秋招找工作的事儿,碰着jd描述需要各种各样的知识,面了几家感觉还缺点gpu架构知识,然后前段时间疯狂补充!顺带牛客上刷代码!
还是老话,有错误请严厉指出,我闻过则喜!