Vortex GPGPU的硬件代码分析(Cache篇4)


前言

前面已经分析了Vortex GPGPU的架构:Vortex GPGPU的硬件设计和代码结构分析

前面也分析了Vortex GPGPU中关于Cache设计的一部分代码:
1、Vortex GPGPU的硬件代码分析(Cache篇1)
2、Vortex GPGPU的硬件代码分析(Cache篇2)
3、Vortex GPGPU的硬件代码分析(Cache篇3)

在Cache篇3中我们详细分析了对Cache访问,从接收对Cache的访问request开始,到访问Cache内的各个bank,随后将应答信号gather。其中针对一个周期内接收到多个访问request作者设计仲裁器来优先选择某一个request。深究仲裁器的代码,作者设计了不少仲裁器,后续可以妥善利用该模块。

以上流程在下图中可以展示,第三篇中分析了下图中从AddrX到ARB的阶段、core response merge阶段,本节将深入分析串行cache中的tag access和data access。
在这里插入图片描述

本文接着分析VX_cache.sv代码


一、bank access?

    // Banks access
    for (genvar i = 0; i < NUM_BANKS; ++i) begin
        wire [`CS_LINE_ADDR_WIDTH-1:0] curr_bank_mem_req_addr;
        wire curr_bank_mem_rsp_valid;

        if (NUM_BANKS == 1) begin
            assign curr_bank_mem_rsp_valid = mem_rsp_valid_s;
        end else begin
            assign curr_bank_mem_rsp_valid = mem_rsp_valid_s && (`CS_MEM_TAG_TO_BANK_ID(mem_rsp_tag_s) == i);
        end

        `RESET_RELAY (bank_reset, reset);
        
        VX_cache_bank #(                
            .BANK_ID      (i),
            .INSTANCE_ID  (INSTANCE_ID),
            .CACHE_SIZE   (CACHE_SIZE),
            .LINE_SIZE    (LINE_SIZE),
            .NUM_BANKS    (NUM_BANKS),
            .NUM_WAYS     (NUM_WAYS),
            .WORD_SIZE    (WORD_SIZE),
            .NUM_REQS     (NUM_REQS),
            .CRSQ_SIZE    (CRSQ_SIZE),
            .MSHR_SIZE    (MSHR_SIZE),
            .MREQ_SIZE    (MREQ_SIZE),
            .WRITE_ENABLE (WRITE_ENABLE),
            .UUID_WIDTH   (UUID_WIDTH),
            .TAG_WIDTH    (TAG_WIDTH),
            .CORE_OUT_BUF (CORE_REQ_BUF_ENABLE ? 0 : CORE_OUT_BUF),
            .MEM_OUT_BUF  (MEM_REQ_BUF_ENABLE ? 0 : MEM_OUT_BUF)
        ) bank (          
            .clk                (clk),
            .reset              (bank_reset),
                    
            // Core request
            .core_req_valid     (per_bank_core_req_valid[i]),
            .core_req_addr      (per_bank_core_req_addr[i]),
            .core_req_rw        (per_bank_core_req_rw[i]),
            .core_req_wsel      (per_bank_core_req_wsel[i]),
            .core_req_byteen    (per_bank_core_req_byteen[i]),
            .core_req_data      (per_bank_core_req_data[i]),
            .core_req_tag       (per_bank_core_req_tag[i]),
            .core_req_idx       (per_bank_core_req_idx[i]),
            .core_req_ready     (per_bank_core_req_ready[i]),

            // Core response                
            .core_rsp_valid     (per_bank_core_rsp_valid[i]),
            .core_rsp_data      (per_bank_core_rsp_data[i]),
            .core_rsp_tag       (per_bank_core_rsp_tag[i]),
            .core_rsp_idx       (per_bank_core_rsp_idx[i]),
            .core_rsp_ready     (per_bank_core_rsp_ready[i]),

            // Memory request
            .mem_req_valid      (per_bank_mem_req_valid[i]),
            .mem_req_addr       (curr_bank_mem_req_addr),
            .mem_req_rw         (per_bank_mem_req_rw[i]),
            .mem_req_wsel       (per_bank_mem_req_wsel[i]),
            .mem_req_byteen     (per_bank_mem_req_byteen[i]),
            .mem_req_data       (per_bank_mem_req_data[i]),
            .mem_req_id         (per_bank_mem_req_id[i]),
            .mem_req_ready      (per_bank_mem_req_ready[i]),

            // Memory response
            .mem_rsp_valid      (curr_bank_mem_rsp_valid),
            .mem_rsp_data       (mem_rsp_data_s),
            .mem_rsp_id         (`CS_MEM_TAG_TO_REQ_ID(mem_rsp_tag_s)),
            .mem_rsp_ready      (per_bank_mem_rsp_ready[i]),

            // initialization    
            .init_enable        (init_enable),
            .init_line_sel      (init_line_sel)
        );

        if (NUM_BANKS == 1) begin
            assign per_bank_mem_req_addr[i] = curr_bank_mem_req_addr;
        end else begin
            assign per_bank_mem_req_addr[i] = `CS_LINE_TO_MEM_ADDR(curr_bank_mem_req_addr, i);
        end
    end   

核心模块就是被例化的VX_cache_bank模块。尽管目前无法对某些assign赋值语句的真正含义理解清楚,但是粗略分析模块之间的连接关系对梳理模块有帮助!

回到作者提供的vortex/docs/cache_subsystem.md中,cache子系统包含以下要素:

1、High-bandwidth with bank parallelism
2、Snoop protocol to flush data for CPU access (基于Snooping协议解决cache一致性问题)
3、Generic design: Dcache, Icache, Shared Memory, L2 cache, L3 cache(提到cache的层次性)

在这里插入图片描述

1、Cache can be configured to be any level in the hierarchy
2、Caches communicate via snooping
3、Cache flush from AFU is passed down the hierarchy​

二、作者归纳的VX_cache和VX_bank特点——指导VX_bank的设计细节

2.1 VX_cache特点

VX.cache.v is the top module of the cache verilog code located in the /hw/rtl/cache directory.
在这里插入图片描述

1. Configurable (Cache size, number of banks, bank line size, etc.)
2. I/O signals
   - Core Request
   - Core Rsp
   - DRAM Req
   - DRAM Rsp
   - Snoop Req
   - Snoop Rsp
   - Snoop Forwarding Out
   - Snoop Forwarding In
3. Bank Select : Assigns valid and ready signals for each bank
4. Snoop Forwarder
5. DRAM Request Arbiter : Prepares cache response for communication with DRAM
6. Snoop Response Arbiter : Sends snoop response
7. Core Response Merge : Cache accesses one line at a time. As a result, each request may not come back in the same response. This module tries to recombine the responses by thread ID.

貌似至今为止还没看到有和Snooping协议相关的信号!

2.2 VX_bank特点

VX_bank.v is the verilog code that handles cache bank functionality and is located in the /hw/rtl/cache directory.
在这里插入图片描述
关于bank的各个部分如下:

1. Allows for high throughput​
2. Each bank contains queues to hold requests to the cache​
3. I/O signals
   - Core request​
   - Core Response​
   - DRAM Fill Requests​
   - DRAM Fill Response​
   - DRAM WB Requests​
   - Snp Request​
   - Snp Response
4. Request Priority: DRAM fill, miss reserve, core request, snoop request​
5. Snoop Request Queue​
6. DRAM Fill Queue​
7. Core Req Arbiter​ : Requests to be processed by the bank
8. Tag Data Store​ : 
   - Registers for valid, dirty, dirtyb, tag, and data​
   - Length of registers determined by lines in the bank​
9. Tag Data Access:- I/O: stall, snoop info, force request miss
   - Writes to cache or sends read response; hit or miss determined here
   - A missed request goes to the miss reserve if it is not a snoop request or DRAM fill

好了,这是个好东西,有助于理解uarch的设计细节。关于这里提到的snoop协议建议回到《计算机体系结构——量化研究方法》这本书中的线程级并行去回顾。Cache的受众有三个:本地的core下级的memory远程的core,而远程的core通过MSI(或者更进一步扩展的MESIMOESI来解决Cache一致性问题),因此IO部分包含CoreDRAMSnp也就不足为怪了。

2.3 一点疑惑

另外关于Tag Data Store里指出的若干个Registers,我们先回顾Cache应该有哪些Field Register

  1. 由于Cache往往采用write back,有了dirty位,与此同时为了支撑write back操作,可能需要在多个dirty的cacheline中选择一个cacheline写回到下一级memory,需要相应的替换位替换策略
  2. 由于Cache冷启动时面临强制缺失,在随着程序的进行中,强制缺失导致的读缺失或者写缺失会陆陆续续从下一级存储中获取相应的block填充到cacheline中,为了标记哪些cacheline已经填了数据或者没填数据,使用valid位来标记cacheline的有效性。
  3. Cache的索引(组相联)需要tagdata,所以自然而然就有这俩的存在。
  4. 在线程级并行的环境中,各个Core的私有Cache会有部分数据处于shared状态,同时由于程序在运行且对数据的读写行为是串行化的,因此总有某些时刻的cachelinemodified的。对于常见的设计场景,写作废协议相比写更新协议有某些好处比如下一拍可能不会再用到的数据用写更新反而会增加多余的操作并引起相应的带宽压力、功耗等,因此写作废协议导致cache需要支持shared、modified和invalidbit

跳回来,如果综合刚才的分析,看第8点:

8. Tag Data Store​ : 
   - Registers for valid, dirty, dirtyb, tag, and data​
   - Length of registers determined by lines in the bank​

感觉还是很疑惑!缺了支持MSIbit位。

=======================================

有懂行的大佬可以指出为什么不需要!!!

=======================================

三、整理点debug或者test的预备知识

其实前面三篇写完发现已经对信号有看不懂的地方了,翻了翻github,在vortex/docs/codebase.md翻到这些:

1. unit_tests: unit tests for some hardware components

2. driver: host drivers repository
  - include: Vortex driver public headers
  - stub: Vortex stub driver library
  - fpga: software driver that uses Intel OPAE FPGA
  - asesim: software driver that uses Intel ASE simulator
  - vlsim: software driver that uses vlsim simulator
  - rtlsim: software driver that uses rtlsim simulator
  - simx: software driver that uses simX simulator

3. sim:
  - vlsim: AFU RTL simulator
  - rtlsim: processor RTL simulator
  - simX: cycle approximate simulator for vortex

以上文件路径的作用作者明明白白说了!后面打算在理清楚各个模块之间的关系和大致作用之后,再去精细化弄清楚原理。


总结

先水一节吧,也考虑到看着看着有点迷糊,想回过头来翻翻作者还给了哪些资料,这一节就先这么着吧!不过有些东西已经整理好了,但篇幅还不够!

最近忙着秋招找工作的事儿,碰着jd描述需要各种各样的知识,面了几家感觉还缺点gpu架构知识,然后前段时间疯狂补充!顺带牛客上刷代码!

还是老话,有错误请严厉指出,我闻过则喜!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

DentionY

谢谢您的支持

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值