伪LRU设计

King9Cc

已于 2024-09-03 20:56:41 修改

阅读量686

点赞数 23

文章标签：硬件架构 gpu算力 fpga开发

于 2024-09-03 20:55:04 首次发布

本文链接：https://blog.csdn.net/weixin_46096236/article/details/141871496

版权

LRU

最近最少使用（LRU，Least Recently Used）替换策略，用于在缓存满时决定应该替换哪个缓存路（way）中的内容

工作原理

LRU 替换策略概述

LRU策略的基本思想是：在缓存需要替换一个路（way）时，选择最近最少使用的路。
通过记录每次访问的路的顺序或频率，可以实现这种策略。在实现上，本模块采用了一种伪LRU（pseudo-LRU）算法，它通过树结构来记录和判断哪个路是最少使用的。

LRU树结构的表示

在4路组相联缓存中，LRU状态可以用3位二进制表示。三个内部节点a、b、c的状态可以用来确定每个路是MRU（Most Recently Used，最近最常使用）还是LRU。
比如，在4路情况下，00?表示路0和1之间选择，?10表示路2和3之间选择，具体的0或1决定了是哪条路径（路）是LRU

    //              d   
    //         /        \
    //        b          c
    //      /   \       /  \
    //     a     c     d    e
    //    / \   / \   / \  / \
    //   0   1 2   3 4   5 6  7

在该结构中，0表示左子树优先，1表示右子树优先。

某一次访问前，{a,b,c,d,e,f,g} = 7'b0，访问way1，则对于子树a而言，way1被最近访问了，则a=0；对于子树b而言，左子树被最近访问了，则b=1；对于树d而言，左子树被最近访问了，则d=1.因此此时{a,b,c,d,e,f,g} = {0,1,c,1,e,f,g}.把访问前的c，e，f，g代入，得到新的状态值。

SRAM模块lru_data

该模块使用SRAM存储LRU标志位，通过read_en、read_addr信号来读取当前的LRU状态。
每当一个集合中的某一条路被访问或替换时，LRU标志位会更新，以反映新的最常使用路

状态更新

当发生替换或访问时，标志位会根据伪LRU规则进行更新，确保LRU策略的正确性。
update_flags根据当前的LRU标志位和新访问的路生成新的LRU状态，这个状态会被写回SRAM。

module cache_lru
    #(parameter NUM_SETS = 1,
    parameter NUM_WAYS = 4,    // Must be 1, 2, 4, or 8
    parameter SET_INDEX_WIDTH = $clog2(NUM_SETS),
    parameter WAY_INDEX_WIDTH = $clog2(NUM_WAYS))
    (input                                clk,
    input                                 reset,

    // Fill interface. Used to request LRU to replace when filling.
    input                                 fill_en,
    input [SET_INDEX_WIDTH - 1:0]         fill_set,
    output logic[WAY_INDEX_WIDTH - 1:0]   fill_way,

    // Access interface. Used to move a way to the MRU position when
    // it has been accessed.
    input                                 access_en,
    input [SET_INDEX_WIDTH - 1:0]         access_set,
    input                                 update_en,
    input [WAY_INDEX_WIDTH - 1:0]         update_way);

    localparam LRU_FLAG_BITS =
        NUM_WAYS == 1 ? 1 :
        NUM_WAYS == 2 ? 1 :
        NUM_WAYS == 4 ? 3 :
        7;    // NUM_WAYS = 8

    logic[LRU_FLAG_BITS - 1:0] lru_flags;
    logic update_lru_en;
    logic [SET_INDEX_WIDTH - 1:0] update_set;
    logic[LRU_FLAG_BITS - 1:0] update_flags;
    logic [SET_INDEX_WIDTH - 1:0] read_set;
    logic read_en;
    logic was_fill;
    logic[WAY_INDEX_WIDTH - 1:0] new_mru;
`ifdef SIMULATION
    logic was_access;
`endif

    assign read_en = access_en || fill_en;
    assign read_set = fill_en ? fill_set : access_set;
    assign new_mru = was_fill ? fill_way : update_way;
    assign update_lru_en = was_fill || update_en;

    initial
    begin
        // Must be 1, 2, 4, or 8
        assert(NUM_WAYS <= 8 && (NUM_WAYS & (NUM_WAYS - 1)) == 0);
    end

    // This uses a pseudo-LRU algorithm
    // Assuming four sets, the current state of each set is represented by
    // three bits. Imagine a tree:
    //
    //        b
    //      /   \
    //     a     c
    //    / \   / \
    //   0   1 2   3
    //
    // The leaves 0-3 represent ways, and the letters a, b, and c represent the 3 bits
    // which indicate a path to the *least recently used* way. A 0 stored in a interior
    // node indicates the  left node and a 1 the right. Each time an element is moved
    // to the MRU, the bits along its path are set to the opposite direction. This means
    // it will take at least two cycles for a node in the MRU to move to the LRU and
    // be evicted. A strict LRU would take three cycles for the node to move to the
    // LRU. So, this is close enough to LRU to work well, but much simpler to implement.
    //
    sram_1r1w #(
        .DATA_WIDTH(LRU_FLAG_BITS),
        .SIZE(NUM_SETS),
        .READ_DURING_WRITE("NEW_DATA")
    ) lru_data(
        // Fetch existing flags
        .read_en(read_en),
        .read_addr(read_set),
        .read_data(lru_flags),

        // Update LRU (from next stage)
        .write_en(update_lru_en),
        .write_addr(update_set),
        .write_data(update_flags),
        .*);

    // XXX I bet there's a way to programmatically create update_flags
    // and fill_way with a generate loop instead of hard-coding like
    // I've done here.
    generate
        case (NUM_WAYS)
            1:
            begin
                assign fill_way = 0;
                assign update_flags = 0;
            end

            2:
            begin
                assign fill_way = !lru_flags[0];
                assign update_flags[0] = !new_mru;
            end

            4:
            begin
                always_comb
                begin
                    casez (lru_flags)
                        3'b00?: fill_way = 0;
                        3'b10?: fill_way = 1;
                        3'b?10: fill_way = 2;
                        3'b?11: fill_way = 3;
                        default: fill_way = '0;
                    endcase
                end

                always_comb
                begin
                    case (new_mru)
                        2'd0: update_flags = {2'b11, lru_flags[0]};
                        2'd1: update_flags = {2'b01, lru_flags[0]};
                        2'd2: update_flags = {lru_flags[2], 2'b01};
                        2'd3: update_flags = {lru_flags[2], 2'b00};
                        default: update_flags = '0;
                    endcase
                end
            end

            8:
            begin
                always_comb
                begin
                    casez (lru_flags)
                        7'b00?0???: fill_way = 0;
                        7'b10?0???: fill_way = 1;
                        7'b?100???: fill_way = 2;
                        7'b?110???: fill_way = 3;
                        7'b???100?: fill_way = 4;
                        7'b???110?: fill_way = 5;
                        7'b???1?10: fill_way = 6;
                        7'b???1?11: fill_way = 7;
                        default: fill_way = '0;
                    endcase
                end

                always_comb
                begin
                    case (new_mru)
                        3'd0: update_flags = {2'b11, lru_flags[5], 1'b1, lru_flags[2:0]};
                        3'd1: update_flags = {2'b01, lru_flags[5], 1'b1, lru_flags[2:0]};
                        3'd2: update_flags = {lru_flags[6], 3'b011, lru_flags[2:0]};
                        3'd3: update_flags = {lru_flags[6], 3'b001, lru_flags[2:0]};
                        3'd4: update_flags = {lru_flags[6:4], 3'b011, lru_flags[0]};
                        3'd5: update_flags = {lru_flags[6:4], 3'b001, lru_flags[0]};
                        3'd6: update_flags = {lru_flags[6:4], 1'b0, lru_flags[2], 2'b01};
                        3'd7: update_flags = {lru_flags[6:4], 1'b0, lru_flags[2], 2'b00};
                        default: update_flags = '0;
                    endcase
                end
            end

            default:
            begin
                initial
                begin
                    $display("%m invalid number of ways");
                    $finish;
                end
            end
        endcase
    endgenerate

    always_ff @(posedge clk)
    begin
        update_set <= read_set;
        was_fill <= fill_en;
    end

`ifdef SIMULATION
    always_ff @(posedge clk, posedge reset)
    begin
        if (reset)
            was_access <= 0;
        else
        begin
            // Can't update when the last cycle didn't perform an access.
            assert(!(update_en && !was_access));
            was_access <= access_en;    // Debug only
        end
    end
`endif
endmodule

接口信号

clk：时钟信号。
reset：重置信号。
fill_en：填充使能信号，表示需要替换一个缓存路。
fill_set：指定填充的集合索引。
fill_way：输出信号，表示在fill_set中选择的路索引，这个路将被替换。
access_en：访问使能信号，表示缓存中的某个路被访问了。
access_set：访问的集合索引。
update_en：更新使能信号，表示需要更新LRU状态。
update_way：访问的路索引，表示该路刚刚被访问，需要更新其LRU状态

读写冲突分析

单端口双工（1R1W）
- 1R1W 表示该SRAM有一个读端口和一个写端口，可以在同一时钟周期内进行一次读操作和一次写操作。这意味着你可以在一个时钟周期内从某个地址读取数据的同时，将数据写入同一或不同的地址。
- 这种设计允许缓存逻辑在更新LRU状态的同时，仍然能够读取当前的LRU标志位，这对于提高流水线性能和减少延迟非常有用。

SRAM操作
- read_en 和 read_addr：这两个信号控制读操作。如果 read_en 被置为高电平，SRAM将在 read_addr 指定的地址处读取数据。
- write_en 和 write_addr：这两个信号控制写操作。如果 write_en 被置为高电平，SRAM将在 write_addr 指定的地址处写入数据（写入的数据由 write_data 信号提供）。
- 由于这是一个单端口双工SRAM，所以即使是在同一周期进行读写操作，也不会产生冲突，前提是读写操作访问的不是同一个地址。

对于本模块而言，本模块的适用场景位于IFETCH阶段，PC的值（指令地址）是+4或者跳转的，不可能出现循环访问同一条指令(死循环)，因此，该模块可能出现的读写冲突，即循环访问同一地址，access某个set后，读和写会在一个地址上的场景不会出现。

为什么读写同一地址很少发生

由于ICACHE的设计和使用模式，读写同一地址的情况很少出现，原因如下：

顺序性：处理器通常以顺序的方式读取指令。即使发生跳转（branch）或函数调用，新的指令地址也与之前的读写地址不同。
缓存未命中：当发生缓存未命中时，缓存会将数据块写入空闲的缓存位置，而不会立即去读取新写入的数据。此外，写操作通常是在一个独立的时钟周期内完成的，读操作和写操作不会在同一个周期针对同一地址进行。
多路组相联结构：ICACHE通常是多路组相联（set-associative）结构，不同的地址会映射到不同的缓存路，因此即使读写操作发生在同一个缓存集合中，也会操作不同的路（way）。