【论文阅读】FULL-KV: Flexible and Ultra-Low-Latency In-Memory Key-Value Store System Design on CPU-FPGA

最新推荐文章于 2024-06-22 11:28:30 发布

想个id好难啊

最新推荐文章于 2024-06-22 11:28:30 发布

阅读量779

点赞数

分类专栏：论文阅读小笔记

本文链接：https://blog.csdn.net/weixin_37802735/article/details/108041274

版权

论文阅读小笔记专栏收录该内容

14 篇文章 4 订阅

订阅专栏

introduction

1.背景

RAM在数据中心的使用越来越多。根据ram的使用，将存储系分为两种类型
a) memory-cached systems：将数据存储在非易失性存储中，使用DRAM作为cache
b) in-memory storage systems：将数据完整地存储在DRAM之中，使用非易失性存储进行掉电恢复。
low latency, high bandwidth, and high I/O throughput
c) 为了加速系统性能并减少功耗：

(1) RDMA（remote direct memory access）：是一种完全由硬件执行I/O交换的工作方式.在这种方式中， DMA 控制器从CPU 完全接管对总线的控制，数据交换不经过CPU ，而直接在存和IO设备之间进行。RNIC-RDMA based network interface
(2) 将部分应用以及network功能移到FPGA进行加速

Network communication：决定分布式存储系统性能的重要因素。两种类型的技术被使用：
a) One-side RDMA：shifts the storage workload to clients
b) Two-side RDMA：performs the storage in the servers where the performance is bounded by CPU
在数据密集型应用如数据中心，传统的冯诺依曼结构就称为限制性能的瓶颈。
a) RDMA 通过减少network stack中CPU的参与提高性能。
b) 通过FPGA进行加速
key-value store (KVS) systems

a) 在目前的基于硬件的KVS中，KVpair的大小可以是固定或者可改变的。

当前hardware-based可支持的最大value大小为1 Mbytes （大部分小于4kbytes）但是和software based KVS还是有较大的差距。

b) 根据KVPs的储存模式可以分为两种设计思路 :decoupling/non-decooupling mode

(1) 前者是将KVPs作为整体进行储存；后者是将KV分离，再单独建hash table进行链接。
(2) 在设计支持大value存储的KVS中，前者更加划算（因为key对于KVP的整体大小贡献不大）

c) 基于硬件的KVS可以选择不同的存储媒介方案
(1) 当前工作中的方案设计（CAM内容寻找址）
在这里插入图片描述
(2) 存储媒介的选择主要基于两个方面
storage capacity
data access latency
主要有三种选择SRAM，DRAM，Flash

i) SRAM or DRAM is usually utilized to store the hash table, whereas DRAM or Flash memory for values.
ii) 考虑到数据中心的规模问题，DRAM成为hash table的首选。value有的存储在DDR3 Memory中，有的则为flash

d) hash collision problem ：由于映射表中不同的KVP可能指向同一个storage address

cuckoo hashing [30], hop- scotching hashing [31], and chained hashing [22]

(1) Chained hashing.The latency of operations increases as the workload of the hash table rises
(2) Cuckoo hashing can guarantee a constant time for a lookup and achieve high storage utilization （FULL-KV选择这个解决方案）
(3) Hopscotch- ing hashing consumes constant time on average.
more complex than cuckoo hashing

e) data consistency problem
(1) 原因在于hash table是共享数据的（并非某个数据独占的），同时searching和updating的过程都是有延时的。
(2) 有的文章【20】使用hashing function以及BRAM Memory来检测data conflict

i) 只要是不同的key映射到相同的BRAM的地址就认为其发生了data conflivt
ii) 但是over-detected，这会导致冲突率过分高（鉴于BRAM的容量十分小。）

(3) 我们的方法
仅仅缓存一定数目的最近的请求，再与当前的指令进行对比。如果没有发生映射地址重叠就按照fully-pipeline正常工作；如果发生了就等待先前的请求完成。

f) distributed KVS system
In such a large-scale distributed system, it is common that failures and repairs occur
(1) Consistent hashing
Only the KVs originally mapped to the failure node need to be remapped to nearby nodes, without remapping all the KVs
(2) 为了探索FULLKV的可拓展型，将FULL-KV拓展成多个节点，并使用custom client以及erhernet switch来测试系统性能

2.our work

FULL-KV, a flexible and ultra-low-latency IMKVS system based on the CPU-FPGA heterogeneous architecture.
a) 设计了高度并行的加速器结构来获得超低的延迟。
b) 采用decoupling的设计，将hash table存储在板上的DRAM memory，value存在host memory之中。
c) 我们设计了一种新的分段策略，可以提供超大值的支持。
d) 我们提出了一种新的内存预分配程序，它可以通过主机内存预分配来隐藏中断处理和PCIe接口的内存映射I/O (MMIO)的延迟。

architecture

下图为整体框架：
在这里插入图片描述

1.单节点结构

KVS加速器中有两个主要结构：KVS Engine，network offload engine

框架

(1) KVSE
在这里插入图片描述

(2) NOE
It is responsible for the transmission of KVS request and response messages over the 10GbE.

key-value store message (KVSM) protocol

to format the KVS request and response messages

在这里插入图片描述

the total value length, and the offset of the fragmented value, respectively

2.多节点结构

在这里插入图片描述

根据consistent hashing的机制，KVPs会通过不同的NOE接口分为不同的key区域。
The custom client also consists of an FPGA card and a computer.
KVSM generator can generate request KVSMs of different modes, such as PUT/GET/DELETE/Random operations for fixed/variable-length values.
KVSM receiver mainly receives the response KVSMs.

implementation

1.KVSE

在这里插入图片描述

初始化
host CPU预先申请一系列memory segment，并向segment management中写入相关信息（通过PCIeMMIO）
再从NOE接收到请求之后，KVSM parser会解析接受的KVSM语义。
将request header提交给hash engine，将值提交给Ｃ2H controller
hash engine会查找哈希表判断是否有相同的key值

(1) 如果有就会将结果传输给KVSM Packer
(2) 如果没有，就向segment management申请一列连续的memory space来储存value >
(3) 在获取到memory地址之后，就向hash table中写入新的条目，并为C2H Controller初始化一个DMA descriptor。
(4) 将日志信息发送给event controller
(5) 当cache的值到达一定的大小（默认为1k bytes），就向 DMA Controller提交一个解释器来发动value的传递。
(6) 最终将结果发送给KVSM Pacher

2.KVSM Parser and Packer

Parser
从request的KVSM中截取出kv-id，type等等并初始化一个request header提交给hash engine。而value传输给C2H controller。
因此hash management以及value的传输可以是并行的。
Packer
将处理结果格式化成KVSM的形式，之后提交给NOE。
从hash engine，C2H controller以及H2C controller会获得6种结果。

(1) Hash Engine -PUT hit, GET miss, and DELETE hit/miss
(2)C2H Controller -PUT miss
(3)H2C Controller -GET hit

3.Hash Engine

hash table-板上DDR4 Memory
value-host DRAM Memory

(1) First, calculate hash addresses by the two hash functions
(2) Second, search the hash table according to hash addresses and then compare the request key with those in the eight buckets
(3) Finally, execute corre- sponding processing

When a new hash table entry needs inserting into the hash table
从空的bucket中序号较低的bucket进行插入
When the hash collision happens, which means the optional eight buckets are all occupied in our design
one old hash table entry has to be kicked out in order to insert the new one
In case of PUT miss or DELETE hit >>
inserting a new hash table entry or deleting one
物理结构

The hash_fun and flag modules are designed to implement step 1
The ht_addr_cmp and FIFO modules are used to resolve the data consistency problem
The P2S, DDR4 controller, S2P, key_cmp, and kick_- key_cmp modules implement step 2
The key buffer and value_size & op_type buffer are used for data synchronization in the pipeline
The Hash Execute module is corresponding to step 3

4.C2H & H2C Controller

在这里插入图片描述

C2H Controller 从host获取value
fragmented descriptor is submitted to DMA Controller to launch DMA transmission after the amount of the value data reaches one frame
H2C Controller 向host memory存储值
multiple fragmented descriptors are first submitted to DMA Con- troller. After the amount reaches a frame, the received value data are directly packed into a response KVSM and trans- mitted to KVSM Packer.

5.Segment Management

We design a novel memory allocation scheme to hide the latency.

method
(1) pre-alloc
(2) 当segment的使用达到了默认的线程，event message就会发送给event controller（告知host更新used segment）
The segment buffer >>
implemented by a dual-port BRAM memory
The default segment size is 4MB, which is the maximum size
Each segment is man- aged mainly by its 64-bit address

Of the two BRAM ports
i) one is used to provide memory space for values
ii) the other to initialize and update segments by the host CPU

6.Synchronizer

Synchronizer provides the capacity for the host to build and update the hash table.
为host，hash engine以及event controller负责。

在这里插入图片描述

host
The host CPU caches the data to be synchronized in the sync_segment
hash engine
Synchronizer generates H2C DMA descriptors to fetch the data. The sync_data from the host is then transmitted to Hash Engine
event controller
After completing the transmission, Synchronizer will generate event messages to inform the host that the data in sync_buffer has been fetched