ceph 对象存储 块存储_针对Ceph对象存储的Flashcache的Bcache

ceph 对象存储 块存储


Fast SSDs are getting cheaper every year, but they are still smaller and more expensive than traditional HDD drives. But HDDs have much higher latency and are easily saturated. However, we want to achieve low latency for the storage system, and a high capacity too. There’s a well-known practice of optimizing performance for big and slow devices — caching. As most of the data on a disk is not accessed most of the time but some percentage of it is accessed frequently, we can achieve a higher quality of service by using a small cache.
快速SSD每年都在变得越来越便宜,但与传统的HDD驱动器相比,它们仍然更小,更昂贵。 但是HDD具有更高的延迟并且很容易饱和。 但是,我们希望实现存储系统的低延迟和高容量。 有一种众所周知的优化大型和慢速设备性能的做法-缓存。 由于磁盘上的大多数数据大多数时候都不被访问,而是经常访问其中的某些百分比,因此我们可以使用较小的缓存来获得更高的服务质量。
Server hardware and operating systems have a lot of caches working on different levels. Linux has a page cache for block devices, a dirent cache and an inode cache on the filesystem layer. Disks have their own cache inside. CPUs have caches. So, why not add one more persistent cache layer for a slow disk?
服务器硬件和操作系统具有许多在不同级别上运行的缓存。 Linux在文件系统层上具有块设备的页面缓存,Dirent缓存和inode缓存。 磁盘内部有自己的缓存。 CPU具有缓存。 那么,为什么不为慢速磁盘增加一个持久性缓存层呢?
In this article, we’ll be explaining what we used, what problems we've had and how we solved them by replacing block device caching software. We’ll start with an issue we’ve been having with flashcache in our Ceph cluster with HDD backend.
在本文中,我们将解释我们使用的东西,遇到的问题以及如何通过替换块设备缓存软件来解决这些问题。 我们将从带有HDD后端的Ceph集群中闪存缓存一直存在的问题开始。

环境 (The Environment)


Ceph is a modern software-defined object storage. It can be used in different ways, including the storage of virtual machine disks and providing an S3 API. We use it in different cases:
Ceph是现代软件定义的对象存储。 它可以以不同的方式使用,包括存储虚拟机磁盘并提供S3 API。 我们在不同情况下使用它:
  • RBD devices for virtual machines.

    虚拟机的RBD设备。
  • CephFS for some internal applications.

    CephFS用于某些内部应用程序。
  • Plain RADOS object storage with self-written client.

    使用自写客户端的普通RADOS对象存储。

The last one is the one we’ll be talking about here.
最后一个是我们将在这里讨论的那个。
Every 24-HDD server has an NVMe SSD, which was split for 48 partitions (actually, it was a RAID-1 array of 2 NVMes, but now we just use two NVMe devices separately). So there’s one partition for the OSD journal and one partition for a cache for each OSD daemon.
每个24-HDD服务器都有一个NVMe SSD,该SSD分为48个分区(实际上,它是2个NVMes的RAID-1阵列,但是现在我们只使用两个NVMe设备)。 因此,每个OSD守护程序都有一个分区用于OSD日志,一个分区用于缓存。

Flash快取 (The Flashcache)


What is flashcache? It is a kernel module initially developed by Facebook, which allows caching of a drive on another drive. It is built using the device mapper and can work in four different modes: write-through, write-around, write-back and write-only. The documentation can be found
什么是闪存缓存? 它是Facebook最初开发的内核模块,它允许将一个驱动器缓存到另一个驱动器上。 它是使用设备映射器构建的,并且可以在四种不同的模式下工作:直写,全写,回写和仅写。 该文档可在 here. Write-around and write-through caches are not persistent across device removals and reboots. As most of the time, we need to cache read and write workloads, we used the write-back mode. 此处找到。 写回和直写式缓存在设备移除和重新引导后并不持久。 大多数时候,我们需要缓存读写工作负载,因此我们使用了回写模式。
Flashcache can be added to a device that is already in use. This is one of its benefits. All it needs to do is just stop the service, create a cache device and start the service using the newly created virtual flashcache device. Like all device mapper-based devices, Flashcache devices will be named
可以将Flashcache添加到已使用的设备中。 这是它的好处之一。 它要做的就是停止服务,创建缓存设备并使用新创建的虚拟闪存设备启动服务。 与所有基于设备映射器的设备一样,Flashcache设备在系统中将被命名为dm- dm-[0-9] in the system. [0-9]
We have been using flashcache for a long period of time as a caching layer for Ceph with virtual machine’s disks. As described in the documentation, it was developed for random read/write workload. We’re able to configure the «sequential threshold», though, which is the maximum size of a request in kilobytes that will be cached. All requests greater than the specified size will be passed through the cache to the slow device.
我们一直在使用flashcache作为具有虚拟机磁盘的Ceph的缓存层。 如文档中所述,它是为随机读取/写入工作负载而开发的。 不过,我们可以配置“顺序阈值”,即将要缓存的请求的最大大小(以千字节为单位)。 所有大于指定大小的请求都将通过缓存传递到慢速​​设备。
Because our experience with it has been good, we’ve tried to start using it under different workload: with Ceph and self-written clients over RADOS.
因为我们的经验非常好,所以我们尝试在不同的工作负载下开始使用它:通过Ceph和基于RADOS的自写客户端。

问题 (The Problem)


As soon as we started using flashcache in a different way, things changed. The first issue was with flashcache’s locking behavior. There was a bug, which led to a deadlock. Under high memory pressure, when flashcache needed to free some memory, it had to start a new thread to do so. But to start a new thread, it needed to allocate memory, which was impossible. The result of this bug was a sudden host hang.
一旦开始以其他方式使用Flashcache,情况就发生了变化。 第一个问题是闪存的锁定行为。 有一个错误,导致死锁。 在高内存压力下,当flashcache需要释放一些内存时,它必须启动一个新线程来这样做。 但是要启动一个新线程,它需要分配内存,这是不可能的。 该错误的结果是主机突然挂起。
Another problem we faced was the high rate of HDD utilization, which was higher the larger the cluster became. Different disks were utilized up to 100% for dozens of seconds. To understand what was happ
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值