分布式块存储QoS限速算法介绍与实践以及对上层应用的影响

杰特JET

已于 2022-01-20 16:25:06 修改

阅读量1.4k

点赞数

分类专栏：云存储技术存储技术文章标签：分布式存储云存储

于 2021-04-23 13:02:04 首次发布

本文链接：https://blog.csdn.net/zxpoiu/article/details/116051428

版权

存储技术同时被 2 个专栏收录

16 篇文章 3 订阅

订阅专栏

云存储技术

8 篇文章 2 订阅

订阅专栏

分布式块存储QoS限速算法以及对上层应用的影响

QoS限速算法介绍

限速策略主要有令牌桶的漏桶两种，下面分别介绍如下。

令牌桶 Token Bucket

Wiki对令牌桶的算法描述如下：

A token is added to the bucket every 1/r seconds.
The bucket can hold at the most b tokens. If a token arrives when the bucket is full, it is discarded.
When a packet (network layer PDU) of n bytes arrives,
- if at least n tokens are in the bucket, n tokens are removed from the bucket, and the packet is sent to the network.
- if fewer than n tokens are available, no tokens are removed from the bucket, and the packet is considered to be non-conformant.

一个固定容量的桶装着一定数量的令牌，桶的容量即令牌数量上限。桶里的令牌数量会每隔固定时间补充，直到桶被装满。一个IO请求将消耗一个令牌，如果桶里有令牌，则该IO请求消耗令牌后放行，反之则无法放行（算法可以选择是否放弃IO请求）。如果对字节数限流，每次个IO会消耗iosize大小的令牌。

按照以上描述，我们可以知道，令牌桶算法可以达到以下效果：

令牌桶算法可以通过控制令牌补充速率来控制处理IO请求的速率；
令牌桶算法允许一定程度的突发，只要桶里的令牌没有耗尽，IO请求即可立即消耗令牌并放行，这段时间内IO请求处理速率将大于令牌补充速率，令牌补充速率实际为平均处理速率；
令牌桶算法无法控制突发速率上限和突发时长，突发时长由实际IO请求速率决定，若实际IO请求大于令牌补充速率且速率恒定，则：突发时长=令牌桶容量/(实际IO请求速率-令牌补充速率)

漏桶 Leaky Bucket

Leaky bucket as a meter

Wiki中对Leaky bucket as a meter定义如下：

A fixed capacity bucket, associated with each virtual connection or user, leaks at a fixed rate.
If the bucket is empty, it stops leaking.
For a packet to conform, it has to be possible to add a specific amount of water to the bucket: The specific amount added by a conforming packet can be the same for all packets, or can be proportional to the length of the packet.
If this amount of water would cause the bucket to exceed its capacity then the packet does not conform and the water in the bucket is left unchanged.

我们可以理解如下：

一个桶，以固定的流量漏水，经过的IO会请求报文会向桶中加水，加水的量以流控的方面为准，可以是byte，可以是IOPS，如果加水溢出，则IO不能通过，反之则可以放行。

可见，这个算法描述和令牌桶基本类似，我们可以认为Leaky bucket as a meter和Token Bucket是等价的。

Leaky bucket as a queue

queue

wiki对这种限流策略的描述是：The leaky bucket consists of a finite queue. When a packet arrives, if there is room on the queue it is appended to the queue; otherwise it is discarded. At every clock tick one packet is transmitted (unless the queue is empty)

可以认为，Leaky bucket as a queue就是令牌桶的桶大小等于1的场景。

主流的块设备流控方案

主流的、在工程上有大范围应用的留空策略，主要有三种，qemu，librbd，spdk，下面分别介绍

Qemu

Qemu早在1.1版本就已支持块设备IO限速，提供6个配置项，可对IOPS和带宽6种场景分别进行速率上限设置。在1.7版本对块设备IO限速增加了支持突发的功能。在2.6版本对支持突发的功能进行了完善，可控制突发速率和时长。参数如下：

场景	基本速率上限配置	突发速率配置	突发时长配置
总iops	iops-total	iops-total-max	iops-total-max-length
读iops	iops-read	iops-read-max	iops-read-max-length
写iops	iops-write	iops-write-max	iops-write-max-length
总bps	bps-total	bps-total-max	bps-total-max-length
读bps	bps-read	bps-read-max	bps-read-max-length
写bps	bps-write	bps-write-max	bps-write-max-length

其实现的核心数据结构是这样描述的：

typedef struct LeakyBucket {
    uint64_t avg;             /* IO的限制目标速率 */
    uint64_t max;             /* IO的突发限制速率 */
    double  level;            /* bucket level in units */
    double  burst_level;      /* bucket level in units (for computing bursts) */
    uint64_t burst_length;    /* 突发时长，默认单位是秒 */
} LeakyBucket

Qemu的流控算法使用漏桶实现。算法的的目标是，用户可以在突发速率bkt.max持续bkt.burst_length秒，之后速率会降为bkt.avg

为了实现这个目标，qemu实现了两个桶

主桶：大小bucket_size为bkt.max * bkt.burst_length，以bkt.avg的速率漏水，正常IO先经过主桶处理
突发桶：大小burst_bucket_size设置为主桶的十分之一，以bkt.max速率漏水

如果主桶已经满了，则需要等待漏桶，如果主桶未满并且设置了突发桶，则需要检验突发桶是否可以放行。这样，我们通过突发桶保证了IO的突发速率，通过主桶的大小，保证了突发的时间。

关键的控制IO是否能放行的函数如下：

/* This function compute the wait time in ns that a leaky bucket should trigger
 *
 * @bkt: the leaky bucket we operate on
 * @ret: the resulting wait time in ns or 0 if the operation can go through
 */
int64_t throttle_compute_wait(LeakyBucket *bkt)
{
    double extra; /* the number of extra units blocking the io */
    double bucket_size;   /* I/O before throttling to bkt->avg */
    double burst_bucket_size; /* Before throttling to bkt->max */

    if (!bkt->avg) {
        return 0;
    }

    if (!bkt->max) {
        /* If bkt->max is 0 we still want to allow short bursts of I/O
         * from the guest, otherwise every other request will be throttled
         * and performance will suffer considerably. */
        bucket_size = (double) bkt->avg / 10;
        burst_bucket_size = 0;
    } else {
        /* If we have a burst limit then we have to wait until all I/O
         * at burst rate has finished before throttling to bkt->avg */
        bucket_size = bkt->max * bkt->burst_length;
        burst_bucket_size = (double) bkt->max / 10;
    }

    /* If the main bucket is full then we have to wait */
    extra = bkt->level - bucket_size;
    if (extra > 0) {
        return throttle_do_compute_wait(bkt->avg, extra);
    }

    /* If the main bucket is not full yet we still have to check the
     * burst bucket in order to enforce the burst limit */
    if (bkt->burst_length > 1) {
        assert(bkt->max > 0); /* see throttle_is_valid() */
        extra = bkt->burst_level - burst_bucket_size;
        if (extra > 0) {
            return throttle_do_compute_wait(bkt->max, extra);
        }
    }

    return 0;
}

librbd

Ceph在13.2.0版本(m版)支持对RBD镜像的IO限速，此版本仅支持总iops场景的限速，且支持突发，支持配置突发速率，但不可控制突发时长(实际相当于突发时长设置为1秒且无法修改)。在14.2.0版本(n版)增加了对读iops、写iops、总bps、读bps、写bps这5种场景的限速支持，对突发的支持效果保持不变。

Librbd的限速机制支持突发，支持配置突发速率，但不支持控制突发时长，使用令牌桶实现。令牌桶加水的速率可以使用rbd_qos_schedule_tick_min参数调节，默认50ms，用户可以通过如下参数配置基本速率和突发速率。

场景	基本速率上限配置	突发速率配置
总iops	rbd_qos_iops_limit	rbd_qos_iops_burst
读iops	rbd_qos_iops_read_limit	rbd_qos_iops_read_burst
写iops	rbd_qos_iops_write_limit	rbd_qos_iops_write_burst
总bps	rbd_qos_bps_limit	rbd_qos_bps_burst
读bps	rbd_qos_bps_read_limit	rbd_qos_bps_read_burst
写bps	rbd_qos_bps_write_limit	rbd_qos_bps_write_burst

spdk

spdk的qos限速实现在bdev层，是令牌桶。支持对IOPS和BW单独进行配置，但是不支持突发速率。通过使用rpc请求bdev_set_qos_limit进行配置。配置参数如下

参数	解释
rw_ios_per_sec	IOPS限制
rw_mbytes_per_sec	读写带宽限制
r_mbytes_per_sec	读带宽限制
w_mbytes_per_sec	写带宽限制

spdk通过注册poller函数bdev_channel_poll_qos向令牌桶中加令牌，频率为SPDK_BDEV_QOS_TIMESLICE_IN_USEC硬编码，默认1ms。每次加令牌的频率就是总速率/时间片

一个IO需要经过所有配置的令牌桶之后才可以被放行，令牌桶可以单次消耗减为负数，减为负数之后所有的IO均不能被放行，只有等函数bdev_channel_poll_qos重新将令牌桶加成正数之后才能放行。

static int
bdev_channel_poll_qos(void *arg)
{
	struct spdk_bdev_qos *qos = arg;
	uint64_t now = spdk_get_ticks();
	int i;

	if (now < (qos->last_timeslice + qos->timeslice_size)) {
		/* We received our callback earlier than expected - return
		 *  immediately and wait to do accounting until at least one
		 *  timeslice has actually expired.  This should never happen
		 *  with a well-behaved timer implementation.
		 */
		return SPDK_POLLER_IDLE;
	}

	/* Reset for next round of rate limiting */
	for (i = 0; i < SPDK_BDEV_QOS_NUM_RATE_LIMIT_TYPES; i++) {
		/* We may have allowed the IOs or bytes to slightly overrun in the last
		 * timeslice. remaining_this_timeslice is signed, so if it's negative
		 * here, we'll account for the overrun so that the next timeslice will
		 * be appropriately reduced.
		 */
		if (qos->rate_limits[i].remaining_this_timeslice > 0) {
			qos->rate_limits[i].remaining_this_timeslice = 0;
		}
	}

	while (now >= (qos->last_timeslice + qos->timeslice_size)) {
		qos->last_timeslice += qos->timeslice_size;
		for (i = 0; i < SPDK_BDEV_QOS_NUM_RATE_LIMIT_TYPES; i++) {
			qos->rate_limits[i].remaining_this_timeslice +=
				qos->rate_limits[i].max_per_timeslice;
		}
	}

	return bdev_qos_io_submit(qos->ch, qos);
}

限速策略对块设备的影响

不同的qos策略上层块设备的体验也是不同的，主要体现在IO的时延和%util。

时延

时延由qos策略的突发性能和补充频率决定

突发性能，设置比较大的漏桶或令牌桶，或者像Qemu那样配置两个桶，可以增强块设备的突发性能，让块设备承受突发流量时延时比较低。
- 可以通过fio命令fio --group_reporting --rw=randwrite --bs=1M --numjobs=1 --iodepth=64 --ioengine=libaio --direct=1 --name test --size=2000G --filename=/dev/vdb -iodepth_low=0 -iodepth_batch_submit=64 -thinktime=950ms -thinktime_blocks=64，每次下发1M队列深度为64的IO，下发完成之后等950ms再重复。如果块设备的突发性能不行，看到的现象是iowait时延较大，时延与队列深度呈线性增长，且带宽压不上去。而且因为我们每次IO下发都会等很久，因此io util也不高。
补充频率，补充频率较低会造成拖尾时延严重，举个简单的例子，令牌桶每隔1秒补充一次，那么如果当前这1秒下发的IO下过了限制，那么有些IO的时延肯定会超过1秒，造成拖尾时延较大。

IO util

磁盘util值定义为磁盘处理IO时间占总时间的比例。也就是当前磁盘队列中有IO的时间和总时间的比率。如果限速算法导致处理IO的时间分布很均匀（如Leaky bucket as a queue，IO一个个断续的被处理），磁盘队列一直存在IO，那么util自然较高。

而设置突发性能较大的块设备，很高的队列深度也可以被很快的处理完成，Util自然低。

对数据库应用的影响

我们这里以构建于分布式块设备之上的数据库mysql为例，谈谈限速策略对sql性能造成的影响。

在mysql中，主要有两部分IO比较影响性能

下刷脏页
1. mysql为了减少IO数量，提高读写性能，引入了buffer pool，mysql对数据的修改会首先修改到buffer pool中，等到合适的实际时间下刷。当内存数据页跟磁盘数据页内容不一致的时候，我们称这个内存页为“脏页”。内存数据写入到磁盘后，内存和磁盘上的数据页的内容就一致了，称为“干净页”。
2. 从使用场景上我们可以推断，每次下刷脏页必然是大IO高队列，如果块设备突发性能不行，会导致下刷脏页速率慢，并且如上所述，这种场景极有可能ioutil不高。也就是说这种IO模型并没有发挥限流机制的性能。
3. 解决方法也很简单：
  1. 降低队列深度（每次下刷脏页的大小），以抵消突发性能不足的限制
  2. 加快下刷频率，从而提高ioutil，提高对限流策略的利用率。
下刷redo-log。这里不谈bin-log是因为官方认为开启bin-log的性能损耗小于1%
1. redo-log为了保证正确性，是单线程顺序写的，如果块设备的突发性能不行，会导致下发redo-log的时延较高，拖累整个系统的TPS。
2. redo-log如果和其他IO共享同一块盘，redo-log自身的优先级无法体现，有可能因为脏页下刷触发了限流，而增大了redo-log的时延。
3. 解决方法有以下几种：
  1. 增加块设备的突发能力
  2. 提升redo-log的优先级，让redo-log先下发，如果块设备系统不支持IO优先级，可以另外申请一块盘作为redo-log盘单独使用。
  3. 上层mysql应用支持并发随机写的redo-log（PolarDB应该已经实现了）。

参考链接

杰特JET

关注

0
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
分布式块存储QoS限速算法介绍与实践以及对上层应用的影响

分布式块存储QoS限速算法以及对上层应用的影响QoS限速算法介绍令牌桶 Token Bucket漏桶 Leaky BucketLeaky bucket as a meterLeaky bucket as a queue主流的块设备流控方案Qemulibrbdspdk限速策略对块设备的影响时延IO util对数据库应用的影响参考链接QoS限速算法介绍限速策略主要有令牌桶的漏桶两种，下面分别介绍如下。令牌桶 Token BucketWiki对令牌桶的算法描述如下：A token is added
复制链接

扫一扫

专栏目录