每个key在其对应的dir/bucket下都会占有200B左右的空间。当dir/bucket下面的key数量
很多时,这将使得dir对象很大。不仅包含该dir对象的osd会使用很多内存,而且当dir
对象迁移时所有对该对象的写操作都会锁定[1]。
从hammer开始,就可以使用bucket index shard特性。将bucket的index对象分成多个
对象。但对于已经存在bucket,无法使用该特性。
对于bucket index shard,涉及4类情况[2]:
- bucket 创建
- put/copy/delete 对象
以put为例进行分析。
- bucket listing/stats/fix indexing
以list为例进行分析。
- bucket index log
有两种配置方法:
.. Note:: 需要重新启动radosgw实例才能使得配置生效。
创建一个名为mmm的bucket。
从[1]中看,该值不适合配置太大,否则严重影响bucket list操作。具体设置为多少,
需要测试来确定。
index shard对象。 对象的名字为.dir_XXX.$NUM。
初始化index对象的rgw_bucket_dir_header信息。
rgw_bucket_dir_header持久化为omap header。具体存放的内容为(结合encode方法看):
在将对象的各个数据片段写入数据pool后,需要更新bucket index的信息。在
"解析in中的op信息,将新key内容写到bucket dir对象的omap中。"步骤时的操作如下:
上面的两步都需要确定bucket index信息。具体对于bucket index分片的确定,在
BucketShard对象的初始化过程中完成。prepare和complete中都是通过调用
get_bucket_shard()来取定BucketShard信息的。下面对BucketShard bs的初始化过程
进行分析。
BucketShard信息:
binfo内容如下:
其中对于index shard来说关注的有num_shards和bucket_index_shard_hash_type。
**总结**
需要获取BucketShard信息时,套路如下
list_results 结果示例:
列举结果存放在 struct rgw_cls_list_ret 结构中。
最终keys解析到new_dir.m中,返回给rgw instance端。如下
[1] `RadosGW Big Index <http://cephnotes.ksperis.com/blog/2015/05/12/radosgw-big-index/>`_
[2] `Rgw bucket index scalability <http://tracker.ceph.com/projects/ceph/wiki/Rgw_-_bucket_index_scalability>
很多时,这将使得dir对象很大。不仅包含该dir对象的osd会使用很多内存,而且当dir
对象迁移时所有对该对象的写操作都会锁定[1]。
[root@yhg-2 cmds]# rados -p .rgw.buckets.index listomapvals .dir.yhg-yhg.5457.29 --cluster yhg | more
afuhan
value: (198 bytes) :
0000 : 08 03 c0 00 00 00 06 00 00 00 61 66 75 68 61 6e : ..........afuhan
0010 : 02 00 00 00 00 00 00 00 01 04 03 75 00 00 00 01 : ...........u....
0020 : 09 00 00 00 00 00 00 00 fc b7 10 57 00 00 00 00 : ...........W....
0030 : 20 00 00 00 62 62 62 38 61 61 65 35 37 63 31 30 : ...bbb8aae57c10
0040 : 34 63 64 61 34 30 63 39 33 38 34 33 61 64 35 65 : 4cda40c93843ad5e
0050 : 36 64 62 38 03 00 00 00 78 78 31 11 00 00 00 5a : 6db8....xx1....Z
0060 : 6f 6e 65 20 75 73 65 72 20 66 6f 72 20 79 68 67 : one user for yhg
0070 : 18 00 00 00 61 70 70 6c 69 63 61 74 69 6f 6e 2f : ....application/
0080 : 6f 63 74 65 74 2d 73 74 72 65 61 6d 09 00 00 00 : octet-stream....
0090 : 00 00 00 00 00 00 00 00 00 00 00 00 01 01 02 00 : ................
00a0 : 00 00 01 02 04 0f 00 00 00 79 68 67 2d 79 68 67 : .........yhg-yhg
00b0 : 2e 31 34 31 30 32 2e 37 00 00 00 00 00 00 00 00 : .14102.7........
00c0 : 00 00 00 00 00 00 : ......
从hammer开始,就可以使用bucket index shard特性。将bucket的index对象分成多个
对象。但对于已经存在bucket,无法使用该特性。
对于bucket index shard,涉及4类情况[2]:
- bucket 创建
- put/copy/delete 对象
以put为例进行分析。
- bucket listing/stats/fix indexing
以list为例进行分析。
- bucket index log
1. index shard配置
有两种配置方法:
通过region map设置
// 获取region配置
[root@yhg-2 cmds]# radosgw-admin region get --cluster yhg > /tmp/region
// 将 'bucket_index_max_shards'的值修改为 4
[root@yhg-2 cmds]# vim /tmp/region
// 更新region配置
[root@yhg-2 cmds]# radosgw-admin region put --cluster yhg < /tmp/region
// 更新到region map
[root@yhg-2 cmds]# radosgw-admin regionmap update --cluster yhg
通过集群配置文件设置
[client.radosgw.yhg-yhg-yhg-2]
rgw frontends = "civetweb port=80"
rgw bucket index max shards = 4
.. Note:: 需要重新启动radosgw实例才能使得配置生效。
2. 检查
创建一个名为mmm的bucket。
// 查看bucket的id
[root@yhg-2 cmds]# radosgw-admin metadata get bucket:mmm --cluster yhg | grep bucket_id
"bucket_id": "yhg-yhg.14236.1"
// 查看bucket 实例信息
[root@yhg-2 cmds]# radosgw-admin metadata get bucket.instance:mmm:yhg-yhg.14236.1 --cluster yhg | grep shard
"num_shards": 4,
"bi_shard_hash_type": 0
// 查看.dir对象分片
[root@yhg-2 cmds]# rados -p .rgw.buckets.index ls --cluster yhg | grep "yhg-yhg.14236.1"
.dir.yhg-yhg.14236.1.2
.dir.yhg-yhg.14236.1.0
.dir.yhg-yhg.14236.1.1
.dir.yhg-yhg.14236.1.3
3. bucket_index_max_shards 值的选择
从[1]中看,该值不适合配置太大,否则严重影响bucket list操作。具体设置为多少,
需要测试来确定。
4. create bucket 初始化各个index shard
创建bucket时需要初始化bucket index对象。index shard打开时,会创建多个index shard对象。 对象的名字为.dir_XXX.$NUM。
2593 int RGWRados::init_bucket_index(rgw_bucket& bucket, int num_shards)
2594 {
2595 librados::IoCtx index_ctx; // context for new bucket
2596
// 创建到集群中bucket.index_pool的IoCtx,用于后续对该pool进行操作
// 本例中bucket.index_pool是 .rgw.buckets.index
2597 int r = open_bucket_index_ctx(bucket, index_ctx);
2598 if (r < 0)
2599 return r;
2600
// 拼接bucket的.dir_XXX对象名字
// 例如:.dir.yhg-yhg.5457.24
// 其中, '.dir'为统一前缀,yhg-yhg.5457.24为bucket id/marker
2601 string dir_oid = dir_oid_prefix;
2602 dir_oid.append(bucket.marker);
2603
// 获取.dir_XXX对象map。如果没有打开index shard特性,该map中只有一个
// 项,就是<0, '.dir_XXX'>。
// 如果打开了index shard特性,.dir_XXX为分成num_shards个对象。
// 名字为.dir_XXX.$NUM
2604 map<int, string> bucket_objs;
2605 get_bucket_index_objects(dir_oid, num_shards, bucket_objs);
2606
// 调用了ceph osd端cls操作 'rgw bucket_init_index',
// 即调用了CLSRGWConcurrentIO()
2607 return CLSRGWIssueBucketIndexInit(index_ctx, bucket_objs, cct->_conf->rgw_bucket_index_max_aio)();
2608 }
对于index shard配置为4时,
[root@yhg-2 ~]# rados -p .rgw.buckets.index ls --cluster yhg | grep "yhg-yhg.14236.1"
.dir.yhg-yhg.14236.1.2
.dir.yhg-yhg.14236.1.0
.dir.yhg-yhg.14236.1.1
.dir.yhg-yhg.14236.1.3
4.1 CLSRGWConcurrentIO()
对bucket 的.dir_XXX对象调用了cls 操作 'rgw bucket_init_index'。 // cls/rgw/cls_rgw_client.h
246 int operator()() {
247 int ret = 0;
248 iter = objs_container.begin();
249 for (; iter != objs_container.end() && max_aio-- > 0; ++iter) {
// 最终调用了issue_bucket_index_init_op
// 即,调用了集群端cls操作, 'rgw bucket_init_index'
250 ret = issue_op(iter->first, iter->second);
4.2 ceph osd端 cls 操作 'rgw bucket_init_index'
初始化index对象的rgw_bucket_dir_header信息。
// cls/rgw/cls_rgw.cc
566 int rgw_bucket_init_index(cls_method_context_t hctx, bufferlist *in, bufferlist *out)
rgw_bucket_dir_header持久化为omap header。具体存放的内容为(结合encode方法看):
// key为 header
542 struct rgw_bucket_dir_header {
543 map<uint8_t, rgw_bucket_category_stats> stats;
544 uint64_t tag_timeout;
545 uint64_t ver; // 本次操作时,只有有ver被设置为了1
546 uint64_t master_ver;
547 string max_marker
5. put object 如何确定bucket index
在将对象的各个数据片段写入数据pool后,需要更新bucket index的信息。在
"解析in中的op信息,将新key内容写到bucket dir对象的omap中。"步骤时的操作如下:
// RGWRados::Bucket::UpdateIndex::prepare
3468 r = index_op.prepare(CLS_RGW_OP_ADD);
// RGWRados::Bucket::UpdateIndex::complete
3487 r = index_op.complete(poolid, epoch, size,
3488 ut, etag, content_type, &acl_bl,
3489 meta.category, meta.remove_objs);
上面的两步都需要确定bucket index信息。具体对于bucket index分片的确定,在
BucketShard对象的初始化过程中完成。prepare和complete中都是通过调用
get_bucket_shard()来取定BucketShard信息的。下面对BucketShard bs的初始化过程
进行分析。
// rgw/rgw_rados.h
1529 int get_bucket_shard(BucketShard **pbs) {
1530 if (!bs_initialized) {
1531 int r = bs.init(bucket_info.bucket, obj);
// rgw/rgw_rados.cc
4589 int RGWRados::open_bucket_index_shard(rgw_bucket& bucket, librados::IoCtx& index_ctx,
4590 const string& obj_key, string *bucket_obj, int *shard_id)
// 建立index base对象所在pool的io上下文,
// 并返回拼接好的index base对象名字.dir.${bucket.marker}
4592 string bucket_oid_base;
4593 int ret = open_bucket_index_base(bucket, index_ctx, bucket_oid_base);
// 从bucket meta对象(.bucket.meta.${bucket.name}:${bucket.marker})中读出
// bucket的描述信息
4599 // Get the bucket info
4600 RGWBucketInfo binfo;
4601 ret = get_bucket_instance_info(obj_ctx, bucket, binfo, NULL, NULL);
// 采用简单的hash算法,计算出shard id,并拼接出bucket index对象的名字
// 比如,.dir.yhg-yhg.14236.1.1
4605 ret = get_bucket_index_object(bucket_oid_base, obj_key, binfo.num_shards,
4606 (RGWBucketInfo::BIShardsHashType)binfo.bucket_index_shard_hash_type, bucket_obj, shard_id)
BucketShard信息:
(gdb) print *bs
$62 = {
store = 0x3301c70,
bucket = {
name = "mmm",
data_pool = "dpool1",
data_extra_pool = ".rgw.buckets.extra",
index_pool = ".rgw.buckets.index",
marker = "yhg-yhg.14236.1",
bucket_id = "yhg-yhg.14236.1",
oid = ".bucket.meta.mmm:yhg-yhg.14236.1"
},
shard_id = 1,
index_ctx = {
io_ctx_impl = 0x7f5074008d10
},
bucket_obj = ".dir.yhg-yhg.14236.1.1"
}
binfo内容如下:
其中对于index shard来说关注的有num_shards和bucket_index_shard_hash_type。
(gdb) print binfo
$53 = {
bucket = {
name = "mmm",
data_pool = "dpool1",
data_extra_pool = ".rgw.buckets.extra",
index_pool = ".rgw.buckets.index",
marker = "yhg-yhg.14236.1",
bucket_id = "yhg-yhg.14236.1",
oid = ".bucket.meta.mmm:yhg-yhg.14236.1"
},
owner = "xx1",
flags = 0,
region = "yhg",
creation_time = 1461035367,
placement_rule = "default-placement",
has_instance_obj = true,
objv_tracker = {
read_version = {
ver = 1,
tag = "_TrpC7B0VOdoBEkokzucAQtd"
},
write_version = {
ver = 0,
tag = ""
}
},
ep_objv = {
ver = 0,
tag = ""
},
quota = {
max_size_kb = -1,
max_objects = -1,
enabled = false,
max_size_soft_threshold = -1,
max_objs_soft_threshold = -1
},
num_shards = 4,
bucket_index_shard_hash_type = 0 '\000',
static NUM_SHARDS_BLIND_BUCKET = 4294967295
}
**总结**
需要获取BucketShard信息时,套路如下
6589 BucketShard bs(this);
6590 int ret = bs.init(bucket, obj_instance)
6. bucket listing 获取所有对象列表
// rgw/rgw_rados.cc
2396 /**
2397 * get listing of the objects in a bucket.
2398 * bucket: bucket to list contents of
2399 * max: maximum number of results to return
2400 * prefix: only return results that match this prefix
2401 * delim: do not include results that match this string.
2402 * Any skipped results will have the matching portion of their name
2403 * inserted in common_prefixes with a "true" mark.
2404 * marker: if filled in, begin the listing with this object.
2405 * result: the objects are put in here.
2406 * common_prefixes: if delim is filled in, any matching prefixes are placed
2407 * here.
2408 */
2409 int RGWRados::Bucket::List::list_objects(int max, vector<RGWObjEnt> *result,
2410 map<string, bool> *common_prefixes,
2411 bool *is_truncated)
8084 int RGWRados::cls_bucket_list(rgw_bucket& bucket, rgw_obj_key& start, const string& prefix,
8085 uint32_t num_entries, bool list_versions, map<string, RGWObjEnt>& m,
8086 bool *is_truncated, rgw_obj_key *last_entry,
8087 bool (*force_check_filter)(const string& name))
...
8092 // key - oid (for different shards if there is any)
8093 // value - list result for the corresponding oid (shard), it is filled by the AIO callback
8094 map<int, string> oids; // 存放shard id
// CLSRGWIssueBucketList列举的结果存放在list_results中
8095 map<int, struct rgw_cls_list_ret> list_results;
// oids中存放bucket index shard 对象的名字
// 比如,
// (gdb) print oids
// $75 = std::map with 4 elements = {
// [0] = ".dir.yhg-yhg.14236.1.0",
// [1] = ".dir.yhg-yhg.14236.1.1",
// [2] = ".dir.yhg-yhg.14236.1.2",
// [3] = ".dir.yhg-yhg.14236.1.3"
//
// 调用了"4551 int RGWRados::open_bucket_index/get_bucket_index_objects"
// 获取了iods的名字列表(bucket index分片的名字)。
8096 int r = open_bucket_index(bucket, index_ctx, oids);
8097 if (r < 0)
8098 return r;
8099
8100 cls_rgw_obj_key start_key(start.name, start.instance);
// 对于oids中的所有对象,调用issue_op方法
// 详见:cls/rgw/cls_rgw_client.h:246
// issue_op中调用了 osd 端cls 函数 'rgw bucket_list'
8101 r = CLSRGWIssueBucketList(index_ctx, start_key, prefix, num_entries, list_versions,
8102 oids, list_results, cct->_conf->rgw_bucket_index_max_aio)();
list_results 结果示例:
(gdb) print list_results
$85 = std::map with 4 elements = {
[0] = {
dir = {
header = {
stats = std::map with 1 elements = {
[1 '\001'] = {
total_size = 18,
total_size_rounded = 8192,
num_entries = 2
}
},
tag_timeout = 0,
ver = 7,
master_ver = 0,
max_marker = "00000000006.219.3"
},
m = std::map with 2 elements = {
["h4h4"] = {
key = {
name = "h4h4",
instance = ""
},
ver = {
pool = 1,
epoch = 8
},
locator = "",
exists = true,
meta = {
category = 1 '\001',
size = 9,
mtime = {
tv = {
tv_sec = 1461035810,
tv_nsec = 0
}
},
etag = "bbb8aae57c104cda40c93843ad5e6db8",
owner = "xx1",
owner_display_name = "Zone user for yhg",
content_type = "application/octet-stream",
accounted_size = 9
},
pending_map = std::multimap with 0 elements,
index_ver = 4,
tag = "yhg-yhg.14236.5",
flags = 0,
versioned_epoch = 0
},
["sbsb"] = {
key = {
name = "sbsb",
instance = ""
},
ver = {
pool = 1,
epoch = 99
},
locator = "",
exists = true,
meta = {
category = 1 '\001',
size = 9,
mtime = {
tv = {
tv_sec = 1461055490,
tv_nsec = 0
}
},
etag = "bbb8aae57c104cda40c93843ad5e6db8",
owner = "xx1",
owner_display_name = "Zone user for yhg",
content_type = "application/octet-stream",
accounted_size = 9
},
pending_map = std::multimap with 0 elements,
index_ver = 6,
tag = "yhg-yhg.14236.29",
flags = 0,
versioned_epoch = 0
}
}
},
is_truncated = false
},
[1] = {
dir = {
header = {
stats = std::map with 1 elements = {
[1 '\001'] = {
total_size = 10485760,
total_size_rounded = 10485760,
num_entries = 1
}
},
tag_timeout = 0,
ver = 11,
master_ver = 0,
max_marker = "00000000010.67.3"
},
m = std::map with 1 elements = {
["tttt"] = {
key = {
name = "tttt",
instance = ""
},
ver = {
pool = 1,
epoch = 13
},
locator = "",
exists = true,
meta = {
category = 1 '\001',
size = 10485760,
mtime = {
tv = {
tv_sec = 1461132772,
tv_nsec = 0
}
},
etag = "219c7b0c38567750b218389f15c57e82",
owner = "xx1",
owner_display_name = "Zone user for yhg",
content_type = "application/octet-stream",
accounted_size = 10485760
},
pending_map = std::multimap with 0 elements,
index_ver = 10,
tag = "yhg-yhg.14236.49",
flags = 0,
versioned_epoch = 0
}
}
},
is_truncated = false
},
[2] = {
dir = {
header = {
stats = std::map with 1 elements = {
[1 '\001'] = {
total_size = 9,
total_size_rounded = 4096,
num_entries = 1
}
},
tag_timeout = 0,
ver = 11,
master_ver = 0,
max_marker = "00000000010.101.3"
},
m = std::map with 1 elements = {
["h5h5"] = {
key = {
name = "h5h5",
instance = ""
},
ver = {
pool = 1,
epoch = 34
},
locator = "",
exists = true,
meta = {
category = 1 '\001',
size = 9,
mtime = {
tv = {
tv_sec = 1461053473,
tv_nsec = 0
}
},
etag = "bbb8aae57c104cda40c93843ad5e6db8",
owner = "xx1",
owner_display_name = "Zone user for yhg",
content_type = "application/octet-stream",
accounted_size = 9
},
pending_map = std::multimap with 0 elements,
index_ver = 10,
tag = "yhg-yhg.14236.26",
flags = 0,
versioned_epoch = 0
}
}
},
is_truncated = false
},
[3] = {
dir = {
header = {
stats = std::map with 0 elements,
tag_timeout = 0,
ver = 1,
master_ver = 0,
max_marker = ""
},
m = std::map with 0 elements
},
is_truncated = false
}
}
rgw bucket_list
列举结果存放在 struct rgw_cls_list_ret 结构中。
375 struct rgw_cls_list_ret
376 {
377 rgw_bucket_dir dir;
378 bool is_truncated;
584 struct rgw_bucket_dir {
585 struct rgw_bucket_dir_header header;
586 std::map<string, struct rgw_bucket_dir_entry> m;
542 struct rgw_bucket_dir_header {
543 map<uint8_t, rgw_bucket_category_stats> stats;
544 uint64_t tag_timeout;
545 uint64_t ver;
546 uint64_t master_ver;
547 string max_marker;
516 struct rgw_bucket_category_stats {
517 uint64_t total_size;
518 uint64_t total_size_rounded;
519 uint64_t num_entries;
403 int rgw_bucket_list(cls_method_context_t hctx, bufferlist *in, bufferlist *out)
// 读取bucket index 对象上的omap header
// (gdb) print new_dir.header
// $2 = {
// stats = std::map with 1 elements = {
// [1 '\001'] = {
// total_size = 18,
// total_size_rounded = 8192,
// num_entries = 2
// }
// },
// tag_timeout = 0,
// ver = 7,
// master_ver = 0,
// max_marker = "00000000006.219.3"
// }
//
415 struct rgw_cls_list_ret ret;
416 struct rgw_bucket_dir& new_dir = ret.dir;
417 int rc = read_bucket_header(hctx, &new_dir.header);
425 map<string, bufferlist> keys;
426 string start_key;
427 encode_list_index_key(hctx, op.start_obj, &start_key); // 没有什么作用
// 读取bucket index 对象的omap的各个k/v entry
428 rc = get_obj_vals(hctx, start_key, op.filter_prefix, op.num_entries + 1, &keys);
最终keys解析到new_dir.m中,返回给rgw instance端。如下
(gdb) print new_dir.m
$5 = std::map with 2 elements = {
["h4h4"] = {
key = {
name = "h4h4",
instance = ""
},
ver = {
pool = 1,
epoch = 8
},
locator = "",
exists = true,
meta = {
category = 1 '\001',
size = 9,
mtime = {
tv = {
tv_sec = 1461035810,
tv_nsec = 0
}
},
etag = "bbb8aae57c104cda40c93843ad5e6db8",
owner = "xx1",
owner_display_name = "Zone user for yhg",
content_type = "application/octet-stream",
accounted_size = 9
},
pending_map = std::multimap with 0 elements,
index_ver = 4,
tag = "yhg-yhg.14236.5",
flags = 0,
versioned_epoch = 0
},
["sbsb"] = {
key = {
name = "sbsb",
instance = ""
},
ver = {
pool = 1,
epoch = 99
},
locator = "",
exists = true,
meta = {
category = 1 '\001',
size = 9,
mtime = {
tv = {
tv_sec = 1461055490,
tv_nsec = 0
}
},
etag = "bbb8aae57c104cda40c93843ad5e6db8",
owner = "xx1",
owner_display_name = "Zone user for yhg",
content_type = "application/octet-stream",
accounted_size = 9
},
pending_map = std::multimap with 0 elements,
index_ver = 6,
tag = "yhg-yhg.14236.29",
flags = 0,
versioned_epoch = 0
}
}
参考
[1] `RadosGW Big Index <http://cephnotes.ksperis.com/blog/2015/05/12/radosgw-big-index/>`_
[2] `Rgw bucket index scalability <http://tracker.ceph.com/projects/ceph/wiki/Rgw_-_bucket_index_scalability>