ceph是一个开源的统一分布式存储系统,RADOS是提供了底层基础对象存储服务,它由mon和osd组成。RADOS主要操作的对象有pool,object和object的xattr、omap。
rados gateway是基于RADOS的一个对象存储服务,对外提供了S3、swift和RESTful api接口,对外提供存储服务。
bucket和object(key)是rados gateway构造的两个主要的数据模型,本文主要是介绍gateway中bucket和key的设计。
bucket:是一个存放key的容器,也可以理解为一个目录,但是bucket不可以嵌套。
key:也可以称作对象,它代表这上传到存储服务中的一份完整数据。
接下来通过一组实际操作来介绍bucket和key的设计。
rados gateway中也构造了account、zone、region等数据结构,但不是本文介绍重点,这里就不做详细介绍。
要想在gateway中创建bucket,上传数据,首先要有创建一个用户拿到一对认证密钥(access_key、secret_key)。
gateway user
创建用户:
# radosgw-admin user create --uid=yankun --display-name=yankun
{
"user_id": "yankun",
"display_name": "yankun",
"email": "",
"suspended": 0,
"max_buckets": 1000,
"auid": 0,
"subusers": [],
"keys": [
{
"user": "yankun",
"access_key": "FLNOEBKYFT7R0VA2ZH03",
"secret_key": "2a3O5epEHpnRw26Rb6tukdYJz6nQes6hCoO5fIM3"
}
],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"placement_tags": [],
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"temp_url_keys": []
}
创建用户之后就会获得access_key和secret_key,然后就使用s3cmd这个客户端来创建bucket,并上传数据。
在s3cmd的配置文件中,配置access_key、secret_key和服务地址。
RGW中的bucket
创建bucket:
# s3cmd mb s3://where_is_my_bucket
# s3cmd mb s3://where_is_my_bucket1
查看bucket信息:
# radosgw-admin bucket stats --bucket=where_is_my_bucket
{
"bucket": "where_is_my_bucket",
"pool": ".rgw.buckets",
"index_pool": ".rgw.buckets.index",
"id": "default.5762326.25",
"marker": "default.5762326.25",
"owner": "yankun",
"ver": "0#9",
"master_ver": "0#0",
"mtime": "2017-09-12 10:16:47.000000",
"max_marker": "0#",
"usage": {
"rgw.main": {
"size_kb": 4105961,
"size_kb_actual": 4105964,
"num_objects": 3
}
},
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
}
bucket对象:
用户创建的bucket都会保存在.users.uid pool 中对象yankun.buckets的omap中,key是bucket名字value是bucket的信息。.users.id中保存用户的用户名{username}和{username}.buckets
# rados -p .users.uid listomapkeys yankun.buckets
where_is_my_bucket
where_is_my_bucket1
# rados -p .users.uid getomapval yankun.buckets where_is_my_bucket binary_where_is_my_bucket
Writing to binary_where_is_my_bucket
# ceph-dencoder type RGWBucketEnt import binary_where_is_my_bucket decode dump_json
{
"bucket": {
"name": "where_is_my_bucket",
"pool": ".rgw.buckets",
"data_extra_pool": ".rgw.buckets.extra",
"index_pool": ".rgw.buckets.index",
"marker": "default.5762326.25",
"bucket_id": "default.5762326.25"
},
"size": 4204504056,
"size_rounded": 4204507136,
"mtime": 1505182607,
"count": 3
}
bucket在rados中的对象:
每个bucket,rados都会为其在.rgw.buckets.index pool中创建一个对象,其命名格式为:.dir.{bucket_id}
# rados -p .rgw.buckets.index ls > .rgw.buckets.index
# grep default.5762326.25 .rgw.buckets.index
.dir.default.5762326.25
bucket的元信息:
bucket的元信息在rados中一个独立的对象保存在.rgw pool中的.bucket.meta.{bucket_name}:{marker}。
# rados -p .rgw ls
where_is_my_bucket1
.bucket.meta.where_is_my_bucket1:default.5762326.26
where_is_my_bucket
.bucket.meta.where_is_my_bucket:default.5762326.25
# rados -p .rgw get .bucket.meta.where_is_my_bucket:default.5762326.25 binary.bucket.meta.where_is_my_bucket:default.5762326.25
# ceph-dencoder type RGWBucketInfo import .bucket.meta.where_is_my_bucket\:default.5762326.25 decode dump_json
{
"bucket": {
"name": "where_is_my_bucket",
"pool": ".rgw.buckets",
"data_extra_pool": ".rgw.buckets.extra",
"index_pool": ".rgw.buckets.index",
"marker": "default.5762326.25",
"bucket_id": "default.5762326.25"
},
"creation_time": 1505182607,
"owner": "yankun",
"flags": 0,
"region": "default",
"placement_rule": "default-placement",
"has_instance_obj": "true",
"quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"num_shards": 0,
"bi_shard_hash_type": 0
}
bucket的acl保存在.bucket.meta.{bucket_name}:{marker}对象的xattr中。
# rados -p .rgw getxattr .bucket.meta.where_is_my_bucket:default.5762326.25 user.rgw.acl > binary.bucket.acl
# ceph-dencoder type RGWAccessControlPolicy import binary.bucket.acl decode dump_json
{
"acl": {
"acl_user_map": [
{
"user": "yankun",
"acl": 15
}
],
"acl_group_map": [],
"grant_map": [
{
"id": "yankun",
"grant": {
"type": {
"type": 0
},
"id": "yankun",
"email": "",
"permission": {
"flags": 15
},
"name": "yankun",
"group": 0
}
}
]
},
"owner": {
"id": "yankun",
"display_name": "yankun"
}
}
RGW中的object
object只能保存在bucket中,这里构造了一个大文件where_is_my_object.txt,用于上传到bucket中。
构造大文件:
#dd if=/dev/zero of=./where_is_my_object.txt bs=2M count=1000
# du where_is_my_object.txt -h
2.0G where_is_my_object.txt
上传大文件到bucket中:
#s3cmd put where_is_my_object.txt s3://where_is_my_bucket
upload: 'where_is_my_object.txt' -> 's3://where_is_my_bucket/where_is_my_object.txt' [1 of 1]
2097152000 of 2097152000 100% in 123s 16.24 MB/s done
object与bucket之间的映射
文件上传到bucket where_is_my_bucket中该bucket的id为default.5762326.25,该对象与bucket的关系维护在.dir.{bucket_id}对象的omap中。
# rados -p .rgw.buckets.index listomapkeys .dir.default.5762326.25
where_is_my_object.txt
对象命名格式:
上传的对象在rados中以一个对象存在或者多个对象存在,这主要看上传对象的大小。
对象的数据保存在.rgw.buckets pool中,如果上传数据大小大于512KB,则会保存多个对象,分别是一个头对象(512KB)和一个或者多个尾对象(默认4MB)。头对象命名格式为_,如where_is_my_bucket bucket中的where_is_my_object.txt对象在.rgw.buckets中的名字为:
default.5762326.25_where_is_my_object.txt;尾对象命名格式:{bucket_id}_shadow.{object_head:prefix}_{从1开始的自然序列}
# du default.5762326.25_where_is_my_object.txt
512 default.5762326.25_where_is_my_object.txt
# du default.5762326.25__shadow_.h_oQhOgqDTmDZx2FUSm8zMTOlbhDQsq_99
4096 default.5762326.25__shadow_.h_oQhOgqDTmDZx2FUSm8zMTOlbhDQsq_99
对象的元信息:
对象的元信息保存在头对象的xattr中
# rados -p .rgw.buckets listxattr default.5762326.25_where_is_my_object.txt
user.rgw.acl
user.rgw.content_type
user.rgw.etag
user.rgw.idtag
user.rgw.manifest
user.rgw.x-amz-date
user.rgw.x-amz-meta-s3cmd-attrs
user.rgw.x-amz-storage-class
对象的user.rgw.manifest属性:
# rados -p .rgw.buckets getxattr default.5762326.25_where_is_my_object.txt ./binary.default.5762326.25_where_is_my_object.txt.user.rgw.manifest
# rados -p .rgw.buckets getxattr default.5762326.25_where_is_my_object.txt user.rgw.manifest > ./binary.default.5762326.25_where_is_my_object.txt.user.rgw.manifest
# ceph-dencoder type RGWObjManifest import binary.default.5762326.25_where_is_my_object.txt.user.rgw.manifest decode dump_json
{
"objs": [],
"obj_size": 2097152000,
"explicit_objs": "false",
"head_obj": {
"bucket": {
"name": "where_is_my_bucket",
"pool": ".rgw.buckets",
"data_extra_pool": ".rgw.buckets.extra",
"index_pool": ".rgw.buckets.index",
"marker": "default.5762326.25",
"bucket_id": "default.5762326.25"
},
"key": "",
"ns": "",
"object": "where_is_my_object.txt",
"instance": ""
},
"head_size": 524288,
"max_head_size": 524288,
"prefix": ".h_oQhOgqDTmDZx2FUSm8zMTOlbhDQsq_",
"tail_bucket": {
"name": "where_is_my_bucket",
"pool": ".rgw.buckets",
"data_extra_pool": ".rgw.buckets.extra",
"index_pool": ".rgw.buckets.index",
"marker": "default.5762326.25",
"bucket_id": "default.5762326.25"
},
"rules": [
{
"key": 0,
"val": {
"start_part_num": 0,
"start_ofs": 524288,
"part_size": 0,
"stripe_max_size": 4194304,
"override_prefix": ""
}
}
]
}
Object ACL:
# rados -p .rgw.buckets getxattr default.5762326.25_where_is_my_object.txt user.rgw.acl > binary.object.acl
# ceph-dencoder type RGWAccessControlPolicy import binary.object.acl decode dump_json
{
"acl": {
"acl_user_map": [
{
"user": "yankun",
"acl": 15
}
],
"acl_group_map": [],
"grant_map": [
{
"id": "yankun",
"grant": {
"type": {
"type": 0
},
"id": "yankun",
"email": "",
"permission": {
"flags": 15
},
"name": "yankun",
"group": 0
}
}
]
},
"owner": {
"id": "yankun",
"display_name": "yankun"
}
}
手动还原数据
根据object的模型设计,不通过rados gateway获取一份完整的对象。
构造一个对象
location_object
# du -h location_object
9.8M location_object
本地对象md5值:
# md5sum location_object
24796d54d73d694168170135091f7eba location_object
上传该对象到where_is_my_bucket
# s3cmd put location_object s3://where_is_my_bucket
upload: 'location_object' -> 's3://where_is_my_bucket/location_object' [1 of 1]
10200056 of 10200056 100% in 0s 77.72 MB/s
10200056 of 10200056 100% in 4s 2.18 MB/s done
对象切分:
根据object的设计他会在rados中存在4个对象,一个头对象和3个尾对象。
头对象:default.5762326.25_location_object
尾对象:default.5762326.25__shadow_.{object_head:prefix}{1,2,3}
头对象:
rados -p .rgw.buckets ls | grep location
default.5762326.25_location_object
该对象的prefix:
# rados -p .rgw.buckets getxattr default.5762326.25_location_object user.rgw.manifest > ./binary.default.5762326.25_location_object.user.rgw.manifest
# ceph-dencoder type RGWObjManifest import binary.default.5762326.25_location_object.user.rgw.manifest decode dump_json
{
"objs": [],
"obj_size": 10200056,
"explicit_objs": "false",
"head_obj": {
"bucket": {
"name": "where_is_my_bucket",
"pool": ".rgw.buckets",
"data_extra_pool": ".rgw.buckets.extra",
"index_pool": ".rgw.buckets.index",
"marker": "default.5762326.25",
"bucket_id": "default.5762326.25"
},
"key": "",
"ns": "",
"object": "location_object",
"instance": ""
},
"head_size": 524288,
"max_head_size": 524288,
"prefix": ".Ux77sSsCN2UdioL5XxO0Hx8Ph9oXb35_",
"tail_bucket": {
"name": "where_is_my_bucket",
"pool": ".rgw.buckets",
"data_extra_pool": ".rgw.buckets.extra",
"index_pool": ".rgw.buckets.index",
"marker": "default.5762326.25",
"bucket_id": "default.5762326.25"
},
"rules": [
{
"key": 0,
"val": {
"start_part_num": 0,
"start_ofs": 524288,
"part_size": 0,
"stripe_max_size": 4194304,
"override_prefix": ""
}
}
]
}
为对象为:default.5762326.25__shadow_.Ux77sSsCN2UdioL5XxO0Hx8Ph9oXb35_{1,2,3}
获取被切分的对象:
使用rados来获取这些被切分的对象:
# rados -p .rgw.buckets get default.5762326.25_location_object ./location_head
# rados -p .rgw.buckets get default.5762326.25__shadow_.Ux77sSsCN2UdioL5XxO0Hx8Ph9oXb35_1 ./default.5762326.25__shadow_.Ux77sSsCN2UdioL5XxO0Hx8Ph9oXb35_1
# rados -p .rgw.buckets get default.5762326.25__shadow_.Ux77sSsCN2UdioL5XxO0Hx8Ph9oXb35_2 ./default.5762326.25__shadow_.Ux77sSsCN2UdioL5XxO0Hx8Ph9oXb35_2
# rados -p .rgw.buckets get default.5762326.25__shadow_.Ux77sSsCN2UdioL5XxO0Hx8Ph9oXb35_3 ./default.5762326.25__shadow_.Ux77sSsCN2UdioL5XxO0Hx8Ph9oXb35_3
拼接该对象:
# cat location_head > new_location_object
# cat default.5762326.25__shadow_.Ux77sSsCN2UdioL5XxO0Hx8Ph9oXb35_1 >> new_location_object
# cat default.5762326.25__shadow_.Ux77sSsCN2UdioL5XxO0Hx8Ph9oXb35_2 >> new_location_object
# cat default.5762326.25__shadow_.Ux77sSsCN2UdioL5XxO0Hx8Ph9oXb35_3 >> new_location_object
new_location_object的md5值:
# md5sum new_location_object
24796d54d73d694168170135091f7eba new_location_object
注:拉取拼接后的对象与之前的对象md5值相同,内容没有发生变化。