文章目录
nfsv4协议
nfs-ganesha提供对nfsv3和nfsv4的支持,而众所周知nfsv4是有状态协议,和v3差别很大,且nfsv4.1中引入了RECLAIM_COMPLETE,使得可以提前结束grace period,加速recovery流程,因此,本所介绍的恢复过程是基于nfsv4.1+的。
下面简单介绍一些nfs4.1协议中的一些关键概念,想要对该协议有更清楚的认识请参考RFC5661,而nfs各版本的变化以及关键改动可参考NFS各个版本之间的比较和NFSv 4.0 的变化
关键概念介绍
lease
NFS4实现“租赁锁”。每个锁拥有一样的“租赁期”。客户端的读写操作将刷新“租赁期”。租赁期到期后,锁将被服务器释放。一个server对一个client的多个state是共用一个lease。
Client ID
64-bit ID,client经过认证后,由server进行分配,用于在后续通讯中标识client身份
stateid
128-bit ID,标识server为特定的state owner标识的特定文件状态(open/lock)。
当服务端为客户端分配任一类型的lock时,都会返回一个独一无二的stateid用来标识该锁(也可以是一组锁),因此,当不同的open-owners打开同一个文件时,会使用不同的stateid。Delegations和layout也会有关联的stateid,stateid被用作锁的简写引用,给定一个stateid后,server端就可以确定与之关联的state owner和filehandle,当stateid被使用时,当前的filehandle必须是与之关联的那个filehandle。所有与给定clientID相关联的stateid都与common lease相关联, 表示stateid的声明以及它们所代表的object都由server来维护。
stateid和于当前会话关联的clietid是同一层次,一个stateid可以适用于所有使用与其关联的clientid所创建的会话,clientid可能从某一会话中获取stateid,但在同一client创建的其他会话中使用。
state owner
每一个state都有其对应的state owner,client端在请求时必须使用正确对应的state owner和stateid才能被server通过。
filehandle
filehandle是针对文件系统对象的唯一标识,其在单个nfs server上唯一。
filehandle的内容对client端是不透明的,因此,server端需要负责将filehandle转为文件系统对象的内部表示。
state reclaim和grace period
当服务端重启后状态信息和相关的锁信息丢失时,协议需要提供回收机制重新建立状态。对于大多数类型的锁定状态(layout是例外),是定义一个请求,它允许client在server上重新建立从先前实例获得的锁。这些请求是通常用于创建该类型的锁的请求的变体,并且 被称为"reclaim-type"请求,并且重新建立这种锁的过程被称为“回收”它们。
因为每个client都必须有机会回收其所拥有的锁,同时要避免其他client被授予同样的锁造成冲突,因此nfsv4协议规定了grace period用于处理回收事宜。在grace period期间,client的创建clientID和session等请求会被正常受理,但是锁定请求则会受 到特殊限制,只有reclaim-type类型的请求是被允许的,除非client可以可靠确定(通过重启实例的持久化存储,ganesha没有存储这么细)授予这种锁不会与后续的回收冲突。当client发送了一个请求获取新的锁(非"reclaim-type"请求)给处在grace period期间的服务端,则server回复NFS4ERR_GRACE。
当双方使用新的clientID建立了会话之后,client会使用"reclaim-type"请 求(例如:1:将reclaim字段设置为true的LOCK请求,2:cliam type设置为CLAIM_PREVIOUS的OPEN请求)来重新建立锁状态。一旦该操作完成或者没有该类型的锁状态需要回收,则client会发送一个RECLAIM_COMPLETE请求(例如,设置rca_one_fs为false,表示其已 经回收所有它想回收的锁状态了)通知server。一旦client发送了RECLAIM_COMPLETE之后,它可能便会尝试发送非reclaim-type的锁定请求,尽管可能会由 于server端还没有结束grace period而返回NFS4ERR_GRACE。
grace period期间的做法,最简单的是server直接拒绝所有的read和write请求和非回收锁请求(如其他的LOCK和OPEN), 回复NFS4ERR_GRACE错误。但是如果server端持久化存储了锁相关信息的话,READ和WRITE等也是可以安全处理的,ganesha并没有持久化存储锁信息,因此我们不考虑这个例外。
注意:nfs-ganesha的grace period和lease都是可配的,但必须要满足grace>=lease,两者默认时间分别为90s和60s
nfs handle
rados_ng
recovery backend介绍对比
rados_ng是nfs-ganesha的3种recovery backend之一,3种recovery backend的特点和功能如下:
-
rados_kv: This is the original rados recovery backend that uses a key-
value store. It has support for “takeover” operations: merging one
recovery database into another such that another cluster node could take
over addresses that were hosted on another node. Note that this recovery
backend may not survive crashes that occur during a grace period. If
it crashes and then crashes again during the grace period, the server is
likely to fail to allow any clients to recover (the db will be trashed). -
rados_ng: a more resilient rados_kv backend. This one does not support
takeover operations, but it should properly survive crashes that occur
during the grace period. -
rados_cluster: the new (experimental) clustered recovery backend. This
one also does not support address takeover, but should be resilient
enough to handle crashes that occur during the grace period.
参考Difference among all the recovery backend
rados_ng设计思路
At startup, create a global write op, and set it up to clear out all of the old keys. We then will spool up new client creation (and removals) to that transaction during the grace period.
When lifting the grace period, synchronously commit the transaction
to the kvstore. After that point, all client creation and removal is done synchronously to the kvstore.
This allows for better resilience when the server crashes during the grace period. No changes are made to the backing store until the grace period has been lifted.
rados_ng恢复流程分析
切换流程图
如下几张图分别从整体上和细节上展示了整个恢复过程:
- 图1:rados_ng recovery完整时序图
- 图2:nfs-ganesha启动时的recovery相关流程
- 图3:重建会话时CREATE_SESSION处理时recovery相关流程
- 图4:ganesha open处理流程图
- 图5:ganesha lock处理流程图
@1:关于clid_count!=reclaim_completes的解释:两者均为静态全局变量,clid_count表示内存中当前记录的client数目,reclaim_completes表示已经reclaim完成的client数目。
在nfs_start_grace阶段,先是将两者均置为了0,最后又从backend中加载了client信息,所以在启动时进入到grace period时,如果之前有客户端mount,则两者在此时是不相等的。
需要等待后续启动完成后,client的数据重新发到server,完成reclaim操作后,增加reclaim计数,直到和clid_count相等,表示所有client都已经reclaim完成,此时则可以退出grace period。
一些关键结构体和定义
内存中维护recovery client信息的结构体:
//注意:clid_list只是用于记录recovery clients,只会在start grace period的时候写入,实际client详细信息是记录在hash table中的。
static struct glist_head clid_list = GLIST_HEAD_INIT(clid_list);
typedef struct clid_entry {
struct glist_head cl_list; /*< Link in the list */
struct glist_head cl_rfh_list;
char cl_name[PATH_MAX]; /*< Client name */
} clid_entry_t;
struct glist_head {
struct glist_head *next;
struct glist_head *prev;
};
//实际在backend中显示如下:
6723145641117089793
value (60 bytes) :
00000000 3a 3a 66 66 66 66 3a 31 39 32 2e 31 36 38 2e 32 |::ffff:192.168.2|
00000010 2e 31 36 2d 28 33 35 3a 4c 69 6e 75 78 20 4e 46 |.16-(35:Linux NF|
00000020 53 76 34 2e 31 20 6c 6f 63 61 6c 68 6f 73 74 2e |Sv4.1 localhost.|
00000030 6c 6f 63 61 6c 64 6f 6d 61 69 6e 29 |localdomain)|
0000003c
nfsv4 recovery backend结构体定义和rados_ng做backend的情况下的赋值情况:
struct nfs4_recovery_backend {
int (*recovery_init)(void);
void (*recovery_shutdown)(void);
void (*recovery_read_clids)(nfs_grace_start_t *gsp,
add_clid_entry_hook add_clid,
add_rfh_entry_hook add_rfh);
void (*add_clid)(nfs_client_id_t *);
void (*rm_clid)(nfs_client_id_t *);
void (*add_revoke_fh)(nfs_client_id_t *, nfs_fh4 *);
void (*end_grace)(void);
void (*maybe_start_grace)(void);
bool (*try_lift_grace)(void);
void (*set_enforcing)(void);
bool (*grace_enforcing)(void);
bool (*is_member)(void);
int (*get_nodeid)(char **pnodeid);
};
struct nfs4_recovery_backend rados_ng_backend = {
.recovery_init = rados_ng_init,
.recovery_shutdown = rados_kv_shutdown,
.end_grace = rados_ng_cleanup_old,
.recovery_read_clids = rados_ng_read_recov_clids_takeover,
.add_clid = rados_ng_add_clid,
.rm_clid = rados_ng_rm_clid,
.add_revoke_fh = rados_kv_add_revoke_fh,
};
文件系统对象:
/**
* @brief Public structure for filesystem objects
*
* This structure is used for files of all types including directories
* and anything else that can be operated on via NFS.
*
* All functions that create a a new object handle should allocate
* memory for the complete (public and private) handle and perform any
* private initialization. They should fill the
* @c fsal_obj_handle::attributes structure. They should also call the
* @c fsal_obj_handle_init function with the public object handle,
* object handle operations vector, public export, and file type.
*
* @note Do we actually need a lock and ref count on the fsal object
* handle, since cache_inode is managing life cycle and concurrency?
* That is, do we expect fsal_obj_handle to have a reference count
* that would be separate from that managed by cache_inode_lru?
*/
struct fsal_obj_handle {
struct glist_head handles; /*< Link in list of handles under
the same FSAL. */
struct fsal_filesystem *fs; /*< Owning filesystem */
struct fsal_module *fsal; /*< Link back to fsal module */
struct fsal_obj_ops *obj_ops; /*< Operations vector */
pthread_rwlock_t obj_lock; /*< Lock on handle */
/* Static attributes */
object_file_type_t type; /*< Object file type */
fsal_fsid_t fsid; /*< Filesystem on which this object is
stored */
uint64_t fileid; /*< Unique identifier for this object within
the scope of the fsid, (e.g. inode number) */
struct state_hdl *state_hdl; /*< State related to this handle */
};
state owner:
/**
* @brief General state owner
*
* This structure encodes the owner of any state, protocol specific
* information is contained within the union.
*/
struct state_owner_t {
state_owner_type_t so_type; /*< Owner type */
struct glist_head so_lock_list; /*< Locks for this owner */
#ifdef DEBUG_SAL
struct glist_head so_all_owners; /**< Global list of all state owners */
#endif /* _DEBUG_MEMLEAKS */
pthread_mutex_t so_mutex; /*< Mutex on this owner */
int32_t so_refcount; /*< Reference count for lifecyce management */
int so_owner_len; /*< Length of owner name */
char *so_owner_val; /*< Owner name */
union {
state_nfs4_owner_t so_nfs4_owner; /*< All NFSv4 state owners */
state_nlm_owner_t so_nlm_owner; /*< NLM lock and share
owners */
#ifdef _USE_9P
state_9p_owner_t so_9p_owner; /*< 9P lock owners */
#endif
} so_owner;
};
state:
/**
* @brief Structure representing a single NFSv4 state
*
* Each state is identified by stateid and represents some resource or
* set of resources.
*
* The lists are protected by the state_lock
*/
struct state_t {
struct glist_head state_list; /**< List of states on a file */
struct glist_head state_owner_list; /**< List of states for an owner */
struct glist_head state_export_list; /**< List of states on the same
export */
#ifdef DEBUG_SAL
struct glist_head state_list_all; /**< Global list of all stateids */
#endif
pthread_mutex_t state_mutex; /**< Mutex protecting following pointers */
struct gsh_export *state_export; /**< Export this entry belongs to */
/* Don't re-order or move these next two. They are used for hashing */
state_owner_t *state_owner; /**< State Owner related to state */
struct fsal_obj_handle *state_obj; /**< owning object */
struct fsal_export *state_exp; /**< FSAL export */
union state_data state_data;
enum state_type state_type;
u_int32_t state_seqid; /**< The NFSv4 Sequence id */
int32_t state_refcount; /**< Refcount for state_t objects */
char stateid_other[OTHERSIZE]; /**< "Other" part of state id,
used as hash key */
struct state_refer state_refer; /**< For NFSv4.1, track the
call that created a
state. */
};
总结
rados_ng的常规recovery流程分为两个阶段,第一阶段是在完全启动之前,server端自身的准备工作;第二阶段是在启动完成之后,由client端和server端交互完成的recovery过程,
下面分别对两个阶段分别做了什么进行总结:
- 第一阶段:
- nfs4_recovery_init:检查backend中是否存在recovery object,如果不存在则创建新的,命名格式为${nodeid}_recov,如node0_recov;并预设一个清空omap中全部kv对象的操作。
- nfs_start_grace:开启grace period,默认为90s,清空内存中维护的client信息,并从backend中读取client信息存在内存clid_list链表中。
- nfs_wait_for_grace_enforcement:检查加载的客户端数目和reclaim完成的客户端数目是否相等,相等则表示回收完成,可结束grace period,不相等则直接退出。
在常规情况下,是肯定不相等的,所以此处退出后server仍然停留在grace period,等待client后续的reclaim相关的请求,如果一直没有则超时结束grace period。
- 第二阶段:
client和server开始重新建立会话,会话的建立通过5次交互完成(参见图1),我们只需要关注最后两次会话
-
client发送EXCHANGE_ID请求后,server会为client分配新的clientid,client收到后记录该id,后续所有请求都以该id作为自己的身份标识。
-
client发送CREATE_SESSION请求,server端收到后的动作如下:
- 设置一个存储新client信息到backend的操作
- 将hash table中该client record的cid_recov_tag设置为该client信息。
- 遍历clid_list,寻找该clientid的条目,并将其cid_allow_reclaim设置为true,表示允许reclaim。server完成以上操作后回复给client nfs4_ok,client收到之后,得知自己可以reclaim,便会发送两个请求来进行reclaim。
-
client发送reclaim-type的OPEN请求,server端收到后的动作如下:
- 新建一个open owner
- 检查clientid对应的cid_allow_reclaim值和cid_reclaim_complete值,前者为true,后者为false(表示允许该client reclaim且还未reclaim完成),符合reclaim条件。
- 获取current filehandle关联的fsal object的state的锁
- 在hash table中添加该state,并关联到指定的fsal object,其owner为刚才创建的open owner
- 更新state的seqid,并将该新建的open stateid返回给client
-
client使用刚才返回的stateid发送reclaim-type的LOCK请求,server端收到后的动作如下:
- 通过该stateid查询到state记录以及该state的owner。
- 根据client端请求的clientid,owner,seqid等信息在hash table中查找是否有类型为lock type的record,结果是没有,因为前面创建的是open state和owner,所以这里再次创建一个lock owner,并将其关联到刚才创建的open owner
- 获取current filehandle关联的fsal object的state的锁
- 新建一个lock type的state,并关联到之前open时新建的state。创建完成后同样填入hash table中
- 真正的加锁实现
- 更新state的seqid,并将该新建的lock stateid返回给client
-
client端收到reclaim-type LOCK请求的响应之后,便认为回收已经完成,于是放RECLAIM_COMPLETE请求,
server收到之后便会增加reclaim_completes计数,同时将该client的cid_reclaim_complete值置为true,表示该client的reclaim已经完成。 -
reaper线程会检查clid_count和reclaim_completes数值,此时两者已经相等,表示可以结束grace period,于是便reaper线程将结束grace period,同时将之前预设的清空旧oamp数据和写入新数据到进行提交,更新backend信息,至此,recovery全部完成。