OCF开发者指南第五章

最新推荐文章于 2021-05-04 09:27:59 发布

weixin_33701294

最新推荐文章于 2021-05-04 09:27:59 发布

阅读量386

点赞数

原文链接：http://blog.51cto.com/dangzhiqiang/1385911

版权

5 资源代理行为

每一个行为通常都使用一个分开的函数或者方法来实现。为了方便，通常命名为<agent>_<action>，所以，foobar的start行为的函数实现命名为foobar_start().

按照通用的规则，任何时候资源代理遇到一个不可恢复的错误，资源代理可以马上退出，抛出异常，或者退出执行。这种情况往往发生在配置错误，缺失二进制文件，权限问题等时候。不必将这些错误传递到调用栈。

集群管理器有责任根据用户的配置实行合适的恢复行为。资源代理在有明确的配置说明时，不可以去猜的。

5.1 start action

当调用资源的start操作时，资源代理必须启动资源，除非资源已经启动了。这意味着资源代理必须确认资源的配置，查询他的状态，并只在资源没有启动的情况下才启动资源。通常的做法是首先调用validate_all 和 monitor 函数，如下面的例子：

foobar_start() {

# exit immediately if configuration is not valid

foobar_validate_all || exit $?

# if resource is already running, bail out early

if foobar_monitor; then

ocf_log info "Resource is already running"

return $OCF_SUCCESS

# actually start up the resource here (make sure to immediately

# exit with an $OCF_ERR_ error code if anything goes seriously

# wrong)

...

# After the resource has been started, check whether it started up

# correctly. If the resource starts asynchronously, the agent may

# spin on the monitor function here -- if the resource does not

# start up within the defined timeout, the cluster manager will

# consider the start action failed

while ! foobar_monitor; do

ocf_log debug "Resource has not started yet, waiting"

sleep 1

done

# only return $OCF_SUCCESS if _everything_ succeeded as expected

return $OCF_SUCCESS

}

5.2 stop action

当调用stop行为时，如果资源正在运行资源代理必须停止资源。这意味着，资源代理必须检测资源配置，查询其状态，在其正常运行的情况下，则stop。通常的做法是先调用validate_all 和 monitor 函数。必须清楚的是，stop是一个强制操作----资源代理可以做任何事情来关闭，重启动或切断资源。看下面的例子：

foobar_stop() {

local rc

# exit immediately if configuration is not valid

foobar_validate_all || exit $?

foobar_monitor

rc=$?

case "$rc" in

"$OCF_SUCCESS")

# Currently running. Normal, expected behavior.

ocf_log debug "Resource is currently running"

;;

"$OCF_RUNNING_MASTER")

# Running as a Master. Need to demote before stopping.

ocf_log info "Resource is currently running as Master"

foobar_demote || \

ocf_log warn "Demote failed, trying to stop anyway"

;;

"$OCF_NOT_RUNNING")

# Currently not running. Nothing to do.

ocf_log info "Resource is already stopped"

return $OCF_SUCCESS

;;

esac

# actually shut down the resource here (make sure to immediately

# exit with an $OCF_ERR_ error code if anything goes seriously

# wrong)

...

# After the resource has been stopped, check whether it shut down

# correctly. If the resource stops asynchronously, the agent may

# spin on the monitor function here -- if the resource does not

# shut down within the defined timeout, the cluster manager will

# consider the stop action failed

while foobar_monitor; do

ocf_log debug "Resource has not stopped yet, waiting"

sleep 1

done

# only return $OCF_SUCCESS if _everything_ succeeded as expected

return $OCF_SUCCESS

}

注意：

stop行为运行成功的返回码是$OCF_SUCCESS，不是 $OCF_NOT_RUNNING

重要：

stop行为失败会造成潜在的危险，集群管理器总是试着通过fencing来解决这个问题。换句话说，就是强制将一个节点从集群中剔除。这种方法最终是为了保护数据，但是的确让用户应用中断。所以，资源代理返回错误一定要非常慎重，确保合适合理的资源关闭方法都已经使用了。

5.3 monitor action

monitor 行为查询资源的状态。必须明确下面三种状态：

资源正在运行（返回 $OCF_SUCCESS）
资源安全的关闭（返回 $OCF_NOT_RUNNING)
资源运行出现问题，判断为一种错误（返回最接近的那个 $OCF_ERR_ 来指明问题）

foobar_monitor(){

local rc

# exit immediately if configuration is not valid

foobar_validate_all || exit $?

ocf_run frobnicate --test

# This example assumes the following exit code convention

# for frobnicate:

# 0: running, and fully caught up with master

# 1: gracefully stopped

# any other: error

case "$?" in

rc=$OCF_SUCCESS

ocf_log debug "Resource is running"

;;

rc=$OCF_NOT_RUNNING

ocf_log debug "Resource is not running"

;;

ocf_log err "Resource has failed"

exit $OCF_ERR_GENERIC

esac

return $rc

}

有状态的（master/slave) 资源代理则需要另外一种精心定制的monitoring模式，这种模式可以提示集群管理器哪一个实例最合适做Master节点。第9.4节《确定master特征》会解释细节。

注意：

集群管理器的probe是测试资源是否运行的，会调用monitor行为。正常情况下，monitor操作在被probe调用和直接运行时是一样的。如果有些特别的资源需要特别定义probe，ocf_is_probe函数就是为这个目的的。

5.4 validate-all action

validate-all 行为测试资源代理的配置和工作环境。validate-all 退出会返回如下值：

$OCF_SUCCESS ---- 一切正常，配置正常可用；
$OCF_ERR_CONFIGURED ---- 资源配置出错；
$OCF_ERR_INSTALLED ---- 资源可能配置正确，但是在validate-all执行的节点，可能有关键组件丢失；
$OCF_ERR_PERM ---- 资源配置正确，也不缺组件，但是可能有权限问题（比如无法创建必要的文件）。

validate-all 通常封装成一个函数，不单是在相应行为时显式的调用，也可以由其他函数调用。所以，开发者一定要记得：这个函数也可能会在start，stop和monitor行为时候调用。

Probes 也引出了另外一个对于校验的挑战。在probe时（当集群管理器可能期望资源不要运行在probe运行的节点上），可能期望一些需要的组件在受影响的节点上是不可得的。比如，在probe时，期望在存储设备上的共享数据不可读。validate-all 函数可能需要特别对待probe，可以使用ocf_is_probe函数。

foobar_validate_all() {

# Test for configuration errors first

if ! ocf_is_decimal $OCF_RESKEY_eggs; then

ocf_log err "eggs is not numeric!"

exit $OCF_ERR_CONFIGURED

# Test for required binaries

check_binary frobnicate

# Check for data directory (this may be on shared storage, so

# disable this test during probes)

if ! ocf_is_probe; then

if ! [ -d $OCF_RESKEY_datadir ]; then

ocf_log err "$OCF_RESKEY_datadir does not exist or is not a directory!"

exit $OCF_ERR_INSTALLED

return $OCF_SUCCESS

}

5.5 meta-data action

meta_data 操作导出资源代理元数据到标准输出。输出必须遵循元数据格式----在2.4节有说明。

foobar_meta_data {

cat <<EOF

<?xml version="1.0"?>

<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">

<resource-agent name="foobar" version="0.1">

...

EOF

}

5.6 promote action

promote操作是可选的。它只支持有状态的资源代理，就是说，资源代理必须是两种角色中的一种：Master和slave。slave角色功能上和无状态的资源代理是相同的。这样，标准的无状态资源代理仅仅需要实现start和stop操作，而且有状态的资源代理必须实现started（slave）和master角色的切换。

foobar_promote() {

local rc

# exit immediately if configuration is not valid

foobar_validate_all || exit $?

# test the resource's current state

foobar_monitor

rc=$?

case "$rc" in

"$OCF_SUCCESS")

# Running as slave. Normal, expected behavior.

ocf_log debug "Resource is currently running as Slave"

;;

"$OCF_RUNNING_MASTER")

# Already a master. Unexpected, but not a problem.

ocf_log info "Resource is already running as Master"

return $OCF_SUCCESS

;;

"$OCF_NOT_RUNNING")

# Currently not running. Need to start before promoting.

ocf_log info "Resource is currently not running"

foobar_start

;;

# Failed resource. Let the cluster manager recover.

ocf_log err "Unexpected error, cannot promote"

exit $rc

;;

esac

# actually promote the resource here (make sure to immediately

# exit with an $OCF_ERR_ error code if anything goes seriously

# wrong)

ocf_run frobnicate --master-mode || exit $OCF_ERR_GENERIC

# After the resource has been promoted, check whether the

# promotion worked. If the resource promotion is asynchronous, the

# agent may spin on the monitor function here -- if the resource

# does not assume the Master role within the defined timeout, the

# cluster manager will consider the promote action failed.

while true; do

foobar_monitor

if [ $? -eq $OCF_RUNNING_MASTER ]; then

ocf_log debug "Resource promoted"

break

else

ocf_log debug "Resource still awaiting promotion"

sleep 1

done

# only return $OCF_SUCCESS if _everything_ succeeded as expected

return $OCF_SUCCESS

}

5.7 demote action

promote操作是可选的。它只支持有状态的资源代理，就是说，资源代理必须是两种角色中的一种：Master和slave。slave角色功能上和无状态的资源代理是相同的。这样，标准的无状态资源代理仅仅需要实现start和stop操作，而且有状态的资源代理必须实现master和started（slave）角色的切换。

foobar_demote() {

local rc

# exit immediately if configuration is not valid

foobar_validate_all || exit $?

# test the resource's current state

foobar_monitor

rc=$?

case "$rc" in

"$OCF_RUNNING_MASTER")

# Running as master. Normal, expected behavior.

ocf_log debug "Resource is currently running as Master"

;;

"$OCF_SUCCESS")

# Alread running as slave. Nothing to do.

ocf_log debug "Resource is currently running as Slave"

return $OCF_SUCCESS

;;

"$OCF_NOT_RUNNING")

# Currently not running. Getting a demote action

# in this state is unexpected. Exit with an error

# and let the cluster manager recover.

ocf_log err "Resource is currently not running"

exit $OCF_ERR_GENERIC

;;

# Failed resource. Let the cluster manager recover.

ocf_log err "Unexpected error, cannot demote"

exit $rc

;;

esac

# actually demote the resource here (make sure to immediately

# exit with an $OCF_ERR_ error code if anything goes seriously

# wrong)

ocf_run frobnicate --unset-master-mode || exit $OCF_ERR_GENERIC

# After the resource has been demoted, check whether the

# demotion worked. If the resource demotion is asynchronous, the

# agent may spin on the monitor function here -- if the resource

# does not assume the Slave role within the defined timeout, the

# cluster manager will consider the demote action failed.

while true; do

foobar_monitor

if [ $? -eq $OCF_RUNNING_MASTER ]; then

ocf_log debug "Resource still awaiting promotion"

sleep 1

else

ocf_log debug "Resource demoted"

break

done

# only return $OCF_SUCCESS if _everything_ succeeded as expected

return $OCF_SUCCESS

}

5.8 migrate_to action

migrate_to 操作服务于下面两个目的中的一个：

为资源提供一种本地push方式的迁移发起过程。换句话说，指导资源从当前运行的地方迁移到指定节点。资源代理通过环境变量 $OCF_RESKEY_CRM_meta_migrate_target 获得目标节点。
在freeze/thaw（或suspend/resume）模式的迁移中冻住资源，这种模式下资源不需要知道目的地。

下面的例子描述了push类型的迁移：

foobar_migrate_to() {

# exit immediately if configuration is not valid

foobar_validate_all || exit $?

# if resource is not running, bail out early

if ! foobar_monitor; then

ocf_log err "Resource is not running"

exit $OCF_ERR_GENERIC

# actually start up the resource here (make sure to immediately

# exit with an $OCF_ERR_ error code if anything goes seriously

# wrong)

ocf_run frobnicate --migrate \

--dest=$OCF_RESKEY_CRM_meta_migrate_target \

|| exit OCF_ERR_GENERIC

...

# only return $OCF_SUCCESS if _everything_ succeeded as expected

return $OCF_SUCCESS

}

相应的，freeze/thaw 类型的迁移可以按如下方法实现freeze操作：

foobar_migrate_to() {

# exit immediately if configuration is not valid

foobar_validate_all || exit $?

# if resource is not running, bail out early

if ! foobar_monitor; then

ocf_log err "Resource is not running"

exit $OCF_ERR_GENERIC

# actually start up the resource here (make sure to immediately

# exit with an $OCF_ERR_ error code if anything goes seriously

# wrong)

ocf_run frobnicate --freeze || exit OCF_ERR_GENERIC

...

# only return $OCF_SUCCESS if _everything_ succeeded as expected

return $OCF_SUCCESS

}

5.9 migrate_from action

migrate_from 操作服务于下面两个目的中的一个：

为资源提供一种本地push方式的迁移完成过程。换句话说，检查资源是否正确的迁移，并在本地运行起来了。资源代理通过环境变量 OCF_RESKEY_CRM_meta_migrate_source 获得源节点
在freeze/thaw（或suspend/resume）模式的迁移中解冻资源，这种模式下资源不需要知道源地址

下面的例子描述了push类型的迁移：

foobar_migrate_from() {

# exit immediately if configuration is not valid

foobar_validate_all || exit $?

# After the resource has been migrated, check whether it resumed

# correctly. If the resource starts asynchronously, the agent may

# spin on the monitor function here -- if the resource does not

# run within the defined timeout, the cluster manager will

# consider the migrate_from action failed

while ! foobar_monitor; do

ocf_log debug "Resource has not yet migrated, waiting"

sleep 1

done

# only return $OCF_SUCCESS if _everything_ succeeded as expected

return $OCF_SUCCESS

}

相应的，freeze/thaw 类型的迁移可以按如下方法实现thaw操作：

foobar_migrate_from() {

# exit immediately if configuration is not valid

foobar_validate_all || exit $?

# actually start up the resource here (make sure to immediately

# exit with an $OCF_ERR_ error code if anything goes seriously

# wrong)

ocf_run frobnicate --thaw || exit OCF_ERR_GENERIC

# After the resource has been migrated, check whether it resumed

# correctly. If the resource starts asynchronously, the agent may

# spin on the monitor function here -- if the resource does not

# run within the defined timeout, the cluster manager will

# consider the migrate_from action failed

while ! foobar_monitor; do

ocf_log debug "Resource has not yet migrated, waiting"

sleep 1

done

# only return $OCF_SUCCESS if _everything_ succeeded as expected

return $OCF_SUCCESS

}

5.10 notify action

通过通知，clone的实例（包括master/slave 资源，这种资源是clone资源的一种扩展）可以相互通知各自的状态。当通知机制被启用，每一个克隆实例都会携带 pre 和 post 通知。然后，集群管理器对所有克隆实例调用notify操作。notify操作执行是，会用到如下附加的环境变量：

$OCF_RESKEY_CRM_meta_notify_type—通知类型 (pre 或 post)
$OCF_RESKEY_CRM_meta_notify_operation—操作(action)，这是指通知做什么(start, stop, promote, demote 等.)
$OCF_RESKEY_CRM_meta_notify_start_uname—资源启动所在的节点名字(仅仅对启动通知)
$OCF_RESKEY_CRM_meta_notify_stop_uname—资源停止所在的节点名字(仅仅对停止通知)
$OCF_RESKEY_CRM_meta_notify_master_uname—Master 角色资源运行所在节点的名字
$OCF_RESKEY_CRM_meta_notify_promote_uname—正提升为Master角色的资源所在节点的节点名字 (仅仅 promote 通知)
$OCF_RESKEY_CRM_meta_notify_demote_uname—正在降级为slave角色的资源所在节点的节点名字 (仅仅 demote 通知)