5 资源代理行为

每一个行为通常都使用一个分开的函数或者方法来实现。为了方便,通常命名为<agent>_<action>,所以,foobarstart行为的函数实现命名为foobar_start().


按照通用的规则,任何时候资源代理遇到一个不可恢复的错误,资源代理可以马上退出,抛出异常,或者退出执行。这种情况往往发生在配置错误,缺失二进制文件,权限问题等时候。不必将这些错误传递到调用栈。


集群管理器有责任根据用户的配置实行合适的恢复行为。资源代理在有明确的配置说明时,不可以去猜的。


5.1 start action


当调用资源的start操作时,资源代理必须启动资源,除非资源已经启动了。这意味着资源代理必须确认资源的配置,查询他的状态,并只在资源没有启动的情况下才启动资源。通常的做法是首先调用validate_all monitor 函数,如下面的例子:


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

foobar_start() {

   # exit  immediately if configuration is not valid

   foobar_validate_all  || exit $?


   # if resource is  already running, bail out early

   if foobar_monitor; then

       ocf_log  info "Resource is already running"

       return $OCF_SUCCESS

   fi


   # actually start  up the resource here (make sure to immediately

   # exit with an  $OCF_ERR_ error code if anything goes seriously

   # wrong)

   ...


   # After the  resource has been started, check whether it started up

   # correctly. If  the resource starts asynchronously, the agent may

   # spin on the  monitor function here -- if the resource does not

   # start up within  the defined timeout, the cluster manager will

   # consider the  start action failed

   while !  foobar_monitor; do

       ocf_log  debug "Resource has not started yet, waiting"

       sleep 1

   done


   # only return  $OCF_SUCCESS if _everything_ succeeded as expected

   return $OCF_SUCCESS

}


5.2 stop action


当调用stop行为时,如果资源正在运行资源代理必须停止资源。这意味着,资源代理必须检测资源配置,查询其状态,在其正常运行的情况下,则stop。通常的做法是先调用validate_all monitor 函数。必须清楚的是,stop是一个强制操作----资源代理可以做任何事情来关闭,重启动或切断资源。看下面的例子:


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

foobar_stop() {

   local rc


   # exit  immediately if configuration is not valid

   foobar_validate_all  || exit $?


   foobar_monitor

   rc=$?

   case "$rc" in

       "$OCF_SUCCESS")

           #  Currently running. Normal, expected behavior.

           ocf_log  debug "Resource is currently running"

           ;;

       "$OCF_RUNNING_MASTER")

           #  Running as a Master. Need to demote before stopping.

           ocf_log  info "Resource is currently running as Master"

           foobar_demote  || \

               ocf_log  warn "Demote failed, trying to stop anyway"

           ;;

       "$OCF_NOT_RUNNING")

           #  Currently not running. Nothing to do.

           ocf_log  info "Resource is already stopped"

           return $OCF_SUCCESS

           ;;

   esac


   # actually shut  down the resource here (make sure to immediately

   # exit with an  $OCF_ERR_ error code if anything goes seriously

   # wrong)

   ...


   # After the  resource has been stopped, check whether it shut down

   # correctly. If  the resource stops asynchronously, the agent may

   # spin on the  monitor function here -- if the resource does not

   # shut down  within the defined timeout, the cluster manager will

   # consider the  stop action failed

   while foobar_monitor; do

       ocf_log  debug "Resource has not stopped yet, waiting"

       sleep 1

   done


   # only return  $OCF_SUCCESS if _everything_ succeeded as expected

   return $OCF_SUCCESS


}

注意:

stop行为运行成功的返回码是$OCF_SUCCESS,不是 $OCF_NOT_RUNNING

重要:

stop行为失败会造成潜在的危险,集群管理器总是试着通过fencing来解决这个问题。换句话说,就是强制将一个节点从集群中剔除。这种方法最终是为了保护数据,但是的确让用户应用中断。所以,资源代理返回错误一定要非常慎重,确保合适合理的资源关闭方法都已经使用了。


5.3 monitor action


monitor 行为查询资源的状态。必须明确下面三种状态:


  • 资源正在运行(返回 $OCF_SUCCESS

  • 资源安全的关闭(返回 $OCF_NOT_RUNNING)

  • 资源运行出现问题,判断为一种错误(返回最接近的那个 $OCF_ERR_ 来指明问题)


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

foobar_monitor(){

local rc


   # exit  immediately if configuration is not valid

   foobar_validate_all  || exit $?


   ocf_run  frobnicate --test


   # This example  assumes the following exit code convention

   # for frobnicate:

   # 0: running, and  fully caught up with master

   # 1: gracefully  stopped

   # any other:  error

   case "$?" in

       0)

           rc=$OCF_SUCCESS

           ocf_log  debug "Resource is running"

           ;;

       1)

           rc=$OCF_NOT_RUNNING

           ocf_log  debug "Resource is not running"

           ;;

       *)

           ocf_log  err "Resource has failed"

           exit $OCF_ERR_GENERIC

   esac


   return $rc

}

有状态的(master/slave) 资源代理则需要另外一种精心定制的monitoring模式,这种模式可以提示集群管理器哪一个实例最合适做Master节点。第9.4《确定master特征》会解释细节。


注意:

集群管理器的probe是测试资源是否运行的,会调用monitor行为。正常情况下,monitor操作在被probe调用和直接运行时是一样的。如果有些特别的资源需要特别定义probeocf_is_probe函数就是为这个目的的。


5.4 validate-all action


validate-all 行为测试资源代理的配置和工作环境。validate-all 退出会返回如下值:


  • $OCF_SUCCESS ---- 一切正常,配置正常可用;

  • $OCF_ERR_CONFIGURED ---- 资源配置出错;

  • $OCF_ERR_INSTALLED ---- 资源可能配置正确,但是在validate-all执行的节点,可能有关键组件丢失;

  • $OCF_ERR_PERM ---- 资源配置正确,也不缺组件,但是可能有权限问题(比如无法创建必要的文件)。


validate-all 通常封装成一个函数,不单是在相应行为时显式的调用,也可以由其他函数调用。所以,开发者一定要记得:这个函数也可能会在startstopmonitor行为时候调用。

Probes 也引出了另外一个对于校验的挑战。在probe时(当集群管理器可能期望资源不要运行在probe运行的节点上),可能期望一些需要的组件在受影响的节点上是不可得的。比如,在probe时,期望在存储设备上的共享数据不可读。validate-all 函数可能需要特别对待probe,可以使用ocf_is_probe函数。


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

foobar_validate_all() {

   # Test for  configuration errors first

   if !  ocf_is_decimal $OCF_RESKEY_eggs; then

      ocf_log  err "eggs is not numeric!"

      exit $OCF_ERR_CONFIGURED

   fi


   # Test for  required binaries

   check_binary  frobnicate


   # Check for data  directory (this may be on shared storage, so

   # disable this  test during probes)

   if !  ocf_is_probe; then

      if !  [ -d $OCF_RESKEY_datadir ]; then

         ocf_log  err "$OCF_RESKEY_datadir does not exist or is not a  directory!"

         exit $OCF_ERR_INSTALLED

      fi

   fi


   return $OCF_SUCCESS

}


5.5 meta-data action


meta_data 操作导出资源代理元数据到标准输出。输出必须遵循元数据格式----2.4节有说明。


1

2

3

4

5

6

7

8

9

10

foobar_meta_data {

   cat <<EOF

<?xml version="1.0"?>

<!DOCTYPE resource-agent SYSTEM  "ra-api-1.dtd">

<resource-agent name="foobar" version="0.1">

 <version>0.1</version>

 <longdesc lang="en">

...

EOF

}


5.6 promote action


promote操作是可选的。它只支持有状态的资源代理,就是说,资源代理必须是两种角色中的一种:Masterslaveslave角色功能上和无状态的资源代理是相同的。这样,标准的无状态资源代理仅仅需要实现startstop操作,而且有状态的资源代理必须实现startedslave)和master角色的切换。


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

foobar_promote() {

   local rc


   # exit  immediately if configuration is not valid

   foobar_validate_all  || exit $?


   # test the  resource's current state

   foobar_monitor

   rc=$?

   case "$rc" in

       "$OCF_SUCCESS")

           #  Running as slave. Normal, expected behavior.

           ocf_log  debug "Resource is currently running as Slave"

           ;;

       "$OCF_RUNNING_MASTER")

           #  Already a master. Unexpected, but not a problem.

           ocf_log  info "Resource is already running as Master"

           return $OCF_SUCCESS

           ;;

       "$OCF_NOT_RUNNING")

           #  Currently not running. Need to start before promoting.

           ocf_log  info "Resource is currently not running"

           foobar_start

           ;;

       *)

           #  Failed resource. Let the cluster manager recover.

           ocf_log  err "Unexpected error, cannot promote"

           exit $rc

           ;;

   esac


   # actually  promote the resource here (make sure to immediately

   # exit with an  $OCF_ERR_ error code if anything goes seriously

   # wrong)

   ocf_run  frobnicate --master-mode || exit $OCF_ERR_GENERIC


   # After the  resource has been promoted, check whether the

   # promotion  worked. If the resource promotion is asynchronous, the

   # agent may spin  on the monitor function here -- if the resource

   # does not assume  the Master role within the defined timeout, the

   # cluster manager  will consider the promote action failed.

   while true; do

       foobar_monitor

       if [  $? -eq $OCF_RUNNING_MASTER ]; then

           ocf_log  debug "Resource promoted"

           break

       else

           ocf_log  debug "Resource still awaiting promotion"

           sleep 1

       fi

   done


   # only return  $OCF_SUCCESS if _everything_ succeeded as expected

   return $OCF_SUCCESS

}


5.7 demote action


promote操作是可选的。它只支持有状态的资源代理,就是说,资源代理必须是两种角色中的一种:Masterslaveslave角色功能上和无状态的资源代理是相同的。这样,标准的无状态资源代理仅仅需要实现startstop操作,而且有状态的资源代理必须实现masterstartedslave)角色的切换。


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

foobar_demote() {

   local rc


   # exit  immediately if configuration is not valid

   foobar_validate_all  || exit $?


   # test the  resource's current state

   foobar_monitor

   rc=$?

   case "$rc" in

       "$OCF_RUNNING_MASTER")

           #  Running as master. Normal, expected behavior.

           ocf_log  debug "Resource is currently running as Master"

           ;;

       "$OCF_SUCCESS")

           #  Alread running as slave. Nothing to do.

           ocf_log  debug "Resource is currently running as Slave"

           return $OCF_SUCCESS

           ;;

       "$OCF_NOT_RUNNING")

           #  Currently not running. Getting a demote action

           #  in this state is unexpected. Exit with an error

           #  and let the cluster manager recover.

           ocf_log  err "Resource is currently not running"

           exit $OCF_ERR_GENERIC

           ;;

       *)

           #  Failed resource. Let the cluster manager recover.

           ocf_log  err "Unexpected error, cannot demote"

           exit $rc

           ;;

   esac


   # actually demote  the resource here (make sure to immediately

   # exit with an  $OCF_ERR_ error code if anything goes seriously

   # wrong)

   ocf_run  frobnicate --unset-master-mode || exit $OCF_ERR_GENERIC


   # After the  resource has been demoted, check whether the

   # demotion  worked. If the resource demotion is asynchronous, the

   # agent may spin  on the monitor function here -- if the resource

   # does not assume  the Slave role within the defined timeout, the

   # cluster manager  will consider the demote action failed.

   while true; do

       foobar_monitor

       if [  $? -eq $OCF_RUNNING_MASTER ]; then

           ocf_log  debug "Resource still awaiting promotion"

           sleep 1

       else

           ocf_log  debug "Resource demoted"

           break

       fi

   done


   # only return  $OCF_SUCCESS if _everything_ succeeded as expected

   return $OCF_SUCCESS

}


5.8 migrate_to action


migrate_to 操作服务于下面两个目的中的一个:


  • 为资源提供一种本地push方式的迁移发起过程。换句话说,指导资源从当前运行的地方迁移到指定节点。资源代理通过环境变量 $OCF_RESKEY_CRM_meta_migrate_target 获得目标节点。

  • freeze/thaw(或suspend/resume)模式的迁移中冻住资源,这种模式下资源不需要知道目的地。


下面的例子描述了push类型的迁移:


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

foobar_migrate_to() {

   # exit  immediately if configuration is not valid

   foobar_validate_all  || exit $?


   # if resource is  not running, bail out early

   if !  foobar_monitor; then

       ocf_log  err "Resource is not running"

       exit $OCF_ERR_GENERIC

   fi


   # actually start  up the resource here (make sure to immediately

   # exit with an $OCF_ERR_  error code if anything goes seriously

   # wrong)

   ocf_run  frobnicate --migrate \

                      --dest=$OCF_RESKEY_CRM_meta_migrate_target  \

                      || exit OCF_ERR_GENERIC

   ...


   # only return  $OCF_SUCCESS if _everything_ succeeded as expected

   return $OCF_SUCCESS

}


相应的,freeze/thaw 类型的迁移可以按如下方法实现freeze操作:


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

foobar_migrate_to() {

   # exit immediately  if configuration is not valid

   foobar_validate_all  || exit $?


   # if resource is  not running, bail out early

   if !  foobar_monitor; then

       ocf_log  err "Resource is not running"

       exit $OCF_ERR_GENERIC

   fi


   # actually start  up the resource here (make sure to immediately

   # exit with an  $OCF_ERR_ error code if anything goes seriously

   # wrong)

   ocf_run  frobnicate --freeze || exit OCF_ERR_GENERIC

   ...


   # only return  $OCF_SUCCESS if _everything_ succeeded as expected

   return $OCF_SUCCESS

}


5.9 migrate_from action


migrate_from 操作服务于下面两个目的中的一个:


  • 为资源提供一种本地push方式的迁移完成过程。换句话说,检查资源是否正确的迁移,并在本地运行起来了。资源代理通过环境变量 OCF_RESKEY_CRM_meta_migrate_source 获得源节点

  • freeze/thaw(或suspend/resume)模式的迁移中解冻资源,这种模式下资源不需要知道源地址


下面的例子描述了push类型的迁移:


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

foobar_migrate_from() {

   # exit  immediately if configuration is not valid

   foobar_validate_all  || exit $?


   # After the resource  has been migrated, check whether it resumed

   # correctly. If  the resource starts asynchronously, the agent may

   # spin on the  monitor function here -- if the resource does not

   # run within the  defined timeout, the cluster manager will

   # consider the  migrate_from action failed

   while !  foobar_monitor; do

       ocf_log  debug "Resource has not yet migrated, waiting"

       sleep 1

   done


   # only return  $OCF_SUCCESS if _everything_ succeeded as expected

   return $OCF_SUCCESS

}


相应的,freeze/thaw 类型的迁移可以按如下方法实现thaw操作:


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

foobar_migrate_from() {

   # exit  immediately if configuration is not valid

   foobar_validate_all  || exit $?


   # actually start  up the resource here (make sure to immediately

   # exit with an  $OCF_ERR_ error code if anything goes seriously

   # wrong)

   ocf_run  frobnicate --thaw || exit OCF_ERR_GENERIC


   # After the  resource has been migrated, check whether it resumed

   # correctly. If  the resource starts asynchronously, the agent may

   # spin on the  monitor function here -- if the resource does not

   # run within the  defined timeout, the cluster manager will

   # consider the migrate_from  action failed

   while !  foobar_monitor; do

       ocf_log  debug "Resource has not yet migrated, waiting"

       sleep 1

   done


   # only return  $OCF_SUCCESS if _everything_ succeeded as expected

   return $OCF_SUCCESS

}


5.10 notify action


通过通知,clone的实例(包括master/slave 资源,这种资源是clone资源的一种扩展)可以相互通知各自的状态。当通知机制被启用,每一个克隆实例都会携带 pre post 通知。然后,集群管理器对所有克隆实例调用notify操作。notify操作执行是,会用到如下附加的环境变量:


  • $OCF_RESKEY_CRM_meta_notify_type—通知类型 (pre post)

  • $OCF_RESKEY_CRM_meta_notify_operation—操作(action)这是指通知做什么(start,     stop, promote, demote .)

  • $OCF_RESKEY_CRM_meta_notify_start_uname—资源启动所在的节点名字(仅仅对启动通知)

  • $OCF_RESKEY_CRM_meta_notify_stop_uname—资源停止所在的节点名字(仅仅对停止通知)

  • $OCF_RESKEY_CRM_meta_notify_master_uname—Master 角色资源运行所在节点的名字

  • $OCF_RESKEY_CRM_meta_notify_promote_uname—正提升为Master角色的资源所在节点的节点名字      (仅仅 promote 通知)

  • $OCF_RESKEY_CRM_meta_notify_demote_uname—正在降级为slave角色的资源所在节点的节点名字      (仅仅 demote 通知)


对于master/slave资源,使用push模式的通知是很便利的,在种模式下,master为发布者,slave为订阅者。既然master只有在提升为master时能发通知,那slave就可以利用一个pre-promote通知来配置他们自己指向正确的发布者。


同样的,订阅者也希望在master角色状态不再延续时取消订阅。post-demote通知就是为了这个目的。


下面的例子阐述这样的概念:


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

foobar_notify() {

   local type_op

   type_op="${OCF_RESKEY_CRM_meta_notify_type}-${OCF_RESKEY_CRM_meta_notify_operation}"


   ocf_log  debug "Received $type_op notification."

   case "$type_op" in

       'pre-promote')

           ocf_run  frobnicate --slave-mode \

                              --master=$OCF_RESKEY_CRM_meta_notify_promote_uname  \

                              || exit $OCF_ERR_GENERIC

           ;;

       'post-demote')

           ocf_run  frobnicate --unset-slave-mode || exit $OCF_ERR_GENERIC

           ;;

   esac


   return $OCF_SUCCESS

}


注意:

master/slave资源代理可支持多master配置,这样可能在某个时间内不止一个master。这种情况下,$OCF_RESKEY_CRM_meta_notify_*_uname会包含一个空格分隔的机器名列表,而不是上面例子一样的一个机器名。在那种环境里面,资源代理应该去处理一下这个列表。