[Erlang 0030] 理解Erlang/OTP Supervisor

Supervisors are used to build an hierarchical process structure called a supervision tree, a nice way to structure a fault tolerant application.
                                                                                                                                                                                     --Erlang/OTP Doc

     Supervisor的基本思想就是通过建立层级结构实现错误隔离和管理,具体方法是通过重启的方式保持子进程一直活着.如果supervisor是进程树的一部分,它会被它的supervisor自动终止,当它的supervisor让它shutdown的时候,它会按照子进程启动顺序的逆序终止其所有的子进程,最后终止掉自己.[Erlang 0025]理解Erlang/OTP - Application以log4erl项目为学习了Erlang/OTP application,我们说到application在start的方法中启动了log4erl的顶层监控树.今天我们继续跟进,看log4erl的监控树是怎么构建起来的,并做实验看supervisor如何通过重启恢复服务的.使用application:start(log4erl).启动起来之后的进程树:

下面是log4erl_sup文件的start_link方法,supervisor:start_link方法的执行是同步的,直到所有的子进程都启动了才会返回. supervisor:start_link会使用回调函数init/1.

start_link(Default_logger) ->
R = supervisor:start_link({local, ?MODULE}, ?MODULE, []),
%log4erl:start_link(Default_logger),
add_logger(Default_logger),
?LOG2("Result in supervisor is ~p~n",[R]),
R.

%%回调的方法init/1
init([]) ->
{ok,
{
{one_for_one,3,10},
[]
}
}.
  log4erl的顶层监控树的初始化相当简单仅仅定义了重启策略(RestartStrategy)和最大重启频率(maximum restart frequency):{one_for_one,3,10}.
 {one_for_one,3,10}表达的语义是{How, Max, Within} :在多长时间内(Within)重启了几次(Max),如何重启(HOW 重启策略); 设计最大重启频率是为了避免反复重启进入死循环,一旦超出了此阈值,supervisor进程会结束掉自己以及它所有的子进程,并通过进程树传递退出消息,更上层的supervisor就会采取适当的措施,要么重启终止的supervisor要么自己也终止掉.可能比较纠结这几个值怎么配置,多数资料上都会告诉你"如何配置完全取决于你的应用程序".这个还是有经验值的,生成环境的经验值是一小时内重启4次,也可以参考一些和你应用类似的开源项目看看它们是怎么配置的.如果填写的是{one_for_one,0,1}就是不允许重启,下面的示例中可以看到YAWS项目采用了这样的策略.

下面几个开源项目顶层supervisor的init方法:
%%rabbit_sup.erl  来自大名鼎鼎的rabbitmq
init([]) ->
{ok, {{one_for_all, 0, 1}, []}}.


%%yaws_sup.erl Yaws项目 - Yet Another Web Server
init([]) ->

ChildSpecs = child_specs(),
%% 0, 1 means that we never want supervisor restarts
{ok,{{one_for_all, 0, 1}, ChildSpecs}}.


%%ejabberd_sup ejabberd项目
init([]) ->
Hooks =
{ejabberd_hooks,
{ejabberd_hooks, start_link, []},
%%......................... 省略代码
{ok, {{one_for_one, 10, 1},
[Hooks,
GlobalRouter,
Cluster,
..................
Listener]}}.

重启策略
 
 one-for-one : 把子进程当成各自独立的,一个进程出现问题其它进程不会受到崩溃的进程的影响.该子进程死掉,只有这个进程会被重启
 one_for_all : 如果子进程终止,所有其它子进程也都会被终止,然后所有进程都会被重启.
 rest_for_one:如果一个子进程终止,在这个进程启动之后启动的进程都会被终止掉.然后终止掉的进程和连带关闭的进程都会被重启.
 simple_one_for_one 是one_for_one的简化版 ,所有子进程都动态添加同一种进程的实例

在log4erl_sup.erl的start_link中启动了顶层supervisor之后,添加了一个默认的logger: add_logger(Default_logger),
add_logger(Name) when is_atom(Name) ->
N = atom_to_list(Name),
add_logger(N);
add_logger(Name) when is_list(Name) ->
C1 = {Name,
{log_manager, start_link ,[Name]},
permanent,
10000,
worker,
[log_manager]},

?LOG2("Adding ~p to ~p~n",[C1, ?MODULE]),
supervisor:start_child(?MODULE, C1).
添加的logger是log4erl_sup的子进程,子进程启动和监控的方式通过child specification来指定.
 C1 =  {Name,{log_manager, start_link ,[Name]},permanent,10000,worker,[log_manager]}
 
C1的六个数据项分别为: {ID, StartEntery, Restart, Shutdown, Type, Modules}:
ID  :supervisor 用来在内部区分specification的,所以只要子进程规格说明之间不重复就可以.
Start  : 启动参数{M,F,A}
Restart  : 这个进程遇到错误之后是否重启
                permanent:遇到任何错误导致进程终止就会重启
                temporary:进程永远都不会被重启
                transient: 只有进程异常终止的时候会被重启
Shutdown  进程如何被干掉,这里是使用整型值2000的意思是,进程在被强制干掉之前有2000毫秒的时间料理后事自行终止.
              实际过程是supervisor给子进程发送一个exit(Pid,shutdown)然后等待exit信号返回,在指定时间没有返回则将子进程使用exit(Child,kill)
             这里的参数还有 brutal_kill 意思是进程马上就会被干掉
             infinity :当一个子进程是supervisor那么就要用infinity,意思是给supervisor足够的时间进行重启.
Type  这里只有两个值:supervisor worker ; 只要没有实现supervisor behavior的进程都是worker;
                    可以通过supervisor的层级结构来精细化对进程的控制.这个值主要作用是告知监控进程它的子进程是supervisor还是worker
Modules  是进程依赖的模块, 这个信息只有在代码热更新的时候才会被用到:标注了哪些模块需要按照什么顺序进行更新;通常这里只需要列出进程依赖的主模块. 如果子进程是supervisor gen_server gen_fsm Module名是回调模块的名称,这时Modules的值是只有一个元素的列表,元素就是回调模块的名称;如果子进程是gen_event Modules的值是 dynamic;关于dynamic参数余锋有一篇专门的分析:Erlang supervisor规格的dynamic行为分析  http://blog.yufeng.info/archives/1455
 
实际应用中log4erl中的logger会根据业务逻辑添加多个,我们也不是直接通过application:start(log4erl).而是调用  log4erl:conf(log4erl.conf)这个方法简单的封装了内部逻辑,实际调用的是 log4erl_conf:conf(File).我们这里定义一个简单的log4erl.conf文件,使用log4erl:conf(log4erl.conf).启动之后,我们看看它的进程树是什么样的:
%%log4erl.conf文件 内容我做了简单的缩排
%%mod
logger default_logger{
     file_appender default_app{
    dir = "./log", level = debug, file = default_log, type = size, max = 1000000, suffix = log, rotation = 50, format = ' %d %h:%m:%s.%i %l%n'
     }
}

%%mail mod
logger mail_logger{
     file_appender mail_app{
     dir = "./log", level = debug, file = mail_log, type = size, max = 1000000, suffix = log, rotation = 50, format = ' %d %h:%m:%s.%i %l%n'
     }
}
对应的进程树是这样的,进程之间的红线表示link关系:

我们沿着调用关系,逐步跟进代码:

%==== File : log4erl_conf =======

%%log4erl_conf:conf(File).
conf(File) ->
application:start(log4erl), %%启动log4erl
Tree = parse(leex(File)), %%解析配置文件
traverse(Tree). %%遍历配置项构造监控树

%%跟进遍历的逻辑,对于每一条配置执行的是element/1方法
traverse([]) ->
ok;
traverse([H|Tree]) ->
element(H),
traverse(Tree).

%%对于我们自定义的logger走的是{logger, Logger, Appenders}逻辑
element({cutoff_level, CutoffLevel}) ->
log_filter_codegen:set_cutoff_level(CutoffLevel);
element({default_logger, Appenders}) ->
appenders(Appenders);
element({logger, Logger, Appenders}) ->
log4erl:add_logger(Logger),
appenders(Logger, Appenders).

%==== File : log4erl =======
%%继续跟进我们走到log4erl:add_logger/1
add_logger(Logger) ->
try_msg({add_logger, Logger}).

%%try_msg 是的添加了异常捕获的通用方法
try_msg(Msg) ->
try
handle_call(Msg)
catch
exit:{noproc, _M} ->
io:format("log4erl has not been initialized yet. To do so, please run~n"),
io:format("> application:start(log4erl).~n"),
{error, log4erl_not_started};
E:M ->
?LOG2("Error message received by log4erl is ~p:~p~n",[E, M]),
{E, M}
end.

%%handle_call的代码片段
handle_call({add_logger, Logger}) ->
log_manager:add_logger(Logger);

%==== File : log_manager =======
%%逻辑转到log_manager的add_logger(Logger)
%%最终调用的是log4erl_sup:add_logger(Logger).这个我们上面已经分析过了
add_logger(Logger) ->
log4erl_sup:add_logger(Logger).

%%element方法在添加loger之后会添加appender
appenders([]) ->
ok;
appenders([H|Apps]) ->
appender(H),
appenders(Apps).

appenders(_, []) ->
ok;
appenders(Logger, [H|Apps]) ->
appender(Logger, H),
appenders(Logger, Apps).

appender({appender, App, Name, Conf}) ->
log4erl:add_appender({App, Name}, {conf, Conf}).

appender(Logger, {appender, App, Name, Conf}) ->
log4erl:add_appender(Logger, {App, Name}, {conf, Conf}).


%==== File : log4erl =======
%% Appender = {Appender, Name}
add_appender(Logger, Appender, Conf) ->
try_msg({add_appender, Logger, Appender, Conf}).

handle_call({add_appender, Logger, Appender, Conf}) ->
log_manager:add_appender(Logger, Appender, Conf);

%==== File : log_manager =======
add_appender(Logger, {Appender, Name} , Conf) ->
?LOG2("add_appender ~p with name ~p to ~p with Conf ~p ~n",[Appender, Name, Logger, Conf]),
log4erl_sup:add_guard(Logger, Appender, Name, Conf).

%==== File : log4erl_sup =======
add_guard(Logger, Appender, Name, Conf) ->
C = {Name,
{logger_guard, start_link ,[Logger, Appender, Name, Conf]},
permanent,
10000,
worker,
[logger_guard]},
?LOG2("Adding ~p to ~p~n",[C, ?MODULE]),
supervisor:start_child(?MODULE, C).

%==== File : logger_guard =======
start_link(Logger, Appender, Name, Conf) ->
%?LOG2("starting guard for logger ~p~n",[Logger]),
{ok, Pid} = gen_server:start_link(?MODULE, [Appender, Name], []),
case add_sup_handler(Pid, Logger, Conf) of
{error, E} ->
gen_server:call(Pid, stop),
{error, E};
_R ->
{ok, Pid}
end.

add_sup_handler(G_pid, Logger, Conf) ->
?LOG("add_sup()~n"),
gen_server:call(G_pid, {add_sup_handler, Logger, Conf}).

handle_call({add_sup_handler, Logger, Conf}, _From, [{appender, Appender, Name}] = State) ->
?LOG2("Adding handler ~p with name ~p for ~p From ~p~n",[Appender, Name, Logger, _From]),
try
Res = gen_event:add_sup_handler(Logger, {Appender, Name}, Conf),
{reply, Res, State}
catch
E:R ->
{reply, {error, {E,R}}, State}
end;

gen_event:add_sup_handler会建立EventManager与Event Handler之间的link的关系,所以我们修改一下,注释掉这段,看看监控树是什么样子:

add_sup_handler(G_pid, Logger, Conf) ->

%    ?LOG("add_sup()~n"),
%    gen_server:call(G_pid, {add_sup_handler, Logger, Conf}).
  ok.

注释掉之后可以看到logger和guard之间的link关系就不存在了.


kill进程的实验

后面我们会用各种情况杀掉进程,看这个进程树对异常的处理情况;我们的实验步骤:
1.发送退出消息Reason:some_reason给default_logger
2.发送退出消息Reason:kill 给default_logger
3.发送退出消息Reason:some_reason给logger_guard
4.发送退出消息Reason:some_reason给log4erl_sup
5.发送退出消息Reason:kill 给log4erl_sup
3> whereis(default_logger).
<0.45.0>
4> exit(whereis(default_logger),some_reason).
true
5> whereis(default_logger).
<0.45.0>
6> exit(whereis(default_logger),some_reason). %%由于gen_event默认process_flag(trap_exit, true),所以some_reason的退出消息并没有把它干掉
true
7> whereis(default_logger).
<0.45.0>
8> exit(whereis(default_logger),kill). %%向进程发送强制退出消息,
true

=SUPERVISOR REPORT==== 10-Jan-2012::10:35:21 === %首先能够看到log4erl报出的子进程终止的报告
Supervisor: {local,log4erl_sup}
Context: child_terminated
Reason: killed
Offender: [{pid,<0.45.0>},
{name,"default_logger"},
{mfargs,{log_manager,start_link,["default_logger"]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]

=PROGRESS REPORT==== 10-Jan-2012::10:35:21 === %log4erl_sup重建default_logger,新进程pid是<0.69.0>
supervisor: {local,log4erl_sup}
started: [{pid,<0.69.0>},
{name,"default_logger"},
{mfargs,{log_manager,start_link,["default_logger"]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]

=SUPERVISOR REPORT==== 10-Jan-2012::10:35:21 === %default_logger退出消息转变成为killed继续广播给link的进程,对应的logger_guard终止
Supervisor: {local,log4erl_sup}
Context: child_terminated
Reason: killed
Offender: [{pid,<0.46.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf, [{dir,"./log"},{level,debug},{file,default_log},{type,size},
{max,1000000},{suffix,log}, {rotation,50},
{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]

=PROGRESS REPORT==== 10-Jan-2012::10:35:21 === %logger_guard 重建
supervisor: {local,log4erl_sup}
started: [{pid,<0.70.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf,
[{dir,"./log"},{level,debug}, {file,default_log},{type,size},
{max,1000000}, {suffix,log},{rotation,50},
{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]
9> whereis(default_logger).
<0.69.0>
10> is_process_alive(pid(0,70,0)). %这是新启动的logger_guard进程
true
11> exit(pid(0,70,0),some_reason). %向进程发送一个退出消息
true

=SUPERVISOR REPORT==== 10-Jan-2012::11:07:51 ===
Supervisor: {local,log4erl_sup}
Context: child_terminated
Reason: some_reason
Offender: [{pid,<0.70.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf,
[{dir,"./log"},{level,debug},{file,default_log},{type,size},{max,1000000},
{suffix,log},{rotation,50},{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]

12>
=PROGRESS REPORT==== 10-Jan-2012::11:07:51 ===
supervisor: {local,log4erl_sup}
started: [{pid,<0.76.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf,
[{dir,"./log"},{level,debug},{file,default_log},{type,size},{max,1000000},
{suffix,log},{rotation,50},{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},{child_type,worker}]
12> is_process_alive(pid(0,70,0)).
false
13> whereis(default_logger). %退出消息广播对default_logger没有影响
<0.69.0>
14> whereis(log4erl_sup).
<0.44.0>
15> exit(whereis(log4erl_sup),some_reason). % Supervisor 初始化的时候也会设置 process_flag(trap_exit, true),
true
16> whereis(log4erl_sup).
<0.44.0>
17> exit(whereis(log4erl_sup),kill). %杀掉log4erl_sup 应用程序停止
true

=CRASH REPORT==== 10-Jan-2012::13:26:23 ===
crasher:
initial call: gen_event:init_it/6
pid: <0.69.0>
registered_name: default_logger
exception exit: killed
in function gen_event:terminate_server/4
ancestors: [log4erl_sup,<0.43.0>]
messages: [{'EXIT',<0.76.0>,killed}]
links: [#Port<0.1891>,#Port<0.1885>]
dictionary: []
trap_exit: true
status: running
heap_size: 610
stack_size: 24
reductions: 720
neighbours:
18>
=CRASH REPORT==== 10-Jan-2012::13:26:23 ===
crasher:
initial call: gen_event:init_it/6
pid: <0.47.0>
registered_name: mail_logger
exception exit: killed
in function gen_event:terminate_server/4
ancestors: [log4erl_sup,<0.43.0>]
messages: [{'EXIT',<0.48.0>,killed}]
links: [#Port<0.546>]
dictionary: []
trap_exit: true
status: running
heap_size: 377
stack_size: 24
reductions: 411
neighbours:
18>
=CRASH REPORT==== 10-Jan-2012::13:26:25 ===
crasher:
initial call: application_master:init/4
pid: <0.42.0>
registered_name: []
exception exit: killed
in function application_master:terminate/2
ancestors: [<0.41.0>]
messages: []
links: [<0.6.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 610
stack_size: 24
reductions: 1555
neighbours:
18>
=INFO REPORT==== 10-Jan-2012::13:26:25 ===
application: log4erl
exited: killed
type: temporary
18>


 最后再贴一次log4erl项目的地址: http://code.google.com/p/log4erl/,建议下载下来代码自己动手做一下上面的实验.


下一篇:[Erlang 0031] Erlang Shell中的输出完整数据

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值