本文基于elastic-job.2.1.5版本
本文主要介绍分布式环境下,有一台作业服务器宕机或网络异常离线情况下,导致该服务器上的分片任务丢失,这种情况下该如何处理。
1. 代码示例
-
@ElasticJobConf这个用法是笔者自己写的一个Elastic-job-spring-boot-starter上的一个自定义注解。上面的参数及其含义和elastic-job原生一模一样,所以阅读不要有障碍。
//任务名称,cron表达式:每5s执行一次,分片总数,分片参数,是否覆盖zk @ElasticJobConf(name = "zeng", cron = "0/5 * * * * ?", shardingTotalCount = 2, shardingItemParameters = "", overwrite = true) public class JobTest implements SimpleJob { @Override public void execute(ShardingContext shardingContext) { System.out.println(shardingContext.toString()); } }
-
当我启动一台作业服务器时,运行结果如下:我们可以看到以下打印的日志,shardingItem=0、shardingItem=1这两个分片都跑在172.30.60.210这台服务器上面。
ShardingContext(jobName=zeng, taskId=zeng@-@0,1@-@READY@-@172.30.60.210@-@9920, shardingTotalCount=2, jobParameter=, shardingItem=0, shardingParameter=null) ShardingContext(jobName=zeng, taskId=zeng@-@0,1@-@READY@-@172.30.60.210@-@9920, shardingTotalCount=2, jobParameter=, shardingItem=1, shardingParameter=null)
-
我们看一下zookeeper上的数据(用ZooInspector查看的)。我们可以看到zookeeper上:
/命名空间/zeng/instances有一个实例,就是172.30.60.210。
/命名空间/zeng/sharding 下面有1,0两个分片,下面的instance内容都是172.30.60.210@-@9920
-
我们现在启动另一台服务器。启动成功后引起了重新分片,此时我们上面那台服务器就只跑了一个分片了。
ShardingContext(jobName=zeng, taskId=zeng@-@1@-@READY@-@172.30.60.210@-@9920, shardingTotalCount=2, jobParameter=, shardingItem=1, shardingParameter=null)
-
我们同样看一下zookeeper。
/命名空间/zeng/instances有两个实例了,一个是172.30.60.210。,一个是172.30.60.84
/命名空间/zeng/sharding 下面有1,0两个分片;1分片下面的instance内容是172.30.60.210@-@9920,0分片下面的instance内容是172.30.60.84@-@4408
2. 模拟宕机
-
我们把上面那台172.30.60.84服务器,直接停止。在14:31:35左右停止(172.30.60.84服务器上面最后一条日志是14:31:35)。
-
然后我们看到172.30.60.210服务器上面的日志,从14:31:40到14:32:20都是只执行了一个分片,直到下一次才有两个分片。我们观察到宕机这段时间,zookeeper上面的的实例一直没有发生变化,也是一直到14:32:20左右才变成了一个实例。
[2020-12-11 14:31:40:001] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.simpl.PropertySettingJobFactory] [] [] [] [] [] [] - Producing instance of Job 'DEFAULT.zeng', class=com.dangdang.ddframe.job.lite.internal.schedule.LiteJob [2020-12-11 14:31:40:001] [zeng_Worker-1] [DEBUG] [org.quartz.core.JobRunShell] [] [] [] [] [] [] - Calling execute on job DEFAULT.zeng ShardingContext(jobName=zeng, taskId=zeng@-@1@-@READY@-@172.30.60.210@-@9920, shardingTotalCount=2, jobParameter=, shardingItem=1, shardingParameter=null) [2020-12-11 14:31:40:011] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.core.QuartzSchedulerThread] [] [] [] [] [] [] - batch acquisition of 1 triggers [2020-12-11 14:31:45:000] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.simpl.PropertySettingJobFactory] [] [] [] [] [] [] - Producing instance of Job 'DEFAULT.zeng', class=com.dangdang.ddframe.job.lite.internal.schedule.LiteJob [2020-12-11 14:31:45:000] [zeng_Worker-1] [DEBUG] [org.quartz.core.JobRunShell] [] [] [] [] [] [] - Calling execute on job DEFAULT.zeng ShardingContext(jobName=zeng, taskId=zeng@-@1@-@READY@-@172.30.60.210@-@9920, shardingTotalCount=2, jobParameter=, shardingItem=1, shardingParameter=null) [2020-12-11 14:31:45:013] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.core.QuartzSchedulerThread] [] [] [] [] [] [] - batch acquisition of 1 triggers [2020-12-11 14:31:50:001] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.simpl.PropertySettingJobFactory] [] [] [] [] [] [] - Producing instance of Job 'DEFAULT.zeng', class=com.dangdang.ddframe.job.lite.internal.schedule.LiteJob [2020-12-11 14:31:50:001] [zeng_Worker-1] [DEBUG] [org.quartz.core.JobRunShell] [] [] [] [] [] [] - Calling execute on job DEFAULT.zeng ShardingContext(jobName=zeng, taskId=zeng@-@1@-@READY@-@172.30.60.210@-@9920, shardingTotalCount=2, jobParameter=, shardingItem=1, shardingParameter=null) [2020-12-11 14:31:50:019] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.core.QuartzSchedulerThread] [] [] [] [] [] [] - batch acquisition of 1 triggers [2020-12-11 14:31:55:000] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.simpl.PropertySettingJobFactory] [] [] [] [] [] [] - Producing instance of Job 'DEFAULT.zeng', class=com.dangdang.ddframe.job.lite.internal.schedule.LiteJob [2020-12-11 14:31:55:000] [zeng_Worker-1] [DEBUG] [org.quartz.core.JobRunShell] [] [] [] [] [] [] - Calling execute on job DEFAULT.zeng ShardingContext(jobName=zeng, taskId=zeng@-@1@-@READY@-@172.30.60.210@-@9920, shardingTotalCount=2, jobParameter=, shardingItem=1, shardingParameter=null) [2020-12-11 14:31:55:019] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.core.QuartzSchedulerThread] [] [] [] [] [] [] - batch acquisition of 1 triggers [2020-12-11 14:32:00:001] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.simpl.PropertySettingJobFactory] [] [] [] [] [] [] - Producing instance of Job 'DEFAULT.zeng', class=com.dangdang.ddframe.job.lite.internal.schedule.LiteJob [2020-12-11 14:32:00:001] [zeng_Worker-1] [DEBUG] [org.quartz.core.JobRunShell] [] [] [] [] [] [] - Calling execute on job DEFAULT.zeng ShardingContext(jobName=zeng, taskId=zeng@-@1@-@READY@-@172.30.60.210@-@9920, shardingTotalCount=2, jobParameter=, shardingItem=1, shardingParameter=null) [2020-12-11 14:32:00:020] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.core.QuartzSchedulerThread] [] [] [] [] [] [] - batch acquisition of 1 triggers [2020-12-11 14:32:05:000] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.simpl.PropertySettingJobFactory] [] [] [] [] [] [] - Producing instance of Job 'DEFAULT.zeng', class=com.dangdang.ddframe.job.lite.internal.schedule.LiteJob [2020-12-11 14:32:05:000] [zeng_Worker-1] [DEBUG] [org.quartz.core.JobRunShell] [] [] [] [] [] [] - Calling execute on job DEFAULT.zeng ShardingContext(jobName=zeng, taskId=zeng@-@1@-@READY@-@172.30.60.210@-@9920, shardingTotalCount=2, jobParameter=, shardingItem=1, shardingParameter=null) [2020-12-11 14:32:05:021] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.core.QuartzSchedulerThread] [] [] [] [] [] [] - batch acquisition of 1 triggers [2020-12-11 14:32:10:001] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.simpl.PropertySettingJobFactory] [] [] [] [] [] [] - Producing instance of Job 'DEFAULT.zeng', class=com.dangdang.ddframe.job.lite.internal.schedule.LiteJob [2020-12-11 14:32:10:001] [zeng_Worker-1] [DEBUG] [org.quartz.core.JobRunShell] [] [] [] [] [] [] - Calling execute on job DEFAULT.zeng ShardingContext(jobName=zeng, taskId=zeng@-@1@-@READY@-@172.30.60.210@-@9920, shardingTotalCount=2, jobParameter=, shardingItem=1, shardingParameter=null) [2020-12-11 14:32:10:021] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.core.QuartzSchedulerThread] [] [] [] [] [] [] - batch acquisition of 1 triggers [2020-12-11 14:32:15:000] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.simpl.PropertySettingJobFactory] [] [] [] [] [] [] - Producing instance of Job 'DEFAULT.zeng', class=com.dangdang.ddframe.job.lite.internal.schedule.LiteJob [2020-12-11 14:32:15:000] [zeng_Worker-1] [DEBUG] [org.quartz.core.JobRunShell] [] [] [] [] [] [] - Calling execute on job DEFAULT.zeng ShardingContext(jobName=zeng, taskId=zeng@-@1@-@READY@-@172.30.60.210@-@9920, shardingTotalCount=2, jobParameter=, shardingItem=1, shardingParameter=null) [2020-12-11 14:32:15:022] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.core.QuartzSchedulerThread] [] [] [] [] [] [] - batch acquisition of 1 triggers [2020-12-11 14:32:20:001] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.simpl.PropertySettingJobFactory] [] [] [] [] [] [] - Producing instance of Job 'DEFAULT.zeng', class=com.dangdang.ddframe.job.lite.internal.schedule.LiteJob [2020-12-11 14:32:20:001] [zeng_Worker-1] [DEBUG] [org.quartz.core.JobRunShell] [] [] [] [] [] [] - Calling execute on job DEFAULT.zeng [2020-12-11 14:32:20:008] [zeng_Worker-1] [DEBUG] [com.dangdang.ddframe.job.lite.internal.sharding.ShardingService] [] [] [] [] [] [] - Job 'zeng' sharding begin. [2020-12-11 14:32:20:027] [zeng_Worker-1] [DEBUG] [com.dangdang.ddframe.job.lite.internal.sharding.ShardingService] [] [] [] [] [] [] - Job 'zeng' sharding complete. ShardingContext(jobName=zeng, taskId=zeng@-@0,1@-@READY@-@172.30.60.210@-@9920, shardingTotalCount=2, jobParameter=, shardingItem=0, shardingParameter=null) ShardingContext(jobName=zeng, taskId=zeng@-@0,1@-@READY@-@172.30.60.210@-@9920, shardingTotalCount=2, jobParameter=, shardingItem=1, shardingParameter=null) [2020-12-11 14:32:20:051] [zeng_QuartzSchedulerThread] [DEBUG] [org.quartz.core.QuartzSchedulerThread] [] [] [] [] [] [] - batch acquisition of 1 triggers
3. 分析原因
-
/命名空间/任务名称/instance目录下的节点是临时节点,实例的注册,离线,宕机都会被zookeeper感知。
-
zookeeper服务器会清理sessionTimeout过期的实例的所有临时节点和监听器。
-
elastic-job使用CuratorFrameworkFactory里面默认设置的sessionTimeOut时间60s。
综合上面几个原因,我们可以知道,如果一台作业服务器宕机,zookeeper最长在60s才能感知到,并清除相应的临时节点。其他作业服务器也是最长在60s的时间能感知到,同时把宕机那台服务器的分片拿过来执行.
也就是说,如果间隔时间低于60s的任务,当一台作业服务器宕机,该作业服务器上对应的分片任务就会丢失
4. 解决方案
-
所以对于间隔时间短的任务,相应的作业服务器应该有自己的兜底方案确保丢失的分片不会影响作业的最终执行。
-
当然elastic-job也提供了修改zookeeper,sessionTimeout的配置。
5.失效转移
不是可以失效转移吗?宕机那台服务器错过的分片交给其他正常运行的服务器执行。
没错。但这种适用的场景不是上面那种间隔时间短的周期任务,而是相对间隔时间长的且一次作业耗时长的任务。
如果间隔时间短的周期任务,开启failover=true和monitorExecution=true,可能会产生大量与zookeeper的网络通信,对性能有一定影响。(有时间再仔细介绍一下这种机制)