成功启动集群-解决“Regular plan unhealthy!” 问题

本文详细介绍了Heron集群搭建过程中遇到的拓扑提交问题及其解决方案,包括解决Aurora和Mesos启动问题,修复THROTTLED状态,以及集群资源不足导致的任务失败。通过调整配置和增加节点,最终成功提交拓扑。
摘要由CSDN通过智能技术生成

集群前序配置

解决Aurora和Mesos的启动问题

Aurora Instance “THROTTLED”问题描述

在之前的Heron拓扑提交之后,Aurora中的Instance状态一直如下:

image

而且在mesos中的stderr日志文件中,有如下的错误提示:

E0312 05:05:38.146308 2767 runner.py:299] Regular plan unhealthy!

以及:

-get: java.net.UnknownHostException: heron
Usage: hadoop fs [generic options] -get [-p] [-ignoreCrc] [-crc] <src> … <localdst>

探索:在检查了所有集群相关的配置文件,都没有heron这个主机名称。

在mesosUI 中可以查看提交的task运行详情(task.json)文件:

{
     "processes":[
         {
             "daemon":false,
             "name":"fetch_heron_system",
             "max_failures":1,
             "ephemeral":false,
             "min_duration":5,
             "cmdline":"/home/yitian/hadoop/hadoop-2.7.4/bin/hdfs dfs -get hdfs://heron/dist/heron-core.tar.gz heron-core.tar.gz && tar zxf heron-core.tar.gz",
            "final":false
         },
         {
             "daemon":false,
             "name":"fetch_user_package",
             "max_failures":1,
             "ephemeral":false,
             "min_duration":5,
             "cmdline":"/home/yitian/hadoop/hadoop-2.7.4/bin/hdfs dfs -get /heron/topologies/aurora/WordCountTopology-yitian-tag-0--6983988323592208556.tar.gz topology.tar.gz && tar zxf topology.tar.gz",
            "final":false
         },
         {
             "daemon":false,
             "name":"launch_heron_executor",
             "max_failures":1,
             "ephemeral":false,
             "min_duration":5,
             "cmdline":"./heron-core/bin/heron-executor --shard=0 --topology-name=WordCountTopology --topology-id=WordCountTopology2b628b69-ed95-421d-9a2c-1704996f05df --topology-defn-file=WordCountTopology.defn --state-manager-connection=heron04:2181 --state-manager-root=/heron --tmaster-binary=./heron-core/bin/heron-tmaster --stmgr-binary=./heron-core/bin/heron-stmgr --metrics-manager-classpath="./heron-core/lib/metricsmgr/*" --instance-jvm-opts="" --classpath="heron-api-examples.jar" --master-port={{thermos.ports[port1]}} --tmaster-controller-port={{thermos.ports[port2]}} --tmaster-stats-port={{thermos.ports[port3]}} --heron-internals-config-file=./heron-conf/heron_internals.yaml --override-config-file=./heron-conf/override.yaml --component-ram-map=word:1073741824,consumer:1073741824 --component-jvm-opts="" --pkg-type=jar --topology-binary-file=heron-api-examples.jar --heron-java-home=/usr/lib/jvm/java-1.8.0-openjdk-amd64 --shell-port={{thermos.ports[http]}} --heron-shell-binary=./heron-core/bin/heron-shell --metrics-manager-port={{thermos.ports[port4]}} --cluster=aurora --role=yitian --environment=devel --instance-classpath="./heron-core/lib/instance/*" --metrics-sinks-config-file=./heron-conf/metrics_sinks.yaml --scheduler-classpath="./heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/*" --scheduler-port="{{thermos.ports[scheduler]}}" --python-instance-binary=./heron-core/bin/heron-python-instance --cpp-instance-binary=./heron-core/bin/heron-cpp-instance --metricscache-manager-classpath=./heron-core/lib/metricscachemgr/* --metricscache-manager-master-port={{thermos.ports[metricscachemgr_masterport]}} --metricscache-manager-stats-port={{thermos.ports[metricscachemgr_statsport]}} --is-stateful=false --checkpoint-manager-classpath="./heron-core/lib/ckptmgr/*:./heron-core/lib/statefulstorage/*:" --checkpoint-manager-port={{thermos.ports[ckptmgr_port]}} --stateful-config-file=./heron-conf/stateful.yaml --health-manager-mode=disabled --health-manager-classpath=./heron-core/lib/healthmgr/*",
             "final":false
         },
         {
             "daemon":false,
             "name":"discover_profiler_port",
             "max_failures":1,
             "ephemeral":false,
             "min_duration":5,
             "cmdline":"echo {{thermos.ports[yourkit]}} > yourkit.port",
             "final":false
         }
     ],
     "name":"setup_and_run",
     "finalization_wait":30,
     "max_failures":1,
     "max_concurrency":0,
     "resources":{
         "gpu":0,
         "disk":2147483648,
         "ram":4294967296,
         "cpu":2
     },
     "constraints":[
         {
             "order":[
                 "fetch_heron_system",
                 "fetch_user_package",
                 "launch_heron_executor",
                 "discover_profiler_port"
             ]
         }
     ]
}

问题解决

1. 首先解决-get: java.net.UnknownHostException: heron问题:

因为在task.json文件中,可以看到有相应task的cmdline,即任务的执行命令,该task的执行失败,也就是该命令的执行失败。其中在第一个task中,命令为:

/home/yitian/hadoop/hadoop-2.7.4/bin/hdfs dfs -get hdfs://heron/dist/heron-core.tar.gz heron-core.tar.gz

尝试将其单独拿出来,在heron04(master中配置了hadoop)主机中进行测试,结果正好得到相同的错误信息:

yitian@heron04:~$ /home/yitian/hadoop/hadoop-2.7.4/bin/hdfs dfs -get hdfs://heron/dist/heron-core.tar.gz heron-core.tar.gz
-get: java.net.UnknownHostException: heron
Usage: hadoop fs [generic options] -get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>

怀疑是命令中hdfs://的这种使用方式的问题,将文件目录的前缀hdfs://去掉后,执行命令成功:

yitian@heron04:~$ /home/yitian/hadoop/hadoop-2.7.4/bin/hdfs dfs -get /heron/dist/heron-core.tar.gz heron-core.tar.gz

因此,就可以定位错误所在位置为:heron集群配置中conf/aurora路径下的client.yaml文件中的heron-core.tar.gz所在共享位置的路径设置错误。

原为:

# location of the core package
# heron.package.core.uri:                      "file:///vagrant/.herondata/dist/heron-core-release.tar.gz"
heron.package.core.uri:                      "hdfs://heron/dist/heron-core.tar.gz"
 
# Whether role/env is required to submit a topology. Default value is False.
heron.config.is.role.required:               True
heron.config.is.env.required:                True

改为:

# location of the core package
# heron.package.core.uri:                      "file:///vagrant/.herondata/dist/heron-core-release.tar.gz"
heron.package.core.uri:                      "/heron/dist/heron-core.tar.gz"
 
# Whether role/env is required to submit a topology. Default value is False.
heron.config.is.role.required:               True
heron.config.is.env.required:                True

即可解决该问题。

2. 解决Aurora Instance “THROTTLED”的问题

在Mesos的stderr日志文件中的提示为:E0312 05:05:38.146308 2767 runner.py:299] Regular plan unhealthy!。怀疑问题的原因为集群中的节点太少,无法满足topology中对资源的配置需求。(出现问题时,集群中只有两个节点(heron04:5G,heron06:5G)的配置)。

因此,在集群中加入了第三个节点heron05(配置同heron06),在解决了上面第一个问题后,重新配置集群使其可以正确启动。重新尝试提交拓扑,Heron拓扑提交成功。具体如下。

成功启动Heron集群 – 提交Topology

在解决了上述的两个问题后,重新尝试提交拓扑的命令运行如。成功提交拓扑!!!

yitian@heron04:~$ heron submit aurora/yitian/devel --config-path ~/.heron/conf ~/.heron/examples/heron-api-examples.jar com.twitter.heron.examples.api.WordCountTopology WordCountTopology --deploy-deactivated
[2018-03-15 05:53:45 +0000] [INFO]: Using cluster definition in /home/yitian/.heron/conf/aurora
[2018-03-15 05:53:45 +0000] [INFO]: Launching topology: 'WordCountTopology'
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/yitian/.heron/lib/uploader/heron-dlog-uploader.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/yitian/.heron/lib/statemgr/heron-zookeeper-statemgr.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
[2018-03-15 05:53:46 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Starting Curator client connecting to: heron04:2181  
[2018-03-15 05:53:46 -0700] [INFO] org.apache.curator.framework.imps.CuratorFrameworkImpl: Starting  
[2018-03-15 05:53:46 -0700] [INFO] org.apache.curator.framework.state.ConnectionStateManager: State change: CONNECTED  
[2018-03-15 05:53:46 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Directory tree initialized.  
[2018-03-15 05:53:46 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Checking existence of path: /heron/topologies/WordCountTopology  
[2018-03-15 05:53:50 -0700] [INFO] com.twitter.heron.uploader.hdfs.HdfsUploader: Target topology file already exists at '/heron/topologies/aurora/WordCountTopology-yitian-tag-0-8136175565428738886.tar.gz'. Overwriting it now  
[2018-03-15 05:53:50 -0700] [INFO] com.twitter.heron.uploader.hdfs.HdfsUploader: Uploading topology package at '/tmp/tmp2JPHpD/topology.tar.gz' to target HDFS at '/heron/topologies/aurora/WordCountTopology-yitian-tag-0-8136175565428738886.tar.gz'  
[2018-03-15 05:53:54 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/topologies/WordCountTopology  
[2018-03-15 05:53:54 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/packingplans/WordCountTopology  
[2018-03-15 05:53:54 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/executionstate/WordCountTopology  
[2018-03-15 05:53:54 -0700] [INFO] com.twitter.heron.scheduler.aurora.AuroraLauncher: Launching topology in aurora  
[2018-03-15 05:53:54 -0700] [INFO] com.twitter.heron.scheduler.utils.SchedulerUtils: Updating scheduled-resource in packing plan: WordCountTopology  
[2018-03-15 05:53:54 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Deleted node for path: /heron/packingplans/WordCountTopology  
[2018-03-15 05:53:54 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/packingplans/WordCountTopology  
  INFO] Creating job WordCountTopology
  INFO] Checking status of aurora/yitian/devel/WordCountTopology
Job create succeeded: job url=http://218.195.228.52:8081/scheduler/yitian/devel/WordCountTopology
[2018-03-15 05:54:06 -0700] [INFO] com.twitter.heron.scheduler.utils.SchedulerUtils: Setting Scheduler locations: topology_name: "WordCountTopology"
http_endpoint: "scheduler_as_lib_no_endpoint"
   
[2018-03-15 05:54:06 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/schedulers/WordCountTopology  
[2018-03-15 05:54:06 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Closing the CuratorClient to: heron04:2181  
[2018-03-15 05:54:06 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Closing the tunnel processes  
[2018-03-15 05:54:06 +0000] [INFO]: Successfully launched topology 'WordCountTopology'

查看拓扑成功提交之后的集群组件运行状态

Aurora Scheduler的运行状态

image

image

Mesos的运行状态

Active Tasks中的两个task装填为RUNNING,且分配了相应的主机。

image

点击具体的task的sandbox中,可以看到提交的相关heron任务信息:

image

image

Heron Tracker运行状态

image

Heron UI运行状态

可以看到heron中提交的拓扑的Logical 和Physical PLAN:(打开heron-ui的过程,相当慢,目前不知道什么原因造成!

image

注:还有一些小组件的运行有些问题,之后解决。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值