集群前序配置
解决Aurora和Mesos的启动问题
Aurora Instance “THROTTLED”问题描述
在之前的Heron拓扑提交之后,Aurora中的Instance状态一直如下:
而且在mesos中的stderr日志文件中,有如下的错误提示:
E0312 05:05:38.146308 2767 runner.py:299] Regular plan unhealthy!
以及:
-get: java.net.UnknownHostException: heron
Usage: hadoop fs [generic options] -get [-p] [-ignoreCrc] [-crc] <src> … <localdst>
探索:在检查了所有集群相关的配置文件,都没有heron这个主机名称。
在mesosUI 中可以查看提交的task运行详情(task.json)文件:
{
"processes":[
{
"daemon":false,
"name":"fetch_heron_system",
"max_failures":1,
"ephemeral":false,
"min_duration":5,
"cmdline":"/home/yitian/hadoop/hadoop-2.7.4/bin/hdfs dfs -get hdfs://heron/dist/heron-core.tar.gz heron-core.tar.gz && tar zxf heron-core.tar.gz",
"final":false
},
{
"daemon":false,
"name":"fetch_user_package",
"max_failures":1,
"ephemeral":false,
"min_duration":5,
"cmdline":"/home/yitian/hadoop/hadoop-2.7.4/bin/hdfs dfs -get /heron/topologies/aurora/WordCountTopology-yitian-tag-0--6983988323592208556.tar.gz topology.tar.gz && tar zxf topology.tar.gz",
"final":false
},
{
"daemon":false,
"name":"launch_heron_executor",
"max_failures":1,
"ephemeral":false,
"min_duration":5,
"cmdline":"./heron-core/bin/heron-executor --shard=0 --topology-name=WordCountTopology --topology-id=WordCountTopology2b628b69-ed95-421d-9a2c-1704996f05df --topology-defn-file=WordCountTopology.defn --state-manager-connection=heron04:2181 --state-manager-root=/heron --tmaster-binary=./heron-core/bin/heron-tmaster --stmgr-binary=./heron-core/bin/heron-stmgr --metrics-manager-classpath="./heron-core/lib/metricsmgr/*" --instance-jvm-opts="" --classpath="heron-api-examples.jar" --master-port={{thermos.ports[port1]}} --tmaster-controller-port={{thermos.ports[port2]}} --tmaster-stats-port={{thermos.ports[port3]}} --heron-internals-config-file=./heron-conf/heron_internals.yaml --override-config-file=./heron-conf/override.yaml --component-ram-map=word:1073741824,consumer:1073741824 --component-jvm-opts="" --pkg-type=jar --topology-binary-file=heron-api-examples.jar --heron-java-home=/usr/lib/jvm/java-1.8.0-openjdk-amd64 --shell-port={{thermos.ports[http]}} --heron-shell-binary=./heron-core/bin/heron-shell --metrics-manager-port={{thermos.ports[port4]}} --cluster=aurora --role=yitian --environment=devel --instance-classpath="./heron-core/lib/instance/*" --metrics-sinks-config-file=./heron-conf/metrics_sinks.yaml --scheduler-classpath="./heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/*" --scheduler-port="{{thermos.ports[scheduler]}}" --python-instance-binary=./heron-core/bin/heron-python-instance --cpp-instance-binary=./heron-core/bin/heron-cpp-instance --metricscache-manager-classpath=./heron-core/lib/metricscachemgr/* --metricscache-manager-master-port={{thermos.ports[metricscachemgr_masterport]}} --metricscache-manager-stats-port={{thermos.ports[metricscachemgr_statsport]}} --is-stateful=false --checkpoint-manager-classpath="./heron-core/lib/ckptmgr/*:./heron-core/lib/statefulstorage/*:" --checkpoint-manager-port={{thermos.ports[ckptmgr_port]}} --stateful-config-file=./heron-conf/stateful.yaml --health-manager-mode=disabled --health-manager-classpath=./heron-core/lib/healthmgr/*",
"final":false
},
{
"daemon":false,
"name":"discover_profiler_port",
"max_failures":1,
"ephemeral":false,
"min_duration":5,
"cmdline":"echo {{thermos.ports[yourkit]}} > yourkit.port",
"final":false
}
],
"name":"setup_and_run",
"finalization_wait":30,
"max_failures":1,
"max_concurrency":0,
"resources":{
"gpu":0,
"disk":2147483648,
"ram":4294967296,
"cpu":2
},
"constraints":[
{
"order":[
"fetch_heron_system",
"fetch_user_package",
"launch_heron_executor",
"discover_profiler_port"
]
}
]
}
问题解决
1. 首先解决-get: java.net.UnknownHostException: heron问题:
因为在task.json文件中,可以看到有相应task的cmdline,即任务的执行命令,该task的执行失败,也就是该命令的执行失败。其中在第一个task中,命令为:
/home/yitian/hadoop/hadoop-2.7.4/bin/hdfs dfs -get hdfs://heron/dist/heron-core.tar.gz heron-core.tar.gz
尝试将其单独拿出来,在heron04(master中配置了hadoop)主机中进行测试,结果正好得到相同的错误信息:
yitian@heron04:~$ /home/yitian/hadoop/hadoop-2.7.4/bin/hdfs dfs -get hdfs://heron/dist/heron-core.tar.gz heron-core.tar.gz
-get: java.net.UnknownHostException: heron
Usage: hadoop fs [generic options] -get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>
怀疑是命令中hdfs://的这种使用方式的问题,将文件目录的前缀hdfs://去掉后,执行命令成功:
yitian@heron04:~$ /home/yitian/hadoop/hadoop-2.7.4/bin/hdfs dfs -get /heron/dist/heron-core.tar.gz heron-core.tar.gz
因此,就可以定位错误所在位置为:heron集群配置中conf/aurora路径下的client.yaml文件中的heron-core.tar.gz所在共享位置的路径设置错误。
原为:
# location of the core package
# heron.package.core.uri: "file:///vagrant/.herondata/dist/heron-core-release.tar.gz"
heron.package.core.uri: "hdfs://heron/dist/heron-core.tar.gz"
# Whether role/env is required to submit a topology. Default value is False.
heron.config.is.role.required: True
heron.config.is.env.required: True
改为:
# location of the core package
# heron.package.core.uri: "file:///vagrant/.herondata/dist/heron-core-release.tar.gz"
heron.package.core.uri: "/heron/dist/heron-core.tar.gz"
# Whether role/env is required to submit a topology. Default value is False.
heron.config.is.role.required: True
heron.config.is.env.required: True
即可解决该问题。
2. 解决Aurora Instance “THROTTLED”的问题
在Mesos的stderr日志文件中的提示为:E0312 05:05:38.146308 2767 runner.py:299] Regular plan unhealthy!。怀疑问题的原因为集群中的节点太少,无法满足topology中对资源的配置需求。(出现问题时,集群中只有两个节点(heron04:5G,heron06:5G)的配置)。
因此,在集群中加入了第三个节点heron05(配置同heron06),在解决了上面第一个问题后,重新配置集群使其可以正确启动。重新尝试提交拓扑,Heron拓扑提交成功。具体如下。
成功启动Heron集群 – 提交Topology
在解决了上述的两个问题后,重新尝试提交拓扑的命令运行如。成功提交拓扑!!!
yitian@heron04:~$ heron submit aurora/yitian/devel --config-path ~/.heron/conf ~/.heron/examples/heron-api-examples.jar com.twitter.heron.examples.api.WordCountTopology WordCountTopology --deploy-deactivated
[2018-03-15 05:53:45 +0000] [INFO]: Using cluster definition in /home/yitian/.heron/conf/aurora
[2018-03-15 05:53:45 +0000] [INFO]: Launching topology: 'WordCountTopology'
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/yitian/.heron/lib/uploader/heron-dlog-uploader.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/yitian/.heron/lib/statemgr/heron-zookeeper-statemgr.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
[2018-03-15 05:53:46 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Starting Curator client connecting to: heron04:2181
[2018-03-15 05:53:46 -0700] [INFO] org.apache.curator.framework.imps.CuratorFrameworkImpl: Starting
[2018-03-15 05:53:46 -0700] [INFO] org.apache.curator.framework.state.ConnectionStateManager: State change: CONNECTED
[2018-03-15 05:53:46 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Directory tree initialized.
[2018-03-15 05:53:46 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Checking existence of path: /heron/topologies/WordCountTopology
[2018-03-15 05:53:50 -0700] [INFO] com.twitter.heron.uploader.hdfs.HdfsUploader: Target topology file already exists at '/heron/topologies/aurora/WordCountTopology-yitian-tag-0-8136175565428738886.tar.gz'. Overwriting it now
[2018-03-15 05:53:50 -0700] [INFO] com.twitter.heron.uploader.hdfs.HdfsUploader: Uploading topology package at '/tmp/tmp2JPHpD/topology.tar.gz' to target HDFS at '/heron/topologies/aurora/WordCountTopology-yitian-tag-0-8136175565428738886.tar.gz'
[2018-03-15 05:53:54 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/topologies/WordCountTopology
[2018-03-15 05:53:54 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/packingplans/WordCountTopology
[2018-03-15 05:53:54 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/executionstate/WordCountTopology
[2018-03-15 05:53:54 -0700] [INFO] com.twitter.heron.scheduler.aurora.AuroraLauncher: Launching topology in aurora
[2018-03-15 05:53:54 -0700] [INFO] com.twitter.heron.scheduler.utils.SchedulerUtils: Updating scheduled-resource in packing plan: WordCountTopology
[2018-03-15 05:53:54 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Deleted node for path: /heron/packingplans/WordCountTopology
[2018-03-15 05:53:54 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/packingplans/WordCountTopology
INFO] Creating job WordCountTopology
INFO] Checking status of aurora/yitian/devel/WordCountTopology
Job create succeeded: job url=http://218.195.228.52:8081/scheduler/yitian/devel/WordCountTopology
[2018-03-15 05:54:06 -0700] [INFO] com.twitter.heron.scheduler.utils.SchedulerUtils: Setting Scheduler locations: topology_name: "WordCountTopology"
http_endpoint: "scheduler_as_lib_no_endpoint"
[2018-03-15 05:54:06 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/schedulers/WordCountTopology
[2018-03-15 05:54:06 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Closing the CuratorClient to: heron04:2181
[2018-03-15 05:54:06 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Closing the tunnel processes
[2018-03-15 05:54:06 +0000] [INFO]: Successfully launched topology 'WordCountTopology'
查看拓扑成功提交之后的集群组件运行状态
Aurora Scheduler的运行状态
Mesos的运行状态
Active Tasks中的两个task装填为RUNNING,且分配了相应的主机。
点击具体的task的sandbox中,可以看到提交的相关heron任务信息:
Heron Tracker运行状态
Heron UI运行状态
可以看到heron中提交的拓扑的Logical 和Physical PLAN:(打开heron-ui的过程,相当慢,目前不知道什么原因造成!)
注:还有一些小组件的运行有些问题,之后解决。