集群中Mesos和Aurora的配置情况
拓扑提交命令
yitian@heron04:~$ heron submit aurora/yitian/devel --config-path ~/.heron/conf ~/.heron/examples/heron-api-examples.jar com.twitter.heron.examples.api.WordCountTopology WordCountTopology --deploy-deactivated
[2018-03-12 06:35:51 +0000] [INFO]: Using cluster definition in /home/yitian/.heron/conf/aurora
[2018-03-12 06:35:52 +0000] [INFO]: Launching topology: 'WordCountTopology'
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/yitian/.heron/lib/uploader/heron-dlog-uploader.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/yitian/.heron/lib/statemgr/heron-zookeeper-statemgr.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
[2018-03-12 06:35:53 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Starting Curator client connecting to: heron04:2181
[2018-03-12 06:35:53 -0700] [INFO] org.apache.curator.framework.imps.CuratorFrameworkImpl: Starting
[2018-03-12 06:35:53 -0700] [INFO] org.apache.curator.framework.state.ConnectionStateManager: State change: CONNECTED
[2018-03-12 06:35:53 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Directory tree initialized.
[2018-03-12 06:35:53 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Checking existence of path: /heron/topologies/WordCountTopology
[2018-03-12 06:35:57 -0700] [INFO] com.twitter.heron.uploader.hdfs.HdfsUploader: Target topology file already exists at '/heron/topologies/aurora/WordCountTopology-yitian-tag-0-4050739266681926687.tar.gz'. Overwriting it now
[2018-03-12 06:35:57 -0700] [INFO] com.twitter.heron.uploader.hdfs.HdfsUploader: Uploading topology package at '/tmp/tmpSkEzuj/topology.tar.gz' to target HDFS at '/heron/topologies/aurora/WordCountTopology-yitian-tag-0-4050739266681926687.tar.gz'
[2018-03-12 06:36:01 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/topologies/WordCountTopology
[2018-03-12 06:36:01 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/packingplans/WordCountTopology
[2018-03-12 06:36:01 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/executionstate/WordCountTopology
[2018-03-12 06:36:02 -0700] [INFO] com.twitter.heron.scheduler.aurora.AuroraLauncher: Launching topology in aurora
[2018-03-12 06:36:02 -0700] [INFO] com.twitter.heron.scheduler.utils.SchedulerUtils: Updating scheduled-resource in packing plan: WordCountTopology
[2018-03-12 06:36:02 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Deleted node for path: /heron/packingplans/WordCountTopology
[2018-03-12 06:36:02 -0700] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/packingplans/WordCountTopology
INFO] Creating job WordCountTopology
Mesos运行情况
查看agent主机中的运行情况:
其中kill的命令是最后运行的。在kill之前的任务状态都是failed。
在browse中可以查看该任务的详细运行日志,如下:
在SandBox中可以查看该任务的运行详情日志:
点击查看stderr文件,内容如下:
Log file created at: 2018/03/12 05:05:25
Running on machine: heron06
[DIWEF]mmdd hh:mm:ss.uuuuuu pid file:line] msg
Command line: /home/yitian/mesosdata/run/slaves/0f1a6aac-4d22-40f6-a8a1-1044bcd0a605-S0/frameworks/6663765c-74c6-4af4-8d75-18a8e11ad493-0000/executors/thermos-yitian-devel-WordCountTopology-0-ddf849f3-c077-48ba-b2f4-bc3d8b943156/runs/75bd9591-854a-47d0-8f16-6f96e2aa1cee/thermos_runner.pex --setuid=yitian --task_id=yitian-devel-WordCountTopology-0-ddf849f3-c077-48ba-b2f4-bc3d8b943156 --log_to_disk=DEBUG --hostname=heron06 --thermos_json=/home/yitian/mesosdata/run/slaves/0f1a6aac-4d22-40f6-a8a1-1044bcd0a605-S0/frameworks/6663765c-74c6-4af4-8d75-18a8e11ad493-0000/executors/thermos-yitian-devel-WordCountTopology-0-ddf849f3-c077-48ba-b2f4-bc3d8b943156/runs/75bd9591-854a-47d0-8f16-6f96e2aa1cee/task.json --sandbox=/home/yitian/mesosdata/run/slaves/0f1a6aac-4d22-40f6-a8a1-1044bcd0a605-S0/frameworks/6663765c-74c6-4af4-8d75-18a8e11ad493-0000/executors/thermos-yitian-devel-WordCountTopology-0-ddf849f3-c077-48ba-b2f4-bc3d8b943156/runs/75bd9591-854a-47d0-8f16-6f96e2aa1cee/sandbox --log_dir=/home/yitian/mesosdata/run/slaves/0f1a6aac-4d22-40f6-a8a1-1044bcd0a605-S0/frameworks/6663765c-74c6-4af4-8d75-18a8e11ad493-0000/executors/thermos-yitian-devel-WordCountTopology-0-ddf849f3-c077-48ba-b2f4-bc3d8b943156/runs/75bd9591-854a-47d0-8f16-6f96e2aa1cee --checkpoint_root=/home/yitian/mesosdata/run/slaves/0f1a6aac-4d22-40f6-a8a1-1044bcd0a605-S0/frameworks/6663765c-74c6-4af4-8d75-18a8e11ad493-0000/executors/thermos-yitian-devel-WordCountTopology-0-ddf849f3-c077-48ba-b2f4-bc3d8b943156/runs/75bd9591-854a-47d0-8f16-6f96e2aa1cee/checkpoints --container_sandbox=/home/yitian/mesosdata/run/slaves/0f1a6aac-4d22-40f6-a8a1-1044bcd0a605-S0/frameworks/6663765c-74c6-4af4-8d75-18a8e11ad493-0000/executors/thermos-yitian-devel-WordCountTopology-0-ddf849f3-c077-48ba-b2f4-bc3d8b943156/runs/75bd9591-854a-47d0-8f16-6f96e2aa1cee/sandbox --port=port4:31665 --port=http:31795 --port=metricscachemgr_masterport:31451 --port=yourkit:31819 --port=aurora:31795 --port=metricscachemgr_statsport:31052 --port=scheduler:31768 --port=ckptmgr_port:31438 --port=port2:31209 --port=port3:31829 --port=port1:31471
Log file created at: 2018/03/12 05:05:25
Running on machine: heron06
[DIWEF]mmdd hh:mm:ss.uuuuuu pid file:line] msg
Command line: /home/yitian/mesosdata/run/slaves/0f1a6aac-4d22-40f6-a8a1-1044bcd0a605-S0/frameworks/6663765c-74c6-4af4-8d75-18a8e11ad493-0000/executors/thermos-yitian-devel-WordCountTopology-0-ddf849f3-c077-48ba-b2f4-bc3d8b943156/runs/75bd9591-854a-47d0-8f16-6f96e2aa1cee/thermos_runner.pex --setuid=yitian --task_id=yitian-devel-WordCountTopology-0-ddf849f3-c077-48ba-b2f4-bc3d8b943156 --log_to_disk=DEBUG --hostname=heron06 --thermos_json=/home/yitian/mesosdata/run/slaves/0f1a6aac-4d22-40f6-a8a1-1044bcd0a605-S0/frameworks/6663765c-74c6-4af4-8d75-18a8e11ad493-0000/executors/thermos-yitian-devel-WordCountTopology-0-ddf849f3-c077-48ba-b2f4-bc3d8b943156/runs/75bd9591-854a-47d0-8f16-6f96e2aa1cee/task.json --sandbox=/home/yitian/mesosdata/run/slaves/0f1a6aac-4d22-40f6-a8a1-1044bcd0a605-S0/frameworks/6663765c-74c6-4af4-8d75-18a8e11ad493-0000/executors/thermos-yitian-devel-WordCountTopology-0-ddf849f3-c077-48ba-b2f4-bc3d8b943156/runs/75bd9591-854a-47d0-8f16-6f96e2aa1cee/sandbox --log_dir=/home/yitian/mesosdata/run/slaves/0f1a6aac-4d22-40f6-a8a1-1044bcd0a605-S0/frameworks/6663765c-74c6-4af4-8d75-18a8e11ad493-0000/executors/thermos-yitian-devel-WordCountTopology-0-ddf849f3-c077-48ba-b2f4-bc3d8b943156/runs/75bd9591-854a-47d0-8f16-6f96e2aa1cee --checkpoint_root=/home/yitian/mesosdata/run/slaves/0f1a6aac-4d22-40f6-a8a1-1044bcd0a605-S0/frameworks/6663765c-74c6-4af4-8d75-18a8e11ad493-0000/executors/thermos-yitian-devel-WordCountTopology-0-ddf849f3-c077-48ba-b2f4-bc3d8b943156/runs/75bd9591-854a-47d0-8f16-6f96e2aa1cee/checkpoints --container_sandbox=/home/yitian/mesosdata/run/slaves/0f1a6aac-4d22-40f6-a8a1-1044bcd0a605-S0/frameworks/6663765c-74c6-4af4-8d75-18a8e11ad493-0000/executors/thermos-yitian-devel-WordCountTopology-0-ddf849f3-c077-48ba-b2f4-bc3d8b943156/runs/75bd9591-854a-47d0-8f16-6f96e2aa1cee/sandbox --port=port4:31665 --port=http:31795 --port=metricscachemgr_masterport:31451 --port=yourkit:31819 --port=aurora:31795 --port=metricscachemgr_statsport:31052 --port=scheduler:31768 --port=ckptmgr_port:31438 --port=port2:31209 --port=port3:31829 --port=port1:31471
E0312 05:05:38.146308 2767 runner.py:299] Regular plan unhealthy!
在点击sandbox中,最深层目录中的错误日志内容,如上。为什么?
该问题解决方法:成功启动集群-解决“Regular plan unhealthy!” 问题
Aurora运行情况
查看heron04:8081:
拓扑中包含的两个instance处于如下状态:
THEROTTLED为“节流状态”,这个状态的意义是什么?为什么会出现这个状态?
查看已完成的任务:
上图中可以看到,instance 0的运行状态的改变,但最终为Failed状态。
- 其中的No health-check defined, task is assumed healthy.是什么意思?
- 而且右侧的heron06点击后,找不到页面?WHY?
问题解决:
- 解决aurora和mesos问题:成功启动集群-解决“Regular plan unhealthy!” 问题
- 解决agents连接找不到问题:Aurora thermos_observer的配置与启动
Heron运行情况
heron-tracker运行情况:
heron-ui运行情况:
页面长时间无响应,其实时相应时间很长,在解决了上述问题之后,仍然相应很慢?WHY?