一、目的:
测试Alluxio+HDFS(Hadoop-2.7.7)集群性能。
二、环境:
2台虚拟机,CentOS Linux release 7.5.1804 (Core),内存3G,硬盘45G。
192.168.10.156,hmaster156,alluxio slave
192.168.10.162,hslave162,alluxio master
三、测试过程:
3.1 数据读写
3.1.1 上传文件到Alluxio文件系统
(1)上传到HDFS(alluxio挂载的目录下)的文件会被自动上传到Alluxio文件系统
在本地(hslave162)创建测试文件,并上传到hdfs的/alluxio/data/目录下,则文件自动被上传到alluxio文件系统中。
hdfs dfs -put hadoop2alluxio_test.txt /alluxio/data
(2)本地文件上传到Alluxio文件系统
在本地(hslave162节点/usr/local/alluxio-for-hadoop-2.7.7/performance_test/路径下)创建测试文件local2alluxio_test.txt,并上传到alluxio的文件系统(/alluxio/data/目录)下。
在hslave162节点/usr/local/alluxio-for-hadoop-2.7.7/bin/路径下,以hadoop用户执行以下命令:
# ‘/’代表根目录,即/alluxio/data/
./alluxio fs copyFromLocal /usr/local/alluxio-for-hadoop-2.7.7/performance_test/local2alluxio_test.txt /
注意:从本地向Alluxio文件系统上传数据后,通常会自动持久化到HDFS对应的目录下(即/alluxio/data/目录下)。
3.1.2 Alluxio文件固化到HDFS
在本地(hslave162节点/usr/local/alluxio-for-hadoop-2.7.7/performance_test/路径下)创建测试文件alluxio2hadoop_test.txt,并上传到alluxio的文件系统(/alluxio/data/目录)下。
将alluxio上的文件alluxio2hadoop_test.txt固化到HDFS的目录/alluxio/data/下。在hslave162节点/usr/local/alluxio-for-hadoop-2.7.7/bin/路径下,以hadoop用户执行以下命令:
./alluxio fs persist /alluxio2hadoop_test.txt
出现如下问题1:
解决方案:
重启后可以正常持久化。
出现如下问题2:
Alluxio Worker节点(hmaster156)的AlluxioWorker自动停止。
解决方案(治标不治本):
在Alluxio Worker节点(hmaster156)的/usr/local/alluxio-for-hadoop-2.7.7/bin/路径下执行以下命令,启动AlluxioWorker:
./alluxio-start.sh worker
则问题1和问题2均得到解决。
问题分析:
Alluxio集群配置存在某些问题,导致集群运行不稳定。目前尚未发现问题在哪里,待进一步排查。
(1)间隔一段时间重新从本地向Alluxio文件系统上传文件,仍然成功。分析Alluxio Worker节点(hmaster156)和Alluxio Master节点(hslave162)的日志信息,发现在Alluxio Worker节点(hmaster156)的日志job_worker.log中存在如下报警:
# 上次上传数据日志信息
2019-12-04 20:05:33,818 WARN SaslStreamServerDriver - Error received for channel: 947b2f0d-a4e3-4cdb-95d2-b6b9fe75d480. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:05:33,820 WARN SaslStreamServerDriver - Error received for channel: 96cccf3c-7aa7-4f71-8029-b22115b4a9e8. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
# 本次上传数据日志信息
2019-12-04 20:36:45,846 WARN SaslStreamServerDriver - Error received for channel: 71bc4b60-e968-4e6d-931e-cbccb565ec1b. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:36:45,846 WARN SaslStreamServerDriver - Error received for channel: 7d8207ef-4d45-40bd-9b7b-014b9bd33aff. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
查不到相关资料。
(2)关闭Alluxio集群,重新启动,发现worker仍然无法启动:
观察各日志,存在问题的日志如下:
1)Alluxio Master节点(hslave162)的日志job_master.log:
2019-12-04 20:52:34,386 WARN authentication.SaslStreamServerDriver (SaslStreamServerDriver.java:onError) - Error received for channel: ff20e3c4-d10d-4e1d-a6f2-f14da93236e2. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:53:55,073 INFO network.NettyUtils (NettyUtils.java:checkNettyEpollAvailable) - EPOLL_MODE is available
2019-12-04 20:53:55,194 INFO network.TieredIdentityFactory (TieredIdentityFactory.java:localIdentity) - Initialized tiered identity TieredIdentity(node=hslave162, rack=null)
2019-12-04 20:53:55,229 INFO alluxio.ProcessUtils (ProcessUtils.java:run) - Starting Alluxio job master @ hslave162:20001.
2019-12-04 20:53:55,229 INFO raft.RaftJournalSystem (RaftJournalSystem.java:startInternal) - Initializing Raft Journal System
2019-12-04 20:53:55,238 INFO raft.JournalStateMachine (JournalStateMachine.java:<init>) - Initialized new journal state machine
2019-12-04 20:53:56,375 INFO master.AbstractPrimarySelector (AbstractPrimarySelector.java:setState) - Primary selector transitioning to SECONDARY
2019-12-04 20:53:56,377 INFO raft.RaftJournalSystem (RaftJournalSystem.java:startInternal) - Starting Raft journal system. Cluster addresses: [hslave162:20003]. Local address: hslave162:20003
2019-12-04 20:53:56,692 INFO transport.GrpcMessagingServer (GrpcMessagingServer.java:lambda$listen$1) - Successfully started messaging server at: hslave162:20003
2019-12-04 20:53:56,695 INFO state.ServerContext (ServerContext.java:transition) - hslave162:20003 - Transitioning to FOLLOWER
2019-12-04 20:53:56,708 INFO master.AbstractPrimarySelector (AbstractPrimarySelector.java:setState) - Primary selector transitioning to SECONDARY
2019-12-04 20:54:08,929 INFO state.ServerContext (ServerContext.java:transition) - hslave162:20003 - Transitioning to CANDIDATE
2019-12-04 20:54:08,932 INFO state.CandidateState (CandidateState.java:startElection) - hslave162:20003 - Starting election
2019-12-04 20:54:09,082 INFO state.ServerContext (ServerContext.java:transition) - hslave162:20003 - Transitioning to LEADER
2019-12-04 20:54:09,087 INFO state.ServerContext (ServerContext.java:setLeader) - hslave162:20003 - Found leader hslave162:20003
2019-12-04 20:54:09,254 INFO state.ServerStateMachine (ServerStateMachine.java:installSnapshot) - hslave162:20003 - Installing snapshot 1
2019-12-04 20:54:09,340 INFO raft.JournalStateMachine (JournalStateMachine.java:install) - Successfully installed snapshot up to SN -1
2019-12-04 20:54:13,898 INFO master.AbstractPrimarySelector (AbstractPrimarySelector.java:setState) - Primary selector transitioning to PRIMARY
2019-12-04 20:54:13,899 INFO master.AbstractPrimarySelector (AbstractPrimarySelector.java:setState) - Primary selector transitioning to PRIMARY
2019-12-04 20:54:13,900 INFO server.CopycatServer (CopycatServer.java:lambda$start$6) - Server started successfully!
2019-12-04 20:54:13,903 INFO raft.RaftJournalSystem (RaftJournalSystem.java:startInternal) - Started Raft Journal System in 17526ms
2019-12-04 20:54:13,903 INFO master.FaultTolerantAlluxioJobMasterProcess (FaultTolerantAlluxioJobMasterProcess.java:start) - Transitioning from secondary to primary
2019-12-04 20:54:15,664 INFO session.ClientSession (ClientSessionManager.java:lambda$register$2) - Registered session 48982
2019-12-04 20:54:15,683 INFO raft.RaftJournalSystem (RaftJournalSystem.java:catchUp) - Performing catchup. Last applied SN: -1. Catchup ID: -4000766732932347040
2019-12-04 20:54:36,065 INFO raft.RaftJournalSystem (RaftJournalSystem.java:catchUp) - Caught up in 20383ms. Last sequence number from previous term: -1.
2019-12-04 20:54:36,080 INFO master.AbstractMaster (AbstractMaster.java:stop) - JobMaster: Stopped secondary master.
2019-12-04 20:54:36,080 INFO master.FaultTolerantAlluxioJobMasterProcess (FaultTolerantAlluxioJobMasterProcess.java:start) - Secondary stopped
2019-12-04 20:54:36,081 INFO master.AbstractMaster (AbstractMaster.java:start) - JobMaster: Starting primary master.
2019-12-04 20:54:36,089 INFO master.FaultTolerantAlluxioJobMasterProcess (FaultTolerantAlluxioJobMasterProcess.java:start) - Primary started
2019-12-04 20:54:36,089 INFO master.AlluxioJobMasterProcess (AlluxioJobMasterProcess.java:startServing) - Alluxio job master web server version 2.2.0-SNAPSHOT starting (gained leadership). webAddress=/0.0.0.0:20002
2019-12-04 20:54:36,101 INFO util.log (Log.java:initialized) - Logging initialized @42323ms
2019-12-04 20:54:36,217 INFO web.WebServer (WebServer.java:start) - Alluxio Job Manager Master Web service starting @ /0.0.0.0:20002
2019-12-04 20:54:36,219 INFO server.Server (Server.java:doStart) - jetty-9.2.z-SNAPSHOT
2019-12-04 20:54:36,248 INFO handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@7511c06d{/metrics/json,null,AVAILABLE}
2019-12-04 20:54:45,395 INFO handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@731b9c92{/,null,AVAILABLE}
2019-12-04 20:54:45,402 INFO server.ServerConnector (AbstractConnector.java:doStart) - Started ServerConnector@4f3cf293{HTTP/1.1}{0.0.0.0:20002}
2019-12-04 20:54:45,402 INFO server.Server (Server.java:doStart) - Started @51624ms
2019-12-04 20:54:45,402 INFO web.WebServer (WebServer.java:start) - Alluxio Job Manager Master Web service started @ /0.0.0.0:20002
2019-12-04 20:54:45,402 INFO master.AlluxioJobMasterProcess (AlluxioJobMasterProcess.java:startServing) - Alluxio job master version 2.2.0-SNAPSHOT started (gained leadership). bindAddress=/0.0.0.0:20001, connectAddress=hslave162:20001, webAddress=/0.0.0.0:20002
2019-12-04 20:54:45,402 INFO master.AlluxioJobMasterProcess (AlluxioJobMasterProcess.java:startServingRPCServer) - Starting Alluxio job master gRPC server on address /0.0.0.0:20001
2019-12-04 20:54:45,474 INFO master.MasterProcess (MasterProcess.java:registerServices) - registered service JOB_MASTER_CLIENT_SERVICE
2019-12-04 20:54:45,475 INFO master.MasterProcess (MasterProcess.java:registerServices) - registered service JOB_MASTER_WORKER_SERVICE
2019-12-04 20:54:45,487 INFO master.AlluxioJobMasterProcess (AlluxioJobMasterProcess.java:startServingRPCServer) - Started Alluxio job master gRPC server on address hslave162:20001
2019-12-04 20:55:02,134 INFO job.JobMaster (JobMaster.java:registerWorker) - registerWorker(): WorkerNetAddress: WorkerNetAddress{host=hmaster156, rpcPort=30001, dataPort=30002, webPort=30003, domainSocketPath=, tieredIdentity=TieredIdentity(node=hmaster156)} id: 1575464035219
2)Alluxio Master节点(hslave162)的日志master.log:
2019-12-04 20:52:34,384 WARN SaslStreamServerDriver - Error received for channel: 822a2fd1-4a77-4295-807a-feec4e2df6e7. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:34,384 WARN SaslStreamServerDriver - Error received for channel: 0d7c81e3-0eac-4197-9fa8-e4faf45fa7b4. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:34,384 WARN SaslStreamServerDriver - Error received for channel: e212747a-db40-430d-997b-60c7eb7f6eb2. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:34,384 WARN SaslStreamServerDriver - Error received for channel: b8dd20e4-13b7-48e1-a399-9bf8f0280658. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:36,045 WARN SaslStreamServerDriver - Error received for channel: 99a98071-e933-473f-b573-2c3d6d76fdb0. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:36,045 WARN SaslStreamServerDriver - Error received for channel: b0f4013b-f3ca-4f51-ba46-dbd6046576b2. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:36,045 WARN SaslStreamServerDriver - Error received for channel: 5b8a399a-f7d7-4f13-90e1-f21c0ba33590. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:36,046 WARN SaslStreamServerDriver - Error received for channel: 7d4af118-010f-4220-8337-21260cdcb61d. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:36,046 WARN SaslStreamServerDriver - Error received for channel: e219d970-caae-4359-a4a5-8cde797100c2. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:39,506 WARN SaslStreamServerDriver - Error received for channel: 2c0ac859-5cbd-45c9-8b48-f857b89dc652. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:39,506 WARN SaslStreamServerDriver - Error received for channel: 243d57cc-3c7e-4312-ae12-a7ca53c6059d. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:39,506 WARN SaslStreamServerDriver - Error received for channel: 60d1c342-33ed-4965-b945-2d57fa1c6749. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:42,470 INFO BackupLeaderRole - Closing backup-leader role.
2019-12-04 20:52:42,470 INFO AlluxioMasterProcess - Alluxio master ended (lost leadership)
2019-12-04 20:52:42,493 INFO ServerConnector - Stopped ServerConnector@4eaf5e18{HTTP/1.1}{0.0.0.0:19999}
2019-12-04 20:52:42,503 INFO ContextHandler - Stopped o.e.j.s.ServletContextHandler@508532d4{/,file:/usr/local/alluxio-for-hadoop-2.7.7/webui/master/build/,UNAVAILABLE}
2019-12-04 20:52:42,504 INFO ContextHandler - Stopped o.e.j.s.ServletContextHandler@6fa726e6{/metrics/json,null,UNAVAILABLE}
2019-12-04 20:52:42,504 INFO ContextHandler - Stopped o.e.j.s.ServletContextHandler@1a7f5e7d{/metrics/prometheus,null,UNAVAILABLE}
2019-12-04 20:52:42,505 INFO AbstractMaster - TableMaster: Stopping primary master.
2019-12-04 20:52:42,506 INFO AbstractMaster - TableMaster: Stopped primary master.
2019-12-04 20:52:42,506 INFO AbstractMaster - MetaMaster: Stopping primary master.
2019-12-04 20:52:42,506 INFO AbstractMaster - MetaMaster: Stopped primary master.
2019-12-04 20:52:42,508 INFO AbstractMaster - FileSystemMaster: Stopping primary master.
2019-12-04 20:52:42,508 INFO AbstractMaster - FileSystemMaster: Stopped primary master.
2019-12-04 20:52:42,508 INFO AbstractMaster - BlockMaster: Stopping primary master.
2019-12-04 20:52:42,509 INFO AbstractMaster - BlockMaster: Stopped primary master.
2019-12-04 20:52:42,509 INFO AbstractMaster - MetricsMaster: Stopping primary master.
2019-12-04 20:52:42,509 INFO AbstractMaster - MetricsMaster: Stopped primary master.
2019-12-04 20:52:42,511 INFO RaftJournalSystem - Shutting down raft journal
2019-12-04 20:52:42,511 INFO RaftJournalWriter - Closing journal writer. Last sequence numbers written/submitted/committed: 431/431/431
2019-12-04 20:52:42,702 INFO ServerContext - hslave162:19200 - Transitioning to INACTIVE
2019-12-04 20:52:42,778 INFO AbstractPrimarySelector - Primary selector transitioning to SECONDARY
2019-12-04 20:52:42,783 INFO ProcessUtils - Stopping Alluxio master @hslave162:19998.
2019-12-04 20:52:42,811 INFO RaftJournalSystem - Journal shutdown complete
2019-12-04 20:53:48,341 INFO BlockMasterFactory - Creating alluxio.master.block.BlockMaster
2019-12-04 20:53:48,342 INFO MetaMasterFactory - Creating alluxio.master.meta.MetaMaster
2019-12-04 20:53:48,341 INFO FileSystemMasterFactory - Creating alluxio.master.file.FileSystemMaster
2019-12-04 20:53:48,354 INFO MetricsMasterFactory - Creating alluxio.master.metrics.MetricsMaster
2019-12-04 20:53:48,355 INFO TableMasterFactory - Creating alluxio.master.table.TableMaster
2019-12-04 20:53:48,633 INFO NettyUtils - EPOLL_MODE is available
2019-12-04 20:53:48,657 INFO ExtensionFactoryRegistry - Loading core jars from /usr/local/alluxio-for-hadoop-2.7.7/lib
2019-12-04 20:53:48,734 INFO ExtensionFactoryRegistry - Loading extension jars from /usr/local/alluxio-for-hadoop-2.7.7/extensions
2019-12-04 20:53:48,759 INFO TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=hslave162, rack=null)
2019-12-04 20:53:48,760 INFO UnderDatabaseRegistry - Loading udb jars from /usr/local/alluxio-for-hadoop-2.7.7/lib
2019-12-04 20:53:48,777 INFO UnderDatabaseRegistry - Registered UDBs: hive
2019-12-04 20:53:48,780 INFO LayoutRegistry - Registered Table Layouts: hive
2019-12-04 20:53:49,042 WARN HdfsUnderFileSystem - Cannot create SupportedHdfsAclProvider. HDFS ACLs will not be supported.
2019-12-04 20:53:49,234 WARN NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-12-04 20:53:49,371 WARN HdfsUnderFileSystem - Cannot create SupportedHdfsActiveSyncProvider.HDFS ActiveSync will not be supported.
2019-12-04 20:53:49,707 INFO ProcessUtils - Starting Alluxio master @hslave162:19998.
2019-12-04 20:53:49,707 INFO RaftJournalSystem - Initializing Raft Journal System
2019-12-04 20:53:49,728 INFO JournalStateMachine - Initialized new journal state machine
2019-12-04 20:53:52,644 INFO AbstractPrimarySelector - Primary selector transitioning to SECONDARY
2019-12-04 20:53:52,645 INFO RaftJournalSystem - Starting Raft journal system. Cluster addresses: [hslave162:19200]. Local address: hslave162:19200
2019-12-04 20:53:52,939 INFO GrpcMessagingServer - Successfully started messaging server at: hslave162:19200
2019-12-04 20:53:52,941 INFO ServerContext - hslave162:19200 - Transitioning to FOLLOWER
2019-12-04 20:53:52,952 INFO AbstractPrimarySelector - Primary selector transitioning to SECONDARY
2019-12-04 20:54:04,846 INFO ServerContext - hslave162:19200 - Transitioning to CANDIDATE
2019-12-04 20:54:04,852 INFO CandidateState - hslave162:19200 - Starting election
2019-12-04 20:54:05,007 INFO ServerContext - hslave162:19200 - Transitioning to LEADER
2019-12-04 20:54:05,017 INFO ServerContext - hslave162:19200 - Found leader hslave162:19200
2019-12-04 20:54:05,141 INFO ServerStateMachine - hslave162:19200 - Installing snapshot 1
2019-12-04 20:54:05,165 INFO JournalUtils - Reading journal entries
2019-12-04 20:54:05,227 INFO JournalUtils - Reading journal entries
2019-12-04 20:54:05,227 INFO JournalUtils - Reading journal entries
2019-12-04 20:54:05,236 INFO JournalUtils - Reading journal entries
2019-12-04 20:54:05,238 INFO JournalUtils - Reading journal entries
2019-12-04 20:54:05,238 INFO JournalUtils - Reading journal entries
2019-12-04 20:54:05,239 INFO JournalUtils - Reading journal entries
2019-12-04 20:54:05,239 INFO JournalUtils - Reading journal entries
2019-12-04 20:54:05,239 INFO JournalStateMachine - Successfully installed snapshot up to SN -1
2019-12-04 20:54:15,796 INFO AbstractPrimarySelector - Primary selector transitioning to PRIMARY
2019-12-04 20:54:15,797 INFO AbstractPrimarySelector - Primary selector transitioning to PRIMARY
2019-12-04 20:54:15,798 INFO CopycatServer - Server started successfully!
2019-12-04 20:54:15,804 INFO RaftJournalSystem - Started Raft Journal System in 23159ms
2019-12-04 20:54:15,856 INFO AlluxioMasterProcess - All masters started
2019-12-04 20:54:15,856 INFO FaultTolerantAlluxioMasterProcess - Secondary started
2019-12-04 20:54:15,859 INFO AbstractMaster - TableMaster: Stopped secondary master.
2019-12-04 20:54:15,860 INFO AbstractMaster - MetaMaster: Stopped secondary master.
2019-12-04 20:54:15,860 INFO AbstractMaster - FileSystemMaster: Stopped secondary master.
2019-12-04 20:54:15,860 INFO AbstractMaster - BlockMaster: Stopped secondary master.
2019-12-04 20:54:15,860 INFO AbstractMaster - MetricsMaster: Stopped secondary master.
2019-12-04 20:54:15,860 INFO FaultTolerantAlluxioMasterProcess - Secondary stopped
2019-12-04 20:54:17,368 INFO ClientSession - Registered session 153458
2019-12-04 20:54:17,379 INFO RaftJournalSystem - Performing catchup. Last applied SN: 431. Catchup ID: -3833932302058852624
2019-12-04 20:54:37,479 INFO RaftJournalSystem - Caught up in 20101ms. Last sequence number from previous term: 431.
2019-12-04 20:54:37,483 INFO AbstractMaster - MetricsMaster: Starting primary master.
2019-12-04 20:54:37,491 INFO AbstractMaster - BlockMaster: Starting primary master.
2019-12-04 20:54:37,505 INFO AbstractMaster - FileSystemMaster: Starting primary master.
2019-12-04 20:54:37,505 INFO DefaultFileSystemMaster - Starting fs master as primary
2019-12-04 20:54:37,664 INFO AbstractMaster - MetaMaster: Starting primary master.
2019-12-04 20:54:37,700 INFO DefaultMetaMaster - Detected existing cluster ID cc03d9bd-c8d4-4aaf-b332-d848f27900d1
2019-12-04 20:54:37,755 INFO BackupLeaderRole - Creating backup-leader role.
2019-12-04 20:54:37,755 INFO AbstractMaster - TableMaster: Starting primary master.
2019-12-04 20:54:37,792 INFO AlluxioMasterProcess - All masters started
2019-12-04 20:54:37,812 INFO MetricsSystem - Starting sinks with config: {}.
2019-12-04 20:54:37,812 INFO AlluxioMasterProcess - Alluxio master web server version 2.2.0-SNAPSHOT starting (gained leadership). webAddress=/0.0.0.0:19999
2019-12-04 20:54:37,837 INFO log - Logging initialized @50486ms
2019-12-04 20:54:38,157 INFO WebServer - Alluxio Master Web service starting @ /0.0.0.0:19999
2019-12-04 20:54:38,182 INFO Server - jetty-9.2.z-SNAPSHOT
2019-12-04 20:54:38,211 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@3ecbc252{/metrics/prometheus,null,AVAILABLE}
2019-12-04 20:54:38,211 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@1449db74{/metrics/json,null,AVAILABLE}
2019-12-04 20:54:51,459 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@6bdeb4d7{/,file:/usr/local/alluxio-for-hadoop-2.7.7/webui/master/build/,AVAILABLE}
2019-12-04 20:54:51,466 INFO ServerConnector - Started ServerConnector@d12486e{HTTP/1.1}{0.0.0.0:19999}
2019-12-04 20:54:51,466 INFO Server - Started @64115ms
2019-12-04 20:54:51,466 INFO WebServer - Alluxio Master Web service started @ /0.0.0.0:19999
2019-12-04 20:54:51,466 INFO AlluxioMasterProcess - Alluxio master version 2.2.0-SNAPSHOT started (gained leadership). bindAddress=/0.0.0.0:19998, connectAddress=hslave162:19998, webAddress=/0.0.0.0:19999
2019-12-04 20:54:51,472 INFO AlluxioMasterProcess - Starting Alluxio master gRPC server on address /0.0.0.0:19998
2019-12-04 20:54:51,491 INFO MasterProcess - registered service METRICS_MASTER_CLIENT_SERVICE
2019-12-04 20:54:51,551 INFO MasterProcess - registered service BLOCK_MASTER_CLIENT_SERVICE
2019-12-04 20:54:51,551 INFO MasterProcess - registered service BLOCK_MASTER_WORKER_SERVICE
2019-12-04 20:54:51,650 INFO MasterProcess - registered service FILE_SYSTEM_MASTER_WORKER_SERVICE
2019-12-04 20:54:51,650 INFO MasterProcess - registered service FILE_SYSTEM_MASTER_JOB_SERVICE
2019-12-04 20:54:51,650 INFO MasterProcess - registered service FILE_SYSTEM_MASTER_CLIENT_SERVICE
2019-12-04 20:54:51,697 INFO MasterProcess - registered service META_MASTER_CONFIG_SERVICE
2019-12-04 20:54:51,697 INFO MasterProcess - registered service META_MASTER_CLIENT_SERVICE
2019-12-04 20:54:51,697 INFO MasterProcess - registered service META_MASTER_BACKUP_MESSAGING_SERVICE
2019-12-04 20:54:51,698 INFO MasterProcess - registered service META_MASTER_MASTER_SERVICE
2019-12-04 20:54:51,740 INFO MasterProcess - registered service TABLE_MASTER_CLIENT_SERVICE
2019-12-04 20:54:51,751 INFO DefaultSafeModeManager - Rpc server started, waiting 5000ms for workers to register
2019-12-04 20:54:51,751 INFO AlluxioMasterProcess - Started Alluxio master gRPC server on address hslave162:19998
2019-12-04 20:54:51,762 INFO FaultTolerantAlluxioMasterProcess - Primary started
3)Alluxio Worker节点(hmaster156)的日志job_worker.log:
2019-12-04 20:54:02,847 WARN retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 1): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:02,965 WARN retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 2): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:03,193 WARN retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 3): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:03,621 WARN retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 4): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:04,435 WARN retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 5): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:06,112 WARN retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 6): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:09,376 WARN retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 7): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:14,833 WARN retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 8): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:20,104 WARN retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 9): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: Network closed for unknown reason
2019-12-04 20:54:25,565 WARN retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 10): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: Network closed for unknown reason
2019-12-04 20:54:30,894 WARN retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 11): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: Network closed for unknown reason
2019-12-04 20:54:35,999 WARN retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 12): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: Network closed for unknown reason
2019-12-04 20:54:41,037 WARN retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 13): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:46,113 WARN retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 14): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: Network closed for unknown reason
2019-12-04 20:54:51,574 WARN retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 15): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: Network closed for unknown reason
2019-12-04 20:54:57,470 INFO network.NettyUtils (NettyUtils.java:checkNettyEpollAvailable) - EPOLL_MODE is available
2019-12-04 20:54:57,555 INFO network.TieredIdentityFactory (TieredIdentityFactory.java:localIdentity) - Initialized tiered identity TieredIdentity(node=hmaster156, rack=null)
2019-12-04 20:54:57,676 INFO util.log (Log.java:initialized) - Logging initialized @56199ms
2019-12-04 20:54:57,843 INFO alluxio.ProcessUtils (ProcessUtils.java:run) - Starting Alluxio job worker.
2019-12-04 20:54:57,852 INFO web.WebServer (WebServer.java:start) - Alluxio Job Manager Worker Web service starting @ /0.0.0.0:30003
2019-12-04 20:54:57,863 INFO server.Server (Server.java:doStart) - jetty-9.2.z-SNAPSHOT
2019-12-04 20:54:57,926 INFO handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@62833051{/metrics/json,null,AVAILABLE}
2019-12-04 20:55:02,954 INFO handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@3fcdcf{/,null,AVAILABLE}
2019-12-04 20:55:02,963 INFO server.ServerConnector (AbstractConnector.java:doStart) - Started ServerConnector@282308c3{HTTP/1.1}{0.0.0.0:30003}
2019-12-04 20:55:02,963 INFO server.Server (Server.java:doStart) - Started @61487ms
2019-12-04 20:55:02,963 INFO web.WebServer (WebServer.java:start) - Alluxio Job Manager Worker Web service started @ /0.0.0.0:30003
2019-12-04 20:55:02,971 INFO metrics.MetricsSystem (MetricsSystem.java:startSinksFromConfig) - Starting sinks with config: {}.
2019-12-04 20:55:03,350 INFO worker.AlluxioJobWorkerProcess (AlluxioJobWorkerProcess.java:start) - Started Alluxio job worker with id 1575464035219
2019-12-04 20:55:03,351 INFO worker.AlluxioJobWorkerProcess (AlluxioJobWorkerProcess.java:start) - Alluxio job worker version 2.2.0-SNAPSHOT started. bindHost=/0.0.0.0:30001, connectHost=hmaster156:30001, rpcPort=30001, webPort=30003
2019-12-04 20:55:03,351 INFO worker.AlluxioJobWorkerProcess (AlluxioJobWorkerProcess.java:startServingRPCServer) - Starting gRPC server on address hmaster156:30001
2019-12-04 20:55:03,417 INFO worker.AlluxioJobWorkerProcess (AlluxioJobWorkerProcess.java:startServingRPCServer) - Started gRPC server on address hmaster156:30001
4)Alluxio Worker节点(hmaster156)的日志worker.log:
2019-12-04 20:05:33,818 WARN SaslStreamServerDriver - Error received for channel: 947b2f0d-a4e3-4cdb-95d2-b6b9fe75d480. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:05:33,820 WARN SaslStreamServerDriver - Error received for channel: 96cccf3c-7aa7-4f71-8029-b22115b4a9e8. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:36:45,846 WARN SaslStreamServerDriver - Error received for channel: 71bc4b60-e968-4e6d-931e-cbccb565ec1b. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:36:45,846 WARN SaslStreamServerDriver - Error received for channel: 7d8207ef-4d45-40bd-9b7b-014b9bd33aff. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:35,477 WARN SaslStreamServerDriver - Error received for channel: db43d1a6-3837-46ac-8089-b8d8da9837b2. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:35,481 WARN SaslStreamServerDriver - Error received for channel: 2c1540f4-cfda-4f4f-b208-1cac65f69a8c. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
(3)单独启动worker。11小时后重新测试从本地将文件alluxio2hadoop_test.txt上传到alluxio的文件系统,发现worker已经自动关闭。检查日志,发现新出现的警报如下:
1)Alluxio Master节点(hslave162)的日志master.log:
2019-12-04 22:39:08,528 INFO ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-04 23:39:08,528 INFO ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 00:39:08,529 INFO ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 01:39:08,529 INFO ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 02:39:08,529 INFO ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 03:39:08,529 INFO ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 04:39:08,529 INFO ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 05:39:08,529 INFO ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 06:39:08,529 INFO ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 07:39:08,530 INFO ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 08:21:05,057 WARN SaslStreamServerDriver - Error received for channel: 47492e42-a761-4b01-b8ca-50d43a0f724d. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-05 08:21:05,058 WARN SaslStreamServerDriver - Error received for channel: cfabbc0d-1bdb-4361-80d0-1a079c2f54f2. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2)Alluxio Worker节点(hmaster156)的日志job_worker.log:
2019-12-05 03:59:42,002 WARN heartbeat.SleepingTimer (SleepingTimer.java:tick) - Job Worker Command Handling last execution took 1015 ms. Longer than the interval 1000
2019-12-05 07:20:14,726 WARN heartbeat.SleepingTimer (SleepingTimer.java:tick) - Job Worker Command Handling last execution took 1116 ms. Longer than the interval 1000
2019-12-05 08:02:25,518 WARN heartbeat.SleepingTimer (SleepingTimer.java:tick) - Job Worker Command Handling last execution took 1222 ms. Longer than the interval 1000
查不到相关资料。
解决方案:
无意中发现启动Alluxio时命令行输出如下内容:
查看Alluxio Worker节点(hmaster156)的日志task.log,未发现ERROR和WARN,仅发现在启动worker时有如下信息:
Executing the following command on all worker nodes and logging to /usr/local/alluxio2.1.0-for-hadoop-2.7.7/logs/task.log: /usr/local/alluxio2.1.0-for-hadoop-2.7.7/bin/alluxio-start.sh -a worker noMount
[hmaster156] Connecting as hadoop...
[2019-12-05 15:17:34][hmaster156] nohup: ignoring input
[2019-12-05 15:17:34][hmaster156] Invalid mount mode: noMount
[2019-12-05 15:17:34][hmaster156] Usage: alluxio-start.sh [-hNwm] [-i backup] ACTION [MOPT] [-f]
[2019-12-05 15:17:34][hmaster156] Where ACTION is one of:
[2019-12-05 15:17:34][hmaster156] all [MOPT] Start all masters, proxies, and workers.
[2019-12-05 15:17:34][hmaster156] job_master Start the job_master on this node.
[2019-12-05 15:17:34][hmaster156] job_masters Start job_masters on master nodes.
[2019-12-05 15:17:34][hmaster156] job_worker Start a job_worker on this node.
[2019-12-05 15:17:34][hmaster156] job_workers Start job_workers on worker nodes.
[2019-12-05 15:17:34][hmaster156] local [MOPT] Start all processes locally.
[2019-12-05 15:17:34][hmaster156] master Start the local master on this node.
[2019-12-05 15:17:34][hmaster156] secondary_master Start the local secondary master on this node.
[2019-12-05 15:17:34][hmaster156] masters Start masters on master nodes.
[2019-12-05 15:17:34][hmaster156] proxy Start the proxy on this node.
[2019-12-05 15:17:35][hmaster156] proxies Start proxies on master and worker nodes.
[2019-12-05 15:17:35][hmaster156] safe Script will run continuously and start the master if it's not running.
[2019-12-05 15:17:35][hmaster156] worker [MOPT] Start a worker on this node.
[2019-12-05 15:17:35][hmaster156] workers [MOPT] Start workers on worker nodes.
[2019-12-05 15:17:35][hmaster156] logserver Start the logserver
[2019-12-05 15:17:35][hmaster156] restart_worker Restart a failed worker on this node.
[2019-12-05 15:17:35][hmaster156] restart_workers Restart any failed workers on worker nodes.
[2019-12-05 15:17:35][hmaster156]
[2019-12-05 15:17:35][hmaster156] MOPT (Mount Option) is one of:
[2019-12-05 15:17:35][hmaster156] Mount Mount the configured RamFS if it is not already mounted.
[2019-12-05 15:17:35][hmaster156] SudoMount Mount the configured RamFS using sudo if it is not already mounted.
[2019-12-05 15:17:35][hmaster156] NoMount Do not mount the configured RamFS.
[2019-12-05 15:17:35][hmaster156] Notice: to avoid sudo requirement but using tmpFS in Linux,
[2019-12-05 15:17:35][hmaster156] set ALLUXIO_RAM_FOLDER=/dev/shm on each worker and use NoMount.
[2019-12-05 15:17:35][hmaster156] NoMount is assumed if MOPT is not specified.
[2019-12-05 15:17:35][hmaster156]
[2019-12-05 15:17:35][hmaster156] -a asynchronously start all processes. The script may exit before all
[2019-12-05 15:17:35][hmaster156] processes have been started.
[2019-12-05 15:17:35][hmaster156] -f format Journal, UnderFS Data and Workers Folder on master.
[2019-12-05 15:17:35][hmaster156] -h display this help.
[2019-12-05 15:17:35][hmaster156] -i backup a journal backup to restore the master from. The backup should be
[2019-12-05 15:17:35][hmaster156] a URI path within the root under filesystem, e.g.
[2019-12-05 15:17:35][hmaster156] hdfs://mycluster/alluxio_backups/alluxio-journal-YYYY-MM-DD-timestamp.gz.
[2019-12-05 15:17:35][hmaster156] -N do not try to kill previous running processes before starting new ones.
[2019-12-05 15:17:35][hmaster156] -w wait for processes to end before returning.
[2019-12-05 15:17:35][hmaster156]
[2019-12-05 15:17:35][hmaster156] Supported environment variables:
[2019-12-05 15:17:35][hmaster156]
[2019-12-05 15:17:35][hmaster156] ALLUXIO_JOB_WORKER_COUNT - identifies how many job workers to start per node (default = 1)
[2019-12-05 15:17:35][hmaster156] Connection to hmaster156 closed.
推断启动命令
./alluxio-start.sh all NoMount
存在问题。更改为如下命令,可正常启动Alluxio集群:
./alluxio-start.sh all SudoMount
等待18小时,worker并未自动关闭。
出现如下问题3:
worker仍然有时无法启动,重新搭建alluxio-2.1.0+hadoop-2.7.3集群,Alluxio Master节点(hslave162)的task.log出现如下信息:
[sudo] password for hadoop:
ERROR: mkdir /usr/local/alluxio-2.1.0/ramdisk failed
解决方案:
给Worker机器加上起Worker服务的用户的免密sudo权限,在/etc/sudoers中添加如下内容:
# <user_name> ALL=(ALL) NOPASSWD: ALL
# 实际设置如下:
hadoop ALL=(ALL) NOPASSWD: ALL