【Alluxio+HDFS集群系列-2】测试Alluxio+HDFS(Hadoop-2.7.7)集群性能(1)

一、目的:

测试Alluxio+HDFS(Hadoop-2.7.7)集群性能。

二、环境:

2台虚拟机,CentOS Linux release 7.5.1804 (Core),内存3G,硬盘45G。

192.168.10.156,hmaster156,alluxio slave

192.168.10.162,hslave162,alluxio master

三、测试过程:

3.1  数据读写

3.1.1 上传文件到Alluxio文件系统

(1)上传到HDFS(alluxio挂载的目录下)的文件会被自动上传到Alluxio文件系统

在本地(hslave162)创建测试文件,并上传到hdfs的/alluxio/data/目录下,则文件自动被上传到alluxio文件系统中。

hdfs dfs -put hadoop2alluxio_test.txt /alluxio/data

 

(2)本地文件上传到Alluxio文件系统

在本地(hslave162节点/usr/local/alluxio-for-hadoop-2.7.7/performance_test/路径下)创建测试文件local2alluxio_test.txt,并上传到alluxio的文件系统(/alluxio/data/目录)下。

在hslave162节点/usr/local/alluxio-for-hadoop-2.7.7/bin/路径下,以hadoop用户执行以下命令:

# ‘/’代表根目录,即/alluxio/data/
./alluxio fs copyFromLocal /usr/local/alluxio-for-hadoop-2.7.7/performance_test/local2alluxio_test.txt /

注意:从本地向Alluxio文件系统上传数据后,通常会自动持久化到HDFS对应的目录下(即/alluxio/data/目录下)。

3.1.2 Alluxio文件固化到HDFS

在本地(hslave162节点/usr/local/alluxio-for-hadoop-2.7.7/performance_test/路径下)创建测试文件alluxio2hadoop_test.txt,并上传到alluxio的文件系统(/alluxio/data/目录)下。

将alluxio上的文件alluxio2hadoop_test.txt固化到HDFS的目录/alluxio/data/下。在hslave162节点/usr/local/alluxio-for-hadoop-2.7.7/bin/路径下,以hadoop用户执行以下命令:

./alluxio fs persist /alluxio2hadoop_test.txt

出现如下问题1:

解决方案:

重启后可以正常持久化。

出现如下问题2:

Alluxio Worker节点(hmaster156)的AlluxioWorker自动停止。 

解决方案(治标不治本):

在Alluxio Worker节点(hmaster156)的/usr/local/alluxio-for-hadoop-2.7.7/bin/路径下执行以下命令,启动AlluxioWorker:

./alluxio-start.sh worker

则问题1和问题2均得到解决。

问题分析:

Alluxio集群配置存在某些问题,导致集群运行不稳定。目前尚未发现问题在哪里,待进一步排查。

(1)间隔一段时间重新从本地向Alluxio文件系统上传文件,仍然成功。分析Alluxio Worker节点(hmaster156)和Alluxio Master节点(hslave162)的日志信息,发现在Alluxio Worker节点(hmaster156)的日志job_worker.log中存在如下报警:

# 上次上传数据日志信息
2019-12-04 20:05:33,818 WARN  SaslStreamServerDriver - Error received for channel: 947b2f0d-a4e3-4cdb-95d2-b6b9fe75d480. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:05:33,820 WARN  SaslStreamServerDriver - Error received for channel: 96cccf3c-7aa7-4f71-8029-b22115b4a9e8. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close

# 本次上传数据日志信息
2019-12-04 20:36:45,846 WARN  SaslStreamServerDriver - Error received for channel: 71bc4b60-e968-4e6d-931e-cbccb565ec1b. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:36:45,846 WARN  SaslStreamServerDriver - Error received for channel: 7d8207ef-4d45-40bd-9b7b-014b9bd33aff. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close

查不到相关资料。

 

(2)关闭Alluxio集群,重新启动,发现worker仍然无法启动:

观察各日志,存在问题的日志如下:

1)Alluxio Master节点(hslave162)的日志job_master.log:

2019-12-04 20:52:34,386 WARN  authentication.SaslStreamServerDriver (SaslStreamServerDriver.java:onError) - Error received for channel: ff20e3c4-d10d-4e1d-a6f2-f14da93236e2. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:53:55,073 INFO  network.NettyUtils (NettyUtils.java:checkNettyEpollAvailable) - EPOLL_MODE is available
2019-12-04 20:53:55,194 INFO  network.TieredIdentityFactory (TieredIdentityFactory.java:localIdentity) - Initialized tiered identity TieredIdentity(node=hslave162, rack=null)
2019-12-04 20:53:55,229 INFO  alluxio.ProcessUtils (ProcessUtils.java:run) - Starting Alluxio job master @ hslave162:20001.
2019-12-04 20:53:55,229 INFO  raft.RaftJournalSystem (RaftJournalSystem.java:startInternal) - Initializing Raft Journal System
2019-12-04 20:53:55,238 INFO  raft.JournalStateMachine (JournalStateMachine.java:<init>) - Initialized new journal state machine
2019-12-04 20:53:56,375 INFO  master.AbstractPrimarySelector (AbstractPrimarySelector.java:setState) - Primary selector transitioning to SECONDARY
2019-12-04 20:53:56,377 INFO  raft.RaftJournalSystem (RaftJournalSystem.java:startInternal) - Starting Raft journal system. Cluster addresses: [hslave162:20003]. Local address: hslave162:20003
2019-12-04 20:53:56,692 INFO  transport.GrpcMessagingServer (GrpcMessagingServer.java:lambda$listen$1) - Successfully started messaging server at: hslave162:20003
2019-12-04 20:53:56,695 INFO  state.ServerContext (ServerContext.java:transition) - hslave162:20003 - Transitioning to FOLLOWER
2019-12-04 20:53:56,708 INFO  master.AbstractPrimarySelector (AbstractPrimarySelector.java:setState) - Primary selector transitioning to SECONDARY
2019-12-04 20:54:08,929 INFO  state.ServerContext (ServerContext.java:transition) - hslave162:20003 - Transitioning to CANDIDATE
2019-12-04 20:54:08,932 INFO  state.CandidateState (CandidateState.java:startElection) - hslave162:20003 - Starting election
2019-12-04 20:54:09,082 INFO  state.ServerContext (ServerContext.java:transition) - hslave162:20003 - Transitioning to LEADER
2019-12-04 20:54:09,087 INFO  state.ServerContext (ServerContext.java:setLeader) - hslave162:20003 - Found leader hslave162:20003
2019-12-04 20:54:09,254 INFO  state.ServerStateMachine (ServerStateMachine.java:installSnapshot) - hslave162:20003 - Installing snapshot 1
2019-12-04 20:54:09,340 INFO  raft.JournalStateMachine (JournalStateMachine.java:install) - Successfully installed snapshot up to SN -1
2019-12-04 20:54:13,898 INFO  master.AbstractPrimarySelector (AbstractPrimarySelector.java:setState) - Primary selector transitioning to PRIMARY
2019-12-04 20:54:13,899 INFO  master.AbstractPrimarySelector (AbstractPrimarySelector.java:setState) - Primary selector transitioning to PRIMARY
2019-12-04 20:54:13,900 INFO  server.CopycatServer (CopycatServer.java:lambda$start$6) - Server started successfully!
2019-12-04 20:54:13,903 INFO  raft.RaftJournalSystem (RaftJournalSystem.java:startInternal) - Started Raft Journal System in 17526ms
2019-12-04 20:54:13,903 INFO  master.FaultTolerantAlluxioJobMasterProcess (FaultTolerantAlluxioJobMasterProcess.java:start) - Transitioning from secondary to primary
2019-12-04 20:54:15,664 INFO  session.ClientSession (ClientSessionManager.java:lambda$register$2) - Registered session 48982
2019-12-04 20:54:15,683 INFO  raft.RaftJournalSystem (RaftJournalSystem.java:catchUp) - Performing catchup. Last applied SN: -1. Catchup ID: -4000766732932347040
2019-12-04 20:54:36,065 INFO  raft.RaftJournalSystem (RaftJournalSystem.java:catchUp) - Caught up in 20383ms. Last sequence number from previous term: -1.
2019-12-04 20:54:36,080 INFO  master.AbstractMaster (AbstractMaster.java:stop) - JobMaster: Stopped secondary master.
2019-12-04 20:54:36,080 INFO  master.FaultTolerantAlluxioJobMasterProcess (FaultTolerantAlluxioJobMasterProcess.java:start) - Secondary stopped
2019-12-04 20:54:36,081 INFO  master.AbstractMaster (AbstractMaster.java:start) - JobMaster: Starting primary master.
2019-12-04 20:54:36,089 INFO  master.FaultTolerantAlluxioJobMasterProcess (FaultTolerantAlluxioJobMasterProcess.java:start) - Primary started
2019-12-04 20:54:36,089 INFO  master.AlluxioJobMasterProcess (AlluxioJobMasterProcess.java:startServing) - Alluxio job master web server version 2.2.0-SNAPSHOT starting (gained leadership). webAddress=/0.0.0.0:20002
2019-12-04 20:54:36,101 INFO  util.log (Log.java:initialized) - Logging initialized @42323ms
2019-12-04 20:54:36,217 INFO  web.WebServer (WebServer.java:start) - Alluxio Job Manager Master Web service starting @ /0.0.0.0:20002
2019-12-04 20:54:36,219 INFO  server.Server (Server.java:doStart) - jetty-9.2.z-SNAPSHOT
2019-12-04 20:54:36,248 INFO  handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@7511c06d{/metrics/json,null,AVAILABLE}
2019-12-04 20:54:45,395 INFO  handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@731b9c92{/,null,AVAILABLE}
2019-12-04 20:54:45,402 INFO  server.ServerConnector (AbstractConnector.java:doStart) - Started ServerConnector@4f3cf293{HTTP/1.1}{0.0.0.0:20002}
2019-12-04 20:54:45,402 INFO  server.Server (Server.java:doStart) - Started @51624ms
2019-12-04 20:54:45,402 INFO  web.WebServer (WebServer.java:start) - Alluxio Job Manager Master Web service started @ /0.0.0.0:20002
2019-12-04 20:54:45,402 INFO  master.AlluxioJobMasterProcess (AlluxioJobMasterProcess.java:startServing) - Alluxio job master version 2.2.0-SNAPSHOT started (gained leadership). bindAddress=/0.0.0.0:20001, connectAddress=hslave162:20001, webAddress=/0.0.0.0:20002
2019-12-04 20:54:45,402 INFO  master.AlluxioJobMasterProcess (AlluxioJobMasterProcess.java:startServingRPCServer) - Starting Alluxio job master gRPC server on address /0.0.0.0:20001
2019-12-04 20:54:45,474 INFO  master.MasterProcess (MasterProcess.java:registerServices) - registered service JOB_MASTER_CLIENT_SERVICE
2019-12-04 20:54:45,475 INFO  master.MasterProcess (MasterProcess.java:registerServices) - registered service JOB_MASTER_WORKER_SERVICE
2019-12-04 20:54:45,487 INFO  master.AlluxioJobMasterProcess (AlluxioJobMasterProcess.java:startServingRPCServer) - Started Alluxio job master gRPC server on address hslave162:20001
2019-12-04 20:55:02,134 INFO  job.JobMaster (JobMaster.java:registerWorker) - registerWorker(): WorkerNetAddress: WorkerNetAddress{host=hmaster156, rpcPort=30001, dataPort=30002, webPort=30003, domainSocketPath=, tieredIdentity=TieredIdentity(node=hmaster156)} id: 1575464035219

2)Alluxio Master节点(hslave162)的日志master.log:

2019-12-04 20:52:34,384 WARN  SaslStreamServerDriver - Error received for channel: 822a2fd1-4a77-4295-807a-feec4e2df6e7. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:34,384 WARN  SaslStreamServerDriver - Error received for channel: 0d7c81e3-0eac-4197-9fa8-e4faf45fa7b4. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:34,384 WARN  SaslStreamServerDriver - Error received for channel: e212747a-db40-430d-997b-60c7eb7f6eb2. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:34,384 WARN  SaslStreamServerDriver - Error received for channel: b8dd20e4-13b7-48e1-a399-9bf8f0280658. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:36,045 WARN  SaslStreamServerDriver - Error received for channel: 99a98071-e933-473f-b573-2c3d6d76fdb0. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:36,045 WARN  SaslStreamServerDriver - Error received for channel: b0f4013b-f3ca-4f51-ba46-dbd6046576b2. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:36,045 WARN  SaslStreamServerDriver - Error received for channel: 5b8a399a-f7d7-4f13-90e1-f21c0ba33590. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:36,046 WARN  SaslStreamServerDriver - Error received for channel: 7d4af118-010f-4220-8337-21260cdcb61d. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:36,046 WARN  SaslStreamServerDriver - Error received for channel: e219d970-caae-4359-a4a5-8cde797100c2. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:39,506 WARN  SaslStreamServerDriver - Error received for channel: 2c0ac859-5cbd-45c9-8b48-f857b89dc652. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:39,506 WARN  SaslStreamServerDriver - Error received for channel: 243d57cc-3c7e-4312-ae12-a7ca53c6059d. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:39,506 WARN  SaslStreamServerDriver - Error received for channel: 60d1c342-33ed-4965-b945-2d57fa1c6749. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:42,470 INFO  BackupLeaderRole - Closing backup-leader role.
2019-12-04 20:52:42,470 INFO  AlluxioMasterProcess - Alluxio master ended (lost leadership)
2019-12-04 20:52:42,493 INFO  ServerConnector - Stopped ServerConnector@4eaf5e18{HTTP/1.1}{0.0.0.0:19999}
2019-12-04 20:52:42,503 INFO  ContextHandler - Stopped o.e.j.s.ServletContextHandler@508532d4{/,file:/usr/local/alluxio-for-hadoop-2.7.7/webui/master/build/,UNAVAILABLE}
2019-12-04 20:52:42,504 INFO  ContextHandler - Stopped o.e.j.s.ServletContextHandler@6fa726e6{/metrics/json,null,UNAVAILABLE}
2019-12-04 20:52:42,504 INFO  ContextHandler - Stopped o.e.j.s.ServletContextHandler@1a7f5e7d{/metrics/prometheus,null,UNAVAILABLE}
2019-12-04 20:52:42,505 INFO  AbstractMaster - TableMaster: Stopping primary master.
2019-12-04 20:52:42,506 INFO  AbstractMaster - TableMaster: Stopped primary master.
2019-12-04 20:52:42,506 INFO  AbstractMaster - MetaMaster: Stopping primary master.
2019-12-04 20:52:42,506 INFO  AbstractMaster - MetaMaster: Stopped primary master.
2019-12-04 20:52:42,508 INFO  AbstractMaster - FileSystemMaster: Stopping primary master.
2019-12-04 20:52:42,508 INFO  AbstractMaster - FileSystemMaster: Stopped primary master.
2019-12-04 20:52:42,508 INFO  AbstractMaster - BlockMaster: Stopping primary master.
2019-12-04 20:52:42,509 INFO  AbstractMaster - BlockMaster: Stopped primary master.
2019-12-04 20:52:42,509 INFO  AbstractMaster - MetricsMaster: Stopping primary master.
2019-12-04 20:52:42,509 INFO  AbstractMaster - MetricsMaster: Stopped primary master.
2019-12-04 20:52:42,511 INFO  RaftJournalSystem - Shutting down raft journal
2019-12-04 20:52:42,511 INFO  RaftJournalWriter - Closing journal writer. Last sequence numbers written/submitted/committed: 431/431/431
2019-12-04 20:52:42,702 INFO  ServerContext - hslave162:19200 - Transitioning to INACTIVE
2019-12-04 20:52:42,778 INFO  AbstractPrimarySelector - Primary selector transitioning to SECONDARY
2019-12-04 20:52:42,783 INFO  ProcessUtils - Stopping Alluxio master @hslave162:19998.
2019-12-04 20:52:42,811 INFO  RaftJournalSystem - Journal shutdown complete
2019-12-04 20:53:48,341 INFO  BlockMasterFactory - Creating alluxio.master.block.BlockMaster 
2019-12-04 20:53:48,342 INFO  MetaMasterFactory - Creating alluxio.master.meta.MetaMaster 
2019-12-04 20:53:48,341 INFO  FileSystemMasterFactory - Creating alluxio.master.file.FileSystemMaster 
2019-12-04 20:53:48,354 INFO  MetricsMasterFactory - Creating alluxio.master.metrics.MetricsMaster 
2019-12-04 20:53:48,355 INFO  TableMasterFactory - Creating alluxio.master.table.TableMaster 
2019-12-04 20:53:48,633 INFO  NettyUtils - EPOLL_MODE is available
2019-12-04 20:53:48,657 INFO  ExtensionFactoryRegistry - Loading core jars from /usr/local/alluxio-for-hadoop-2.7.7/lib
2019-12-04 20:53:48,734 INFO  ExtensionFactoryRegistry - Loading extension jars from /usr/local/alluxio-for-hadoop-2.7.7/extensions
2019-12-04 20:53:48,759 INFO  TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=hslave162, rack=null)
2019-12-04 20:53:48,760 INFO  UnderDatabaseRegistry - Loading udb jars from /usr/local/alluxio-for-hadoop-2.7.7/lib
2019-12-04 20:53:48,777 INFO  UnderDatabaseRegistry - Registered UDBs: hive
2019-12-04 20:53:48,780 INFO  LayoutRegistry - Registered Table Layouts: hive
2019-12-04 20:53:49,042 WARN  HdfsUnderFileSystem - Cannot create SupportedHdfsAclProvider. HDFS ACLs will not be supported.
2019-12-04 20:53:49,234 WARN  NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-12-04 20:53:49,371 WARN  HdfsUnderFileSystem - Cannot create SupportedHdfsActiveSyncProvider.HDFS ActiveSync will not be supported.
2019-12-04 20:53:49,707 INFO  ProcessUtils - Starting Alluxio master @hslave162:19998.
2019-12-04 20:53:49,707 INFO  RaftJournalSystem - Initializing Raft Journal System
2019-12-04 20:53:49,728 INFO  JournalStateMachine - Initialized new journal state machine
2019-12-04 20:53:52,644 INFO  AbstractPrimarySelector - Primary selector transitioning to SECONDARY
2019-12-04 20:53:52,645 INFO  RaftJournalSystem - Starting Raft journal system. Cluster addresses: [hslave162:19200]. Local address: hslave162:19200
2019-12-04 20:53:52,939 INFO  GrpcMessagingServer - Successfully started messaging server at: hslave162:19200
2019-12-04 20:53:52,941 INFO  ServerContext - hslave162:19200 - Transitioning to FOLLOWER
2019-12-04 20:53:52,952 INFO  AbstractPrimarySelector - Primary selector transitioning to SECONDARY
2019-12-04 20:54:04,846 INFO  ServerContext - hslave162:19200 - Transitioning to CANDIDATE
2019-12-04 20:54:04,852 INFO  CandidateState - hslave162:19200 - Starting election
2019-12-04 20:54:05,007 INFO  ServerContext - hslave162:19200 - Transitioning to LEADER
2019-12-04 20:54:05,017 INFO  ServerContext - hslave162:19200 - Found leader hslave162:19200
2019-12-04 20:54:05,141 INFO  ServerStateMachine - hslave162:19200 - Installing snapshot 1
2019-12-04 20:54:05,165 INFO  JournalUtils - Reading journal entries
2019-12-04 20:54:05,227 INFO  JournalUtils - Reading journal entries
2019-12-04 20:54:05,227 INFO  JournalUtils - Reading journal entries
2019-12-04 20:54:05,236 INFO  JournalUtils - Reading journal entries
2019-12-04 20:54:05,238 INFO  JournalUtils - Reading journal entries
2019-12-04 20:54:05,238 INFO  JournalUtils - Reading journal entries
2019-12-04 20:54:05,239 INFO  JournalUtils - Reading journal entries
2019-12-04 20:54:05,239 INFO  JournalUtils - Reading journal entries
2019-12-04 20:54:05,239 INFO  JournalStateMachine - Successfully installed snapshot up to SN -1
2019-12-04 20:54:15,796 INFO  AbstractPrimarySelector - Primary selector transitioning to PRIMARY
2019-12-04 20:54:15,797 INFO  AbstractPrimarySelector - Primary selector transitioning to PRIMARY
2019-12-04 20:54:15,798 INFO  CopycatServer - Server started successfully!
2019-12-04 20:54:15,804 INFO  RaftJournalSystem - Started Raft Journal System in 23159ms
2019-12-04 20:54:15,856 INFO  AlluxioMasterProcess - All masters started
2019-12-04 20:54:15,856 INFO  FaultTolerantAlluxioMasterProcess - Secondary started
2019-12-04 20:54:15,859 INFO  AbstractMaster - TableMaster: Stopped secondary master.
2019-12-04 20:54:15,860 INFO  AbstractMaster - MetaMaster: Stopped secondary master.
2019-12-04 20:54:15,860 INFO  AbstractMaster - FileSystemMaster: Stopped secondary master.
2019-12-04 20:54:15,860 INFO  AbstractMaster - BlockMaster: Stopped secondary master.
2019-12-04 20:54:15,860 INFO  AbstractMaster - MetricsMaster: Stopped secondary master.
2019-12-04 20:54:15,860 INFO  FaultTolerantAlluxioMasterProcess - Secondary stopped
2019-12-04 20:54:17,368 INFO  ClientSession - Registered session 153458
2019-12-04 20:54:17,379 INFO  RaftJournalSystem - Performing catchup. Last applied SN: 431. Catchup ID: -3833932302058852624
2019-12-04 20:54:37,479 INFO  RaftJournalSystem - Caught up in 20101ms. Last sequence number from previous term: 431.
2019-12-04 20:54:37,483 INFO  AbstractMaster - MetricsMaster: Starting primary master.
2019-12-04 20:54:37,491 INFO  AbstractMaster - BlockMaster: Starting primary master.
2019-12-04 20:54:37,505 INFO  AbstractMaster - FileSystemMaster: Starting primary master.
2019-12-04 20:54:37,505 INFO  DefaultFileSystemMaster - Starting fs master as primary
2019-12-04 20:54:37,664 INFO  AbstractMaster - MetaMaster: Starting primary master.
2019-12-04 20:54:37,700 INFO  DefaultMetaMaster - Detected existing cluster ID cc03d9bd-c8d4-4aaf-b332-d848f27900d1
2019-12-04 20:54:37,755 INFO  BackupLeaderRole - Creating backup-leader role.
2019-12-04 20:54:37,755 INFO  AbstractMaster - TableMaster: Starting primary master.
2019-12-04 20:54:37,792 INFO  AlluxioMasterProcess - All masters started
2019-12-04 20:54:37,812 INFO  MetricsSystem - Starting sinks with config: {}.
2019-12-04 20:54:37,812 INFO  AlluxioMasterProcess - Alluxio master web server version 2.2.0-SNAPSHOT starting (gained leadership). webAddress=/0.0.0.0:19999
2019-12-04 20:54:37,837 INFO  log - Logging initialized @50486ms
2019-12-04 20:54:38,157 INFO  WebServer - Alluxio Master Web service starting @ /0.0.0.0:19999
2019-12-04 20:54:38,182 INFO  Server - jetty-9.2.z-SNAPSHOT
2019-12-04 20:54:38,211 INFO  ContextHandler - Started o.e.j.s.ServletContextHandler@3ecbc252{/metrics/prometheus,null,AVAILABLE}
2019-12-04 20:54:38,211 INFO  ContextHandler - Started o.e.j.s.ServletContextHandler@1449db74{/metrics/json,null,AVAILABLE}
2019-12-04 20:54:51,459 INFO  ContextHandler - Started o.e.j.s.ServletContextHandler@6bdeb4d7{/,file:/usr/local/alluxio-for-hadoop-2.7.7/webui/master/build/,AVAILABLE}
2019-12-04 20:54:51,466 INFO  ServerConnector - Started ServerConnector@d12486e{HTTP/1.1}{0.0.0.0:19999}
2019-12-04 20:54:51,466 INFO  Server - Started @64115ms
2019-12-04 20:54:51,466 INFO  WebServer - Alluxio Master Web service started @ /0.0.0.0:19999
2019-12-04 20:54:51,466 INFO  AlluxioMasterProcess - Alluxio master version 2.2.0-SNAPSHOT started (gained leadership). bindAddress=/0.0.0.0:19998, connectAddress=hslave162:19998, webAddress=/0.0.0.0:19999
2019-12-04 20:54:51,472 INFO  AlluxioMasterProcess - Starting Alluxio master gRPC server on address /0.0.0.0:19998
2019-12-04 20:54:51,491 INFO  MasterProcess - registered service METRICS_MASTER_CLIENT_SERVICE
2019-12-04 20:54:51,551 INFO  MasterProcess - registered service BLOCK_MASTER_CLIENT_SERVICE
2019-12-04 20:54:51,551 INFO  MasterProcess - registered service BLOCK_MASTER_WORKER_SERVICE
2019-12-04 20:54:51,650 INFO  MasterProcess - registered service FILE_SYSTEM_MASTER_WORKER_SERVICE
2019-12-04 20:54:51,650 INFO  MasterProcess - registered service FILE_SYSTEM_MASTER_JOB_SERVICE
2019-12-04 20:54:51,650 INFO  MasterProcess - registered service FILE_SYSTEM_MASTER_CLIENT_SERVICE
2019-12-04 20:54:51,697 INFO  MasterProcess - registered service META_MASTER_CONFIG_SERVICE
2019-12-04 20:54:51,697 INFO  MasterProcess - registered service META_MASTER_CLIENT_SERVICE
2019-12-04 20:54:51,697 INFO  MasterProcess - registered service META_MASTER_BACKUP_MESSAGING_SERVICE
2019-12-04 20:54:51,698 INFO  MasterProcess - registered service META_MASTER_MASTER_SERVICE
2019-12-04 20:54:51,740 INFO  MasterProcess - registered service TABLE_MASTER_CLIENT_SERVICE
2019-12-04 20:54:51,751 INFO  DefaultSafeModeManager - Rpc server started, waiting 5000ms for workers to register
2019-12-04 20:54:51,751 INFO  AlluxioMasterProcess - Started Alluxio master gRPC server on address hslave162:19998
2019-12-04 20:54:51,762 INFO  FaultTolerantAlluxioMasterProcess - Primary started

3)Alluxio Worker节点(hmaster156)的日志job_worker.log:

2019-12-04 20:54:02,847 WARN  retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 1): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:02,965 WARN  retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 2): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:03,193 WARN  retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 3): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:03,621 WARN  retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 4): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:04,435 WARN  retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 5): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:06,112 WARN  retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 6): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:09,376 WARN  retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 7): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:14,833 WARN  retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 8): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:20,104 WARN  retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 9): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: Network closed for unknown reason
2019-12-04 20:54:25,565 WARN  retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 10): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: Network closed for unknown reason
2019-12-04 20:54:30,894 WARN  retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 11): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: Network closed for unknown reason
2019-12-04 20:54:35,999 WARN  retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 12): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: Network closed for unknown reason
2019-12-04 20:54:41,037 WARN  retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 13): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2019-12-04 20:54:46,113 WARN  retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 14): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: Network closed for unknown reason
2019-12-04 20:54:51,574 WARN  retry.RetryUtils (RetryUtils.java:retry) - Failed to load cluster default configuration with master (attempt 15): alluxio.exception.status.UnavailableException: Failed to handshake with master hslave162:19998 to load cluster default configuration values: UNAVAILABLE: Network closed for unknown reason
2019-12-04 20:54:57,470 INFO  network.NettyUtils (NettyUtils.java:checkNettyEpollAvailable) - EPOLL_MODE is available
2019-12-04 20:54:57,555 INFO  network.TieredIdentityFactory (TieredIdentityFactory.java:localIdentity) - Initialized tiered identity TieredIdentity(node=hmaster156, rack=null)
2019-12-04 20:54:57,676 INFO  util.log (Log.java:initialized) - Logging initialized @56199ms
2019-12-04 20:54:57,843 INFO  alluxio.ProcessUtils (ProcessUtils.java:run) - Starting Alluxio job worker.
2019-12-04 20:54:57,852 INFO  web.WebServer (WebServer.java:start) - Alluxio Job Manager Worker Web service starting @ /0.0.0.0:30003
2019-12-04 20:54:57,863 INFO  server.Server (Server.java:doStart) - jetty-9.2.z-SNAPSHOT
2019-12-04 20:54:57,926 INFO  handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@62833051{/metrics/json,null,AVAILABLE}
2019-12-04 20:55:02,954 INFO  handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@3fcdcf{/,null,AVAILABLE}
2019-12-04 20:55:02,963 INFO  server.ServerConnector (AbstractConnector.java:doStart) - Started ServerConnector@282308c3{HTTP/1.1}{0.0.0.0:30003}
2019-12-04 20:55:02,963 INFO  server.Server (Server.java:doStart) - Started @61487ms
2019-12-04 20:55:02,963 INFO  web.WebServer (WebServer.java:start) - Alluxio Job Manager Worker Web service started @ /0.0.0.0:30003
2019-12-04 20:55:02,971 INFO  metrics.MetricsSystem (MetricsSystem.java:startSinksFromConfig) - Starting sinks with config: {}.
2019-12-04 20:55:03,350 INFO  worker.AlluxioJobWorkerProcess (AlluxioJobWorkerProcess.java:start) - Started Alluxio job worker with id 1575464035219
2019-12-04 20:55:03,351 INFO  worker.AlluxioJobWorkerProcess (AlluxioJobWorkerProcess.java:start) - Alluxio job worker version 2.2.0-SNAPSHOT started. bindHost=/0.0.0.0:30001, connectHost=hmaster156:30001, rpcPort=30001, webPort=30003
2019-12-04 20:55:03,351 INFO  worker.AlluxioJobWorkerProcess (AlluxioJobWorkerProcess.java:startServingRPCServer) - Starting gRPC server on address hmaster156:30001
2019-12-04 20:55:03,417 INFO  worker.AlluxioJobWorkerProcess (AlluxioJobWorkerProcess.java:startServingRPCServer) - Started gRPC server on address hmaster156:30001

4)Alluxio Worker节点(hmaster156)的日志worker.log:

2019-12-04 20:05:33,818 WARN  SaslStreamServerDriver - Error received for channel: 947b2f0d-a4e3-4cdb-95d2-b6b9fe75d480. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:05:33,820 WARN  SaslStreamServerDriver - Error received for channel: 96cccf3c-7aa7-4f71-8029-b22115b4a9e8. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:36:45,846 WARN  SaslStreamServerDriver - Error received for channel: 71bc4b60-e968-4e6d-931e-cbccb565ec1b. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:36:45,846 WARN  SaslStreamServerDriver - Error received for channel: 7d8207ef-4d45-40bd-9b7b-014b9bd33aff. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:35,477 WARN  SaslStreamServerDriver - Error received for channel: db43d1a6-3837-46ac-8089-b8d8da9837b2. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-04 20:52:35,481 WARN  SaslStreamServerDriver - Error received for channel: 2c1540f4-cfda-4f4f-b208-1cac65f69a8c. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close

(3)单独启动worker。11小时后重新测试从本地将文件alluxio2hadoop_test.txt上传到alluxio的文件系统,发现worker已经自动关闭。检查日志,发现新出现的警报如下:

1)Alluxio Master节点(hslave162)的日志master.log:

2019-12-04 22:39:08,528 INFO  ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-04 23:39:08,528 INFO  ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 00:39:08,529 INFO  ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 01:39:08,529 INFO  ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 02:39:08,529 INFO  ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 03:39:08,529 INFO  ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 04:39:08,529 INFO  ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 05:39:08,529 INFO  ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 06:39:08,529 INFO  ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 07:39:08,530 INFO  ServerConfigurationChecker - All server-side configurations are consistent.
2019-12-05 08:21:05,057 WARN  SaslStreamServerDriver - Error received for channel: 47492e42-a761-4b01-b8ca-50d43a0f724d. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2019-12-05 08:21:05,058 WARN  SaslStreamServerDriver - Error received for channel: cfabbc0d-1bdb-4361-80d0-1a079c2f54f2. Error: io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close

2)Alluxio Worker节点(hmaster156)的日志job_worker.log:

2019-12-05 03:59:42,002 WARN  heartbeat.SleepingTimer (SleepingTimer.java:tick) - Job Worker Command Handling last execution took 1015 ms. Longer than the interval 1000
2019-12-05 07:20:14,726 WARN  heartbeat.SleepingTimer (SleepingTimer.java:tick) - Job Worker Command Handling last execution took 1116 ms. Longer than the interval 1000
2019-12-05 08:02:25,518 WARN  heartbeat.SleepingTimer (SleepingTimer.java:tick) - Job Worker Command Handling last execution took 1222 ms. Longer than the interval 1000

查不到相关资料。

解决方案:

无意中发现启动Alluxio时命令行输出如下内容:

查看Alluxio Worker节点(hmaster156)的日志task.log,未发现ERROR和WARN,仅发现在启动worker时有如下信息:

Executing the following command on all worker nodes and logging to /usr/local/alluxio2.1.0-for-hadoop-2.7.7/logs/task.log: /usr/local/alluxio2.1.0-for-hadoop-2.7.7/bin/alluxio-start.sh -a worker noMount
[hmaster156] Connecting as hadoop...
[2019-12-05 15:17:34][hmaster156] nohup: ignoring input
[2019-12-05 15:17:34][hmaster156] Invalid mount mode: noMount
[2019-12-05 15:17:34][hmaster156] Usage: alluxio-start.sh [-hNwm] [-i backup] ACTION [MOPT] [-f]
[2019-12-05 15:17:34][hmaster156] Where ACTION is one of:
[2019-12-05 15:17:34][hmaster156] all [MOPT]         	Start all masters, proxies, and workers.
[2019-12-05 15:17:34][hmaster156] job_master         	Start the job_master on this node.
[2019-12-05 15:17:34][hmaster156] job_masters        	Start job_masters on master nodes.
[2019-12-05 15:17:34][hmaster156] job_worker         	Start a job_worker on this node.
[2019-12-05 15:17:34][hmaster156] job_workers        	Start job_workers on worker nodes.
[2019-12-05 15:17:34][hmaster156] local [MOPT]       	Start all processes locally.
[2019-12-05 15:17:34][hmaster156] master             	Start the local master on this node.
[2019-12-05 15:17:34][hmaster156] secondary_master   	Start the local secondary master on this node.
[2019-12-05 15:17:34][hmaster156] masters            	Start masters on master nodes.
[2019-12-05 15:17:34][hmaster156] proxy              	Start the proxy on this node.
[2019-12-05 15:17:35][hmaster156] proxies            	Start proxies on master and worker nodes.
[2019-12-05 15:17:35][hmaster156] safe               	Script will run continuously and start the master if it's not running.
[2019-12-05 15:17:35][hmaster156] worker [MOPT]      	Start a worker on this node.
[2019-12-05 15:17:35][hmaster156] workers [MOPT]     	Start workers on worker nodes.
[2019-12-05 15:17:35][hmaster156] logserver          	Start the logserver
[2019-12-05 15:17:35][hmaster156] restart_worker     	Restart a failed worker on this node.
[2019-12-05 15:17:35][hmaster156] restart_workers    	Restart any failed workers on worker nodes.
[2019-12-05 15:17:35][hmaster156] 
[2019-12-05 15:17:35][hmaster156] MOPT (Mount Option) is one of:
[2019-12-05 15:17:35][hmaster156] Mount    	Mount the configured RamFS if it is not already mounted.
[2019-12-05 15:17:35][hmaster156] SudoMount	Mount the configured RamFS using sudo if it is not already mounted.
[2019-12-05 15:17:35][hmaster156] NoMount  	Do not mount the configured RamFS.
[2019-12-05 15:17:35][hmaster156] Notice: to avoid sudo requirement but using tmpFS in Linux,
[2019-12-05 15:17:35][hmaster156] set ALLUXIO_RAM_FOLDER=/dev/shm on each worker and use NoMount.
[2019-12-05 15:17:35][hmaster156] NoMount is assumed if MOPT is not specified.
[2019-12-05 15:17:35][hmaster156] 
[2019-12-05 15:17:35][hmaster156] -a         asynchronously start all processes. The script may exit before all
[2019-12-05 15:17:35][hmaster156] processes have been started.
[2019-12-05 15:17:35][hmaster156] -f         format Journal, UnderFS Data and Workers Folder on master.
[2019-12-05 15:17:35][hmaster156] -h         display this help.
[2019-12-05 15:17:35][hmaster156] -i backup  a journal backup to restore the master from. The backup should be
[2019-12-05 15:17:35][hmaster156] a URI path within the root under filesystem, e.g.
[2019-12-05 15:17:35][hmaster156] hdfs://mycluster/alluxio_backups/alluxio-journal-YYYY-MM-DD-timestamp.gz.
[2019-12-05 15:17:35][hmaster156] -N         do not try to kill previous running processes before starting new ones.
[2019-12-05 15:17:35][hmaster156] -w         wait for processes to end before returning.
[2019-12-05 15:17:35][hmaster156] 
[2019-12-05 15:17:35][hmaster156] Supported environment variables:
[2019-12-05 15:17:35][hmaster156] 
[2019-12-05 15:17:35][hmaster156] ALLUXIO_JOB_WORKER_COUNT - identifies how many job workers to start per node (default = 1)
[2019-12-05 15:17:35][hmaster156] Connection to hmaster156 closed.

推断启动命令

./alluxio-start.sh all NoMount

存在问题。更改为如下命令,可正常启动Alluxio集群:

./alluxio-start.sh all SudoMount

等待18小时,worker并未自动关闭。

出现如下问题3:

worker仍然有时无法启动,重新搭建alluxio-2.1.0+hadoop-2.7.3集群,Alluxio Master节点(hslave162)的task.log出现如下信息: 

[sudo] password for hadoop:
ERROR: mkdir /usr/local/alluxio-2.1.0/ramdisk failed

解决方案:

给Worker机器加上起Worker服务的用户的免密sudo权限,在/etc/sudoers中添加如下内容:

# <user_name> ALL=(ALL)       NOPASSWD: ALL
# 实际设置如下:
hadoop     ALL=(ALL)       NOPASSWD: ALL

 

集群性能测试部分未完待续。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值