作为一名合格的计算机人士,百折不挠的瞎折腾精神是必备的。今天本想使用一下尘封已久的VMware虚拟机搭的集群,结果发现 Spark 有各种问题,应该是之前潦草搭集群时挖下的坑(前几天也用过,但并不是cluster mode
,我现在才知道..),面对这些坑,果断的选择重装啊,所以叒叒叒开始愉快的搭环境了,,
不过这次格外注重了各处细节,力图条理清晰的记录一次搭建过程,除了 Scala 和 Spark 的搭建过程,当然还有运行调试(这才是关键)部分,包括用IDEA打包 jar 上传执行 和IDEA远程提交执行,这里也都分别作了记录。
关于IDEA提交Spark任务的几种方式,可以参见我 另一篇文章 .
集群环境
得亏了我16G
的内存,四个虚拟机全开还可以娱乐的玩耍,这四台虚拟机已经装过Hadoop
了,Hadoop
集群用起来也没什么问题,就保留了。
各版本如下:
配置项 | 版本 | 备注 |
---|---|---|
Hadoop | 2.7.3 | |
Java | 1.8.0 | |
Scala | 2.11.8 | 待安装 |
Spark | 2.2.0 | 待安装 |
主节点安装Scala环境
- 下载、解压、改名、放到自定义路径
wget http://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz
tar -zxvf scala-2.11.8.tgz
mv scala-2.11.8.tgz scala
- 更新
/etc/profile
sudo vi /etc/profile
//在文件的最后插入
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
source /etc/profile
- 检测是否安装成功
scala -version
主节点配置Spark
- 下载、解压、改名、放到自定义目录
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz
mv spark-2.2.0-bin-hadoop2.7 spark
- 更新
/etc/profile
vi /etc/profile
//在最后加入
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
source /etc/profile
- 修改Spark配置文件
cd spark/conf
//先改名,把template去掉
mv spark-env.sh.template spark-env.sh
mv slaves.sh.template slaves.sh
vi conf/spark-env.sh
//在最后添加各项变量值
export JAVA_HOME=/usr/local/java/jdk1.8.0_112
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SCALA_HOME=/usr/local/scala
export SPARK_MASTER_IP=192.168.146.130(hadoop01)
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_CORES=1
export SPARK_WORKER_INSTANCES=1
vi conf/slaves
//在最后添加各从节点映射(主机名或IP)
hadoop02
hadoop03
hadoop04
//还有spark-defaults.conf,一开始没改,结果导致出错
# spark-defaults.conf 的修改在后面
拷贝分发调试集群
- 分发拷贝到各
Slave
节点(其实可以脚本化,偷懒..)
//scala
scp -r scala hadoop02:/usr/local/
scp -r scala hadoop03:/usr/local/
scp -r scala hadoop04:/usr/local/
//spark
scp -r spark hadoop02:/usr/local/
scp -r spark hadoop03:/usr/local/
scp -r spark hadoop04:/usr/local/
//profile
sudo scp /etc/profile hadoop02:/etc/profile
sudo scp /etc/profile hadoop03:/etc/profile
sudo scp /etc/profile hadoop04:/etc/profile
- 调试集群
因为我们只需要使用hadoop的HDFS文件系统,所以我们并不用把hadoop全部功能都启动。
start-dfs.sh
因为 hadoop/sbin
以及 spark/sbin
均配置到了系统的环境中,它们同一个文件夹下存在同样的 start-all.sh
文件。最好是打开spark-2.2.0
,在文件夹下面打开该文件。
cd /usr/local/spark/sbin
./start-all.sh
- 各节点的正常状态
[hadoop@hadoop01 ~]$ jps
18822 SecondaryNameNode
18521 NameNode
18634 DataNode
18990 Master
19055 Jps
[hadoop@hadoop02 ~]$ jps
33380 DataNode
33589 Jps
33519 Worker
[hadoop@hadoop03 ~]$ jps
25876 Jps
25656 DataNode
25806 Worker
[hadoop@hadoop04 ~]$ jps
32162 Worker
32025 DataNode
32234 Jps
注意这里是 3
个 worker
因为 master
节点没有 配置启动 Worker
,当然可以配置(比如 hdfs
就是四个 datanode
)
但是这里 spark 要执行计算任务,所以主节点最好不要有worker以免出现计算任务争夺主节点资源
- Spark UI 正常视图
IDEA 项目打包
- 项目示例
这里的实例程序 读取 hdfs
文件 Vote-demo.txt
,并且使用 GraphX
读取文件生成图,并打印图的边数。
- 示例代码
RemoteDemo.scala
package Remote
import org.apache.spark.graphx.{Edge, Graph}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object RemoteDemo {
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("SimpleDemo")
.setMaster("spark://hadoop01:7077")
.setJars(List("I:\\IDEA_PROJ\\VISNWK\\out\\artifacts\\visnwk_jar\\visnwk.jar"))
val sc = new SparkContext(conf)
def loadEdges(fn: String): Graph[Any, String] = {
val edges: RDD[Edge[String]] =
sc.textFile(fn).filter(l => !(l.startsWith("#"))).map { //无放回
line =>
val fields = line.split("\t")
Edge(fields(0).toLong, fields(1).toLong, "1.0")
}
val graph: Graph[Any, String] = Graph.fromEdges(edges, "defaultProperty")
graph
}
val graph = loadEdges("hdfs://hadoop01:9000/TVCG/SNAP/DATASET/Vote-demo.txt")
println(s"graph.edges.count() = ${graph.edges.count()}")
sc.stop // 一开始没加,报错了
}
}
- 打包项目,注意指定
Main Class
Build
打包
- 运行配置
Run Configure
错误,IDEA远程连接失败
- 错误详情
错误排查一
Spark://Hadoop01:7077
spark://host_name:7077 改为 spark://master_ip:7077
修改配置
spark-env.sh
- 重启
saprk
集群
./sbin/stop-all.sh
./sbin/start-all.sh
- Spark UI
Master_IP:8080
显示正常
错误排查二
- 修改 Spark
spark-defaults.conf
错误排查三
排除集群本身问题,尝试spark-submit 提交
采用不打包依赖的方式打包(注意打包后只有 300kb)
- 集群 打印了下述错误
[hadoop@hadoop01 bin]$ ./spark-submit --class "Remote.RemoteDemo" ~/visnwk-build.jar
17/07/03 19:48:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/07/03 19:49:03 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.SparkException: Could not find AppClient.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:644)
at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:178)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:107)
at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
解决: 这里
示例代码最后添加:
sc.stop
- 集群提交 运行正常
http://192.168.146.130:4040/jobs/ 4040 UI界面只有在job运行时才可见,运行完后就不可访问
- 集群输出正常
回到IDEA提交问题
- 比较这个错误(和一个错误IP相比)
//直接用xxxx这个错误IP报错如下
18/05/25 19:06:20 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://118.202.40.210:4042
18/05/25 19:06:20 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://xxxxx:7077...
18/05/25 19:06:23 WARN TransportClientFactory: DNS resolution for xxxxx:7077 took 2551 ms
18/05/25 19:06:23 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master xxxxx:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
//用master_ip报错如下
18/05/25 19:07:26 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://118.202.40.210:4040
18/05/25 19:07:27 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://192.168.146.130:7077...
18/05/25 19:07:27 INFO TransportClientFactory: Successfully created connection to /192.168.146.130:7077 after 23 ms (0 ms spent in bootstraps)
18/05/25 19:07:27 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master 192.168.146.130:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
//比较上述代码,会发现虽然最后的错误一样,但是中间日志并不一样,所以并不是简单的连接失败
- 怀疑是 7077 端口的问题,但发现绑定一切正常
怀疑是版本的问题了,集群是 scala-2.11.8 + Spark-2.2.0
解决: 这里
修改 sbt 中 spark 的版本,原来的是 2.1.0 我擦!
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "org.apache.spark" %% "spark-graphx" % "2.2.0"
- 再运行,已越过那个错误~(原来是setJar注释掉了),补全之:
val conf = new SparkConf()
.setAppName("SimpleDemo")
.setMaster("spark://192.168.146.130:7077")
//.setIfMissing("spark.driver.host", "127.0.0.1") // 不设置会默认使用本机的物理IP
.setJars(List("I:\\IDEA_PROJ\\VISNWK\\out\\artifacts\\visnwk_jar\\visnwk.jar"))
val sc = new SparkContext(conf)
- 完美的收官:
18/05/25 19:25:36 INFO DAGScheduler: Job 0 finished: reduce at EdgeRDDImpl.scala:90, took 60.614562 s
graph.edges.count() = 103689 // 终于等到你!!
18/05/25 19:25:36 INFO SparkUI: Stopped Spark web UI at http://118.202.40.210:4040
18/05/25 19:25:36 INFO StandaloneSchedulerBackend: Shutting down all executors
18/05/25 19:25:36 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
18/05/25 19:25:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/05/25 19:25:36 INFO MemoryStore: MemoryStore cleared
18/05/25 19:25:36 INFO BlockManager: BlockManager stopped
18/05/25 19:25:36 INFO BlockManagerMaster: BlockManagerMaster stopped
18/05/25 19:25:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/05/25 19:25:36 INFO SparkContext: Successfully stopped SparkContext
18/05/25 19:25:36 INFO ShutdownHookManager: Shutdown hook called
18/05/25 19:25:36 INFO ShutdownHookManager: Deleting directory C:\Users\msi\AppData\Local\Temp\spark-fae200dd-12cc-4b8a-b2ec-751d641d3689
Process finished with exit code 0
任务运行时 http://118.202.40.210:4040/environment/ 显示的状态
注:本机 windows ip 为 118.202.40.210
其他各种问题
😒 留下您对该文章的评价 😄