踩过的坑:(遇到就更新)
1.启动后 jps 正常,但是hadoop UI界面检测不到从节点信息,重启后有可能修正
2.添加节点,对hadoop -format 导致clusterID不一致,节点出现异常,上面讲过
3.yarn界面检测到的核数量与内存大小与真实集群的不一致
yarn默认每台机器 8核8G 如果不是,则需要修改配置文件yarn-site.xml
添加(修改为自己机器的实际大小,内存以M为单位,每台机器的配置文件都得修改)
<property>
<name>yarn-nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn-nodemanager.resource.cpu-vcores</name>
<value>4</value>
</property>
4.spark-submit --master yarn --deploy-mode cluster **.py 使用yarn集群运行spark出错
#报错内容
Exception in thread "main" org.apache.spark.SparkException: Application application_1543628881761_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1165)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1520)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
原因是找不到jars包。
可把spark安装目录下的jars包上55传到hdfs上,并在 spark-defaults.conf中添加该hdfs路径
#spark-defaults.conf
spark.yarn.jars hdfs:///usr/local/spark/spark_jars/*
5.内存不足的问题
内存不足导致节点丢失,连接不到节点
18/12/03 19:35:37 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 1 for reason Container killed by YARN for exceeding memory limits. 6.3 GB of 6.3 GB virtual memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/12/03 19:38:48 WARN TaskSetManager: Lost task 1.0 in stage 2.0 (TID 18, master, executor 2): FetchFailed(BlockManagerId(1, node1, 33421, None), shuffleId=0, mapId=5, reduceId=1, message=
Caused by: java.io.IOException: Failed to connect to node1/202......
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 拒绝连接: node1/202.......
Caused by: java.net.ConnectException: 拒绝连接
内存不足导致的节点丢失,连接到节点被拒绝了。
日志太长,刚开始只注意到拒绝连接,一直以为是通信出了问题,所以一直把注意力放在通信上,但发现ssh是正常的。后来百度有人说是内存不足,又试着把参数 --executor-memory,--num-executors 降低,发现还是没用,又回到通信解决上。
最后取消了spark.defaults.conf的配置内容,把 --executor-memory,--num-executors 降低就好了
<大概是spark.defaults.conf的优先级要高于命令行加的参数?>
6.spark遇到shuffle时报错:设备空间不足。
spark在shuffle时会先将数据写入到指定的硬盘地址(可spark-env.sh配置文件指定),如果该盘符满了,则会报错。
解决办法1.清理指定盘符的空间;2.换指定写入地址(通过 SPARK_LOCAL_DIRS参数修改)。
7.Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
vim $HADOOP_HOME/etc/mapred-site.xml
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>