spark-submit,以kmeans为例
本地模式:
使用setMaster("local"),在idea中直接右键run即可
import org.apache.log4j.{ Level, Logger }
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.mllib.clustering._
import org.apache.spark.mllib.linalg.Vectors
object test {
def main(args: Array[String]): Unit = {
// 1. 构造spark对象
val conf = new SparkConf().setMaster("yarn-client").setAppName("KMeans").set("spark.driver.memory", "512m").set("spark.executor.memory", "512m")
val sc = new SparkContext(conf)
println("mode:"+sc.master)
// 去除多余的warn信息
// 2. 读取样本数据,LIBSVM格式
val data = sc.textFile("file:///test/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
// 3. 新建KMeans模型,并训练
val model = KMeans.train(parsedData,2,20)
// 打印聚类中心
model.clusterCenters.foreach(println)
}
}
也可以读取hdfs的文件:
val data = sc.textFile("file:///test/kmeans_data.txt")
yarn模式:
首先打jar包,可以用idea也可以用sbt
然后选择build --》build artifacts,test --》build,
然后在工程目录的子目录下会生成对应的jar文件:
在src文件的旁的/out/artifacts中找到我们需要的test.jar
复制到好找到的目录,如/export/spark_jar/
[root@master ~]# spark-submit --class test --master yarn file:///export/spark_jar/test.jar
结果:
spark-submit命令参数:
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
--queue thequeue \
examples/target/scala-2.11/jars/spark-examples*.jar 10
遇到的错误:
1.内存不足:
diagnostics: Application application_1555606869295_0001 failed 2 times due to AM Container for appattempt_1555606869295_0001_000002 exited with exitCode: -103
Failing this attempt.Diagnostics: [2019-04-19 01:02:32.548]Container [pid=18294,containerID=container_1555606869295_0001_02_000001] is running 125774336B beyond the 'VIRTUAL' memory limit. Current usage: 75.3 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1555606869295_0001_02_000001 :
Container要用2.2GB的内存,而虚拟内存只有2.1GB,不够用了,所以Kill了Container
我的SPARK-EXECUTOR-MEMORY设置的是1G,即物理内存是1G,Yarn默认的虚拟内存和物理内存比例是2.1,也就是说虚拟内存是2.1G,小于了需要的内存2.2G。解决的办法是把拟内存和物理内存比例增大,在yarn-site.xml中增加一个设置:
解决方法:
yarn-site.xml中增加一个设置:
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.5</value>
</property>
修改hadoop的yarn-site.xml,不检查一些东西
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
2.警告warn:Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
原因:
原因是因为Spark提交任务到yarn集群,需要上传Hadoop相关yarn的jar包,只是warn,可以不管
解决办法:
把yarn的jar包上传到hdfs,修改配置文件指定路径即可
(1)上传
[root@hadoop00 ~]# hadoop fs -mkdir -p /spark/jars
[root@hadoop00 ~]# hadoop fs -ls /
[root@hadoop00 ~]# hadoop fs -put /export/servers/spark-2.3.1-bin-hadoop2.7/jars/* /spark/jars/
(2)在spark的conf的spark-default.conf ,添加: spark.yarn.jars hdfs://192.168.12.129:9000//spark/jars/*
(3)重新运行,warn消失
[root@hadoop00 ~]# spark-shell --master yarn --deploy-mode client