spark 使用
1 安装
解压
配置/etc/profile:
SPARK_HOME
PATH
copy文件:core-site.xml hdfs-site.xml yarn-site.xml
修改 spark-env.sh export HDFS_CONF_DIR=本地上面三个文件的目录
copy jersey.jar包 并改名
yarn-site.xml中
[YARN] 2.2 GB of 2.1 GB virtual memory used. Killing container.
spark程序在yarn的集群运行,出现 Current usage: 105.9 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container. 错误。
解决方法:
在etc/hadoop/yarn-site.xml文件中,修改检查虚拟内存的属性为false,如下:
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
2 测试
集群提交任务
spark-submit --master yarn --class org.apache.spark.examples.SparkPi /opt/spark-2.3.1-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.1.jar
启动spark-shell:
val file=sc.textFile("hdfs://192.168.xxx.xxx:8020/user/test/aa")
val count = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_)
count.collect()
count.saveAsTextFile("hdfs://master/user/test/wordcounttest_res")
执行命令:hdfs dfs -cat /user/test/wordcounttest_res/part-00000