spark入门1-运行wordCount
一、spark入门
1、在idea中运行
-
先创建maven工程,pom文件导入下面的插件
<dependencies> <!-- spark的依赖包 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <version>3.0.0</version> </dependency> </dependencies> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.2</version> <executions> <execution> <!-- 声明绑定到maven的compile阶段 --> <goals> <goal>compile</goal> </goals> </execution> </executions> </plugin>
-
创建一个object,编写main方法
-
创建SparkConf
//sparkConf 源码 /** * Configuration for a Spark application. Used to set various Spark parameters as key-value pairs. * * Most of the time, you would create a SparkConf object with `new SparkConf()`, which will load * values from any `spark.*` Java system properties set in your application as well. In this case, * parameters you set directly on the `SparkConf` object take priority over system properties. * * For unit tests, you can also call `new SparkConf(false)` to skip loading external settings and * get the same configuration no matter what the system properties are. * * All setter methods in this class support chaining. For example, you can write * `new SparkConf().setMaster("local").setAppName("My app")`. * * @param loadDefaults whether to also load values from Java system properties * * @note Once a SparkConf object is passed to Spark, it is cloned and can no longer be modified * by the user. Spark does not support modifying the configuration at runtime. */ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Serializable {...}
-
创建SparkContext
//sparkContext 源码 /** * Main entry point for Spark functionality. A SparkContext represents the connection to a Spark * cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. * * @note Only one `SparkContext` should be active per JVM. You must `stop()` the * active `SparkContext` before creating a new one. * @param config a Spark Config object describing the application configuration. Any settings in * this config overrides the default configs as well as system properties. */ class SparkContext(config: SparkConf) extends Logging {...}
-
调用sparkContext的 textFile方法,传入需要处理的文件地址
-
得到一个RDD[]对象,这个对象可以理解为一个List,对其进行map、reduce操作等
sc.textFile("我的文件地址").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).collect().foreach(println)
-
最后记得关闭sparkContext,调用stop方法
1.1 注意
-
一个JVM只能同时运行一个sparkContext
-
spark的主要入口就是sparkContext,他的主构造器中要求传入一个SparkConf
-
sparkConf 可以通过set键值对的方法配置参数
-
通过sparkContext可以获取一个RDD对象
-
RDD对象的textFile方法可以获取文本数据,并直接操作
2、在Linux系统中运行
2.1 用spark-shell运行
-
上传tar文件并解压
-
按照我的理解,spark是个计算引擎,是个工具,所以解压完可以直接运行
-
进入到bin目录,运行spark-shell命令进入窗口
./spark-shell
-
可以看到他已经创建好了sparkContext
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://hadoop102:4040 Spark context available as 'sc' (master = local[*], app id = local-1606478720737). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_212) Type in expressions to have them evaluated. Type :help for more information. scala>
-
我们直接运行scala代码就行,注意要计算文件要上传
sc.textFile("我的文件地址").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).collect().foreach(println)
2.2 用spark-submit运行
-
首先需要将我们idea中写的代码打包
-
在代码中sparkConf的setMaster要去掉
-
textFile中的参数可以改成args(0),这样后续可以动态传值
-
上传jar包
-
运行命令
./spark-submit --class com.yire.SimpleWordCount swcNoMaster.jar /home/yire/input
3、在Spark自带的集群上运行
-
假如我们现在为三个节点的集群
-
进入spark的conf目录
-
修改spark-env.sh.template文件名为spark-env.sh,并修改它:指定我们master节点的位置
mv spark-env.sh.template spark-env.sh
SPARK_MASTER_HOST=节点1 #可以不用配置,默认就是7077 SPARK_MASTER_PORT=7077
-
分发上面的配置到另外两个节点
-
进入spark的根目录
sbin/start-all.sh --这个是停集群的命令 sbin/stop-all.sh
-
此时运行spark-shell或者spark-sumbit时
spark-shell:要加上–master spark://节点1:7077才行,不然还是本地模式
./spark-shell --master spark://hadoop102:7077
spark-shell:同上
bin/spark-submit --master spark://hadoop102:7077 --class com.yire.SimpleWordCount bin/swcNoMaster.jar /home/yire/input
-
到此测试集群测试完毕,此时可以尝试加上–deploy-mode cluster试试
bin/spark-submit --master spark://hadoop102:7077 --deploy-mode cluster --class com.yire.SimpleWordCount bin/swcNoMaster.jar /home/yire/input
3.1 配置集群的历史服务器
-
修改spark-defaults.conf.template文件名为spark-defaults.conf
mv spark-defaults.conf.template spark-defaults.conf
-
修改spark-default.conf文件
spark.eventLog.enabled true spark.eventLog.dir hdfs://hadoop:9820/spark-logs
-
修改spark-env.sh文件, 添加日志配置
export SPARK_HISTORY_OPTS=" -Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://hadoop102:9820/spark-logs -Dspark.history.retainedApplications=30"
-
分发到另外的节点
-
重启集群,启动历史服务器
sbin/start-all.sh sbin/start-history-server.sh
3.2 注意
-
可以运行./spark-shell --help查看执行参数
Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, k8s://https://host:port, or local (Default: local[*]). --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). --class CLASS_NAME Your application's main class (for Java / Scala apps).
-
其中–master 为指定运行在哪里,local为本地,yarn为hadoop的Yarn集群,spark://host:port为spark的集群
-
–class 后面带的参数,第一个为程序入口main方法的类,第二个为jar包,第三个往后为参数
-
–deploy-mode 有cluster和client两个可选,一定要陪在–master后面,否则亲测无效
cluster:在集群上跑
client:在本地跑(默认)
-
起spark自带的集群时,要在配置文件中配置的master的那个节点运行命令
-
配置历史服务器,需要启动hadoop集群,HDFS上的spark-logs目录需要提前存在
4、在yarn上运行(重点)
-
修改hadoop配置文件…/etc/hadoop/yarn-site.xml, 并分发
<!--是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认是true --> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <!--是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是true --> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property>
-
修改conf/spark-env.sh,添加YARN_CONF_DIR配置
mv spark-env.sh.template spark-env.sh
#export JAVA_HOME=/opt/module/jdk1.8.0_144 YARN_CONF_DIR=/opt/module/hadoop/etc/hadoop
-
启动HDFS和YARN集群
-
提交运行job
bin/spark-submit --master yarn --deploy-mode cluster --class com.yire.SimpleWordCount bin/swcNoMaster.jar /input
-
运行后在yarn的页面可以看到job,但是点击查看历史,发现页面找不到,此时需要打通spark和yarn的历史服务器
4.1 打通spark和yarn的历史服务器
-
修改spark-defaults.conf
spark.eventLog.enabled true spark.eventLog.dir hdfs://hadop102:9820/spark-logs #相比spark集群的历史服务器,多了以下配置 spark.yarn.historyServer.address=hadoop102:18080 spark.history.ui.port=18080
-
修改spark-env.sh文件, 添加日志配置
export SPARK_HISTORY_OPTS=" -Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://hadoop102:9820/spark-logs -Dspark.history.retainedApplications=30"
-
启动历史服务
sbin/start-history-server.sh
二、补充
1.本文常用命令(方便粘贴)
1.1 spark-submit
-
本地提交
bin/spark-submit --class com.yire.SimpleWordCount bin/swcNoMaster.jar /home/yire/input
-
集群本地提交
bin/spark-submit --master spark://hadoop102:7077 --class com.yire.SimpleWordCount bin/swcNoMaster.jar /home/yire/input
-
集群
bin/spark-submit --master spark://hadoop102:7077 --deploy-mode cluster --class com.yire.SimpleWordCount bin/swcNoMaster.jar /home/yire/input
-
提交到yarn
bin/spark-submit --master yarn --deploy-mode cluster --class com.yire.SimpleWordCount bin/swcNoMaster.jar /input
2.端口
- 历史服务器:18080
- 集群中master节点端口:7077
- 集群Master资源监控Web UI界面端口:master节点:8080
- 当前Spark-shell运行任务情况端口号:4040(计算)