目录
idea安装Spark之前
开发环境分为:windows和centos6.5两端。
一:windows端环境设置
1:安装javaJDK1.8
2:环境设置
2.1:环境变量
3:安装scala2.11.12(注意不要安装最新或最高版本,视你的操作系统的Idea版本,否则会出现版本冲突)
3.1下载安装2.11.12(百度一下,有N多下载地址)
3.3系统变量设置
4:安装MAVEN
:注意如是要在本地运行请把在linux端配置的hadoop及spark相应的版本软件在windows下配置如下:(各位参考自己的软件版本及安装位置)
4.1安装maven3.6.1(不要太低,否则后面会有很多软件安装会出异常)
4.2配置maven环境变量(path目录)
5:如果在windows下运行软件
安装即可
二:Idea的配置
1:下载安装Idea软件
这里我安装的是2019.3版本的。算比较新的版本了。
1.2配置idea 环境的JDK
1.3配置Scala
三:开发第一个wordcount程序
1、新建一个maven工程
2、输入项目信息后下一步:
3、注意选择自己的mvn库
根据你自己之前设置的maven安装处
setting一定要选择正确,不要后面的maven库会让你下到想哭,下到猴年马月。
4、关键是配置pom.xml
好人做到底:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.wisdom.spark.demo</groupId>
<artifactId>sparkwordcount</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.11.12</scala.version>
<spark.version>2.3.4</spark.version>
<jackson.version>2.6.0</jackson.version>
</properties>
<repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.4</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.2.5</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.2.5</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.codahale.metrics</groupId>
<artifactId>metrics-core</artifactId>
<version>3.0.1</version>
</dependency>
<dependency>
<groupId>com.codahale.metrics</groupId>
<artifactId>metrics-json</artifactId>
<version>3.0.1</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-scala_2.10</artifactId>
<version>${jackson.version}</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.5</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<buildcommands>
<buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
</buildcommands>
<additionalProjectnatures>
<projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
</additionalProjectnatures>
<classpathContainers>
<classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
<classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
</classpathContainers>
</configuration>
</plugin>
</plugins>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>
等待Maven下载成功,这里如果网速太慢,建议大家更改一下setting中的mirorr代码,换成aliyun的挥着什么会好很多!
Build成功:
6、创建代码:
7、打包成功:
查看本地,打包文件:
集群上安装Spark
1:伪分布式集群搭建spark环境
1下载:https://www.apache.org/
|
2在linux上传并解压 |
[root@hadoop ~]# cd /home/ [root@hadoop home]# cd tools/ [root@hadoop tools]# rz rz waiting to receive. Starting zmodem transfer. Press Ctrl+C to cancel. Transferring spark-2.3.1-bin-hadoop2.7.tgz... 100% 220589 KB 2535 KB/sec 00:01:27 0 Errors
[root@hadoop tools]# tar -zxf spark-2.3.1-bin-hadoop2.7.tgz -C ../softwares/ [root@hadoop tools]# |
3配置环境变量 |
export SPARK_HOME=/home/softwares/spark-2.3.1-bin-hadoop2.7 export PATH=$PATH:$JAVA_HOME/bin:$MAVEN_HOME/bin:$FINDBUGS_HOME/bin:$SCALA_HOME/bin:$SPA [root@hadoop spark-2.3.1-bin-hadoop2.7]# source /etc/profile |
4:启动scala |
[root@hadoop spark-2.3.1-bin-hadoop2.7]# spark-shell 2018-09-11 19:41:41 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://hadoop:4040 Spark context available as 'sc' (master = local[*], app id = local-1536720131769). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.1 /_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information.
scala> |
5:设置spark-shell的显示信息 |
打开修改 |
重新打spark-shell |
[root@hadoop spark-2.3.1-bin-hadoop2.7]# spark-shell 18/09/11 19:53:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://hadoop:4040 Spark context available as 'sc' (master = local[*], app id = local-1536720844018). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.1 /_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information.
scala> 5+5 res0: Int = 10
scala> |
6:spark读取本地文件 |
scala> val tetxFile=sc.textFile("file:/home/softwares/spark-2.3.1-bin-hadoop2.7/README.md") scala> tetxFile.count res2: Long = 103 |
7:读取HDFS文件 |
1:启动HDFS [root@hadoop hadoop-2.9.1]# sbin/start-dfs.sh Starting namenodes on [hadoop] hadoop: starting namenode, logging to /home/softwares/hadoop-2.9.1/logs/hadoop-root-namenode-hadoop.out localhost: starting datanode, logging to /home/softwares/hadoop-2.9.1/logs/hadoop-root-datanode-hadoop.out Starting secondary namenodes [0.0.0.0] 0.0.0.0: starting secondarynamenode, logging to /home/softwares/hadoop-2.9.1/logs/hadoop-root-secondarynamenode-hadoop.out [root@hadoop hadoop-2.9.1]# jps 60867 SecondaryNameNode 61061 Jps 60346 SparkSubmit 60730 DataNode 60637 NameNode [root@hadoop hadoop-2.9.1]# |
2:读取文件 scala> val textFile2=sc.textFile("hdfs://hadoop:8020/words") textFile2: org.apache.spark.rdd.RDD[String] = hdfs://hadoop:8020/words MapPartitionsRDD[3] at textFile at <console>:24
scala> textFile2.count res3: Long = 3
scala> |
8:在Hadoop Yarn上运行Spark-shell |
启动hadoop集群 [root@hadoop hadoop-2.9.1]# sbin/start-yarn.sh [root@hadoop spark-2.3.1-bin-hadoop2.7]# HADOOP_CONF_DIR=/home/softwares/hadoop-2.9.1/etc/hadoop/ spark-shell --master yarn
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.1 /_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information.
scala> |
|
2:完全分布式集群搭建spark环境
(1) 把安装包上传到hadoop01服务器并解压
[hadoop@hadoop01 soft]$ tar zxvf spark-2.2.0-bin-hadoop2.7.tgz -C /home/hadoop/apps/
# 解压后如果感觉安装目录的名称太长可以修改一下
[hadoop@hadoop01 /]$ cd /opt/module
[hadoop@hadoop01 module]$ mv spark-2.2.0-bin-hadoop2.7 spark-2.2.0
(2) 修改spark-env.sh配置文件
# 把SPARK_HOME/conf/下的spark-env.sh.template文件复制为spark-env.sh
[hadoop@hadoop01 module]$ cd spark-2.2.0/conf
[hadoop@hadoop01 conf]$ mv spark-env.sh.template spark-env.sh
# 修改spark-env.sh配置文件,添加如下内容
[hadoop@hadoop01 conf]$ vim spark-env.sh
# 配置JAVA_HOME,一般来说,不配置也可以,但是可能会出现问题,还是配上吧
export JAVA_HOME=/opt/module/jdk1.8
# 一般来说,spark任务有很大可能性需要去HDFS上读取文件,所以配置上
# 如果说你的spark就读取本地文件,也不需要yarn管理,不用配
export HADOOP_CONF_DIR=/opt/module/hadoop-2.7.4/etc/hadoop
# 设置Master的主机名
export SPARK_MASTER_HOST=hadoop01 #这里建议写IP
# 提交Application的端口,默认就是这个,万一要改呢,改这里
export SPARK_MASTER_PORT=7077
# 每一个Worker最多可以使用的cpu core的个数,我虚拟机就3个...
# 真实服务器如果有32个,你可以设置为32个
export SPARK_WORKER_CORES=3
# 每一个Worker最多可以使用的内存,我的虚拟机就2g
# 真实服务器如果有128G,你可以设置为100G
export SPARK_WORKER_MEMORY=1g
(3) 修改slaves配置文件,添加Worker的主机列表
[hadoop@hadoop01 conf]$ mv slaves.template slaves
[hadoop@hadoop01 conf]$ vim slaves
# 里面的内容原来为localhost
hadoop01
hadoop02
hadoop03
(4) 把SPARK_HOME/sbin下的start-all.sh和stop-all.sh这两个文件重命名
比如分别把这两个文件重命名为start-spark-all.sh和stop-spark-all.sh
原因:
如果集群中也配置HADOOP_HOME,那么在HADOOP_HOME/sbin目录下也有start-all.sh和stop-all.sh这两个文件,当你执行这两个文件,系统不知道是操作hadoop集群还是spark集群。修改后就不会冲突了,当然,不修改的话,你需要进入它们的sbin目录下执行这些文件,这肯定就不会发生冲突了。我们配置SPARK_HOME主要也是为了执行其他spark命令方便。
[hadoop@hadoop01 conf]$ cd ../sbin
[hadoop@hadoop01 sbin]$ mv start-all.sh start-spark-all.sh
[hadoop@hadoop01 sbin]$ mv stop-all.sh stop-spark-all.sh
(5) 把spark安装包分发给其他节点
[hadoop@hadoop01 apps]$ scp -r spark-2.2.0 hadoop02:`pwd`
[hadoop@hadoop01 apps]$ scp -r spark-2.2.0 hadoop03:`pwd`
[hadoop@hadoop01 apps]$ scp -r spark-2.2.0 hadoop04:`pwd`
(6) 在集群所有节点中配置SPARK_HOME环境变量
[hadoop@hadoop01 conf]$ vim ~/.bash_profile
export SPARK_HOME=/opt/module/spark-2.2.0
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
[hadoop@hadoop01 conf]$ source ~/.bash_profile
# 其他节点也都配置...这里你可以参考........等我有时间写一下完全分布式安装的步骤,里面有渐变的集体分发文件,和集体命令的方法,会方便很多。前提是你要记住用法,显然我经常忘记..........
(7) 在spark master节点启动spark集群
# 注意,如果你没有执行第4步,一定要进入SPARK_HOME/sbin目录下执行这个命令
# 或者你在Master节点分别执行start-master.sh和start-slaves.sh
[hadoop@hadoop01 conf]$ start-spark-all.sh
(8)完全分布式配置好spark,然后启动。
8.1先启动hdfs和yarn
8.2然后启动spark
这里忽略了上传刚刚打包好的idea的WordCount文件步骤,请自己完成。
8.3上传文件:启动
spark-submit
--class com.spark.demo.WordCount #代码全包名
--master spark://192.168.91.101:7077 #作为master的端口
sparkwordcount-1.0-SNAPSHOT.jar #你虚拟机中上传好的jar包
file:/opt/data/test.txt #输入文件
hdfs://hadoop01:9000/user/hadoop/output #输出文件
结果显示:
好了。Happy Ending了。下面结束吧。
先停止spark
然后yarn
最后Hdfs