1、安装centos 7,修改 vi /etc/sysconfig/network-scripts/ifg-eno16777736 文件,设置静态ip,重启网络 service network restart
2、修改机器名
vi /etc/hosts 原始的内容删掉,增加:
192.168.153.4 hadoop000
192.168.153.4 localhost
vi /etc/hostname 原始的内容删掉,增加:
hadoop000
3、创建hadoop用户,在用户目录下创建 software、app、source、lib、data 目录
useradd hadoop
passwd hadoop
在root账号下执行:
visudo(推荐这种方式) 或者 vi /etc/sudoers
在文件中 root ALL=(ALL) ALL 这行下面加上:
hadoop ALL=(ALL) NOPASSWD:ALL
把hadoop账号加入到suders组,允许运行任何命令,且不需要输入密码(sudo 命令)
4、基础环境配置
java用的是 jdk1.8.0_40
hadoop用的是 hadoop-2.6.0-cdh5.7.0 (下载链接 http://archive.cloudera.com/cdh5/cdh/5/)
maven用的是 apache-maven-3.5.4 (下载链接 https://archive.apache.org/dist/maven/maven-3/)
scala用的是 scala-2.11.8
vi ~/.bash_profile 添加如下内容:
export JAVA_HOME=/home/hadoop/app/jdk1.8.0_40
export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar:$PATH
export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export MAVEN_HOME=/home/hadoop/app/apache-maven-3.5.4
export PATH=$MAVEN_HOME/bin:$PATH
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
export SCALA_HOME=/home/hadoop/app/scala-2.11.8
export PATH=$SCALA_HOME/bin:$PATH
export SPARK_HOME=/home/hadoop/app/spark-2.4.4-bin-2.6.0-cdh5.7.0 #spark-2.4.2-bin-2.6.0-cdh5.7.0 #spark-2.1.1-bin-hadoop2.6
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
为了防止编译时内存不够,上面加上了(机器的内存推荐4G以上的内存,我的虚拟机内存是8G+4个3.4GHz的cpu):
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
5、下载spark源码、修改scala版本、修改pom.xml文件、编译
下载各个历史版本的源码包在这里下载:https://archive.apache.org/dist/spark/
我下载的是:spark-2.4.4.tgz
解压:tar -zxvf spark-2.4.4.tgz
切换目录:cd spark-2.4.4
修改scala 版本: ./dev/change-scala-version.sh 2.11
打开spark-2.4.4目录下的pom.xml文:
先找到下面这段
<repository>
<id>central</id>
<!-- This should be at top, it makes maven try the central repo first and then others and hence faster dep resolution -->
<name>Maven Repository</name>
<url>https://repo.maven.apache.org/maven2</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
在这段的下面加上下面的源,并保存:
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
需要注意的是,不需要把 <useZincServer>true</useZincServer> 改成false
也不需要增加下面的depencency:
<dependency>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
</dependency>
用make-distribution脚本来编译:
./dev/make-distribution.sh \
--name 2.6.0-cdh5.7.0 --tgz \
-Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver \
-Dhadoop.version=2.6.0-cdh5.7.0
上面的命令解释:
--name --tgz :是最后生成的包名,以及采用上面格式打包,比如,编译的是spark-2.4.4,那么最后编译成功后就会在 spark-2.4.4这个目录下生成 spark-2.4.4-bin-2.6.0-cdh5.7.0.tgz
-Pyarn: 表示支持yarn
--Phadoop-2.6 :指定hadoop的主版本号
-Dhadoop.version: 指定hadoop的子版本号
-Phive -Phive-thriftserver:开启HDBC和Hive功能。
还可以加上:
-Dscala-2.10 :指定scala版本。
-DskipTests :忽略测试过程。
clean package:clean和package是编译目标。clean执行清理工作,比如清除旧打包痕迹,package用于编译和打包。
可以加上 -X 来显示详细的信息,有问题可以在 https://mvnrepository.com/里搜索一下代码库,找到后把denpendency加入到pom.xml文件中。
还可以从错误的地方继续编译 -rf: xxx
编译成功后在 spark-2.4.4目录下生成文件: spark-2.4.4-bin-2.6.0-cdh5.7.0.tgz
6、解压,修改配置文件
tar -zxvf spark-2.4.4-bin-2.6.0-cdh5.7.0.tgz -C /home/hadoop/app
cd /home/hadoop/app/spark-2.4.4-bin-2.6.0-cdh5.7.0/conf
(1)配置spark-env.sh
cp spark-env.sh.template spark-env.sh
vi spark-env.sh 后加上:
export SCALA_HOME=/home/hadoop/app/scala-2.11.8
export SPARK_MASTER_IP=hadoop000
export MASTER=spark://hadoop000:7077
(2)配置slaves
cp slaves.template slaves
vi slaves 内容改为:hadoop000
(3)配置spark-config.sh
cd /home/hadoop/app/spark-2.4.4-bin-2.6.0-cdh5.7.0/sbin
vi spark-config.sh 在最后加上:
export JAVA_HOME=/home/hadoop/app/jdk1.8.0_40
7、启动spark
cd /home/hadoop/app/spark-2.4.4-bin-2.6.0-cdh5.7.0/sbin
./start-all.sh
(5)验证是否启动成功
[hadoop@hadoop000 ~]$ jps
81315 Worker
87783 Jps
81162 Master
8、运行例子
cd /home/hadoop/app/spark-2.4.4-bin-2.6.0-cdh5.7.0/bin
[hadoop@hadoop000 bin]$ run-example org.apache.spark.examples.SparkPi 2>&1 | grep "roughly"
Pi is roughly 3.1469357346786735
或者登陆spark-shell 交互式输入命令(在/home/hadoop目录下创建一个word.txt文件,输入单词):
[hadoop@hadoop000 bin]$ cat /home/hadoop/word.txt
hadoop mapreduce hbase
flink spark sparksql
sql
hdfs impala
sparkstreaming spark hadoop
sqlserver oracle mysql
[hadoop@hadoop000 bin]$ spark-shell
19/11/20 18:30:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://hadoop000:4040
Spark context available as 'sc' (master = spark://hadoop000:7077, app id = app-20191120183048-0002).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.4
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val inputFile = "file:///home/hadoop/word.txt"
inputFile: String = file:///home/hadoop/word.txt
scala> val textFile = sc.textFile(inputFile)
textFile: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/word.txt MapPartitionsRDD[1] at textFile at <console>:26
scala> val wordcount = textFile.flatMap(line => line.split(" ")).map(word=>(word,1)).reduceByKey(_+_)
wordcount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:28
scala> wordcount.collect().foreach(println)