Spark1.6.0 on Hadoop-2.6.3 安装配置
1. 配置hadoop
(1)、下载hadoop
mkdir /usr/local/bigdata/hadoop cd /usr/local/bigdata/hadoop wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.6.3/hadoop-2.6.3.tar.gz tar zxvf hadoop-2.6.3.tar.gz |
(2)、配置hadoop环境变量
export HADOOP_HOME=/usr/local/bigdata/hadoop/hadoop-2.6.3 export PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin |
2. 安装配置Scala
mkdir /usr/local/bigdata/scala wget http://www.scala-lang.org/files/archive/scala-2.10.4.tgz tar zxvf scala-2.10.4.tgz |
export SCALA_HOME=/usr/local/bigdata/scala/scala-2.10.4 export PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${SCALA_HOME}/bin:$PATH |
测试:12*12 回车 |
3. 安装配置Spark1.6.0
根据Hadoop选择对应版本下载Spark,下载网址:http://spark.apache.org/downloads.html
mkdir /usr/local/bigdata/spark wget http://archive.apache.org/dist/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz tar zxvf spark-1.6.0.tgz |
export SPARK_HOME=/usr/local/bigdata/spark/spark-1.6.0-bin-hadoop2.6 export PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${SCALA_HOME}/bin:${SPARK_HOME}/bin:$PATH |
cd /usr/local/bigdata/spark/spark-1.6.0-bin-hadoop2.6/conf cp spark-env.sh.template spark-env.sh export JAVA_HOME=/usr/java/jdk1.8.0_71 export SCALA_HOME=/usr/local/bigdata/scala/scala-2.10.4 export SPARK_MASTER_IP=XTYFB-CSJ06 export SPARK_WORKER_CORES=2 export SPARK_WORKER_MEMORY=1g export HADOOP_CONF_DIR=/usr/local/bigdata/hadoop/hadoop-2.6.3/etc/hadoop |
cp slaves.template slaves XTYFB-CSJ06 或者 127.0.1.1 |
4. 启动Spark,查看集群状况
cd /usr/local/bigdata/spark/spark-1.6.0-bin-hadoop2.6/sbin
启动: ./start-all.sh |
jps查看进程:多了一个Master和Worker进程
使用jps -mlv查看进程的详情
可以看到Master前端的访问地址http://172.16.80.226:8080/
Worker前端的访问地址http://172.16.80.226:8081/
切换到cd /usr/local/bigdata/spark/spark-1.6.0-bin-hadoop2.6/bin
启动:spark-shell
mkdir /usr/local/bigdata/spark/testData
vim /usr/local/bigdata/spark/testData/wcDemo1.txt
spark hive spark hive hive redis hdds redis |
执行scala的脚本命令获得单词计数的结果:
val rdd=sc.textFile("/usr/local/bigdata/spark/testData/wcDemo1.txt").flatMap(_.split("\t")).map(x=>(x,1)).reduceByKey(_+_).collect |
打印统计出结果:rdd: Array[(String, Int)] = Array((hive,3), (spark,2), (hdds,1), (redis,2))
val rdd=sc.textFile("/usr/local/bigdata/spark/testData/wcDemo1.txt").flatMap(_.split("\t")).map(x=>(x,1)).reduceByKey(_+_).sortByKey().collect
|
执行结果:rdd: Array[(String, Int)] = Array((hdds,1), (hive,3), (redis,2), (spark,2))
对结果降序排序
val rdd=sc.textFile("/usr/local/bigdata/spark/testData/wcDemo1.txt").flatMap(_.split("\t")).map(x=>(x,1)).reduceByKey(_+_).sortByKey(false).collect
|
执行结果:rdd: Array[(String, Int)] = Array((spark,2), (redis,2), (hive,3), (hdds,1))
统计结果行数
val rdd=sc.textFile("/usr/local/bigdata/spark/testData/wcDemo1.txt").flatMap(_.split("\t")).map(x=>(x,1)).reduceByKey(_+_).sortByKey(false).count
|
执行结果:rdd: Long = 4
将结果进行保存
val rdd=sc.textFile("/usr/local/bigdata/spark/testData/wcDemo1.txt").flatMap(_.split("\t")).map(x=>(x,1)).reduceByKey(_+_).sortByKey(false).saveAsTextFile("/usr/local/bigdata/spark/testData/wcDemo_out")
|
对于WC而言,需要从输入数据中每行字符串解析出单词,然后将相同的单词放到一个桶中,最后统计每个桶中每个单词出现的频率。
flatMap函数将一条记录转换成多条记录(一对多关系),map函数将一条记录转换成另一条记录(一对一关系),reduceByKey函数将Key相同的数据划分到一个桶中,并以Key为单位分组进行计算。
经过一系列的RDD转换算子操作,之前都是Transformation算子,最后collect、saveAsTextFile、count都是Actions算子。
查看结果如下截图