SparkCore学习笔记
1:Spark Core:内核,也是Spark中最重要的部分,相当于Mapreduce
SparkCore 和 Mapreduce都是进行离线数据分析
SparkCore的核心:RDD(弹性分布式数据集),由分区组成
2:Spark Sql:相当于Hive
支持Sql和DSL语句 -》Spark任务(RDD)-》运行
3:Spark Streaming:相当于Storm
本质是将连续的数据-》转换成不连续的数据DStream(离散流),本质还是RDD
=================================spark core内容=======================================
一:什么是Spark
1:为什么要学习Spark?讨论Mapreduce的不足?
(*)什么是Spark?
Lightning-fast unified analytics engine(快如闪电的计算引擎)
Apache Spark™ is a unified analytics engine for large-scale data processing.(Spark是数据处理的统一分析引擎)
hadoop 3.0 vs spark https://www.cnblogs.com/zdz8207/p/hadoop-3-new-spark.html
(*)Mapreduce的缺点不足:核心shuffle-》产生大量的I/O操作
2:特点
(1)speed(快)
(2)Ease of Use(易用)
(3)Generality(通用)
(4)Runs Everywhere(兼容性)
二:Spark体系结构和部署
1:体系结构:主从结构(单点故障)
官网提供了一个图:http://spark.apache.org/docs/2.2.1/cluster-overview.html
2:安装部署:
准备工作:安装linux、JDK 1.8.x部署
解压:tar -zxvf spark-2.2.1-bin-hadoop2.7.tgz -C /root/app/
配置文件:/root/app/spark-2.2.1-bin-hadoop2.7/conf
spark-env.sh
slaves
(1)伪分布:bigdata01(主要用于开发测试)
spark-env.sh
export JAVA_HOME=/opt/modules/jdk1.8.0_11
export SPARK_MASTER_HOST=bigdata01
export SPARK_MASTER_PORT=7077
slaves
bigdata01
启动:sbin/start-all.sh
Spark Web Console(内置了Tomcat:8080)http://bigdata01:8080/
(2)全分布:三台(用于生产)
Master节点:bigdata01
Worker节点:bigdata02、bigdata03
spark-env.sh
export JAVA_HOME=/opt/modules/jdk1.8.0_11
export SPARK_MASTER_HOST=bigdata01
export SPARK_MASTER_PORT=7077
slaves
bigdata02
bigdata03
将配置好的spark复制到从节点上
scp -r spark-2.2.1-bin-hadoop2.7/ bigdata02:/opt/modules/
scp -r spark-2.2.1-bin-hadoop2.7/ bigdata03:/opt/modules/
在主节点启动:
[root@bigdata01 spark-2.2.1-bin-hadoop2.7]# sbin/start-all.sh
http://bigdata01:8080/ 是spark web ui的监控页面端口
7077 是master的rpc通信接口
(3)Spark HA:(两种方式)
(*)基于文件目录,用于开发测试(单机环境)
(*)基于zookpeer
前提:搭建zookeeper
Master主节点:bigdata01
bigdata02
Worker从节点:bigdata02
bigdata03
修改:
spark-env.sh
export JAVA_HOME=/opt/modules/jdk1.8.0_11
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=bigdata01:2181,bigdata02:2181,bigdata03:2181 -Dspark.deploy.zookeeper.dir=/spark"
在bigdata01启动:
sbin/start-all.sh 将master、worker全部启动
需要在bigdata02上,单独启动一个master
sbin/start-master.sh
FailOver
(4)worker服务器资源的分配
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=2g
三:执行Spark Demo程序
1:执行spark任务的工具
(1)spark-shell:类似于scala的REPL命令行,类似与Hive、Hadoop、Oracle的Sql*PLUS
spark的交互式行工具
分为两种运行模式
(*)本地模式
[root@bigdata01 bin]# ./spark-shell
不连接到集群,在本地执行,类似于Storm的本地模式
日志:
Spark context Web UI available at http://192.168.137.101:4040
Spark context available as 'sc' (master = local[*], app id = local-1537193211258).
Spark session available as 'spark'.
开发程序: ***********.setMaster("local")
(*)集群模式
连接到集群环境执行任务,类似于Storm的集群模式
[root@bigdata01 bin]# ./spark-shell --master spark://bigdata01:7077
日志:
Spark context Web UI available at http://192.168.137.101:4040
Spark context available as 'sc' (master = spark://bigdata01:7077, app id = app-20180917101324-0000).
Spark session available as 'spark'.
开发一个WordCount程序:(词频统计)
scala> sc.textFile("hdfs://bigdata02:9000/input/words").flatMap(x=>x.split(" ")).map((_, 1)).reduceByKey(_+_).sortBy(_._2, false).collect
res1: Array[(String, Int)] = Array((hello,4), (spark,3), (hdoop,2), (hadoop,1), (hbase,1), (hive,1), (java,1))
小技巧:能不能产生一个分区?
scala> sc.textFile("hdfs://bigdata02:9000/input/words").flatMap(x=>x.split(" ")).map((_, 1)).reduceByKey(_+_)
.repartition(1).sortBy(_._2,false).saveAsTextFile("hdfs://bigdata02:9000/output/0917-02")
(2)spark-submit:相当于Hadoop的Jar命令 ->递交Mapreduce任务(jar文件)
spark官方提供的examples例子:/opt/modules/spark-2.2.1-bin-hadoop2.7/examples/src/main
举例:蒙特卡罗算法(SparkPi.scala)
#使用示例:
[root@bigdata01 spark-2.2.1-bin-hadoop2.7]# bin/spark-submit --master spark://bigdata01:7077 --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.11-2.2.1.jar 2001
Pi is roughly 3.141956357097818
(3)使用IDEA开发程序:WordCount
(1)Java版本
bin/spark-submit --master spark://bigdata01:7077
--class cn.edu360.spark.day01.ScalaWordCount /opt/jars/HelloSpark-1.0-SNAPSHOT-shaded.jar
hdfs://bigdata01:9000/input/words/ hdfs://bigdata01:9000/output/0917-03
(2)Scala版本
bin/spark-submit --master spark://bigdata01:7077
--class cn.edu360.spark.day01.JavaWordCount /opt/jars/HelloSpark-1.0-SNAPSHOT-shaded.jar
hdfs://bigdata01:9000/input/words/ hdfs://bigdata01:9000/output/0917-04
(3)Java Lambda版本
bin/spark-submit --master spark://bigdata01:7077
--class cn.edu360.spark.day02.JavaLambdaWC /opt/jars/HelloSpark-1.0-SNAPSHOT-shaded.jar
hdfs://bigdata01:9000/input/words/ hdfs://bigdata01:9000/output/0917-03
四:Spark执行原理分析
1、分析WordCount程序处理过程
2、Spark提交任务的流程:类似Yarn调度任务的过程
补充:
spark程序的本地运行(Hadoop必须是2.8.3版本)
1:将hadoop.dllAndwinutils.exeForhadoop-2.8.3文件夹放到任意目录
2:配置环境变量,HADOOP_HOME 路径:D:\developer\hadoop.dllAndwinutils.exeForhadoop-2.8.3
3:将%HADOOP_HOME%\bin追加到Path
4:设置spark本地运行
SparkConf conf = new SparkConf().setAppName("JavaWordCount").setMaster("local");
5:读取本地的文件, JavaRDD<String> lines = jsc.textFile("D:\\1.txt");
五:Spark的RDD和算子(函数、方法)
1. 重要:什么是RDD
(*)RDD (Resilient Distributed Dataset)弹性分布式数据集
(*)Array VS RDD, array针对于单机而言,RDD来源于分布式服务器,比如Worker1,worker2
(*)Spark数据集的一个基本抽象
(*)结合一下源码,理解一下RDD的特性
* Internally, each RDD is characterized by five main properties:
*
* - A list of partitions
一组分区
* - A function for computing each split
函数,用于计算RDD中数据
* - A list of dependencies on other RDDs
RDD之间存在依赖关系(宽依赖,窄依赖)
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
* an HDFS file)
*
(*)如何创建RDD
1) 对集合进行并列化创建
scala> val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8), 3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> rdd1.partitions.length
res0: Int = 3
2)通过读取外部的数据源,直接创建RDD
scala> val rdd1 = sc.textFile("hdfs://bigdata01:9000/input/words")
rdd1: org.apache.spark.rdd.RDD[String] = hdfs://bigdata01:9000/input/words MapPartitionsRDD[2] at textFile at <console>:24
提一下:
Spark Sql -> 表(DataFrame)-> 本质也是RDD
SparkStreaming -> 核心:DStream输入流 -> 本质也是RDD
2:RDD的算子(函数、方法)
(1)Transformation:不会触发计算,延时加载(Scala lazy)
(*)textFile:
(*)map(func) :对原来的 Rdd中的每个元素,执行func的操作,并且返回一个新的RDD
map(word => (word, 1))
scala> val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "eleohant"), 3)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> val b = a.map(_.length)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:26
scala> b.collect
res0: Array[Int] = Array(3, 6, 6, 3, 8)
scala> val c = a.zip(b)
c: org.apache.spark.rdd.RDD[(String, Int)] = ZippedPartitionsRDD2[2] at zip at <console>:28
scala> c.collect
res1: Array[(String, Int)] = Array((dog,3), (salmon,6), (salmon,6), (rat,3), (eleohant,8))
(*)flatMap:flatten+Map
scala> val a = sc.parallelize(1 to 10, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24
scala> a.flatMap(1 to _).collect
res2: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> val a = sc.parallelize(List(1,2,3), 3).flatMap(x => List(x, x, x)).collect
a: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3)
(*)filter:过滤,选择满足条件的元素
scala> val a = sc.parallelize(1 to 10, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at parallelize at <console>:24
scala> val b = a.filter(_%2 ==0)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[17] at filter at <console>:26
scala> b.collect
res7: Array[Int] = Array(2, 4, 6, 8, 10)
(*)mapPartitions(func):对原来的RDD中的每个分区,执行func操作,并且返回一个新的RDD
(*)mapPartitionsWithIndex(func):跟mapPartitions一样,对其中某个分区进行func操作
(*)union:并集
scala> val a = sc.parallelize(1 to 3, 2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> val b = sc.parallelize(5 to 7, 1)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24
scala> (a ++ b).collect
res4: Array[Int] = Array(1, 2, 3, 5, 6, 7)
(*)cartesian 笛卡儿积
scala> val x = sc.parallelize(List(1,2,3,4,5))
x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24
scala> val y = sc.parallelize(List(6,7,8,9,10))
y: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:24
scala> x.cartesian(y).collect
res5: Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9), (3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))
(*)intersection:交集
(*)distinct:去重
scala> val c = sc.parallelize(List("dog", "cat", "Ret", "Ret"))
c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[19] at parallelize at <console>:24
scala> c.distinct.collect
res13: Array[String] = Array(dog, Ret, cat)
(*)groupBy:分组
reduceByKey:分组,会有一个本次聚合操作(相当于有一个Combiner)
scala> val a = sc.parallelize(1 to 9, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at parallelize at <console>:24
scala> a.groupBy(x => if(x%2 ==0) "even" else "cdd").collect
res6: Array[(String, Iterable[Int])] = Array((even,CompactBuffer(2, 8, 4, 6)), (cdd,CompactBuffer(5, 1, 3, 7, 9)))
(*)mapValues:
scala> val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "eagle"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[23] at parallelize at <console>:24
scala> val b = a.map(x=>(x.length, x))
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[24] at map at <console>:28
scala> b.collect
res14: Array[(Int, String)] = Array((3,dog), (5,tiger), (4,lion), (3,cat), (5,eagle))
scala> b.mapValues("x"+_+"x").collect
res15: Array[(Int, String)] = Array((3,xdogx), (5,xtigerx), (4,xlionx), (3,xcatx), (5,xeaglex))
(2)Action:会触发计算(collect)
(*)collect:触发计算
(*)count:求个数
(*)first:求第一个元素
(*)take(n):求集合中的n个元素
(*)saveAsTextFile:保存成文件
(*)foreach(func):对原来的RDD的每个元素,执行func操作
3、RDD的高级算子
(1)MapParatition
rdd的MapPartition可以认为是Map的变种, 他们都可以进行分区的并行处理,两者的主要区别是调用的粒度不一样,map的输入函数是应用于RDD的每个元素,而mapPartition的输入函数
是应用于每个分区
scala> val a = sc.parallelize(1 to 10, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> def mapFuncEle(e:Int):Int = {
| println("e="+e)
| e*2
| }
mapFuncEle: (e: Int)Int
scala> def mapFuncPart(iter: Iterator[Int]):Iterator[Int] ={
| println("run is partition")
| var res = for(e <- iter) yield e*2
| res
| }
mapFuncPart: (iter: Iterator[Int])Iterator[Int]
scala> val b = a.map(mapFuncEle).collect
b: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
scala> val c = a.mapPartitions(mapFuncPart).collect
c: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
在输入函数(mapFuncEle、mapFuncPart)层面来看,map是推模式,数据是被推到mapFuncEle中, mapPartitoions是拉模式,mapFuncPart通过迭代从分区中拉数据
这两个方法的另外一个区别是在大数据集情况下资源初始化开销和批处理数据,如果在(mapFuncEle、mapFuncPart)中要初始化一个耗时的资源的时候,资源开销不同
比如:数据库连接,在上面的例子中mapFuncPart只需要初始化三个资源,而mapFuncEle需要初始化10个资源,显然在大数据集情况下,mapFuncPart的开销要小的多,也便于进行批处理操作
思考下:为什么mapPartitions是一个迭代器,因为分区中可能有太多的数据,一次性拿出来内存可能放不下导致内存溢出。所以迭代器一条条的拿出来
(2)MapParatitionWithIndex
对RDD的每个分区进行操作,带有分区号
如果数据无法被分区平均分配,则无法平均
scala> val a = sc.parallelize(1 to 10, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:24
//index表示分区号,it表示该分区号对应的分区
scala> val func = (index:Int, it:Iterator[Int]) => it.map(s"index:$index, ele:"+ _ )
func: (Int, Iterator[Int]) => Iterator[String] = <function2>
scala> a.mapPartitionsWithIndex(func).collect
res5: Array[String] = Array(index:0, ele:1, index:0, ele:2, index:0, ele:3, index:1, ele:4, index:1, ele:5, index:1, ele:6, index:2, ele:7, index:2, ele:8, index:2, ele:9, index:2, ele:10)
#可以将数据进行平均分配
scala> val b = sc.parallelize(1 to 9, 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at parallelize at <console>:24
scala> b.mapPartitionsWithIndex(func).collect
res6: Array[String] = Array(index:0, ele:1, index:0, ele:2, index:0, ele:3, index:1, ele:4, index:1, ele:5, index:1, ele:6, index:2, ele:7, index:2, ele:8, index:2, ele:9)
(3)aggregate:聚合操作
存在两次聚合:
1):局部聚合
2):全局聚合
柯里化方法
def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U = withScope {
参数:
zeroValue: U:初始值 -》同时作用于局部操作和全局操作
seqOp: (U, T) => U:局部操作
combOp: (U, U) => U:全局操作
举例一:
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5), 2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24
def func1(index:Int,iter:Iterator[Int]):Iterator[String]={
iter.toList.map(x=>"Partition ID:" + index +",value="+x).iterator
}
scala> rdd1.mapPartitionsWithIndex(func1).collect
res4: Array[String] = Array(
Partition ID:0,value=1,
Partition ID:0,value=2,
Partition ID:1,value=3,
Partition ID:1,value=4,
Partition ID:1,value=5)
调用:
rdd1.aggregate(0)(Math.max(_,_), _+_) => 7
解析:
第一个分区最大值:2
第二个分区最大值:5
结果:7
rdd1.aggregate(0)(_+_, _+_) =>15
解析:第一个分区:1+2
第二个分区:3+4+5
结果:15
rdd1.aggregate(10)(_+_, _+_) =>45
解析:第一个分区:10+1+2
第二个分区:10+3+4+5
结果:10+13+22
rdd1.aggregate(10)(Math.max(_,_), _+_) =>30
解析:第一个分区:10
第二个分区:10
结果:10+10+10
举例二:
val rdd2 = sc.parallelize(List("a","b","c","d","e","f"), 2)
def func2(index:Int,iter:Iterator[String]):Iterator[String]={
iter.toList.map(x=>"Partition ID:" + index +",value="+x).iterator
}
scala> rdd2.mapPartitionsWithIndex(func2).collect
res0: Array[String] = Array(
Partition ID:0,value=a,
Partition ID:0,value=b,
Partition ID:0,value=c,
Partition ID:1,value=d,
Partition ID:1,value=e,
Partition ID:1,value=f)
调用:
rdd2.aggregate("#")(_+_, _+_) ->##abc#def
->##def#abc
(4)aggregateBykey: 针对的是<key, value>的数据类型,先对局部进行操作,在对全局进行操作
aggregate和aggregateByKey差不多,都是聚合,不过aggregateByKey是根据Key进行聚合
1)测试数据
val pairRDD = sc.parallelize(List(("cat", 1),("cat", 2), ("mouse", 4), ("cat", 12), ("dog", 12), ("mouse", 2)), 2)
2)查看每个笼子(分区)里面的动物
def func3(index:Int, iter:Iterator[(String, Int)]) : Iterator[String] = {
iter.toList.map(x=>"Partition ID:" + index +",value="+x).iterator
}
scala> pairRDD.mapPartitionsWithIndex(func3).collect
res3: Array[String] = Array(
Partition ID:0,value=(cat,1),
Partition ID:0,value=(cat,2),
Partition ID:0,value=(mouse,4),
Partition ID:1,value=(cat,12),
Partition ID:1,value=(dog,12),
Partition ID:1,value=(mouse,2))
3)调用
scala> pairRDD.reduceByKey(_+_).collect
res0: Array[(String, Int)] = Array((dog,12), (cat,15), (mouse,6))
pairRDD.aggregateByKey(0)(_+_, _+_).collect
scala> pairRDD.aggregateByKey(0)(_+_, _+_).collect
res1: Array[(String, Int)] = Array((dog,12), (cat,15), (mouse,6))
pairRDD.aggregateByKey(100)(_+_, _+_).collect
res4: Array[(String, Int)] = Array((dog,112), (cat,215), (mouse,206))
dog: 100+12
cat:(100+3)+(100+12)
mouse:(100+4)+(100+2)
4、RDD高级方法
(*)collectAsMap
scala> val rdd = sc.parallelize(List(("a", 1), ("b", 2)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[39] at parallelize at <console>:24
scala> rdd.collect
res48: Array[(String, Int)] = Array((a,1), (b,2))
scala> rdd.collectAsMap
res49: scala.collection.Map[String,Int] = Map(b -> 2, a -> 1)
scala> val rdd = sc.parallelize(List(("a", 1), ("b", 2), ("b", 32)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[40] at parallelize at <console>:24
scala> rdd.collectAsMap
res51: scala.collection.Map[String,Int] = Map(b -> 32, a -> 1)
从结果我们可以看出,如果RDD中同一个Key中存在多个Value,那么后面的Value将会把前面的Value覆盖,最终得到的结果就是Key唯一,而且对应一个Value。
(*)countByKey
scala> val rdd = sc.parallelize(List(("a", 1), ("b", 2), ("b", 32), ("b", 12), ("c", 11)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[46] at parallelize at <console>:24
scala> rdd.countByKey
res54: scala.collection.Map[String,Long] = Map(a -> 1, b -> 3, c -> 1)
统计相同key出现次数
(*)countByValue
scala> val rdd = sc.parallelize(List(("a", 1), ("b", 2), ("b", 32), ("b", 12), ("c", 11)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[46] at parallelize at <console>:24
scala> rdd.countByValue
res55: scala.collection.Map[(String, Int),Long] = Map((b,12) -> 1, (b,2) -> 1, (b,32) -> 1, (c,11) -> 1, (a,1) -> 1)
scala> val rdd = sc.parallelize(List(("a", 1), ("b", 2), ("b", 32), ("b", 12), ("c", 11), ("c", 11)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[52] at parallelize at <console>:24
scala> rdd.countByValue
res56: scala.collection.Map[(String, Int),Long] = Map((b,12) -> 1, (b,2) -> 1, (b,32) -> 1, (c,11) -> 2, (a,1) -> 1)
统计相同的key+value出现的次数
scala> val rdd = sc.parallelize(List("a","b","a","c"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[61] at parallelize at <console>:24
scala> rdd.countByValue
res58: scala.collection.Map[String,Long] = Map(a -> 2, b -> 1, c -> 1)
(*)flatMapValues
同基本转换操作中的flatMap,只不过flatMapValues是针对[K,V]中的V值进行flatMap操作。
scala> val rdd3 = sc.parallelize(List(("a", "1 2"), ("b", "3 4")))
rdd3: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[69] at parallelize at <console>:24
scala> val rdd4 = rdd3.flatMapValues(_.split(" "))
rdd4: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[70] at flatMapValues at <console>:26
scala> rdd4.collect
res61: Array[(String, String)] = Array((a,1), (a,2), (b,3), (b,4))
5、广播变量的使用
具体见画图和代码IPLocation.scala
6、JdbcRDDDemo的使用
object JdbcRDDDemo {
// getConnection: () => Connection,
val conn = () => {
DriverManager.getConnection("jdbc:mysql://bigdata01:3306/bigdata?characterEncoding=UTF-8", "root", "123456")
}
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("IPLocation").setMaster("local[*]")
val sc = new SparkContext(conf)
val jdbcRDD = new JdbcRDD(
sc,
conn,
"select id, name, age from logs where id>=? and id<=?",
1,
5,
2,
rs =>{
val id = rs.getLong(1)
val name = rs.getString(2)
val age = rs.getInt(3)
(id, name, age)
}
)
val result = jdbcRDD.collect()
println(result.toBuffer)
sc.stop()
}
7、spark缓存的使用
scala> val rdd1 = sc.textFile("hdfs://bigdata02:9000/input/access/access_2013_05_30.txt")
rdd1: org.apache.spark.rdd.RDD[String] = hdfs://bigdata02:9000/input/access/access_2013_05_30.txt MapPartitionsRDD[3] at textFile at <console>:24
scala> rdd1.count
res1: Long = 548160
在ui页面可以查看执行花费的时间
http://192.168.137.101:4040/storage/
缓存文件的大小比例,占用了内存的大小
文件太大的时候,不会全部放到内存中,实际文件大小30M,放到内存中达到90M:因为写入的文件当中存放的是二进制,而读取到内存中以后,使用Java对象序列化方式
这种序列化会占用更大的空间,所以比实际大小要大
实际上不会将内存全部占用,要给程序运行留下足够的内存
注意:
cache可以提高程序运行速度,但是如果使用一次就没必要cache,常用于反复的使用
cache既不是transformation也不是action,因为没有生成新的RDD, 也没有立即执行
cache不建议直接将hdfs的数据直接cache
建议将hdfs的数据过滤后缓存
使用完毕后清空缓存:
unpersist()
8、RDD的缓存机制
(*)提高效率
(*)源码分析
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def cache(): this.type = persist()
(*)默认缓存级别:StorageLevel.MEMORY_ONLY
9、RDD的容错机制
checkpoint是建立检查点,类似于快照,例如在spark计算里面,计算流程DAG非常长,服务器需要将整个DAG计算完成得到结果,但是如果在这很长的计算流程中突然中间算出的
数据丢失了,spark又会根据RDD的依赖关系从头到尾计算一遍,这样很费性能,当然我们可以将中间计算的结果通过cache或者persist方法内存或者磁盘中,但是这样也不能保证数据完全不能丢失
存储的这个内存出问题或者磁盘坏了,也会导致spark从头再根据RDD计算一遍,所以就有了checkpoint,其中checkpoint的作用是将DAG中比较重要的中间数据做一个检查点将结果
放在一个高可用的地方(通常这个地方是HDFS里面)
(*)checkpoint到底是什么和需要用checkpoint解决什么问题?
1)spark在生产环境下经常面临transformation的RDD非常多,(例如一个Job中包含一万个RDD),或者是具体的transformation产生的RDD本身计算特别复杂和耗时(例如计算时长超过1个小时)
可能业务比较复杂,此时我们必须要考虑对计算结果的持久化
2)spark是擅长多步骤迭代计算,同时擅长基于Job的复用,这个时候如果曾经可以对计算结果的过程进行复用,就可以极大地提升效率,因为有时候有共同的步骤,可以避免重复计算
3)如果采用cache将数据存放到内存的话,虽然最快但是也是最不可靠,即使放到磁盘也不可靠,都会坏掉
4)checkpoint的产生就是为了相对而言更加可靠的持久化数据,在checkpoint可以指定数据存放到本地(HDFS)并且多个副本,这就天然的借助HDFS高可靠的特征
5)checkpoint是针对整个RDD计算链条中特别需要数据持久化的环节(后面反复使用的RDD)
(*)缺点:
通过检查点checkpoint来实现,缺点:产生i/o
(*)复习:HDFS的检查点:由SeconderyNameNode进行日志合并
Oracle中,数据也是由检查点的,如果产生检查点,会以最高优先级唤醒数据库写进程,将内存中的脏数据写到数据文件中(持久化)
(*)检查点可以将中间结果保存起来
两种方式
(*)本地目录(测试环境)
(*)HDFS的目录(生产环境)
注意:这种模式,需要将spark-shell运行在集群上
(*)使用checkpoint
scala> sc.setCheckpointDir("hdfs://bigdata02:9000/checkpoint0927")
scala> val rdd1 = sc.parallelize(1 to 1000)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24
scala> rdd1.checkpoint
[root@bigdata01 spark-2.2.1-bin-hadoop2.7]# hdfs dfs -ls /checkpoint0927
18/09/27 14:10:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxr-xr-x - root supergroup 0 2018-09-27 09:21 /checkpoint0927/82b44a8c-d73a-4ded-b02e-b710f7b2a213
发现HDFS中还是没数据,说明checkpoint也是一个transformation的算子
scala> rdd1.collect
[root@bigdata01 spark-2.2.1-bin-hadoop2.7]# hdfs dfs -ls /checkpoint0927
执行的时候相当于走了两次流程,sum的时候前面计算一遍,然后checkpoint又会计算一遍,所以我们一般先进行cache然后做checkpoint就会只走一次流程
checkpoint的时候就会从刚cache到内存中取数据写入到hdfs中
其中作者也说明了,在checkpoint的时候强烈建议先进行cache,并且当你checkpoint执行成功后,那么前面所有的RDD依赖都会被销毁