【大数据学习】之 Spark-RDD core3

最新推荐文章于 2024-09-27 18:48:18 发布

奔走觅衣粮

最新推荐文章于 2024-09-27 18:48:18 发布

阅读量194

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/qq_35826412/article/details/86509291

版权

Spark 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

SparkCore03
一．   Spark Glossary ( Spark术语 )
Glossary
The following table summarizes terms you’ll see used to refer to cluster concepts:
Term   Meaning
Application   User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar   A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program   The process running the main() function of the application and creating the SparkContext
Cluster manager   An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
Deploy mode   Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.
Worker node   Any node that can run application code in the cluster
Executor   A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
Task   A unit of work that will be sent to one executor
Job   A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.
Stage   Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.

1 Application=1个 Driver + N 个Executors
   什么是Driver:是Process（进程），main() 创建SparkContext
       可以运行在本地client
       也可以运行在集群cluster
   Executor:也是Process（进程），执行task(比如map、filter)
       运行在worker上面的,可以理解为NM
       算子
   Job ==> action
   Stage ==> 一个Job可能会被切分成多个stage，即一个Job会被拆分成多个小的集合

二．
1、MapPartition
object MapPartitionApp {
def main(args: Array[String]){
val sparkConf=new SparkConf().setMaster("local[2]").setAppName("MapPartitionApp")
val sc=new SparkContext(sparkConf)

val students = new ListBuffer[String]()
for(i <- 1 to 100) {
students += "若泽数据实战班五期：" + i
}
students.foreach(println)

val stus=sc.parallelize(students,4) //因为spark无法处理集合的，需要转成rdd。另外，后面的4指分区的数目，设置多少就是多少个分区数

stus.map(x=> { //map的意义相当于每一个元素都干一遍
val connection=DBUtils.getConnection()
println(connection+"~~~~~~~")

DBUtils.returnConnection(connection)
}).foreach(println) //这里会打印100次，即连接数据库的次数，如果用这种方法连接数据库那就完了，所以在实际工作中这种是不可能的

//所以对数据库的操作，建议用mapPartitions，即用分区。
stus.mapPartitions(partition=> { // 方法是一个partition作一次
val connection = DBUtils.getConnection()
println(connection + "~~~~~~~~~~~~")
// TODO... 写出数据到数据库

DBUtils.returnConnection(connection)
partition //相当于返回值
}).foreach(println)
sc.stop()
}
}

2、ForeachPartition
object ForeachPartitionApp {
def main(args: Array[String]){
val sparkConf=new SparkConf().setMaster("local[2]").setAppName("MapPartitionApp")
val sc=new SparkContext(sparkConf)

val students = new ListBuffer[String]()
for(i <- 1 to 100) {
students += "若泽数据实战班五期：" + i
}

val stus=sc.parallelize(students,4) //因为spark无法处理集合的，需要转成rdd。另外后面的4指分区的数目，设置多少就是多少个分区数

// stus.foreach(x=> { //map的意义相当于每一个元素都干一遍
// val connection=DBUtils.getConnection()
// println(connection+"~~~~~~~")
//
// DBUtils.returnConnection(connection)
// }) //用foreach,这里会打印100次，即连接数据库的次数，如果用这种方法连接数据库那就完了，所以在实际工作中这种是不可能的
//
//对数据库的操作，可以用foreachPartitions
stus.foreachPartition(partition=> { // 方法是一个partition作一次
val connection = DBUtils.getConnection()
println(connection + "~~~~~~~~~~~~")
// TODO... 写出数据到数据库

DBUtils.returnConnection(connection)
partition //相当于返回值
})
sc.stop()
}
}
注意：在工作当中，如果需要写数据到数据库的，就直接选择foreachPartition，因为foreachPartition是action，而MapPartition是transformation，所以选择foreachPartition更方便，可以一步到位了。

一个比较重要的RDD算子：coalesce
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
源码如下：
/**
* Return a new RDD that is reduced into `numPartitions` partitions.
*
* This results in a narrow dependency, e.g. if you go from 1000 partitions
* to 100 partitions, there will not be a shuffle, instead each of the 100
* new partitions will claim 10 of the current partitions. If a larger number
* of partitions is requested, it will stay at the current number of partitions.
*
* However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
* this may result in your computation taking place on fewer nodes than
* you like (e.g. one node in the case of numPartitions = 1). To avoid this,
* you can pass shuffle = true. This will add a shuffle step, but means the
* current upstream partitions will be executed in parallel (per whatever
* the current partitioning is).
*
* @note With shuffle = true, you can actually coalesce to a larger number
* of partitions. This is useful if you have a small number of partitions,
* say 100, potentially with a few partitions being abnormally large. Calling
* coalesce(1000, shuffle = true) will result in 1000 partitions with the
* data distributed using a hash partitioner. The optional partition coalescer
* passed in must be serializable.
*/

//读取外部文件
scala>val data=sc.textFile("file:///home/hadoop/data/ruozeinput.txt")
data: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/data/ruozeinput.txt MapPartitionsRDD[7] at textFile at <console>:24
//原本的分区数量
scala> data.partitions.size
res5: Int = 2
//减小分区数量成功
scala> data.coalesce(1)
res3: org.apache.spark.rdd.RDD[String] = CoalescedRDD[4] at coalesce at <console>:26
scala> data.partitions.size
res5: Int = 1

coalesce的作用是可以减少分区。

那分区数目是否可以分成多的分区数呢？即由1个分区数分成2个，或3个以上的分区数
可以，用repartition
repartition与coalesce的区别，reparation的底层是调用coalesce方法的

scala> data.repartition(4).partitions.size
res9: Int = 4

看下面的代码
object CoalesceRePartitionApp {
def main(args: Array[String]){
val sparkConf=new SparkConf().setMaster("local[2]").setAppName("CoalesceRePartitionApp")
val sc=new SparkContext(sparkConf)

val students = new ListBuffer[String]()
for(i <- 1 to 100) {
students += "若泽数据实战班五期：" + i
}
val stus=sc.parallelize(students,3) //因为spark无法处理集合的，需要转成rdd。另外后面的4指分区的数目，设置多少就是多少个分区数

stus.mapPartitionsWithIndex((index,partition)=>{
val emps=new ListBuffer[String] //将每一个分区里面的东西放到emps
while (partition.hasNext){
emps+=("~~~~~~"+partition.next()+",原部门：["+(index+1)+"]") //把100个人分到3个部门
}
emps.iterator
}).foreach(println)

println("============华丽的分割线================")
stus.coalesce(2).mapPartitionsWithIndex((index,partition)=>{ //假如减少部门，由3个部门变成2个
val emps=new ListBuffer[String] //将每一个分区里面的东西放到emps
while (partition.hasNext){
emps+=("~~~~~~"+partition.next()+",新部门：["+(index+1)+"]") //把100个人分到3个部门
}
emps.iterator
}).foreach(println)

println("============华丽的分割线================")
stus.repartition(5).mapPartitionsWithIndex((index,partition)=>{ //假如增加部门，由3个部门变成5个
val emps=new ListBuffer[String] //将每一个分区里面的东西放到emps
while (partition.hasNext){
emps+=("~~~~~~"+partition.next()+",新部门：["+(index+1)+"]") //把100个人分到3个部门
}
emps.iterator
}).foreach(println)

sc.stop()
}
}
三、Shuffle operations
官方的解释：Spark中的某些操作会触发称为shuffle的事件。shuffle是Spark用于重新分发数据的机制，以便跨分区对数据进行不同的分组。这通常涉及跨执行程序和机器复制数据，使shuffle成为一项复杂而昂贵的操作。

为了理解洗牌过程中会发生什么，我们可以考虑reduceByKey操作的例子。reduceByKey操作生成一个新的RDD，其中单个键的所有值组合成一个元组——键和对与该键关联的所有值执行reduce函数的结果。挑战在于，并不是一个键的所有值都必须位于相同的分区，甚至是相同的机器上，但是它们必须位于同一个位置来计算结果。
shuffle带来的影响：数据序列化、磁盘io、网络IO

val sparkConf=new SparkConf().setMaster("local[2]").setAppName("WcAPP")
val sc=new SparkContext(sparkConf)

//spark的WC一步步的步骤实现
val lines=sc.textFile("file:///E:/word.txt") //读取文件。textFile方法是返回一个string类型的Rdd
val words=lines.flatMap(_.split("\t")) //扁平化以及用tab分割
val pairs=words.map((_,1)) //.组装（单词，1）这种形式，将单词转换为元组用于计数
val jieguo=pairs.reduceByKey(_+_) //把相同的key,分到一个partition里面。类似MapReduce里面的把相同的key分到一个reduce
jieguo.collect().foreach(println)
sc.stop()

四、宽依赖和窄依赖

每一个大框是RDD，里面的小框是partition
窄依赖（narrow dependency）
一个父RDD的partition至少会被子RDD的某个partition使用一次。
宽依赖（wide dependency）
一个父RDD的partition会被子RDD的partition使用多次。

reduceKey是属于宽依赖

五、stage
遇到shuffle就产生新的stage 遇到宽依赖就会产生新的stage

我们看看wc的这个例子

再看web界面http://hadoop001:4040，然后选择该JOB,再点开DAG图

第一步textFile,第二步是flatMap，第三步是map,第4步遇到reduceByKey,就拆成两个stage,后面的stage执行的前提，就是前面的那个stage要执行完毕。没有宽依赖就没有shuffe,就只有一个stage了。textFile读数据是44B，经过shuffe后，写出去的数据是71B，同时该71B也作为下一个stage的输入

六、reduceByKey与groupByKey的区别

两个的RDD类型是不同的。
计算时是在groupByKey后面再用map计算，第一个位置不动，即单词不变，用x._1;第二个即数字，就用sum加起来，用x._2.sum，同样可以实现跟使用reduceByKey一样的结果

但是观察一下web的DAG图，textFile输入一样的44B，但是其经过shuffle后的值是91B，比前面用reduceByKey时实现的值高，如果放到生产上，值就会多很多倍了。所以groubByKey一定要慎用。
reduceBykey的shuffle输出比groupBykey的输出要少。原因，类似于mapreduce程序中的的combainer在mapTask端对数据进行了预聚合，使得网络传输、shuffle过程中的数据减少了，提高了性能。