sparkCore 知识点

最新推荐文章于 2022-03-16 20:29:55 发布

游九河

最新推荐文章于 2022-03-16 20:29:55 发布

阅读量153

点赞数

分类专栏： spark core 文章标签： spark core

本文链接：https://blog.csdn.net/qq_40337206/article/details/101539127

版权

spark core 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

1. RDD 五大特性

A list of partitions
A function for computing each split
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for
an HDFS file)

1）. RDD由一系列Partition组成

protected def getPartitions: Array[Partition]

2）. 针对 RDD的操作其实就是针对RDD底层Partition进行操作

def compute(split: Partition, context: TaskContext): Iterator[T]

3）. 依赖于其他RDD的列表

protected def getDependencies: Seq[Dependency[_]] = deps

4）. 可选：key- value数据类型的RDD分区器( a Partitioner for key- alue RDDS)、控制分区策略和分区数

类似于mapreduce当中的paritioner接口，控制Key分到哪个reduce。

 /** Optionally overridden by subclasses to specify how they are partitioned. */
  @transient val partitioner: Option[Partitioner] = None

5）. 每个分区都有一个优先位置列表( a list of preferred locations to compute each split on）

protected def getPreferredLocations(split: Partition): Seq[String] = Nil

2. RDD创建方式

1）.集合创建parallelize

val rdd = sc.parallelize(List(1,2,3,4,5,6))

2）.其他RDD转换而来

val rdd2 = rdd.map((_,100))

3）.外部数据集

val hadoopRDD = sc.textFile("./data/graph/g.txt")

3. 算子对应的RDD

在这里插入图片描述

4.算子分类

常用 Transformation：{
　　　　1、map算子
　　　　2、flatMap算子
　　　　3、mapPartitions算子
　　　　4、filter算子
　　　　5、distinct算子
　　　　6、groupByKey算子
　　　　7、reduceByKey算子
　　　　8、join算子
}

常用 Action: {
　　　　1、foreach算子
　　　　2、saveAsTextFile算子
　　　　3、saveAsObjectFile算子
　　　　4、collect算子
　　　　6、collectAsMap算子
　　　　7、count算子
　　　　8、top算子
　　　　9、reduce算子
　　　　10、fold算子
　　　　11、aggregate算子
}
【注】：countByKey 是 action 算子

  def countByKey(): Map[K, Long] = self.withScope {
    self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
  }

5. 查看RDD的依赖列表 toDebugString

ps.toDebugString

(1) MapPartitionsRDD[7] at sortBy at T3.scala:16 []
 |  ShuffledRDD[6] at sortBy at T3.scala:16 []
 +-(1) MapPartitionsRDD[5] at sortBy at T3.scala:16 []
    |  MapPartitionsRDD[4] at sortBy at T3.scala:15 []
    |  ShuffledRDD[3] at sortBy at T3.scala:15 []
    +-(1) MapPartitionsRDD[2] at sortBy at T3.scala:15 []
       |  MapPartitionsRDD[1] at map at T3.scala:15 []
       |  ParallelCollectionRDD[0] at parallelize at T3.scala:9 []

6. 宽 / 窄依赖

宽窄依赖划分：
窄依赖：一个父RDD的partition 至多被子RDD的partition使用一次
宽依赖：产生shuffle ，是划分stage的标志
生产中大部分场景下，能用窄依赖就用窄依赖，shuffle的代价是昂贵的，消耗网络IO和磁盘IO
窄依赖： NarrowDependency => {
OneToOneDependency
RangeDependency
PruneDependency
}
宽依赖：ShuffleDependency

7.reduceByKey 和 groupByKey的区别

reduceByKey在map阶段有combine(局部聚合)操作， groupByKey没有combine(局部聚合)操作，所以reduceByKey的性能高于groupByKey

  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
  combineByKeyWithClassTag[CompactBuffer[V]](
  //.......
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
  }

8. cache 和 persist

cache() 底层调用的persist()

 def cache(): this.type = persist()
 def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

MEMORY_ONLY 缓存在内存中
MEMORY_ONLY_SER 序列化后缓存在内存中
不建议缓存在磁盘上，RDD部分partition失效后，重新计算都比从磁盘上读取性能高

9. repartition 与 coalesce

repartition执行时发生shuffle,所以可以增大分区数
coalesce默认关闭shuffle,不能增大分区数，只能减少分区数

   def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }
  
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)

10. 排序

1). case class （Product 继承 Ordered）

import org.apache.spark.{SparkConf, SparkContext}

object T3 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local")
    conf.setAppName("sortBy")
    val sc = new SparkContext(conf)
    val products = sc.parallelize(List(("白菜", "2.0", "1000"), ("萝卜", "2.5", "2567"), ("青椒", "3.0", "675"), ("猪肉", "30", "250")))
    products.collect().foreach(println)
    val ps = products.map(p => Product(p._1, p._2.toDouble, p._3.toInt))
      .sortBy(x => x)
    ps.collect().foreach(println)

    sc.stop()
  }

}

case class Product(name: String, price: Double, amount: Int) extends Ordered[Product] {
  override def compare(that: Product): Int = {
    this.amount - that.amount
  }
}


//Product(猪肉,30.0,250)
//Product(青椒,3.0,675)
//Product(白菜,2.0,1000)
//Product(萝卜,2.5,2567)

3). 隐式类型转换（Product => Ordered）

import org.apache.spark.{SparkConf, SparkContext}

object T3 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local")
    conf.setAppName("sortBy")
    val sc = new SparkContext(conf)
    val products = sc.parallelize(List(("白菜", "2.0", "1000"), ("萝卜", "2.5", "2567"), ("青椒", "3.0", "675"), ("猪肉", "30", "250")))
    products.collect().foreach(println)
    val ps = products.map(p => Product(p._1, p._2.toDouble, p._3.toInt))
      .sortBy(x => x)
    ps.collect().foreach(println)

    sc.stop()
  }

  implicit def Product2Ordered(product: Product): Ordered[Product] = new Ordered[Product] {
    override def compare(that: Product): Int = {
      product.amount - that.amount
    }
  }
  case class Product(name: String, price: Double, amount: Int)
}

//Product(猪肉,30.0,250)
//Product(青椒,3.0,675)
//Product(白菜,2.0,1000)
//Product(萝卜,2.5,2567)

3). Ordering.on

import org.apache.spark.{SparkConf, SparkContext}

object T3 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local")
    conf.setAppName("sortBy")
    val sc = new SparkContext(conf)
    val products = sc.parallelize(List(("白菜", "2.0", "1000"), ("萝卜", "2.5", "2567"), ("土豆", "2.5", "3567"),("青椒", "3.0", "675"), ("猪肉", "30", "250")))
    products.collect().foreach(println)
    implicit var ord = Ordering[(Double, Int)].on[(String, Double, Int)](x => (-x._2, -x._3))
    val ps = products.map(p => (p._1, p._2.toDouble, p._3.toInt)).sortBy(x => x)
      .sortBy(x => x)
    ps.collect().foreach(println)

    sc.stop()
  }
}

//(猪肉,30.0,250)
//(青椒,3.0,675)
//(土豆,2.5,3567)
//(萝卜,2.5,2567)
//(白菜,2.0,1000)

11. 根据 key输出到不同文件中

import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
import org.apache.spark.{SparkConf, SparkContext}

object T3 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local")
    conf.setAppName("sortBy")
    val sc = new SparkContext(conf)
    val products = sc.parallelize(List(("白菜", "2.0", "1000"), ("萝卜", "2.5", "2567"), ("土豆", "2.5", "3567"), ("青椒", "3.0", "675"), ("猪肉", "30", "250")))
    products.collect().foreach(println)

    products.map(x => (x._1, x._2)).saveAsHadoopFile("out/hadoop/file", classOf[String], classOf[String], classOf[RZMultipleTextOutputFormat])
    sc.stop()
  }

  class RZMultipleTextOutputFormat extends MultipleTextOutputFormat[Any,Any]{
    override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = {
      s"${key}/${name}"
    }

    override def generateActualKey(key: Any, value: Any): Any = {
      null
    }
  }
}

在这里插入图片描述

如果去掉 generateActualKey 这个方法的重载，每一行将会带上 key

游九河

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
sparkCore 知识点

1. RDD 五大特性A list of partitionsA function for computing each splitA list of dependencies on other RDDsOptionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)O...
复制链接

扫一扫