1. RDD 五大特性
- A list of partitions
- A function for computing each split
- A list of dependencies on other RDDs
- Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- Optionally, a list of preferred locations to compute each split on (e.g. block locations for
an HDFS file)
1). RDD由一系列Partition组成
protected def getPartitions: Array[Partition]
2). 针对 RDD的操作其实就是针对RDD底层Partition进行操作
def compute(split: Partition, context: TaskContext): Iterator[T]
3). 依赖于其他RDD的列表
protected def getDependencies: Seq[Dependency[_]] = deps
4). 可选:key- value数据类型的RDD分区器( a Partitioner for key- alue RDDS)、控制分区策略和分区数
类似于mapreduce当中的paritioner接口,控制Key分到哪个reduce。
/** Optionally overridden by subclasses to specify how they are partitioned. */
@transient val partitioner: Option[Partitioner] = None
5). 每个分区都有一个优先位置列表( a list of preferred locations to compute each split on)
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
2. RDD创建方式
1).集合创建parallelize
val rdd = sc.parallelize(List(1,2,3,4,5,6))
2).其他RDD转换而来
val rdd2 = rdd.map((_,100))
3).外部数据集
val hadoopRDD = sc.textFile("./data/graph/g.txt")
3. 算子对应的RDD
4.算子分类
常用 Transformation:{
1、map算子
2、flatMap算子
3、mapPartitions算子
4、filter算子
5、distinct算子
6、groupByKey算子
7、reduceByKey算子
8、join算子
}
常用 Action: {
1、foreach算子
2、saveAsTextFile算子
3、saveAsObjectFile算子
4、collect算子
6、collectAsMap算子
7、count算子
8、top算子
9、reduce算子
10、fold算子
11、aggregate算子
}
【注】:countByKey 是 action 算子
def countByKey(): Map[K, Long] = self.withScope {
self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
}
5. 查看RDD的依赖列表 toDebugString
ps.toDebugString
(1) MapPartitionsRDD[7] at sortBy at T3.scala:16 []
| ShuffledRDD[6] at sortBy at T3.scala:16 []
+-(1) MapPartitionsRDD[5] at sortBy at T3.scala:16 []
| MapPartitionsRDD[4] at sortBy at T3.scala:15 []
| ShuffledRDD[3] at sortBy at T3.scala:15 []
+-(1) MapPartitionsRDD[2] at sortBy at T3.scala:15 []
| MapPartitionsRDD[1] at map at T3.scala:15 []
| ParallelCollectionRDD[0] at parallelize at T3.scala:9 []
6. 宽 / 窄依赖
宽窄依赖划分:
窄依赖:一个父RDD的partition 至多被子RDD的partition使用一次
宽依赖:产生shuffle ,是划分stage的标志
生产中大部分场景下,能用窄依赖就用窄依赖,shuffle的代价是昂贵的,消耗网络IO和磁盘IO
窄依赖: NarrowDependency => {
OneToOneDependency
RangeDependency
PruneDependency
}
宽依赖:ShuffleDependency
7.reduceByKey 和 groupByKey的区别
reduceByKey在map阶段有combine(局部聚合)操作, groupByKey没有combine(局部聚合)操作,所以reduceByKey的性能高于groupByKey
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
combineByKeyWithClassTag[CompactBuffer[V]](
//.......
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
}
8. cache 和 persist
cache() 底层调用的persist()
def cache(): this.type = persist()
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
MEMORY_ONLY 缓存在内存中
MEMORY_ONLY_SER 序列化后缓存在内存中
不建议缓存在磁盘上,RDD部分partition失效后,重新计算都比从磁盘上读取性能高
9. repartition 与 coalesce
repartition执行时发生shuffle,所以可以增大分区数
coalesce默认关闭shuffle,不能增大分区数,只能减少分区数
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
10. 排序
1). case class (Product 继承 Ordered)
import org.apache.spark.{SparkConf, SparkContext}
object T3 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setMaster("local")
conf.setAppName("sortBy")
val sc = new SparkContext(conf)
val products = sc.parallelize(List(("白菜", "2.0", "1000"), ("萝卜", "2.5", "2567"), ("青椒", "3.0", "675"), ("猪肉", "30", "250")))
products.collect().foreach(println)
val ps = products.map(p => Product(p._1, p._2.toDouble, p._3.toInt))
.sortBy(x => x)
ps.collect().foreach(println)
sc.stop()
}
}
case class Product(name: String, price: Double, amount: Int) extends Ordered[Product] {
override def compare(that: Product): Int = {
this.amount - that.amount
}
}
//Product(猪肉,30.0,250)
//Product(青椒,3.0,675)
//Product(白菜,2.0,1000)
//Product(萝卜,2.5,2567)
3). 隐式类型转换 (Product => Ordered)
import org.apache.spark.{SparkConf, SparkContext}
object T3 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setMaster("local")
conf.setAppName("sortBy")
val sc = new SparkContext(conf)
val products = sc.parallelize(List(("白菜", "2.0", "1000"), ("萝卜", "2.5", "2567"), ("青椒", "3.0", "675"), ("猪肉", "30", "250")))
products.collect().foreach(println)
val ps = products.map(p => Product(p._1, p._2.toDouble, p._3.toInt))
.sortBy(x => x)
ps.collect().foreach(println)
sc.stop()
}
implicit def Product2Ordered(product: Product): Ordered[Product] = new Ordered[Product] {
override def compare(that: Product): Int = {
product.amount - that.amount
}
}
case class Product(name: String, price: Double, amount: Int)
}
//Product(猪肉,30.0,250)
//Product(青椒,3.0,675)
//Product(白菜,2.0,1000)
//Product(萝卜,2.5,2567)
3). Ordering.on
import org.apache.spark.{SparkConf, SparkContext}
object T3 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setMaster("local")
conf.setAppName("sortBy")
val sc = new SparkContext(conf)
val products = sc.parallelize(List(("白菜", "2.0", "1000"), ("萝卜", "2.5", "2567"), ("土豆", "2.5", "3567"),("青椒", "3.0", "675"), ("猪肉", "30", "250")))
products.collect().foreach(println)
implicit var ord = Ordering[(Double, Int)].on[(String, Double, Int)](x => (-x._2, -x._3))
val ps = products.map(p => (p._1, p._2.toDouble, p._3.toInt)).sortBy(x => x)
.sortBy(x => x)
ps.collect().foreach(println)
sc.stop()
}
}
//(猪肉,30.0,250)
//(青椒,3.0,675)
//(土豆,2.5,3567)
//(萝卜,2.5,2567)
//(白菜,2.0,1000)
11. 根据 key输出到不同文件中
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
import org.apache.spark.{SparkConf, SparkContext}
object T3 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setMaster("local")
conf.setAppName("sortBy")
val sc = new SparkContext(conf)
val products = sc.parallelize(List(("白菜", "2.0", "1000"), ("萝卜", "2.5", "2567"), ("土豆", "2.5", "3567"), ("青椒", "3.0", "675"), ("猪肉", "30", "250")))
products.collect().foreach(println)
products.map(x => (x._1, x._2)).saveAsHadoopFile("out/hadoop/file", classOf[String], classOf[String], classOf[RZMultipleTextOutputFormat])
sc.stop()
}
class RZMultipleTextOutputFormat extends MultipleTextOutputFormat[Any,Any]{
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = {
s"${key}/${name}"
}
override def generateActualKey(key: Any, value: Any): Any = {
null
}
}
}
如果去掉 generateActualKey 这个方法的重载,每一行将会带上 key