1、数据
hadoop fs -cat /tmp/1234/1.txt
hello world
hello spark
hello hive
//读取HDFS文件到RDD
scala> var data = sc.textFile("/tmp/1234/1.txt")
data: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at :21
2、map 将一个RDD中的每个数据项,通过map中的函数映射变为一个新的元素。输入分区与输出分区一对一,即:有多少个输入分区,就有多少个输出分区。
scala> var mapresult = data.map(line => line.split("\\s+"))
mapresult: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at :23
//运算map算子结果
scala> mapresult.collect
res0: Array[Array[String]] = Array(Array(hello, world), Array(hello, spark), Array(hello, hive))
3、flatMap 第一步和map一样,最后将所有的输出分区合并成一个。
scala> var flatmapresult = data.flatMap(line => line.split("\\s+"))
flatmapresult: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at :23
//运算flagMap算子结果
scala> flatmapresult.collect
res1: Array[String] = Array(hello, world, hello, spark, hello, hive)
使用flatMap时候需要注意:
flatMap会将字符串看成是一个字符数组。
看下面的例子:
scala> data.map(_.toUpperCase).collect
res32: Array[String] = Array(HELLO WORLD, HELLO SPARK, HELLO HIVE, HI SPARK)
scala> data.flatMap(_.toUpperCase).collect
res33: Array[Char] = Array(H, E, L, L, O, , W, O, R, L, D, H, E, L, L, O, , S, P, A, R, K, H, E, L, L, O, , H, I, V, E, H, I, , S, P, A, R, K)
4、distinct 对RDD中的元素进行去重操作。
scala> data.flatMap(line => line.split("\\s+")).collect
res61: Array[String] = Array(hello, world, hello, spark, hello, hive, hi, spark)
scala> data.flatMap(line => line.split("\\s+")).distinct.collect
res62: Array[String] = Array(hive, hello, world, spark, hi)
coalesce
def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]
该函数用于将RDD进行重分区,使用HashPartitioner。
第一个参数为重分区的数目,第二个为是否进行shuffle,默认为false;
以下面的例子来看:
scala> var data = sc.textFile("/tmp/lxw1234/1.txt")
data: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[53] at textFile at :21
scala> data.collect
res37: Array[String] = Array(hello world, hello spark, hello hive, hi spark)
scala> data.partitions.size
res38: Int = 2 //RDD data默认有两个分区
scala> var rdd1 = data.coalesce(1)
rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[2] at coalesce at :23
scala> rdd1.partitions.size
res1: Int = 1 //rdd1的分区数为1
scala> var rdd1 = data.coalesce(4)
rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[3] at coalesce at :23
scala> rdd1.partitions.size
res2: Int = 2 //如果重分区的数目大于原来的分区数,那么必须指定shuffle参数为true,//否则,分区数不便
scala> var rdd1 = data.coalesce(4,true)
rdd1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at coalesce at :23
scala> rdd1.partitions.size
res3: Int = 4
repartition
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]
该函数其实就是coalesce函数第二个参数为true的实现
scala> var rdd2 = data.repartition(1)
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at repartition at :23
scala> rdd2.partitions.size
res4: Int = 1
scala> var rdd2 = data.repartition(4)
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[15] at repartition at :23
scala> rdd2.partitions.size
res5: Int = 4
RDD的并集和差集不去重。交集去重。
scala对list的操作也是一样。
union(并集)
def union(other: RDD[T]): RDD[T]
将两个RDD进行合并,数据集不去重,分区数相加。
var rdd1 = sc.makeRDD(1 to 2,10)
var rdd2 = sc.makeRDD(2 to 5,2)
rdd1.union(rdd2).collect().foreach(println(_))
println(rdd1.union(rdd2).partitions.size)
结果:
数据集:1,2,2,3,4,5
分区数:12
数据集不去重,分区数相加
intersection(交集)
def intersection(other: RDD[T]): RDD[T]
def intersection(other: RDD[T], numPartitions: Int): RDD[T]
返回两个RDD的交集,结果去重,分区为rdd的较大的分区。
参数numPartitions指定返回的RDD的分区数。
val rdd3=sc.parallelize(List(1,2,2,4,4),2)
val rdd4=sc.parallelize(List(2,2,4,4,5),10)
rdd3.intersection(rdd4).collect().foreach(println(_))
println(rdd3.intersection(rdd4).partitions.size)
结果为:2,4
分区数:10
subtract(差集)
def subtract(other: RDD[T]): RDD[T]
def subtract(other: RDD[T], numPartitions: Int): RDD[T]
def subtract(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
返回在RDD中出现,并且不在otherRDD中出现的元素,不去重,分区数为被减的RDD的分区。
参数含义同intersection
var rdd1 = sc.makeRDD(Seq(1,2,2,3),5)
var rdd2 = sc.makeRDD(3 to 4,4)
rdd1.subtract(rdd2).collect().foreach(println(_))
println(rdd1.subtract(rdd2).partitions.size)
数据集为:1,2,2
分区个数:5
zip
def zip[U](other: RDD[U])(implicit arg0: ClassTag[U]): RDD[(T, U)]
zip函数用于将两个RDD组合成Key/Value形式的RDD,这里默认两个RDD的partition数量以及元素数量都相同,否则会抛出异常。
scala> var rdd1 = sc.makeRDD(1 to 5,2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at :21
scala> var rdd2 = sc.makeRDD(Seq("A","B","C","D","E"),2)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at makeRDD at :21
scala> rdd1.zip(rdd2).collect
res0: Array[(Int, String)] = Array((1,A), (2,B), (3,C), (4,D), (5,E))
scala> rdd2.zip(rdd1).collect
res1: Array[(String, Int)] = Array((A,1), (B,2), (C,3), (D,4), (E,5))
scala> var rdd3 = sc.makeRDD(Seq("A","B","C","D","E"),3)
scala> rdd1.zip(rdd3).collect
java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions
//如果两个RDD分区数不同,则抛出异常